Overview 1.1 Canadian Hansards
The main corpus for this assignment comes from the official records (Hansards) of the 36th Canadian Parliament, including debates from both the House of Representatives and the Senate. This corpus is available at /u/cs401/A2/data/Hansard/ and has been split into Training/ and Testing/ directories. This data set consists of pairs of corresponding files (*.e is the English equivalent of the French *.f) in which every line is a sentence. Here, sentence alignment has already been performed for you. That is, the n th sentence in one file corresponds to the n th sentence in its corresponding file (e.g., line n in fubar.e is aligned with line n in fubar.f). Note that this data only consists of sentence pairs; many-to-one, many-to-many, and one-to-many alignments are not included.
We will be implementing a simple seq2seq model, with and without attention, based largely on the course
material. You will train the models with teacher-forcing and decode using beam search. We will write
it in PyTorch version 1.2.0 (https://pytorch.org/docs/1.2.0/), which is the version installed on the
teach.cs servers. For those unfamiliar with PyTorch, we suggest you first read the PyTorch tutorial (https:
1.3 Tensors and batches
PyTorch, like many deep learning frameworks, operate with tensors, which are multi-dimensional arrays.
When you work in PyTorch, you will rarely if ever work with just one bitext pair at a time. You’ll
instead be working with multiple sequences in one tensor, organized along one dimension of the batch.
This means that a pair of source and target tensors F and E actually correspond to multiple sequences
F = (F
, E = (E
. We work with batches instead of individual sequences because:
a) backpropagating the average gradient over a batch tends to converge faster than single samples, and b)
sample computations can be performed in parallel. For example, if we want to multiply source sequences
(n) and F
(n+1) with an embedding matrix W, we can tell one CPU core to compute the result for F
and another for F
(n+1), halving the overall time it would take to multiply them independently. Learning
to work with tensors can be difficult at first, but is integral to efficient computation. We suggest you read
more about it in the NumPy docs (https://docs.scipy.org/doc/numpy/user/theory.broadcasting.
html#array-broadcasting-in-numpy), which PyTorch borrows for tensors.