NMT Training through the Lens of SMT

This is a post for the EMNLP 2021 paper Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT.

In SMT, model competences are modelled with distinct models. In NMT, the whole translation task is modelled with a single neural network. How and when does NMT get to learn all the competences? We show that

during training, NMT undergoes three different stages:
target-side language modeling,
learning how to use source and approaching word-by-word translation,
refining translations, visible by increasingly complex reorderings, but almost invisible to standard metrics (e.g. BLEU);

morda-min

not only this is fun, but it can also help in practice! For example, in settings where data complexity matters, such as non-autoregressive NMT.

Neural Machine Translation Inside Out

This is a blog version of my talk at the ACL 2021 workshop Representation Learning for NLP (and updated version of that at NAACL 2021 workshop Deep Learning Inside Out (DeeLIO) ).

In the last decade, machine translation shifted from the traditional statistical approaches with distinct components and hand-crafted features to the end-to-end neural ones. We try to understand how NMT works and show that:

NMT model components can learn to extract features which in SMT were modelled explicitly;
for NMT, we can also look at how it balances the two different types of context: the source and the prefix;
NMT training consists of the stages where it focuses on competences mirroring three core SMT components.

Source and Target Contributions to NMT Predictions

This is a post for the ACL 2021 paper Analyzing the Source and Target Contributions to Predictions in Neural Machine Translation.

In NMT, each prediction is based on two types of context: the source and the prefix of the target sentence. We show how to evaluate the relative contributions of source and target to NMT predictions and find that:

models suffering from exposure bias are more prone to over-relying on target history (and hence to hallucinating) than the ones where the exposure bias is mitigated;
models trained with more data rely on the source more and do it more confidently;
the training process is non-monotonic with several distinct stages.