The “masked” a part of types of rnn the term refers to a way used throughout training the place future tokens are hidden from the mannequin. “Multi-head” right here means that the mannequin has a number of sets (or “heads”) of realized linear transformations that it applies to the enter. This is important because it enhances the modeling capabilities of …