source

Attention mechanisms allow a model to focus on specific areas of input data in natural language processing or computer vision. Weights are given to distinct areas of data sequences, similarly to how human brains pay specific attention to areas when processing information.

  1. Input Sequence with Embedding: The model is given an input sequence vector or embedding.
  2. Calculation of Relevance: A calculation of relevance is performed on each element in the input sequence.
  3. Softmax and Attention Weight Distribution: The relevance scores are computed to produce probability-like values.
  4. Context Vector Calculation: Each element from the input sequence is multiplied by its attention weight and added together.
  5. Context vectors are combined with current model state to generate a decision output.
  6. Context Vector Recalculation: Context vectors are recalculated at each step using the input sequence and previous model state.
  7. Backpropagation is used during model training to learn attention weights. Backpropagation is exclusive to the model training phase

Scaled Dot Product: Most Common

The model attempts to determine which pieces of information (K) most resemble the query (Q). The measurement of resemblance is done by multiplying and adding parts of the query and the information(4). The dataset is scaled to account for extremely small or large scores.

All information (V) is merged with awareness of relevance, to generate the context vector (5,6)

Multi-Head Attention

WIP