How Attention Works?
A comprehensive visual guide to Query, Key, and Value projections and causal masking operations inside transformers.
The self-attention mechanism, introduced in the seminal paper 'Attention Is All You Need' (Vaswani et al., 2017), allows models to weigh the importance of different words in a sentence dynamically. Unlike recurrent architectures, self-attention processes all tokens in parallel, achieving significant performance speedups.
Query, Key, and Value Matrices
For every input vector, we project it into three spaces using trained weights matrices (W_Q, W_K, W_V) to produce Queries (Q), Keys (K), and Values (V):
- Query (Q): What the token is looking for.
- Key (K): What the token contains to match other queries.
- Value (V): The actual content information to extract.
import numpy as np
def scaled_dot_product_attention(q, k, v, mask=None):
d_k = q.shape[-1]
scores = np.matmul(q, k.T) / np.sqrt(d_k)
if mask is not None:
scores += (mask * -1e9) # Lower masked indices to near negative infinity
attention_weights = softmax(scores, axis=-1)
return np.matmul(attention_weights, v), attention_weightsCausal Masking
In autoregressive decoder models (like GPT architectures), tokens are forbidden from looking ahead at future tokens. We apply a lower-triangular matrix filled with negative infinity for positions above the diagonal. When Softmax is computed, these future scores evaluate to zero attention weight.
Want to play with this concept?
We build interactive visual terminals for tokenizers, rendering engines, rate limiters, and network topologies. Explore them live!