Understanding Transformers: Part 3 - The Complete Architecture

Welcome to Part 3 of our Transformers series! We've covered the attention mechanism and multi-head self-attention. Now let's see how all the pieces fit together in the complete Transformer architecture.

The Big Picture

The original Transformer consists of two main components:

Encoder: Processes the input sequence
Decoder: Generates the output sequence

Each component is built from stacked layers, and each layer contains multiple sub-components working together.

Encoder Architecture

Encoder Layer Components

Each encoder layer contains:

Multi-Head Self-Attention
Position-wise Feed-Forward Network
Residual Connections
Layer Normalization

The Flow Through an Encoder Layer

Input: Token embeddings + positional encoding
Self-Attention: Multi-head self-attention mechanism
Add & Norm: Residual connection + layer normalization
Feed-Forward: Position-wise feed-forward network
Add & Norm: Another residual connection + layer normalization
Output: Enhanced representations

Position-wise Feed-Forward Network

Each position gets processed independently through:

Linear transformation
ReLU activation
Another linear transformation

$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$

This adds non-linearity and allows the model to process information at each position.

Decoder Architecture

Decoder Layer Components

Each decoder layer contains:

Masked Multi-Head Self-Attention
Multi-Head Cross-Attention (encoder-decoder attention)
Position-wise Feed-Forward Network
Residual Connections and Layer Normalization

Masked Self-Attention

The decoder uses masked self-attention to prevent positions from attending to subsequent positions. This maintains the autoregressive property needed for generation.

Cross-Attention

Cross-attention allows the decoder to attend to the encoder's output:

Queries: Come from the decoder
Keys and Values: Come from the encoder
This enables the decoder to access input information while generating

Key Architectural Components

1. Residual Connections

Residual connections help with:

Gradient flow: Prevents vanishing gradients in deep networks
Training stability: Makes optimization easier
Information preservation: Allows information to flow directly

$\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))$

2. Layer Normalization

Applied after each sub-layer:

Normalizes activations across the feature dimension
Stabilizes training
Reduces internal covariate shift

3. Positional Encoding

Added to input embeddings to provide position information:

Sinusoidal encoding in the original paper
Learned positional embeddings in many implementations
Relative positional encoding in some variants

The Complete Data Flow

Training (Teacher Forcing)

Input Processing:
- Tokenize input and target sequences
- Add positional encoding
- Apply dropout
Encoder Processing:
- Process input through encoder stack
- Generate contextualized representations
Decoder Processing:
- Process target sequence (shifted right)
- Use masked self-attention
- Apply cross-attention to encoder output
- Generate predictions
Loss Calculation:
- Compare predictions with target
- Compute cross-entropy loss
- Backpropagate gradients

Inference (Autoregressive Generation)

Encoder: Process input sequence once
Decoder: Generate tokens one by one
- Start with special start token
- Use previously generated tokens as input
- Apply beam search or sampling for generation

Implementation Example

Here's a simplified Transformer implementation:

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output, _ = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x

class Transformer(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_len):
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_len)
        
        self.encoder_layers = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff)
            for _ in range(num_layers)
        ])
        
        self.output_projection = nn.Linear(d_model, vocab_size)
    
    def forward(self, x, mask=None):
        # Embedding and positional encoding
        x = self.embedding(x) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)
        
        # Pass through encoder layers
        for layer in self.encoder_layers:
            x = layer(x, mask)
        
        # Output projection
        return self.output_projection(x)

Transformer Variants

Encoder-Only Models

BERT: Bidirectional encoder for understanding tasks
RoBERTa: Optimized BERT training
DeBERTa: Enhanced position encoding

Decoder-Only Models

GPT: Autoregressive generation
GPT-2/3/4: Scaled-up versions
PaLM: Pathways Language Model

Encoder-Decoder Models

T5: Text-to-Text Transfer Transformer
BART: Denoising autoencoder
mT5: Multilingual T5

Design Choices and Trade-offs

Model Size

Parameters: More parameters → better performance but higher cost
Layers: Deeper models can capture more complex patterns
Hidden size: Wider models have more representational capacity

Attention Heads

Number of heads: More heads → more diverse attention patterns
Head size: Larger heads → more capacity per head
Trade-off: Total parameters remain constant

Sequence Length

Quadratic complexity: Longer sequences → quadratically more computation
Memory requirements: Attention matrices grow quadratically
Solutions: Sparse attention, linear attention, hierarchical models

What's Next?

In Part 4, we'll explore:

Training techniques and optimization
Pre-training strategies
Fine-tuning approaches
Common training challenges and solutions

We'll also discuss practical considerations for training Transformers effectively.

Key Takeaways

Modular design: Transformers are built from reusable components
Residual connections: Essential for training deep networks
Layer normalization: Stabilizes training and improves convergence
Encoder-decoder structure: Flexible for various tasks
Autoregressive generation: Enables text generation capabilities

The complete Transformer architecture elegantly combines all these components to create a powerful and flexible model that has revolutionized natural language processing and beyond.

Understanding Transformers: Part 3 - The Complete Architecture

Understanding Transformers: Part 3 - The Complete Architecture

The Big Picture

Encoder Architecture

Encoder Layer Components

The Flow Through an Encoder Layer

Position-wise Feed-Forward Network

Decoder Architecture

Decoder Layer Components

Masked Self-Attention

Cross-Attention

Key Architectural Components

1. Residual Connections

2. Layer Normalization

3. Positional Encoding

The Complete Data Flow

Training (Teacher Forcing)

Inference (Autoregressive Generation)

Implementation Example

Transformer Variants

Encoder-Only Models

Decoder-Only Models

Encoder-Decoder Models

Design Choices and Trade-offs

Model Size

Attention Heads

Sequence Length

What's Next?

Key Takeaways

Understanding Transformers: Part 2 - Multi-Head Attention and Self-Attention