AI is Eating the World - the Rise of LLMs
Remember when computers were only better than us at playing chess? Those days are gone. Today ChatGPT helps students understand complex topics, GitHub Copilot assisting developers in writing code, or Claude helping researchers analyze data. These aren’t just incremental improvements - they represent a fundamental shift in how we interact with computers. Just five years ago, these capabilities would have seemed like science fiction.
But how did we get here? How do these models actually work? And most importantly - why are they suddenly everywhere? In this section, we’ll break down the technology behind LLMs, from their fundamental building blocks to the clever tricks that make them work. We’ll explore how they’re trained, how they generate text, and why they’ve become so powerful.
From Neural Networks to Transfomers
Imagine teaching a computer to understand language like a human. For decades, we tried using traditional neural networks - they were like giving the computer a basic brain that could recognize patterns, but something was missing. These networks could handle simple tasks, but they struggled with understanding context and relationships in language.
Then, in 2017, everything changed with the introduction of Transformers in a paper called “Attention is All You Need”1. What made Transformers so special? Two breakthrough innovations:
-
Processing Everything at Once Think about reading a book. While traditional neural networks had to read one word at a time (like using a finger to follow each word), Transformers can look at an entire page at once. This parallel processing is like having hundreds of eyes reading simultaneously, making them incredibly fast and efficient.
-
Understanding Context Through Self-Attention Here’s where it gets interesting. When you read the sentence “The dog chased the cat because it was scared”, you instantly know that “it” refers to the cat. Traditional networks struggled with this, but Transformers use a clever trick called “self-attention” to handle it beautifully. Every word in a sentence can “look” at every other word and figure out how they’re related.
But there was one more challenge to solve. In language, order matters - “dog bites man” means something very different from “man bites dog”! Since Transformers process everything at once, they needed a way to understand word order. The solution? Position embeddings - like giving each word a special tag that tells the model “I’m the first word” or “I’m the last word” in the sentence.
While transformers are known for their success in natural language processing, they provide state-of-the-art performance in other domains like computer vision2.
Recipe to Build a Large Language Model
In this part, we will be using Jupyter Notebook to run the code. If you want to follow along, you can find the notebook on GitHub: Jupyter Notebook
Remember when ChatGPT3 burst onto the scene in late 2022? Suddenly, AI wasn’t just for tech enthusiasts - everyone from students to grandparents was using it. But what goes into building such a powerful AI system? Let’s break down the key ingredients needed to cook up a large language model (LLM).
Tokenization (Turn Text Into Numbers)
Before an LLM can write like Shakespeare, it needs to learn to read - but not like we do. LLM only understands numbers, so we need to convert text into a numerical format. This process is called tokenization.
Here’s how it works:
- Text is broken down into smaller pieces called tokens
- Each token gets assigned a unique number
- These numbers become the AI’s vocabulary
Think of it like creating a giant dictionary where:
- Common words might be single tokens: “dog” → 892
- Rare words get split into pieces: “unconventional” → [“un”, “convention”, “al”]
- Special characters, spaces, and punctuation also get their own numbers
Modern LLMs typically know 128,000+ tokens. While a bigger vocabulary helps the AI understand more words, it also makes the model larger and slower. It’s a balance!
All the (Clean) Data You Can Get
You can’t make a gourmet meal with poor ingredients, and you can’t build a great AI with poor data. Modern LLMs are trained on enormous amounts of text, including:
- Websites
- Books
- Academic papers
- Code repositories
- Social media posts
and high quality private data (user interactions, chatlogs, instructions and more). Is the data licensed to be used for training? We don’t really know, but it is safe to assume some companies are pushing the boundaries of data privacy.
For example, Meta’s Llama 3.34 is pretrained on 15+ trillions of tokens. The dataset is a result of a lot of filtering (and censorship):
To ensure Llama 3 is trained on data of the highest quality, we developed a series of data-filtering pipelines. These pipelines include using heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers to predict data quality
More recently, there has been a push towards training LLMs on synthetic data, which is generated by the model itself (or other models). You can imagine the potential for bias and misinformation in such data.
Enormous Compute Power
Here’s the part that might crush your dreams of building your own LLM from scratch: the computing requirements are staggering.
Let’s put it in perspective. Training Llama 3.3 required:
- 24,000 high-end GPUs
- Hardware costs of around $720 million
- 102 days of non-stop training
To understand the scale: if you wanted to bake a cake, you’d need an oven. Building an LLM is like needing an entire industrial bakery complex!
Don’t worry though - while building LLMs from scratch is out of reach for most, you can still:
- Use pre-trained models
- Fine-tune existing models for specific tasks
- Build applications on top of existing LLMs
The Secret Sauce: Billions of Parameters
Parameters are like the model’s brain cells - they store all the patterns and knowledge learned during training. Modern LLMs have billions of them:
- GPT-3: 175 billion parameters
- Llama 3: 405 billion parameters
What’s fascinating is that as we add more parameters, these models develop unexpected abilities5 - like solving math problems or writing code, even though they were only trained to predict the next word in a sequence.
The relationship between parameters and capabilities is still somewhat mysterious. Scientists are still trying to understand why and how these models develop certain abilities at different scales.
How to Train an LLM
Imagine teaching a baby to speak - they start by listening and repeating words, then learn to follow instructions, and finally develop the ability to think through complex problems. Training an LLM follows a similar journey, but with a few key stages:
Supervised Pretraining
Think of this as the LLM’s “childhood” - it’s the longest and most crucial phase, taking up about 95% of the training time. You give it a text like this:
The cat sat on the
And the model predicts the next word
laptop
With billions of examples, the model starts to understand:
- Common word patterns
- Grammar rules
- Basic facts about the world
- Language structure
At this point, the model is like a toddler who can speak but doesn’t yet know how to have a proper conversation or follow instructions.
Instruction Following
Now we need to teach our model to be helpful and follow directions. There are two main approaches:
- Reinforcement Learning from Human Feedback (RLHF)
- Used by OpenAI6 for ChatGPT 3.5
- Human teachers rate the model’s responses
- Model learns from this feedback
- Very expensive and time-consuming
- The Practical Approach
- Use existing conversation datasets
- Contains examples of good instructions and responses
- More affordable and faster
- Still quite effective
Think of this as teaching the model proper conversation etiquette - when to answer, how to be helpful, and what kind of responses are appropriate (along with a lot of censorship in most cases).
Learning to Think (Chain-of-Thought)
The final stage is teaching the model to “show its work” - just like we teach students to explain their problem-solving process. This is called Chain-of-Thought (CoT)7. For example:
If John has 5 apples and gives 2 to Mary, how many does he have left?
Model’s thought process:
1. Start with John's apples: 5
2. John gives away: 2
3. Calculate remaining: 5 - 2 = 3
Answer: John has 3 apples left
This step helps the model:
- Break down complex problems
- Plan responses step-by-step
- Catch its own mistakes
- Provide more reliable answers
By training the model with examples that include this step-by-step thinking, it learns to approach problems more systematically - just like a student learning to solve math problems by showing their work.
Generating Text (Sampling)
When an LLM model generates text, it’s making choices about which word to write next - just like you choosing your next word in a conversation. The simplest approach is to choose the most likely next token at each step, but there are more sophisticated strategies that can lead to more interesting and varied text.
Temperature: Controlling Creativity vs Consistency
Temperature controls how “confident” the model should be when choosing the next word. The probabilities of word choices are scaled by , where is the temperature. Higher temperatures flatten the probability distribution, making all words more likely to be chosen. Here’s how it works:
Low Temperature (Below 0.5)
-
Perfect for when you need accurate, predictable responses
-
Example Use Cases:
- Writing technical documentation
- Generating factual answers
- Creating business emails
High Temperature (Above 1.0)
- Great for generating creative and unexpected content
- Example Use Cases:
- Writing creative stories
- Generating poetry
- Brainstorming ideas
Top-k Sampling: Setting Boundaries for Choices
Think of top-k sampling like giving a limited menu of word choices. Instead of choosing from every possible word, it only looks at the k most likely options.
Here’s a real-world example:
- Let’s say k = 5
- The AI is writing a sentence: “The cat sat on the ___”
- Top 5 most likely words might be: “mat”, “chair”, “floor”, “bed”, “couch”
- The AI can only choose from these 5 words, preventing it from picking something nonsensical like “helicopter”
Practical Tips:
- Lower k (like 5-10): More focused, consistent text
- Higher k (like 40-50): More variety, potentially more creative
- Start with k = 20 and adjust based on your needs
Repetition Penalty
This mechanism discourages the model from generating the same words or phrases repeatedly. It works by penalizing the probability of words that have already been used. This encourages the model to explore alternative ways of phrasing.
-
Why It’s Needed:
- Language models sometimes “loop” or overuse certain phrases, especially in longer texts.
- Repetition penalty ensures variety and coherence.
Pro Tips
- For factual writing: Use low temperature (0.3-0.4) and moderate top-k (10-20)
- For creative writing: Use higher temperature (0.7-0.9) and higher top-k (30-50)
- Always keep repetition penalty on (usually between 1.1-1.2)
- Experiment! The best settings depend on your specific needs
Case study: Build a Tiny LM
Enough talk, let’s build a tiny language model from scratch! We’ll train it on a short text and then generate new text based on that training. Let’s start with the imports:
import numpy as np
import torch
import torch.nn as nn
Vocabulary
The first component will map characters to indices and vice versa. This will help us convert text to numbers and back:
class Vocabulary:
def __init__(self, text: str):
self.char_to_idx = {}
self.idx_to_char = {}
self.vocab_size = 0
self.build_vocab(text)
def build_vocab(self, text):
# Create sorted vocabulary from unique characters
unique_chars = sorted(list(set(text)))
self.char_to_idx = {char: idx for idx, char in enumerate(unique_chars)}
self.idx_to_char = {idx: char for char, idx in self.char_to_idx.items()}
self.vocab_size = len(unique_chars)
def encode(self, text):
"""Convert string to list of indices"""
return [self.char_to_idx[char] for char in text]
def decode(self, indices):
"""Convert list of indices to string"""
return "".join([self.idx_to_char[idx] for idx in indices])
def encode_tensor(self, text):
"""Convert string to PyTorch tensor"""
return torch.tensor([self.encode(text)])
def decode_tensor(self, tensor):
"""Convert PyTorch tensor to string"""
return self.decode(tensor.flatten().tolist())
Let’s try it with a sample text:
text = """
I grew up on the crime side, the New York Times side
Stayin' alive was no jive
At second hands, moms bounced on old men
So then we moved to Shaolin land
""".strip()
vocab = Vocabulary(text)
new_text = "Stayin' alive was no jive"
print(vocab.encode(new_text))
[7, 27, 10, 31, 17, 22, 2, 1, 10, 20, 17, 29, 14, 1, 30, 10, 26, 1, 22, 23, 1, 18, 17, 29, 14]
Transformer
Here’s the full transformer implementation:
EMBEDDING_SIZE = 32
ATTENTION_HEADS = 4
FEED_FORWARD_SIZE = 128
DROPOOUT = 0.1
CONTEXT_WINDOW = 128
class MultiHeadAttention(nn.Module):
def __init__(self):
super().__init__()
self.d_k = EMBEDDING_SIZE // ATTENTION_HEADS
self.q_linear = nn.Linear(EMBEDDING_SIZE, EMBEDDING_SIZE)
self.k_linear = nn.Linear(EMBEDDING_SIZE, EMBEDDING_SIZE)
self.v_linear = nn.Linear(EMBEDDING_SIZE, EMBEDDING_SIZE)
self.out = nn.Linear(EMBEDDING_SIZE, EMBEDDING_SIZE)
def forward(self, q, k, v, mask=None):
batch_size = q.size(0)
q = (
self.q_linear(q)
.view(batch_size, -1, ATTENTION_HEADS, self.d_k)
.transpose(1, 2)
)
k = (
self.k_linear(k)
.view(batch_size, -1, ATTENTION_HEADS, self.d_k)
.transpose(1, 2)
)
v = (
self.v_linear(v)
.view(batch_size, -1, ATTENTION_HEADS, self.d_k)
.transpose(1, 2)
)
scores = torch.matmul(q, k.transpose(-2, -1)) / np.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attn = torch.softmax(scores, dim=-1)
out = torch.matmul(attn, v)
out = out.transpose(1, 2).contiguous().view(batch_size, -1, EMBEDDING_SIZE)
return self.out(out)
class FeedForward(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(EMBEDDING_SIZE, FEED_FORWARD_SIZE),
nn.ReLU(),
nn.Dropout(DROPOOUT),
nn.Linear(FEED_FORWARD_SIZE, EMBEDDING_SIZE),
)
def forward(self, x):
return self.net(x)
class TransformerBlock(nn.Module):
def __init__(self):
super().__init__()
self.attention = MultiHeadAttention()
self.feed_forward = FeedForward()
self.norm1 = nn.LayerNorm(EMBEDDING_SIZE)
self.norm2 = nn.LayerNorm(EMBEDDING_SIZE)
self.dropout = nn.Dropout(DROPOOUT)
def forward(self, x, mask=None):
attended = self.attention(x, x, x, mask)
x = self.norm1(x + self.dropout(attended))
fed_forward = self.feed_forward(x)
x = self.norm2(x + self.dropout(fed_forward))
return x
class Transformer(nn.Module):
def __init__(self, vocab_size: int):
super().__init__()
self.embedding = nn.Embedding(vocab_size, EMBEDDING_SIZE)
self.pos_embedding = nn.Parameter(
torch.randn(1, CONTEXT_WINDOW, EMBEDDING_SIZE)
)
self.transformer = TransformerBlock()
self.fc = nn.Linear(EMBEDDING_SIZE, vocab_size)
def forward(self, x, mask=None):
x = self.embedding(x) + self.pos_embedding[:, : x.size(1)]
x = self.transformer(x, mask)
return self.fc(x)
Our model expects a sequence of numerical indexes from the Vocabulary
and then transformed into a learned embedding. These embeddings, combined with positional information, flow through the Transformer block where the multi-head attention mechanism allows each position to attend to all other positions, capturing relationships between characters regardless of their distance.
The attended representations then pass through a feed-forward network and layer normalization, producing output logits that predict the probability distribution of the next character, enabling the model to generate text continuations by repeatedly sampling from these predictions.
Dataset
Our dataset is extremely simple, just a single sentence:
sentence = "The quick brown fox jumps over the lazy dog."
sentence[:-1]
The quick brown fox jumps over the lazy dog
The model will learn to predict the next character, here’s the target sequence:
sentence[1:]
he quick brown fox jumps over the lazy dog.
Train
Training the model is a plain PyTorch loop, except we’ll map the sentence to numerical indexes using the Vocabulary
:
def train(sentence):
vocab = Vocabulary(sentence)
# Prepare input and target sequences
x = vocab.encode_tensor(sentence[:-1]) # Input sequence
y = vocab.encode_tensor(sentence[1:]) # Target sequence
# Create model and optimizer
model = Transformer(vocab.vocab_size)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
# Training loop
for epoch in range(1000):
optimizer.zero_grad()
output = model(x)
loss = criterion(output.view(-1, vocab.vocab_size), y.view(-1))
loss.backward()
optimizer.step()
if (epoch + 1) % 100 == 0:
print(f"Epoch {epoch + 1}, Loss: {loss.item():.4f}")
return model, vocab
Let’s train the model:
model, vocab = train(sentence)
Epoch 100, Loss: 0.4024
Epoch 200, Loss: 0.0857
Epoch 300, Loss: 0.0362
Epoch 400, Loss: 0.0203
Epoch 500, Loss: 0.0143
Epoch 600, Loss: 0.0096
Epoch 700, Loss: 0.0072
Epoch 800, Loss: 0.0057
Epoch 900, Loss: 0.0046
Epoch 1000, Loss: 0.0038
The loss is decreasing, which means the model is learning to predict the next character in the sequence.
Text generation
Finally, we can generate new text by providing a prefix and letting the model predict the next characters (by picking the character with highest probability):
def generate(model, prefix, vocab, max_new_chars=32):
model.eval()
current_sequence = vocab.encode(prefix)
result = prefix
for _ in range(max_new_chars):
# Predict next character
x = torch.tensor([current_sequence])
with torch.no_grad():
output = model(x)
next_char_idx = torch.argmax(output[0, -1]).item()
# Add predicted character to sequence
current_sequence.append(next_char_idx)
result += vocab.idx_to_char[next_char_idx]
# Stop if we predict a period
if vocab.idx_to_char[next_char_idx] == ".":
break
return result
Here’s how to generate text:
text = "The quick brown"
generated = generate(model, text, vocab, max_new_chars=4)
print(generated)
The quick brown fox
And that’s it! You’ve built a tiny language model from scratch. While this model is very simple, you can further expand it by adding more layers, training on larger datasets, and experimenting with different hyperparameters. Have fun!