Which LLM Provider is right for you? - Comparison of the Major Players
Your ideal LLM depends on specific needs: performance, cost, context window, and data privacy. We’ll do a comprehensive overview of the major LLMs and how to choose the right one for you.
Here are my current recommendations (subject to change):
- Daily Driver: Claude 3.5 Sonnet1 - Reliable, high-performance
- API Access: Google Gemini2 - Robust, wide-ranging capabilities
- Budget Option: DeepSeek V33 - Cost-effective, open model
Here’s a table of the most used LLMs (via OpenRouter - router/unification layer for all the major LLMs) in Jan 20254 (pricing data from Artificial Analysis5):
Model | Context Window | Input ($/1M) | Output ($/1M) | Tokens Processed |
---|---|---|---|---|
Anthropic: Claude 3.5 Sonnet (self-moderated) | 200k | 3.00 | 15.00 | 444B |
Anthropic: Claude 3.5 Sonnet | 200k | 3.00 | 15.00 | 261B |
Google: Gemini Flash 1.5 8B | 1M | 0.04 | 0.15 | 136B |
Google: Gemini Flash 1.5 | 1M | 0.07 | 0.30 | 129B |
DeepSeek V3 | 128k | 0.90 | 1.10 | 70.3B |
Meta: Llama 3.2 1B Instruct | 128k | 0.04 | 0.06 | 63B |
Mistral: Mistral Nemo | 128k | 0.06 | 0.14 | 59.1B |
OpenAI: GPT-4o-mini | 128k | 0.15 | 0.60 | 47.2B |
MythoMax 13B | 4096 | - | - | 35.4B |
Meta: Llama 3.1 70B Instruct | 128k | 0.60 | 0.80 | 30.4B |
When integrating AI into your product, here’re some considerations to keep in mind:
- Google Gemini Flash: 1M token context window. Ideal for processing long documents/videos2.
- DeepSeek V3: Open model. API or self-hosting options. Great performance-to-cost ratio3.
- Meta Llama 3.2 1B: Exceptional small-model performance. Perfect for sentiment analysis, summarization6.
Why You Won’t Build Your Own LLM
This part refers to building an LLM from scratch (not fine-tuning an existing model). Training smaller models and/or other architectures is still a viable option - if your use case allows it.
You might be sitting there, fingers hovering over your keyboard, thinking:
“How hard could it be to build my own Large Language Model from scratch?”
Let me walk you through why this is an incredibly complex undertaking that’s beyond the reach of most organizations—even those with significant resources.
-
Building a competitive LLM isn’t just expensive—it’s astronomically costly. We’re talking about:
- Hardware costs exceeding $1 billion
- Ongoing electricity expenses that could fund entire research departments
- Specialized GPU clusters that cost millions just to purchase
-
Training a state-of-the-art LLM isn’t a weekend project. You’re looking at:
- Minimum 1-3 months of continuous training
- Entire teams working full-time
- Constant iterations and refinements
-
This isn’t a task for generalist machine learning engineers. You need:
- 5+ machine learning researchers
- Each with 5-10 years of specialized experience in transformer architectures, distributed training, and model optimization
-
Even if you overcome the financial, time, and expertise barriers, you’re competing against:
- Tech giants with billions in research budgets
- Dedicated teams of world-class AI researchers
- Models trained on massive, curated datasets
Your most realistic paths forward are:
- Fine-tuning existing models
- Using API services from established providers
- Exploring smaller, specialized model architectures for specific use cases
Remember: Just because you can’t build an LLM from scratch doesn’t mean you can’t create incredible AI applications. The ecosystem of tools, APIs, and pre-trained models has never been more accessible.
Start with an API
Choosing the right Large Language Model (LLM) can feel like navigating a complex maze, but the most important first step is simple: start experimenting! APIs provide the quickest and most accessible entry point to do that. They allow you to get access to powerful AI models with good inference speed without the need to manage complex infrastructure.
However, there’s an important trade-off to consider: when using an API, you’re sharing your data with the service provider. If data privacy is a critical concern for your project, you might want to explore self-hosting options, which we’ll cover in the next section.
The good news is that most LLM providers have standardized around a similar API structure, primarily inspired by OpenAI’s original client. This means once you learn one, transitioning to others becomes much easier. Let’s set up a basic OpenAI API client:
import pandas as pd
from loguru import logger
from openai import OpenAI
from pydantic import BaseModel
# Configuration constants
MODEL = "gpt-4o-mini" # Choose your model
TEMPERATURE = 0.0 # Control randomness
MAX_COMPLETION_TOKENS = 128 # Limit response length
client = OpenAI(api_key="YOUR API KEY")
While most providers have Python clients, some also include libraries for TypeScript/JavaScript, Java, Go and more. Here’s a list of libraries that the OpenAI team supports: https://platform.openai.com/docs/libraries
To get started, you’ll need to:
- Create an account at https://platform.openai.com/
- Generate an API key
- Keep the key secure and never share it publicly
Here’s an example API call:
PROMPT = """
You are a UX writer specializing in clear, actionable error messages.
Write a payment failure error message in 2 parts:
- What happened (max 10 words)
- What to do (max 15 words)
The result should be a single error message that is 25 words or less.
Format it as a JSON object.
""".strip()
# Make the API call
response = client.chat.completions.create(
messages=[
{
"role": "user",
"content": PROMPT,
}
],
max_completion_tokens=MAX_COMPLETION_TOKENS,
temperature=TEMPERATURE,
model=MODEL,
)
print(response.choices[0].message.content)
Let’s break down what’s happening in this API call:
- We’re sending a list of
messages
to simulate a conversation - Each message has a
role
7 (user
,system
, orassistant
) - The
role
helps the model understand the context of the conversation
The response object contains the model’s generated content:
{
"error_message": {
"what_happened": "Payment processing failed.",
"what_to_do": "Please check your payment details and try again."
}
}
along with metadata like token usage:
usage = response.usage
usage.prompt_tokens, usage.completion_tokens, usage.total_tokens
prompt_tokens | completion_tokens | total_tokens |
---|---|---|
73 | 43 | 116 |
But you’re sophisticated AI engineer, you can use Pydantic8 to enforce specific output formats:
class ErrorMessage(BaseModel):
type: str
message: str
response = client.beta.chat.completions.parse(
messages=[
{
"role": "user",
"content": PROMPT,
}
],
max_completion_tokens=MAX_COMPLETION_TOKENS,
temperature=TEMPERATURE,
model=MODEL,
response_format=ErrorMessage, # Parse response into a structured format
)
print(response.choices[0].message.parsed)
ErrorMessage(
type='ErrorMessage',
message='Payment failed due to insufficient funds. Please check your account balance or use a different payment method.'
)
This approach gives you:
- Automatic data validation
- Predictable output structures
- Enhanced type safety
You can get much more complex with your objects, we’ll see how to do that in the projects we’re going to build.
Run an LLM on your local machine
Running a LLM on your local machine requires some setup and a good computing environment. However, this approach offers significant advantages, particularly when you’re dealing with sensitive data or need to process large volumes of information without relying on cloud services.
By far, the most straightforward method for running an LLM locally is using the Ollama application9. I’ve created a guide to help you get started, which you can watch in the video below:
-
Install Ollama: Follow the installation instructions in the video.
-
Download a Model: Open your terminal and pull the Llama 3.2 3B model with a simple command:
ollama run llama3.2
This command accomplishes two things:
- Downloads the Llama 3.2 model to your machine
- Loads the model into your system’s memory
- Interact with the Model: Once loaded, you can start chatting immediately. When you’re finished, type
/bye
to exit the application.
The models in Ollama are quantized10, which is a technical term for compression. Think of quantization like audio compression:
- Reduces file size
- Makes the model run faster
- Slightly reduces model accuracy
While you lose some precision, the trade-off is a more accessible and performant local AI experience.
Ollama provides a Python client11 that allows seamless integration into your own projects. Here’s a practical example of using the Llama 3.2 model to generate a concise error message:
from ollama import chat
# Model Configuration
MODEL = "llama3.2"
TEMPERATURE = 0 # Set to zero for deterministic responses
PROMPT = """
You are a UX writer specializing in clear, actionable error messages.
Write a payment failure error message in 2 parts:
- What happened (max 10 words)
- What to do (max 15 words)
The result should be a single error message that is 25 words or less.
Format it as a JSON object.
""".strip()
response = chat(
model=MODEL,
messages=[{"role": "user", "content": PROMPT}],
options={"temperature": TEMPERATURE},
format="json",
)
print(response.message.content)
When you run this script, you’ll get a response like:
{
"error": "Payment processing failed due to insufficient funds.",
"action": "Please try again with a different payment method or contact support for assistance."
}
Note how we used the format="json"
parameter to ensure structured output.
Deploy an LLM for Your Business or Company
Deploying an LLM for a businesses that require high security, custom data processing, or specialized interactions can be a game-changer. In my experience, many companies have the idea of deploying an LLM but are unsure where to start. They mostly fantasize about having internal ChatGPT-like system that works with their private data.
While cloud-based solutions are convenient, running an LLM on your own infrastructure offers unparalleled control and data privacy. Of course, that would require a substantial investment in hardware and setup/monitoring.
Most companies go for >=70B models for their internal deployments. These models are powerful and can handle a wide range of tasks. However, they require a robust server that covers the following requirements (in FP16):
- GPU: around 142GB of GPU memory (that would be 2x NVIDIA A10012 80GB)
- RAM: >64GB
- CPU: AMD EPYC or Intel Xeon
- Storage: 300GB+ SSD
- Operating System: Linux (for CUDA compatibility)
You can use cloud infrastructure from providers like AWS, Google Cloud, or Azure. Alternatively, you can set up a dedicated server.
If your use-case allows for it, you can use 4-bit quantization and run the model on a single A100 40GB GPU12.
To serve the model, you can use the vLLM library13. It’s optimized for real-time and batch predictions, and it provides an OpenAI-compatible API. Here’s a quick guide to deploying a model:
Let’s walk through a step-by-step deployment process:
- Create and Activate a Virtual Environment
uv venv myenv --python 3.12 --seed
source myenv/bin/activate
- Install vLLM (Ensure CUDA compatibility with your GPU)
uv pip install vllm
- Deploy Your Chosen Model
vllm serve Qwen/Qwen2.5-1.5B-Instruct
- Test the Deployment
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
While this guide provides a quick start, production-ready deployment requires additional considerations:
- Load balancing
- Model caching
- Monitoring and logging
- Security configurations
- Scaling infrastructure
Here are some tips that help me sleep at night (after deploying models):
- Start with smaller models to test your infrastructure
- Always have a backup and recovery plan
- Monitor GPU memory and model performance
- Consider quantization techniques to reduce resource requirements
Deploy an LLM for your business/company
When you have a (lot) more resources, you can consider deploying an (non-quantized) LLM on your own infrastructure. This is a good option if you’re processing a lot of data and need to keep it secure.
My recommendation is to rent a VPS (Virtual Private Server) from a cloud provider like AWS, Google Cloud, or Azure. Or setup your own server in a data center. You’ll need a good machine with a lot of memory and a good GPU(s).
Then you can use the vLLM13 library to serve models from the HuggingFace model hub.
vLLM is a fast and easy-to-use library for LLM inference and serving.
Here’s a quick guide on how to do that. Start with creating a virtual environment and activate it:
uv venv myenv --python 3.12 --seed
source myenv/bin/activate
Next, you need to install the vLLM library (note that this depends on the CUDA version for your GPU/machine):
uv pip install vllm
Then you can deploy a model of your choice with an OpenAI compatible API:
vllm serve Qwen/Qwen2.5-1.5B-Instruct
and test it with a CURL command:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
Easy, right? Of course, you still need to build an infrastructure around this to make it production-ready. But this is a good start.
Counting Tokens (Context Window)
Looking at the model comparison table, you can see that the context window (number of tokens the model can process at once) is around 128k tokens for most models. Not only that, more tokens processed means more money you have to pay for the model. How do you estimate how much it will cost you to process your data?
To help you understand token usage, you can use the tiktoken
14 package by OpenAI. Let’s explore how to use it.
import pandas as pd
import tiktoken
from loguru import logger
from tiktoken import Encoding
# Initialize the encoder for a specific model
MODEL = "gpt-4o-mini"
encoder = tiktoken.encoding_for_model(MODEL)
We’ll create a function to demonstrate how tokens differ from words:
def show_encoding_comparison(text: str, encoder: Encoding):
# Encode the text into tokens
encoding = encoder.encode(text)
# Count words and tokens
word_count = len(text.split())
token_count = len(encoding)
# Create a DataFrame to show the comparison
df = pd.DataFrame(
{
"word_count": [word_count],
"token_count": [token_count],
"increase": [round(token_count / word_count, 2)],
}
)
logger.debug(f"Statistics:\n{df.to_markdown(index=None)}")
Let’s analyze our sample prompt:
prompt = """
You are a UX writer specializing in clear, actionable error messages.
Write a payment failure error message in 2 parts:
- What happened (max 10 words)
- What to do (max 15 words)
""".strip()
show_encoding_comparison(prompt, encoder)
Results:
word_count | token_count | increase |
---|---|---|
33 | 43 | 1.3 |
Notice how 33 words became 43 tokens, that’s a 30% increase! This matters when you’re processing large volumes of text.
Imagine processing 100,000 documents with this prompt:
Input tokens: 43 * 100,000 = 4,300,000 tokens
Model | Input Cost per Million Tokens | Total Cost for 4.3M Tokens |
---|---|---|
Claude 3.5 Sonnet | $3.00 | $12.90 |
DeepSeek v3 | $0.90 | $3.87 |
Google’s Gemini Flash 1.5 | $0.07 | $0.301 |
Output tokens often cost even more. Let’s look at a sample response:
response = """
{
"type": "payment_failure",
"message": "Your payment method has expired. Please update your card details or use a different payment method to complete the transaction."
}
""".strip()
show_encoding_comparison(response, encoder)
Results:
word_count | token_count | increase |
---|---|---|
25 | 37 | 1.48 |
Output token costs for 100,000 documents:
Model | Output Cost per Million Tokens | Total Cost for 3.7M Tokens |
---|---|---|
Claude 3.5 Sonnet | $15.00 | $55.50 |
DeepSeek v3 | $1.10 | $4.07 |
Google’s Gemini Flash 1.5 | $0.30 | $1.11 |
Key takeaways:
- Tokens aren’t the same as words
- Token counts can increase text “size” by 30-50%
- Always consider both input and output token costs
- Cheaper models might have trade-offs in performance
As the technology evolves, these costs will likely decrease. However, being token-efficient is always a valuable skill.