Which LLM Provider is right for you? - Comparison of the Major Players

Your ideal LLM depends on specific needs: performance, cost, context window, and data privacy. We’ll do a comprehensive overview of the major LLMs and how to choose the right one for you.

Here are my current recommendations (subject to change):

  • Daily Driver: Claude 3.5 Sonnet1 - Reliable, high-performance
  • API Access: Google Gemini2 - Robust, wide-ranging capabilities
  • Budget Option: DeepSeek V33 - Cost-effective, open model

Here’s a table of the most used LLMs (via OpenRouter - router/unification layer for all the major LLMs) in Jan 20254 (pricing data from Artificial Analysis5):

ModelContext WindowInput ($/1M)Output ($/1M)Tokens Processed
Anthropic: Claude 3.5 Sonnet (self-moderated)200k3.0015.00444B
Anthropic: Claude 3.5 Sonnet200k3.0015.00261B
Google: Gemini Flash 1.5 8B1M0.040.15136B
Google: Gemini Flash 1.51M0.070.30129B
DeepSeek V3128k0.901.1070.3B
Meta: Llama 3.2 1B Instruct128k0.040.0663B
Mistral: Mistral Nemo128k0.060.1459.1B
OpenAI: GPT-4o-mini128k0.150.6047.2B
MythoMax 13B4096--35.4B
Meta: Llama 3.1 70B Instruct128k0.600.8030.4B

When integrating AI into your product, here’re some considerations to keep in mind:

  • Google Gemini Flash: 1M token context window. Ideal for processing long documents/videos2.
  • DeepSeek V3: Open model. API or self-hosting options. Great performance-to-cost ratio3.
  • Meta Llama 3.2 1B: Exceptional small-model performance. Perfect for sentiment analysis, summarization6.

Why You Won’t Build Your Own LLM

This part refers to building an LLM from scratch (not fine-tuning an existing model). Training smaller models and/or other architectures is still a viable option - if your use case allows it.

You might be sitting there, fingers hovering over your keyboard, thinking:

“How hard could it be to build my own Large Language Model from scratch?”

Let me walk you through why this is an incredibly complex undertaking that’s beyond the reach of most organizations—even those with significant resources.

  • Building a competitive LLM isn’t just expensive—it’s astronomically costly. We’re talking about:

    • Hardware costs exceeding $1 billion
    • Ongoing electricity expenses that could fund entire research departments
    • Specialized GPU clusters that cost millions just to purchase
  • Training a state-of-the-art LLM isn’t a weekend project. You’re looking at:

    • Minimum 1-3 months of continuous training
    • Entire teams working full-time
    • Constant iterations and refinements
  • This isn’t a task for generalist machine learning engineers. You need:

    • 5+ machine learning researchers
    • Each with 5-10 years of specialized experience in transformer architectures, distributed training, and model optimization
  • Even if you overcome the financial, time, and expertise barriers, you’re competing against:

    • Tech giants with billions in research budgets
    • Dedicated teams of world-class AI researchers
    • Models trained on massive, curated datasets

Your most realistic paths forward are:

  • Fine-tuning existing models
  • Using API services from established providers
  • Exploring smaller, specialized model architectures for specific use cases

Remember: Just because you can’t build an LLM from scratch doesn’t mean you can’t create incredible AI applications. The ecosystem of tools, APIs, and pre-trained models has never been more accessible.

Start with an API

Choosing the right Large Language Model (LLM) can feel like navigating a complex maze, but the most important first step is simple: start experimenting! APIs provide the quickest and most accessible entry point to do that. They allow you to get access to powerful AI models with good inference speed without the need to manage complex infrastructure.

However, there’s an important trade-off to consider: when using an API, you’re sharing your data with the service provider. If data privacy is a critical concern for your project, you might want to explore self-hosting options, which we’ll cover in the next section.

The good news is that most LLM providers have standardized around a similar API structure, primarily inspired by OpenAI’s original client. This means once you learn one, transitioning to others becomes much easier. Let’s set up a basic OpenAI API client:

import pandas as pd
from loguru import logger
from openai import OpenAI
from pydantic import BaseModel
 
# Configuration constants
MODEL = "gpt-4o-mini"  # Choose your model
TEMPERATURE = 0.0      # Control randomness
MAX_COMPLETION_TOKENS = 128  # Limit response length
 
client = OpenAI(api_key="YOUR API KEY")

While most providers have Python clients, some also include libraries for TypeScript/JavaScript, Java, Go and more. Here’s a list of libraries that the OpenAI team supports: https://platform.openai.com/docs/libraries

To get started, you’ll need to:

  1. Create an account at https://platform.openai.com/
  2. Generate an API key
  3. Keep the key secure and never share it publicly

Here’s an example API call:

PROMPT = """
You are a UX writer specializing in clear, actionable error messages.
 
Write a payment failure error message in 2 parts:
 
- What happened (max 10 words)
- What to do (max 15 words)
 
The result should be a single error message that is 25 words or less.
Format it as a JSON object.
""".strip()
 
# Make the API call
response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": PROMPT,
        }
    ],
    max_completion_tokens=MAX_COMPLETION_TOKENS,
    temperature=TEMPERATURE,
    model=MODEL,
)
 
print(response.choices[0].message.content)

Let’s break down what’s happening in this API call:

  • We’re sending a list of messages to simulate a conversation
  • Each message has a role7 (user, system, or assistant)
  • The role helps the model understand the context of the conversation

The response object contains the model’s generated content:

Response
{
  "error_message": {
    "what_happened": "Payment processing failed.",
    "what_to_do": "Please check your payment details and try again."
  }
}

along with metadata like token usage:

usage = response.usage
usage.prompt_tokens, usage.completion_tokens, usage.total_tokens
prompt_tokenscompletion_tokenstotal_tokens
7343116

But you’re sophisticated AI engineer, you can use Pydantic8 to enforce specific output formats:

class ErrorMessage(BaseModel):
    type: str
    message: str
 
 
response = client.beta.chat.completions.parse(
    messages=[
        {
            "role": "user",
            "content": PROMPT,
        }
    ],
    max_completion_tokens=MAX_COMPLETION_TOKENS,
    temperature=TEMPERATURE,
    model=MODEL,
    response_format=ErrorMessage, # Parse response into a structured format
)
 
print(response.choices[0].message.parsed)
ErrorMessage(
  type='ErrorMessage',
  message='Payment failed due to insufficient funds. Please check your account balance or use a different payment method.'
)

This approach gives you:

  • Automatic data validation
  • Predictable output structures
  • Enhanced type safety

You can get much more complex with your objects, we’ll see how to do that in the projects we’re going to build.

Run an LLM on your local machine

Running a LLM on your local machine requires some setup and a good computing environment. However, this approach offers significant advantages, particularly when you’re dealing with sensitive data or need to process large volumes of information without relying on cloud services.

By far, the most straightforward method for running an LLM locally is using the Ollama application9. I’ve created a guide to help you get started, which you can watch in the video below:

  1. Install Ollama: Follow the installation instructions in the video.

  2. Download a Model: Open your terminal and pull the Llama 3.2 3B model with a simple command:

ollama run llama3.2

This command accomplishes two things:

  • Downloads the Llama 3.2 model to your machine
  • Loads the model into your system’s memory
  1. Interact with the Model: Once loaded, you can start chatting immediately. When you’re finished, type /bye to exit the application.

The models in Ollama are quantized10, which is a technical term for compression. Think of quantization like audio compression:

  • Reduces file size
  • Makes the model run faster
  • Slightly reduces model accuracy

While you lose some precision, the trade-off is a more accessible and performant local AI experience.

Ollama provides a Python client11 that allows seamless integration into your own projects. Here’s a practical example of using the Llama 3.2 model to generate a concise error message:

from ollama import chat
 
# Model Configuration
MODEL = "llama3.2"
TEMPERATURE = 0  # Set to zero for deterministic responses
 
PROMPT = """
You are a UX writer specializing in clear, actionable error messages.
 
Write a payment failure error message in 2 parts:
 
- What happened (max 10 words)
- What to do (max 15 words)
 
The result should be a single error message that is 25 words or less.
Format it as a JSON object.
""".strip()
 
response = chat(
    model=MODEL,
    messages=[{"role": "user", "content": PROMPT}],
    options={"temperature": TEMPERATURE},
    format="json",
)
 
print(response.message.content)

When you run this script, you’ll get a response like:

Response
{
  "error": "Payment processing failed due to insufficient funds.",
  "action": "Please try again with a different payment method or contact support for assistance."
}

Note how we used the format="json" parameter to ensure structured output.

Deploy an LLM for Your Business or Company

Deploying an LLM for a businesses that require high security, custom data processing, or specialized interactions can be a game-changer. In my experience, many companies have the idea of deploying an LLM but are unsure where to start. They mostly fantasize about having internal ChatGPT-like system that works with their private data.

While cloud-based solutions are convenient, running an LLM on your own infrastructure offers unparalleled control and data privacy. Of course, that would require a substantial investment in hardware and setup/monitoring.

Most companies go for >=70B models for their internal deployments. These models are powerful and can handle a wide range of tasks. However, they require a robust server that covers the following requirements (in FP16):

  • GPU: around 142GB of GPU memory (that would be 2x NVIDIA A10012 80GB)
  • RAM: >64GB
  • CPU: AMD EPYC or Intel Xeon
  • Storage: 300GB+ SSD
  • Operating System: Linux (for CUDA compatibility)

You can use cloud infrastructure from providers like AWS, Google Cloud, or Azure. Alternatively, you can set up a dedicated server.

If your use-case allows for it, you can use 4-bit quantization and run the model on a single A100 40GB GPU12.

To serve the model, you can use the vLLM library13. It’s optimized for real-time and batch predictions, and it provides an OpenAI-compatible API. Here’s a quick guide to deploying a model:

Let’s walk through a step-by-step deployment process:

  1. Create and Activate a Virtual Environment
uv venv myenv --python 3.12 --seed
source myenv/bin/activate
  1. Install vLLM (Ensure CUDA compatibility with your GPU)
uv pip install vllm
  1. Deploy Your Chosen Model
vllm serve Qwen/Qwen2.5-1.5B-Instruct
  1. Test the Deployment
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

While this guide provides a quick start, production-ready deployment requires additional considerations:

  • Load balancing
  • Model caching
  • Monitoring and logging
  • Security configurations
  • Scaling infrastructure

Here are some tips that help me sleep at night (after deploying models):

  • Start with smaller models to test your infrastructure
  • Always have a backup and recovery plan
  • Monitor GPU memory and model performance
  • Consider quantization techniques to reduce resource requirements

Deploy an LLM for your business/company

When you have a (lot) more resources, you can consider deploying an (non-quantized) LLM on your own infrastructure. This is a good option if you’re processing a lot of data and need to keep it secure.

My recommendation is to rent a VPS (Virtual Private Server) from a cloud provider like AWS, Google Cloud, or Azure. Or setup your own server in a data center. You’ll need a good machine with a lot of memory and a good GPU(s).

Then you can use the vLLM13 library to serve models from the HuggingFace model hub.

vLLM is a fast and easy-to-use library for LLM inference and serving.

Here’s a quick guide on how to do that. Start with creating a virtual environment and activate it:

uv venv myenv --python 3.12 --seed
source myenv/bin/activate

Next, you need to install the vLLM library (note that this depends on the CUDA version for your GPU/machine):

uv pip install vllm

Then you can deploy a model of your choice with an OpenAI compatible API:

vllm serve Qwen/Qwen2.5-1.5B-Instruct

and test it with a CURL command:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Easy, right? Of course, you still need to build an infrastructure around this to make it production-ready. But this is a good start.

Counting Tokens (Context Window)

Looking at the model comparison table, you can see that the context window (number of tokens the model can process at once) is around 128k tokens for most models. Not only that, more tokens processed means more money you have to pay for the model. How do you estimate how much it will cost you to process your data?

To help you understand token usage, you can use the tiktoken14 package by OpenAI. Let’s explore how to use it.

import pandas as pd
import tiktoken
from loguru import logger
from tiktoken import Encoding
 
# Initialize the encoder for a specific model
MODEL = "gpt-4o-mini"
encoder = tiktoken.encoding_for_model(MODEL)

We’ll create a function to demonstrate how tokens differ from words:

def show_encoding_comparison(text: str, encoder: Encoding):
    # Encode the text into tokens
    encoding = encoder.encode(text)
 
    # Count words and tokens
    word_count = len(text.split())
    token_count = len(encoding)
 
    # Create a DataFrame to show the comparison
    df = pd.DataFrame(
        {
            "word_count": [word_count],
            "token_count": [token_count],
            "increase": [round(token_count / word_count, 2)],
        }
    )
    logger.debug(f"Statistics:\n{df.to_markdown(index=None)}")

Let’s analyze our sample prompt:

prompt = """
You are a UX writer specializing in clear, actionable error messages.
 
Write a payment failure error message in 2 parts:
 
- What happened (max 10 words)
- What to do (max 15 words)
""".strip()
 
show_encoding_comparison(prompt, encoder)

Results:

word_counttoken_countincrease
33431.3

Notice how 33 words became 43 tokens, that’s a 30% increase! This matters when you’re processing large volumes of text.

Imagine processing 100,000 documents with this prompt:

Input tokens: 43 * 100,000 = 4,300,000 tokens

ModelInput Cost per Million TokensTotal Cost for 4.3M Tokens
Claude 3.5 Sonnet$3.00$12.90
DeepSeek v3$0.90$3.87
Google’s Gemini Flash 1.5$0.07$0.301

Output tokens often cost even more. Let’s look at a sample response:

response = """
{
  "type": "payment_failure",
  "message": "Your payment method has expired. Please update your card details or use a different payment method to complete the transaction."
}
""".strip()
 
show_encoding_comparison(response, encoder)

Results:

word_counttoken_countincrease
25371.48

Output token costs for 100,000 documents:

ModelOutput Cost per Million TokensTotal Cost for 3.7M Tokens
Claude 3.5 Sonnet$15.00$55.50
DeepSeek v3$1.10$4.07
Google’s Gemini Flash 1.5$0.30$1.11

Key takeaways:

  1. Tokens aren’t the same as words
  2. Token counts can increase text “size” by 30-50%
  3. Always consider both input and output token costs
  4. Cheaper models might have trade-offs in performance

As the technology evolves, these costs will likely decrease. However, being token-efficient is always a valuable skill.

References

Footnotes

  1. Claude Sonnet by Anthropic

  2. Google’s Gemini Flash 2

  3. DeepSeek V3 2

  4. OpenRouter Rankings

  5. Artificial Alanalysis

  6. Meta’s Llama 3.2

  7. Messages and roles

  8. Pydantic

  9. ollama

  10. Quantization

  11. Ollama Python client

  12. NVIDIA A100 2

  13. vLLM 2

  14. tiktoken