The AI Engineer's Toolkit

7 lessons

Module Progress...

Run AI Locally with Ollama

Get started with local AI development. Learn to install and use Ollama to run powerful AI models on your own machine for enhanced privacy, speed, and cost-efficiency.

Tutorial banner

Cloud-based AI APIs are powerful, but they're not always the right choice. Sending sensitive data to third-party servers raises privacy concerns, recurring API costs can become unpredictable, and network latency can slow down your applications. For many developers, these limitations are deal-breakers.

Ollama1 solves these problems by making it simple to run state-of-the-art Large Language Models directly on your own hardware. This open-source2 tool streamlines downloading, running, and managing LLMs. It allows you to run models like Qwen and Llama directly on your hardware, effectively turning your laptop into a private inference server.

This tutorial will get you up and running with Ollama. You'll learn to install the runtime, provision a model, and build a Python interface to control it. By the end, you will have a zero-latency, zero-cost AI stack running on your local machine.

Tutorial Goals

  • Install and configure the Ollama daemon for local inference
  • Manage model lifecycles - pull, run, and remove quantized models
  • Programmatically control the LLM using the Python SDK
  • Implement streaming responses for real-time user experiences

Why Run AI Models Locally?

Running AI models on your own infrastructure lets you "own the intelligence". It offers engineering advantages over cloud APIs:

  • Data Sovereignty: Your data never leaves your metal. This is non-negotiable for healthcare data (HIPAA), legal documents, or proprietary codebases where API transmission constitutes a leak.
  • Cost Predictability: Shift from variable OpEx (per-token billing) to fixed CapEx (hardware). Once the GPU is running, the marginal cost of experimentation is zero.
  • System Reliability: Eliminate external dependencies. Your application runs regardless of internet connectivity, cloud provider outages, or sudden API deprecations.
  • Developer Velocity: Iterate without rate limits. Test complex prompts and pipelines in a tight loop without waiting for quota resets or worrying about the bill.

How to Get Started with Ollama

Getting Ollama running takes just a few minutes. Let's install it, grab a model, and start chatting.

Installation

First, we need the Ollama daemon running. This acts as the backend server that manages the model weights and the inference compute.

Using Homebrew is the cleanest method:

brew install --cask ollama

Once installed, open the Ollama application from your Applications folder. This starts the background service on localhost:11434.

Verify the service is running:

ollama --version
Output
ollama version is 0.13.0

Download Your First Model

Now let's grab a capable model to work with. We'll use a small and fast model, perfect for getting started:

ollama pull gemma3:4b

This downloads the 4-billion parameter Gemma 3 4B model to your machine. It'll take a few minutes depending on your internet speed.

Start Chatting

Time to test your new AI assistant. Start an interactive chat:

ollama run gemma3:4b

You'll see a >>> prompt. Ask it anything:

>>> Explain the difference between diesel and petrol engines in one sentence.
Output
Diesel engines ignite fuel through compression, while petrol (gasoline) engines
use a spark plug to ignite the air-fuel mixture, resulting in different
combustion processes and power outputs.

Useful commands:

  • ollama list - See all your downloaded models
  • /bye - Exit the chat session
ollama list
Output
NAME                                        ID              SIZE      MODIFIED
gemma3:4b                                   a2af6cc3eb7f    3.3 GB    30 sec ago

Project Setup

Clone the repository:

git clone https://github.com/mlexpertio/academy.git

Navigate to the project directory:

cd academy

And install the dependencies:

uv sync

Confirm that your Ollama instance is still running. Now you're ready to run AI models locally.

Basic Chat

The command line is useful for quick sanity checks, but production systems are built in code. The Ollama Python SDK3 provides a programmatic interface to interact with your local models.

Let's send your first request to the model:

py codelocal_ai/basic_chat.py
import ollama
 
MODEL = "gemma3:4b"
prompt = "Explain the difference between diesel and petrol engines in one sentence."
 
response = ollama.chat(
    model=MODEL,
    messages=[{"role": "user", "content": prompt}],
)
print(response.message.content)

Code Breakdown:

  • model: Identifies which local model to load into memory. If the model is not found locally, the library will raise an error.
  • messages: The standard chat format. You pass a list of dictionaries containing the role (user/assistant/system) and content.
  • response: Returns an object containing metadata (evaluation duration, token counts) and the actual message content in message.content.

Run the script:

uv run local_ai/basic_chat.py
Output
Diesel engines ignite fuel through compression, while petrol (gasoline) engines use a spark plug to ignite the fuel-air mixture, resulting in different combustion processes and power outputs.

Note that the responses are not the same as the ones from the command line chat. Having deterministic responses is useful for testing and debugging, so we'll see how to achieve that later in a later section.

Streaming Responses

In a production application, latency is the enemy. If a model takes 5 seconds to generate a paragraph, waiting for the entire completion before showing anything makes your application feel broken.

To fix this, we use Streaming. Instead of waiting for the full response, we consume the output token-by-token as it is generated. This reduces the perceived latency to near-zero (Time to First Token):

py codelocal_ai/stream_chat.py
import ollama
 
MODEL = "gemma3:4b"
prompt = "Write a 3 lines poem about Lexus (the car brand)."
 
response = ollama.chat(
    model=MODEL,
    messages=[{"role": "user", "content": prompt}],
    stream=True,
)
 
for chunk in response:
    print(chunk.message.content, end="", flush=True)

Code Breakdown:

  • stream=True: Changes the return type from a ChatResponse object to a Python Generator.
uv run local_ai/stream_chat.py
Output
Here's a 3-line poem about Lexus:

A whisper of leather, a silent grace,
Lexus moves with a sophisticated pace,
Luxury and comfort, in a timeless space.

You will see the text appear in real-time, simulating the typing effect seen in ChatGPT or Claude. This is the standard pattern for all user-facing AI applications.

Customizing Model Parameters

By default, LLMs are non-deterministic. If you ask the same question twice, you might get different answers. In the real world, you need results that are consistent and repeatable.

To get deterministic results, you can pass options to the ollama.chat function:

py codelocal_ai/deterministic_chat.py
import ollama
 
MODEL = "gemma3:4b"
prompt = "Output the 3 best modern Ferrari V8 cars in order of preference. Reply with just the names."
 
response = ollama.chat(
    model=MODEL,
    messages=[{"role": "user", "content": prompt}],
    options={
        "temperature": 0.0,
        "num_ctx": 2048,
        "seed": 42,
    },
)
 
print(response.message.content)

Code Breakdown:

  • temperature: Controls which tokens are generated next. Set to 0.0 for determinism (analytical tasks, code generation, JSON extraction). Increase to 0.7 or higher for creative writing.
  • seed: Sets the random number generator seed. Using a fixed integer (like 42) lets the model generate the exact same response for the same prompt, which is essential for testing your prompts.
  • num_ctx: The context window size (in tokens). This is the model's "short-term memory." The default is 4096. If you are processing long documents (RAG), you must increase this value, but be aware that higher values consume more VRAM.

Run it:

uv run local_ai/deterministic_chat.py
Output
488 GT3
488 Pista
SF90 Stradale

Because we set temperature: 0.0 and a fixed seed, you will get this exact list every time you run the script. This predictability is the foundation of engineering stable AI systems.

Conclusion

By setting up Ollama and connecting it to Python, you have unlocked:

  • Infrastructure Independence: You can build and test AI features without an internet connection or a credit card.
  • Rapid Prototyping: You can iterate on complex pipelines without hitting rate limits or incurring per-token costs.
  • Privacy-First Architecture: You can now process sensitive data that would otherwise be blocked by compliance or security policies.

This local environment is the foundation for the rest of the Academy. You will use this setup to build RAG pipelines, autonomous agents, and fine-tuning workflows.

However, having the engine running is only the first step. To get production-grade outputs from these models, you need to know how to speak their language.

Next, we move to Prompt Engineering for Engineers, where you will learn how to write and structure your prompts to get your model's full potential.

References

Footnotes

  1. Ollama's Docs

  2. Ollama's GitHub Repository

  3. Ollama Python SDK