Build Your First AI App

Run AI Locally with Ollama

Get started with local AI development. Learn to install and use Ollama to run powerful AI models on your own machine for enhanced privacy, speed, and cost-efficiency.

Run AI Locally with Ollama

Cloud-based AI APIs are powerful, but they're not always the right choice. Sending sensitive data to third-party servers raises privacy concerns, recurring API costs can become unpredictable, and network latency can slow down your applications. For many developers, these limitations are deal-breakers.

Ollama1 solves these problems by making it simple to run state-of-the-art Large Language Models directly on your own hardware. This open-source2 tool streamlines downloading, running, and managing LLMs. It allows you to run models like Qwen and Llama directly on your hardware, effectively turning your laptop into a private inference server.

This tutorial will get you up and running with Ollama. You'll learn to install the runtime, provision a model, and build a Python interface to control it. By the end, you will have a zero-latency, zero-cost AI stack running on your local machine.

Tutorial Goals

  • Install and configure the Ollama daemon for local inference
  • Manage model lifecycles - pull, run, and remove quantized models
  • Programmatically control the LLM using the Python SDK
  • Implement streaming responses for real-time user experiences

Why Run AI Models Locally?

Running AI models on your own infrastructure lets you "own the intelligence". It offers engineering advantages over cloud APIs:

  • Data Sovereignty: Your data never leaves your metal. This is non-negotiable for healthcare data (HIPAA), legal documents, or proprietary codebases where API transmission constitutes a leak.
  • Cost Predictability: Shift from variable OpEx (per-token billing) to fixed CapEx (hardware). Once the GPU is running, the marginal cost of experimentation is zero.
  • System Reliability: Eliminate external dependencies. Your application runs regardless of internet connectivity, cloud provider outages, or sudden API deprecations.
  • Developer Velocity: Iterate without rate limits. Test complex prompts and pipelines in a tight loop without waiting for quota resets or worrying about the bill.
Real World Considerations

While local inference has many advantages, the most powerful models are still in the cloud. Most businesses will still need cloud models for some workloads, often because they offer stronger performance with less operational complexity.

How to Get Started with Ollama

Getting Ollama running takes just a few minutes. Let's install it, grab a model, and start chatting.

Hardware Requirements

Running AI models locally isn't exactly free. You need decent hardware (at least 16GB of (V)RAM) to run the models. Want to run state-of-the-art open models? You'll need 160GB+ of VRAM for some of the larger models in compressed (quantized) format.

Installation

First, we need the Ollama daemon running. This acts as the backend server that manages the model weights and the inference compute.

Using Homebrew is the cleanest method:

Terminal windowBASH
brew install --cask ollama

Once installed, open the Ollama application from your Applications folder. This starts the background service on localhost:11434.

Verify the service is running:

Terminal windowBASH
ollama --version
Output
ollama version is 0.15.5

Download Your First Model

Now let's grab a capable model to work with. We'll use a small and fast model, perfect for getting started:

Terminal windowBASH
ollama pull gemma3:4b

This downloads the 4-billion parameter Gemma 3 4B model to your machine. It'll take a few minutes depending on your internet speed.

Want more models?

Hundreds of models are available at the Ollama library for free. Be sure to check their licenses and hardware requirements.

Start Chatting

Time to test your new AI assistant. Start an interactive chat:

Terminal windowBASH
ollama run gemma3:4b

You'll see a >>> prompt. Ask it anything:

Terminal windowBASH
>>> Explain the difference between diesel and petrol engines in one sentence.
Output
Diesel engines ignite fuel through compression, while petrol (gasoline) engines
use a spark plug to ignite the air-fuel mixture, resulting in different
combustion processes and power outputs.

Useful commands:

  • ollama list - See all your downloaded models
  • /bye - Exit the chat session
Terminal windowBASH
ollama list
Output
NAME ID SIZE MODIFIED
gemma3:4b a2af6cc3eb7f 3.3 GB 30 sec ago

Project Setup

Project Setup

Want to follow along? You can find the complete code on GitHub: MLExpert Academy repository

Clone the repository and install the dependencies:

Terminal windowBASH
git clone git@github.com:mlexpertio/academy.git
cd academy/toolkit
uv sync

Confirm that your Ollama instance is still running. Now you're ready to run AI models locally.

Basic Chat

The command line is useful for quick sanity checks, but production systems are built in code. The Ollama Python SDK3 provides a programmatic interface to interact with your local models.

Let's send your first request to the model:

toolkit/local-ai/basic_chat.pyPY
import ollama
MODEL = "gemma3:4b"
prompt = "Explain the difference between diesel and petrol engines in one sentence."
response = ollama.chat(
model=MODEL,
messages=[{"role": "user", "content": prompt}],
)
print(response.message.content)

Code Breakdown:

  • model: Identifies which local model to load into memory. If the model is not found locally, the library will raise an error.
  • messages: The standard chat format. You pass a list of dictionaries containing the role (user/assistant/system) and content.
  • response: Returns an object containing metadata (evaluation duration, token counts) and the actual message content in message.content.

Run the script:

Terminal windowBASH
uv run local-ai/basic_chat.py
Output
Diesel engines ignite fuel through compression, while petrol (gasoline) engines use a spark plug to ignite the fuel-air mixture, resulting in different combustion processes and power outputs.
Cold Starts

The first time you run this script, it may take some time as Ollama loads the model weights from disk into VRAM. Subsequent runs will be faster as the model remains cached in memory for a default keep-alive period (usually 5 minutes).

Note that the responses are not the same as the ones from the command line chat. Having deterministic responses is useful for testing and debugging, so we'll see how to achieve that later in a later section.

Streaming Responses

In a production application, latency is the enemy. If a model takes 5 seconds to generate a paragraph, waiting for the entire completion before showing anything makes your application feel broken.

To fix this, we use Streaming. Instead of waiting for the full response, we consume the output token-by-token as it is generated. This reduces the perceived latency to near-zero (Time to First Token):

toolkit/local-ai/stream_chat.pyPY
import ollama
MODEL = "gemma3:4b"
prompt = "Write a 3 lines poem about Lexus (the car brand)."
response = ollama.chat(
model=MODEL,
messages=[{"role": "user", "content": prompt}],
stream=True,
)
for chunk in response:
print(chunk.message.content, end="", flush=True)

Code Breakdown:

  • stream=True: Changes the return type from a ChatResponse object to a Python Generator.
Terminal windowBASH
uv run local-ai/stream_chat.py
Output
Here's a 3-line poem about Lexus:
A whisper of leather, a silent grace,
Lexus moves with a sophisticated pace,
Luxury and comfort, in a timeless space.

You will see the text appear in real-time, simulating the typing effect seen in ChatGPT or Claude. This is the standard pattern for all user-facing AI applications.

Customizing Model Parameters

By default, LLMs are non-deterministic. If you ask the same question twice, you might get different answers. In the real world, you need results that are consistent and repeatable.

To get deterministic results, you can pass options to the ollama.chat function:

toolkit/local-ai/deterministic_chat.pyPY
import ollama
MODEL = "gemma3:4b"
prompt = "Output the 3 best modern Ferrari V8 cars in order of preference. Reply with just the names."
response = ollama.chat(
model=MODEL,
messages=[{"role": "user", "content": prompt}],
options={
"temperature": 0.0,
"num_ctx": 2048,
"seed": 42,
},
)
print(response.message.content)

Code Breakdown:

  • temperature: Controls which tokens are generated next. Set to 0.0 for determinism (analytical tasks, code generation, JSON extraction). Increase to 0.7 or higher for creative writing.
  • seed: Sets the random number generator seed. Using a fixed integer (like 42) lets the model generate the exact same response for the same prompt, which is essential for testing your prompts.
  • num_ctx: The context window size (in tokens). This is the model's "short-term memory." The default is 4096. If you are processing long documents (RAG), you must increase this value, but be aware that higher values consume more VRAM.
Repeatable Results

While with Ollama models it is relatively easy to get deterministic results, using cloud-based models like Google Gemini or OpenAI GPT can be close to impossible.

Run it:

Terminal windowBASH
uv run local-ai/deterministic_chat.py
Output
488 GT3
488 Pista
SF90 Stradale

Because we set temperature: 0.0 and a fixed seed, you will get this exact list every time you run the script. This predictability is the foundation of engineering stable AI systems.

Common Pitfalls

Common Pitfalls
  • Connection refused on http://localhost:11434 — Ollama isn't running. Start the macOS app, or run ollama serve in a separate terminal.
  • model 'qwen3:4b' not found, try pulling it first — Run ollama pull qwen3:4b before invoking the model. The Python SDK doesn't auto-download.
  • Same prompt returns different responses despite temperature=0.0 — You also need a fixed seed. Pass both inside options: options={"temperature": 0, "seed": 42}.
  • Out of memory or extremely slow on CPU — 7B+ models need a GPU or 16GB+ RAM. Drop to a smaller model on older laptops: ollama pull qwen3:0.6b.
  • Empty or truncated response on long promptsnum_ctx defaults to 4096. For RAG or long inputs, raise it to 8192 or 16384, but watch VRAM.

Conclusion

By setting up Ollama and connecting it to Python, you have unlocked:

  • Infrastructure Independence: You can build and test AI features without an internet connection or a credit card.
  • Rapid Prototyping: You can iterate on complex pipelines without hitting rate limits or incurring per-token costs.
  • Privacy-First Architecture: You can now process sensitive data that would otherwise be blocked by compliance or security policies.

This local environment is the foundation for the rest of the Academy. You will use this setup to build RAG pipelines, autonomous agents, and fine-tuning workflows.

However, having the engine running is only the first step. To get production-grade outputs from these models, you need to know how to speak their language.

Next, we move to Prompt Engineering for Engineers, where you will learn how to structure prompts to get the most out of your model.

References

Loading...

Footnotes

  1. Ollama's Docs

  2. Ollama's GitHub Repository

  3. Ollama Python SDK