Run AI Locally with Ollama
Get started with local AI development. Learn to install and use Ollama to run powerful AI models on your own machine for enhanced privacy, speed, and cost-efficiency.

Cloud-based AI APIs are powerful, but they're not always the right choice. Sending sensitive data to third-party servers raises privacy concerns, recurring API costs can become unpredictable, and network latency can slow down your applications. For many developers, these limitations are deal-breakers.
Ollama1 solves these problems by making it simple to run state-of-the-art Large Language Models directly on your own hardware. This open-source2 tool streamlines downloading, running, and managing LLMs. It allows you to run models like Qwen and Llama directly on your hardware, effectively turning your laptop into a private inference server.
This tutorial will get you up and running with Ollama. You'll learn to install the runtime, provision a model, and build a Python interface to control it. By the end, you will have a zero-latency, zero-cost AI stack running on your local machine.
Tutorial Goals
- Install and configure the Ollama daemon for local inference
- Manage model lifecycles - pull, run, and remove quantized models
- Programmatically control the LLM using the Python SDK
- Implement streaming responses for real-time user experiences
Why Run AI Models Locally?
Running AI models on your own infrastructure lets you "own the intelligence". It offers engineering advantages over cloud APIs:
- Data Sovereignty: Your data never leaves your metal. This is non-negotiable for healthcare data (HIPAA), legal documents, or proprietary codebases where API transmission constitutes a leak.
- Cost Predictability: Shift from variable OpEx (per-token billing) to fixed CapEx (hardware). Once the GPU is running, the marginal cost of experimentation is zero.
- System Reliability: Eliminate external dependencies. Your application runs regardless of internet connectivity, cloud provider outages, or sudden API deprecations.
- Developer Velocity: Iterate without rate limits. Test complex prompts and pipelines in a tight loop without waiting for quota resets or worrying about the bill.
Real World Considerations
While local inference has a lot of advantages, the most powerfull models are still the ones in the cloud. Most businesses will still need to use the cloud for their AI needs - often requiring less complexity to achieve the best possible results.
How to Get Started with Ollama
Getting Ollama running takes just a few minutes. Let's install it, grab a model, and start chatting.
Hardware Requirements
Running AI models locally isn't exactly free. You need a decent hardware (at least 16GB of (V)RAM) to run the models. Want to run state-of-the-art open models? You'll need 160GB+ of VRAM for some of the larger models in compressed (quantized) format.
Installation
First, we need the Ollama daemon running. This acts as the backend server that manages the model weights and the inference compute.
Using Homebrew is the cleanest method:
brew install --cask ollamaOnce installed, open the Ollama application from your Applications folder. This starts the background service on localhost:11434.
Verify the service is running:
ollama --versionollama version is 0.13.0Download Your First Model
Now let's grab a capable model to work with. We'll use a small and fast model, perfect for getting started:
ollama pull gemma3:4bThis downloads the 4-billion parameter Gemma 3 4B model to your machine. It'll take a few minutes depending on your internet speed.
Want more models?
Hundreds of models are available at the Ollama library for free. Be sure to check their licenses and hardware requirements.
Start Chatting
Time to test your new AI assistant. Start an interactive chat:
ollama run gemma3:4bYou'll see a >>> prompt. Ask it anything:
>>> Explain the difference between diesel and petrol engines in one sentence.Diesel engines ignite fuel through compression, while petrol (gasoline) engines
use a spark plug to ignite the air-fuel mixture, resulting in different
combustion processes and power outputs.Useful commands:
ollama list- See all your downloaded models/bye- Exit the chat session
ollama listNAME ID SIZE MODIFIED
gemma3:4b a2af6cc3eb7f 3.3 GB 30 sec agoProject Setup
Project Setup
Want to follow along? You can find the complete code on GitHub: MLExpert Academy repository
Clone the repository:
git clone https://github.com/mlexpertio/academy.gitNavigate to the project directory:
cd academyAnd install the dependencies:
uv syncConfirm that your Ollama instance is still running. Now you're ready to run AI models locally.
Basic Chat
The command line is useful for quick sanity checks, but production systems are built in code. The Ollama Python SDK3 provides a programmatic interface to interact with your local models.
Let's send your first request to the model:
import ollama
MODEL = "gemma3:4b"
prompt = "Explain the difference between diesel and petrol engines in one sentence."
response = ollama.chat(
model=MODEL,
messages=[{"role": "user", "content": prompt}],
)
print(response.message.content)Code Breakdown:
model: Identifies which local model to load into memory. If the model is not found locally, the library will raise an error.messages: The standard chat format. You pass a list of dictionaries containing therole(user/assistant/system) andcontent.response: Returns an object containing metadata (evaluation duration, token counts) and the actual message content inmessage.content.
Run the script:
uv run local_ai/basic_chat.pyDiesel engines ignite fuel through compression, while petrol (gasoline) engines use a spark plug to ignite the fuel-air mixture, resulting in different combustion processes and power outputs.Cold Starts
The first time you run this script, it may take some time as Ollama loads the model weights from disk into VRAM. Subsequent runs will be faster as the model remains cached in memory for a default keep-alive period (usually 5 minutes).
Note that the responses are not the same as the ones from the command line chat. Having deterministic responses is useful for testing and debugging, so we'll see how to achieve that later in a later section.
Streaming Responses
In a production application, latency is the enemy. If a model takes 5 seconds to generate a paragraph, waiting for the entire completion before showing anything makes your application feel broken.
To fix this, we use Streaming. Instead of waiting for the full response, we consume the output token-by-token as it is generated. This reduces the perceived latency to near-zero (Time to First Token):
import ollama
MODEL = "gemma3:4b"
prompt = "Write a 3 lines poem about Lexus (the car brand)."
response = ollama.chat(
model=MODEL,
messages=[{"role": "user", "content": prompt}],
stream=True,
)
for chunk in response:
print(chunk.message.content, end="", flush=True)Code Breakdown:
stream=True: Changes the return type from aChatResponseobject to a Python Generator.
uv run local_ai/stream_chat.pyHere's a 3-line poem about Lexus:
A whisper of leather, a silent grace,
Lexus moves with a sophisticated pace,
Luxury and comfort, in a timeless space.You will see the text appear in real-time, simulating the typing effect seen in ChatGPT or Claude. This is the standard pattern for all user-facing AI applications.
Customizing Model Parameters
By default, LLMs are non-deterministic. If you ask the same question twice, you might get different answers. In the real world, you need results that are consistent and repeatable.
To get deterministic results, you can pass options to the ollama.chat function:
import ollama
MODEL = "gemma3:4b"
prompt = "Output the 3 best modern Ferrari V8 cars in order of preference. Reply with just the names."
response = ollama.chat(
model=MODEL,
messages=[{"role": "user", "content": prompt}],
options={
"temperature": 0.0,
"num_ctx": 2048,
"seed": 42,
},
)
print(response.message.content)Code Breakdown:
temperature: Controls which tokens are generated next. Set to0.0for determinism (analytical tasks, code generation, JSON extraction). Increase to0.7or higher for creative writing.seed: Sets the random number generator seed. Using a fixed integer (like42) lets the model generate the exact same response for the same prompt, which is essential for testing your prompts.num_ctx: The context window size (in tokens). This is the model's "short-term memory." The default is 4096. If you are processing long documents (RAG), you must increase this value, but be aware that higher values consume more VRAM.
Repeatable Results
While with Ollama models it is relatively easy to get deterministic results, using cloud-based models like Google Gemini or OpenAI GPT can be close to impossible.
Run it:
uv run local_ai/deterministic_chat.py488 GT3
488 Pista
SF90 StradaleBecause we set temperature: 0.0 and a fixed seed, you will get this exact list every time you run the script. This predictability is the foundation of engineering stable AI systems.
Conclusion
By setting up Ollama and connecting it to Python, you have unlocked:
- Infrastructure Independence: You can build and test AI features without an internet connection or a credit card.
- Rapid Prototyping: You can iterate on complex pipelines without hitting rate limits or incurring per-token costs.
- Privacy-First Architecture: You can now process sensitive data that would otherwise be blocked by compliance or security policies.
This local environment is the foundation for the rest of the Academy. You will use this setup to build RAG pipelines, autonomous agents, and fine-tuning workflows.
However, having the engine running is only the first step. To get production-grade outputs from these models, you need to know how to speak their language.
Next, we move to Prompt Engineering for Engineers, where you will learn how to write and structure your prompts to get your model's full potential.