In the last few years, large language models (LLMs) have become the brain behind many modern tools — from ChatGPT to Gemini and Claude. They help us code faster, summarize text, and even generate entire articles.
But what if you could run one of these models on your own machine, completely offline, without sending a single byte of your data to the cloud?
That’s exactly what Ollama allows you to do.
In this post, I’ll show you how to deploy your own LLM locally using Ollama on Docker, step by step — from installation to using it via API. We’ll also explore why it can be a game-changer for privacy, experimentation, and control.
🚀 Why Run an LLM Locally?
While cloud-based LLMs like ChatGPT or Gemini are powerful and easy to use, they come with trade-offs:
- 💾 Data Privacy: Anything you send to those services is processed in the cloud. Running your own model locally ensures that all your prompts, code, and data stay on your machine.
- ⚙️ Customization: You can tweak system prompts, memory, or even fine-tune models without limitations.
- 🔒 Offline Access: No internet? No problem. You can still use the model.
- 💸 Cost Control: No API fees or subscriptions — just your local hardware doing the work.
This approach is perfect for developers, researchers, and hobbyists who want to experiment safely with AI on their own terms.
🧩 What Is Ollama?
Ollama is a lightweight runtime that lets you run open-source LLMs (like Llama 3, Mistral, Phi, or Gemma) with a single command.
It exposes a REST API compatible with the OpenAI format, meaning you can integrate it easily with existing tools or scripts.
Think of it as “Docker for models” — you pull, run, and interact with them locally.
⚙️ Requirements
Before starting, make sure you have:
- Docker installed (
docker --version) - At least 8 GB of RAM (16 GB recommended)
- Disk space: models range from 2 GB to 15 GB
- Optional: NVIDIA GPU (for better performance)
🧱 Step 1: Run Ollama in Docker
Open a terminal and pull the official Ollama image:
docker pull ollama/ollamaThen start the container:
docker run -d \
--name ollama \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama- The volume
ollama:/root/.ollamastores your downloaded models. - The port
11434exposes Ollama’s REST API locally.
If you have an NVIDIA GPU, add --gpus=all to use hardware acceleration.
🧠 Step 2: Pull a Model
Once the container is up, let’s download a model.
For this example, we’ll use Mistral, a solid open-source model known for good reasoning and small size (~4 GB):
docker exec -it ollama ollama pull mistralYou can also try Llama 3 or Gemma later.
Check your installed models:
docker exec -it ollama ollama list💬 Step 3: Chat in the CLI
Start a local chat session:
docker exec -it ollama ollama run mistralExample:
>>> Hello, what can you do?
I can summarize text, answer questions, or help you write code — all locally on your machine!To exit, press Ctrl + C.
🌐 Step 4: Use the API
Ollama exposes an endpoint compatible with OpenAI’s API at http://localhost:11434/api/generate.
Let’s test it with curl:
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Write a haiku about Docker."
}'You’ll get a JSON response similar to:
{"response":"Containers afloat / Isolation in motion / Cloud in a small box"}🧩 Step 5: Connect from Python
You can even use the OpenAI client library — just point it to Ollama’s local endpoint:
pip install openaiThen create a small Python script:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="mistral",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what Docker is in one sentence."}
]
)
print(response.choices[0].message.content)Run it, and you’ll see the response generated locally — no external API involved.
🧭 Best Practices
- 💾 Keep models on a dedicated volume: This avoids re-downloading large files every time you restart Docker.
- 🧹 Clean unused models:
docker exec -it ollama ollama rm <model> - ⚡ Optimize with GPU: Ollama automatically uses GPU if available (CUDA or Metal).
- 🔐 Stay offline: If you want full privacy, disable internet access for the container. The model works entirely locally.
- 📊 Monitor resources: Some models can use several GBs of RAM — use
docker statsto watch usage.
🔒 Which Model Is 100% Private?
If privacy is your top priority, I recommend Mistral 7B:
- ✅ Open-source and licensed for local use
- ✅ Excellent performance on general tasks
- ✅ Does not send data anywhere
- ✅ Works well even without GPU
- ⚖️ Around 4 GB on disk
You can pull it with:
docker exec -it ollama ollama pull mistralThis model runs entirely on your machine — no telemetry, no cloud, no data collection.
🧩 Bonus: Expose Ollama on Your Local Network
If you want to connect from another device on your LAN (e.g. from a laptop or tablet), run:
docker run -d \
--name ollama \
-v ollama:/root/.ollama \
-p 0.0.0.0:11434:11434 \
ollama/ollamaThen access it from another device using your local IP, e.g.:
http://192.168.1.100:11434/api/generate✅ Conclusion
Running an LLM locally gives you control, privacy, and freedom.
With Ollama on Docker, you can deploy models like Mistral or Llama 3 in minutes, use them offline, and even integrate them into your own tools or scripts.
No subscriptions, no data leaks — just you, your machine, and your AI.
