🧠 Run Your Own LLM Locally with Ollama and Docker

In the last few years, large language models (LLMs) have become the brain behind many modern tools — from ChatGPT to Gemini and Claude. They help us code faster, summarize text, and even generate entire articles.
But what if you could run one of these models on your own machine, completely offline, without sending a single byte of your data to the cloud?

That’s exactly what Ollama allows you to do.

In this post, I’ll show you how to deploy your own LLM locally using Ollama on Docker, step by step — from installation to using it via API. We’ll also explore why it can be a game-changer for privacy, experimentation, and control.


🚀 Why Run an LLM Locally?

While cloud-based LLMs like ChatGPT or Gemini are powerful and easy to use, they come with trade-offs:

  • 💾 Data Privacy: Anything you send to those services is processed in the cloud. Running your own model locally ensures that all your prompts, code, and data stay on your machine.
  • ⚙️ Customization: You can tweak system prompts, memory, or even fine-tune models without limitations.
  • 🔒 Offline Access: No internet? No problem. You can still use the model.
  • 💸 Cost Control: No API fees or subscriptions — just your local hardware doing the work.

This approach is perfect for developers, researchers, and hobbyists who want to experiment safely with AI on their own terms.


🧩 What Is Ollama?

Ollama is a lightweight runtime that lets you run open-source LLMs (like Llama 3, Mistral, Phi, or Gemma) with a single command.
It exposes a REST API compatible with the OpenAI format, meaning you can integrate it easily with existing tools or scripts.

Think of it as “Docker for models” — you pull, run, and interact with them locally.


⚙️ Requirements

Before starting, make sure you have:

  • Docker installed (docker --version)
  • At least 8 GB of RAM (16 GB recommended)
  • Disk space: models range from 2 GB to 15 GB
  • Optional: NVIDIA GPU (for better performance)

🧱 Step 1: Run Ollama in Docker

Open a terminal and pull the official Ollama image:

docker pull ollama/ollama

Then start the container:

docker run -d \
  --name ollama \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama
  • The volume ollama:/root/.ollama stores your downloaded models.
  • The port 11434 exposes Ollama’s REST API locally.

If you have an NVIDIA GPU, add --gpus=all to use hardware acceleration.


🧠 Step 2: Pull a Model

Once the container is up, let’s download a model.
For this example, we’ll use Mistral, a solid open-source model known for good reasoning and small size (~4 GB):

docker exec -it ollama ollama pull mistral

You can also try Llama 3 or Gemma later.

Check your installed models:

docker exec -it ollama ollama list

💬 Step 3: Chat in the CLI

Start a local chat session:

docker exec -it ollama ollama run mistral

Example:

>>> Hello, what can you do?
I can summarize text, answer questions, or help you write code — all locally on your machine!

To exit, press Ctrl + C.


🌐 Step 4: Use the API

Ollama exposes an endpoint compatible with OpenAI’s API at http://localhost:11434/api/generate.

Let’s test it with curl:

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Write a haiku about Docker."
}'

You’ll get a JSON response similar to:

{"response":"Containers afloat / Isolation in motion / Cloud in a small box"}

🧩 Step 5: Connect from Python

You can even use the OpenAI client library — just point it to Ollama’s local endpoint:

pip install openai

Then create a small Python script:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="mistral",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what Docker is in one sentence."}
    ]
)

print(response.choices[0].message.content)

Run it, and you’ll see the response generated locally — no external API involved.


🧭 Best Practices

  • 💾 Keep models on a dedicated volume: This avoids re-downloading large files every time you restart Docker.
  • 🧹 Clean unused models: docker exec -it ollama ollama rm <model>
  • ⚡ Optimize with GPU: Ollama automatically uses GPU if available (CUDA or Metal).
  • 🔐 Stay offline: If you want full privacy, disable internet access for the container. The model works entirely locally.
  • 📊 Monitor resources: Some models can use several GBs of RAM — use docker stats to watch usage.

🔒 Which Model Is 100% Private?

If privacy is your top priority, I recommend Mistral 7B:

  • ✅ Open-source and licensed for local use
  • ✅ Excellent performance on general tasks
  • ✅ Does not send data anywhere
  • ✅ Works well even without GPU
  • ⚖️ Around 4 GB on disk

You can pull it with:

docker exec -it ollama ollama pull mistral

This model runs entirely on your machine — no telemetry, no cloud, no data collection.


🧩 Bonus: Expose Ollama on Your Local Network

If you want to connect from another device on your LAN (e.g. from a laptop or tablet), run:

docker run -d \
  --name ollama \
  -v ollama:/root/.ollama \
  -p 0.0.0.0:11434:11434 \
  ollama/ollama

Then access it from another device using your local IP, e.g.:

http://192.168.1.100:11434/api/generate

✅ Conclusion

Running an LLM locally gives you control, privacy, and freedom.
With Ollama on Docker, you can deploy models like Mistral or Llama 3 in minutes, use them offline, and even integrate them into your own tools or scripts.

No subscriptions, no data leaks — just you, your machine, and your AI.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top