Ollama Optimization & Integration

01 / Introduction

Beyond the Basics: Optimization and Integration

You've completed Ollama Basics. You have Mistral or Llama2 running, you've generated some responses, and you've tested the REST API. Now: how do you make it better?

This tutorial is for people comfortable with Ollama fundamentals who want to squeeze maximum performance, understand the tuning knobs, and integrate seamlessly with OpenClaw. We're going deep.

Where You Left Off (Quick Review)

You have:

✓ Ollama installed and running as a service
✓ At least one 7B model downloaded (Mistral recommended)
✓ REST API working (tested with curl)
✓ Baseline performance metrics (10–15 tokens/sec on a typical CPU)

If you don't have these, go back to Tutorial 1 (Basics) first. This tutorial assumes you're comfortable with Ollama's fundamentals.

What You'll Learn (Advanced)

In this 90-minute tutorial:

Parameters: Temperature, top_p, context windows—what they do and how to tune them
Use-Case Tuning: Different settings for different tasks (code generation vs creative writing vs structured output)
Benchmarking: Objective methods to measure quality and speed
GPU Detection: Understand when GPUs help (spoiler: not for you)
Multiple Models: Run models concurrently, queue requests, load balancing
OpenClaw Integration: Point your bot to local Ollama for a fully private AI brain
Optimization: Memory tuning, CPU optimization, inference speed tuning
Monitoring: Health checks, logs, continuous operation
Troubleshooting: Common issues and how to fix them

Your Hardware

This tutorial works on any modern hardware capable of running Ollama. Here's what to keep in mind:

Recommended Specs

CPU: 4+ cores (8+ cores recommended for smooth performance)
RAM: 16GB minimum, 32GB recommended (more RAM = more models loaded simultaneously)
Storage: SSD recommended (fast model loading)
GPU: Optional — CPU-only inference works well for 7B models

Performance numbers in this tutorial are based on a mid-range multi-core CPU with 32GB RAM. Your results may vary depending on your specific hardware — adjust recommendations accordingly.

Why This Matters

For OpenClaw: A well-tuned local LLM can power a bot that feels just as smart as cloud-based solutions, but faster, cheaper, and completely private.

For Learning: Understanding parameters teaches you how language models actually work. You'll develop intuition for LLM behavior.

For Optimization: Squeezing another 20% performance from your hardware means faster responses and better user experience.

A Note on Complexity

This tutorial assumes you're comfortable with terminal commands and basic system administration. You don't need to be an expert, but "comfortable with Linux" is the baseline.

If something is unclear, go back to the Basics tutorial or skip to the sections that interest you. You don't have to do everything in order.

Skip Ahead If You Want: Each section stands mostly on its own. If you only care about OpenClaw integration, jump to Section 7. If you're obsessed with performance, focus on Sections 2, 4, and 8. Pick your own adventure.

Let's Go Deep: You're about to understand Ollama at a level beyond "it works." Ready?

02 / Understanding Parameters

The Knobs That Control Your LLM

LLMs generate text one token at a time. But how they choose which token to generate is controlled by parameters. These are your tuning knobs. Understand them, and you control the model's behavior.

Temperature (Randomness)

Range: 0.0 to 2.0 (or higher)
Default: Usually 0.7 (balanced)

Temperature controls how "creative" or "random" the model gets:

Temperature Examples

0.0 (Cold, Deterministic): Always picks the most likely next token. Same input = identical output every time. Good for: Precise answers, code generation, structured output.
0.5 (Cool, Focused): Mostly predictable but with some variation. Good for: Professional writing, Q&A, technical content.
0.7 (Balanced, Default): A mix of creativity and consistency. Good for: General chat, creative but coherent responses.
1.2 (Warm, Creative): More randomness, more interesting responses. Good for: Brainstorming, creative writing, poetry.
1.5+ (Hot, Chaotic): Very random, often incoherent. Good for: Experimentation (usually a mistake).

Try it:

🖥️ Mini PC — Compare temperatures

# Cold (deterministic)
curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Complete: The future of AI is",
  "temperature": 0.0,
  "stream": false
}' | jq -r '.response'

# Hot (creative)
curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Complete: The future of AI is",
  "temperature": 1.5,
  "stream": false
}' | jq -r '.response'

Run both and notice the difference. You'll see the same semantic answer at 0.0, wildly creative answers at 1.5.

Top-P (Nucleus Sampling)

Range: 0.0 to 1.0
Default: 0.9

Top-P is a more sophisticated diversity control than temperature. It works by:

Model predicts probabilities for the next token
Sort tokens by likelihood (highest first)
Keep only the top tokens that sum to p% probability
Randomly sample from that restricted set

In practice:

0.1: Ultra-focused (only the top 10% of likely tokens)
0.5: Focused (top 50% of likely tokens)
0.9: Balanced (top 90% of likely tokens, allows more creativity)

Works best with temperature: Top-P controls diversity among the remaining candidates after temperature does its work. Usually keep at 0.9 unless you're experimenting.

Top-K

Range: 1 to infinity
Default: 40 (varies by Ollama version)

Top-K is the simpler cousin of top-P. It limits the model to only consider the K most likely tokens:

K=1: Only consider the most likely token (extremely constrained)
K=10: Consider top 10 most likely tokens
K=40: Consider top 40 (good balance)
K=100+: Very permissive (almost all tokens allowed)

Practical advice: Leave top-K at default and use temperature+top-P instead. Top-K is older and less intuitive than top-P.

Repeat Penalty

Range: 0.0 to 2.0
Default: 1.1

Repeat penalty prevents the model from repeating the same phrase over and over (common LLM failure mode):

1.0: No penalty (tokens can repeat freely)
1.1: Mild penalty (slight discouragement to repeat)
1.5: Strong penalty (heavily discourage repetition)

Use when: Model generates repetitive text ("the the the..." or "and and and...").
Default is good: Usually leave at 1.1.

Context Window (num_ctx)

Range: 128 to ~32000 tokens
Default: 2048

Context window is how many tokens the model can "see" when generating. It's like short-term memory:

256: Very short memory (forgets everything quickly)
2048: Good balance (remember ~1500 words of conversation)
4096: Long memory (remember ~3000 words)
8192: Very long (remember entire documents)

Tradeoff: Larger context = slower inference (more computation). On a typical multi-core CPU with a 7B model:

💻 Context window vs speed

2048 tokens:  ~15 tokens/sec (fast)
4096 tokens:  ~12 tokens/sec (slightly slower)
8192 tokens:  ~8 tokens/sec (noticeably slower)
16384+:       ~3-5 tokens/sec (CPU gets saturated)

Recommendation: Use 2048 for chat. Use 4096 for document analysis. Skip 8192+ unless you really need the memory.

Prediction Tokens (num_predict)

Range: -1 (unlimited) to any positive number
Default: -1 (unlimited)

Maximum tokens to generate before stopping. Prevents runaway responses:

-1: Unlimited (model stops when it feels done)
128: Stop after 128 tokens (~100 words)
512: Stop after 512 tokens (~400 words)
2048: Stop after 2048 tokens (full page)

Use when: You want predictable response lengths. Good for APIs where you need bounded latency.

Threads (num_threads)

Range: 1 to your CPU core count
Default: Auto-detect (all cores)

How many CPU cores Ollama uses. For example, on an 8-core CPU:

1-2: Slow, leaves CPU idle
8: All cores active (default, good)
>8: Hyperthreading counts, can use more (usually 16 total)

Leave it on auto. Ollama detects your CPU and uses all available threads.

Quick Parameter Summary

💻 Parameter Quick Reference

temperature      → 0.0–2.0   (randomness, default 0.7)
top_p            → 0.0–1.0   (diversity, default 0.9)
top_k            → 1–∞       (token limit, default 40, skip it)
repeat_penalty   → 0–2       (penalize repetition, default 1.1)
num_ctx          → 128–32k   (memory, default 2048)
num_predict      → -1–∞      (max length, default -1)
num_threads      → 1–cores   (CPU usage, default all)

Parameters Demystified: You now understand the knobs. Next: how to tune them for different use cases.

03 / Tuning for Different Use Cases

Parameters That Fit Your Task

Now that you understand parameters, let's apply them. Different tasks need different settings. A Discord bot behaves differently than a creative writer's assistant. Let's build configurations for real scenarios.

Use Case 1: OpenClaw Integration (Structured Output)

Goal: Consistent, deterministic responses. OpenClaw needs predictable JSON or structured text.

Recommended Settings:

📝 OpenClaw Configuration

curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "Your prompt here",
    "temperature": 0.1,
    "top_p": 0.9,
    "num_predict": 2048,
    "num_ctx": 4096,
    "stream": false
  }'

Why these values:

temperature: 0.1 → Focused, deterministic (reproducible output)
top_p: 0.9 → Still allows valid variation, not robotic
num_predict: 2048 → Reasonable max length for API
num_ctx: 4096 → Enough memory for conversation context

Expected behavior: Responses are consistent. Same prompt = mostly same answer (good for testing and debugging).

Use Case 2: Creative Writing

Goal: Variety and creativity. You want different results each time, but still coherent.

Recommended Settings:

📝 Creative Writing Configuration

curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "Write a short story about...",
    "temperature": 1.2,
    "top_p": 0.95,
    "num_predict": 1000,
    "num_ctx": 2048,
    "repeat_penalty": 1.2,
    "stream": true
  }'

Why these values:

temperature: 1.2 → Creative, more varied outputs
top_p: 0.95 → Very permissive vocabulary
repeat_penalty: 1.2 → Prevent repetitive phrases
stream: true → See creativity unfold in real-time

Expected behavior: Each run produces unique, interesting variations on the theme.

Use Case 3: Fast Inference (Real-Time Chat)

Goal: Speed matters. Users are waiting for a response. Trade some context for speed.

Recommended Settings:

📝 Fast Chat Configuration

curl http://localhost:11434/api/generate \
  -d '{
    "model": "neural-chat",
    "prompt": "What is...",
    "temperature": 0.7,
    "top_p": 0.9,
    "num_predict": 512,
    "num_ctx": 2048,
    "stream": true
  }'

Why these values:

num_predict: 512 → Shorter responses (faster generation)
num_ctx: 2048 → Not too large (faster processing)
stream: true → First token appears faster (perceived speed)
model: neural-chat → Slightly faster than Mistral

Expected behavior: First token appears in <500ms, total response in 5-10 seconds.

Use Case 4: Code Generation

Goal: Correct, working code. Logic must be sound.

Recommended Settings:

📝 Code Generation Configuration

curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "Write a Python function that...",
    "temperature": 0.3,
    "top_p": 0.9,
    "num_predict": 1024,
    "num_ctx": 4096,
    "repeat_penalty": 1.1,
    "stream": false
  }'

Why these values:

temperature: 0.3 → Focused on correct syntax
num_ctx: 4096 → Understand full code context
repeat_penalty: 1.1 → Avoid redundant code

Expected behavior: Code is syntactically correct and logically sound most of the time.

Use Case 5: Long-Form Document Analysis

Goal: Understand and analyze large texts. Need big context window.

Recommended Settings:

📝 Document Analysis Configuration

curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "Summarize this document:\n\n[LARGE TEXT HERE]",
    "temperature": 0.5,
    "top_p": 0.9,
    "num_predict": 1024,
    "num_ctx": 8192,
    "stream": false
  }'

Why these values:

num_ctx: 8192 → Can see entire document (trade-off: slower)
temperature: 0.5 → Faithful to source material
stream: false → You're doing analysis, not interactive chat

Expected behavior: Accurate summaries that capture main points. Slower (expect 20–30 seconds), but comprehensive.

Testing Your Configuration

Don't just trust recommendations. Test with your own prompts:

🖥️ Mini PC — A/B Test Script

#!/bin/bash
PROMPT="Write a haiku about programming"

echo "=== Configuration A ==="
time curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "'$PROMPT'",
    "temperature": 0.5,
    "stream": false
  }' | jq -r '.response'

echo ""
echo "=== Configuration B ==="
time curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "'$PROMPT'",
    "temperature": 1.0,
    "stream": false
  }' | jq -r '.response'

Run the same prompt with different configurations. Notice speed and quality differences. Build intuition.

Save Your Configs

Once you find good settings, save them to files for reuse:

📝 ~/.ollama/configs.sh

# OpenClaw Config
alias ollama-openclaw='curl -d "{\"model\": \"mistral\", \"temperature\": 0.1, \"top_p\": 0.9}"'

# Creative Config
alias ollama-creative='curl -d "{\"model\": \"mistral\", \"temperature\": 1.2, \"top_p\": 0.95}"'

# Fast Config
alias ollama-fast='curl -d "{\"model\": \"neural-chat\", \"temperature\": 0.7, \"num_predict\": 512}"'

Source the file and use aliases. Saves typing and keeps configs consistent.

Task-Specific Tuning Ready: You can now configure Ollama for your specific needs. Next: benchmarking to measure results objectively.

04 / Performance Benchmarking

Measure Speed and Quality Objectively

You can feel that a model is fast, but how fast exactly? How does Mistral compare to Llama2 on your hardware? Benchmarking gives you objective data to guide optimization decisions.

Metrics That Matter

Four key metrics for LLM inference:

Benchmark Metrics

Time to First Token (TTFT): How long before the model starts responding. Goal: <500ms
Tokens Per Second (TPS): Generation speed. Goal: 10-15 for CPU, 50+ for GPU
Memory Used: RAM footprint during inference. Goal: <8GB for your setup
Response Quality: Does the answer make sense? Subjective but important

Benchmark Script

Create a reusable benchmarking script:

📝 benchmark.sh

#!/bin/bash

MODEL="${1:-mistral}"
PROMPT="Explain machine learning in 3 paragraphs"

echo "Benchmarking $MODEL..."
echo ""

# Run inference and capture timing
START=$(date +%s%N)
RESPONSE=$(curl -s http://localhost:11434/api/generate \
  -d "{
    \"model\": \"$MODEL\",
    \"prompt\": \"$PROMPT\",
    \"temperature\": 0.5,
    \"stream\": false
  }")
END=$(date +%s%N)

# Extract metrics
TOTAL_TIME=$((($END - $START) / 1000000))  # milliseconds
TOKENS=$(echo "$RESPONSE" | jq -r '.eval_count')
TPS=$(echo "1000 * $TOKENS / $(echo \"$RESPONSE\" | jq -r '.eval_duration') * 1000000" | bc -l)

echo "Model: $MODEL"
echo "Total Time: ${TOTAL_TIME}ms"
echo "Tokens: $TOKENS"
echo "Tokens/Sec: $(echo \"scale=2; $TPS\" | bc)"
echo ""
echo "Response:"
echo "---"
echo "$RESPONSE" | jq -r '.response'
echo "---"

Run it:

🖥️ Mini PC

bash benchmark.sh mistral
bash benchmark.sh llama2
bash benchmark.sh neural-chat

Compare Models Side-by-Side

Test the same prompt across models to find your best balance of speed and quality:

🖥️ Mini PC — Comparison Output

Model            Time    Tokens  TPS     Quality
mistral          8.2s    82      10.0    Excellent
llama2           7.5s    75      10.0    Good
neural-chat      6.1s    85      13.9    Good

Neural Chat is fastest, Mistral is best quality. Your choice depends on your priority.

Monitor Resources During Benchmark

In another terminal, watch resource usage:

🖥️ Mini PC — Terminal 2

watch -n 0.5 'free -h && echo && top -n 1 -b | head -n 3'

Look for:

Peak memory usage (should be <8GB)
CPU utilization (should be 80%+ during inference)
No thermal throttling (CPU stays stable frequency)

Benchmarking Baseline Set: You can now objectively measure model performance. Use these numbers to guide optimization.

05 / GPU Detection and Acceleration

Understanding Hardware Acceleration (You Probably Don't Need It)

GPUs are much faster than CPUs for LLM inference. Many systems have an integrated GPU, but integrated graphics are typically too small to meaningfully accelerate LLMs. Let's understand when GPU acceleration actually helps.

Integrated vs Discrete GPUs

Most CPUs include an integrated GPU, but these are far too weak for LLM acceleration. For reference:

Typical integrated GPU: ~1–2 TFLOPs
RTX 4070: ~20 TFLOPs (10–20x more powerful)
RTX 4090: ~82 TFLOPs (40–80x more powerful)

Integrated GPUs are orders of magnitude weaker than discrete gaming GPUs. For LLMs on integrated graphics, the CPU is actually competitive.

Check if Ollama Detects Your GPU

Ollama logs what hardware it's using. Check:

🖥️ Mini PC

tail -50 ~/.ollama/ollama.log | grep -i gpu

If you see GPU references, Ollama detected it. If not, it's using CPU (default, which is fine for you).

When GPU Acceleration Helps

GPU acceleration is worth it if:

Nvidia RTX GPU: 2080 Ti or newer (30x-60x speedup)
AMD Radeon: RX 6700 or newer (10x-20x speedup)
Apple Silicon: Built-in GPU (3x-5x speedup)

An integrated GPU? Not worth the complexity. CPU inference is simpler and nearly as fast.

Should You Upgrade Hardware?

For local LLMs, consider GPU if:

You want to run 13B+ models (currently too slow on CPU)
You need sub-second response times (production use)
You want to run multiple concurrent inferences

For now, stick with CPU. A modern multi-core CPU is plenty fast for a single Ollama instance running 7B models.

GPU Decision Made: If you only have integrated graphics, CPU inference is the right choice. No GPU upgrade needed for 7B models.

06 / Running Multiple Models Concurrently

Leverage Your RAM — Load Once, Use Instantly

With 16GB+ RAM and OLLAMA_KEEP_ALIVE set, you can keep multiple models loaded simultaneously and switch between them with zero cold-start delay. Different models for different tasks — all ready to go.

Check What's Currently Loaded

Before thinking about concurrent models, know what's already in memory:

🖥️ Mini PC

ollama ps

🖥️ Mini PC — Output

NAME              ID              SIZE      PROCESSOR    UNTIL
qwen2.5:7b        845dbda0ea48    5.5 GB    100% CPU     28 minutes from now
mistral:latest    f974a74358d6    5.0 GB    100% CPU     25 minutes from now

PROCESSOR shows whether your GPU is involved. On a CPU-only machine you'll see 100% CPU. On a machine with a compatible GPU (NVIDIA with CUDA, AMD with ROCm, or Apple Silicon), you'll see a split like 45%/55% CPU/GPU — higher GPU % means faster inference. The UNTIL column shows the keep-alive expiry for each loaded model.

Concurrent Model Memory Layout

On a 32GB system, loading your full stack looks like this:

💻 Memory Layout — 32GB System

OS + System:           ~2GB (always used)
Ollama Runtime:        ~1GB
qwen2.5:7b (q4_K_M):  ~5.5GB  — all-rounder, 128K context
mistral:latest:        ~5.0GB  — reasoning + code workhorse
llama3.2:3b:           ~2.5GB  — speed tier, quick queries
Reserve Buffer:        ~16GB   (available for phi4 or larger models)
────────────────────────────────
Total Used:            ~16GB
Total Available:       ~16GB

With OLLAMA_KEEP_ALIVE set to -1 or a long duration, all three models stay resident. Switching between them is instant — the model is already warm in memory.

Practical Multi-Model Setup (2026)

A sensible three-tier stack for most use cases:

Recommended Model Stack

qwen2.5:7b — Default tier. Complex reasoning, code, long documents (128K context). Best all-rounder.
mistral:latest — Alternative for instruction-following and reliability. Battle-tested and predictable.
llama3.2:3b — Speed tier. Quick lookups, simple tasks, when you want an instant answer.
phi4 or gemma3:12b — Heavy reasoning (needs 16GB+). Pull when qwen2.5 isn't cutting it on complex problems.

Total disk for the first three: ~12GB. All loaded simultaneously at 32GB: ~16GB RAM. Comfortable headroom.

API Request Routing by Model

With multiple models loaded, route requests to the right model for the task:

💻 Route by task type

# Complex task → Qwen2.5 for reasoning + 128K context
curl http://localhost:11434/api/generate \
  -d '{"model":"qwen2.5:7b","prompt":"Analyze this long document..."}'

# Quick factual lookup → Llama 3.2 3B for speed
curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2:3b","prompt":"What does HTTP 429 mean?"}'

# Code review → Mistral for reliable instruction following
curl http://localhost:11434/api/generate \
  -d '{"model":"mistral","prompt":"Review this function for bugs..."}'

If all three are loaded in memory (keep-alive), these requests complete without any model loading delay. Ollama queues concurrent requests and processes them in order, switching models transparently.

Pinning Models with OLLAMA_KEEP_ALIVE

The key to no-delay model switching is keeping models in memory. Set keep-alive system-wide via systemd:

🖥️ Mini PC — Pin models via systemd

sudo systemctl edit ollama

📝 Add to the override file

[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"

🖥️ Mini PC — Reload

sudo systemctl daemon-reload && sudo systemctl restart ollama

Use -1 only if you have headroom. With -1, models never unload until Ollama restarts. On 16GB RAM with one 7B model loaded, that's fine. Loading a second 7B while the first is pinned will consume ~12GB — check ollama ps and your free RAM before pinning multiple large models permanently.

Multiple Models Ready: With a tiered stack and keep-alive tuned to your RAM, you have a local AI infrastructure that routes tasks intelligently. ollama ps is your dashboard — check it whenever you're curious about what's running.

07 / Integration with OpenClaw

Connect OpenClaw to Your Local Ollama

This is the power play: use your local Ollama as the brain for OpenClaw. Your Discord bot will have a private, offline-capable LLM that you control completely. No API keys, no cloud dependency, no monthly bills.

Why Local Ollama + OpenClaw?

Privacy: Every message stays on your laptop. No cloud. No logs in someone else's data center.

Cost: Free. The Ollama infrastructure costs you literally nothing.

Speed: No network latency. Your bot responds instantly (well, as fast as token generation).

Control: You choose the model. You tune the parameters. You own the LLM.

Architecture Overview

Here's the data flow:

💻 OpenClaw Integration Flow

Discord User
    ↓ (sends message)
    ↓
[Discord Bot (OpenClaw)]
    ↓ (sends HTTP request)
    ↓
[Ollama API on localhost:11434]
    ↓ (inference)
    ↓
[LLM Model (Mistral, Llama2, etc)]
    ↓ (returns response)
    ↓
[Ollama API]
    ↓ (sends JSON back)
    ↓
[OpenClaw]
    ↓ (processes response)
    ↓
Discord User receives answer

All local. All private. Fast.

Configuration: Point OpenClaw to Ollama

OpenClaw (and most bots) accept LLM configuration. You'll specify:

Typical Configuration

LLM Provider: "ollama" or "custom"
Endpoint: http://localhost:11434
Model: "mistral" (or your chosen model)
Temperature: 0.1-0.3 (structured output)
Max Tokens: 2048 (keep responses reasonable)
Context Window: 4096 (remember conversation)

Configuration location depends on your OpenClaw setup (usually YAML, JSON, or environment variables). Check your OpenClaw documentation for exact syntax.

Example: Python Integration

If you're building a bot from scratch, here's how to call Ollama from Python:

💻 Python / requests

import requests
import json

def query_ollama(prompt, model="mistral", temperature=0.1):
    """Send a prompt to local Ollama and get response"""
    endpoint = "http://localhost:11434/api/generate"

    payload = {
        "model": model,
        "prompt": prompt,
        "temperature": temperature,
        "num_predict": 2048,
        "stream": False
    }

    response = requests.post(endpoint, json=payload)
    result = response.json()

    return result.get("response", "")

# Usage
answer = query_ollama("What is the capital of France?")
print(answer)

This is the exact pattern OpenClaw uses internally. The endpoint is always localhost:11434 (assuming Ollama is running locally).

Testing the Integration

Before plugging into Discord, test locally:

🖥️ Mini PC — Test Script

#!/bin/bash

# Test endpoint accessibility
echo "Testing Ollama connectivity..."
curl -s http://localhost:11434/api/tags && echo " ✓ Connected" || echo " ✗ Failed"

# Test a simple inference
echo ""
echo "Testing inference..."
RESPONSE=$(curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "Hello, how are you?",
    "temperature": 0.1,
    "stream": false
  }' | jq -r '.response')

echo "Response: $RESPONSE"

If both tests pass, your Ollama is ready for OpenClaw integration.

Performance Expectations

When running on your local hardware with OpenClaw actively processing:

💻 Typical Discord Bot Behavior

User sends message to Discord
OpenClaw receives and processes (milliseconds)
Sends prompt to Ollama (localhost, instant)
Ollama generates response (10-15 tokens/sec)
Average response time: 5-20 seconds (depending on response length)
Discord receives and displays answer

Example:
"What's the capital of France?" → 3 seconds
Long explanation prompt → 15-20 seconds

Users will perceive this as responsive and fast, especially for a free, local bot.

Scaling Considerations

What If Multiple Users Message Simultaneously?

Ollama queues requests. If 3 users send messages at once:

User 1 gets response in ~10 seconds
User 2 gets response in ~20 seconds (queued behind User 1)
User 3 gets response in ~30 seconds (queued behind both)

This is fine for personal bots or small communities. For high-traffic scenarios, you'd want to run multiple Ollama instances or use GPU acceleration (beyond this tutorial).

Troubleshooting Integration

OpenClaw can't reach Ollama:

Verify Ollama is running: systemctl status ollama
Verify API is accessible: curl http://localhost:11434/api/tags
Check firewall: sudo ufw allow 11434 (if using ufw)

Responses are slow or timing out:

Check CPU load: top -n 1 | grep Cpu
Check memory: free -h
Reduce context window in config (faster inference)
Use a faster model (Neural Chat instead of Mistral)

Responses are nonsensical:

Lower temperature in config (more deterministic)
Reduce num_predict (shorter, more coherent responses)
Try a different model

Local AI Brain Ready: Your OpenClaw bot now has a private LLM. The future of your bot is in your hands, not in the cloud. Keep going—you're almost done with Advanced.

08 / Optimization Techniques

Squeeze Maximum Performance

Your CPU is capable, but there are ways to go further. Environment variables, context tuning, CPU pinning — small tweaks compound into noticeable improvements. This section covers the full toolkit.

Key Environment Variables

Ollama exposes a set of environment variables that control how it loads and runs models. These are the most useful ones for a Mini PC setup:

OLLAMA_KEEP_ALIVE

How long to keep a model loaded in RAM after its last request. Default: 5m. Set to 30m or -1 (forever) to avoid cold-start delays. On 16GB+ systems, longer keep-alive means faster responses — the model is already warm.
Environment="OLLAMA_KEEP_ALIVE=30m"

OLLAMA_MAX_LOADED_MODELS

Maximum number of models to keep loaded simultaneously. Default: 3 on systems with GPU, 1 on CPU-only. Raise to 3 or 4 on 32GB systems to keep your full model stack warm without unloading.
Environment="OLLAMA_MAX_LOADED_MODELS=3"

OLLAMA_NUM_PARALLEL

How many requests to process in parallel per model. Default: 1 on CPU. Increasing this can improve throughput if you have multiple concurrent users or applications hitting Ollama at the same time, at the cost of higher RAM usage per request. Start at 1 on CPU — most single-user setups don't need more.
Environment="OLLAMA_NUM_PARALLEL=2"

OLLAMA_FLASH_ATTENTION

Enable Flash Attention, an optimized attention mechanism that reduces memory usage during long-context inference. Set to 1 to enable. Most beneficial when running models with large context windows (like qwen2.5:7b at 128K tokens). Can noticeably reduce RAM pressure on long conversations.
Environment="OLLAMA_FLASH_ATTENTION=1"

🖥️ Mini PC — Apply via systemd (persists across reboots)

sudo systemctl edit ollama

📝 Recommended override block for 16–32GB systems

[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_FLASH_ATTENTION=1"

🖥️ Mini PC — Reload and restart

sudo systemctl daemon-reload && sudo systemctl restart ollama

CPU Affinity

Pin Ollama to specific cores to reduce OS scheduling overhead and give the model consistent CPU access:

🖥️ Mini PC — via systemd override

# Add to your [Service] block in systemctl edit ollama:
CPUAffinity=0-7   # Pins to cores 0–7 (adjust to your core count)

On a 12-core machine, you might pin Ollama to 8 cores (0–7) and leave 4 cores (8–11) for the OS and other services. Experiment — the benefit varies by workload.

Context Window Tuning

Context window size directly affects inference speed and RAM usage. Smaller context = faster tokens:

💻 Speed vs. Context Tradeoff (7B model)

Context 1024:   ~20 tokens/sec (very fast, short conversations)
Context 2048:   ~15 tokens/sec (balanced — good default)
Context 4096:   ~10 tokens/sec (slower, good for most tasks)
Context 8192:   ~7 tokens/sec  (long docs, complex reasoning)
Context 32768+: ~3–5 tokens/sec (Qwen2.5 long context, only when needed)

You can override context size per request via the API using the options.num_ctx field. For most chat tasks, 2048–4096 is the sweet spot. Only go higher when you're actually sending long inputs.

💻 Set context size per API request

curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen2.5:7b",
    "prompt": "Summarize this document...",
    "options": {
      "num_ctx": 8192
    }
  }'

Memory Pressure Relief

If you're hitting RAM limits — models are being unexpectedly unloaded, or you see swap activity — try these:

🖥️ Mini PC — Diagnose memory pressure

# See current model memory usage
ollama ps

# Check system RAM
free -h

# Check if swap is being used (bad for inference speed)
swapon --show

If you're hitting swap, either reduce OLLAMA_MAX_LOADED_MODELS, switch to smaller model variants (e.g. llama3.2:3b instead of a 7B), or reduce OLLAMA_KEEP_ALIVE so models unload faster.

Optimized: Keep-alive, Flash Attention, and context tuning are the three highest-impact knobs for a CPU-based Mini PC. Set them once via systemd, then verify with ollama ps.

09 / Monitoring and Maintenance

Keep Your LLM Healthy

A running system needs monitoring. Health checks, logs, resource tracking—simple practices prevent surprises.

Health Check Script

📝 health-check.sh

#!/bin/bash

echo "=== Ollama Health Check ==="
echo ""

# Service status
echo "✓ Service Status:"
systemctl status ollama --no-pager | head -3

echo ""
echo "✓ API Connectivity:"
curl -s http://localhost:11434/api/tags | jq '.models | length' | xargs echo "  Models available:"

echo ""
echo "✓ Resources:"
ps aux | grep "ollama serve" | grep -v grep | awk '{print "  CPU: " $3 "%, RAM: " $6 " KB"}'

echo ""
echo "✓ Logs (last error):"
grep -i error ~/.ollama/ollama.log | tail -1 || echo "  No errors"

Log Location

Logs live at ~/.ollama/ollama.log. Check for issues:

🖥️ Mini PC

# View recent logs
tail -50 ~/.ollama/ollama.log

# Find errors
grep ERROR ~/.ollama/ollama.log | tail -10

Regular Maintenance

Weekly: Check disk usage (du -sh ~/.ollama/models)
Monthly: Update Ollama (sudo apt update && sudo apt install ollama)
Quarterly: Clean unused models (ollama rm model_name)
Yearly: Backup models to external drive

Monitored: Your Ollama system is healthy and maintainable.

10 / Performance Tuning Checklist

Before You Go Live

You've learned parameters, benchmarking, optimization. This checklist ensures your Ollama setup is production-ready for OpenClaw.

Pre-Launch Checklist

Production Ready?

☐ Ollama running as systemd service (auto-start on boot)
☐ Model selected and downloaded (Mistral recommended)
☐ API endpoint verified (curl http://localhost:11434/api/tags)
☐ Temperature set appropriately (0.1-0.3 for OpenClaw)
☐ Context window configured (4096 minimum)
☐ Max tokens set (2048 to prevent runaway)
☐ Baseline performance benchmarked
☐ Health check script running

Configuration Backup

Save your working configuration:

🖥️ Mini PC

# Backup models
cp -r ~/.ollama/models ~/ollama-models-backup

# Document your settings
cat > ~/ollama-config.txt << EOC
Model: mistral
Temperature: 0.1
Top-P: 0.9
Context: 4096
Num Predict: 2048
EOC

OpenClaw Integration Checklist

☐ Ollama endpoint configured in OpenClaw (localhost:11434)
☐ Model name matches what you pulled
☐ Test prompt sent and received successfully
☐ Response quality acceptable
☐ Response time reasonable (5-20 seconds)
☐ No memory leaks after sustained use
☐ Reboot test: Ollama starts automatically, works after restart

Launch Ready: Your Ollama + OpenClaw setup is production-ready.

11 / Troubleshooting Guide

Common Issues and Fixes

Things go wrong. Here's how to diagnose and fix common Ollama problems.

Ollama Won't Start

Symptom: Service shows inactive or fails to start

🖥️ Mini PC — Diagnosis

systemctl status ollama
# Read error message

# Try manual start to see error
ollama serve

Common fixes:

Permission denied: sudo chown ollama:ollama ~/.ollama
Port conflict: Check sudo lsof -i :11434
Corrupted model: Delete and re-pull

Out of Memory Errors

Symptom: "OOM Killer" in dmesg, processes killed

🖥️ Mini PC — Check Memory

free -h
# If available < 2GB during inference, you're hitting limits

dmesg | grep -i killed | tail -5
# Shows what got OOM killed

Fixes:

Reduce context window (num_ctx: 2048 instead of 8192)
Unload unused models (ollama rm model_name)
Close other apps consuming memory
Add swap if permanently needed

Very Slow Inference

Symptom: <5 tokens/sec (should be 10-15)

🖥️ Mini PC — Check System

top -n 1
# CPU usage <80%? Issue might be elsewhere
# Check if CPU is being shared with other apps

iostat -x 1
# High wait time? Disk bottleneck

Fixes:

Close heavy apps (browser, IDE, etc)
Reduce num_ctx to 2048
Use faster model (Neural Chat)
Check for thermal throttling (watch -n 1 'cat /proc/cpuinfo | grep MHz')

API Not Responding

Symptom: curl returns Connection refused

🖥️ Mini PC — Check Service

systemctl status ollama
# Make sure it's running

netstat -an | grep 11434
# Should show LISTEN on port 11434

curl http://localhost:11434/api/tags
# Should return JSON, not error

Fixes:

Start service: sudo systemctl start ollama
Check firewall: sudo ufw allow 11434
Restart: sudo systemctl restart ollama

Bad Quality Responses

Symptom: Responses are nonsensical or repetitive

Fixes:

Lower temperature (0.3 or less)
Increase repeat_penalty (1.5)
Reduce num_predict (limit length)
Try different model (Llama2 instead of Mistral)

Fixed: Most issues are solvable with these techniques.

12 / Next Steps

Where to Go From Here

You've completed Ollama Advanced. You understand parameters, tuning, integration, optimization, and troubleshooting. Your local LLM setup is sophisticated and production-ready.

You've Accomplished

✓ Understand all Ollama parameters and their effects
✓ Tune for specific use cases (OpenClaw, creative, code, etc)
✓ Benchmark models objectively
✓ Know when GPU acceleration matters (and doesn't for you)
✓ Run multiple models concurrently
✓ Integrate Ollama seamlessly with OpenClaw
✓ Optimize your hardware for maximum LLM performance
✓ Monitor and maintain a healthy Ollama system
✓ Troubleshoot common problems

Option 1: Deploy Your OpenClaw Bot

You have everything you need. Configure OpenClaw to use your local Ollama, then deploy:

Point OpenClaw to http://localhost:11434
Select your tuned model and parameters
Launch your Discord bot
Enjoy your private, offline-capable AI agent

Option 2: Explore Advanced Topics

If you want to go deeper:

Model Fine-Tuning: Customize models for specific tasks (advanced)
Quantization: Compress models further (4-bit, 2-bit)
Distributed Inference: Run Ollama across multiple machines
Web UI: Build a web interface for Ollama
Monitoring: Prometheus/Grafana metrics tracking

Option 3: Compare with Other LLM Tools

Ollama isn't the only option. If you want to explore:

LM Studio: Web UI for local models (easier than Ollama CLI)
vLLM: High-performance inference server (more complex)
Text Generation WebUI: Feature-rich but steeper learning curve
GPT4All: Lightweight, beginner-friendly

But honestly? Ollama is the best balance of simplicity and power for your use case.

Keep Learning

Understand Transformers Better:

Read: "Attention is All You Need" (the original paper)
Watch: YouTube tutorials on how LLMs work
Experiment: Try different models, prompt engineering

Follow the Community:

Ollama GitHub (issues, discussions)
Hugging Face model hub (find new models)
Reddit r/LocalLLM (community sharing)

You're Part of the AI Revolution

A few years ago, running local LLMs meant compiling C++, wrestling with dependencies, and getting 1-2 tokens/sec. Now? You install Ollama, pull a model, and get 10-15 tokens/sec on CPU. You've got a private, offline-capable AI brain that costs nothing to run.

Your data is yours. Your LLM is yours. No cloud vendor, no API limits, no surveillance.

That's power. Use it wisely.

Advanced Complete: You're now an Ollama expert. Go build amazing things. 🚀