Beyond the Basics: Optimization and Integration
You've completed Ollama Basics. You have Mistral or Llama2 running, you've generated some responses, and you've tested the REST API. Now: how do you make it better?
This tutorial is for people comfortable with Ollama fundamentals who want to squeeze maximum performance, understand the tuning knobs, and integrate seamlessly with OpenClaw. We're going deep.
Where You Left Off (Quick Review)
You have:
- ✓ Ollama installed and running as a service
- ✓ At least one 7B model downloaded (Mistral recommended)
- ✓ REST API working (tested with curl)
- ✓ Baseline performance metrics (10–15 tokens/sec on a typical CPU)
If you don't have these, go back to Tutorial 1 (Basics) first. This tutorial assumes you're comfortable with Ollama's fundamentals.
What You'll Learn (Advanced)
In this 90-minute tutorial:
- Parameters: Temperature, top_p, context windows—what they do and how to tune them
- Use-Case Tuning: Different settings for different tasks (code generation vs creative writing vs structured output)
- Benchmarking: Objective methods to measure quality and speed
- GPU Detection: Understand when GPUs help (spoiler: not for you)
- Multiple Models: Run models concurrently, queue requests, load balancing
- OpenClaw Integration: Point your bot to local Ollama for a fully private AI brain
- Optimization: Memory tuning, CPU optimization, inference speed tuning
- Monitoring: Health checks, logs, continuous operation
- Troubleshooting: Common issues and how to fix them
Your Hardware
This tutorial works on any modern hardware capable of running Ollama. Here's what to keep in mind:
- CPU: 4+ cores (8+ cores recommended for smooth performance)
- RAM: 16GB minimum, 32GB recommended (more RAM = more models loaded simultaneously)
- Storage: SSD recommended (fast model loading)
- GPU: Optional — CPU-only inference works well for 7B models
Performance numbers in this tutorial are based on a mid-range multi-core CPU with 32GB RAM. Your results may vary depending on your specific hardware — adjust recommendations accordingly.
Why This Matters
For OpenClaw: A well-tuned local LLM can power a bot that feels just as smart as cloud-based solutions, but faster, cheaper, and completely private.
For Learning: Understanding parameters teaches you how language models actually work. You'll develop intuition for LLM behavior.
For Optimization: Squeezing another 20% performance from your hardware means faster responses and better user experience.
A Note on Complexity
This tutorial assumes you're comfortable with terminal commands and basic system administration. You don't need to be an expert, but "comfortable with Linux" is the baseline.
If something is unclear, go back to the Basics tutorial or skip to the sections that interest you. You don't have to do everything in order.
The Knobs That Control Your LLM
LLMs generate text one token at a time. But how they choose which token to generate is controlled by parameters. These are your tuning knobs. Understand them, and you control the model's behavior.
Temperature (Randomness)
Range: 0.0 to 2.0 (or higher)
Default: Usually 0.7 (balanced)
Temperature controls how "creative" or "random" the model gets:
- 0.0 (Cold, Deterministic): Always picks the most likely next token. Same input = identical output every time. Good for: Precise answers, code generation, structured output.
- 0.5 (Cool, Focused): Mostly predictable but with some variation. Good for: Professional writing, Q&A, technical content.
- 0.7 (Balanced, Default): A mix of creativity and consistency. Good for: General chat, creative but coherent responses.
- 1.2 (Warm, Creative): More randomness, more interesting responses. Good for: Brainstorming, creative writing, poetry.
- 1.5+ (Hot, Chaotic): Very random, often incoherent. Good for: Experimentation (usually a mistake).
Try it:
# Cold (deterministic)
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Complete: The future of AI is",
"temperature": 0.0,
"stream": false
}' | jq -r '.response'
# Hot (creative)
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Complete: The future of AI is",
"temperature": 1.5,
"stream": false
}' | jq -r '.response'
Run both and notice the difference. You'll see the same semantic answer at 0.0, wildly creative answers at 1.5.
Top-P (Nucleus Sampling)
Range: 0.0 to 1.0
Default: 0.9
Top-P is a more sophisticated diversity control than temperature. It works by:
- Model predicts probabilities for the next token
- Sort tokens by likelihood (highest first)
- Keep only the top tokens that sum to p% probability
- Randomly sample from that restricted set
In practice:
- 0.1: Ultra-focused (only the top 10% of likely tokens)
- 0.5: Focused (top 50% of likely tokens)
- 0.9: Balanced (top 90% of likely tokens, allows more creativity)
Works best with temperature: Top-P controls diversity among the remaining candidates after temperature does its work. Usually keep at 0.9 unless you're experimenting.
Top-K
Range: 1 to infinity
Default: 40 (varies by Ollama version)
Top-K is the simpler cousin of top-P. It limits the model to only consider the K most likely tokens:
- K=1: Only consider the most likely token (extremely constrained)
- K=10: Consider top 10 most likely tokens
- K=40: Consider top 40 (good balance)
- K=100+: Very permissive (almost all tokens allowed)
Practical advice: Leave top-K at default and use temperature+top-P instead. Top-K is older and less intuitive than top-P.
Repeat Penalty
Range: 0.0 to 2.0
Default: 1.1
Repeat penalty prevents the model from repeating the same phrase over and over (common LLM failure mode):
- 1.0: No penalty (tokens can repeat freely)
- 1.1: Mild penalty (slight discouragement to repeat)
- 1.5: Strong penalty (heavily discourage repetition)
Use when: Model generates repetitive text ("the the the..." or "and and and...").
Default is good: Usually leave at 1.1.
Context Window (num_ctx)
Range: 128 to ~32000 tokens
Default: 2048
Context window is how many tokens the model can "see" when generating. It's like short-term memory:
- 256: Very short memory (forgets everything quickly)
- 2048: Good balance (remember ~1500 words of conversation)
- 4096: Long memory (remember ~3000 words)
- 8192: Very long (remember entire documents)
Tradeoff: Larger context = slower inference (more computation). On a typical multi-core CPU with a 7B model:
2048 tokens: ~15 tokens/sec (fast)
4096 tokens: ~12 tokens/sec (slightly slower)
8192 tokens: ~8 tokens/sec (noticeably slower)
16384+: ~3-5 tokens/sec (CPU gets saturated)
Recommendation: Use 2048 for chat. Use 4096 for document analysis. Skip 8192+ unless you really need the memory.
Prediction Tokens (num_predict)
Range: -1 (unlimited) to any positive number
Default: -1 (unlimited)
Maximum tokens to generate before stopping. Prevents runaway responses:
- -1: Unlimited (model stops when it feels done)
- 128: Stop after 128 tokens (~100 words)
- 512: Stop after 512 tokens (~400 words)
- 2048: Stop after 2048 tokens (full page)
Use when: You want predictable response lengths. Good for APIs where you need bounded latency.
Threads (num_threads)
Range: 1 to your CPU core count
Default: Auto-detect (all cores)
How many CPU cores Ollama uses. For example, on an 8-core CPU:
- 1-2: Slow, leaves CPU idle
- 8: All cores active (default, good)
- >8: Hyperthreading counts, can use more (usually 16 total)
Leave it on auto. Ollama detects your CPU and uses all available threads.
Quick Parameter Summary
temperature → 0.0–2.0 (randomness, default 0.7)
top_p → 0.0–1.0 (diversity, default 0.9)
top_k → 1–∞ (token limit, default 40, skip it)
repeat_penalty → 0–2 (penalize repetition, default 1.1)
num_ctx → 128–32k (memory, default 2048)
num_predict → -1–∞ (max length, default -1)
num_threads → 1–cores (CPU usage, default all)
Parameters That Fit Your Task
Now that you understand parameters, let's apply them. Different tasks need different settings. A Discord bot behaves differently than a creative writer's assistant. Let's build configurations for real scenarios.
Use Case 1: OpenClaw Integration (Structured Output)
Goal: Consistent, deterministic responses. OpenClaw needs predictable JSON or structured text.
Recommended Settings:
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Your prompt here",
"temperature": 0.1,
"top_p": 0.9,
"num_predict": 2048,
"num_ctx": 4096,
"stream": false
}'
Why these values:
- temperature: 0.1 → Focused, deterministic (reproducible output)
- top_p: 0.9 → Still allows valid variation, not robotic
- num_predict: 2048 → Reasonable max length for API
- num_ctx: 4096 → Enough memory for conversation context
Expected behavior: Responses are consistent. Same prompt = mostly same answer (good for testing and debugging).
Use Case 2: Creative Writing
Goal: Variety and creativity. You want different results each time, but still coherent.
Recommended Settings:
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Write a short story about...",
"temperature": 1.2,
"top_p": 0.95,
"num_predict": 1000,
"num_ctx": 2048,
"repeat_penalty": 1.2,
"stream": true
}'
Why these values:
- temperature: 1.2 → Creative, more varied outputs
- top_p: 0.95 → Very permissive vocabulary
- repeat_penalty: 1.2 → Prevent repetitive phrases
- stream: true → See creativity unfold in real-time
Expected behavior: Each run produces unique, interesting variations on the theme.
Use Case 3: Fast Inference (Real-Time Chat)
Goal: Speed matters. Users are waiting for a response. Trade some context for speed.
Recommended Settings:
curl http://localhost:11434/api/generate \
-d '{
"model": "neural-chat",
"prompt": "What is...",
"temperature": 0.7,
"top_p": 0.9,
"num_predict": 512,
"num_ctx": 2048,
"stream": true
}'
Why these values:
- num_predict: 512 → Shorter responses (faster generation)
- num_ctx: 2048 → Not too large (faster processing)
- stream: true → First token appears faster (perceived speed)
- model: neural-chat → Slightly faster than Mistral
Expected behavior: First token appears in <500ms, total response in 5-10 seconds.
Use Case 4: Code Generation
Goal: Correct, working code. Logic must be sound.
Recommended Settings:
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Write a Python function that...",
"temperature": 0.3,
"top_p": 0.9,
"num_predict": 1024,
"num_ctx": 4096,
"repeat_penalty": 1.1,
"stream": false
}'
Why these values:
- temperature: 0.3 → Focused on correct syntax
- num_ctx: 4096 → Understand full code context
- repeat_penalty: 1.1 → Avoid redundant code
Expected behavior: Code is syntactically correct and logically sound most of the time.
Use Case 5: Long-Form Document Analysis
Goal: Understand and analyze large texts. Need big context window.
Recommended Settings:
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Summarize this document:\n\n[LARGE TEXT HERE]",
"temperature": 0.5,
"top_p": 0.9,
"num_predict": 1024,
"num_ctx": 8192,
"stream": false
}'
Why these values:
- num_ctx: 8192 → Can see entire document (trade-off: slower)
- temperature: 0.5 → Faithful to source material
- stream: false → You're doing analysis, not interactive chat
Expected behavior: Accurate summaries that capture main points. Slower (expect 20–30 seconds), but comprehensive.
Testing Your Configuration
Don't just trust recommendations. Test with your own prompts:
#!/bin/bash
PROMPT="Write a haiku about programming"
echo "=== Configuration A ==="
time curl -s http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "'$PROMPT'",
"temperature": 0.5,
"stream": false
}' | jq -r '.response'
echo ""
echo "=== Configuration B ==="
time curl -s http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "'$PROMPT'",
"temperature": 1.0,
"stream": false
}' | jq -r '.response'
Run the same prompt with different configurations. Notice speed and quality differences. Build intuition.
Save Your Configs
Once you find good settings, save them to files for reuse:
# OpenClaw Config
alias ollama-openclaw='curl -d "{\"model\": \"mistral\", \"temperature\": 0.1, \"top_p\": 0.9}"'
# Creative Config
alias ollama-creative='curl -d "{\"model\": \"mistral\", \"temperature\": 1.2, \"top_p\": 0.95}"'
# Fast Config
alias ollama-fast='curl -d "{\"model\": \"neural-chat\", \"temperature\": 0.7, \"num_predict\": 512}"'
Source the file and use aliases. Saves typing and keeps configs consistent.
Measure Speed and Quality Objectively
You can feel that a model is fast, but how fast exactly? How does Mistral compare to Llama2 on your hardware? Benchmarking gives you objective data to guide optimization decisions.
Metrics That Matter
Four key metrics for LLM inference:
- Time to First Token (TTFT): How long before the model starts responding. Goal: <500ms
- Tokens Per Second (TPS): Generation speed. Goal: 10-15 for CPU, 50+ for GPU
- Memory Used: RAM footprint during inference. Goal: <8GB for your setup
- Response Quality: Does the answer make sense? Subjective but important
Benchmark Script
Create a reusable benchmarking script:
#!/bin/bash
MODEL="${1:-mistral}"
PROMPT="Explain machine learning in 3 paragraphs"
echo "Benchmarking $MODEL..."
echo ""
# Run inference and capture timing
START=$(date +%s%N)
RESPONSE=$(curl -s http://localhost:11434/api/generate \
-d "{
\"model\": \"$MODEL\",
\"prompt\": \"$PROMPT\",
\"temperature\": 0.5,
\"stream\": false
}")
END=$(date +%s%N)
# Extract metrics
TOTAL_TIME=$((($END - $START) / 1000000)) # milliseconds
TOKENS=$(echo "$RESPONSE" | jq -r '.eval_count')
TPS=$(echo "1000 * $TOKENS / $(echo \"$RESPONSE\" | jq -r '.eval_duration') * 1000000" | bc -l)
echo "Model: $MODEL"
echo "Total Time: ${TOTAL_TIME}ms"
echo "Tokens: $TOKENS"
echo "Tokens/Sec: $(echo \"scale=2; $TPS\" | bc)"
echo ""
echo "Response:"
echo "---"
echo "$RESPONSE" | jq -r '.response'
echo "---"
Run it:
bash benchmark.sh mistral
bash benchmark.sh llama2
bash benchmark.sh neural-chat
Compare Models Side-by-Side
Test the same prompt across models to find your best balance of speed and quality:
Model Time Tokens TPS Quality
mistral 8.2s 82 10.0 Excellent
llama2 7.5s 75 10.0 Good
neural-chat 6.1s 85 13.9 Good
Neural Chat is fastest, Mistral is best quality. Your choice depends on your priority.
Monitor Resources During Benchmark
In another terminal, watch resource usage:
watch -n 0.5 'free -h && echo && top -n 1 -b | head -n 3'
Look for:
- Peak memory usage (should be <8GB)
- CPU utilization (should be 80%+ during inference)
- No thermal throttling (CPU stays stable frequency)
Understanding Hardware Acceleration (You Probably Don't Need It)
GPUs are much faster than CPUs for LLM inference. Many systems have an integrated GPU, but integrated graphics are typically too small to meaningfully accelerate LLMs. Let's understand when GPU acceleration actually helps.
Integrated vs Discrete GPUs
Most CPUs include an integrated GPU, but these are far too weak for LLM acceleration. For reference:
- Typical integrated GPU: ~1–2 TFLOPs
- RTX 4070: ~20 TFLOPs (10–20x more powerful)
- RTX 4090: ~82 TFLOPs (40–80x more powerful)
Integrated GPUs are orders of magnitude weaker than discrete gaming GPUs. For LLMs on integrated graphics, the CPU is actually competitive.
Check if Ollama Detects Your GPU
Ollama logs what hardware it's using. Check:
tail -50 ~/.ollama/ollama.log | grep -i gpu
If you see GPU references, Ollama detected it. If not, it's using CPU (default, which is fine for you).
When GPU Acceleration Helps
GPU acceleration is worth it if:
- Nvidia RTX GPU: 2080 Ti or newer (30x-60x speedup)
- AMD Radeon: RX 6700 or newer (10x-20x speedup)
- Apple Silicon: Built-in GPU (3x-5x speedup)
An integrated GPU? Not worth the complexity. CPU inference is simpler and nearly as fast.
Should You Upgrade Hardware?
For local LLMs, consider GPU if:
- You want to run 13B+ models (currently too slow on CPU)
- You need sub-second response times (production use)
- You want to run multiple concurrent inferences
For now, stick with CPU. A modern multi-core CPU is plenty fast for a single Ollama instance running 7B models.
Leverage Your RAM — Load Once, Use Instantly
With 16GB+ RAM and OLLAMA_KEEP_ALIVE set, you can keep multiple models loaded simultaneously and switch between them with zero cold-start delay. Different models for different tasks — all ready to go.
Check What's Currently Loaded
Before thinking about concurrent models, know what's already in memory:
ollama ps
NAME ID SIZE PROCESSOR UNTIL
qwen2.5:7b 845dbda0ea48 5.5 GB 100% CPU 28 minutes from now
mistral:latest f974a74358d6 5.0 GB 100% CPU 25 minutes from now
PROCESSOR shows whether your GPU is involved. On a CPU-only machine you'll see 100% CPU. On a machine with a compatible GPU (NVIDIA with CUDA, AMD with ROCm, or Apple Silicon), you'll see a split like 45%/55% CPU/GPU — higher GPU % means faster inference. The UNTIL column shows the keep-alive expiry for each loaded model.
Concurrent Model Memory Layout
On a 32GB system, loading your full stack looks like this:
OS + System: ~2GB (always used)
Ollama Runtime: ~1GB
qwen2.5:7b (q4_K_M): ~5.5GB — all-rounder, 128K context
mistral:latest: ~5.0GB — reasoning + code workhorse
llama3.2:3b: ~2.5GB — speed tier, quick queries
Reserve Buffer: ~16GB (available for phi4 or larger models)
────────────────────────────────
Total Used: ~16GB
Total Available: ~16GB
With OLLAMA_KEEP_ALIVE set to -1 or a long duration, all three models stay resident.
Switching between them is instant — the model is already warm in memory.
Practical Multi-Model Setup (2026)
A sensible three-tier stack for most use cases:
- qwen2.5:7b — Default tier. Complex reasoning, code, long documents (128K context). Best all-rounder.
- mistral:latest — Alternative for instruction-following and reliability. Battle-tested and predictable.
- llama3.2:3b — Speed tier. Quick lookups, simple tasks, when you want an instant answer.
- phi4 or gemma3:12b — Heavy reasoning (needs 16GB+). Pull when qwen2.5 isn't cutting it on complex problems.
Total disk for the first three: ~12GB. All loaded simultaneously at 32GB: ~16GB RAM. Comfortable headroom.
API Request Routing by Model
With multiple models loaded, route requests to the right model for the task:
# Complex task → Qwen2.5 for reasoning + 128K context
curl http://localhost:11434/api/generate \
-d '{"model":"qwen2.5:7b","prompt":"Analyze this long document..."}'
# Quick factual lookup → Llama 3.2 3B for speed
curl http://localhost:11434/api/generate \
-d '{"model":"llama3.2:3b","prompt":"What does HTTP 429 mean?"}'
# Code review → Mistral for reliable instruction following
curl http://localhost:11434/api/generate \
-d '{"model":"mistral","prompt":"Review this function for bugs..."}'
If all three are loaded in memory (keep-alive), these requests complete without any model loading delay. Ollama queues concurrent requests and processes them in order, switching models transparently.
Pinning Models with OLLAMA_KEEP_ALIVE
The key to no-delay model switching is keeping models in memory. Set keep-alive system-wide via systemd:
sudo systemctl edit ollama
[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"
sudo systemctl daemon-reload && sudo systemctl restart ollama
-1 only if you have headroom. With -1, models never unload
until Ollama restarts. On 16GB RAM with one 7B model loaded, that's fine. Loading a second 7B while the first
is pinned will consume ~12GB — check ollama ps and your free RAM before pinning multiple
large models permanently.
ollama ps is your dashboard — check it whenever
you're curious about what's running.
Connect OpenClaw to Your Local Ollama
This is the power play: use your local Ollama as the brain for OpenClaw. Your Discord bot will have a private, offline-capable LLM that you control completely. No API keys, no cloud dependency, no monthly bills.
Why Local Ollama + OpenClaw?
Privacy: Every message stays on your laptop. No cloud. No logs in someone else's data center.
Cost: Free. The Ollama infrastructure costs you literally nothing.
Speed: No network latency. Your bot responds instantly (well, as fast as token generation).
Control: You choose the model. You tune the parameters. You own the LLM.
Architecture Overview
Here's the data flow:
Discord User
↓ (sends message)
↓
[Discord Bot (OpenClaw)]
↓ (sends HTTP request)
↓
[Ollama API on localhost:11434]
↓ (inference)
↓
[LLM Model (Mistral, Llama2, etc)]
↓ (returns response)
↓
[Ollama API]
↓ (sends JSON back)
↓
[OpenClaw]
↓ (processes response)
↓
Discord User receives answer
All local. All private. Fast.
Configuration: Point OpenClaw to Ollama
OpenClaw (and most bots) accept LLM configuration. You'll specify:
- LLM Provider: "ollama" or "custom"
- Endpoint: http://localhost:11434
- Model: "mistral" (or your chosen model)
- Temperature: 0.1-0.3 (structured output)
- Max Tokens: 2048 (keep responses reasonable)
- Context Window: 4096 (remember conversation)
Configuration location depends on your OpenClaw setup (usually YAML, JSON, or environment variables). Check your OpenClaw documentation for exact syntax.
Example: Python Integration
If you're building a bot from scratch, here's how to call Ollama from Python:
import requests
import json
def query_ollama(prompt, model="mistral", temperature=0.1):
"""Send a prompt to local Ollama and get response"""
endpoint = "http://localhost:11434/api/generate"
payload = {
"model": model,
"prompt": prompt,
"temperature": temperature,
"num_predict": 2048,
"stream": False
}
response = requests.post(endpoint, json=payload)
result = response.json()
return result.get("response", "")
# Usage
answer = query_ollama("What is the capital of France?")
print(answer)
This is the exact pattern OpenClaw uses internally. The endpoint is always localhost:11434 (assuming Ollama is running locally).
Testing the Integration
Before plugging into Discord, test locally:
#!/bin/bash
# Test endpoint accessibility
echo "Testing Ollama connectivity..."
curl -s http://localhost:11434/api/tags && echo " ✓ Connected" || echo " ✗ Failed"
# Test a simple inference
echo ""
echo "Testing inference..."
RESPONSE=$(curl -s http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Hello, how are you?",
"temperature": 0.1,
"stream": false
}' | jq -r '.response')
echo "Response: $RESPONSE"
If both tests pass, your Ollama is ready for OpenClaw integration.
Performance Expectations
When running on your local hardware with OpenClaw actively processing:
User sends message to Discord
OpenClaw receives and processes (milliseconds)
Sends prompt to Ollama (localhost, instant)
Ollama generates response (10-15 tokens/sec)
Average response time: 5-20 seconds (depending on response length)
Discord receives and displays answer
Example:
"What's the capital of France?" → 3 seconds
Long explanation prompt → 15-20 seconds
Users will perceive this as responsive and fast, especially for a free, local bot.
Scaling Considerations
What If Multiple Users Message Simultaneously?
Ollama queues requests. If 3 users send messages at once:
- User 1 gets response in ~10 seconds
- User 2 gets response in ~20 seconds (queued behind User 1)
- User 3 gets response in ~30 seconds (queued behind both)
This is fine for personal bots or small communities. For high-traffic scenarios, you'd want to run multiple Ollama instances or use GPU acceleration (beyond this tutorial).
Troubleshooting Integration
OpenClaw can't reach Ollama:
- Verify Ollama is running:
systemctl status ollama - Verify API is accessible:
curl http://localhost:11434/api/tags - Check firewall:
sudo ufw allow 11434(if using ufw)
Responses are slow or timing out:
- Check CPU load:
top -n 1 | grep Cpu - Check memory:
free -h - Reduce context window in config (faster inference)
- Use a faster model (Neural Chat instead of Mistral)
Responses are nonsensical:
- Lower temperature in config (more deterministic)
- Reduce num_predict (shorter, more coherent responses)
- Try a different model
Squeeze Maximum Performance
Your CPU is capable, but there are ways to go further. Environment variables, context tuning, CPU pinning — small tweaks compound into noticeable improvements. This section covers the full toolkit.
Key Environment Variables
Ollama exposes a set of environment variables that control how it loads and runs models. These are the most useful ones for a Mini PC setup:
5m.
Set to 30m or -1 (forever) to avoid cold-start delays.
On 16GB+ systems, longer keep-alive means faster responses — the model is already warm.
Environment="OLLAMA_KEEP_ALIVE=30m"
3 on systems with GPU,
1 on CPU-only. Raise to 3 or 4 on 32GB systems to keep
your full model stack warm without unloading.
Environment="OLLAMA_MAX_LOADED_MODELS=3"
1 on CPU.
Increasing this can improve throughput if you have multiple concurrent users or
applications hitting Ollama at the same time, at the cost of higher RAM usage per request.
Start at 1 on CPU — most single-user setups don't need more.
Environment="OLLAMA_NUM_PARALLEL=2"
1 to enable. Most beneficial when running models
with large context windows (like qwen2.5:7b at 128K tokens). Can noticeably
reduce RAM pressure on long conversations.
Environment="OLLAMA_FLASH_ATTENTION=1"
sudo systemctl edit ollama
[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_FLASH_ATTENTION=1"
sudo systemctl daemon-reload && sudo systemctl restart ollama
CPU Affinity
Pin Ollama to specific cores to reduce OS scheduling overhead and give the model consistent CPU access:
# Add to your [Service] block in systemctl edit ollama:
CPUAffinity=0-7 # Pins to cores 0–7 (adjust to your core count)
On a 12-core machine, you might pin Ollama to 8 cores (0–7) and leave 4 cores (8–11) for the OS and other services. Experiment — the benefit varies by workload.
Context Window Tuning
Context window size directly affects inference speed and RAM usage. Smaller context = faster tokens:
Context 1024: ~20 tokens/sec (very fast, short conversations)
Context 2048: ~15 tokens/sec (balanced — good default)
Context 4096: ~10 tokens/sec (slower, good for most tasks)
Context 8192: ~7 tokens/sec (long docs, complex reasoning)
Context 32768+: ~3–5 tokens/sec (Qwen2.5 long context, only when needed)
You can override context size per request via the API using the options.num_ctx field.
For most chat tasks, 2048–4096 is the sweet spot. Only go higher when you're actually sending long inputs.
curl http://localhost:11434/api/generate \
-d '{
"model": "qwen2.5:7b",
"prompt": "Summarize this document...",
"options": {
"num_ctx": 8192
}
}'
Memory Pressure Relief
If you're hitting RAM limits — models are being unexpectedly unloaded, or you see swap activity — try these:
# See current model memory usage
ollama ps
# Check system RAM
free -h
# Check if swap is being used (bad for inference speed)
swapon --show
If you're hitting swap, either reduce OLLAMA_MAX_LOADED_MODELS, switch to smaller model variants
(e.g. llama3.2:3b instead of a 7B), or reduce OLLAMA_KEEP_ALIVE so models unload faster.
ollama ps.
Keep Your LLM Healthy
A running system needs monitoring. Health checks, logs, resource tracking—simple practices prevent surprises.
Health Check Script
#!/bin/bash
echo "=== Ollama Health Check ==="
echo ""
# Service status
echo "✓ Service Status:"
systemctl status ollama --no-pager | head -3
echo ""
echo "✓ API Connectivity:"
curl -s http://localhost:11434/api/tags | jq '.models | length' | xargs echo " Models available:"
echo ""
echo "✓ Resources:"
ps aux | grep "ollama serve" | grep -v grep | awk '{print " CPU: " $3 "%, RAM: " $6 " KB"}'
echo ""
echo "✓ Logs (last error):"
grep -i error ~/.ollama/ollama.log | tail -1 || echo " No errors"
Log Location
Logs live at ~/.ollama/ollama.log. Check for issues:
# View recent logs
tail -50 ~/.ollama/ollama.log
# Find errors
grep ERROR ~/.ollama/ollama.log | tail -10
Regular Maintenance
- Weekly: Check disk usage (
du -sh ~/.ollama/models) - Monthly: Update Ollama (
sudo apt update && sudo apt install ollama) - Quarterly: Clean unused models (
ollama rm model_name) - Yearly: Backup models to external drive
Before You Go Live
You've learned parameters, benchmarking, optimization. This checklist ensures your Ollama setup is production-ready for OpenClaw.
Pre-Launch Checklist
- ☐ Ollama running as systemd service (auto-start on boot)
- ☐ Model selected and downloaded (Mistral recommended)
- ☐ API endpoint verified (
curl http://localhost:11434/api/tags) - ☐ Temperature set appropriately (0.1-0.3 for OpenClaw)
- ☐ Context window configured (4096 minimum)
- ☐ Max tokens set (2048 to prevent runaway)
- ☐ Baseline performance benchmarked
- ☐ Health check script running
Configuration Backup
Save your working configuration:
# Backup models
cp -r ~/.ollama/models ~/ollama-models-backup
# Document your settings
cat > ~/ollama-config.txt << EOC
Model: mistral
Temperature: 0.1
Top-P: 0.9
Context: 4096
Num Predict: 2048
EOC
OpenClaw Integration Checklist
- ☐ Ollama endpoint configured in OpenClaw (localhost:11434)
- ☐ Model name matches what you pulled
- ☐ Test prompt sent and received successfully
- ☐ Response quality acceptable
- ☐ Response time reasonable (5-20 seconds)
- ☐ No memory leaks after sustained use
- ☐ Reboot test: Ollama starts automatically, works after restart
Common Issues and Fixes
Things go wrong. Here's how to diagnose and fix common Ollama problems.
Ollama Won't Start
Symptom: Service shows inactive or fails to start
systemctl status ollama
# Read error message
# Try manual start to see error
ollama serve
Common fixes:
- Permission denied:
sudo chown ollama:ollama ~/.ollama - Port conflict: Check
sudo lsof -i :11434 - Corrupted model: Delete and re-pull
Out of Memory Errors
Symptom: "OOM Killer" in dmesg, processes killed
free -h
# If available < 2GB during inference, you're hitting limits
dmesg | grep -i killed | tail -5
# Shows what got OOM killed
Fixes:
- Reduce context window (num_ctx: 2048 instead of 8192)
- Unload unused models (
ollama rm model_name) - Close other apps consuming memory
- Add swap if permanently needed
Very Slow Inference
Symptom: <5 tokens/sec (should be 10-15)
top -n 1
# CPU usage <80%? Issue might be elsewhere
# Check if CPU is being shared with other apps
iostat -x 1
# High wait time? Disk bottleneck
Fixes:
- Close heavy apps (browser, IDE, etc)
- Reduce num_ctx to 2048
- Use faster model (Neural Chat)
- Check for thermal throttling (
watch -n 1 'cat /proc/cpuinfo | grep MHz')
API Not Responding
Symptom: curl returns Connection refused
systemctl status ollama
# Make sure it's running
netstat -an | grep 11434
# Should show LISTEN on port 11434
curl http://localhost:11434/api/tags
# Should return JSON, not error
Fixes:
- Start service:
sudo systemctl start ollama - Check firewall:
sudo ufw allow 11434 - Restart:
sudo systemctl restart ollama
Bad Quality Responses
Symptom: Responses are nonsensical or repetitive
Fixes:
- Lower temperature (0.3 or less)
- Increase repeat_penalty (1.5)
- Reduce num_predict (limit length)
- Try different model (Llama2 instead of Mistral)
Where to Go From Here
You've completed Ollama Advanced. You understand parameters, tuning, integration, optimization, and troubleshooting. Your local LLM setup is sophisticated and production-ready.
You've Accomplished
- ✓ Understand all Ollama parameters and their effects
- ✓ Tune for specific use cases (OpenClaw, creative, code, etc)
- ✓ Benchmark models objectively
- ✓ Know when GPU acceleration matters (and doesn't for you)
- ✓ Run multiple models concurrently
- ✓ Integrate Ollama seamlessly with OpenClaw
- ✓ Optimize your hardware for maximum LLM performance
- ✓ Monitor and maintain a healthy Ollama system
- ✓ Troubleshoot common problems
Option 1: Deploy Your OpenClaw Bot
You have everything you need. Configure OpenClaw to use your local Ollama, then deploy:
- Point OpenClaw to http://localhost:11434
- Select your tuned model and parameters
- Launch your Discord bot
- Enjoy your private, offline-capable AI agent
Option 2: Explore Advanced Topics
If you want to go deeper:
- Model Fine-Tuning: Customize models for specific tasks (advanced)
- Quantization: Compress models further (4-bit, 2-bit)
- Distributed Inference: Run Ollama across multiple machines
- Web UI: Build a web interface for Ollama
- Monitoring: Prometheus/Grafana metrics tracking
Option 3: Compare with Other LLM Tools
Ollama isn't the only option. If you want to explore:
- LM Studio: Web UI for local models (easier than Ollama CLI)
- vLLM: High-performance inference server (more complex)
- Text Generation WebUI: Feature-rich but steeper learning curve
- GPT4All: Lightweight, beginner-friendly
But honestly? Ollama is the best balance of simplicity and power for your use case.
Keep Learning
Understand Transformers Better:
- Read: "Attention is All You Need" (the original paper)
- Watch: YouTube tutorials on how LLMs work
- Experiment: Try different models, prompt engineering
Follow the Community:
- Ollama GitHub (issues, discussions)
- Hugging Face model hub (find new models)
- Reddit r/LocalLLM (community sharing)
You're Part of the AI Revolution
A few years ago, running local LLMs meant compiling C++, wrestling with dependencies, and getting 1-2 tokens/sec. Now? You install Ollama, pull a model, and get 10-15 tokens/sec on CPU. You've got a private, offline-capable AI brain that costs nothing to run.
Your data is yours. Your LLM is yours. No cloud vendor, no API limits, no surveillance.
That's power. Use it wisely.