The AI Shell That Puts
Your Machine First
Olympus routes every prompt through Ollama locally — free, instant, private. Cloud providers only activate when Ollama can't handle it. Cut cloud token spend 60–90% without changing how you work.
$ oly ask "refactor the auth module to use JWTs"
# routed to ollama — $0.00
Naive AI usage is expensive
Sending every prompt directly to a cloud API means paying per token on every message, including the entire conversation history re-sent on each turn. By turn 20 you're paying for the same tokens 20 times over.
Production AI teams at Anthropic, Scale AI, Fireworks AI, and Hugging Face don't work this way. They run local models first, compress aggressively, and treat cloud calls as a last resort.
50–70% of queries don't need GPT-4
Research shows the majority of developer queries — explain, fix, summarize, review — are handled equally well by smaller local models.
Context compression cuts 70–85% of tokens
Summarizing conversation history with a local model before each cloud call eliminates the compounding cost of long sessions.
Route by complexity, not habit
Scoring each prompt for complexity and routing to the cheapest capable model — without sacrificing quality — is now standard practice in production.
Ollama first. Cloud as fallback.
Every query flows through a 4-level waterfall. Ollama handles the vast majority. Cloud providers only activate when Ollama is unavailable — and within cloud, subscription providers are always tried before pay-per-token.
Configure the threshold to adjust how much goes local:
local_threshold: 1.0 = everything to Ollama (default)
Compress before you send. Always.
Conversation history is the silent cost killer. Every cloud call re-sends every prior turn. By turn 20 you are paying for the same context 20 times over. Olympus uses Ollama to summarise old turns locally and for free before any cloud call goes out.
This is the same technique described in LLMLingua-2 (Microsoft Research, 2024) and Anthropic's own long context management guide.
~70% reduction
4,000 tokens/call compressed to ~1,200
~85% reduction
8,000 tokens/call compressed to ~1,200
~91% reduction
16,000 tokens/call compressed to ~1,400
Compression is automatic and invisible. Configure the trigger:
routing:
compress_after_turns: 10 # summarise history every 10 turns (default)
Everything you need, nothing you don't
Ollama-first routing
Local models handle everything by default. Complexity scoring in pure Go — no model call needed to make the routing decision.
Context compression
Ollama summarizes old conversation turns locally before each cloud call. Automatic, invisible, configurable.
Real-time token streaming
SSE (Claude) and NDJSON (Ollama) streaming with a BubbleTea TUI that shows every token as it arrives.
Governance panels
Six built-in review panels: code-review, security-review, threat-modeling, cost-analysis, documentation, data-governance.
Plugin providers
Add Groq, Mistral, Together AI, Azure OpenAI, or any OpenAI-compatible API with one command — no code required.
Context checkpointing
Automatic checkpoint at 80% context window. Restore any session with /continue <id>.
Built-in provider waterfall
Built-in providers are configured via olympus configure. Plugin providers are added with olympus providers add.
| Priority | Provider | Auth | Cost | Best for |
|---|---|---|---|---|
| 1 · Primary | Ollama (local) | None — local | Free | Everything. Default for all queries. |
| 2 · Cloud fallback | Claude Pro (OAuth) | Claude Code session | Subscription | Reasoning, long context, complex code |
| 3 · Cloud fallback | GitHub Copilot | GitHub PAT | Subscription | Code generation, diffs |
| 4 · Last resort | Claude API | Anthropic API key | Per token ⚠ | Fallback only — cost warning shown |
| 5 · Plugin | Groq / Mistral / etc. | API key | Per token | Any OpenAI-compatible API |
# Add any OpenAI-compatible API as a plugin provider
olympus providers add groq \
--key gsk_... \
--model llama-3.3-70b-versatile \
--base-url https://api.groq.com/openai/v1
olympus providers list
Get started in two minutes
Requires Go 1.22+ and Ollama running locally.
# Install via Homebrew
brew install convergent-systems-co/tap/olympus
# Or build from source
git clone https://github.com/convergent-systems-co/olympus-cli
cd olympus-cli && make install-oly
# Pull a local model (Ollama handles most queries for free)
ollama pull llama3
# Configure cloud providers (optional — Ollama works without any)
olympus configure
# Start the shell
olympus
Developer workflow commands
/fix the null pointer in auth_service.go
/explain the token bucket algorithm
/review the payment processing module
/refactor the database connection pool
/tests the UserService.CreateAccount method
/diff # review staged git changes
/security # security-focused code review
/govern # run all Dark Forge governance panels
What's next
Semantic caching
Cache responses by query embedding. Skip API calls for semantically similar questions. Expected: 20–40% additional reduction.
RAG — codebase context
Vector-index the repo. Send the 3 most relevant file chunks instead of full files. 70–90% context reduction on code queries.
LLMLingua compression
Score individual prompt tokens by importance, prune the lowest-scoring ones. 20–40% prompt size reduction with minimal quality loss.