Open Source · Go · Local-First

The AI Shell That Puts
Your Machine First

Olympus routes every prompt through Ollama locally — free, instant, private. Cloud providers only activate when Ollama can't handle it. Cut cloud token spend 60–90% without changing how you work.

$ olympus
$ oly ask "refactor the auth module to use JWTs"
# routed to ollama — $0.00
60–90%
Cloud token reduction
<200ms
Local response latency
$0.00
Cost for Ollama queries
4
Providers in waterfall

Naive AI usage is expensive

Sending every prompt directly to a cloud API means paying per token on every message, including the entire conversation history re-sent on each turn. By turn 20 you're paying for the same tokens 20 times over.

Production AI teams at Anthropic, Scale AI, Fireworks AI, and Hugging Face don't work this way. They run local models first, compress aggressively, and treat cloud calls as a last resort.

RouteLLM · Berkeley 2024

50–70% of queries don't need GPT-4

Research shows the majority of developer queries — explain, fix, summarize, review — are handled equally well by smaller local models.

LLMLingua-2 · Microsoft 2024

Context compression cuts 70–85% of tokens

Summarizing conversation history with a local model before each cloud call eliminates the compounding cost of long sessions.

Fireworks AI LLM Router

Route by complexity, not habit

Scoring each prompt for complexity and routing to the cheapest capable model — without sacrificing quality — is now standard practice in production.

Ollama first. Cloud as fallback.

Every query flows through a 4-level waterfall. Ollama handles the vast majority. Cloud providers only activate when Ollama is unavailable — and within cloud, subscription providers are always tried before pay-per-token.

flowchart TD Q([User query]) --> O O["1 · Ollama (local)\nllama3 / mistral / phi3"] O -->|available| OR([Response · free]) O -->|unavailable| CP CP["2 · Claude Pro (OAuth)\nsubscription · no per-token cost"] CP -->|available| CPR([Response · subscription]) CP -->|unavailable| GH GH["3 · GitHub Copilot\nsubscription · no per-token cost"] GH -->|available| GHR([Response · subscription]) GH -->|unavailable| CA CA["4 · Claude API\n⚠ pay-per-token · last resort"] CA -->|available| CAR([Response · cost warning]) CA -->|unavailable| ERR([Error + diagnosis]) style O fill:#2E3440,color:#A3BE8C,stroke:#A3BE8C style OR fill:#1a2e20,color:#A3BE8C,stroke:#A3BE8C style CP fill:#2E3440,color:#88C0D0,stroke:#88C0D0 style CPR fill:#1a2535,color:#88C0D0,stroke:#88C0D0 style GH fill:#2E3440,color:#81A1C1,stroke:#81A1C1 style GHR fill:#1a2535,color:#81A1C1,stroke:#81A1C1 style CA fill:#3a2020,color:#BF616A,stroke:#BF616A style CAR fill:#3a2020,color:#EBCB8B,stroke:#EBCB8B style ERR fill:#3a2020,color:#BF616A,stroke:#BF616A style Q fill:#252d3a,color:#D8DEE9,stroke:#4C566A

Configure the threshold to adjust how much goes local: local_threshold: 1.0 = everything to Ollama (default)

Compress before you send. Always.

Conversation history is the silent cost killer. Every cloud call re-sends every prior turn. By turn 20 you are paying for the same context 20 times over. Olympus uses Ollama to summarise old turns locally and for free before any cloud call goes out.

This is the same technique described in LLMLingua-2 (Microsoft Research, 2024) and Anthropic's own long context management guide.

flowchart LR subgraph Session["Active session (20 turns)"] T1["Turns 1–16\n~6,400 tokens\nverbose back-and-forth"] T2["Turns 17–20\n~1,600 tokens\nrecent context"] end subgraph Compress["Local compression (Ollama · free)"] OL["Ollama summarises\nturns 1–16\n~1 second · $0.00"] SUM["Dense summary\n~800 tokens\ndecisions · code · errors"] end subgraph Payload["Cloud API payload"] PAY["Summary + turns 17–20\n~2,400 tokens\n70% smaller"] end T1 -->|"compress_after_turns: 10"| OL OL --> SUM T2 --> PAY SUM --> PAY PAY -->|"sent to Claude / Copilot"| Cloud(["Cloud response"]) style Session fill:#1e2430,color:#D8DEE9,stroke:#4C566A style Compress fill:#1a2e20,color:#D8DEE9,stroke:#A3BE8C style Payload fill:#1a2535,color:#D8DEE9,stroke:#5E81AC style OL fill:#2E3440,color:#A3BE8C,stroke:#A3BE8C style SUM fill:#2E3440,color:#88C0D0,stroke:#88C0D0 style PAY fill:#2E3440,color:#81A1C1,stroke:#81A1C1 style Cloud fill:#1a2535,color:#88C0D0,stroke:#88C0D0
10 turns

~70% reduction

4,000 tokens/call compressed to ~1,200

20 turns

~85% reduction

8,000 tokens/call compressed to ~1,200

40 turns

~91% reduction

16,000 tokens/call compressed to ~1,400

Compression is automatic and invisible. Configure the trigger:

routing:
  compress_after_turns: 10   # summarise history every 10 turns (default)

See full architecture →

Everything you need, nothing you don't

🖥
Free · <200ms

Ollama-first routing

Local models handle everything by default. Complexity scoring in pure Go — no model call needed to make the routing decision.

🗜
85% token reduction

Context compression

Ollama summarizes old conversation turns locally before each cloud call. Automatic, invisible, configurable.

Streaming

Real-time token streaming

SSE (Claude) and NDJSON (Ollama) streaming with a BubbleTea TUI that shows every token as it arrives.

🛡
Dark Forge

Governance panels

Six built-in review panels: code-review, security-review, threat-modeling, cost-analysis, documentation, data-governance.

🔌
OpenAI-compatible

Plugin providers

Add Groq, Mistral, Together AI, Azure OpenAI, or any OpenAI-compatible API with one command — no code required.

💾
Auto-save

Context checkpointing

Automatic checkpoint at 80% context window. Restore any session with /continue <id>.

Built-in provider waterfall

Built-in providers are configured via olympus configure. Plugin providers are added with olympus providers add.

Priority Provider Auth Cost Best for
1 · Primary Ollama (local) None — local Free Everything. Default for all queries.
2 · Cloud fallback Claude Pro (OAuth) Claude Code session Subscription Reasoning, long context, complex code
3 · Cloud fallback GitHub Copilot GitHub PAT Subscription Code generation, diffs
4 · Last resort Claude API Anthropic API key Per token ⚠ Fallback only — cost warning shown
5 · Plugin Groq / Mistral / etc. API key Per token Any OpenAI-compatible API
# Add any OpenAI-compatible API as a plugin provider
olympus providers add groq \
  --key gsk_... \
  --model llama-3.3-70b-versatile \
  --base-url https://api.groq.com/openai/v1

olympus providers list

Get started in two minutes

Requires Go 1.22+ and Ollama running locally.

# Install via Homebrew
brew install convergent-systems-co/tap/olympus

# Or build from source
git clone https://github.com/convergent-systems-co/olympus-cli
cd olympus-cli && make install-oly

# Pull a local model (Ollama handles most queries for free)
ollama pull llama3

# Configure cloud providers (optional — Ollama works without any)
olympus configure

# Start the shell
olympus

Developer workflow commands

/fix      the null pointer in auth_service.go
/explain  the token bucket algorithm
/review   the payment processing module
/refactor the database connection pool
/tests    the UserService.CreateAccount method
/diff     # review staged git changes
/security # security-focused code review
/govern   # run all Dark Forge governance panels

What's next

Phase 2

Semantic caching

Cache responses by query embedding. Skip API calls for semantically similar questions. Expected: 20–40% additional reduction.

Phase 2

RAG — codebase context

Vector-index the repo. Send the 3 most relevant file chunks instead of full files. 70–90% context reduction on code queries.

Phase 3

LLMLingua compression

Score individual prompt tokens by importance, prune the lowest-scoring ones. 20–40% prompt size reduction with minimal quality loss.