local-first · policy-gated retrieval · running on-device

Companion

v0.0.0-dev · MIT · Open source · Local-first · Tauri 2 + Rust + TypeScript

Built for:: Engineers who want a serious local AI surface they can audit, extend, and own — not a wrapper around someone else’s API.
Not built for:: Anyone who wants a polished consumer chat app today.

Most AI assistants assume the language model is the product. Companion doesn’t. The model is a service; the relationship — memory, presence, trust, timing — is the product. Policy is what keeps the relationship intact over years of use, and it is the only thing in the system that nothing else can route around.

§ I

The problem

A consumer-grade AI assistant is a single-session question machine. You type, it answers, the context evaporates, and the next session starts cold. That works for a search bar. It does not work for anything you want to live alongside for years — a journal, a research partner, a creative collaborator, a long-running second brain. Long-lived relationships need state; state on someone else’s servers is a relationship you don’t actually own.

Companion is the opposite shape. Memory, persona, audit log, and model weights all live on the user’s machine. Remote calls are deliberate and visible — never the default path. The architectural invariant is one line: every action that touches the world first clears a policy gate. No fast paths, no admin mode, no exceptions. The gate is the ground floor; everything else grows on top of it.

§ II

Decisions

The features are consequences. The decisions are the spine. Each one is dated and lives in a versioned ADR in the repo, so the reasoning is auditable later — including by me, when I forget why I made a call six months from now.

kept
Sequential retrieval, post-memory and pre-prompt, with a hard 5-second bailout per phase. Parallel-with-timeout was the obvious move, but retrieval quality materially improves when the query vector is formed from the memory-augmented context, not the raw utterance. The bailout is a safety valve for a broken Ollama, not a normal-path latency target. ADR-0005 →
cut
wgpu::Limits::max_buffer_sizeas a tier-detection signal. On real hardware (24 GB RTX 3090 Ti) the Vulkan backend reported u64::MAXwhile DX12 and GL reported 1 GB — the heuristic was measuring “did the driver expose a sentinel”, not VRAM. Tier classification now keys on device_type + total system RAM, with the per-platform VRAM probe deferred to a future ADR. ADR-0008 →
flipped
Embedding default from nomic-embed-text to bge-m3. ~25-point retrieval-accuracy lift in recent RAG benchmarks (72% vs 57%); 1024-dim dense vs 768; 1.2 GB pull, still CPU-bounded. Text default pinned explicitly to gemma4:e4b rather than riding the :latesttag, which currently resolves to e4b but could drift to the 31B Dense variant — known FA-hang on prompts > 3–4K tokens. ADR-0003 →
refused
A cloud-fallback route for the local model. The whole premise is that nothing leaves the device by default; an “escape hatch” that calls a remote API when the local model struggles erases the architecture. If the local model is wrong for the workload, the fix is a different local model.

Retrieval is the boundary where the model becomes mine.
— ADR-0005, retrieval wiring

§ III

System

Seven independent engines on a shared bus, each with an explicit contract. Sibling layers do not import each other; they coordinate through typed events and a single policy gate. The boundary is enforced by a workspace-level linter that fails CI on a forbidden import edge. That linter is the only reason the architecture survives contact with my own shortcuts.

FIGURE 1. A single turn — the policy gate (brass) is the only path to the world; retrieval (brass) is the only path that adds context the user owns.

Stack — current pins.
Layer	Implementation	Purpose
Shell	Tauri 2.x	Native window · IPC · single binary
Core	Rust + tokio	Layered engines · capability bus · audit
Inference	Ollama · gemma4:e4b	Local text model · 9.6 GB · 128K ctx
Embeddings	bge-m3 · 1024-d	Retrieval lane · CPU-bounded
Storage	SQLite + sqlite-vec	Durable memory · WAL · single writer
UI	React 19 + Vite	Chat surface · presence · onboarding

apps/desktop/src-tauri/src/retrieval.rsrust · ADR-0005

// Embed → query_nearest → fetch_one → rank, with one
// hard wall-clock deadline across the whole pipeline.
// On exceed: warn once, return empty. No silent fallback.
pub fn run_retrieval_context(
    state: &AppState,
    session_id: &str,
    utterance: &str,
    max_items: usize,
    deadline: Duration,
) -> Vec<RetrievalHit> {
    if max_items == 0 || !state.embeddings_enabled() {
        return Vec::new();
    }

    // L5 policy gate — one audit row per retrieval call.
    let decision = state.policy_evaluate(
        Capability::RetrievalContext,
        ResourceScope::None,
    );
    if !matches!(decision, Decision::Allow { .. }) {
        return Vec::new();
    }

    let deadline_at = Instant::now() + deadline;
    let query = timed(deadline_at, || state.embed(utterance))?;
    let hits  = timed(deadline_at, || state.query_nearest(&query))?;
    let rows  = timed(deadline_at, || state.fetch_rows(&hits))?;

    rank(rows, RECENCY_WEIGHT_ALPHA, max_items)
}

retrieval-context.audit.jsonaudit row · L5

{
  "request_id":  "retrieval-1714312880412",
  "capability":  "RetrievalContext",
  "decision":    "Allow",
  "deadline_ms": 5000,
  "embed_ms":    187,
  "query_ms":    42,
  "fetch_ms":    9,
  "rank_alpha":  0.1,
  "hits": [
    { "id": "msg_2419", "score": 0.821, "tier": "durable" },
    { "id": "msg_2210", "score": 0.804, "tier": "session" },
    { "id": "msg_1988", "score": 0.793, "tier": "session" }
  ],
  "bailout":     false
}

FIGURE. One retrieval invocation. Every call lands a row in the audit log; bailout fires before the prompt builder ever blocks. The 5-second deadline applies across embed + query + fetch — not per-step — so a slow stage can never compound.

Companion desktop app — three-pane layout with recent conversations, an active chat showing a retrieval-context capability call, and a side panel listing three ranked retrieval hits with memory tier tags. — FIGURE 2. One turn mid-flight. The retrieval context panel surfaces the three rows the orchestrator handed the model, with their tier and recency baked into the rank. Telemetry strip below: the local model, throughput, resident memory.

Expanded retrieval-context drawer — eight ranked memory hits with relevance scores from 0.91 down to 0.42, each row tagged with a memory tier (DURABLE / EPHEMERAL / SESSION). — FIGURE 3. The drawer one click below the chat. Same retrieval invocation, eight hits ranked by relevance, each row carrying its tier and embed-time. The bottom strip names the model that produced the embeddings, the dimension, and the median look-up.

Companion first-run terminal — install progress for sqlite-vec, gemma4:e4b download at 1.2GB, embedding model warming, and a final ready banner reading on-device with 23 tokens per second throughput. — FIGURE 4. The first run. No installer, no telemetry call, no signed update channel — just three commands and a ready banner. The local model loads once and stays resident; the next conversation starts at the same throughput as the last.

§ IV

Running it locally

Companion is source-only at this stage — there is no signed installer yet because there is no signed installer worth shipping yet. The build chain is small. Rust toolchain, Node, Ollama, the model and embedding pulls; that’s it. The first run brings up an empty profile and the policy gate in the most conservative posture; you grant capabilities as you use them, and every grant is a row in the audit log.

build.shshell# prerequisites: rustup, node 20+, pnpm, Ollama
git clone https://github.com/dbhavery/aether
cd aether
ollama pull gemma4:e4b
ollama pull bge-m3
pnpm install
pnpm tauri dev

Targets macOS 13+, Linux x86_64/arm64, Windows 11. Disk: ~12 GB for both model pulls. Runtime: ~6 GB resident during inference on the e4b variant. The repo’s CONTRIBUTING.md covers per-platform setup quirks.

Acknowledgments

Three notes on what this assembly owes. First, the runtime: Tauri wraps a Rust core (tokio, SQLite via sqlite-vec, hardware probes from sysinfo and wgpu) into a single binary that opens like a native window. Second, the model layer: the Gemma weights from Google DeepMind, served via Ollama, with BGE-M3 doing retrieval. Third, the boundaries: cargo-deny and a small ESLint rule keep the layered architecture from rotting. The interesting parts of this project are the parts I wrote on top of work other people did first.

← Index