Skip to main content

Corpora (RAG)

A corpus is a collection of documents Kenaz has indexed and made retrievable to the model. When you attach a corpus to a session, the model can ask "find me the 5 most relevant chunks for this query" without dumping the whole document set into context.

Use corpora when you have:

  • A set of internal docs the model should know about (engineering notes, runbooks, contracts).
  • A codebase too large to paste into a turn.
  • Reference material you want the model to cite from rather than hallucinate.

Creating a corpus

Corpora view → New corpus.

  1. Pick a name and a description.
  2. Add sources. Three source types:
    • Local directory — Kenaz walks the directory, reads supported formats, indexes them.
    • Single file — drop a PDF, Markdown, code file.
    • Web URL — Kenaz fetches and indexes a single page (one-shot, no recursion).
  3. Pick an embedding provider. Embeddings come from the same providers as chat — Anthropic, OpenAI, Bedrock, Ollama, OpenRouter — but the model is different (smaller, faster, cheaper). Kenaz suggests a default per provider (text-embedding-3-large on OpenAI, voyage-3 via OpenRouter, nomic-embed-text via Ollama, …).
  4. Click Build. Kenaz extracts text, chunks it, embeds each chunk, writes the index to $XDG_DATA_HOME/kaneaz-harness/corpora/<id>/.

Builds run in the background. The corpus is queryable as soon as the first chunks are embedded; the indicator turns green when fully built.

Supported document formats

  • Plaintext.md, .txt, .rst, .org
  • Code — every common extension; chunked along function/class boundaries when a tree-sitter parser is available
  • PDF — text-extractable PDFs; OCR is not run automatically
  • HTML — fetched URLs and .html files are stripped to readable text
  • Office documents.docx, .pptx, .xlsx via local conversion

Binary formats Kenaz doesn't understand are skipped with a log line.

Attaching to a session

In the chat header → Corpora dropdown → check the corpora you want available. Multiple corpora can be active at once.

When a corpus is attached, the model gets a corpus.search tool that takes a query and returns the top-K chunks with their source filenames. The model decides when to use it — typically once at the start of a turn before generating a final answer.

Updating a corpus

  • Re-index a single file — Corpora view → corpus → file → ⋯ → Re-embed. Useful when a runbook changed.
  • Re-index everything — corpus → ⋯ → Rebuild. Wipes and re-embeds. Cheap on small corpora, slow on large ones.
  • Watch a directory — Source → ⋯ → Watch. Kenaz re-indexes files when their mtime changes. Off by default.

Privacy

  • Documents are read locally and embedded by whichever embedding provider you configured for the corpus. The full text of each chunk is sent to that provider.
  • The resulting embeddings (vectors) and chunk text are stored locally; never uploaded.
  • A corpus.search tool call sends only the query (a few words / a sentence) to the model — not the corpus contents. Once the model picks chunks to read, those chunk contents go in the next turn's context as the model continues.
  • The corpus index sits at $XDG_DATA_HOME/kaneaz-harness/corpora/<id>/. Delete the directory or use the UI's Delete corpus action to remove.

Cost

Embedding cost is roughly proportional to total document length. Per-million-token rates as of writing:

  • OpenAI text-embedding-3-large — $0.13 / 1M tokens
  • Voyage AI (via OpenRouter) — $0.18 / 1M tokens
  • Bedrock Titan Embeddings — varies by region
  • Ollama — free (local)

A 5-megabyte Markdown corpus is roughly 1.2M tokens — under a quarter on most providers.

Recurring queries are free — embeddings are computed once at build time and reused on every search.