Corpora (RAG)

A corpus is a collection of documents Kenaz has indexed and made retrievable to the model. When you attach a corpus to a session, the model can ask "find me the 5 most relevant chunks for this query" without dumping the whole document set into context.

Use corpora when you have:

A set of internal docs the model should know about (engineering notes, runbooks, contracts).
A codebase too large to paste into a turn.
Reference material you want the model to cite from rather than hallucinate.

Creating a corpus

Corpora view → New corpus.

Pick a name and a description.
Add sources. Three source types:
- Local directory — Kenaz walks the directory, reads supported formats, indexes them.
- Single file — drop a PDF, Markdown, code file.
- Web URL — Kenaz fetches and indexes a single page (one-shot, no recursion).
Pick an embedding provider. Embeddings come from the same providers as chat — Anthropic, OpenAI, Bedrock, Ollama, OpenRouter — but the model is different (smaller, faster, cheaper). Kenaz suggests a default per provider (text-embedding-3-large on OpenAI, voyage-3 via OpenRouter, nomic-embed-text via Ollama, …).
Click Build. Kenaz extracts text, chunks it, embeds each chunk, writes the index to $XDG_DATA_HOME/kaneaz-harness/corpora/<id>/.

Builds run in the background. The corpus is queryable as soon as the first chunks are embedded; the indicator turns green when fully built.

Supported document formats

Plaintext — .md, .txt, .rst, .org
Code — every common extension; chunked along function/class boundaries when a tree-sitter parser is available
PDF — text-extractable PDFs; OCR is not run automatically
HTML — fetched URLs and .html files are stripped to readable text
Office documents — .docx, .pptx, .xlsx via local conversion

Binary formats Kenaz doesn't understand are skipped with a log line.

Attaching to a session

In the chat header → Corpora dropdown → check the corpora you want available. Multiple corpora can be active at once.

When a corpus is attached, the model gets a corpus.search tool that takes a query and returns the top-K chunks with their source filenames. The model decides when to use it — typically once at the start of a turn before generating a final answer.

Updating a corpus

Re-index a single file — Corpora view → corpus → file → ⋯ → Re-embed. Useful when a runbook changed.
Re-index everything — corpus → ⋯ → Rebuild. Wipes and re-embeds. Cheap on small corpora, slow on large ones.
Watch a directory — Source → ⋯ → Watch. Kenaz re-indexes files when their mtime changes. Off by default.

Privacy

Documents are read locally and embedded by whichever embedding provider you configured for the corpus. The full text of each chunk is sent to that provider.
The resulting embeddings (vectors) and chunk text are stored locally; never uploaded.
A corpus.search tool call sends only the query (a few words / a sentence) to the model — not the corpus contents. Once the model picks chunks to read, those chunk contents go in the next turn's context as the model continues.
The corpus index sits at $XDG_DATA_HOME/kaneaz-harness/corpora/<id>/. Delete the directory or use the UI's Delete corpus action to remove.

Cost

Embedding cost is roughly proportional to total document length. Per-million-token rates as of writing:

OpenAI text-embedding-3-large — $0.13 / 1M tokens
Voyage AI (via OpenRouter) — $0.18 / 1M tokens
Bedrock Titan Embeddings — varies by region
Ollama — free (local)

A 5-megabyte Markdown corpus is roughly 1.2M tokens — under a quarter on most providers.

Recurring queries are free — embeddings are computed once at build time and reused on every search.

Creating a corpus​

Supported document formats​

Attaching to a session​

Updating a corpus​

Privacy​

Cost​