How it works

1. File discovery

unch index walks the repository, applies .gitignore plus explicit --exclude patterns, and decides which source files should be processed.

2. File hashing

Each candidate file is hashed. The active file-hash state in .semsearch/filehashes.db is used to decide whether the file can be reused or needs to be reindexed.

3. Symbol extraction

For supported languages, unch uses Tree-sitter to extract top-level symbols and attached docs. That currently covers:

Go
Rust
TypeScript
JavaScript
Python

For unsupported files or parser failures, index can still use the legacy prefix fallback.

4. Embedding generation

Each extracted symbol is flattened into an indexed document and embedded with the selected provider. Provider options today:

llama.cpp for local GGUF models through yzma
openrouter for remote embedding APIs

Known model ids today:

embeddinggemma
qwen3

5. Snapshot activation

Embeddings and symbols are written into provider-scoped and model-scoped snapshots in .semsearch/index.db. When indexing succeeds, the new snapshot becomes active for that provider/model pair. Other provider/model snapshots are left untouched. Embedding vector tables are also separated by dimension, so models with different embedding sizes can coexist in one .semsearch directory.

6. Search

unch search supports:

semantic
lexical
auto

auto stays semantic-first, but can prefer lexical results when the query looks more like a symbol name or code fragment.

7. Optional remote restore

If the manifest is bound to remote CI, unch search can restore the latest compatible published state before executing the query.

​1. File discovery

​2. File hashing

​3. Symbol extraction

​4. Embedding generation

​5. Snapshot activation

​6. Search

​7. Optional remote restore