Semantic Search Engine¶

Deep dive into the TF-IDF search engine powering Aksara's code intelligence.

SearchDocument¶

Every indexed artifact becomes a SearchDocument:

from aksara.search import SearchDocument

doc = SearchDocument(
    kind="model",              # Category (model, route, setting, etc.)
    title="User",              # Display title
    summary="User model with auth fields",
    content="Model: User\nTable: users\nFields: id, email, password_hash",
    metadata={"table_name": "users", "field_count": 3},
    tags=["model", "users", "auth"],
    source="model:User",
)

Fields¶

Field	Type	Description
`id`	`str`	Auto-generated UUID (16 hex chars)
`kind`	`Literal`	Document category
`title`	`str`	Human-readable title
`summary`	`str`	Short description
`content`	`str`	Full text for indexing
`metadata`	`dict`	Arbitrary key-value pairs
`tags`	`list[str]`	Filterable tags
`source`	`str`	Source identifier
`created_at`	`datetime`	Creation timestamp

SearchIndex¶

The in-memory search index:

from aksara.search import SearchIndex, SearchDocument

index = SearchIndex()

# Add documents
index.add(doc)
index.add_many([doc1, doc2, doc3])

# Search
results = index.search(
    "user authentication",
    top_k=10,
    kind="model",        # Filter by kind
    tags=["auth"],       # Filter by tags
    min_score=0.1,       # Score threshold
    mode="hybrid",       # keyword, semantic, or hybrid
)

# Manage
index.remove(doc_id)
index.clear()
print(index.size)
print(index.stats())

TF-IDF Engine¶

The built-in search uses Term Frequency–Inverse Document Frequency (TF-IDF):

Tokenization: Text is split into tokens, splitting camelCase and snake_case
Stop word removal: Common English words are filtered out
TF computation: Term frequency per document
IDF computation: Inverse document frequency across the corpus
Cosine similarity: Query vector compared against document vectors

Scoring¶

Keyword mode: Token overlap ratio (query tokens ∩ document tokens)
Semantic mode: TF-IDF cosine similarity
Hybrid mode: 40% keyword + 60% semantic (default, best overall)

SearchResult¶

@dataclass
class SearchResult:
    document: SearchDocument   # The matched document
    score: float              # 0.0 to 1.0
    highlights: list[str]     # Matched text snippets
    match_type: str           # "keyword", "semantic", or "hybrid"

Embedding Providers¶

The BaseEmbeddingProvider protocol allows pluggable embedding backends:

from aksara.search.embeddings import get_embedding_provider

# Built-in TF-IDF (default, no dependencies)
provider = get_embedding_provider("local")

# Custom provider
from aksara.search.embeddings import register_embedding_provider

class MyEmbedder:
    provider_name = "custom"
    dimensions = 384
    def embed(self, text): ...
    def embed_batch(self, texts): ...

register_embedding_provider("custom", MyEmbedder)