Search & Indexing
Status: Published Last Updated: 2026-03-09
Overview
This document defines how full-text search is implemented within the ReadStore’s Search sub-interface. It covers text processing, query execution, ranking, cross-entity search, and the migration path to a dedicated search engine — built on the Architecture, Data Schema, and API Implementation.
Guiding principles:
- Composable, not separate. Search is a filter that composes with all other filters (person, time range, platform, conversation). Not a separate endpoint or query path.
- Correct across languages first. Personal data spans platforms and languages. The initial implementation prioritizes universal correctness over language-specific sophistication.
- Replaceable. The Search sub-interface is independently swappable. PostgreSQL full-text search is the starting implementation; the interface contract is stable regardless of engine.
- Good enough now, upgradable later. PostgreSQL tsvector search is adequate for personal data scale. Language-aware processing, semantic search, and dedicated engines are clean upgrade paths, not redesigns.
Interface Contract
The Search sub-interface is one of three ReadStore sub-interfaces defined in the Architecture. This section defines the engine-agnostic contract — what any search implementation must support. The remainder of this document describes the PostgreSQL implementation.
Operations
Index — accept an entity with typed, weighted fields and make it searchable.
type SearchIndexer interface { Index(ctx context.Context, doc SearchDocument) error IndexBatch(ctx context.Context, docs []SearchDocument) error Delete(ctx context.Context, entityType string, entityID uuid.UUID) error}
type SearchDocument struct { TenantID uuid.UUID EntityType string // "message", "event", "document", "person", "conversation" EntityID uuid.UUID Timestamp time.Time // entity's canonical timestamp (sent_at, start_at, etc.) Fields []SearchField // weighted text fields}
type SearchField struct { Name string // "title", "body", "display_name", etc. Content string Weight FieldWeight // High, Medium, Low, Default}Search — accept a query string, filters, and ordering mode. Return scored results.
type SearchReader interface { Search(ctx context.Context, params SearchParams) (*SearchResult, error) SearchMultiEntity(ctx context.Context, params CrossEntitySearchParams) (*SearchResult, error)}
type SearchParams struct { TenantID uuid.UUID Query string EntityType string // single entity type Filters SearchFilters // composable with relational filters Order OrderingMode // Temporal, Relevance, Hybrid DecayRate float64 // recency decay (Hybrid mode) Pagination CursorPagination PrefixMatch bool // treat last token as prefix}
type SearchResult struct { Items []SearchHit PageInfo PageInfo TotalCount *int // optional, computed only when requested}
type SearchHit struct { EntityType string EntityID uuid.UUID Score float64 // normalized 0.0–1.0 Timestamp time.Time Excerpt string // snippet for display}Required Capabilities
Any implementation of the Search interface must support:
- Phrase matching — quoted multi-word queries match as a phrase
- Negation — exclude terms from results
- Field weighting — matches in high-weight fields (titles) score higher than low-weight fields (body text)
- Relevance scoring — return a normalized score (0.0–1.0) reflecting match quality, considering term frequency, proximity, and field weight
- Recency blending — combine relevance score with a configurable time-decay function
- Composable filtering — search composes with tenant, time range, platform, person, and conversation filters at the query level
- Cross-entity search — search across multiple entity types in a single query, returning a unified scored result set
- Prefix matching — optionally treat the last query token as a prefix
What the Interface Does NOT Specify
- How text is tokenized, stemmed, or normalized (engine-specific)
- How indexes are structured or maintained (engine-specific)
- How scores are computed internally (engine-specific — only the normalized output matters)
- Query parsing syntax beyond the required capabilities (engine may support more)
The Domain layer and API resolvers interact only with SearchReader and SearchIndexer. The Projector calls SearchIndexer when projecting entities. No module outside the Store layer references PostgreSQL-specific search constructs (tsvector, tsquery, GIN, ts_rank).
PostgreSQL Implementation
Everything below describes the initial implementation using PostgreSQL full-text search. All PostgreSQL-specific constructs (tsvector, tsquery, GIN indexes, ts_rank_cd) are confined to the Store layer behind the interface defined above.
Text Processing
Search Configuration
The initial implementation uses PostgreSQL’s simple text search configuration for all search vectors.
simple splits text on whitespace, lowercases all tokens, and performs no stemming or stopword removal. This provides:
- Universal language coverage — no language-specific processing means no language-specific breakage. A dataset with English Slack messages, Spanish WhatsApp messages, and Japanese LINE messages is searched uniformly.
- Predictable matching — exact token matching (after lowercasing). “Running” matches “running” but not “run”.
- No false conflation — aggressive stemming in the wrong language can conflate unrelated words.
simpleavoids this entirely.
Tradeoff accepted: Lower recall for English morphological variants. Searching “run” won’t find “running”. This is acceptable for the initial implementation — the upgrade path to language-aware search addresses it without schema changes.
Field Weighting
PostgreSQL tsvector supports four weight classes (A, B, C, D) that influence ranking scores. Fields are weighted by semantic importance:
| Entity | Weight A | Weight B | Weight C/D |
|---|---|---|---|
| Message | — | — | body |
| Conversation | title | — | — |
| Event | title | description | — |
| Document | title | body | — |
| Person | display_name | handles | — |
Single-field entities (Messages) use default weight — weighting is irrelevant when there’s only one field. Multi-field entities weight titles above body content so that a title match ranks higher than a body mention.
Search Vector Construction
The Projector builds search vectors when projecting WriteStore entities into ReadStore tables. Construction uses setweight() and to_tsvector():
-- Example: Document search vectorsearch_vector = setweight(to_tsvector('simple', coalesce(title, '')), 'A') || setweight(to_tsvector('simple', coalesce(body, '')), 'B')Search vectors are recomputed on every projection — when an entity is created or updated, the Projector rebuilds the full search vector. This is idempotent and ensures the search vector always reflects the current entity state.
Bulk recomputation. When the search configuration changes (e.g., migrating from simple to language-aware configs), the Projector can recompute all search vectors in a batch operation. No schema migration required — the tsvector column type and GIN indexes remain the same.
Query Execution
Query Parsing
User search strings are parsed with PostgreSQL’s websearch_to_tsquery function, which provides familiar web-search syntax:
| Input | Interpretation | tsquery |
|---|---|---|
dinner plans | AND all terms | 'dinner' & 'plans' |
"dinner plans" | Phrase match | 'dinner' <-> 'plans' |
dinner -lunch | Negation | 'dinner' & !'lunch' |
dinner OR lunch | OR | 'dinner' | 'lunch' |
websearch_to_tsquery handles malformed input gracefully — unbalanced quotes or stray operators fall back to AND behavior rather than producing errors.
Prefix matching. Optionally, the last token in a query can be treated as a prefix (term:*) to support search-as-you-type patterns. This is applied at the application layer before passing to PostgreSQL:
-- "dinner pl" becomes:websearch_to_tsquery('simple', 'dinner') && to_tsquery('simple', 'pl:*')Prefix matching is available but not the default — consumers opt in via a query parameter.
Query Structure
Search queries compose with relational filters in a single SQL query. The search condition is an additional WHERE clause:
SELECT message_id, sent_at, ts_rank_cd(search_vector, query) AS scoreFROM read_messagesWHERE tenant_id = $1 AND search_vector @@ websearch_to_tsquery('simple', $2) AND conversation_id = $3 -- optional: conversation filter AND sent_at BETWEEN $4 AND $5 -- optional: time range filter AND sender_person_id = $6 -- optional: person filterORDER BY score DESC, sent_at DESCLIMIT 25The GIN index on search_vector handles the text matching. Relational filters use their own indexes. PostgreSQL’s query planner combines these — typically a bitmap AND of the GIN scan and the B-tree scan on the relational filter.
Ranking
Scoring Function
Results are scored using ts_rank_cd (cover density ranking), which considers both term frequency and term proximity:
ts_rank_cd(search_vector, query, 32) -- flag 32: divides rank by document lengthThe normalization flag (32) divides by document length, preventing long messages from dominating results purely by having more term occurrences. This is important for a dataset mixing short Slack messages with long email threads.
Field weights (A > B > C > D) are factored into the score automatically — a title match contributes more to the score than a body match.
Score normalization. ts_rank_cd returns an unbounded positive float. The PostgreSQL implementation normalizes this to 0.0–1.0 before returning SearchHit results, satisfying the interface contract. The normalization approach (e.g., dividing by the maximum score in the result set, or using a sigmoid function) is an implementation detail — other engines (Elasticsearch, Meilisearch) produce their own score scales and normalize similarly.
Ordering Modes
Three ordering modes, matching the API product spec:
Temporal (default when no search terms):
ORDER BY sent_at DESCPure chronological. No relevance scoring — ts_rank_cd is not computed.
Relevance (default when search terms present):
ORDER BY ts_rank_cd(search_vector, query, 32) DESC, sent_at DESCRanked by match quality. Timestamp is the tiebreaker for equal scores.
Hybrid (relevance weighted by recency):
ORDER BY ( ts_rank_cd(search_vector, query, 32) * (1.0 / (1.0 + EXTRACT(EPOCH FROM (now() - sent_at)) / 86400.0 * :decay_rate))) DESCThe recency factor is a decay function: 1 / (1 + age_in_days * decay_rate). A higher decay rate increases the recency bias. The decay rate is configurable per deployment.
| Decay Rate | Effect |
|---|---|
| 0.0 | No recency bias (equivalent to pure relevance) |
| 0.01 | Gentle decay — content from months ago still ranks well |
| 0.1 | Moderate decay — recent content noticeably preferred |
| 1.0 | Steep decay — strongly favors last few days |
Default decay rate: 0.01 (gentle). This preserves relevance quality while giving a mild recency nudge, suitable for general-purpose personal data search.
Application-Layer Re-ranking
The in-query scoring handles the common path. For advanced cases — BM25-style scoring, ML re-ranking, or intent-specific reordering — the application layer can:
- Fetch top-N results by ts_rank_cd score (over-fetch by 2-3x)
- Apply re-ranking logic in the Domain layer
- Return the re-ranked, trimmed result set
This path is not implemented initially but the architecture supports it without changes to the query layer or search interface.
Intent Hints
The GraphQL schema accepts an optional intent enum on search-capable filter inputs:
enum SearchIntent { CATCH_UP # recent activity, strong recency bias RESEARCH_TOPIC # thorough search, pure relevance FIND_ACTION_ITEMS # action-oriented content}Initially ignored. The intent parameter is accepted but does not affect ranking. When implemented, intent maps to ranking adjustments:
CATCH_UP→ steep recency decayRESEARCH_TOPIC→ pure relevance (no recency factor)FIND_ACTION_ITEMS→ relevance-ranked with content-aware re-ranking (requires Intelligence module)
The parameter is in the schema from the start so consumers can begin passing intent without a schema change when the implementation lands.
Cross-Entity Search
Timeline Search
The timeline query searches across Messages, Events, and Documents. This is implemented as a UNION query across ReadStore tables:
( SELECT 'message' AS entity_type, message_id AS entity_id, sent_at AS timestamp, ts_rank_cd(search_vector, query, 32) AS score, body AS excerpt FROM read_messages WHERE tenant_id = $1 AND search_vector @@ query AND sent_at BETWEEN $2 AND $3)UNION ALL( SELECT 'event', event_id, start_at, ts_rank_cd(search_vector, query, 32), title FROM read_events WHERE tenant_id = $1 AND search_vector @@ query AND start_at BETWEEN $2 AND $3)UNION ALL( SELECT 'document', document_id, source_created_at, ts_rank_cd(search_vector, query, 32), title FROM read_documents WHERE tenant_id = $1 AND search_vector @@ query AND source_created_at BETWEEN $2 AND $3)ORDER BY score DESC, timestamp DESCLIMIT 25Each branch uses its own GIN index. The entityTypes filter on TimelineFilter allows consumers to restrict to specific types, dropping branches from the UNION.
Pagination Across Entities
Cross-entity pagination uses a composite cursor encoding (score, timestamp, entity_type, entity_id). The cursor is opaque to clients but allows stable resumption across the mixed result set.
For temporal ordering (no search), the cursor simplifies to (timestamp, entity_type, entity_id).
Optimization Path: Unified Search Table
If cross-entity search performance becomes a bottleneck, a read_search table can consolidate all searchable entities into a single table:
entity_type TEXT NOT NULL,entity_id UUID NOT NULL,tenant_id UUID NOT NULL,timestamp TIMESTAMPTZ NOT NULL,search_vector TSVECTOR,excerpt TEXT,metadata JSONB -- type-specific summary fieldsThis replaces the UNION with a single index scan. The Projector maintains it alongside the per-entity ReadStore tables. This is additive — per-entity search (searching only messages, only documents) continues to use the per-entity tables and indexes.
GIN Index Maintenance
Write Amplification
GIN indexes are updated when search vectors change. Since the Projector rebuilds search vectors on every entity projection, GIN updates happen on every write. PostgreSQL handles this efficiently via pending lists — GIN updates are batched and merged periodically, not applied synchronously on every insert.
Configuration:
gin_pending_list_limit— controls pending list size before forced merge. Default (4MB) is fine for personal data write rates.VACUUM— GIN indexes need regular vacuuming to reclaim dead entries. Standard autovacuum settings are adequate.
Index Size Monitoring
GIN indexes grow with vocabulary size. The Projector’s search vector construction determines what goes into the index. Monitoring index size relative to table size provides an early signal if the index is growing unexpectedly (e.g., due to noisy platform metadata being indexed accidentally).
Performance Characteristics
- GIN index scans — the primary cost of a search query. At personal data scale (millions of entities), GIN indexes fit in memory and scans complete in low milliseconds.
- UNION overhead — cross-entity search executes three separate GIN scans and merges results. At typical result set sizes (25-100 items per branch), the merge is negligible.
- Ranking computation —
ts_rank_cdis computed per matching row. For queries with thousands of matches, this adds CPU cost proportional to match count. TheLIMITclause allows PostgreSQL to stop early once enough top-ranked results are found. - Concurrent search and write — GIN pending lists allow writes to proceed without blocking reads. Search queries read from the committed index; pending entries are merged asynchronously.
- Scalability ceiling — PostgreSQL full-text search performs well up to low millions of documents per table. Beyond that, query latency may degrade as GIN indexes exceed available memory. The migration path to a dedicated search engine addresses this.
Migration Paths
The Search sub-interface is defined as Go interfaces (SearchReader, SearchIndexer) in the Domain layer. The PostgreSQL implementation is one implementation. Migrations swap the implementation; the interface and all consumers remain unchanged.
To Dedicated Search Engine (Elasticsearch, Meilisearch)
- Implement the Search interface against the new engine
- Projector gains a second projection target — it writes to both PostgreSQL ReadStore (relational + graph) and the search engine
- Search queries route to the new engine; relational and graph queries stay on PostgreSQL
- Remove
search_vectorcolumns and GIN indexes from PostgreSQL ReadStore tables (optional — can keep as fallback)
To Language-Aware Search
- Add language detection to the Projector (via the Intelligence module or a lightweight library)
- Store detected language on ReadStore rows
- Switch
to_tsvector('simple', ...)toto_tsvector(detected_language, ...)in the Projector - Bulk recompute search vectors for existing data
- GIN indexes remain unchanged — they index tsvector values regardless of the configuration that produced them
To Semantic Search
- Add pgvector extension to PostgreSQL (or use the dedicated search engine)
- Projector computes embeddings via the Intelligence module and stores them alongside tsvector search vectors
- Search interface supports both keyword (tsvector) and semantic (vector similarity) queries
- Hybrid retrieval: keyword match for precision, vector similarity for recall, combined scoring
Each migration is an implementation swap behind the Search interface. No consumer-facing API changes required.
Privacy Considerations
- Tenant-scoped search. Every search query includes
tenant_idin the WHERE clause. GIN index scans are filtered by tenant — there is no code path that searches across tenants. - Scope-filtered results. AuthContext DataScope filters (platform, person, conversation, time range) are applied as additional WHERE clauses alongside the search condition. Scoped consumers cannot discover entities outside their scope via search.
- Search vectors contain entity content. The tsvector columns are derived from entity text (message bodies, titles, descriptions). They are subject to the same access controls and deletion requirements as the source entities. When an entity is soft-deleted, it is excluded from search results. Hard deletion (GDPR) must also remove the search vector.
- No query logging in search vectors. Search queries (what consumers search for) are logged in the access log, not in the search vectors. The access log has its own retention policy.
Related Documents
Product Specifications
- API — Search as a composable filter, ordering modes, intent hints
- Data Model — Entities and fields that are searchable
Technical Specifications
- Architecture — ReadStore Search sub-interface, interface boundaries, migration path
- Data Schema — search_vector columns, GIN indexes, ReadStore table definitions
- API Implementation — Search parameter on filter inputs, GraphQL schema
- Module Interfaces — SearchReader/SearchIndexer cataloged in cross-module interface map
- Security & Privacy — Security architecture overview and guarantees
- Security & Privacy (Internal) — QueryScope enforcement on search queries, deletion from indexes
Decisions
- ADR-002: PostgreSQL as Initial Storage — Why PostgreSQL for search initially, with migration path to dedicated engines
Vision
- Business Model Assumptions — Search threshold (200 customers) referenced in cost modeling