Search & Indexing

Status: Published Last Updated: 2026-03-09

Overview

This document defines how full-text search is implemented within the ReadStore’s Search sub-interface. It covers text processing, query execution, ranking, cross-entity search, and the migration path to a dedicated search engine — built on the Architecture, Data Schema, and API Implementation.

Guiding principles:

Composable, not separate. Search is a filter that composes with all other filters (person, time range, platform, conversation). Not a separate endpoint or query path.
Correct across languages first. Personal data spans platforms and languages. The initial implementation prioritizes universal correctness over language-specific sophistication.
Replaceable. The Search sub-interface is independently swappable. PostgreSQL full-text search is the starting implementation; the interface contract is stable regardless of engine.
Good enough now, upgradable later. PostgreSQL tsvector search is adequate for personal data scale. Language-aware processing, semantic search, and dedicated engines are clean upgrade paths, not redesigns.

Interface Contract

The Search sub-interface is one of three ReadStore sub-interfaces defined in the Architecture. This section defines the engine-agnostic contract — what any search implementation must support. The remainder of this document describes the PostgreSQL implementation.

Operations

Index — accept an entity with typed, weighted fields and make it searchable.

type SearchIndexer interface {
    Index(ctx context.Context, doc SearchDocument) error
    IndexBatch(ctx context.Context, docs []SearchDocument) error
    Delete(ctx context.Context, entityType string, entityID uuid.UUID) error
}

type SearchDocument struct {
    TenantID   uuid.UUID
    EntityType string              // "message", "event", "document", "person", "conversation"
    EntityID   uuid.UUID
    Timestamp  time.Time           // entity's canonical timestamp (sent_at, start_at, etc.)
    Fields     []SearchField       // weighted text fields
}

type SearchField struct {
    Name    string                 // "title", "body", "display_name", etc.
    Content string
    Weight  FieldWeight            // High, Medium, Low, Default
}

Search — accept a query string, filters, and ordering mode. Return scored results.

type SearchReader interface {
    Search(ctx context.Context, params SearchParams) (*SearchResult, error)
    SearchMultiEntity(ctx context.Context, params CrossEntitySearchParams) (*SearchResult, error)
}

type SearchParams struct {
    TenantID    uuid.UUID
    Query       string
    EntityType  string             // single entity type
    Filters     SearchFilters      // composable with relational filters
    Order       OrderingMode       // Temporal, Relevance, Hybrid
    DecayRate   float64            // recency decay (Hybrid mode)
    Pagination  CursorPagination
    PrefixMatch bool               // treat last token as prefix
}

type SearchResult struct {
    Items      []SearchHit
    PageInfo   PageInfo
    TotalCount *int               // optional, computed only when requested
}

type SearchHit struct {
    EntityType string
    EntityID   uuid.UUID
    Score      float64            // normalized 0.0–1.0
    Timestamp  time.Time
    Excerpt    string             // snippet for display
}

Required Capabilities

Any implementation of the Search interface must support:

Phrase matching — quoted multi-word queries match as a phrase
Negation — exclude terms from results
Field weighting — matches in high-weight fields (titles) score higher than low-weight fields (body text)
Relevance scoring — return a normalized score (0.0–1.0) reflecting match quality, considering term frequency, proximity, and field weight
Recency blending — combine relevance score with a configurable time-decay function
Composable filtering — search composes with tenant, time range, platform, person, and conversation filters at the query level
Cross-entity search — search across multiple entity types in a single query, returning a unified scored result set
Prefix matching — optionally treat the last query token as a prefix

What the Interface Does NOT Specify

How text is tokenized, stemmed, or normalized (engine-specific)
How indexes are structured or maintained (engine-specific)
How scores are computed internally (engine-specific — only the normalized output matters)
Query parsing syntax beyond the required capabilities (engine may support more)

The Domain layer and API resolvers interact only with SearchReader and SearchIndexer. The Projector calls SearchIndexer when projecting entities. No module outside the Store layer references PostgreSQL-specific search constructs (tsvector, tsquery, GIN, ts_rank).

PostgreSQL Implementation

Everything below describes the initial implementation using PostgreSQL full-text search. All PostgreSQL-specific constructs (tsvector, tsquery, GIN indexes, ts_rank_cd) are confined to the Store layer behind the interface defined above.

Text Processing

Search Configuration

The initial implementation uses PostgreSQL’s simple text search configuration for all search vectors.

simple splits text on whitespace, lowercases all tokens, and performs no stemming or stopword removal. This provides:

Universal language coverage — no language-specific processing means no language-specific breakage. A dataset with English Slack messages, Spanish WhatsApp messages, and Japanese LINE messages is searched uniformly.
Predictable matching — exact token matching (after lowercasing). “Running” matches “running” but not “run”.
No false conflation — aggressive stemming in the wrong language can conflate unrelated words. simple avoids this entirely.

Tradeoff accepted: Lower recall for English morphological variants. Searching “run” won’t find “running”. This is acceptable for the initial implementation — the upgrade path to language-aware search addresses it without schema changes.

Field Weighting

PostgreSQL tsvector supports four weight classes (A, B, C, D) that influence ranking scores. Fields are weighted by semantic importance:

Entity	Weight A	Weight B	Weight C/D
Message	—	—	body
Conversation	title	—	—
Event	title	description	—
Document	title	body	—
Person	display_name	handles	—

Single-field entities (Messages) use default weight — weighting is irrelevant when there’s only one field. Multi-field entities weight titles above body content so that a title match ranks higher than a body mention.

Search Vector Construction

The Projector builds search vectors when projecting WriteStore entities into ReadStore tables. Construction uses setweight() and to_tsvector():

-- Example: Document search vector
search_vector = setweight(to_tsvector('simple', coalesce(title, '')), 'A') ||
                setweight(to_tsvector('simple', coalesce(body, '')), 'B')

Search vectors are recomputed on every projection — when an entity is created or updated, the Projector rebuilds the full search vector. This is idempotent and ensures the search vector always reflects the current entity state.

Bulk recomputation. When the search configuration changes (e.g., migrating from simple to language-aware configs), the Projector can recompute all search vectors in a batch operation. No schema migration required — the tsvector column type and GIN indexes remain the same.

Query Execution

Query Parsing

User search strings are parsed with PostgreSQL’s websearch_to_tsquery function, which provides familiar web-search syntax:

Input	Interpretation	tsquery
`dinner plans`	AND all terms	`'dinner' & 'plans'`
`"dinner plans"`	Phrase match	`'dinner' <-> 'plans'`
`dinner -lunch`	Negation	`'dinner' & !'lunch'`
`dinner OR lunch`	OR	`'dinner' \| 'lunch'`

websearch_to_tsquery handles malformed input gracefully — unbalanced quotes or stray operators fall back to AND behavior rather than producing errors.

Prefix matching. Optionally, the last token in a query can be treated as a prefix (term:*) to support search-as-you-type patterns. This is applied at the application layer before passing to PostgreSQL:

-- "dinner pl" becomes:
websearch_to_tsquery('simple', 'dinner') && to_tsquery('simple', 'pl:*')

Prefix matching is available but not the default — consumers opt in via a query parameter.

Query Structure

Search queries compose with relational filters in a single SQL query. The search condition is an additional WHERE clause:

SELECT message_id, sent_at, ts_rank_cd(search_vector, query) AS score
FROM read_messages
WHERE tenant_id = $1
  AND search_vector @@ websearch_to_tsquery('simple', $2)
  AND conversation_id = $3          -- optional: conversation filter
  AND sent_at BETWEEN $4 AND $5     -- optional: time range filter
  AND sender_person_id = $6         -- optional: person filter
ORDER BY score DESC, sent_at DESC
LIMIT 25

The GIN index on search_vector handles the text matching. Relational filters use their own indexes. PostgreSQL’s query planner combines these — typically a bitmap AND of the GIN scan and the B-tree scan on the relational filter.

Ranking

Scoring Function

Results are scored using ts_rank_cd (cover density ranking), which considers both term frequency and term proximity:

ts_rank_cd(search_vector, query, 32)  -- flag 32: divides rank by document length

The normalization flag (32) divides by document length, preventing long messages from dominating results purely by having more term occurrences. This is important for a dataset mixing short Slack messages with long email threads.

Field weights (A > B > C > D) are factored into the score automatically — a title match contributes more to the score than a body match.

Score normalization. ts_rank_cd returns an unbounded positive float. The PostgreSQL implementation normalizes this to 0.0–1.0 before returning SearchHit results, satisfying the interface contract. The normalization approach (e.g., dividing by the maximum score in the result set, or using a sigmoid function) is an implementation detail — other engines (Elasticsearch, Meilisearch) produce their own score scales and normalize similarly.

Ordering Modes

Three ordering modes, matching the API product spec:

Temporal (default when no search terms):

ORDER BY sent_at DESC

Pure chronological. No relevance scoring — ts_rank_cd is not computed.

Relevance (default when search terms present):

ORDER BY ts_rank_cd(search_vector, query, 32) DESC, sent_at DESC

Ranked by match quality. Timestamp is the tiebreaker for equal scores.

Hybrid (relevance weighted by recency):

ORDER BY (
  ts_rank_cd(search_vector, query, 32) *
  (1.0 / (1.0 + EXTRACT(EPOCH FROM (now() - sent_at)) / 86400.0 * :decay_rate))
) DESC

The recency factor is a decay function: 1 / (1 + age_in_days * decay_rate). A higher decay rate increases the recency bias. The decay rate is configurable per deployment.

Decay Rate	Effect
0.0	No recency bias (equivalent to pure relevance)
0.01	Gentle decay — content from months ago still ranks well
0.1	Moderate decay — recent content noticeably preferred
1.0	Steep decay — strongly favors last few days

Default decay rate: 0.01 (gentle). This preserves relevance quality while giving a mild recency nudge, suitable for general-purpose personal data search.

Application-Layer Re-ranking

The in-query scoring handles the common path. For advanced cases — BM25-style scoring, ML re-ranking, or intent-specific reordering — the application layer can:

Fetch top-N results by ts_rank_cd score (over-fetch by 2-3x)
Apply re-ranking logic in the Domain layer
Return the re-ranked, trimmed result set

This path is not implemented initially but the architecture supports it without changes to the query layer or search interface.

Intent Hints

The GraphQL schema accepts an optional intent enum on search-capable filter inputs:

enum SearchIntent {
  CATCH_UP         # recent activity, strong recency bias
  RESEARCH_TOPIC   # thorough search, pure relevance
  FIND_ACTION_ITEMS # action-oriented content
}

Initially ignored. The intent parameter is accepted but does not affect ranking. When implemented, intent maps to ranking adjustments:

CATCH_UP → steep recency decay
RESEARCH_TOPIC → pure relevance (no recency factor)
FIND_ACTION_ITEMS → relevance-ranked with content-aware re-ranking (requires Intelligence module)

The parameter is in the schema from the start so consumers can begin passing intent without a schema change when the implementation lands.

Cross-Entity Search

Timeline Search

The timeline query searches across Messages, Events, and Documents. This is implemented as a UNION query across ReadStore tables:

(
  SELECT 'message' AS entity_type, message_id AS entity_id,
         sent_at AS timestamp, ts_rank_cd(search_vector, query, 32) AS score,
         body AS excerpt
  FROM read_messages
  WHERE tenant_id = $1 AND search_vector @@ query
    AND sent_at BETWEEN $2 AND $3
)
UNION ALL
(
  SELECT 'event', event_id, start_at, ts_rank_cd(search_vector, query, 32),
         title
  FROM read_events
  WHERE tenant_id = $1 AND search_vector @@ query
    AND start_at BETWEEN $2 AND $3
)
UNION ALL
(
  SELECT 'document', document_id, source_created_at, ts_rank_cd(search_vector, query, 32),
         title
  FROM read_documents
  WHERE tenant_id = $1 AND search_vector @@ query
    AND source_created_at BETWEEN $2 AND $3
)
ORDER BY score DESC, timestamp DESC
LIMIT 25

Each branch uses its own GIN index. The entityTypes filter on TimelineFilter allows consumers to restrict to specific types, dropping branches from the UNION.

Pagination Across Entities

Cross-entity pagination uses a composite cursor encoding (score, timestamp, entity_type, entity_id). The cursor is opaque to clients but allows stable resumption across the mixed result set.

For temporal ordering (no search), the cursor simplifies to (timestamp, entity_type, entity_id).

Optimization Path: Unified Search Table

If cross-entity search performance becomes a bottleneck, a read_search table can consolidate all searchable entities into a single table:

entity_type         TEXT NOT NULL,
entity_id           UUID NOT NULL,
tenant_id           UUID NOT NULL,
timestamp           TIMESTAMPTZ NOT NULL,
search_vector       TSVECTOR,
excerpt             TEXT,
metadata            JSONB           -- type-specific summary fields

This replaces the UNION with a single index scan. The Projector maintains it alongside the per-entity ReadStore tables. This is additive — per-entity search (searching only messages, only documents) continues to use the per-entity tables and indexes.

GIN Index Maintenance

Write Amplification

GIN indexes are updated when search vectors change. Since the Projector rebuilds search vectors on every entity projection, GIN updates happen on every write. PostgreSQL handles this efficiently via pending lists — GIN updates are batched and merged periodically, not applied synchronously on every insert.

Configuration:

gin_pending_list_limit — controls pending list size before forced merge. Default (4MB) is fine for personal data write rates.
VACUUM — GIN indexes need regular vacuuming to reclaim dead entries. Standard autovacuum settings are adequate.

Index Size Monitoring

GIN indexes grow with vocabulary size. The Projector’s search vector construction determines what goes into the index. Monitoring index size relative to table size provides an early signal if the index is growing unexpectedly (e.g., due to noisy platform metadata being indexed accidentally).

Performance Characteristics

GIN index scans — the primary cost of a search query. At personal data scale (millions of entities), GIN indexes fit in memory and scans complete in low milliseconds.
UNION overhead — cross-entity search executes three separate GIN scans and merges results. At typical result set sizes (25-100 items per branch), the merge is negligible.
Ranking computation — ts_rank_cd is computed per matching row. For queries with thousands of matches, this adds CPU cost proportional to match count. The LIMIT clause allows PostgreSQL to stop early once enough top-ranked results are found.
Concurrent search and write — GIN pending lists allow writes to proceed without blocking reads. Search queries read from the committed index; pending entries are merged asynchronously.
Scalability ceiling — PostgreSQL full-text search performs well up to low millions of documents per table. Beyond that, query latency may degrade as GIN indexes exceed available memory. The migration path to a dedicated search engine addresses this.

Migration Paths

The Search sub-interface is defined as Go interfaces (SearchReader, SearchIndexer) in the Domain layer. The PostgreSQL implementation is one implementation. Migrations swap the implementation; the interface and all consumers remain unchanged.

To Dedicated Search Engine (Elasticsearch, Meilisearch)

Implement the Search interface against the new engine
Projector gains a second projection target — it writes to both PostgreSQL ReadStore (relational + graph) and the search engine
Search queries route to the new engine; relational and graph queries stay on PostgreSQL
Remove search_vector columns and GIN indexes from PostgreSQL ReadStore tables (optional — can keep as fallback)

To Language-Aware Search

Add language detection to the Projector (via the Intelligence module or a lightweight library)
Store detected language on ReadStore rows
Switch to_tsvector('simple', ...) to to_tsvector(detected_language, ...) in the Projector
Bulk recompute search vectors for existing data
GIN indexes remain unchanged — they index tsvector values regardless of the configuration that produced them

To Semantic Search

Add pgvector extension to PostgreSQL (or use the dedicated search engine)
Projector computes embeddings via the Intelligence module and stores them alongside tsvector search vectors
Search interface supports both keyword (tsvector) and semantic (vector similarity) queries
Hybrid retrieval: keyword match for precision, vector similarity for recall, combined scoring

Each migration is an implementation swap behind the Search interface. No consumer-facing API changes required.

Privacy Considerations

Tenant-scoped search. Every search query includes tenant_id in the WHERE clause. GIN index scans are filtered by tenant — there is no code path that searches across tenants.
Scope-filtered results. AuthContext DataScope filters (platform, person, conversation, time range) are applied as additional WHERE clauses alongside the search condition. Scoped consumers cannot discover entities outside their scope via search.
Search vectors contain entity content. The tsvector columns are derived from entity text (message bodies, titles, descriptions). They are subject to the same access controls and deletion requirements as the source entities. When an entity is soft-deleted, it is excluded from search results. Hard deletion (GDPR) must also remove the search vector.
No query logging in search vectors. Search queries (what consumers search for) are logged in the access log, not in the search vectors. The access log has its own retention policy.

Product Specifications

API — Search as a composable filter, ordering modes, intent hints
Data Model — Entities and fields that are searchable

Technical Specifications

Architecture — ReadStore Search sub-interface, interface boundaries, migration path
Data Schema — search_vector columns, GIN indexes, ReadStore table definitions
API Implementation — Search parameter on filter inputs, GraphQL schema
Module Interfaces — SearchReader/SearchIndexer cataloged in cross-module interface map
Security & Privacy — Security architecture overview and guarantees
Security & Privacy (Internal) — QueryScope enforcement on search queries, deletion from indexes

Decisions

ADR-002: PostgreSQL as Initial Storage — Why PostgreSQL for search initially, with migration path to dedicated engines

Vision

Business Model Assumptions — Search threshold (200 customers) referenced in cost modeling