Skip to content
Home

Ingestion

Status: Published Last Updated: 2026-03-09

Overview

Ingestion is how data gets into LifeDB. The system is only as valuable as the data it contains, making ingestion existential — not a nice-to-have. The bar is simple: it must be easy and it must work.

Ingestion uses the same write API that external consumers use (see API). Connectors are API consumers with elevated permissions and platform_reported provenance. This means one contract to maintain, the write API is battle-tested by ingestion, and third-party connectors are architecturally possible.

Ingestion Modes

Live Sync

Ongoing connection to a platform. The user connects once and data flows in continuously. New messages, events, and documents appear in LifeDB as they occur.

  • Setup — ideally one-click OAuth-style authorization. The aspiration is cloud-first: no local software required. Where platform limitations force local access (e.g., iMessage database), the setup should be as guided and minimal as possible.
  • Freshness — varies by medium, matching the natural cadence of the communication type:
    • Instant messaging (iMessage, WhatsApp, Slack, Discord, Telegram, Signal) — seconds. These are real-time communication channels and data should arrive with minimal delay.
    • Email (Gmail, Outlook) — minutes. Email is inherently less urgent, and push notification APIs support near-real-time delivery.
    • Calendar — minutes to hours. Events change less frequently; periodic sync is acceptable.
    • Phone calls — minutes. Call log metadata is available shortly after a call ends.
    • Notes — minutes to hours. Document sync is less latency-sensitive.
  • Ongoing — once connected, stays current without user intervention. Token refresh, reconnection, and catch-up after downtime are automatic.

Bulk Import

One-time import of historical data. Large volume, processed asynchronously.

  • Sources — platform data exports (Slack export, Google Takeout, iMessage database), email archives (MBOX, PST), backup files, CSV/JSON dumps
  • Processing — asynchronous. The user initiates the import and processing happens in the background. Data becomes available incrementally as it’s processed, not all-at-once when the import completes.
  • Idempotent — re-importing the same data does not create duplicates. Upsert semantics based on platform-specific identifiers ensure safe re-runs.
  • Scale — must handle years of data. A user importing a decade of email or five years of Slack history should not encounter limits or failures due to volume.

Manual / Ad-Hoc

Direct writes through the API. A consumer, user, or agent pushes data explicitly.

  • Use cases — logging a phone call manually, adding a note, creating a relationship, correcting ingested data, importing from an unsupported platform
  • Provenance — carries user_confirmed or manual provenance, distinguishing it from platform-reported data
  • Same API — no special endpoints. Uses the standard write operations.

Connector Architecture

What a Connector Is

A connector is a component that reads data from a specific platform and writes it to LifeDB through the write API. It handles:

  • Authentication — connecting to the platform (OAuth, API keys, local file access)
  • Extraction — reading data from the platform in its native format
  • Normalization — mapping platform-specific data to LifeDB’s canonical entities
  • Writing — creating entities and relationships through the write API
  • State management — tracking sync position to enable incremental updates and catch-up

Cloud-First Aspiration

The ideal connector runs entirely in the cloud. The user authorizes access, the connector syncs continuously, and no local software is required.

This is achievable for platforms with cloud APIs:

  • Slack — OAuth + Web API
  • Gmail / Outlook — OAuth + push notifications / API polling
  • Google Calendar / Outlook Calendar — OAuth + API
  • Discord — OAuth + Gateway API

Some platforms require local access:

  • iMessage — requires access to the local SQLite database on macOS
  • Signal — local encrypted database
  • Phone call logs — device-specific access
  • Local notes — Apple Notes (local database), file-based notes

For local-access platforms, the connector may initially require a local agent or script. The setup should be guided and as simple as possible, with the goal of minimizing what runs locally and moving toward cloud-based solutions as platform capabilities evolve.

First-Party and Third-Party

First-party connectors are maintained by LifeDB. They are the default, trusted connectors for supported platforms. They carry full platform_reported provenance trust.

Third-party connectors are possible because the write API is public. Any developer can build a connector for an unsupported platform using the same API. Third-party connectors:

  • Use the same write API with standard authentication
  • Carry provenance identifying them as third-party
  • May have different default trust levels for inferred signals
  • Are not a primary product goal but are an architectural consequence of the unified API

Reliability Guarantees

Data Integrity

  • No data loss — once data is acknowledged by the write API, it is durably stored. Connector failures, retries, and restarts do not cause data loss.
  • No duplicates — upsert semantics prevent duplicate entities from re-ingestion or connector restarts. Platform-specific identifiers serve as deduplication keys.
  • Ordering — messages are stored with their original platform timestamps regardless of ingestion order. Out-of-order ingestion (common during bulk imports) produces correctly ordered query results.

Connector Health

  • Transparent status — consumers can query which connectors are active, healthy, erroring, or stalled
  • Automatic recovery — transient failures (API timeouts, rate limits, network issues) retry automatically with backoff. No user intervention required.
  • Clear action required — when the user must act (re-authenticate after token expiry, grant additional permissions, resolve a configuration issue), the system communicates exactly what’s needed.
  • Catch-up — after downtime or disconnection, connectors automatically sync missed data on reconnection without manual intervention.

Processing Pipeline

Ingested data goes through a processing pipeline before being fully available:

  1. Receive and store — raw data is durably stored immediately
  2. Normalize — platform-specific data is mapped to canonical entities
  3. Index — entities are indexed for query and search
  4. Resolve — entity resolution runs asynchronously (see Entity Resolution)
  5. Enrich — pre-computed fields (summaries, extracted entities) are generated asynchronously

Data becomes queryable after step 3. Steps 4 and 5 improve data quality over time without blocking availability. The consumer API always reflects the current state of processing.

Data Source Coverage

Initial Focus

Platforms aligned with the vision doc’s scope of communication and interaction data:

PlatformModeAccess Method
iMessageLive sync + bulk importLocal database
Android SMSBulk import + live webhookXML file + webhook (HMAC)
WhatsAppLive sync + bulk importTBD (API limitations)
SlackLive sync + bulk importOAuth + Web API
DiscordLive sync + bulk importOAuth + Gateway API
TelegramLive sync + bulk importBot API / TDLib
SignalBulk importLocal encrypted database
GmailLive sync + bulk importOAuth + Gmail API
Google ContactsLive syncOAuth + People API
Apple ContactsBulk importLocal database
OutlookLive sync + bulk importOAuth + Graph API
Google CalendarLive syncOAuth + Calendar API
Outlook CalendarLive syncOAuth + Graph API
Phone callsLive syncDevice-specific
Apple NotesLive sync + bulk importLocal database

Specific connector implementation order and feasibility are planning concerns, not product spec concerns. This table captures the target scope.

Extensibility

The connector architecture and write API are designed so new data sources can be added without changes to the core system. A new connector is a new API consumer, not a core modification.

Performance Expectations

Ingestion is throughput-oriented. The priority is reliably processing large volumes of data. Latency between ingestion and full availability (including entity resolution and enrichment) is expected and acceptable.

The consumer API is latency-oriented. Query responses must be near-instantaneous regardless of ingestion load. Ingestion processing must not degrade query performance. These are separate performance profiles and may involve separate infrastructure.

Data availability is incremental:

  • Raw entity data (messages, events, documents) is queryable within seconds to minutes of ingestion
  • Entity resolution improves results over minutes to hours as evidence accumulates
  • Pre-computed intelligence (summaries, extracted entities) populates asynchronously

Performance Visibility

Operators and developers can observe ingestion performance across three dimensions:

Pipeline stage durations — each phase of the processing pipeline (receive, normalize, index, resolve, enrich) reports its duration as a distinct span. This enables identifying which phase is slow, whether latency is concentrated in one stage or distributed, and how stage durations change with different data sources or batch sizes.

Throughput metrics — the system exposes counts of entities processed, broken down by entity type (messages, conversations, persons, events, documents) and ingestion mode (live sync, bulk import, manual). Batch sizes for bulk imports are visible, enabling operators to correlate throughput with resource utilization.

Error rates — errors are reported by entity type and pipeline phase. An operator can determine whether failures are concentrated in a specific connector, entity type, or processing stage. Transient errors (retried successfully) are distinguished from permanent failures.


Vision

Specifications

  • Data Model — The canonical entities that ingestion normalizes data into
  • API — The write API that connectors use, unified consumer contract
  • Entity Resolution — Resolution triggered by ingested data
  • Architecture — Observability infrastructure (tracing, metrics, profiling) supporting performance visibility
  • Module Interfaces — Connector and ConnectorManager interfaces
  • Security & Privacy — Security architecture overview and credential encryption guarantees
  • Security & Privacy (Internal) — Credential encryption implementation details

Operational

Decisions

  • Connector implementation order (TBD) — ADR/plan on which connectors to build first
  • Local agent architecture (TBD) — ADR on how local-access connectors work