Ingestion

Status: Published Last Updated: 2026-03-09

Overview

Ingestion is how data gets into LifeDB. The system is only as valuable as the data it contains, making ingestion existential — not a nice-to-have. The bar is simple: it must be easy and it must work.

Ingestion uses the same write API that external consumers use (see API). Connectors are API consumers with elevated permissions and platform_reported provenance. This means one contract to maintain, the write API is battle-tested by ingestion, and third-party connectors are architecturally possible.

Ingestion Modes

Live Sync

Ongoing connection to a platform. The user connects once and data flows in continuously. New messages, events, and documents appear in LifeDB as they occur.

Setup — ideally one-click OAuth-style authorization. The aspiration is cloud-first: no local software required. Where platform limitations force local access (e.g., iMessage database), the setup should be as guided and minimal as possible.
Freshness — varies by medium, matching the natural cadence of the communication type:
- Instant messaging (iMessage, WhatsApp, Slack, Discord, Telegram, Signal) — seconds. These are real-time communication channels and data should arrive with minimal delay.
- Email (Gmail, Outlook) — minutes. Email is inherently less urgent, and push notification APIs support near-real-time delivery.
- Calendar — minutes to hours. Events change less frequently; periodic sync is acceptable.
- Phone calls — minutes. Call log metadata is available shortly after a call ends.
- Notes — minutes to hours. Document sync is less latency-sensitive.
Ongoing — once connected, stays current without user intervention. Token refresh, reconnection, and catch-up after downtime are automatic.

Bulk Import

One-time import of historical data. Large volume, processed asynchronously.

Sources — platform data exports (Slack export, Google Takeout, iMessage database), email archives (MBOX, PST), backup files, CSV/JSON dumps
Processing — asynchronous. The user initiates the import and processing happens in the background. Data becomes available incrementally as it’s processed, not all-at-once when the import completes.
Idempotent — re-importing the same data does not create duplicates. Upsert semantics based on platform-specific identifiers ensure safe re-runs.
Scale — must handle years of data. A user importing a decade of email or five years of Slack history should not encounter limits or failures due to volume.

Manual / Ad-Hoc

Direct writes through the API. A consumer, user, or agent pushes data explicitly.

Use cases — logging a phone call manually, adding a note, creating a relationship, correcting ingested data, importing from an unsupported platform
Provenance — carries user_confirmed or manual provenance, distinguishing it from platform-reported data
Same API — no special endpoints. Uses the standard write operations.

Connector Architecture

What a Connector Is

A connector is a component that reads data from a specific platform and writes it to LifeDB through the write API. It handles:

Authentication — connecting to the platform (OAuth, API keys, local file access)
Extraction — reading data from the platform in its native format
Normalization — mapping platform-specific data to LifeDB’s canonical entities
Writing — creating entities and relationships through the write API
State management — tracking sync position to enable incremental updates and catch-up

Cloud-First Aspiration

The ideal connector runs entirely in the cloud. The user authorizes access, the connector syncs continuously, and no local software is required.

This is achievable for platforms with cloud APIs:

Slack — OAuth + Web API
Gmail / Outlook — OAuth + push notifications / API polling
Google Calendar / Outlook Calendar — OAuth + API
Discord — OAuth + Gateway API

Some platforms require local access:

iMessage — requires access to the local SQLite database on macOS
Signal — local encrypted database
Phone call logs — device-specific access
Local notes — Apple Notes (local database), file-based notes

For local-access platforms, the connector may initially require a local agent or script. The setup should be guided and as simple as possible, with the goal of minimizing what runs locally and moving toward cloud-based solutions as platform capabilities evolve.

First-Party and Third-Party

First-party connectors are maintained by LifeDB. They are the default, trusted connectors for supported platforms. They carry full platform_reported provenance trust.

Third-party connectors are possible because the write API is public. Any developer can build a connector for an unsupported platform using the same API. Third-party connectors:

Use the same write API with standard authentication
Carry provenance identifying them as third-party
May have different default trust levels for inferred signals
Are not a primary product goal but are an architectural consequence of the unified API

Reliability Guarantees

Data Integrity

No data loss — once data is acknowledged by the write API, it is durably stored. Connector failures, retries, and restarts do not cause data loss.
No duplicates — upsert semantics prevent duplicate entities from re-ingestion or connector restarts. Platform-specific identifiers serve as deduplication keys.
Ordering — messages are stored with their original platform timestamps regardless of ingestion order. Out-of-order ingestion (common during bulk imports) produces correctly ordered query results.

Connector Health

Transparent status — consumers can query which connectors are active, healthy, erroring, or stalled
Automatic recovery — transient failures (API timeouts, rate limits, network issues) retry automatically with backoff. No user intervention required.
Clear action required — when the user must act (re-authenticate after token expiry, grant additional permissions, resolve a configuration issue), the system communicates exactly what’s needed.
Catch-up — after downtime or disconnection, connectors automatically sync missed data on reconnection without manual intervention.

Processing Pipeline

Ingested data goes through a processing pipeline before being fully available:

Receive and store — raw data is durably stored immediately
Normalize — platform-specific data is mapped to canonical entities
Index — entities are indexed for query and search
Resolve — entity resolution runs asynchronously (see Entity Resolution)
Enrich — pre-computed fields (summaries, extracted entities) are generated asynchronously

Data becomes queryable after step 3. Steps 4 and 5 improve data quality over time without blocking availability. The consumer API always reflects the current state of processing.

Data Source Coverage

Initial Focus

Platforms aligned with the vision doc’s scope of communication and interaction data:

Platform	Mode	Access Method
iMessage	Live sync + bulk import	Local database
Android SMS	Bulk import + live webhook	XML file + webhook (HMAC)
WhatsApp	Live sync + bulk import	TBD (API limitations)
Slack	Live sync + bulk import	OAuth + Web API
Discord	Live sync + bulk import	OAuth + Gateway API
Telegram	Live sync + bulk import	Bot API / TDLib
Signal	Bulk import	Local encrypted database
Gmail	Live sync + bulk import	OAuth + Gmail API
Google Contacts	Live sync	OAuth + People API
Apple Contacts	Bulk import	Local database
Outlook	Live sync + bulk import	OAuth + Graph API
Google Calendar	Live sync	OAuth + Calendar API
Outlook Calendar	Live sync	OAuth + Graph API
Phone calls	Live sync	Device-specific
Apple Notes	Live sync + bulk import	Local database

Specific connector implementation order and feasibility are planning concerns, not product spec concerns. This table captures the target scope.

Extensibility

The connector architecture and write API are designed so new data sources can be added without changes to the core system. A new connector is a new API consumer, not a core modification.

Performance Expectations

Ingestion is throughput-oriented. The priority is reliably processing large volumes of data. Latency between ingestion and full availability (including entity resolution and enrichment) is expected and acceptable.

The consumer API is latency-oriented. Query responses must be near-instantaneous regardless of ingestion load. Ingestion processing must not degrade query performance. These are separate performance profiles and may involve separate infrastructure.

Data availability is incremental:

Raw entity data (messages, events, documents) is queryable within seconds to minutes of ingestion
Entity resolution improves results over minutes to hours as evidence accumulates
Pre-computed intelligence (summaries, extracted entities) populates asynchronously

Performance Visibility

Operators and developers can observe ingestion performance across three dimensions:

Pipeline stage durations — each phase of the processing pipeline (receive, normalize, index, resolve, enrich) reports its duration as a distinct span. This enables identifying which phase is slow, whether latency is concentrated in one stage or distributed, and how stage durations change with different data sources or batch sizes.

Throughput metrics — the system exposes counts of entities processed, broken down by entity type (messages, conversations, persons, events, documents) and ingestion mode (live sync, bulk import, manual). Batch sizes for bulk imports are visible, enabling operators to correlate throughput with resource utilization.

Error rates — errors are reported by entity type and pipeline phase. An operator can determine whether failures are concentrated in a specific connector, entity type, or processing stage. Transient errors (retried successfully) are distinguished from permanent failures.

Vision

Project Vision — “Ingestion must be easy” is a core principle

Specifications

Data Model — The canonical entities that ingestion normalizes data into
API — The write API that connectors use, unified consumer contract
Entity Resolution — Resolution triggered by ingested data
Architecture — Observability infrastructure (tracing, metrics, profiling) supporting performance visibility
Module Interfaces — Connector and ConnectorManager interfaces
Security & Privacy — Security architecture overview and credential encryption guarantees
Security & Privacy (Internal) — Credential encryption implementation details

Operational

Email Connectors Guide — Consumer guide for Gmail connector setup and usage
Google Contacts Connector Guide — Consumer guide for Google Contacts sync and identity enrichment
iMessage Connector Guide — Consumer guide for iMessage chat.db import
Android SMS Connector Guide — Consumer guide for Android SMS bulk import and webhook sync
Apple Contacts Connector Guide — Consumer guide for Apple Contacts AddressBook import
Connecting Data Sources — Consumer guide for connecting all supported data platforms
Operations — Connector troubleshooting, token refresh, sync issues

Decisions

Connector implementation order (TBD) — ADR/plan on which connectors to build first
Local agent architecture (TBD) — ADR on how local-access connectors work