02 Data Ingestion¶

Supported sources¶

uv run personal-llm ingest --config config/sources.yaml
uv run personal-llm extract
uv run personal-llm classify-domains
uv run personal-llm embed

NFS works through the local_files connector. Mount the share on your Mac and point the configured path at the mount point, for example /Volumes/KnowledgeShare.
Local and NFS files are snapshotted into data/raw/ before extraction. This is the recommended best practice because it preserves a stable training and indexing input even if the source share changes later.
Google Drive supports incremental changes through stored page/change tokens.
Email ingestion preserves thread, sender, recipients, timestamps, and attachment metadata.
The simplest email path is file-based ingestion:
mbox if you have mailbox exports
.eml directories if you have one-message-per-file exports
Gmail Takeout if you exported from Gmail
IMAP is optional and only needed when you want direct mailbox sync.
Linkwarden ingestion keeps URL, tags, archive references, and collection metadata.
Code repositories prioritize README files, ADRs, infra code, CI files, and key modules rather than embedding every binary or vendored file.

For the first version of the system, prefer:

local_files pointing at a local folder or NFS mount
email.mbox_paths pointing at mailbox exports, or .eml files inside local_files
Linkwarden export JSON if you use Linkwarden

Use Google Drive and IMAP only if you specifically need live API-based sync.

data/raw/ should be treated as highly sensitive.
Use config/security.yaml to define retention, encryption, and export policies.

Ingestion is only the mechanical part of the system. Before you promote extracted content into training data, review:

docs/13_data_quality.md for duplicate, contradiction, freshness, and source-trust rules
docs/11_personal_knowledge.md for what belongs in curated knowledge versus raw RAG-only content