13 Data Quality¶

This guide covers the quality-control rules that matter before data becomes training material.

Core principle¶

Do not treat all ingested content as equally trustworthy or equally suitable for LoRA.

The safest model quality pattern is:

broad raw corpus for RAG
narrow curated corpus for fine-tuning

Source trust ranking¶

Use a simple ranking model:

curated knowledge files you wrote intentionally
reviewed internal runbooks, ADRs, policies, and finance models
recent operational documents with clear ownership
external docs and bookmarks
noisy OCR output and low-confidence extraction

Higher-ranked sources should dominate training examples when conflicts exist.

Duplicate handling¶

The pipeline already supports exact and near-duplicate detection. Your job is to decide which copy is canonical.

Best practice:

keep the highest-quality source as canonical
retain provenance to alternates in metadata
do not fine-tune on the same knowledge repeated many times

Contradictions¶

Contradictions are normal in personal corpora. Handle them deliberately:

if one source is newer and authoritative, mark the older one as superseded
if the answer genuinely depends on date, include the date explicitly in the training example
if the issue is unresolved, prefer RAG with citations over encoding a strong LoRA answer

Staleness and freshness¶

Ask these questions before promoting content into training:

is this still your current view
is the regulation or policy still current
is the system architecture described still the real one
are the assumptions in this spreadsheet still valid

If not, keep it for historical reference or archive it from training candidates.

Noisy PDFs and OCR¶

Treat OCR-heavy files with caution.

Promote to training only if:

extraction quality is readable
tables or formulas survived well enough
the document still contains stable knowledge

Otherwise, keep it retrieval-only or exclude it entirely.

Spreadsheets and models¶

Do not fine-tune on raw spreadsheets blindly.

Instead:

extract the stable logic
write curated examples that explain the model or control logic
keep the spreadsheet itself as retrieval evidence

Sensitive information¶

Before fine-tuning or remote export:

remove secrets
redact credentials, personal data, and confidential identifiers where practical
separate highly sensitive content from general domain knowledge

Quality checklist before dataset generation¶

is the content current
is it authoritative
is it duplicative
is it stable enough to encode into the model
is it better handled through RAG instead
does it align with the persona and guardrail profile

Recommended operating rule¶

When unsure:

keep the material in RAG
do not put it into LoRA training until it has been curated

That rule reduces the risk of encoding outdated or low-quality behavior into the adapter.

Source strategy matrix¶

Use this as a first-pass default:

Source type	Default use	Why
`knowledge/persona.md`	LoRA + runtime prompt design	This is your intended voice and behavior
Curated domain files in `knowledge/domains/`	LoRA + optional RAG	Stable heuristics and default positions
Internal runbooks and ADRs	Both	Good retrieval sources and good basis for distilled examples
Policies and controls docs	Both	Strong source for governance and security behavior
Code repositories	Mostly RAG, selective LoRA	Good for retrieval, but LoRA should only encode stable patterns or preferences
Spreadsheets and financial models	RAG, plus distilled LoRA examples	Raw sheets are noisy; the logic is what matters
Email archives	Mostly RAG	Useful context, but often too noisy or transient for LoRA
Linkwarden bookmarks	Mostly RAG	Good discovery and citation layer, weak direct training material
OCR-heavy PDFs	RAG only unless curated	Extraction quality is often too unstable for fine-tuning
Temporary tickets or logs	Usually neither, or short-term RAG only	High churn and low long-term value

If a source contains stable judgment and reusable reasoning, it is a LoRA candidate. If it mainly contains changing facts or one-off reference material, keep it in RAG.