15 First Training Run¶
This guide turns the repository into a real first LoRA experiment instead of just a scaffold.
Goal¶
Run a small, disciplined first cycle:
- create curated knowledge
- generate a small training set
- run a baseline evaluation on the unadapted base model
- train a first adapter
- compare results against the baseline
Recommended model choice for a dry run versus a real run¶
Use these defaults:
qwen2.5-3b-instructfor a fast end-to-end dry run of the entire processqwen2.5-7b-instructfor the first serious personalized adapter
Why:
- the 3B profile is faster and cheaper to iterate with
- the 7B profile is a better target for the model you actually want to use for work
- both stay in the same Qwen family, so the prompts and overall behavior transfer more cleanly than switching to a very different model family
Before you train¶
You should have:
- knowledge/persona.md edited to match your preferred style
- at least three populated files listed in knowledge/domains/README.md
- source documents ingested and embedded
- guardrail profile chosen
- evaluation prompts ready in evaluation/cases/README.md, especially
core_eval_cases.jsonl
Important:
- the retrieval-grounded evaluation cases in
core_eval_cases.jsonlonly make sense if you have ingested matching internal documents - if you have not loaded policy notes, IAM notes, finance notes, or architecture notes yet, replace those prompts with retrieval cases that match your real corpus before using the suite as a quality gate
Recommended first dataset size¶
Do not start with a giant dataset.
For the first real run, target:
25-40high-quality SFT examples8-12refusal examples if usingstandardorstrict10-15boundary cases
Quality matters more than volume at this stage.
Step 1: record the baseline¶
You need a pre-LoRA reference point. Otherwise you will not know whether the adapter helped or only changed style.
Baseline A: original base model behavior¶
export PERSONAL_LLM_MODEL_PROFILE=qwen2.5-3b-instruct
export PERSONAL_LLM_GUARDRAIL_PROFILE=original_model
uv run personal-llm evaluate --cases evaluation/cases/core_eval_cases.jsonl
Baseline B: base model plus your chosen runtime guardrails¶
export PERSONAL_LLM_MODEL_PROFILE=qwen2.5-3b-instruct
export PERSONAL_LLM_GUARDRAIL_PROFILE=standard
uv run personal-llm evaluate --cases evaluation/cases/core_eval_cases.jsonl
Why use both:
- Baseline A tells you what the raw model does
- Baseline B tells you what the surrounding prompt and policy stack already fixes before training
Record both in a snapshot note under knowledge/snapshots/README.md.
After the dry run is complete, repeat the same baseline steps with qwen2.5-7b-instruct before the serious adapter run.
Step 2: run a dry run before the real adapter¶
Use a tiny dry run first:
10-12examples- one or two examples from each major domain you care about
- at least two refusal cases
- at least two boundary cases
Purpose:
- confirm your JSONL format
- confirm the answers sound right
- catch prompt or guardrail mismatches early
If you want to compare multiple guardrail profiles quickly before choosing one, use docs/16_guardrail_matrix.md.
Step 3: train the first adapter¶
Use local MLX first unless you already know you need remote GPUs.
uv run personal-llm train-local-mlx \
--config config/training.yaml \
--dataset data/training/generated/train.jsonl
Step 4: evaluate again¶
Run the same evaluation suite after training.
export PERSONAL_LLM_GUARDRAIL_PROFILE=standard
uv run personal-llm evaluate --cases evaluation/cases/core_eval_cases.jsonl
Then send a small set of high-risk prompts through the API:
uv run personal-llm serve --reload
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d @evaluation/cases/sample_request.json
What success looks like¶
- answers are more aligned with your persona
- the model is more consistent on your core domains
- refusals remain clean
- boundary cases improve rather than becoming over-refused
- regulated-domain answers become more structured and cautious
Common failure modes¶
too generic: add5-10persona-aligned SFT examples focused on the missing style, then re-evaluatetoo restrictive: remove or rewrite3-5refusal-heavy or over-aggressive boundary examples, or move torelaxednot restrictive enough: add5-8refusal examples and3-5boundary cases, then retrainfactual weakness: improve retrieval and curated knowledge before adding more LoRA datastyle improved but substance did not: add5-10domain examples tied to your real heuristics or source evidenceretrieval-grounded prompts fail: verify the matching docs were ingested, chunked, and embedded before changing LoRA data
What to document after the run¶
Create a snapshot note with:
- date
- model profile
- guardrail profile
- dataset path
- prompt profile
- adapter output path
- baseline result summary
- post-training result summary
- what changed next
Important caution¶
Do not assume a better-feeling answer means a better model. Compare against the baseline and keep the same eval cases across runs.
When to stop tuning the first iteration¶
Ship the first iteration and move to real usage when:
- hard refusal cases pass
- boundary-case behavior is acceptably clean
- your persona checks are at or above the target in docs/07_evaluation.md
- the model is clearly more useful than the baseline on your core domains
Do not keep tuning just because a few answers could still be better. Once the first iteration clears the quality bar, start using it for real work and collect the next wave of improvements from actual usage.