09 Guardrails¶
Purpose¶
This repository applies guardrails in several independent layers. This guide explains:
- where the guardrails live
- how to make them stronger
- how to make them weaker
- how to use the original model behavior with minimal repository guardrails
For difficult mixed-topic decisions, also read docs/12_guardrail_boundary_cases.md.
Guardrail layers¶
The repository uses five guardrail layers:
- domain taxonomy in
config/domain_taxonomy.yaml - runtime policy in
src/personal_llm/core/policy.py - system and refusal prompts in
prompts/ - retrieval filtering in the RAG path
- refusal examples in the LoRA training dataset
This matters because changing only one layer does not fully change system behavior.
Example:
- if you weaken the system prompt but leave runtime policy enabled, the API will still refuse disallowed topics
- if you disable runtime policy but keep refusal-heavy LoRA adapters, the model may still refuse because the adapter learned that behavior
Built-in guardrail profiles¶
The repository includes these profiles in config/guardrails.yaml:
strictstandardrelaxedoriginal_model
Select one with:
or by setting it in .env.
strict¶
Use when:
- you want a strongly domain-restricted assistant
- you prefer refusal over ambiguity
- you are handling sensitive professional workflows
Behavior:
- runtime policy enabled
- mixed-domain prompts refused
- disallowed retrieval excluded
- refusal examples included in training-set generation
- stricter system prompt
standard¶
Use when:
- you want the recommended default
- you want professional-domain steering without refusing every mixed query
Behavior:
- runtime policy enabled
- clearly disallowed prompts refused
- mixed-domain prompts allowed unless the classifier marks them disallowed
- disallowed retrieval excluded
- refusal examples included in training-set generation
relaxed¶
Use when:
- you still want professional steering
- you want fewer refusals
- you want the model to keep more of its original breadth
Behavior:
- runtime policy enabled
- mixed-domain prompts not automatically refused
- disallowed retrieval not automatically excluded
- refusal examples omitted from generated LoRA training sets
- softer system and refusal prompts
original_model¶
Use when:
- you want the base model behavior with minimal repository guardrails
- you are comparing the raw base model against guarded variants
- you want to test how much the surrounding stack changes behavior
Behavior:
- runtime policy disabled
- no repository-level refusal enforcement
- disallowed retrieval not automatically excluded
- refusal examples omitted from generated LoRA training sets
- minimal repository system prompt
Important:
- this does not change the base model weights by itself
- if you use a previously trained adapter with refusal-heavy fine-tuning, the model may still behave like a guarded model
- for the cleanest baseline, use the original base model without a guardrail-heavy adapter
How to strengthen guardrails¶
Use this order:
- switch to
strict - expand
disallowed_domainsand tighten keywords inconfig/domain_taxonomy.yaml - add more refusal examples to your training dataset
- regenerate training data and retrain the adapter
- re-run evaluation with refusal-compliance cases
Add a new blocked topic¶
Edit config/domain_taxonomy.yaml:
Then run:
uv run personal-llm classify-domains
uv run personal-llm build-training-set
uv run personal-llm embed
uv run personal-llm evaluate
Make mixed-topic handling stricter¶
Use the strict profile. That profile refuses mixed-domain prompts instead of trying to salvage the professional part.
How to weaken guardrails¶
Use this order:
- switch to
relaxed - reduce refusal examples in your training data
- retrain or stop using a strongly guarded adapter
- only if needed, switch to
original_model
Reduce runtime refusals without removing all guardrails¶
Set:
This keeps professional steering but removes the most aggressive repository-level restrictions.
Keep the professional prompt but stop filtering retrieval¶
Use the relaxed profile. It stops automatically forcing classification_label=allowed at retrieval time.
How to use the original model¶
If you want the original model with minimal repository guardrails:
- choose the base model profile you want
- set
PERSONAL_LLM_GUARDRAIL_PROFILE=original_model - do not use a heavily guardrailed LoRA adapter
- avoid training with refusal-heavy datasets unless you intentionally want that behavior
Example:
export PERSONAL_LLM_MODEL_PROFILE=qwen2.5-7b-instruct
export PERSONAL_LLM_GUARDRAIL_PROFILE=original_model
uv run personal-llm serve --reload
How LoRA training changes guardrails¶
The generated training set changes with the selected guardrail profile:
strictandstandard: refusal examples are includedrelaxedandoriginal_model: refusal examples are omitted
This means the same base model can end up behaving very differently depending on the training profile.
Best practice:
- tune runtime guardrails first
- only encode stronger guardrails into LoRA training when you are confident you want that behavior long-term
Safe experimentation workflow¶
For a first-time user, use this sequence:
- start with the base model and
standard - test real prompts
- if too restrictive, switch to
relaxed - if still too restrictive, test
original_model - only after that, decide whether the LoRA dataset should include refusal examples
That order avoids baking unwanted refusal behavior into adapters too early.
Files to edit¶
config/guardrails.yaml: choose or define profilesconfig/domain_taxonomy.yaml: add or remove blocked/allowed domainsprompts/system_professional_strict.md: stricter top-level behaviorprompts/system_professional.md: default behaviorprompts/system_professional_relaxed.md: softer professional steeringprompts/system_original_model.md: near-original model behaviorprompts/refusal_redirect.md: default refusal wordingprompts/refusal_redirect_relaxed.md: softer refusal wordingsrc/personal_llm/core/policy.py: runtime allow/refuse logicsrc/personal_llm/pipelines/dataset_builder.py: training-set refusal inclusion
Removing guardrails completely¶
That is possible, but I do not recommend it if your goal is a domain-specialized professional assistant.
If you still want the closest thing to that:
- use
original_model - disable or ignore any refusal-heavy adapter
- avoid disallowed-topic filtering in retrieval
- keep evaluation separate so you can compare behaviors clearly
The repository is designed for specialization, so fully removing every guardrail works against the main design goal.