05 Model Serving¶
API surface¶
The FastAPI service exposes:
POST /v1/chat/completionsPOST /v1/embeddingsGET /healthGET /metricsPOST /admin/reindex
Serving modes¶
- MLX backend for Apple Silicon local inference
- Ollama backend for managed local model lifecycle
- llama.cpp backend for GGUF-centric deployments
- Mock backend for offline smoke tests
Plain-English differences¶
MLX: runs the model directly through Apple-Silicon-native libraries. This is the default local path in this repo.Ollama: runs the model through a local server with a simple API. It is easier to operate if you prefer a model manager over direct runtime wiring.llama.cpp: useful when you want tight control over GGUF-based local inference.
If you are unsure, start with MLX. Move to Ollama later only if you want a cleaner local serving layer or easier model switching outside the Python process.
Backend selection recommendation¶
For first-time use, keep backend selection profile-driven:
- set the model profile
- let the repository choose the matching backend
- use the backend override only when you are intentionally testing a different runtime
Guardrail selection recommendation¶
For first-time use, keep:
PERSONAL_LLM_GUARDRAIL_PROFILE=standard
Then move to:
strictif you want stronger professional-only refusal behaviorrelaxedif you want fewer refusalsoriginal_modelif you want near-base-model behavior
Guardrail profiles change:
- which system prompt is used
- whether runtime policy refuses requests
- whether disallowed retrieval is filtered out automatically
The actual prompt files live in prompts/README.md. They already exist for strict, standard, relaxed, and original_model, and are selected by config/guardrails.yaml.
For first-time tuning, keep this mental model:
knowledge/persona.mddefines how you want the assistant to soundprompts/*.mdtranslate that into runtime behavior- LoRA examples make the behavior more stable over time
Request flow¶
- Classify the user query.
- Refuse if the topic is outside the professional domain.
- Retrieve relevant chunks using the configured vector backend.
- Build the final prompt with system policy, retrieved context, and citations.
- Generate the answer through the selected backend.
- Validate the output policy before returning the response.