VOICE_AND_AI — alaivOS Canonical¶

Last updated: April 13, 2026 (Omega v2.7)

Supersedes:

V1_1_AI_ROADMAP.md, V1_1_ROADMAP.md
OMEGA_V2_7_SESSION_HANDOVER.md, OMEGA_V2_6_SESSION_HANDOVER.md, OMEGA_V2_5_SESSION_HANDOVER.md
SPRINT_EPSILON_KOKORO_EVAL.md, EPSILON_CDN_RENAME.md
SPRINT_S1_SKILL_ROUTER.md, SPRINT_S2_OBSERVER_AGENT.md, SPRINT_S3_PLANNER_EXECUTOR.md, SPRINT_S4_PERSONALITY_UI.md
SPRINT_ALPHA_CHECKUP_PIPELINE.md, SPRINT_ALPHA_SKILL_AUDIT.md, SPRINT_ALPHA_PIPER_TTS.md
SPRINT_BETA2_ASK_LAIV.md, SPRINT_BETA_TTS_DEAD_CODE_SEVERANCE.md, SPRINT_VOICE_BETA.md, SPRINT_FIX_VOICE.md
SPRINT_GAMMA_LOCAL_MODEL_IMPL.md
sprint_A5_sovereign_tts_waveform_modes.md, SPRINT_B13_sovereign_voice_asr.md
sprint_A2_gguf_runtime_research.md, sprint_B10_gguf_runtime.md, gguf_runtime_research_report.md
sprint_A6_voice_translation.md, qwen_mistral_evaluation.md
sprint_B14_elevenlabs_tts.md (DEAD — historical traceability only)
tiered_ai_capability_spec.md, alaivOS_qwen3_tts_strategy.md, alaivOS_ai_model_assessment.md
alaivOS_phone_ai_server_hub_architecture.md
laiv_system_prompt_spec.md, alaivOS_AI_Prompt_Library_v2.md

Status overrides: ElevenLabs = DEAD (0 refs). Cloud Gemini = DEAD (enum + code removed). AiProvider enum = {local, ghost} only. Google Speech Services STT / Google WaveNet TTS = DEAD. Qwen 3.5 (NOT 2.5) is the canonical on-device family. Gemma 4 is server-only. Ghost model = Gemma 4 E4B. Voice pipeline is Kokoro-first (inverted). Per-skill pricing = DEAD.

Cross-reference: GHOST_PROTOCOL.md (credit economy, Ghost routing), PRODUCT_SCOPE.md (module inventory), INFRASTRUCTURE.md (CX43, Bishop, CDN).

1. MODEL LANDSCAPE — 7-TIER REGISTRY¶

1.1 Tier table (locked in TAW10 — Alpha-A model registry rewrite)¶

Tier	CDN filename	Model	Download	Loaded RAM	Min free RAM	Role
`on-device-xs`	`laiv-xs.gguf`	Qwen 3.5 0.8B Q4_K_M	989 MB	2.1 GB	2.5 GB	Practical everyday tier for most flagships
`on-device-s`	`laiv-s.gguf`	Qwen 3.5 2B Q4_K_M	2.55 GB	4.1 GB	4.5 GB	Gap-filler between xs and Gemma; critical for 4-5 GB-free devices (LatAm/India)
`on-device-m`	`laiv-m.gguf`	Qwen 3.5 4B Q4_K_M	3.16 GB	5.8 GB	6.2 GB	Full on-device Chief of Staff (native early-fusion multimodal)
`on-device-l`	`laiv-l.gguf`	Gemma 4 E2B Q4_K_M	6.67 GB	7.7 GB	8.0 GB	Reserved for 16+ GB tablets / future phones — practically server-only today
`on-device-xl`	`laiv-xl.gguf`	Gemma 4 E4B Q4_K_M	8.95 GB	10 GB	10.5 GB	Reserved for tablets / future — practically server-only today
`ghost-std`	`laiv-ghost.gguf`	Gemma 4 E4B Q4_K_M	8.95 GB	10 GB	— (server)	Ghost Brain on CX43, 12 tok/s

Filenames are tier labels, not model names. When a better model drops, swap the GGUF behind the same filename. App reads manifest.json v3 for SHA + sizes. Backward compat aliases exist (laiv-core-s.bin / laiv-core-sm.bin / laiv-core-m.bin → Qwen files).

1.2 Qwen 3.5 capabilities (locked — NOT Qwen 2.5)¶

All Qwen 3.5 sizes share: - Unified vision (all sizes accept images; 4B is native early-fusion multimodal) - Thinking mode (off by default on 0.8B and 2B; do not flip on without explicit need — it explodes latency and hides output) - 262K context window - 201 languages (preserves multilingual parity; critical for 21-locale support) - Hermes-style tool calling (ChatML turn format; dual prompt templates live in app)

1.3 Ghost server model (locked — Epsilon TAW9 speed tests)¶

Gemma 4 E4B @ 12.02 tok/s on CX43 CPU-only, 10 GB loaded RAM.
Gemma 4 E2B @ 21.81 tok/s measured but unused on server (E4B quality preferred for Ghost).
Qwen 3.5 9B preserved in Ollama as fallback only. Original 0.2 tok/s measurement was misconfigured thinking mode; with thinking off it runs at 5.58 tok/s — usable but 2× slower than Gemma.
Function calling verified perfect in EN, ES, PT via Ollama native tool_calls. Example: log_expense(amount=50, description="tacos") returns a proper tool_calls array.
Native audio input, native vision, native function calling in a single model.

1.4 Why Gemma 4 is server-only¶

E2B at 7.7 GB loaded and E4B at 10 GB loaded will not fit any shipping phone. Both are marketed as "edge" models but the loaded-RAM envelope rules phones out. On-device stays Qwen 3.5 exclusively. on-device-l / on-device-xl exist on the CDN and in the tier ladder for future 16+ GB devices and tablets.

1.5 Real-world AMI cascade (J's Pixel 7 Pro reality check)¶

12 GB total RAM → 3.6 GB free typical. Most flagship users will run Qwen 0.8B. Ghost is the real AI upgrade path.

Device total	Typical free	Best model	Experience
3-4 GB	1-2 GB	None (Flutter smart only)	Data-driven, no model
4 GB	2-3 GB	Qwen 0.8B (2.1 GB)	Basic skills + vision
6-8 GB	3-4 GB	Qwen 0.8B	Same — 2B at 4.1 GB too tight
8-12 GB	3.5-5 GB	Qwen 0.8B or 2B	2B only if apps closed
12-16 GB	5-7 GB	Qwen 2B or 4B	4B (5.8 GB) tight on 12 GB
16+ GB	7-9 GB	Qwen 4B	Best on-device experience

2. AMI — ADAPTIVE MODEL INTELLIGENCE¶

2.1 Cardinal rules¶

ONE model loaded at a time, NEVER TWO. No model co-residency, no preloading of a bigger model while a smaller one serves.
No always-resident model. J explicitly rejected "0.8B always loaded." Loading and unloading are dynamic on app lifecycle.
App backgrounded → model unloaded. Zero RAM footprint, zero battery, zero heat while backgrounded.
App foregrounded → AMI checks freeRamMb → picks best tier → loads during natural navigation time (2-8 s hidden behind splash/home rendering). User reaches Laiv with the model already warm.
Users never pick models. AI Engine screen shows dots (●●●○○) indicating the running tier; no file picker. "Manage AI brains" is a small link for power users.

2.2 Triggers (no polling tax)¶

Android onTrimMemory — RUNNING_MODERATE (prepare) → RUNNING_LOW (downgrade) → RUNNING_CRITICAL (unload all).
iOS didReceiveMemoryWarning — only one level, so iOS unload threshold is more aggressive (modelSize × 1.5 vs Android's 1.2).
Before-inference lazy check — only check RAM when the user actually asks Laiv something.
60-second health tick (only active when a model is loaded) — reads /proc/meminfo on Android (microseconds, no battery impact), os_proc_available_memory() on iOS. Unloads proactively below threshold.

2.3 Decision flow¶

User asks Laiv something
  → Is a model loaded?
    YES → Is free RAM > headroom (500 MB)?
      YES → Use loaded model
      NO  → Unload, load 0.8B fallback, set "downgraded" flag
    NO  → Check free RAM
      > 5.8 GB → Load 4B (if on disk)
      > 4.1 GB → Load 2B (if on disk)
      > 2.1 GB → Load 0.8B
      < 2.1 GB → Smart Flutter Response (no model) or Ghost (if subscribed)

2.4 Downgrade UX¶

Not a popup. A dismissible glass card inside Laiv, shown once:

"Laiv adapted — your phone is busy with other apps right now. Full power returns automatically." [Switch back now] [Got it]

2.5 Load times (from flash)¶

Tier	Load time	Perception
0.8B	1-2 s	Instant
2B	3-5 s	Brief pause
4B	6-10 s	Shows loading indicator

2.6 Download chain (onboarding)¶

Every user gets laiv-xs (Qwen 0.8B, 989 MB) first — "Laiv is ready!" in ~30 s WiFi. Optimal tier for the device downloads in background. Stepping-stone tier downloads last as insurance. Dots upgrade silently when the better model finishes.

2.7 iOS-specific hardening¶

Use os_proc_available_memory() proactively.
Unload when available drops below modelSize × 1.5 (more aggressive than Android).
Always unload on app backgrounding to be a good citizen; reload on foreground.

3. SMART FLUTTER RESPONSE SYSTEM (bridges model-load gap)¶

Built by Beta in TAW10. Files: smart_flutter_response.dart, laiv_message_queue.dart.

10 notification-action handlers: budget exceeded, project due, free time, bill due, reconnect contact, exercise nudge, plan your day, cooking nudge, birthday, generic.
Tier-gated: Starter = raw numbers · Spark+ = category breakdown · Core+ = follow-up suggestion.
Personalized with the user's first name from SharedPreferences.
Message queue captures user input while the model loads — messages never lost. When the model is ready it processes the queued messages; if its response adds value beyond the Flutter response it's appended as a follow-up, if redundant it is skipped silently.

3.1 Notification tap = instant Laiv, no OmniOrb needed¶

Notification taps route directly to the relevant screen + Laiv immediately greets with a data-driven response. Example: "Budget exceeded" → Money screen → Laiv says "José, you're $450 over budget. Dining hit $680." — all from SQLite, instant. The model loads in the 3-5 s the user spends reading; if they reply, real AI handles the follow-up.

4. RUNTIME — llama_cpp_dart + sherpa_onnx¶

4.1 Service name mappings (abstract → real)¶

Abstract	Real implementation
`LocalModelService`	`LocalInferenceService` + `LlamaRuntime` (llama_cpp_dart FFI)
`AdaptiveModelManager` / AMI	Dynamic tier selection by `freeRamMb`, ONE model at a time, lifecycle-driven load/unload
TTS	`SovereignTtsService` (sherpa_onnx Piper ONNX), fallback `CortexVoiceService` (platform TTS)
Voice nav	`NavVoiceService` (audio focus + timing) + `InstructionEnricher` (OSRM steps → natural language × 21 locales)
Navigation	`NavigationService` (state machine: idle → navigating → rerouting → arrived)

4.2 Prompt templates¶

Dual templates live in the app — selected per loaded model: - Qwen 3.5 → ChatML turns (<|im_start|>system ... <|im_end|>). - Gemma 4 → Turn-based (<start_of_turn>user ... <end_of_turn>).

4.3 Ghost native function calling¶

17 tool definitions registered for Gemma 4 E4B on CX43. Ollama native tool_calls array is consumed directly — no regex parsing. Verified EN/ES/PT.

5. MULTI-AGENT ARCHITECTURE — v1.0 (BUILT)¶

Shipped in sprints S1-S4 (Omega v2.5). Brain Distillation stays v1.1.

5.1 SkillRouter (S1, Alpha + Beta-1)¶

lib/core/ai/skill_router.dart — pluggable registry; two-phase match (keyword pre-filter → scored confidence).
Shrunk LaivAgent from 882 → 294 lines.
17 built-in skills (per Omega v2.7; raised from the original 12 during TAW device-write audit): log_expense, create_event, find_place, log_meal, check_budget, plan_trip, host_event, suggest_activity, draft_message, navigate, web_lookup, general_chat, plus 5 added during TAW/v2.7 for real writes on meds/sessions/sports/AQ/reconnect flows.
Multilingual trigger keywords EN/ES/PT/FR/DE per skill (pre-filter only; the model does the real understanding).
LaivContext assembled by LaivContextAssembler from Riverpod providers — single source of truth for all skills (user name, cluster, management style, tier, location, module state snapshot).
Unverified claim (v2.7): "Laiv skills execute real writes (17/17)" — device test pending.

5.2 ObserverAgent (S2, Beta-2) — READ-ONLY¶

lib/core/ai/observer/observer_agent.dart — never modifies data; only writes to the observations SQLite table.
11 pattern rules: spending_spike, sleep_exercise, calendar_crunch, weather_spending, relationship_drift, habit_streak, commute_anomaly, meal_pattern, project_deadline, recurring_expense, AQ (rule #11, added in Omega v2.7).
Runs on app_open / morning / evening. Dedupes within 48 h. Confidence threshold 0.6. Failing rules are caught and skipped — never crash the observer.

5.3 PlannerAgent + ExecutorAgent (S3, Alpha + Beta-1)¶

Planner: reads pending observations → produces ActionPlans. Max 5 plans per session. 6 action types: review_budget, draft_message, reschedule_event, create_reminder, adjust_departure, log_meal_suggestion.
Executor: NEVER writes to SQLite silently. Every autonomous action is rendered as an ActionPlanCard; SQLite write happens only on explicit user confirmation. 8 module actions supported; partial-failure path triggers rollback. Exactly 1 call site app-wide.

5.4 PersonalitySettings (S4, Gamma)¶

SharedPreferences keys: warmth / verbosity / directness / humor (all 0-100) + preset.
5 presets: Coach (default, matches Chief of Staff), Friend, Assistant, Mentor, Custom.
4 sliders with preview text; selecting a preset snaps sliders; adjusting a slider flips preset to Custom.
toStyleDirective() produces a natural-language block injected into PromptAssembler Layer 1 alongside the active persona from ai_personas table.

6. PROMPTASSEMBLER — 5 LAYERS¶

Total prompt budget ~2,000 tokens (0.8B) to ~4,000 tokens (Ghost). Every word counts.

Layer	Content	Size
L1 — PERSONA (static)	Who Laiv IS + active persona (from `ai_personas`) + PersonalitySettings style directive	~300 tok
L2 — USER (dynamic)	Name, cluster (if confidence ≥ 0.65), cluster behavioral note, management style directive, focus areas, roadblock	~200 tok
L3 — STATE (real-time)	Today's events, budget status, weather, allergens, AQ context (v2.7), health snapshot	~300 tok
L4 — MODULE (contextual)	Where the user is in the app, module-specific state	~200 tok
L5 — ACTIONS (static)	Tool definitions the model can call	~500 tok

6.1 Layer 1 persona baseline ("Chief of Staff")¶

Use contractions always. Default to 1-2 sentences. Expand only when the topic demands it.
Never sycophantic ("Great question!"), never corporate (leverage, synergize, optimize).
"I run entirely on the user's device. Their data never leaves their phone." Privacy stated once if asked, then move on.
Vocabulary: "fine-tune" not "calibrate"; "set up" not "configure"; "your life" not "your data"; "I'll handle it" not "I'll process".

6.2 Cluster-specific L2 notes¶

13 cluster behavioral notes (juggler, warrior, professional, sovereign, student, creative, hustler, elder, chef, healer, tracker, scrapper, optimizer) — one-line directive each.

6.3 Management style directives¶

gentle — warm, supportive, nudge. "Would you like me to..."
strict — direct, accountable. "You missed yesterday's run. Let's not make it two."
dashboard — minimal, reactive only. No proactive suggestions.

7. VOICE PIPELINE — KOKORO-FIRST (INVERTED)¶

7.1 Decision rationale¶

Users never heard the original ElevenLabs reference voice. Kokoro's approximation of it IS the first voice users hear. Piper is trained to match Kokoro (not ElevenLabs), so quality degrades gracefully. Voxtral clones Kokoro output for Ghost HD — same person, frontier quality across the chain.

7.2 The pipeline¶

ElevenLabs custom voice (existing reference audio, NEVER shipped to users)
    ↓  Bishop extracts StyleTTS 2 style vector
Kokoro 82M .pt (~500 KB) — "Laiv voice" canonical reference
    ↓  Kokoro generates reference corpus (500-1000 sentences EN/ES/PT)
    ├── Fine-tune Piper VITS → ONNX on-device  (one per language, same speaker)
    └── Feed as reference clip to Voxtral 3B → zero-shot embedding for Ghost HD (v1.1+)

7.3 Kokoro 82M (Ghost TTS, v1.0)¶

StyleTTS 2 architecture, 82M params, Apache 2.0.
54 voices (11 female), 8 languages, CPU-only, sherpa_onnx-supported.
Style vectors stored as .pt files (~500 KB), NOT full model weights.
Zero-shot voice extraction from 15-20 s reference audio is approximate — that's fine because Kokoro IS the reference, not a clone target.
Status: Epsilon eval sprint ready (SPRINT_EPSILON_KOKORO_EVAL.md). Pending final voice selection — J listens to samples of all 11 female voices.

7.4 Piper (on-device TTS)¶

v1.0 on-device: en_US-hfc_female-medium (63 MB, Piper ONNX, bundled in APK) via sherpa_onnx. EN-accented in other languages but functional.
v1.0 / v1.0.1 on-device (Bishop ready): Per-language Piper models fine-tuned from Kokoro corpus. All sound like the same speaker.
Fine-tuning data: 80-150 samples (5-15 min audio) per language. ~20 min on GPU or ~1-2 h on CPU (Bishop Ryzen AI 9 HX 370 + 64 GB DDR5).
NO pre-built Piper ES/PT downloads — different speakers would break voice continuity.
Target: v1.0 if Bishop provisioned in time, otherwise v1.0.1.

7.5 Voxtral 3B (Ghost HD, v1.1+)¶

Zero-shot embedding from Kokoro reference → frontier-quality multilingual Ghost HD voice. Trained on Bishop. 9 languages.

7.6 Voice stack (final)¶

Tier	Engine	Voice source	Quality	Where
v1.0 on-device	Piper	Generic `hfc_female` (bundled)	Good, EN accent in other langs	Phone
v1.0 Ghost	Kokoro 82M	Best female voice (pending J selection)	Near-natural, multilingual	CX43
v1.0 / v1.0.1 on-device	Piper	Fine-tuned from Kokoro corpus	Good, Laiv voice, per-language	Phone
v1.1+ Ghost HD	Voxtral 3B	Cloned from Kokoro output	Frontier, 9 languages	Bishop training → CX43 inference

Uses the same Piper engine as Laiv TTS. NavVoiceService handles audio focus + timing; InstructionEnricher converts OSRM steps → natural language across 21 locales.

8. LAIV CHECKUP (v1.0 feature — Omega v2.7)¶

Overnight batch analysis of anonymized user data via Claude Sonnet 4.6 (Anthropic Batch API). Baked into subscription tiers — NOT Ghost credit-gated.

8.1 Three domains¶

Wellbeing · Planning · Financial Health.

8.2 Trial schedule¶

Day 0 → Day 1 morning: Baseline checkup (Planning only — built from onboarding data). Delivered in the user's very first Morning Briefing. Sets the hook immediately.
Day 14: Mid-trial (Planning updated + Wellbeing baseline). Delivered alongside Elite trial unlock.
Day 28: Full checkup (all 3 domains) — FREE regardless of subscription status. Even if the user drops to Starter, they get it. The report IS the conversion pitch, not a paywall.

8.3 Tier cadence (post-trial)¶

Elite 1 mo · Pro 2 mo · Core 3 mo · Spark 6 mo · Starter none.

8.4 Dual anonymization pipeline¶

Device:    PII-stripping collectors (zero PII in payload — aggregates, day-of-week only)
    ↓ HTTPS
CX43:      Gemma 4 E4B anonymizer (second pass) → Anthropic Batch API as Citerius Holdings LLC
    ↓
Result:    SQLite row retained locally forever (~$0.012 per checkup)

Collectors live in lib/core/services/checkup_collectors.dart (one per domain). Relay on CX43:8100. Service orchestrator: lib/core/services/checkup_service.dart.

9. LAIV BRAIN DISTILLATION — v1.1+ (NOT v1.0)¶

Needs real user data. Launch with vanilla Qwen 3.5 (stable, proven), fine-tune after 4-6 weeks of real usage.

Target: custom fine-tunes of Qwen 3.5 0.8B / 2B / 4B on alaivOS-specific training data.
Data: ~3,500 curated examples across 10 categories (brain dump, command routing, receipt OCR, meal → nutrition, daily conversations, financial patterns, travel, proactive suggestions, multilingual, error recovery). All include <think> traces.
Training: Unsloth + HuggingFace TRL · SFT response-only loss · 1 epoch (avoids catastrophic forgetting) · Bishop (CPU) or 1× A100 80 GB rental.
Cost: ~$475-710 total (Claude Opus data generation dominates).
Timeline: 4-6 weeks post-launch.
Custom eval suite: brain dump accuracy, receipt F1, module routing accuracy, meal estimation ±20% of USDA, response length, multilingual parity, voice/tone (J approval on 50 samples), hallucination rate.
Deployment: CDN model swap; manifest.json version bump; users wake up to smarter Laiv with zero action.
Mistral Small 4 evaluation: deferred to v1.1+ (needs GEX44 revenue justification).

10. DEAD / REMOVED¶

Item	Status	Notes
Cloud Gemini (AiProvider.gemini/openai/claude)	DEAD	Enum reduced to `{local, ghost}`; code removed
ElevenLabs shipped voice	DEAD	72 refs → 0. Custom EL reference stays as Bishop training input only, never shipped
Google Speech Services STT / Google WaveNet TTS	DEAD	Replaced by on-device pipeline
Per-skill Ghost pricing	DEAD	Credits are the ONLY gate (see `GHOST_PROTOCOL.md`)
Qwen 3.5 9B on-device	DEAD	Ghost-only; CDN file retained for 16+ GB future phones
"0.8B always loaded"	DEAD	AMI dynamic load/unload only
Pre-built Piper ES/PT downloads	DEAD	Different speaker — breaks voice continuity
Qwen3-TTS self-host (Mar 2026 strategy doc)	Superseded	Replaced by Kokoro-first pipeline

11. INFRASTRUCTURE TIE-IN¶

See INFRASTRUCTURE.md for full detail.

CX43 (€17/mo): Ollama (Gemma 4 E4B) on :11434, ghost-router, sports-cache :8300, checkup-relay :8100, nginx :443, coturn :3478/5349. Gemma 4 E4B at 12 tok/s keeps Ghost viable on current hardware — no downgrade to CX23.
Bishop (mini PC, not a GPU server): AMD Ryzen AI 9 HX 370, 64 GB DDR5, Radeon 890M iGPU + XDNA 2 NPU (50 TOPS), no discrete GPU, no CUDA. Role: one-time training (voice pipeline + Brain Distillation) + personal SmartLab. Training on CPU — slower but sufficient for one-time jobs.
CDN (cdn.alaivos.com/models/, Cloudflare R2): 7 tier files + manifest.json v3.
Scaling triggers: >500 Ghost subs → second node or CX43 upgrade; >1000 Ghost subs → dedicated GPU (GEX44 €184/mo).

12. KEY METRICS (April 13, 2026)¶

Metric	Value
AiProvider enum values	2 (`local`, `ghost`)
On-device tiers	5 (xs/s/m/l/xl) — practical ceiling is `m` today
Ghost tier	1 (`ghost-std`)
Ghost tok/s	12.02 (Gemma 4 E4B, CPU-only)
Function-calling languages verified	EN / ES / PT
SkillRouter skills	17
Observer rules	11 (incl. AQ #11)
Planner action types	6
Executor module actions	8
PersonalitySettings presets	5 (Coach default)
PromptAssembler layers	5
PromptAssembler cluster notes	13
Checkup domains	3
Kokoro female voices shortlisted	11
Piper fine-tune samples needed	80-150 per language
CDN model files	7 + manifest.json

This document is the single source of truth for alaivOS AI + Voice architecture. When it contradicts a sprint brief, this wins. Cross-reference: GHOST_PROTOCOL.md, PRODUCT_SCOPE.md, INFRASTRUCTURE.md.