Skip to content

DATA_PIPELINE.md — alaivOS Proprietary Data Pipeline

Last updated: April 13, 2026 (Omega v2.7) Status: CANONICAL — single source of truth for traffic, POI, weather, AQ, autocomplete, sports, and Checkup collection. Owner: Epsilon (infrastructure / server). Supersedes: PROPRIETARY_DATA_PIPELINE.md v2.1, SPRINT_EPSILON_PIPELINE.md, SPRINT_EPSILON_PIPELINE_FULL.md, traffic_intelligence_deployment_plan.md, traffic_deployment_guide.md, epsilon_traffic_deployment.md, EPSILON_3SERVER_WEATHER.md, EPSILON_ENRICHMENT_ENGINE.md, EPSILON_LOAD_TOMTOM_KEYS.md, SPRINT_EPSILON_TRAFFIC_DEPLOYMENT.md, SPRINT_EPSILON_PHOTON.md, SPRINT_EPSILON_NEXT.md, SPRINT_POI_MEGA_HARVEST.md, city_tiers.md, city_data_prioritization.md, motorcycle_traffic_calibration.md, airport_transit_data_pipeline.md, holiday_school_calendars.md, map_routing_traffic_strategy_LOCKED.md, alaivOS_map_routing_traffic_strategy.md, EPSILON_GOOGLE_CREDIT_PLAN.md (obsolete). Cross-refs: INFRASTRUCTURE.md (server hardware), GHOST_PROTOCOL.md (Ghost brain, Kokoro, Checkup credits), MAP_MODULE_SPEC.md (5-view map, traffic chip UI, search stack), ANTI_ABUSE_SPEC.md, PRICING_AND_TIERS.md (feature gates).


1. EXECUTIVE SUMMARY

alaivOS runs a privacy-first, zero-paid-API data pipeline across three Hetzner servers in Helsinki, producing:

  1. Traffic Pattern Intelligence — 5-layer composite ETA (baseline_spline × live_calibration × weather × calendar × event) for 290 priority cities at Full tier, with 1,178 expansion cities queued for the week-5 flood (April 29, 2026).
  2. POI Cache — Overpass/OSM backbone (263+ cities collected) + nightly DDG enrichment harvest (~43,200 queries/day).
  3. Autocomplete + Geocoding — Photon Cloudflare Worker proxy at photon.alaivos.com + Nominatim (free, 1 req/s).
  4. Weather + Air Quality — Open-Meteo weather + AQI pulled on every 30-min snapshot; CDN-pushed per city; app reads from CDN (offline-capable, ≤30 min stale). RainViewer radar tiles are the sole live-net dependency.
  5. Sports Cache — 31 leagues via ESPN + TheSportsDB + Jolpica + boxing scraper, served from ghost-01:8300.
  6. Checkup Relay — dual-anonymized tunnel to Anthropic Batch API at ghost-01:8100.
  7. Voice (Ghost HD) — Kokoro 82M TTS hosted on ghost-01.

Monthly infra cost: €25 (ghost-01 €17 + cx23 €4 + cx23-b €4). Zero paid API dependencies in the POI/search/traffic/weather/AQ stack. Only paid item in hand is the one-off Google Places credit ($5,154 MXN ≈ $300 USD, expires June 5, 2026) — used exclusively as an in-app real-time luxury layer behind a dart-define kill-switch, never in the pipeline.


2. SERVER TOPOLOGY — 3 SERVERS (v2 ARCHITECTURE)

Server IP Spec Location Cost TomTom keys Cron entries Role
ghost-01 46.62.149.145 CX43 Helsinki €17/mo 7 (accounts 1-6 + original) 12 Ghost brain (Gemma 4 E4B), Kokoro TTS, Checkup Relay (:8100), Sports Cache (:8300), Americas traffic collection, Overpass/POI harvest, DDG harvest co-runner, weather/AQ push_latest, Coturn, nginx
cx23 204.168.205.135 CX23 Helsinki €4/mo 6 (accounts 7-11 + tt12 bh0xXvG7wJoCkQKvww2hBaH0kAQzbqKt) 3 Europe + APAC traffic collection, weather+AQ snapshots, DDG harvest co-runner
cx23-b 204.168.236.190 CX23 Helsinki €4/mo 6 (tt13 kzThRghjvrb4awuCdCvRF5XdPpg22bz0, tt14 qwmX0zZFXDRJ2oIguze4xV459s33ItVx, tt15 RSSGlqu3U1ronBXH1f6OavczeZu33hkW, tt16 eCzG0MJ652Ue1S8D74BTxCkKWXSYo7Iu, tt17 nDxsWoOASOqi6cGdGgLv01MKllfcv09c, tt18 EJ7e0l1nOIB0g7atT4XgEyYzyUFouqqN) 3 Expansion queue (1,178 cities), DDG harvest primary, weather+AQ snapshots

cx23 and cx23-b rsync daily artifacts to ghost-01 at midnight UTC. ghost-01 runs pattern computation, composite ETA baking, and CDN upload to cdn.alaivos.com/infra/.

v2.4 → v2.7 delta:

Metric v2.4 v2.7
Servers 2 (ghost-01, CX22) 3 (ghost-01, cx23, cx23-b)
TomTom keys 12 19 (7/6/6)
Priority cities 150 290 at Full + 1,178 expansion queued
Daily API budget ~2,500 routing ~47,500 routing
Snapshots/day 3 24 (see §5.5)
Weather / AQ None Open-Meteo every snapshot
POI source Overpass only Overpass + DDG enrichment
DDG harvest None ~43,200 queries/day nightly
Autocomplete Photon Cloudflare Worker
Collection tiers Seed/Lite/Std/Full/Maintenance Standard / Full / Maintenance ONLY (Seed + Lite DEAD)

3. TOMTOM API — 19 KEYS, ~47,500 CALLS/DAY

One TomTom account exposes all APIs with independent per-product free tiers.

API Free/account/day What it returns Historical?
Routing API 2,500 Per-segment travel times + exact ETA ✅ via departAt
Flow Tiles 50,000 Color-coded PNG (green/yellow/red) ❌ live only

Routing is one-way per call (A→B ≠ B→A). With 19 keys:

Pool Per key × 19 Daily total
Routing 2,500 × 19 47,500
Tiles 50,000 × 19 950,000

Rule: zero wasted calls; both pools consumed to the maximum every day. Weekday overflow → historical departAt backfill; weekend spare → motorcycle calibration, match days, expansion previews.


4. COLLECTION TIERS — 3 ONLY

Seed and Lite are DEAD. The pipeline only runs three tiers:

Tier Cadence Coverage Used for
Standard 24 snapshots/day × 7 days, linear interpolation between baselines Base 290 cities + expansion graduates All live ETA
Full 24 snapshots × 7 days × weather+AQ+event overlays; Catmull-Rom cubic spline baseline for Gold cities (28+ days of data) 290 priority cities 5-layer composite ETA
Maintenance Spot-check drift detection (>15% deviation triggers recollection), seasonal adjustment, holiday enrichment Cities that graduated from Full Long-tail refresh

Graduation frees ~39,960 calls/day when all 290 cities reach Maintenance — funds expansion.


5. TRAFFIC PATTERN INTELLIGENCE

5.1 Route topology — 4×4 grid + ring road (uniform, all cities)

6 zig-zag routes per city, each zig-zag using a DIFFERENT parallel arterial pass:

# Route Direction
1 EW zig-zag eastbound 4 parallel EW arterials
2 EW zig-zag westbound same, reversed
3 NS zig-zag southbound 4 parallel NS arterials
4 NS zig-zag northbound same, reversed
5 Ring road (Periférico/loop) clockwise
6 Ring road counter-clockwise

Full bidirectional always — both directions every snapshot, no direction-smart shortcuts. Any user route crosses ≥2-3 measured arterials.

5.2 Segment spacing — 250 m

150 waypoints / call → 149 segments × ~250 m = ~37.5 km covered per call. Small/medium cities (~80%) = 6 calls/snapshot; large metros (CDMX, São Paulo, Tokyo, LA, ~20%) = ~9 calls/snapshot. Blended ~6.6 calls/city/snapshot = ~894 bidirectional segments per city (large metros ~1,074). First/last-mile gap ≤125 m (~18 s, negligible).

5.3 Segment graph + Dijkstra

Each waypoint = node; each 250-m segment = weighted edge with day-of-week + time-of-day weight. Any user route traced as Dijkstra over the graph, summed + first/last-mile estimate → ±1 min accuracy.

5.4 Zoom level

Zoom 13 — all cities, all named roads.

5.5 Snapshot schedule — 24 slots/day

30-min granularity during peaks, hourly otherwise. Every day of the week (Mon-Sun) is a separate baseline (Mon 08:00 ≠ Wed 08:00 ≠ Fri 08:00).

# Time (local) Captures
1 06:00 Early risers
2 07:00 School run
3 07:30 Peak AM ramp
4 08:00 Peak AM
5 08:30 Peak AM tail
6 09:00 Late commuters
7 10:00 Mid-morning
8 11:00 Pre-lunch
9 12:00 Lunch rush
10 13:00 Post-lunch
11 13:30 Early afternoon
12 14:00 Afternoon
13 14:30 Mid-afternoon
14 15:00 School pickup starts
15 15:30 School pickup peak
16 16:00 Early evening commute
17 16:30 Evening ramp
18 17:00 Peak PM
19 17:30 Peak PM
20 18:00 PM tail
21 18:30 Late commute
22 19:00 Dinner/leisure
23 20:00 Evening
24 21:00 Late baseline

5.6 Day-type classification (6 types)

Code Type Pattern
NW Normal workday Full commute peaks
SW School-off workday Reduced AM peak
OH Office holiday Reduced all day
FH Full holiday Minimal commute
WE Weekend Late starts, leisure
SE Special event Abnormal venue spikes (matches, concerts)

Calendars cover 20 countries (see §8).

5.7 City tiers — Gold / Silver / Standard

Tier Assignment rule Baseline model
Gold ≥28 days of data AND dual-source (routing + tiles) validated Catmull-Rom cubic spline between snapshots
Silver 14-27 days of data Smoothed linear
Standard <14 days or single-source Linear interpolation

Cubic spline produces smooth minute-level ETA curves between 30-min snapshots; linear is acceptable for Standard.

5.8 Tile → routing calibration

Cities with both tiles + routing train a regression: delay_ratio = f(pct_green, pct_yellow, pct_red, hour, city_size). Tile-only cities inherit travel-time inference. After Phase 1 (~106K dual data points across 158 cities × 24 slots × 7 days × 4 samples) → robust model.

5.9 5-layer composite ETA

FINAL_ETA = baseline_spline × live_calibration × weather × calendar × event
Layer Source Notes
1. baseline_spline Historical Mon-Sun × 24-slot table per segment Catmull-Rom for Gold, linear for Standard
2. live_calibration TomTom Flow Tiles current snapshot Scales baseline to right-now conditions
3. weather Open-Meteo current + city-specific weather→traffic model (weekly rebuild, 8 categories: clear, fog, light_rain, rain, heavy_rain, storm, snow, ice)
4. calendar 20-country holiday + school-break + puente + pre-holiday exodus + Semana Santa + Buen Fin + Día de Muertos tables
5. event Proximity-weighted (Haversine) + time-decay (build → peak → disperse) + attendance-scaled; largest nearby event only, no stacking Match Day Intelligence feeds this

UI rule (locked): factor chips display minutes, not percentages — e.g. "Rain expected — adds ~8 min", "Heavy traffic — adds ~12 min vs. your normal 28 min on Mondays." Reference baseline phrasing: "Normally X min on Mondays."

Implementation: 8 files in lib/core/location/traffic_intelligence/, 58 tests.

5.10 Motorcycle calibration

One-time collection, 3 days in Week 2, 15 diverse cities (gridlock → light traffic), alternating car/moto on same routes (1 account's daily budget). Output = congestion-dependent multiplier applied to cached car patterns at query time:

Congestion Multiplier
Freeflow 0.90-0.95×
Light 0.80-0.85×
Moderate 0.65-0.75×
Heavy 0.50-0.60×
Gridlock 0.35-0.45×

Motorcycle ETA is FREE for ALL tiers including Starter (map feature gate interactiveMap). Collected once, used forever.

5.11 Match Day Intelligence

  • Past harvest: 250 matches (Liga MX + MLS 2025-2026 at WC venues), 40 departAt calls/match = 10,000 calls, single Sunday batch.
  • Live weekend harvest (Apr-Jun): ~4-6 WC-venue matches × 85 calls = 340-510 calls/weekend.
  • Model: surge = f(attendance, venue_capacity, time_offset) via API-Sports attendance data; scaled by WC capacity ratio × 1.3 international factor for World Cup 2026 prediction.

5.12 Historical backfill

Weekend spare routing calls fill weekday gaps via departAt. Priority order: Gold city gaps → Silver → Standard → upcoming holidays → expansion city previews.


6. POI PIPELINE — OVERPASS BACKBONE + DDG ENRICHMENT

6.1 Overpass / OSM (backbone)

  • 290 cities × 8 categories (food, shopping, health, finance, transport, leisure, tourism, services) = 2,320 queries.
  • Large cities (CDMX, São Paulo, Tokyo) split by administrative subdivision to avoid timeouts.
  • Monthly refresh, 1st of month 03:00 UTC.
  • Status: 263+ cities already collected, harvest running.
  • CDN: cdn.alaivos.com/infra/poi/{city_slug}/{category}.json.
  • Phone-side: user sets home city / creates trip → downloads bundle (~2-5 MB/city) → instant offline lookups.

6.2 DDG enrichment (primary enrichment source)

Cloudflare Worker at search.alaivos.com proxies DuckDuckGo. Nightly batch harvest:

Parameter Value
Fires 22:00 UTC nightly
Rate 1 req/sec per server × 3 servers × 4 hours
Daily volume ~43,200 queries/day
To-launch total ~2.5M queries over 57 days
Cost $0 (fully free, fully cacheable)

DDG is the sole enrichment source. Google Places is not in the pipeline.

6.3 Photon autocomplete (Cloudflare Worker)

  • Endpoint: photon.alaivos.com.
  • Worker proxies Komoot's hosted Photon with 24 h edge cache.
  • Cost: $0/mo (Cloudflare Worker free tier).
  • Self-hosting abandoned — Photon index is 17 GB; cx23-b has only 38 GB disk. Worker + global edge cache is strictly better anyway.

6.4 Nominatim

Free geocoding, 1 req/sec, User-Agent: alaivOS/1.0. Used for contact address pins ("Show on map" toggle in People) and rare forward/reverse geocoding. Feature gate: contactMapPins (Spark+).

6.5 5-layer app search stack (for cross-reference)

App-side search in places_service.dart (not pipeline, but depends on it):

  1. My Places FTS5 (offline SQLite)
  2. POI cache (offline, from §6.1)
  3. Photon (free, §6.3)
  4. Nominatim (free, §6.4)
  5. Google Places — temporary real-time luxury layer, dart-define kill-switch, $5,154 MXN ≈ $300 USD credit, expires June 5, 2026. Not used by pipeline.

Cancelled entirely: Foursquare (mandatory attribution, no caching on free/PAYG), Yelp (24 h max cache, pre-fetch explicitly prohibited).


7. WEATHER + AIR QUALITY

  • Source: Open-Meteo (free, no key).
  • Cadence: every snapshot (every 30 min) per city.
  • AQ: Open-Meteo AQI alongside weather (same call pattern).
  • Distribution: Epsilon's push_latest.py uploads per-city JSON to cdn.alaivos.com/infra/weather/{city}.json every 30 min. ~500 B per city × 290 = ~145 KB total manifest.
  • App-side rule (locked): app reads weather + AQ from CDN, never directly from Open-Meteo. Overlays therefore work offline (≤30 min stale).
  • Only live-net weather dependency: RainViewer radar PNG tiles (satellite imagery, not reasonably cacheable). Rain overlay uses maxNativeZoom: 12 with upscaling.
  • Weather→traffic correlation: city-specific multipliers, weekly rebuild, 8 categories (clear, fog, light_rain, rain, heavy_rain, storm, snow, ice). Feeds Layer 3 of composite ETA.

8. CALENDARS — 20 COUNTRIES

holidays.json (already built) + school-break extension covering:

  • Holidays (federal + state).
  • School breaks (summer, winter, mid-term).
  • Puentes (long-weekend bridges).
  • Pre-holiday exodus days.
  • Semana Santa, Buen Fin, Día de Muertos (MX-specific).
  • Equivalent regional waves (US Thanksgiving Wed, UK bank-holiday Mondays, EU August shutdown, etc.).

Feeds day-type classification (§5.6) and Layer 4 of composite ETA.


9. AIRPORTS + TRANSIT

  • Airports: 47 airports cached at cdn.alaivos.com/infra/airports/ with IATA codes, terminal maps, holiday/flight-schedule metadata.
  • Transit (v1.1): GTFS feeds from OpenMobilityData (free). Walking/cycling via OSRM foot/bicycle profiles (free, self-hosted or demo server with 1 req/s throttle).

10. SPORTS CACHE — 31 LEAGUES

Server: ghost-01:8300. TTL 1 h, stale-on-error (serves last-good payload if upstream fails).

Source Leagues Cost Notes
ESPN (undocumented JSON) 14 US sports endpoints Free NFL, NBA, MLB, NHL, NCAA FB/BB, MLS, WNBA, etc.
TheSportsDB 15 (football/LatAm/cricket) $3/mo Patreon (commercial ToS) Premier, La Liga, Serie A, Bundesliga, Liga MX, Brasileirão, IPL, BBL, etc.
Jolpica F1 1 (Formula 1) Free Drop-in Ergast replacement
Boxing scraper 1 (boxing) Free Custom HTML scraper

App-side: 5 tabs, 4 table styles, 40 ARB keys × 21 locales. Match-day intelligence (§5.11) consumes attendance + match schedule data from this cache.


11. CHECKUP RELAY

  • Endpoint: ghost-01:8100.
  • Dual anonymization pipeline:
  • Device-side PII strip (regex + NanoAnonymizer).
  • Server-side Gemma 4 E4B anonymizer pass (NLP-grade, catches what regex misses).
  • Submit anonymized payload to Anthropic Batch API.
  • Cost: ~$0.012 per checkup.
  • 3 domains, tier-gated cadence (see GHOST_PROTOCOL.md).
  • Response routed back through reverse anonymization on device (re-personalization map held on-device only).

12. VOICE (GHOST HD)

  • Engine: Kokoro 82M (StyleTTS 2, Apache 2.0, 54 voices, 8 languages, CPU-friendly).
  • Host: ghost-01 (CX43), co-resident with Gemma 4 E4B (E4B only loads on demand, unloads on idle timeout; concurrent load test pending in SPRINT_EPSILON_KOKORO_EVAL.md).
  • Kokoro-first pipeline: ElevenLabs reference → Bishop extracts StyleTTS 2 style vector → Kokoro .pt (~500 KB) is canonical "Laiv voice" → generates corpus → (a) fine-tunes Piper VITS → ONNX on-device, (b) feeds Voxtral 3B zero-shot embedding for Ghost HD v1.1+.
  • v1.0 on-device TTS: Piper en_US-hfc_female-medium (63 MB, bundled APK).

13. CRON & BUDGETS — SUMMARY TABLE

Job Server Cadence Daily cost
Traffic routing snapshots (24/day × 290 Full cities) ghost-01 (Americas) / cx23 (EU+APAC) every 30-60 min per slot ~47,500 routing calls
Flow tile snapshots all 3 every snapshot ~950k tiles (budget headroom)
Weather + AQ (Open-Meteo) cx23, cx23-b every snapshot (30 min) free
push_latest.py → CDN ghost-01 every 30 min free
DDG enrichment harvest ghost-01 + cx23 + cx23-b 22:00 UTC nightly, 4 h ~43,200 queries
Overpass POI refresh ghost-01 1st of month 03:00 UTC 2,320 queries
Sports cache refresh ghost-01:8300 1 h TTL, refresh on miss ~30 upstream calls/h
Expansion flood (1,178 cities) cx23-b primary scheduled April 29, 2026 (week 5) graduates into Standard tier
Match-day past harvest ghost-01 one-off Sunday batch 10,000 calls
Match-day live harvest ghost-01 weekends Apr-Jun 340-510 calls/wk
Motorcycle calibration ghost-01 (one account) 3 days in Week 2 1 account budget

14. CDN STRUCTURE

cdn.alaivos.com/
├── infra/
│   ├── airports/              (47 airports)
│   ├── poi/{city}/{cat}.json  (290 cities × 8 cats)
│   ├── traffic/{city}/patterns.json  (segment graph + Mon-Sun × 24-slot baselines)
│   ├── weather/{city}.json    (Open-Meteo snapshot + AQ)
│   └── calendars/holidays.json (20 countries + school breaks)
└── models/                    (laiv-xs/s/m/l/xl/ghost.gguf + manifest.json)

Privacy: fixed route waypoints only, never user GPS. All pipeline artifacts are city-level, public-road, non-personal.


15. LOCKED DECISIONS

  1. 3 servers only — ghost-01 CX43 + cx23 + cx23-b. No further node adds until >500 Ghost subscribers (then GEX44 €184/mo).
  2. 19 TomTom keys (7/6/6 distribution across ghost-01/cx23/cx23-b). ~47,500 routing + ~950k tiles daily.
  3. 290 priority cities at Full + 1,178 expansion cities queued for week-5 flood (Apr 29, 2026). Target at launch: ~1,090+ cities effectively served.
  4. 3 collection tiers only — Standard, Full, Maintenance. Seed and Lite are DEAD.
  5. 24 snapshots/day — 30-min peaks, hourly otherwise. Every day-of-week is its own baseline (7 Mon-Sun baselines per city).
  6. 250 m segment spacing, 150 waypoints per route, 6 routes per city (4×4 grid + bidirectional ring). Uniform premium quality for all cities.
  7. Full bidirectional always — no direction-smart shortcuts.
  8. 5-layer composite ETA formula locked: baseline_spline × live_calibration × weather × calendar × event. Catmull-Rom cubic spline for Gold, linear for Standard. Factor chips display minutes, not percentages.
  9. 20-country calendar with holidays, school breaks, puentes, Semana Santa, Buen Fin, Día de Muertos.
  10. Event adjustment is proximity + time-decay + attendance-scaled, largest nearby event only, no stacking.
  11. Motorcycle calibration — one-time collection, congestion multiplier, used forever. Motorcycle ETA FREE for all tiers including Starter.
  12. Weather + AQ every snapshot via Open-Meteo — app reads from CDN, never live. Only RainViewer radar needs live net.
  13. DDG is the sole enrichment source — ~43,200 queries/day nightly, free, fully cacheable. Google Places is app-real-time-only behind kill-switch, never in pipeline.
  14. Photon = Cloudflare Worker proxy at photon.alaivos.com with 24 h edge cache. Self-hosting abandoned (17 GB index > 38 GB cx23-b disk).
  15. Nominatim — free, 1 req/s, User-Agent: alaivOS/1.0.
  16. Zero paid API dependencies in POI/search/traffic/weather/AQ. Only paid items: TheSportsDB Patreon $3/mo (sports cache commercial ToS) and Anthropic API (~$0.012/checkup).
  17. Google Places credit = $5,154 MXN ≈ $300 USD (NOT $5,154 USD), expires June 5, 2026, app-real-time luxury only. Foursquare and Yelp cancelled entirely.
  18. Sports cache — 31 leagues (ESPN 14 + TheSportsDB 15 + Jolpica F1 + boxing scraper) on ghost-01:8300, 1 h TTL, stale-on-error.
  19. Checkup Relay — ghost-01:8100, dual anonymization (device → Gemma 4 E4B), Anthropic Batch, ~$0.012/checkup.
  20. Voice (Ghost HD) — Kokoro 82M on ghost-01. Kokoro-first pipeline inverts prior plan.
  21. Segment graph + Dijkstra — any user route traced as shortest path over the 250 m segment graph.
  22. Monthly infra cost: €25 + $3/mo TheSportsDB (+ Anthropic metered).

Every day this runs, the moat gets deeper. 290 cities at 250 m precision, 24 snapshots/day, 5-layer composite ETA, full weather + AQ + DDG enrichment, for €25/month. No competitor has this.