← kineticaai.com
Clinical data infrastructure

From raw wearable exports
to model-ready clinical physiology

Kinetica’s end-to-end pipeline ingests heterogeneous Polar exports, validates schemas, applies physiological quality filters, computes derived HRV and activity features, aligns them on a daily timeline and prepares them for prediction, interpretation and public rendering.

L0 – L6 · 7 layers 6 data sources · 8 parsers 243 days · 71 features 5 independent predictors
Pipeline map · L0 → L6
L0 Raw ingest Polar GDPR export· outside repo
L1 Structured extract 8 parsers · typed DataFrames
L2 Derived features HRV · zones · strata
L3 Unified daily frame 243 rows · 71 cols
L4 Joint with diary 61 pairs · lag features
L5 Model outputs 5 targets · AUC 0.829
L6 Portfolio render polar_live.json · live
L0 · L1
Raw ingest · Structured extract

From heterogeneous Polar exports to typed tables

Raw data lives outside the repository. Polar GDPR exports contain personal cardiac data and are never committed. The L0 archive holds 1,025 JSON files (1.1 GB) from eleven distinct data sources, covering the full observation window from August 2025 to April 2026.

Eight specialized parsers in pipeline/l1_extract/ convert these heterogeneous JSON schemas into typed pandas DataFrames with consistent date indexing and Pydantic schema validation. Every dropped row is logged with its reason. Output: a set of dated parquets written to data/processed/L1/.

247 OHR
24/7 optical heart rate monitoring. Wrist-based continuous recording aggregated to daily resting HR and daytime HR load.
parse_247ohr.py
Activity
Daily step count, active calories, activity class distribution and total training load from device activity logs.
parse_activity.py
Sleep
Sleep architecture: total sleep time, deep, REM and light stage minutes, wake-after-sleep-onset and Polar sleep score.
parse_sleep.py
Nightly recovery
Polar nightly recharge score, ANS charge state and HRV readiness index computed from nocturnal RR recordings.
parse_nightly_recovery.py
PPI samples
Raw peak-to-peak RR interval time series. Full nocturnal recording. Foundation for all advanced HRV computation in L2.
parse_ppi_samples.py
Orthostatic tests
Sparse source: 8 structured lying-to-standing autonomic reactivity tests. HR and HRV response curves per test session.
parse_orthostatic.py
L2
Derived features

Advanced HRV, zone distribution and session stratification

Three independent feature computers in pipeline/l2_features/ transform the L1 typed DataFrames into interpretable physiological signals. Each module is deterministic given the same input and writes its own dated parquet to data/processed/L2/.

HRV
Advanced HRV features
NeuroKit2 computes 9 frequency- and time-domain metrics from raw PPI samples: SDNN, RMSSD, LF power, HF power, LF/HF ratio, SD1, SD2, DFA-α1 and sample entropy. Applied nightly to the full nocturnal RR interval stream. Each night yields a single-row feature vector aligned by calendar date.
compute_hrv_features.py · 9 features · 239 nights
Z1–Z5
Heart rate zone distribution
Training sessions are partitioned into 5 HR zones using age-adjusted maximum HR thresholds. Minutes in each zone are aggregated per training day and expressed both as absolute time and as proportion of total session duration. Rest days receive zero-valued zone columns.
compute_zone_distribution.py · 10 columns · 107 training days
STR
Session stratification
Each training session is classified into an intensity stratum — easy, moderate, hard or very hard — using a composite score from zone distribution, session duration and training load index. Rest days are labelled rest. The stratum column propagates through L3 and serves as a stratifier in L4 lag-feature analysis.
compute_session_strata.py · 107 sessions classified
L3 · L4
Unified daily frame · Joint with diary

One row per day, aligned across all sources

l3_unified.py outer-merges all L1 and L2 parquets on calendar date. The result is a unified DataFrame with one row per day across the full observation window. Every date present in any source is preserved; NaN marks absence from a given source — orthostatic tests appear on only 8 days, fitness tests on 11. Sparse sources do not compress the window.

L4 inner-joins the L3 frame with the subjective symptom diary, reducing to the paired days where both physiological data and a diary entry coexist. Four temporal lag features (t−1, t−2, t−3) are added to the four primary autonomic columns, expanding the feature space for time-lagged prediction. The L4 artifact is the direct input to the L5 model training pipeline.

L3 rows
243
days · outer merge · all sources
L3 columns
71
features · NaN where source absent
L4 pairs
61
diary-aligned days · inner join
Lag features
+24
t−1 · t−2 · t−3 on 4 autonomic cols
L5 · L6
Model outputs · Portfolio render

Predictor outputs and publish state

The L5 predictor layer runs forward feature selection independently per target symptom. Leave-one-out cross-validation is the primary validation strategy — every paired day serves as both training and test exactly once. Bootstrap CI (1000×) quantifies uncertainty per target. The deployment target is autonomic dysfunction, reaching AUC 0.829 on N = 55 paired days.

L6 assembles all pipeline outputs into polar_live.json using an atomic tempfile-rename write strategy. The JSON is consumed at runtime by the React portfolio at kineticaai.com — SSR fallback is served first; client-side hydration fills confidence intervals and feature weights after load. A separate pipeline_state.json artifact documents per-level status, metrics and last-run timestamps.

5 Targets
0.829 Headline AUC
55 Training pairs
LOO-CV Validation
1000× Bootstrap CI
Atomic Publish write
Multi-predictor infrastructure

This pipeline is the common foundation for multiple independent predictors. Each predictor consumes the L4 feature frame but selects its own feature subset, trains its own model and publishes its own output independently. Adding a new predictor requires no changes to L0–L4 — the data infrastructure is shared; the modelling layer is not. Built as infrastructure, not as a one-off preprocessing script.