Clinical data infrastructure

From raw wearable exports
to model-ready clinical physiology

Kinetica’s end-to-end pipeline ingests heterogeneous Polar exports, validates schemas, applies physiological quality filters, computes derived HRV and activity features, aligns them on a daily timeline and prepares them for prediction, interpretation and public rendering.

L0 – L6 · 7 layers 6 data sources · 8 parsers 243 days · 71 features 5 independent predictors

Pipeline map · L0 → L6

L0 Raw ingest Polar GDPR export· outside repo

L1 Structured extract 8 parsers · typed DataFrames

L2 Derived features HRV · zones · strata

L3 Unified daily frame 243 rows · 71 cols

L4 Joint with diary 61 pairs · lag features

L5 Model outputs 5 targets · AUC 0.829

L6 Portfolio render polar_live.json · live

L0 · L1

Raw ingest · Structured extract

From heterogeneous Polar exports to typed tables

Raw data lives outside the repository. Polar GDPR exports contain personal cardiac data and are never committed. The L0 archive holds 1,025 JSON files (1.1 GB) from eleven distinct data sources, covering the full observation window from August 2025 to April 2026.

Eight specialized parsers in pipeline/l1_extract/ convert these heterogeneous JSON schemas into typed pandas DataFrames with consistent date indexing and Pydantic schema validation. Every dropped row is logged with its reason. Output: a set of dated parquets written to data/processed/L1/.

247 OHR

24/7 optical heart rate monitoring. Wrist-based continuous recording aggregated to daily resting HR and daytime HR load.

parse_247ohr.py

Activity

Daily step count, active calories, activity class distribution and total training load from device activity logs.

parse_activity.py

Sleep

Sleep architecture: total sleep time, deep, REM and light stage minutes, wake-after-sleep-onset and Polar sleep score.

parse_sleep.py

Nightly recovery

Polar nightly recharge score, ANS charge state and HRV readiness index computed from nocturnal RR recordings.

parse_nightly_recovery.py

PPI samples

Raw peak-to-peak RR interval time series. Full nocturnal recording. Foundation for all advanced HRV computation in L2.

parse_ppi_samples.py

Orthostatic tests

Sparse source: 8 structured lying-to-standing autonomic reactivity tests. HR and HRV response curves per test session.

parse_orthostatic.py

Derived features

Advanced HRV, zone distribution and session stratification

Three independent feature computers in pipeline/l2_features/ transform the L1 typed DataFrames into interpretable physiological signals. Each module is deterministic given the same input and writes its own dated parquet to data/processed/L2/.

HRV

Advanced HRV features

NeuroKit2 computes 9 frequency- and time-domain metrics from raw PPI samples: SDNN, RMSSD, LF power, HF power, LF/HF ratio, SD1, SD2, DFA-α1 and sample entropy. Applied nightly to the full nocturnal RR interval stream. Each night yields a single-row feature vector aligned by calendar date.

compute_hrv_features.py · 9 features · 239 nights

Z1–Z5

Heart rate zone distribution

Training sessions are partitioned into 5 HR zones using age-adjusted maximum HR thresholds. Minutes in each zone are aggregated per training day and expressed both as absolute time and as proportion of total session duration. Rest days receive zero-valued zone columns.

compute_zone_distribution.py · 10 columns · 107 training days

STR

Session stratification

Each training session is classified into an intensity stratum — easy, moderate, hard or very hard — using a composite score from zone distribution, session duration and training load index. Rest days are labelled rest. The stratum column propagates through L3 and serves as a stratifier in L4 lag-feature analysis.

compute_session_strata.py · 107 sessions classified

L3 · L4

Unified daily frame · Joint with diary

One row per day, aligned across all sources

l3_unified.py outer-merges all L1 and L2 parquets on calendar date. The result is a unified DataFrame with one row per day across the full observation window. Every date present in any source is preserved; NaN marks absence from a given source — orthostatic tests appear on only 8 days, fitness tests on 11. Sparse sources do not compress the window.

L4 inner-joins the L3 frame with the subjective symptom diary, reducing to the paired days where both physiological data and a diary entry coexist. Four temporal lag features (t−1, t−2, t−3) are added to the four primary autonomic columns, expanding the feature space for time-lagged prediction. The L4 artifact is the direct input to the L5 model training pipeline.

L3 rows

243

days · outer merge · all sources

L3 columns

features · NaN where source absent

L4 pairs

diary-aligned days · inner join

Lag features

+24

t−1 · t−2 · t−3 on 4 autonomic cols

L5 · L6

Model outputs · Portfolio render

Predictor outputs and publish state

The L5 predictor layer runs forward feature selection independently per target symptom. Leave-one-out cross-validation is the primary validation strategy — every paired day serves as both training and test exactly once. Bootstrap CI (1000×) quantifies uncertainty per target. The deployment target is autonomic dysfunction, reaching AUC 0.829 on N = 55 paired days.

L6 assembles all pipeline outputs into polar_live.json using an atomic tempfile-rename write strategy. The JSON is consumed at runtime by the React portfolio at kineticaai.com — SSR fallback is served first; client-side hydration fills confidence intervals and feature weights after load. A separate pipeline_state.json artifact documents per-level status, metrics and last-run timestamps.

5 Targets

0.829 Headline AUC

55 Training pairs

LOO-CV Validation

1000× Bootstrap CI

Atomic Publish write

Multi-predictor infrastructure

This pipeline is the common foundation for multiple independent predictors. Each predictor consumes the L4 feature frame but selects its own feature subset, trains its own model and publishes its own output independently. Adding a new predictor requires no changes to L0–L4 — the data infrastructure is shared; the modelling layer is not. Built as infrastructure, not as a one-off preprocessing script.

ANS Predictor → Sleep Quality Predictor → View source on GitHub →