IO3 is a context engine — a persistent orchestration layer that enriches every prompt with personal RAG, clinical rules, and ethical reasoning before any LLM processes it. The model is interchangeable. IO3 is not.
Most AI assistants are prompt wrappers — you send text, you get
text back. IO3 is architecturally different. Before any LLM sees your input, the
Context Engine has already loaded your profile, retrieved relevant RAG chunks,
assembled enriched_context as a structured markdown block, and injected
clinical rules from config/clinical_rules.yaml.
The LLM receives a fully enriched prompt every time, without being asked.
Both the RAG orchestration layer and the ALMA verification mechanism are independently corroborated by published research on clinical AI systems. [→ Scientific basis]
Every input passes through the Context Engine before the classifier sees it.
No LLM call happens without enriched_context. No output leaves without ALMA evaluation.
Every output is evaluated before reaching the patient. ALMA operates in two layers: L1 injects axioms as pre-generation context (zero additional tokens after prompt caching). L2 evaluates post-generation with a deterministic pipeline.
The evaluation set covers four risk categories — pharmacological risk, diagnostic overreach, false urgency and scope violation — at two severity levels (high and medium), plus 10 negative cases expected to pass cleanly. Cases were written to mirror real agent outputs in chronic care contexts, not synthetic toy examples.
| Metric | Value | Note |
|---|---|---|
| Overall accuracy | 1.00 | no missed high-risk cases |
| High severity — recall | 1.00 | all high-risk outputs flagged |
| High severity — F1 | 0.69 | medium cases escalated to high |
| Regex detection (19 cases) | 0.032 ms mean | deterministic, zero tokens |
| Embedding detection (1 case) | 24.6 ms mean | cosine similarity ≥ 0.96 |
| Gray-zone flags (10 cases) | 30.4 ms mean | clinician decides, not ALMA |
ALMA is intentionally deterministic-first: regex patterns cover 19 of 30 test cases with sub-millisecond latency and no LLM dependency. Embedding similarity handles semantic variants. Gray-zone cases (cosine 0.86–0.89) are never auto-blocked — they are surfaced to the clinician via the LangGraph interrupt, preserving human-on-loop control. An LLM is used only as an auditor after decisions are made, not as the decision engine.
IO and ALMA are grounded in two bodies of published evidence: research on retrieval-augmented generation in clinical settings, and empirical studies on why a post-response safety layer is not optional when LLMs operate in healthcare contexts.
[1] Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
Singh A et al. · arXiv · 2025 · https://hf.co/papers/2501.09136
Systematic survey of agentic RAG architectures — autonomous agents with dynamic retrieval, planning, and tool use — covering healthcare as a primary application domain. IO's LangGraph + ChromaDB + intent-classification loop is an instance of this architectural pattern.
[2] Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support
Gomez-Cabello CA et al. · Bioengineering · 2025 · https://doi.org/10.3390/bioengineering12111194
Adaptive chunking in clinical RAG pipelines raises accuracy from 50% to 87% vs fixed-length baselines — evidence that the knowledge segmentation strategy inside IO's ChromaDB layer (1,880 audited chunks) is a first-order safety variable, not an implementation detail.
[3] Cardia-AI: Passive Cardiac Event Monitoring Using Smartwatch Sensors and Predictive Analysis via Large Language Models
Momin EA & Mansoor H · Cureus · 2025 · https://doi.org/10.7759/cureus.95083
Published proof-of-concept of the same architectural pattern as IO: wearable sensor stream + longitudinal health record + domain-tuned LLM with guardrails + explicit escalation guidance. Validates feasibility; IO extends the pattern with local-first deployment and autonomous multi-cycle reasoning.
[4] LLMs Can Do Medical Harm: Stress-Testing Clinical Decisions Under Social Pressure
Omar M et al. · medRxiv · 2025 · https://doi.org/10.1101/2025.11.25.25340972
Across 10 million clinical scenarios, LLMs produced harmful outputs in 11.7% of cases; social-pressure framing increased this to 16.6%. A brief safety reminder reduced harm but did not eliminate it. ALMA exists because this residual harm rate is unacceptable in chronic care — and because prompt-level mitigations alone are insufficient.
[5] MATRIX: Multi-Agent Simulation Framework for Safe Interactions and Contextual Clinical Conversational Evaluation
Lim E et al. · arXiv · 2025 · https://hf.co/papers/2508.19163
Safety-oriented taxonomy for clinical dialogue evaluation combining safety engineering methodology, an LLM-based evaluator, and a simulated patient agent — the closest published framework to ALMA's 30-case evaluation set and per-severity metrics. ALMA differs in being a deterministic pre-publish gate rather than a post-hoc benchmark.
[6] A Field Guide to Deploying AI Agents in Clinical Practice
Gallifant J et al. · arXiv · 2025 · https://hf.co/papers/2509.26153
Sociotechnical framework for clinical AI agent deployment covering governance, system drift, and integration with EHR workflows. IO's roadmap (versioned milestones, audit logs, human-on-loop design) directly implements the deployment posture this paper recommends.