
Current Status
An honest accounting of what THEOS is and is not, as of February 2026.
This page is the website-facing summary of research/VALIDATED_FINDINGS.md — the canonical lab-quality truth document.
What Is Complete
| Component | Status |
|---|---|
| I→A→D→I wringer loop | Complete — code/theos_core.py |
| Per-engine self-reflection | Complete — engine_reflection_depth parameter |
| Dual-clock governor | Complete — 5 halt conditions, full audit trail |
| Wisdom accumulation and retrieval | Complete — JSON persistence, similarity threshold |
| LLM adapter layer | Complete — Claude, GPT-4, mock |
| MCP server (Claude Desktop) | Complete — code/theos_mcp_server.py |
| Medical, financial, AI safety examples | Complete — examples/ |
| Test suite | 71 tests passing |
| PyPI package | Published — pip install theos-reasoning |
| Patent | Filed — USPTO provisional #63/831,738 |
What Is Supported by Evidence
Two-pass reasoning finds structure single-pass misses
Supported by two documented examples:
- Egotism vs. arrogance — single-pass gives a spectrum model; THEOS gives a 2×2 matrix, revealing that a humble person can be contemptuous (a case the spectrum model cannot explain).
- π — single-pass gives “appears in many fields”; THEOS gives “quantifies the relationship between linear and cyclical phenomena” — a deeper organizing principle.
Wisdom cache works
Measured — 402,000× speedup on repeat queries. This is caching, not reasoning acceleration.
Governor is correctly implemented
Verified — 35 passing unit tests on the governor alone.
What Is Not Yet Established
THEOS consistently outperforms single-pass on insight metrics
Not yet tested. The Insight Detection Rubric experiment (experiments/insight_experiment.py) is ready to run. We need 30+ questions rated by human raters before this claim is defensible.
THEOS outperforms chain-of-thought prompting
Not yet tested. The B condition (single-prompt I→A→D structure) controls for this. If THEOS does not beat B, the improvement may come from structured prompting alone, not from multi-pass iteration.
The native architecture reduces cost
Theoretical. The overlay architecture (current implementation) costs 12–20× a single-pass query. A native architecture using KV cache reuse is projected to cost ~0.5× — but this has not been built or measured.
What Was Measured and Documented Honestly
The February 2026 auto-scoring experiment (30 questions × 3 conditions, scored by Claude on accuracy/depth/utility/coherence/coverage) produced:
THEOS 9.77/15 — Single-pass 14.03/15 — Cohen’s d = −3.46 (large reversed effect)
This is a real result. It is also a result produced by the wrong instrument — standard metrics reward confident completeness, and THEOS is not designed to produce confident completeness. It is designed to produce dialectical tension, structural discovery, and productive disagreement.
The larger the negative result on standard metrics, the stronger the evidence that THEOS is doing something those metrics cannot measure.
→ Full explanation: Why Normal Metrics Fail for THEOS
What Was Corrected
| Error | Correction |
|---|---|
| “5000% faster” performance chart | Actual benchmark: 38× slower on first pass (expected) |
| Jaccard similarity for convergence | Should be semantic cosine similarity |
| Three conflicting governor implementations | Unified into one canonical governor |
| “95% decision accuracy”, “1ms speed”, “<1% hallucination” | No source — quarantined in archive |
| “E = AI²”, “π ≡ ∞” | Metaphor / incorrect — not mathematical claims |
The Next Step
Run the Insight Detection Experiment with human raters.
Everything else — expanded research papers, further documentation, native architecture implementation — should follow after this experiment produces data.
The experiment is ready:
git clone https://github.com/Frederick-Stalnecker/THEOS.git
cd THEOS
export ANTHROPIC_API_KEY=sk-ant-...
python3 experiments/insight_experiment.py --backend anthropic --questions 10
Rate the outputs using the Insight Detection Rubric. If THEOS outperforms single-pass at p < 0.05, the core thesis is supported. If it does not, that result will be reported.
→ Full experiment design and how to contribute results
This page is updated as experiments are run and results are confirmed. The foundation is small and honest. It is being built correctly.
Celeste, working under authority of Frederick Davis Stalnecker — 2026