THEOS

The Experiment

What We Are Testing

THEOS’s core claim is:

Dialectical two-engine reasoning finds structure that a single inference step cannot find — systematically, not occasionally.

The egotism/arrogance case demonstrates this once. The experiment is designed to test whether it holds at scale.

Why the First Experiment Was Wrong

The February 2026 auto-scoring experiment (30 questions, scored by Claude on accuracy/depth/utility/coherence/coverage) produced a reversed result: THEOS 9.77/15, single-pass 14.03/15.

This was a measurement failure, not a THEOS failure.

The rubric rewarded confident completeness. THEOS produces dialectical tension and structural discovery. The instrument did not match the phenomenon.

→ Full explanation: Why Normal Metrics Fail for THEOS

The Insight Detection Rubric (IDR)

The new rubric measures what THEOS claims to produce:

Dimension	The question it answers
Revelation (0–3)	After reading this answer, do you understand something you genuinely did not before?
Structural Discovery (0–3)	Did it find dimensions or axes hidden in the question?
Productive Tension (0–3)	Does it hold two perspectives in tension and produce something neither alone could?
Consequence Derivation (0–3)	Does it derive non-trivial consequences that follow from the structure found?
Question Interrogation (0–3)	Does it name a hidden assumption that changes the answer?

Maximum: 15 points. Same scale as before, fundamentally different dimensions.

The New Experiment Design

Three conditions, 30 questions each:

Label	Condition	LLM Calls	Purpose
A — SP	Direct single-pass answer	1	Baseline
B — IAD-P	Single prompt explicitly asking for I→A→D structure	1	Isolates structure from iteration
C — THEOS	Full two-engine multi-pass I→A→D	6–12	Full framework

The B condition is new. It controls for number of LLM calls. The same underlying model is asked to follow the I→A→D structure in a single structured prompt — isolating whether the structure alone produces insight.

Four possible outcomes, all informative:

A < B < C: Structure helps; iteration adds more. Full THEOS thesis supported.
A < B ≈ C: Structure alone is sufficient; multi-pass overhead may not be needed.
A ≈ B < C: Multi-pass iteration drives discovery; structure alone doesn’t help.
A ≈ B ≈ C: Null result; re-examine question types.

How to Contribute Results

The experiment framework is ready. We need:

Option 1 — Run the experiment yourself:

git clone https://github.com/Frederick-Stalnecker/THEOS.git
cd THEOS
export ANTHROPIC_API_KEY=sk-ant-...
python3 experiments/insight_experiment.py --backend anthropic --questions 10

Rate the answers using the Insight Detection Rubric and open an issue with your results.

Option 2 — Rate existing answers: The 30-question answer set (SP / CoT / THEOS) from February 2026 is in: experiments/results/run_20260226_194058.json

Rate them using the IDR rubric (blind — you don’t see which condition produced which answer until after rating). Open an issue with your ratings.

Option 3 — Run on your domain: Have a domain where dialectical reasoning matters? (Law, medicine, strategy, ethics, research?) Run the experiment on your own questions and report what you find.

Current Status

Item	Status
30-question answer set generated	✓ Complete
Auto-scored (wrong instrument)	✓ Complete — negative result on wrong metric
Human rating with IDR	Pending
Insight Detection Experiment run	Ready to run
Statistical analysis	Pending first IDR data

The experiment is honest. If THEOS does not outperform single-pass on the IDR, we will report that. If it does, we will report that. We are here to find out.

The Experiment — THEOS

Dual-Engine Dialectical Reasoning Framework — finding structure that a single reasoning pass misses.