real_run3.py)This document describes the evaluation layer for the photosynthesis tutoring simulation: what each metric means, how it is computed from the per-turn log, and how to use outputs for comparisons (e.g. policies on vs off, learner archetypes).
For equations, policy priority ((\rho)), winner indicators ((\omega)), decay formulas, and turn pseudocode aligned with the same script, see Formal mechanics (real_run3).
Each entry appended in run_turn() is a dictionary with (among others):
| Field | Meaning |
|---|---|
turn |
0-based turn index |
condition |
Run id from SimulationRunConfig.name (e.g. full_dialograph, llm_baseline) |
node |
Active semantic node id, or None for the LLM-only baseline (no graph) |
learner_correct |
Simulated learner correctness; always set for graph runs and for the LLM baseline |
confidence |
Learner confidence after the turn |
action |
Chosen tutor action (advance, give_hint, review, practice, explain, …) |
policy |
Dialograph policy name, KT_heuristic / KT_BKT / KT_DKT_style, LLM_tutor_baseline, or None (no-policy ablation) |
kt_mastery |
Latent belief scalar after the turn for any KT controller (SimpleKT, BKT, DKTStyle); else null |
temporal_on |
Whether retention / memory-strength updates were enabled for this condition |
Metrics are pure functions of this log; they do not call the LLM.
Runs are configured with SimulationRunConfig and run_simulation(learner, config, ...). The default batch matches a typical reviewer-facing grid:
| Config name | Graph | Temporal (retention + memory strength) | Decision layer |
|---|---|---|---|
full_dialograph |
Full multinode | On | Dialograph policies |
no_policy |
Full multinode | On | Naive: advance if correct else hint |
no_temporal |
Full multinode | Off | Dialograph policies |
single_node |
One node only (no navigation) | On | Dialograph policies |
llm_baseline |
Off | Off (no graph memory) | Scripted tutor: llm_baseline_decision (hint / ask / explain) |
kt_heuristic_baseline |
Full multinode | On | SimpleKT — heuristic scalar mastery (legacy “toy” KT for contrast) |
kt_bkt_baseline |
Full multinode | On | BKT — one-skill Bayesian Knowledge Tracing (fixed (p_{\mathrm{learn}}, p_{\mathrm{slip}}, p_{\mathrm{guess}})) |
kt_dkt_style_baseline |
Full multinode | On | DKTStyle — untrained scalar latent with gated updates (DKT-style, not a trained LSTM) |
How to frame comparisons (paper):
Prefer cautious language, e.g. that Dialograph shows lower premature advancement or higher stability under simulation, rather than absolute “outperforms all baselines” unless backed by stats.
| Class | policy_mode |
Role |
|---|---|---|
SimpleKT |
kt |
Heuristic (m): cheap control path for ablations. |
BKT |
kt_bkt |
Standard BKT equations (Bayesian observe + learn); parameters not fitted to logs. |
DKTStyle |
kt_dkt_style |
Scalar (h) with LSTM-like gated nudges; not trained on sequences. |
Each exposes .belief (logged as kt_mastery) and decide() → review / practice / advance with thresholds documented in Formal mechanics.
mean_kt_masteryIn compute_metrics, mean of per-turn kt_mastery (the controller’s belief scalar) when present; otherwise null. See also Paper: KT positioning & claims.
DialographAgentLLM uses OpenRouter’s OpenAI-compatible API (https://openrouter.ai/api/v1).
| Environment variable | Purpose |
|---|---|
OPENROUTER_API_KEY |
Required for live runs |
OPENROUTER_MODEL |
Optional; defaults to openai/gpt-4o-mini. Use any OpenRouter model id (e.g. anthropic/claude-3.5-haiku, meta-llama/llama-3.3-70b-instruct). |
OPENROUTER_HTTP_REFERER |
Optional site URL for OpenRouter rankings |
OPENROUTER_APP_TITLE |
Optional app name (default Dialograph real_run3) |
You can also pass model_name= or api_key= when constructing DialographAgentLLM.
run_simulation(..., turns=...) defaults to DEFAULT_SIMULATION_TURNS (50) so curves and stability estimates have enough length for analysis. Override turns= for shorter smoke tests or longer runs.
learning_curvelearner_correct is not None, appends cumulative accuracy correct_so_far / (i + 1) where i is the index in the full log.sliding_accuracy_k3i, accuracy over the last k turns (default k = 3) among entries with non-null learner_correct; entries with no scorable steps in the window are null in JSON.concept_stability_by_node and concept_stability_summarynode id, fraction of turns (with non-null learner_correct) where the learner was correct.mean, min, and max over nodes with at least one scored turn.time_to_mastery_turni such that confidence is at least MASTERY_CONFIDENCE_THRESHOLD (0.8) for MASTERY_STREAK_LEN (3) consecutive graph-on rows (rows with non-null confidence). If never satisfied, null in JSON.premature_advancement_rateaction == "advance", the fraction where confidence is missing or strictly below PREMATURE_ADVANCE_CONFIDENCE (0.7).intervention_effectivenessaction is one of: ask_why, give_hint, challenge, restructure (see INTERVENTION_ACTIONS in code). For each such turn with both current and next turn having non-null confidence, count whether next-turn confidence strictly increases. Report successes / eligible interventions, or 0.0 if there were no eligible pairs.These remain useful for tables and baselines:
mean_confidence, mean_retention, mean_memory_strengthmastery_rate — fraction of turns with non-null confidence ≥ 0.8advance_rate — fraction of all turns with action == "advance"error_rate — fraction of turns with non-null learner_correct that are false; null if no such rowsmean_retention and mean_memory_strength average only turns where those fields are present (graph-on with temporal state). For LLM-only baseline rows, they are null in the log, so the corresponding means are null even when mean_confidence exists.
When the log has no rows with confidence (valid empty), legacy means are written as null; time-series and rates still return where defined.
MisconceptionLearnerPreviously always incorrect, which made learning and policy effects hard to observe. The updated learner samples correctness with probability min(0.1 + 0.1 * activation_count, 0.6) using a reproducible RNG (rng_seed optional; default derived from learner name). That yields improvement with practice while staying mistake-prone early on.
Running real_run3.py as a script writes JSON under simulation_logs/:
*_log.json — full trace*_metrics.json — output of compute_metrics(log) including arrays for curvesFor publication-style claims, prefer comparing premature advancement, concept stability summary, time to mastery, and intervention effectiveness across conditions rather than only scalar means over short runs.
Defined at module top in real_run3.py:
| Constant | Role |
|---|---|
DEFAULT_SIMULATION_TURNS |
Default episode length |
OPENROUTER_BASE_URL |
OpenRouter API base (override only if needed) |
DEFAULT_OPENROUTER_MODEL |
Fallback model id if OPENROUTER_MODEL is unset |
INTERVENTION_ACTIONS |
Actions counted for intervention effectiveness |
MASTERY_CONFIDENCE_THRESHOLD / MASTERY_STREAK_LEN |
Time-to-mastery criterion |
PREMATURE_ADVANCE_CONFIDENCE |
Threshold for “low confidence” advances |
SLIDING_ACCURACY_WINDOW |
Window size k for sliding accuracy |
Adjust these in one place so sweeps and ablations stay consistent.