dialograph

Simulation metrics (real_run3.py)

This document describes the evaluation layer for the photosynthesis tutoring simulation: what each metric means, how it is computed from the per-turn log, and how to use outputs for comparisons (e.g. policies on vs off, learner archetypes).

For equations, policy priority ((\rho)), winner indicators ((\omega)), decay formulas, and turn pseudocode aligned with the same script, see Formal mechanics (real_run3).

Per-turn log schema

Each entry appended in run_turn() is a dictionary with (among others):

Field Meaning
turn 0-based turn index
condition Run id from SimulationRunConfig.name (e.g. full_dialograph, llm_baseline)
node Active semantic node id, or None for the LLM-only baseline (no graph)
learner_correct Simulated learner correctness; always set for graph runs and for the LLM baseline
confidence Learner confidence after the turn
action Chosen tutor action (advance, give_hint, review, practice, explain, …)
policy Dialograph policy name, KT_heuristic / KT_BKT / KT_DKT_style, LLM_tutor_baseline, or None (no-policy ablation)
kt_mastery Latent belief scalar after the turn for any KT controller (SimpleKT, BKT, DKTStyle); else null
temporal_on Whether retention / memory-strength updates were enabled for this condition

Metrics are pure functions of this log; they do not call the LLM.

Conditions, ablations, and external baselines

Runs are configured with SimulationRunConfig and run_simulation(learner, config, ...). The default batch matches a typical reviewer-facing grid:

Config name Graph Temporal (retention + memory strength) Decision layer
full_dialograph Full multinode On Dialograph policies
no_policy Full multinode On Naive: advance if correct else hint
no_temporal Full multinode Off Dialograph policies
single_node One node only (no navigation) On Dialograph policies
llm_baseline Off Off (no graph memory) Scripted tutor: llm_baseline_decision (hint / ask / explain)
kt_heuristic_baseline Full multinode On SimpleKT — heuristic scalar mastery (legacy “toy” KT for contrast)
kt_bkt_baseline Full multinode On BKT — one-skill Bayesian Knowledge Tracing (fixed (p_{\mathrm{learn}}, p_{\mathrm{slip}}, p_{\mathrm{guess}}))
kt_dkt_style_baseline Full multinode On DKTStyle — untrained scalar latent with gated updates (DKT-style, not a trained LSTM)

How to frame comparisons (paper):

Prefer cautious language, e.g. that Dialograph shows lower premature advancement or higher stability under simulation, rather than absolute “outperforms all baselines” unless backed by stats.

Knowledge-tracing controllers (code reference)

Class policy_mode Role
SimpleKT kt Heuristic (m): cheap control path for ablations.
BKT kt_bkt Standard BKT equations (Bayesian observe + learn); parameters not fitted to logs.
DKTStyle kt_dkt_style Scalar (h) with LSTM-like gated nudges; not trained on sequences.

Each exposes .belief (logged as kt_mastery) and decide()review / practice / advance with thresholds documented in Formal mechanics.

mean_kt_mastery

In compute_metrics, mean of per-turn kt_mastery (the controller’s belief scalar) when present; otherwise null. See also Paper: KT positioning & claims.

LLM backend (OpenRouter)

DialographAgentLLM uses OpenRouter’s OpenAI-compatible API (https://openrouter.ai/api/v1).

Environment variable Purpose
OPENROUTER_API_KEY Required for live runs
OPENROUTER_MODEL Optional; defaults to openai/gpt-4o-mini. Use any OpenRouter model id (e.g. anthropic/claude-3.5-haiku, meta-llama/llama-3.3-70b-instruct).
OPENROUTER_HTTP_REFERER Optional site URL for OpenRouter rankings
OPENROUTER_APP_TITLE Optional app name (default Dialograph real_run3)

You can also pass model_name= or api_key= when constructing DialographAgentLLM.

Simulation horizon

run_simulation(..., turns=...) defaults to DEFAULT_SIMULATION_TURNS (50) so curves and stability estimates have enough length for analysis. Override turns= for shorter smoke tests or longer runs.

Time-aware metrics

learning_curve

sliding_accuracy_k3

Paper-oriented metrics

concept_stability_by_node and concept_stability_summary

time_to_mastery_turn

premature_advancement_rate

intervention_effectiveness

Legacy scalar metrics (ablations)

These remain useful for tables and baselines:

mean_retention and mean_memory_strength average only turns where those fields are present (graph-on with temporal state). For LLM-only baseline rows, they are null in the log, so the corresponding means are null even when mean_confidence exists.

When the log has no rows with confidence (valid empty), legacy means are written as null; time-series and rates still return where defined.

Simulated learner: MisconceptionLearner

Previously always incorrect, which made learning and policy effects hard to observe. The updated learner samples correctness with probability min(0.1 + 0.1 * activation_count, 0.6) using a reproducible RNG (rng_seed optional; default derived from learner name). That yields improvement with practice while staying mistake-prone early on.

Output files

Running real_run3.py as a script writes JSON under simulation_logs/:

For publication-style claims, prefer comparing premature advancement, concept stability summary, time to mastery, and intervention effectiveness across conditions rather than only scalar means over short runs.

Constants (tuning)

Defined at module top in real_run3.py:

Constant Role
DEFAULT_SIMULATION_TURNS Default episode length
OPENROUTER_BASE_URL OpenRouter API base (override only if needed)
DEFAULT_OPENROUTER_MODEL Fallback model id if OPENROUTER_MODEL is unset
INTERVENTION_ACTIONS Actions counted for intervention effectiveness
MASTERY_CONFIDENCE_THRESHOLD / MASTERY_STREAK_LEN Time-to-mastery criterion
PREMATURE_ADVANCE_CONFIDENCE Threshold for “low confidence” advances
SLIDING_ACCURACY_WINDOW Window size k for sliding accuracy

Adjust these in one place so sweeps and ablations stay consistent.