Skip to content

Overview

The Evonic evaluation engine lets you test LLM performance across multiple domains with structured, repeatable, and scoring-based evaluations. Each test goes through a multi-pass pipeline that produces a normalized score from 0.0 to 1.0.

This section covers everything you need to know about evaluation in Evonic:

PageWhat It Covers
Evaluation WorkflowThe complete end-to-end pipeline: test loading, prompt resolution, LLM passes, scoring, and persistence
Test DefinitionsHow to author and organize test definition JSON files — domain configs, test format, expected outputs
System Prompt HierarchyHow the 3-layer prompt resolution works (Domain → Level → Test), with overwrite vs append modes
Evaluator TypesAll built-in evaluator strategies: Keyword, Two-Pass, SQL Executor, Tool Call, and Custom evaluators
Regex EvaluatorsBuilt-in regex patterns, scoring modes, and how to create custom regex evaluators
Headless ModeRun evaluations from the command line without the web UI

The pipeline works like this:

  1. Test Loading — loads test definitions organized by domain and difficulty level (1–5)
  2. System Prompt Resolution — resolves prompts through a 3-layer hierarchy
  3. PASS 1 — sends the full prompt to the LLM and receives a response
  4. Evaluator Routing — routes the response to the right evaluator strategy
  5. PASS 2 (optional) — extracts the final answer in a strict format for certain evaluators
  6. Scoring — produces a score (0.0–1.0) and pass/fail status
  7. Aggregation — computes weighted scores per domain/level
  8. Persistence — saves everything to SQLite

See Evaluation Workflow for the full breakdown, or jump straight to Test Definitions to start writing your own tests.