Overview
The Evonic evaluation engine lets you test LLM performance across multiple domains with structured, repeatable, and scoring-based evaluations. Each test goes through a multi-pass pipeline that produces a normalized score from 0.0 to 1.0.
What You’ll Find Here
Section titled “What You’ll Find Here”This section covers everything you need to know about evaluation in Evonic:
| Page | What It Covers |
|---|---|
| Evaluation Workflow | The complete end-to-end pipeline: test loading, prompt resolution, LLM passes, scoring, and persistence |
| Test Definitions | How to author and organize test definition JSON files — domain configs, test format, expected outputs |
| System Prompt Hierarchy | How the 3-layer prompt resolution works (Domain → Level → Test), with overwrite vs append modes |
| Evaluator Types | All built-in evaluator strategies: Keyword, Two-Pass, SQL Executor, Tool Call, and Custom evaluators |
| Regex Evaluators | Built-in regex patterns, scoring modes, and how to create custom regex evaluators |
| Headless Mode | Run evaluations from the command line without the web UI |
Quick Summary
Section titled “Quick Summary”The pipeline works like this:
- Test Loading — loads test definitions organized by domain and difficulty level (1–5)
- System Prompt Resolution — resolves prompts through a 3-layer hierarchy
- PASS 1 — sends the full prompt to the LLM and receives a response
- Evaluator Routing — routes the response to the right evaluator strategy
- PASS 2 (optional) — extracts the final answer in a strict format for certain evaluators
- Scoring — produces a score (0.0–1.0) and pass/fail status
- Aggregation — computes weighted scores per domain/level
- Persistence — saves everything to SQLite
See Evaluation Workflow for the full breakdown, or jump straight to Test Definitions to start writing your own tests.