Evaluator Types
The evaluator system uses the Strategy Pattern: each evaluator inherits from BaseEvaluator and returns an EvaluationResult with a score (0.0-1.0), status, and details.
Keyword Evaluator
Section titled “Keyword Evaluator”Domains: conversation
PASS 2: No
Scores responses based on keyword presence, relevance, and Indonesian language fluency.
{ "evaluator_id": "keyword", "expected": { "keywords": ["halo", "selamat", "pagi"], "forbidden": ["error"] }}Scoring breakdown:
- Keyword match ratio (what percentage of expected keywords appear)
- Forbidden word penalty
- Indonesian fluency bonus (checks for natural Indonesian phrasing)
Two-Pass Evaluator
Section titled “Two-Pass Evaluator”Domains: math, reasoning, health
PASS 2: Yes
First pass gets the full response with reasoning. Second pass extracts just the final answer in a strict format, then compares against expected.
{ "evaluator_id": "two_pass", "expected": { "answer": "105", "type": "numeric" }}Expected types:
numeric: numeric comparison with tolerancestring: exact string match (case-insensitive)contains: checks if expected appears in extracted answer
SQL Executor Evaluator
Section titled “SQL Executor Evaluator”Domains: sql
PASS 2: Yes
Extracts SQL from the LLM response, executes it against a real SQLite database (seed/test_db.sqlite), and validates the results.
{ "evaluator_id": "sql_executor", "expected": { "query_type": "SELECT", "expected_columns": ["name", "price"], "min_rows": 1, "max_rows": 100 }}Validation checks:
- SQL syntax (does it execute without error?)
- Column presence (are expected columns in the result?)
- Row count (within expected range?)
- Data quality (are values non-null, reasonable?)
Supports multi-statement SQL (separated by ;), executing each sequentially.
Tool Call Evaluator
Section titled “Tool Call Evaluator”Domains: tool_calling
PASS 2: Yes
Validates that the LLM correctly invokes function tools with appropriate arguments. Supports both single-tool and chained multi-step tool calls.
{ "evaluator_id": "tool_call", "expected": { "tools": ["get_order", "send_notification"], "chain": true }}Scoring:
- Did the LLM call the expected tools?
- Were arguments present and reasonable?
- For chained calls: did it use output from tool A as input to tool B?
- Partial credit for calling some but not all expected tools
Custom Evaluators
Section titled “Custom Evaluators”Custom evaluators support three modes:
Regex-Only (type: "regex")
Section titled “Regex-Only (type: "regex")”Matches response against a regex pattern. See Regex Evaluators for details.
LLM-Prompt (type: "custom")
Section titled “LLM-Prompt (type: "custom")”Uses a second LLM call to evaluate quality:
{ "id": "natural_text_compare", "type": "custom", "eval_prompt": "Compare these two texts and rate similarity 1-3:\n\nExpected: {expected}\nActual: {response}\n\nRate: 1=wrong, 2=partial, 3=correct\nScore:", "uses_pass2": false}Hybrid (type: "hybrid")
Section titled “Hybrid (type: "hybrid")”Combines LLM evaluation with regex score extraction:
{ "id": "hybrid_quality_rater", "type": "hybrid", "eval_prompt": "Rate response quality 0-100...\nSCORE: <number>", "extraction_regex": "SCORE:\\s*(\\d+)"}Qwen XML Tool Call Format Support
Section titled “Qwen XML Tool Call Format Support”Introduced in v0.2.6.
The Tool Call evaluator now supports Qwen-style XML tool call format in addition to the standard JSON function-calling format. This enables evaluation of models that use Qwen’s XML-based tool invocation syntax.
Qwen XML Format
Section titled “Qwen XML Format”Qwen models represent tool calls as XML blocks instead of JSON:
<tool_call><tool_name>get_order</tool_name><parameters><order_id>12345</order_id></parameters></tool_call>How the Evaluator Handles It
Section titled “How the Evaluator Handles It”The Tool Call evaluator automatically detects whether the response uses JSON or XML format:
| Format | Detection | Example |
|---|---|---|
| JSON | Standard {"name": "...", "arguments": {...}} | Standard OpenAI-style |
| Qwen XML | <tool_call><tool_name>...</tool_name> blocks | Auto-detected and parsed |
The evaluator extracts tool names and parameters from the XML structure and validates them against the expected tools, just like with JSON format.
Configuration
Section titled “Configuration”No additional configuration is needed. The evaluator automatically detects and parses both formats. If you want to explicitly test Qwen XML format:
{ "evaluator_id": "tool_call", "expected": { "tools": ["get_order", "send_notification"], "chain": true, "format": "qwen_xml" }}Setting "format": "qwen_xml" tells the evaluator to expect the Qwen XML format specifically (useful when the model might produce ambiguous output).
Creating Custom Evaluators
Section titled “Creating Custom Evaluators”Via the Settings UI or by creating JSON files in evaluator/test_definitions/evaluators/:
{ "id": "my_evaluator", "name": "My Custom Evaluator", "type": "regex", "description": "Extracts and validates date format", "extraction_regex": "(\\d{4}-\\d{2}-\\d{2})", "uses_pass2": false, "config": {}}See Creating Evaluators for the full development guide.