Skip to content

Evaluator Types

The evaluator system uses the Strategy Pattern — each evaluator inherits from BaseEvaluator and returns an EvaluationResult with a score (0.0-1.0), status, and details.

Domains: conversation

PASS 2: No

Scores responses based on keyword presence, relevance, and Indonesian language fluency.

{
"evaluator_id": "keyword",
"expected": {
"keywords": ["halo", "selamat", "pagi"],
"forbidden": ["error"]
}
}

Scoring breakdown:

  • Keyword match ratio (what percentage of expected keywords appear)
  • Forbidden word penalty
  • Indonesian fluency bonus (checks for natural Indonesian phrasing)

Domains: math, reasoning, health

PASS 2: Yes

First pass gets the full response with reasoning. Second pass extracts just the final answer in a strict format, then compares against expected.

{
"evaluator_id": "two_pass",
"expected": {
"answer": "105",
"type": "numeric"
}
}

Expected types:

  • numeric — numeric comparison with tolerance
  • string — exact string match (case-insensitive)
  • contains — checks if expected appears in extracted answer

Domains: sql

PASS 2: Yes

Extracts SQL from the LLM response, executes it against a real SQLite database (seed/test_db.sqlite), and validates the results.

{
"evaluator_id": "sql_executor",
"expected": {
"query_type": "SELECT",
"expected_columns": ["name", "price"],
"min_rows": 1,
"max_rows": 100
}
}

Validation checks:

  • SQL syntax (does it execute without error?)
  • Column presence (are expected columns in the result?)
  • Row count (within expected range?)
  • Data quality (are values non-null, reasonable?)

Supports multi-statement SQL (separated by ;), executing each sequentially.

Domains: tool_calling

PASS 2: Yes

Validates that the LLM correctly invokes function tools with appropriate arguments. Supports both single-tool and chained multi-step tool calls.

{
"evaluator_id": "tool_call",
"expected": {
"tools": ["get_order", "send_notification"],
"chain": true
}
}

Scoring:

  • Did the LLM call the expected tools?
  • Were arguments present and reasonable?
  • For chained calls: did it use output from tool A as input to tool B?
  • Partial credit for calling some but not all expected tools

Custom evaluators support three modes:

Matches response against a regex pattern. See Regex Evaluators for details.

Uses a second LLM call to evaluate quality:

{
"id": "natural_text_compare",
"type": "custom",
"eval_prompt": "Compare these two texts and rate similarity 1-3:\n\nExpected: {expected}\nActual: {response}\n\nRate: 1=wrong, 2=partial, 3=correct\nScore:",
"uses_pass2": false
}

Combines LLM evaluation with regex score extraction:

{
"id": "hybrid_quality_rater",
"type": "hybrid",
"eval_prompt": "Rate response quality 0-100...\nSCORE: <number>",
"extraction_regex": "SCORE:\\s*(\\d+)"
}

Via the Settings UI or by creating JSON files in test_definitions/evaluators/:

{
"id": "my_evaluator",
"name": "My Custom Evaluator",
"type": "regex",
"description": "Extracts and validates date format",
"extraction_regex": "(\\d{4}-\\d{2}-\\d{2})",
"uses_pass2": false,
"config": {}
}

See Creating Evaluators for the full development guide.