Skip to content

Test Definitions

test_definitions/
├── conversation/
│ ├── domain.json # Domain metadata
│ ├── level_1/
│ │ ├── test_greeting.json
│ │ └── test_weather.json
│ ├── level_2/
│ │ └── test_geography.json
│ └── ...
├── math/
├── sql/
├── tool_calling/
├── evaluators/ # Evaluator configs
└── tools/ # Tool definitions

Each domain has a domain.json file:

{
"id": "conversation",
"name": "Conversation",
"description": "Tests Indonesian language conversation abilities",
"icon": "chat",
"color": "#3B82F6",
"evaluator_id": "keyword",
"system_prompt": "Kamu adalah asisten yang ramah dan helpful.",
"system_prompt_mode": "overwrite",
"enabled": true
}
FieldRequiredDescription
idYesUnique domain identifier (matches directory name)
nameYesDisplay name
descriptionNoDescription shown in UI
evaluator_idNoDefault evaluator for tests in this domain
system_promptNoDefault system prompt for all tests
system_prompt_modeNooverwrite (default) or append
tool_idsNoArray of tool IDs available in this domain
enabledNoWhether to include in evaluations (default: true)

Individual test files in level_<n>/test_*.json:

{
"id": "math_multiply_1",
"name": "Simple Multiplication",
"description": "Tests basic multiplication",
"prompt": "Berapa hasil dari 15 dikali 7?",
"expected": {
"answer": "105",
"type": "numeric"
},
"evaluator_id": "two_pass",
"system_prompt": null,
"system_prompt_mode": "overwrite",
"tool_ids": [],
"timeout_ms": 30000,
"weight": 1.0,
"enabled": true
}
FieldRequiredDescription
idYesUnique test identifier
nameYesDisplay name
promptYesThe user message sent to the LLM
expectedDependsExpected output (format varies by evaluator)
evaluator_idNoOverride domain’s default evaluator
system_promptNoOverride or extend the domain system prompt
system_prompt_modeNooverwrite or append
tool_idsNoTools available for this specific test
timeout_msNoPer-test timeout (default: 30000)
weightNoScore weight within the level (default: 1.0)
enabledNoInclude in evaluation (default: true)
{
"expected": {
"keywords": ["halo", "selamat"],
"forbidden": ["error", "maaf"]
}
}
{
"expected": {
"answer": "105",
"type": "numeric"
}
}
{
"expected": {
"query_type": "SELECT",
"expected_columns": ["name", "price"],
"min_rows": 1
}
}
{
"expected": {
"tools": ["get_weather"],
"chain": false
}
}

For chained tool calls (multi-step):

{
"expected": {
"tools": ["get_order", "send_notification"],
"chain": true
}
}
  1. Navigate to /settings
  2. Select a domain from the sidebar
  3. Click a level to view its tests
  4. Use + Add Test to create new tests
  5. Click a test to edit its prompt, expected output, and evaluator
Terminal window
# List tests for a domain/level
curl http://localhost:8080/api/settings/tests?domain=math&level=1
# Create a test
curl -X POST http://localhost:8080/api/settings/tests \
-H 'Content-Type: application/json' \
-d '{
"domain_id": "math",
"level": 1,
"name": "Addition Test",
"prompt": "What is 2 + 2?",
"expected": {"answer": "4", "type": "numeric"},
"evaluator_id": "two_pass"
}'

Export all test definitions:

Terminal window
curl http://localhost:8080/api/settings/export > tests_backup.json

Import:

Terminal window
curl -X POST http://localhost:8080/api/settings/import \
-H 'Content-Type: application/json' \
-d @tests_backup.json