poq.toml Examples

A poq.toml file turns uploaded data into review items: what to ingest, what evidence validators see, what questions they answer, and how review slots are assigned. These examples use the current namespaced layout from the poq.toml reference.

The best first spec is usually small: one source, one review item per row, and a short rubric. Add joins, routing, classes, and AI validators only when the workflow needs them.

Annotation QC

Use this when you already have labeled rows and want independent agreement checks. What makes it interesting: it is the smallest useful shape, but still shows the core loop of data in, evidence shown, rubric answered, consensus produced.

poq.toml
[project]
spec_version = "1"
tag          = "annotation_qc"

[[ingestion.sources]]
id   = "labels"
type = "csv"
path = "labels.csv"

[ingestion.fields]
id             = "labels.item_id"
source_text    = "labels.source_text"
proposed_label = "labels.proposed_label"

[[validation.evidence]]
type            = "markdown"
title           = "Source text"
ingestion_field = "source_text"

[[validation.evidence]]
type            = "markdown"
title           = "Proposed label"
ingestion_field = "proposed_label"

[[validation.rubric]]
id               = "agreement"
label            = "Label agreement"
prompt           = "Does the proposed label correctly describe the source text?"
role             = "influence_gauge"
scale.type       = "likert_agreement"
scale.size       = 7
consensus_weight = 2.0

[[validation.rubric]]
id               = "confidence"
label            = "Confidence"
prompt           = "How confident are you in your answer?"
role             = "certainty"
scale.type       = "numeric"
scale.values     = [0, 25, 50, 75, 100]
scale.labels     = ["none", "low", "medium", "high", "certain"]
consensus_weight = 1.0

[validators]
num_validators = 3
reward_usd     = "1.00"
stake_usd      = "0.00"

Routed Expert Review

Use this when some items need specialists and others can be handled by a general pool. What makes it interesting: validator classes and routes let higher-risk rows get more reviewers and a senior mix without changing the item schema.

poq.toml
[project]
spec_version = "1"
tag          = "expert_review"

[[ingestion.sources]]
id        = "findings"
type      = "json"
path_glob = "findings/*.json"

[ingestion.fields]
id                = "findings.id"
title             = "findings.title"
summary           = "findings.summary"
source_path       = "findings.sourcePath"
proposed_severity = "findings.proposedSeverity"
detected_by       = "findings.detectedBy"

[[validation.evidence]]
type            = "markdown"
title           = "Finding"
ingestion_field = "summary"

[[validation.evidence]]
type  = "datapoint_facts"
title = "Finding facts"
fields = [
  { label = "Title", field = "title" },
  { label = "Source", field = "source_path" },
  { label = "Proposed severity", field = "proposed_severity" },
  { label = "Detected by", field = "detected_by" },
]

[[validation.rubric]]
id               = "validity"
label            = "Validity"
prompt           = "Is this a real issue?"
role             = "influence_gauge"
scale.type       = "likert_agreement"
scale.size       = 7
consensus_weight = 2.0

[[validation.rubric]]
id               = "impact"
label            = "Impact"
prompt           = "How much practical impact would this issue have?"
scale.type       = "ordinal"
scale.labels     = ["none", "low", "medium", "high", "critical"]
consensus_weight = 1.5

[[validation.rubric]]
id               = "confidence"
label            = "Confidence"
prompt           = "How certain are you in this assessment?"
role             = "certainty"
scale.type       = "numeric"
scale.values     = [0, 25, 50, 75, 100]
scale.labels     = ["none", "low", "medium", "high", "certain"]
consensus_weight = 1.0

[validators]
num_validators = 3
reward_usd     = "5.00"
stake_usd      = "0.00"

[[validators.classes]]
id         = "generalist"
label      = "General reviewer"
type       = "human"
priority   = 20
reward_usd = "5.00"
stake_usd  = "0.00"

[[validators.classes]]
id         = "senior"
label      = "Senior reviewer"
type       = "human"
priority   = 10
reward_usd = "20.00"
stake_usd  = "0.00"

[[validators.routes]]
match = { proposed_severity = ["high", "critical"] }
total = 5

[[validators.routes.composition]]
class = "senior"
count = 2

[[validators.routes.composition]]
class = "*"
count = 3

[[validators.routes]]
total = 3

[[validators.routes.composition]]
class = "*"
count = 3

AI Panel With Human Escalation

Use this when an AI panel can handle first-pass review and humans should focus on contested items. What makes it interesting: each AI class can use a different model or prompt, while low-consensus cases escalate to a human class.

poq.toml
[project]
spec_version = "1"
tag          = "ai_assisted_review"

[[ingestion.sources]]
id   = "cases"
type = "csv"
path = "cases.csv"

[ingestion.fields]
id                = "cases.case_id"
prompt            = "cases.prompt"
candidate_answer  = "cases.candidate_answer"
reference_context = "cases.reference_context"
risk_tier         = "cases.risk_tier"

[[validation.evidence]]
type            = "markdown"
title           = "Prompt"
ingestion_field = "prompt"

[[validation.evidence]]
type            = "markdown"
title           = "Candidate answer"
ingestion_field = "candidate_answer"

[[validation.evidence]]
type            = "markdown"
title           = "Reference context"
ingestion_field = "reference_context"

[[validation.rubric]]
id               = "correctness"
label            = "Correctness"
prompt           = "Is the answer correct given the reference context?"
role             = "influence_gauge"
scale.type       = "likert_agreement"
scale.size       = 7
consensus_weight = 2.0

[[validation.rubric]]
id               = "completeness"
label            = "Completeness"
prompt           = "Does the answer cover the important parts of the prompt?"
scale.type       = "likert_agreement"
scale.size       = 5
consensus_weight = 1.0

[[validation.rubric]]
id               = "safety"
label            = "Safety"
prompt           = "Does the answer avoid unsafe, unsupported, or misleading guidance?"
scale.type       = "likert_agreement"
scale.size       = 5
consensus_weight = 1.5

[validators]
num_validators = 3
reward_usd     = "0.00"
stake_usd      = "0.00"

[[validators.classes]]
id         = "reasoning-model"
label      = "Reasoning model"
type       = "ai"
model      = "provider/reasoning-model"
prompt     = "Review the case carefully. Score each rubric row using only the available evidence."
priority   = 30
reward_usd = "0.00"
stake_usd  = "0.00"

[[validators.classes]]
id         = "fast-model"
label      = "Fast model"
type       = "ai"
model      = "provider/fast-model"
prompt     = "Review the case for correctness, completeness, and safety."
priority   = 31
reward_usd = "0.00"
stake_usd  = "0.00"

[[validators.classes]]
id         = "policy-model"
label      = "Policy model"
type       = "ai"
model      = "provider/policy-model"
prompt     = "Focus on unsupported claims, policy risks, and missing caveats."
priority   = 32
reward_usd = "0.00"
stake_usd  = "0.00"

[[validators.classes]]
id         = "human-senior"
label      = "Senior human reviewer"
type       = "human"
priority   = 10
reward_usd = "15.00"
stake_usd  = "0.00"

[[validators.routes]]
total = 3

[[validators.routes.composition]]
class = "reasoning-model"
count = 1

[[validators.routes.composition]]
class = "fast-model"
count = 1

[[validators.routes.composition]]
class = "policy-model"
count = 1

[[validators.routes.escalation]]
match = { consensus_below = 0.6 }
add = 1

[[validators.routes.escalation.composition]]
class = "human-senior"
count = 1
Edit this page on GitHub Last updated Jun 16, 2026