poq.toml examples

A poq.toml file turns uploaded data into review items: what to ingest, what evidence validators see, what questions they answer, and how review slots are assigned. These examples use the current namespaced layout from the poq.toml reference.

The best first spec is usually small: one source, one review item per row, and a short rubric. Add joins, routing, classes, and AI validators only when the workflow needs them.

Annotation Validation

Use this when you already have labeled rows and want independent agreement checks. What makes it interesting: it is the smallest useful shape, but still shows the core sequence of data ingested, evidence shown, rubric answered, consensus produced.

csv-row

annotation-qa/

poq.toml

[project]
spec_version = "1"
tag          = "beatles_genre_qa"

[[ingestion.sources]]
id   = "labels"
type = "csv"
path = "labels.csv"

# --- Key to this example ------------------------------------
# These fields map straight onto the columns in labels.csv,
# turning each track row into one review item.
[ingestion.fields]
id             = "labels.item_id"
source_text    = "labels.source_text"
proposed_label = "labels.proposed_label"
# ------------------------------------------------------------

[[validation.evidence]]
type            = "markdown"
title           = "Track description"
ingestion_field = "source_text"

[[validation.evidence]]
type            = "markdown"
title           = "Proposed genre"
ingestion_field = "proposed_label"

[[validation.rubric]]
id               = "agreement"
label            = "Genre agreement"
prompt           = "Does the proposed genre correctly describe this Beatles track?"
role             = "influence_gauge"
scale.type       = "likert_agreement"
scale.size       = 7
consensus_weight = 2.0

[[validation.rubric]]
id               = "confidence"
label            = "Confidence"
prompt           = "How confident are you in your answer?"
role             = "certainty"
scale.type       = "numeric"
scale.values     = [0, 25, 50, 75, 100]
scale.labels     = ["none", "low", "medium", "high", "certain"]
consensus_weight = 1.0

[validators]
num_validators = 3
reward_usd     = "1.00"
stake_usd      = "0.00"

Image Annotation Validation

Use this when each review item is a row of structured annotation metadata joined to an image file, common in captioning, classification, segmentation, and other image labeling workflows. What makes it interesting: the CSV is the task root, images attach through a join, and validators see the annotation and image together on one page.

csv-image-join

image-annotation-qa/

poq.toml

[project]
spec_version = "1"
tag          = "beatles_cover_qa"

[[ingestion.sources]]
id   = "annotations"
type = "csv"
path = "annotations.csv"

# --- Key to this example ------------------------------------
# The cover images load as their own source, then this join
# attaches each file in images/ to its caption row by matching
# annotations.image_id to the file basename.
[[ingestion.sources]]
id               = "images"
type             = "file_collection"
path_glob        = "images/*.{png,jpg,jpeg,webp,svg}"
file_id_strategy = "basename_without_ext"

[[ingestion.joins]]
left     = "annotations"
right    = "images"
left_on  = "image_id"
right_on = "file_id"
type     = "left"
# ------------------------------------------------------------

[ingestion.fields]
id          = "annotations.item_id"
annotation  = "annotations.annotation_text"
image_path  = "images.path"
context     = "annotations.context_notes"
category    = "annotations.category"
batch_id    = "annotations.batch_id"

[[validation.evidence]]
type            = "image"
title           = "Album cover"
ingestion_field = "image_path"

[[validation.evidence]]
type            = "markdown"
title           = "Caption"
ingestion_field = "annotation"

[[validation.evidence]]
type            = "markdown"
title           = "Context"
ingestion_field = "context"

[[validation.evidence]]
type  = "datapoint_facts"
title = "Item metadata"
fields = [
  { label = "Category", field = "category" },
  { label = "Batch", field = "batch_id" },
]

[[validation.rubric]]
id               = "agreement"
label            = "Caption agreement"
prompt           = "Does the caption correctly describe what is shown on the album cover?"
role             = "influence_gauge"
scale.type       = "likert"
scale.size       = 7
consensus_weight = 2.0

[[validation.rubric]]
id               = "completeness"
label            = "Completeness"
prompt           = "Does the caption capture the important details of the cover?"
scale.type       = "ordinal"
scale.labels     = ["missing key details", "partial", "mostly complete", "complete"]
consensus_weight = 1.5

[[validation.rubric]]
id               = "quality"
label            = "Overall quality"
prompt           = "How usable is this caption as catalog metadata?"
scale.type       = "ordinal"
scale.labels     = ["poor", "fair", "good", "excellent"]
consensus_weight = 1.0

[[validation.rubric]]
id               = "confidence"
label            = "Confidence"
prompt           = "How confident are you in your assessment?"
role             = "certainty"
scale.type       = "numeric"
scale.values     = [0, 25, 50, 75, 100]
scale.labels     = ["none", "low", "medium", "high", "certain"]
consensus_weight = 1.0

[validators]
num_validators = 3
reward_usd     = "1.00"
stake_usd      = "0.00"

Markdown Report Review

Use this when one uploaded report file should become many review items, like audit write-ups, policy reviews, incident postmortems, or any document with repeated sections. What makes it interesting: markdown_split turns each header match into a row, metadata regex pulls structured fields from the document and sections, and source_link sends validators to the cited file and line.

markdown-split

markdown-report-review/

poq.toml

[project]
spec_version = "1"
tag          = "beatles_catalog_review"

# --- Key to this example ------------------------------------
# markdown_split turns each `## CAT-01: ...` heading in
# reports/catalog-2026-06.md into its own review item, and the
# metadata regexes below pull repository, file, line, and
# severity out of the report's table and section bullets.
[[ingestion.sources]]
id               = "report"
type             = "markdown_split"
path_glob        = "reports/*.md"
splitter.regex   = '^##\s+(?P<id>[A-Z]+-\d+):?\s+(?P<title>.+)
#x27;
splitter.end_regex = '^##\s+'

[[ingestion.sources.splitter.metadata]]
scope  = "document"
column = "repository"
regex  = '^\|\s*Repository\s*\|\s*(?P<repository>[^|]+?)\s*\|'

[[ingestion.sources.splitter.metadata]]
scope  = "document"
column = "commit_sha"
regex  = '^\|\s*Commit\s*\|\s*`?(?P<commit_sha>[^|`]+?)`?\s*\|'

[[ingestion.sources.splitter.metadata]]
scope  = "section"
column = "source_file"
regex  = '^-\s*\*\*File\*\*:\s*`?(?P<source_file>.+?):L(?P<line_number>\d+)'

[[ingestion.sources.splitter.metadata]]
scope  = "section"
column = "line_number"
regex  = '^-\s*\*\*File\*\*:\s*`?(?P<source_file>.+?):L(?P<line_number>\d+)'

[[ingestion.sources.splitter.metadata]]
scope  = "section"
column = "proposed_severity"
regex  = '^-\s*\*\*Severity\*\*:\s*(?P<proposed_severity>.+)
#x27;
# ------------------------------------------------------------

[ingestion.fields]
id                = "report.row_id"
finding_id        = "report.id"
title             = "report.title"
body              = "report.body"
repository        = "report.repository"
commit_sha        = "report.commit_sha"
source_file       = "report.source_file"
line_number       = "report.line_number"
proposed_severity = "report.proposed_severity"

[[validation.evidence]]
type            = "markdown"
title           = "Discrepancy"
ingestion_field = "body"

[[validation.evidence]]
type        = "source_link"
title       = "Catalog entry"
repository  = "repository"
commit_sha  = "commit_sha"
path        = "source_file"
line_number = "line_number"
label       = "View catalog entry"

[[validation.evidence]]
type  = "datapoint_facts"
title = "Discrepancy metadata"
fields = [
  { label = "Discrepancy ID", field = "finding_id" },
  { label = "Proposed severity", field = "proposed_severity" },
]

[[validation.rubric]]
id               = "validity"
label            = "Validity"
prompt           = "Is this catalog discrepancy a real error?"
role             = "influence_gauge"
scale.type       = "ordinal"
scale.labels     = ["False positive", "Unlikely valid", "Unclear", "Likely valid", "Clearly valid"]
consensus_weight = 2.0

[[validation.rubric]]
id               = "severity"
label            = "Severity"
prompt           = "How severe is this error if real?"
scale.type       = "ordinal"
scale.labels     = ["info", "low", "medium", "high", "critical"]
consensus_weight = 1.5

[[validation.rubric]]
id               = "confidence"
label            = "Confidence"
prompt           = "How confident are you in your assessment?"
role             = "certainty"
scale.type       = "numeric"
scale.values     = [0, 25, 50, 75, 100]
scale.labels     = ["none", "low", "medium", "high", "certain"]
consensus_weight = 1.0

[validators]
num_validators = 3
reward_usd     = "5.00"
stake_usd      = "0.00"

Conditional Rubric

Use this when later questions only apply depending on earlier answers. For example, grey out fix-quality scoring when a validator marks an item as a false positive. What makes it interesting: consensus_skip.match on a rubric row references another row's id and trigger labels; when the validator's own answers match, that row greys out in the UI (filler votes still count toward consensus).

conditional-rubric

conditional-rubric-triage/

poq.toml

[project]
spec_version = "1"
tag          = "beatles_credit_triage"

[[ingestion.sources]]
id        = "findings"
type      = "json"
path_glob = "findings/*.json"

[ingestion.fields]
id          = "findings.id"
title       = "findings.title"
description = "findings.description"
proposed_fix = "findings.proposedFix"

[[validation.evidence]]
type            = "markdown"
title           = "Credit issue"
ingestion_field = "description"

[[validation.evidence]]
type            = "markdown"
title           = "Proposed fix"
ingestion_field = "proposed_fix"

[[validation.rubric]]
id               = "validity"
label            = "Validity"
prompt           = "Is this credit issue a real error?"
role             = "influence_gauge"
scale.type       = "ordinal"
scale.labels     = ["False positive", "Unlikely valid", "Unclear", "Likely valid", "Clearly valid"]
consensus_weight = 2.0

# --- Key to this example ------------------------------------
# consensus_skip.match greys this row out in the validator UI
# whenever the validator's own "validity" answer is
# "False positive" — so the proposed fix is only scored when
# the credit issue is judged real.
[[validation.rubric]]
id               = "fix_quality"
label            = "Fix quality"
prompt           = "How adequate is the proposed correction? (N/A if the issue is a false positive.)"
scale.type       = "ordinal"
scale.labels     = ["Unsound", "Weak", "Partial", "Mostly sound", "Fully sound"]
consensus_skip.match = { validity = ["False positive"] }
consensus_weight = 1.5
# ------------------------------------------------------------

[[validation.rubric]]
id               = "confidence"
label            = "Confidence"
prompt           = "How confident are you in your assessment?"
role             = "certainty"
scale.type       = "numeric"
scale.values     = [0, 25, 50, 75, 100]
scale.labels     = ["none", "low", "medium", "high", "certain"]
consensus_weight = 1.0

[validators]
num_validators = 3
reward_usd     = "5.00"
stake_usd      = "0.00"

Model-Prediction Agreement

Use this when a model already emits a prediction for each item and you want to measure how closely an independent human panel lands on it, without letting that prediction anchor the reviewers. What makes it interesting: [[validation.reported_label]] with quality = true turns the model's prediction into a hidden distance reference, so the dimension's Quality Rating becomes how close the panel's consensus sits to the model while Consensus Strength still measures agreement on its own.

reported-label

model-prediction-agreement/

poq.toml

[project]
spec_version = "1"
tag          = "beatles_trivia_agreement"

[[ingestion.sources]]
id   = "responses"
type = "csv"
path = "responses.csv"

[ingestion.fields]
id           = "responses.item_id"
prompt       = "responses.prompt"
response     = "responses.response"
model_rating = "responses.model_rating"

[[validation.evidence]]
type            = "markdown"
title           = "Trivia question"
ingestion_field = "prompt"

[[validation.evidence]]
type            = "markdown"
title           = "Model answer"
ingestion_field = "response"

[[validation.rubric]]
id               = "rating"
label            = "Answer quality"
prompt           = "How good is this answer to the Beatles trivia question?"
role             = "influence_gauge"
scale.type       = "numeric"
scale.values     = [0, 25, 50, 75, 100]
scale.labels     = ["very poor", "poor", "fair", "good", "excellent"]
consensus_weight = 2.0

[[validation.rubric]]
id               = "confidence"
label            = "Confidence"
prompt           = "How confident are you in your rating?"
role             = "certainty"
scale.type       = "numeric"
scale.values     = [0, 25, 50, 75, 100]
scale.labels     = ["none", "low", "medium", "high", "certain"]
consensus_weight = 1.0

# --- Key to this example ------------------------------------
# The model's predicted rating (responses.model_rating) scores
# the dimension by distance and is never shown, so it cannot
# anchor how validators answer.
[[validation.reported_label]]
dimension = "rating"
field     = "model_rating"
quality   = true
map       = { "1" = "very poor", "2" = "poor", "3" = "fair", "4" = "good", "5" = "excellent" }
# ------------------------------------------------------------

[validators]
num_validators = 3
reward_usd     = "2.00"
stake_usd      = "0.00"

Because quality = true, model_rating is stripped from everything validators see, so do not also list it in [[validation.evidence]]. The map translates the model's 1 to 5 output onto the scale labels, and the mapped value must land on a declared anchor or the dimension falls back to raw-value quality. The rating Quality Rating now reads as agreement with the model: 100 when the panel lands exactly on the prediction and lower as it drifts, while Consensus Strength still measures how tightly the panel agreed with itself. See [[validation.reported_label]] for the scoring formula.

JSON Array Unnesting

Use this when each uploaded JSON file wraps many review items in one array — for example an album file with a tracks list. What makes it interesting: unnest.array_key turns one file into one review item per array element, expanding each element's object fields into row columns.

json-unnest

json-array-unnest/

poq.toml

[project]
spec_version = "1"
tag          = "beatles_tracklist_unnest"

# --- Key to this example ------------------------------------
# unnest.array_key expands each element of the "tracks" array
# in albums/abbey-road.json into its own review item;
# root-level scalars like album and releasedYear are not
# copied onto rows.
[[ingestion.sources]]
id               = "album"
type             = "json"
path_glob        = "albums/*.json"
unnest.array_key = "tracks"
# ------------------------------------------------------------

[ingestion.fields]
id         = "album.id"
title      = "album.title"
songwriter = "album.songwriter"
lead_vocal = "album.leadVocal"
duration   = "album.duration"

[[validation.evidence]]
type  = "datapoint_facts"
title = "Track"
fields = [
  { label = "Title", field = "title" },
  { label = "Songwriter", field = "songwriter" },
  { label = "Lead vocal", field = "lead_vocal" },
  { label = "Duration", field = "duration" },
]

[[validation.rubric]]
id               = "validity"
label            = "Credit correct"
prompt           = "Is the songwriter credit for this track correct?"
scale.type       = "ordinal"
scale.labels     = ["no", "yes"]
consensus_weight = 1.0

[validators]
num_validators = 3
reward_usd     = "5.00"
stake_usd      = "0.00"

Source Excerpt Review

Use this when validators need to judge a finding against the actual source code or document text, not just a summary. What makes it interesting: source_excerpt fetches and displays pinned content from a public repository at a specific commit directly in the task panel.

source-excerpt

source-excerpt-review/

poq.toml

[project]
spec_version = "1"
tag          = "beatles_source_review"

[[ingestion.sources]]
id        = "findings"
type      = "json"
path_glob = "findings/*.json"

[ingestion.fields]
id                = "findings.id"
title             = "findings.title"
description       = "findings.description"
repository        = "findings.repository"
commit_sha        = "findings.commitSha"
source_path       = "findings.sourcePath"
proposed_severity = "findings.proposedSeverity"

[[validation.evidence]]
type            = "markdown"
title           = "Claim"
ingestion_field = "description"

# --- Key to this example ------------------------------------
# source_excerpt fetches the pinned file (findings.sourcePath)
# from the repository at findings.commitSha and shows it in
# the task panel, so validators judge against the real log.
[[validation.evidence]]
type       = "source_excerpt"
title      = "Session log at commit"
repository = "repository"
path       = "source_path"
commit_sha = "commit_sha"
# ------------------------------------------------------------

[[validation.evidence]]
type  = "datapoint_facts"
title = "Claim metadata"
fields = [
  { label = "Title", field = "title" },
  { label = "Proposed severity", field = "proposed_severity" },
]

[[validation.rubric]]
id               = "validity"
label            = "Validity"
prompt           = "Does the cited session log support this claim?"
role             = "influence_gauge"
scale.type       = "ordinal"
scale.labels     = ["False positive", "Unlikely valid", "Unclear", "Likely valid", "Clearly valid"]
consensus_weight = 2.0

[[validation.rubric]]
id               = "severity"
label            = "Severity"
prompt           = "How severe is this error if valid?"
scale.type       = "ordinal"
scale.labels     = ["info", "low", "medium", "high", "critical"]
consensus_weight = 1.5

[[validation.rubric]]
id               = "confidence"
label            = "Confidence"
prompt           = "How confident are you in your assessment?"
role             = "certainty"
scale.type       = "numeric"
scale.values     = [0, 25, 50, 75, 100]
scale.labels     = ["none", "low", "medium", "high", "certain"]
consensus_weight = 1.0

[validators]
num_validators = 3
reward_usd     = "5.00"
stake_usd      = "0.00"

Routed Expert Review

Use this when some items need specialists and others can be handled by a general pool. What makes it interesting: validator classes and routes let higher-risk rows get more reviewers and a senior mix without changing the item schema.

severity-routing

routed-expert-review/

poq.toml

[project]
spec_version = "1"
tag          = "beatles_expert_review"

[[ingestion.sources]]
id        = "findings"
type      = "json"
path_glob = "findings/*.json"

[ingestion.fields]
id                = "findings.id"
title             = "findings.title"
summary           = "findings.summary"
source_path       = "findings.sourcePath"
proposed_severity = "findings.proposedSeverity"
detected_by       = "findings.detectedBy"

[[validation.evidence]]
type            = "markdown"
title           = "Catalog issue"
ingestion_field = "summary"

[[validation.evidence]]
type  = "datapoint_facts"
title = "Issue facts"
fields = [
  { label = "Title", field = "title" },
  { label = "Source", field = "source_path" },
  { label = "Proposed severity", field = "proposed_severity" },
  { label = "Detected by", field = "detected_by" },
]

[[validation.rubric]]
id               = "validity"
label            = "Validity"
prompt           = "Is this a real catalog issue?"
role             = "influence_gauge"
scale.type       = "likert_agreement"
scale.size       = 7
consensus_weight = 2.0

[[validation.rubric]]
id               = "impact"
label            = "Impact"
prompt           = "How much practical impact would this issue have?"
scale.type       = "ordinal"
scale.labels     = ["none", "low", "medium", "high", "critical"]
consensus_weight = 1.5

[[validation.rubric]]
id               = "confidence"
label            = "Confidence"
prompt           = "How certain are you in this assessment?"
role             = "certainty"
scale.type       = "numeric"
scale.values     = [0, 25, 50, 75, 100]
scale.labels     = ["none", "low", "medium", "high", "certain"]
consensus_weight = 1.0

[validators]
num_validators = 3
reward_usd     = "5.00"
stake_usd      = "0.00"

# --- Key to this example ------------------------------------
# Validator classes plus routes send high/critical issues
# (findings.proposedSeverity) — the iconic tracks — to a
# larger, senior-heavy panel, while everything else falls
# through to the default 3-reviewer route.
[[validators.classes]]
id         = "generalist"
label      = "General critic"
type       = "human"
priority   = 20
reward_usd = "5.00"
stake_usd  = "0.00"

[[validators.classes]]
id         = "senior"
label      = "Senior critic"
type       = "human"
priority   = 10
reward_usd = "20.00"
stake_usd  = "0.00"

[[validators.routes]]
match = { proposed_severity = ["high", "critical"] }
total = 5

[[validators.routes.composition]]
class = "senior"
count = 2

[[validators.routes.composition]]
class = "*"
count = 3

[[validators.routes]]
total = 3

[[validators.routes.composition]]
class = "*"
count = 3
# ------------------------------------------------------------

AI Panel With Human Escalation

Use this when an AI panel can handle first-pass review and humans should focus on contested items. What makes it interesting: each AI class can use a different model or prompt, and escalation steps run in order until the item reaches verified consensus: first one senior human, then a larger human panel if agreement is still insufficient.

ai-escalation

ai-panel-escalation/

poq.toml

[project]
spec_version = "1"
tag          = "beatles_ai_review"

[[ingestion.sources]]
id   = "cases"
type = "csv"
path = "cases.csv"

[ingestion.fields]
id                = "cases.case_id"
prompt            = "cases.prompt"
candidate_answer  = "cases.candidate_answer"
reference_context = "cases.reference_context"
risk_tier         = "cases.risk_tier"

[[validation.evidence]]
type            = "markdown"
title           = "Trivia question"
ingestion_field = "prompt"

[[validation.evidence]]
type            = "markdown"
title           = "Candidate answer"
ingestion_field = "candidate_answer"

[[validation.evidence]]
type            = "markdown"
title           = "Reference fact"
ingestion_field = "reference_context"

[[validation.rubric]]
id               = "correctness"
label            = "Correctness"
prompt           = "Is the answer correct given the reference fact?"
role             = "influence_gauge"
scale.type       = "likert_agreement"
scale.size       = 7
consensus_weight = 2.0

[[validation.rubric]]
id               = "completeness"
label            = "Completeness"
prompt           = "Does the answer cover the important parts of the question?"
scale.type       = "likert_agreement"
scale.size       = 5
consensus_weight = 1.0

[[validation.rubric]]
id               = "grounding"
label            = "Grounding"
prompt           = "Does the answer avoid unsupported or misleading claims?"
scale.type       = "likert_agreement"
scale.size       = 5
consensus_weight = 1.5

[validators]
num_validators = 3
reward_usd     = "0.00"
stake_usd      = "0.00"

# --- Key to this example ------------------------------------
# Each AI class runs a different model; the base route uses a
# three-model AI panel, and the escalation steps add senior
# music critics in waves until the item reaches verified
# consensus.
[[validators.classes]]
id         = "reasoning-model"
label      = "Reasoning model"
type       = "ai"
model      = "provider/reasoning-model"
prompt     = "Review the trivia answer carefully. Score each rubric row using only the reference fact."
priority   = 30
reward_usd = "0.00"
stake_usd  = "0.00"

[[validators.classes]]
id         = "fast-model"
label      = "Fast model"
type       = "ai"
model      = "provider/fast-model"
prompt     = "Review the answer for correctness, completeness, and grounding."
priority   = 31
reward_usd = "0.00"
stake_usd  = "0.00"

[[validators.classes]]
id         = "policy-model"
label      = "Policy model"
type       = "ai"
model      = "provider/policy-model"
prompt     = "Focus on unsupported claims and misleading statements about the catalog."
priority   = 32
reward_usd = "0.00"
stake_usd  = "0.00"

[[validators.classes]]
id         = "human-senior"
label      = "Senior music critic"
type       = "human"
priority   = 10
reward_usd = "15.00"
stake_usd  = "0.00"

[[validators.routes]]
total = 3

[[validators.routes.composition]]
class = "reasoning-model"
count = 1

[[validators.routes.composition]]
class = "fast-model"
count = 1

[[validators.routes.composition]]
class = "policy-model"
count = 1

[[validators.routes.escalation]]
add = 1

[[validators.routes.escalation.composition]]
class = "human-senior"
count = 1

[[validators.routes.escalation]]
add = 2

[[validators.routes.escalation.composition]]
class = "human-senior"
count = 2
# ------------------------------------------------------------