Sign in with your Google account to access the annotation tool.
Run LLM Review above to auto-fill, or select manually.
| QID | Author | Subskill | Modality | Reviews | Verdict |
|---|
| User | Submitted | Goal | Submit Progress | Reviewed | Goal | Review Progress |
|---|
Click a video to see its annotation timeline.
Copy this prompt into your AI assistant (ChatGPT, Claude, etc.) along with the video context. It will help you write good questions.
You are helping create questions for THERMAL-KITCHENS, a benchmark
that tests whether AI models can understand cooking videos using both
RGB (visible) and thermal infrared cameras.
The core challenge: thermal cameras reveal physical heat state that
RGB cannot. The best questions are ones where RGB appearance is
misleading or insufficient, and thermal evidence is required to reach
the correct answer.
Key concepts in this benchmark:
- Latent physical state: heat properties invisible to RGB even under
good lighting (residual heat, hotspots, contact traces)
- Cross-modal conflict: RGB appearance actively misleads
(visible steam but cold liquid, seared surface but raw interior,
reheated food that looks hot but is unevenly heated)
- Thermal grounding: connecting thermal evidence to specific objects,
regions, or actions in the scene
- Hazard persistence: dangerous heat that remains after the visible
cause has ended (burner off but surface still hot)
- Temporal thermal evolution: how heat state changes, spreads,
or dissipates over the course of a clip
---
## What makes a good question
The question must require thermal evidence that RGB cannot provide.
Ask about heat states, temperature distributions, or thermal changes
that are invisible to the naked eye.
Good scenarios (aligned with benchmark design):
- Residual heat after action ends: burner off, pan still hot,
RGB shows no active heat source
- Cross-modal conflict: visible steam/smoke suggests heat,
thermal reveals actual low temperature
- Uneven reheating: food looks uniformly warm from RGB,
thermal shows cold center
- Contact hazard persistence: object touched or heated,
hand has left, thermal trace remains
- New cold object introduced: fresh ingredient or cold utensil
placed into hot environment, thermal contrast visible
- Heat transfer between objects: conduction through cookware,
thermal spread across surfaces
---
## What makes a bad question
Avoid these patterns — they will be rejected:
- Commonsense bypass: the answer is obvious without watching the video
("Is the oil hot after frying for 10 minutes?" — everyone knows yes)
- Single-frame sufficient: a single thermal frame answers the question;
no need to watch the clip at all
- RGB bypass: visible cues (bubbling, smoke, color change, action)
already answer the question without thermal
- Food science dependency: the answer requires knowing specific
temperature thresholds (starch gelatinization, safe internal temps)
rather than reading observable evidence from the video
- Threshold language: avoid "boiling point", "near-boiling",
"safe temperature" — use relative descriptions instead
("hotter than", "cooler than", "no observable change")
- Temporal mismatch: the question claims to require the full clip
but a single end frame would suffice
---
## Option design rules
All four options must be on the same judgment dimension.
Do not mix:
- Present-state descriptions with historical trajectory claims
- Temperature thresholds with relative comparisons
- Yes/No prefixes with standalone statements
At least one distractor should reflect the natural RGB-only
misinterpretation — what a model would answer if it only saw the
visible camera.
Options must be mutually exclusive. If two options could both be
true at the same time, redesign them.
---
## Examples
GOOD question (cross-modal conflict):
"Based on both the visible appearance and the thermal evidence,
what is the most accurate assessment of the liquid's temperature
trend over this period?"
- Works because: RGB (steam) suggests heat, thermal may contradict
- Requires: both modalities, neither alone suffices
BAD question (commonsense bypass):
"After stir-frying for 10 minutes, is the wok hot?"
- Fails because: anyone answers yes without watching anything
GOOD question (heat distribution):
"Based on the thermal evidence, how does the thermal state of the
newly added layer compare to the existing layers by the end of
the clip?"
- Works because: requires observing thermal contrast between objects
- RGB only shows that something was added, not its temperature
BAD question (food science dependency):
"Has the chicken reached a safe internal cooking temperature?"
- Fails because: requires knowing the safe temperature threshold,
not reading observable thermal evidence
GOOD question (temporal, residual heat):
"Over the course of the clip, which surface retains the highest
thermal reading the longest after the heat source is removed?"
- Works because: requires tracking thermal change across time
- No actions or appliances named in the stem
BAD question (single-frame sufficient):
"By the end of the clip, is the pan hot or cold?"
- Fails because: one thermal frame at the end answers this completely
GOOD question (contact hazard persistence):
"Based on the thermal evidence, is there any observable thermal
trace on the surface after the last contact event in the clip?"
- Works because: requires temporal observation of thermal persistence
- RGB cannot detect residual heat from touch
BAD question (RGB bypass):
"After the lid is placed on the pot, does steam escape?"
- Fails because: steam is visible in RGB, no thermal needed