Kitchen RGBT Annotation

Sign in with your Google account to access the annotation tool.

Kitchen RGBT Video Annotation

View:

1. Select Video

Question coverage: 0 questions 1-2 3+
Start: -- | End: --
Hover on thermal video (T/RT) to see exact temperature from IR sensor

2. Select Modality

3. Question Type

4. Question Intent & Generation

5. Options (MCQ)

Thermal range (optional): °C → °C (helps generate better options)

5. Reference Answer (Open-ended)

LLM Review (GPT-5.4)

6. Taxonomy

Run LLM Review above to auto-fill, or select manually.

Model Predictions

Existing Quiz Entries (0)

QA List (0)

Select a QA from the list

--
Hover on thermal video (T/RT) to see exact temperature from IR sensor

Edit

Filter:
QID Author Subskill Modality Reviews Verdict

Modality Targets

--

This is your own question — please don't review it yourself.
Hover on thermal video (T/RT) to see exact temperature from IR sensor
Previous Reviews

LLM Review

Your Review

Admin Override

Reviewer Progress

Project Progress

Leaderboard

User Submitted Goal Submit Progress Reviewed Goal Review Progress

Video Gallery

Click a video to see its annotation timeline.

--

0 questions 1-2 3+
Text-Leak Questions

Select a question

--

Fix Question

Leak info:

Annotation Prompt

Copy this prompt into your AI assistant (ChatGPT, Claude, etc.) along with the video context. It will help you write good questions.

You are helping create questions for THERMAL-KITCHENS, a benchmark
that tests whether AI models can understand cooking videos using both
RGB (visible) and thermal infrared cameras.

The core challenge: thermal cameras reveal physical heat state that
RGB cannot. The best questions are ones where RGB appearance is
misleading or insufficient, and thermal evidence is required to reach
the correct answer.

Key concepts in this benchmark:
- Latent physical state: heat properties invisible to RGB even under
  good lighting (residual heat, hotspots, contact traces)
- Cross-modal conflict: RGB appearance actively misleads
  (visible steam but cold liquid, seared surface but raw interior,
  reheated food that looks hot but is unevenly heated)
- Thermal grounding: connecting thermal evidence to specific objects,
  regions, or actions in the scene
- Hazard persistence: dangerous heat that remains after the visible
  cause has ended (burner off but surface still hot)
- Temporal thermal evolution: how heat state changes, spreads,
  or dissipates over the course of a clip

---

## What makes a good question

The question must require thermal evidence that RGB cannot provide.
Ask about heat states, temperature distributions, or thermal changes
that are invisible to the naked eye.

Good scenarios (aligned with benchmark design):
- Residual heat after action ends: burner off, pan still hot,
  RGB shows no active heat source
- Cross-modal conflict: visible steam/smoke suggests heat,
  thermal reveals actual low temperature
- Uneven reheating: food looks uniformly warm from RGB,
  thermal shows cold center
- Contact hazard persistence: object touched or heated,
  hand has left, thermal trace remains
- New cold object introduced: fresh ingredient or cold utensil
  placed into hot environment, thermal contrast visible
- Heat transfer between objects: conduction through cookware,
  thermal spread across surfaces

---

## What makes a bad question

Avoid these patterns — they will be rejected:

- Commonsense bypass: the answer is obvious without watching the video
  ("Is the oil hot after frying for 10 minutes?" — everyone knows yes)
- Single-frame sufficient: a single thermal frame answers the question;
  no need to watch the clip at all
- RGB bypass: visible cues (bubbling, smoke, color change, action)
  already answer the question without thermal
- Food science dependency: the answer requires knowing specific
  temperature thresholds (starch gelatinization, safe internal temps)
  rather than reading observable evidence from the video
- Threshold language: avoid "boiling point", "near-boiling",
  "safe temperature" — use relative descriptions instead
  ("hotter than", "cooler than", "no observable change")
- Temporal mismatch: the question claims to require the full clip
  but a single end frame would suffice

---

## Option design rules

All four options must be on the same judgment dimension.
Do not mix:
- Present-state descriptions with historical trajectory claims
- Temperature thresholds with relative comparisons
- Yes/No prefixes with standalone statements

At least one distractor should reflect the natural RGB-only
misinterpretation — what a model would answer if it only saw the
visible camera.

Options must be mutually exclusive. If two options could both be
true at the same time, redesign them.

---

## Examples

GOOD question (cross-modal conflict):
"Based on both the visible appearance and the thermal evidence,
what is the most accurate assessment of the liquid's temperature
trend over this period?"
- Works because: RGB (steam) suggests heat, thermal may contradict
- Requires: both modalities, neither alone suffices

BAD question (commonsense bypass):
"After stir-frying for 10 minutes, is the wok hot?"
- Fails because: anyone answers yes without watching anything

GOOD question (heat distribution):
"Based on the thermal evidence, how does the thermal state of the
newly added layer compare to the existing layers by the end of
the clip?"
- Works because: requires observing thermal contrast between objects
- RGB only shows that something was added, not its temperature

BAD question (food science dependency):
"Has the chicken reached a safe internal cooking temperature?"
- Fails because: requires knowing the safe temperature threshold,
  not reading observable thermal evidence

GOOD question (temporal, residual heat):
"Over the course of the clip, which surface retains the highest
thermal reading the longest after the heat source is removed?"
- Works because: requires tracking thermal change across time
- No actions or appliances named in the stem

BAD question (single-frame sufficient):
"By the end of the clip, is the pan hot or cold?"
- Fails because: one thermal frame at the end answers this completely

GOOD question (contact hazard persistence):
"Based on the thermal evidence, is there any observable thermal
trace on the surface after the last contact event in the clip?"
- Works because: requires temporal observation of thermal persistence
- RGB cannot detect residual heat from touch

BAD question (RGB bypass):
"After the lid is placed on the pot, does steam escape?"
- Fails because: steam is visible in RGB, no thermal needed

Question Templates

10 question templates with anti-patterns, leak-risk ratings, and a self-check checklist. Each template has been validated against state-of-the-art VLMs. Copy the prompt below and use it with your AI assistant along with the video context.

You are helping create questions for THERMAL-KITCHENS, a benchmark
that tests whether AI models can understand cooking videos using both
RGB (visible) and thermal infrared cameras.

A question is "good" only if it satisfies BOTH:
  (1) A strong LLM given only the question text and options cannot
      reliably answer it (i.e., no language-prior shortcut).
  (2) A strong VLM given only the RGB video cannot reliably answer
      it (i.e., no visual shortcut from appearance or action).

Below are 10 question templates, each with a leak-risk rating, a
cleaned example, and template-specific anti-patterns. Read the
"ANTI-TEMPLATES" section first — it defines what NOT to write.

═══════════════════════════════════════════════════════════════
ANTI-TEMPLATES — Questions that LEAK (do NOT write these)
═══════════════════════════════════════════════════════════════

Anti-pattern 1 — Thermal observation described in the question stem
  Q: "The pan shows a bright thermal signature while the handle
      shows a dim signature. Is the handle safe to touch?"
  Why it leaks: The stem hands the thermal observation to the model.
  The task degrades into a physics reading-comprehension problem
  that any LLM can solve without ever looking at the video.

Anti-pattern 2 — Options enumerate a textbook physical process
  Q: "What is the cooking stage of the eggs?"
     A. Raw  B. Half-cooked  C. Fully cooked  D. Overcooked
  Why it leaks: The four options reconstruct the standard cooking
  progression. An LLM infers the typical state from context
  ("kitchen scene + eggs being cooked → likely fully cooked") and
  picks the most common answer — no vision needed.

Anti-pattern 3 — Options bind "observation" to "conclusion"
  Q: "Which assessment of the water's state is best supported?"
     A. Low thermal intensity, no vapor → water cold
     B. High thermal intensity, visible vapor → water boiling
  Why it leaks: The observation-to-conclusion mapping is given
  inside the options. The model only has to pick the description
  that matches the video, without performing thermal-to-state
  inference. Also: the "visible vapor" cue is RGB-solvable.

Anti-pattern 4 — Option specificity asymmetry
  A. Water is warm
  B. Water is at approximately 68°C after 3 minutes of heating
  Why it leaks: Option B is markedly more specific and
  authoritative. LLMs have a well-documented bias toward detailed,
  confident-sounding options, independent of correctness.

Anti-pattern 5 — "Cannot determine" as a distractor among
specific options
  A. Cannot determine  B. One-third  C. Two-thirds  D. Almost full
  Why it leaks: RLHF training makes VLMs over-select "cannot
  determine" as a safe hedge. Its presence shifts selection
  distribution in predictable ways, leaking signal.

Anti-pattern 6 — One option is physically implausible
  A. The pan is hot everywhere
  B. The pan is cool everywhere
  C. The pan has uneven heating with localized hotspots
  D. The lid stays ice-cold while the base glows red-hot
  Why it leaks: D is physically impossible under normal cooking.
  Any LLM eliminates it instantly, turning 4-choice (25% random)
  into 3-choice (33% random).

Anti-pattern 7 — Stem implies a typical scenario
  Q: "The steamer is a covered pot for cooking vegetables. Based on
      the thermal signature, what is the state of the water inside?"
  Why it leaks: "for cooking vegetables" anchors the LLM to the
  typical scenario (water at boiling, producing steam). The answer
  can be guessed from the priors alone.

Anti-pattern 8 — Correct answer is the "counter-intuitive" option
  If annotators consistently place the correct answer in the less
  obvious position (to "trick" RGB-only models), LLMs learn this
  meta-pattern and select counter-intuitive options by default.
  This is "anti-shortcut shortcut" — avoid.

═══════════════════════════════════════════════════════════════
TEMPLATE 1 — Quantitative Thermal Reading
═══════════════════════════════════════════════════════════════
Leak risk: LOW  (if ranges avoid stereotypical cooking temperatures)

Skill: Read a specific numerical temperature from the thermal
color scale bar for an object in the scene.

Pattern:
  "What is the approximate temperature of [object] at [timestamp]?"
  Options: 4 narrow, adjacent numeric ranges

Cleaned example:
  Q: "At timestamp 00:42, what is the approximate surface
      temperature of the liquid in the pan?"
  A. 45-55°C   B. 60-70°C   C. 75-85°C   D. 90-100°C

Rules:
  - Ranges must be narrow (≤15°C each) and strictly adjacent
  - AVOID stereotypical kitchen temperatures as the correct answer
    (100°C water boil, 180°C frying, 4°C refrigerated) — they are
    language priors
  - The object's temperature should NOT be inferable from what it
    is (e.g. don't ask "room-temperature egg" when the answer is
    20-30°C)
  - The thermal color scale bar must be clearly legible

Template-specific anti-pattern:
  ✗ "What is the temperature of the frying oil?"
     A. 80°C  B. 120°C  C. 180°C  D. 240°C
  (180°C is the textbook frying temperature; LLM picks it without
   thermal input.)

═══════════════════════════════════════════════════════════════
TEMPLATE 2 — Multi-Object Temperature Comparison
═══════════════════════════════════════════════════════════════
Leak risk: MEDIUM-HIGH  (typical food temperature priors are strong)

Skill: Compare the thermal states of 3+ objects present in the
same frame.

Pattern:
  "At [timestamp], rank [A], [B], [C] by surface temperature
   from hottest to coldest."
  Options: different ranking permutations

Cleaned example:
  Q: "At timestamp 01:15, rank the surface temperatures of the
      red bowl, the knife blade, and the ceramic plate from
      hottest to coldest."
  A. bowl > knife > plate
  B. knife > plate > bowl
  C. plate > bowl > knife
  D. knife > bowl > plate

Rules:
  - At least 3 objects, all visible in the same frame
  - Objects should have NO obvious temperature hierarchy from
    their identity alone (e.g. don't compare "hot pan vs cold
    milk" — that's a language prior)
  - Include objects whose temperature results from recent
    interaction, not their inherent type
  - Correct ranking should require inspecting thermal, not
    inferring from object identity

Template-specific anti-pattern:
  ✗ "Rank the frying pan, the refrigerated milk, and the ice cube
     from hottest to coldest."
  (Trivially answerable from identity priors alone.)

═══════════════════════════════════════════════════════════════
TEMPLATE 3 — Temporal Thermal Change
═══════════════════════════════════════════════════════════════
Leak risk: LOW-MEDIUM  (magnitude questions harder than direction)

Skill: Track how an object's temperature changes between two
specified time points.

Pattern:
  "Between [time A] and [time B], how did the temperature of
   [object] change?"
  Options: different directions + magnitude bands

Cleaned example:
  Q: "Between 00:30 and 01:45, how did the center of the metal
      plate change in surface temperature?"
  A. Decreased by more than 20°C
  B. Decreased by 5-15°C
  C. Changed by less than 5°C
  D. Increased by 5-15°C

Rules:
  - The direction of change should NOT be obvious from RGB actions
    alone (e.g. don't ask about "a pan being heated on a burner" —
    direction is obvious; only magnitude is non-trivial)
  - Include both increase and decrease options when ambiguous
  - Magnitude bands should be narrow enough that guessing direction
    alone doesn't yield >40% accuracy

Template-specific anti-pattern:
  ✗ "After 2 minutes on the burner, how did the pan's temperature
     change?"
     A. Decreased  B. Stayed same  C. Increased slightly
     D. Increased significantly
  (RGB shows burner on → "increased" is the obvious answer.)

═══════════════════════════════════════════════════════════════
TEMPLATE 4 — Cross-Modal Conflict Resolution
═══════════════════════════════════════════════════════════════
Leak risk: HIGH  (easy to accidentally describe thermal in stem)

Skill: Resolve situations where RGB appearance and thermal
evidence disagree about the physical state.

Pattern:
  "[Object] has been [visible RGB action]. What is its actual
   thermal state?"
  Options: states consistent / inconsistent with RGB-only inference

Cleaned example:
  Q: "The metal whisk was placed on the counter 30 seconds ago,
      after being used to stir the hot pot. At timestamp 02:10,
      what is its current surface temperature state?"
  A. Still very hot throughout
  B. Handle cool, head still warm
  C. Uniformly at room temperature
  D. Cooler than the counter surface

Rules:
  - The stem describes only the RGB action and timing, never the
    thermal observation
  - At least one distractor must be the "RGB-only naive" answer
  - The correct answer requires reading thermal to know the
    current state, not inferring from elapsed time alone

Template-specific anti-pattern:
  ✗ "The whisk appears cool in RGB but shows a warm signature in
      thermal. What does this mean?"
  (Stem describes both observations; task collapses to physics
   explanation.)

═══════════════════════════════════════════════════════════════
TEMPLATE 5 — Localized Thermal Distribution
═══════════════════════════════════════════════════════════════
Leak risk: MEDIUM  (must avoid "cooking stage" framing)

Skill: Describe the spatial thermal distribution within a single
object or region.

Pattern:
  "Within [object/region] at [timestamp], how is the temperature
   distributed?"
  Options: different spatial thermal patterns

Cleaned example:
  Q: "At timestamp 01:30, how is the temperature distributed
      across the steak's top surface?"
  A. Uniform temperature throughout the surface
  B. Center hotter than the edges
  C. Edges hotter than the center
  D. One side hotter than the other side

Rules:
  - Ask about SPATIAL distribution, not about cooking progression
  - Avoid options that reference cooking stages (raw, done, etc.)
  - All options must be physically plausible under the scene
  - The correct answer must require looking at thermal

Template-specific anti-pattern:
  ✗ "What is the doneness of the steak?"
     A. Rare  B. Medium-rare  C. Medium  D. Well-done
  (Doneness is a language concept tied to cooking-stage priors.)

═══════════════════════════════════════════════════════════════
TEMPLATE 6 — Safety & Burn Risk Judgment
═══════════════════════════════════════════════════════════════
Leak risk: MEDIUM  (must avoid RGB-cue and typical-scenario leaks)

Skill: Judge whether it is safe to touch or handle an object,
when thermal state contradicts naive RGB inference.

Pattern:
  "Based on RGB and thermal evidence, which of these objects is
   currently safe to touch with a bare hand?"
  Options: specific objects in the scene, plus "none" or "all"

Cleaned example:
  Q: "At timestamp 02:00, which of the following is currently
      safe to touch with a bare hand?"
  A. The ceramic mug on the right
  B. The wooden spoon resting on the pan edge
  C. The metal tongs on the counter
  D. None of the above

Rules:
  - Objects should look SIMILAR in RGB but differ thermally
  - At least one should be "RGB-safe but thermal-hot" or vice versa
  - Do not describe any thermal evidence in the stem

Template-specific anti-pattern:
  ✗ "The pan handle shows a dim thermal signature. Is it safe
      to grab?"
  (Thermal observation in stem → physics comprehension task.)

═══════════════════════════════════════════════════════════════
TEMPLATE 7 — Thermal-to-Object Grounding
═══════════════════════════════════════════════════════════════
Leak risk: LOW  (if thermal region is spatially ambiguous in RGB)

Skill: Identify which physical object in the RGB frame corresponds
to a thermally distinct region.

Pattern:
  "At [timestamp], a distinct thermal region appears at [spatial
   location, neutrally described]. Which object in the RGB view
   produces this signature?"
  Options: different candidate objects visible in the scene

Cleaned example:
  Q: "At timestamp 01:05, a warm region is visible in the thermal
      frame in the lower-left quadrant of the cutting board. Which
      object is producing this signature?"
  A. The knife blade resting on the board
  B. Residual oil from the onion that was diced
  C. The hand that recently touched this area
  D. Heat conducted from the pan nearby

Rules:
  - The spatial location should be neutrally described
  - Multiple candidates must be physically plausible
  - Options should span different causal sources

Template-specific anti-pattern:
  ✗ "What causes the hot spot inside the pan?"
  (If only one item is in the pan, trivially answerable from RGB.)

═══════════════════════════════════════════════════════════════
TEMPLATE 8 — Action-Thermal Effect Magnitude
═══════════════════════════════════════════════════════════════
Leak risk: MEDIUM  (direction is often RGB-obvious; magnitude is not)

Skill: Quantify the thermal consequence of an action visible in RGB.

Pattern:
  "After [RGB-visible action] at [timestamp], by how much does
   the temperature of [object] change within [time window]?"
  Options: different magnitude bands

Cleaned example:
  Q: "After room-temperature water is poured into the pot at
      01:20, by how much does the pot's interior surface
      temperature change within the next 10 seconds?"
  A. Decreases by more than 40°C
  B. Decreases by 15-40°C
  C. Decreases by less than 15°C
  D. Stays approximately the same

Rules:
  - The action must be visible in RGB
  - Magnitude options must genuinely require reading thermal
  - Include at least one "smaller than intuition" option

Template-specific anti-pattern:
  ✗ "After the burner is turned on, does the pan heat up?"
  (Trivially answerable.)

═══════════════════════════════════════════════════════════════
TEMPLATE 9 — Fill Level From Thermal Boundary
═══════════════════════════════════════════════════════════════
Leak risk: LOW  (if "cannot determine" is removed and container is
                 opaque in RGB)

Skill: Estimate the fill level of an opaque container using the
thermal boundary on the container wall.

Pattern:
  "Based on both RGB and thermal evidence, what is the fill level
   of [opaque container]?"
  Options: four specific levels

Cleaned example:
  Q: "At timestamp 01:50, what is the approximate fill level of
      the stainless-steel pot?"
  A. Under one-quarter full
  B. Approximately one-third full
  C. Approximately two-thirds full
  D. Over nine-tenths full

Rules:
  - Container must be genuinely opaque in RGB
  - Thermal must show a CLEAR temperature boundary on the wall
  - DO NOT include "cannot determine"
  - Fill levels should be spaced enough to be distinguishable

Template-specific anti-pattern:
  ✗ "How much water is in the glass measuring cup?"
  (Glass is transparent in RGB — no thermal needed.)

═══════════════════════════════════════════════════════════════
TEMPLATE 10 — Event Counting with Thermal Confirmation
═══════════════════════════════════════════════════════════════
Leak risk: LOW-MEDIUM  (must ensure thermal is actually required)

Skill: Count discrete events where RGB alone is insufficient or
ambiguous.

Pattern:
  "During the clip, how many times does [event] occur?"
  Options: adjacent integers

Cleaned example:
  Q: "During the full clip, how many separate times is a hot
      object (≥60°C) placed onto the counter surface?"
  A. 2 times  B. 3 times  C. 4 times  D. 5 times

Rules:
  - The event must be partially obscured or ambiguous in RGB
  - Thermal provides the disambiguating signal
  - Counts should be adjacent integers, no zero option
  - Avoid counting events that are fully visible in RGB

Template-specific anti-pattern:
  ✗ "How many times does the person open the refrigerator?"
  (Fully countable from RGB; thermal adds nothing.)

═══════════════════════════════════════════════════════════════
GENERAL RULES (apply to every template)
═══════════════════════════════════════════════════════════════

A. STEM RULES
  - Describe the scene and what to reason about, nothing more
  - Never describe thermal observations ("bright signature",
    "dim region", "warm area")
  - Never describe RGB observations that give away the answer
  - Never frame the scene with "typical cooking" language
    ("while heating", "during cooking", "for frying")
  - Use timestamps to anchor when specificity is needed

B. OPTION RULES
  - All 4 options on the same judgment dimension
  - Equal length (within ±30% token count)
  - Equal specificity (no single option is the most detailed)
  - Equal scientific tone (no single option is most hedged or
    most confident)
  - All 4 options physically plausible under the scene
  - Do NOT bind observation + conclusion in a single option
  - Do NOT include "cannot determine" when other options are
    specific
  - Do NOT enumerate a textbook physical progression (raw →
    done, cold → boiling)

C. CORRECT-ANSWER RULES
  - Correct answer should not be the longest or most detailed
  - Correct answer should not be the most hedged ("approximately",
    "roughly") unless all options share this language
  - Correct answer position should be randomized across the pool
  - Avoid systematically making the correct answer the
    "counter-intuitive" one — LLMs learn this meta-pattern

D. MODALITY NECESSITY CHECK
  Before submitting, ask:
  1. Could an LLM with no video answer this from language priors
     about cooking and physics? → If yes, redesign.
  2. Could a VLM with only the RGB stream answer this? → If yes,
     redesign.
  3. Could a VLM with only the thermal stream answer this? → If
     yes, this should be reclassified as a Thermal-only question,
     not RGB-T.

═══════════════════════════════════════════════════════════════
SELF-CHECK CHECKLIST (every question must PASS all 12)
═══════════════════════════════════════════════════════════════

Stem:
  [ ] 1. Stem does not describe any thermal observation
  [ ] 2. Stem does not describe RGB observations that reveal the
         answer
  [ ] 3. Stem does not imply a "typical scenario" ("for cooking",
         "while heating")
  [ ] 4. Stem does not use any word that appears in the correct
         answer

Options:
  [ ] 5. All options on the same judgment dimension
  [ ] 6. All options equal length (±30%)
  [ ] 7. All options equal specificity and scientific tone
  [ ] 8. All options physically plausible under the scene
  [ ] 9. No option binds observation + conclusion
  [ ] 10. No "cannot determine" option when other options are
          specific
  [ ] 11. Options do not enumerate a textbook progression

Modality necessity:
  [ ] 12. Removing the thermal stream should make the question
          genuinely hard (not answerable from RGB + priors)

Empirical verification (after drafting):
  [ ] Send question text + options (no video) to GPT-4o and
      Gemini. If either scores ≥3/4 across 4 samples → redesign.
  [ ] Send question + RGB video (no thermal) to GPT-4o and
      Gemini. If either scores ≥3/4 across 4 samples → redesign.

Only after passing all of the above is the question ready for
Stage 3 human adjudication.