MTF'26 Workshop Challenge 📍 Location: Denver, Colorado

Dishcovery Mission II: Where VLM meets Food.

Build a Vision-Language Model that matches food images to detailed captions as accurately as possible! Got a VLM already? Use our 400K image–text dataset (synthetic + noisy real) to pretrain it and test it. Submit predictions, climb the leaderboard, and earn a place to present your method at the 3rd MetaFood CVPR workshop.

30k+ testing samples
2 different phases
Best 2 solutions will be presented at CVPR workshop
Live links
Portal
Find our Kaggle submission portal here
Dataset
Start downloading our pretraining dataset here
Announcements 🔴 Last Updated: Feb 16, 2026

🔴 [Feb 16] Baseline and Pretrainin datasets uploaded! 🔴

🔴 [Feb 16] Challenge is finally live. Good luck to all! 🔴

Challenge overview

Task

Build a Vision-Language Model that matches food images to the right text, with two evaluation tasks: multi-label and single-label.

  • Test 1 (multi-label): retrieve the relevant ingredient/component captions for each image
  • Test 2 (single-label): retrieve the single best dense food description for each image

Phases

Compete across two phases with public test sets and a final deadline.

  • Phase I: submit predictions for Test 1 + Test 2 and iterate on the leaderboard
  • Phase II: final submissions for ranking / results
  • Starter kit: a SigLIP baseline is provided to help you get started fast

Rules

Keep it fair, reproducible, and compatible with the dataset licenses.

  • Submission format: one CSV containing predictions for Test 1 then Test 2 (concatenated)
  • Data usage: you may download and process the data for model development; follow all license terms
  • Heads-up: some training images come from web sources (incl. LAION) and links may disappear—download early

Evaluation

Primary metric

Final score = Harmonic mean of Test 1 (F1-score) and Test 2 (Accuracy).

  • Phase One is common for all participants and uses two public test sets
  • Test 1: multi-label image-to-captions linking (compute F1-score)
  • Test 2: single correct caption per image (compute Accuracy)

Robustness

Phase Two stress-tests robustness via verification and additional evaluation.

  • Organizers reproduce Phase One results from the submitted model + scripts
  • Non-reproducible results lead to disqualification
  • Finalists are evaluated on a private test set to determine winner and runner-up

Limitations

Limits ensure fairness and consistent evaluation conditions.

  • Model must fit in an 80GB GPU (provide scripts for any quantization/preprocessing)
  • Total evaluation time for both public test sets must be ≤ 1 hour on a single NVIDIA H100
  • Test images must not be used for training (remove from any external training sources)

Timeline

Dataset release Feb 10, 2026 • Starter kit + baselines published in the Kaggle portal
Validation leaderboard opens Feb 16, 2026 • Unlimited submissions on val (rate-limited).
First phase deadline + Second phase start May 1, 2026 • Upload final predictions + short method summary.
Second phase results released! May 15, 2026 • Final ranking will be published.
Workshop + awards CVPR day • Top teams invited for short talks and prizes are given.

Leaderboard

Updated Weekly
Rank Team Score Submitted
#1 Baseline • MTF Organizers 0.37 Feb 10, 2026
Want your name here? Submit a prediction file on our portal!!
Submit now

FAQ

What data can I use?

We provide a training set of 400,000 food image–caption pairs (captions include synthetic and noisy real text; some images may also be synthetic). Each sample includes its caption and a link to download the corresponding image. Participants may use external data sources and pre-trained models as long as they comply with this challenge’s license terms and any third-party licenses. Pay-to-use / proprietary models are forbidden. Also, test set images must not be used for training—ensure they are removed from any external training sources you use.

How is the final ranking computed?

The challenge runs in two phases. In Phase One, models are evaluated on two public test sets: Test 1 (multi-label image-to-captions linking; scored with F1-score) and Test 2 (single correct caption per image; scored with Accuracy). The final evaluation score is the harmonic mean of the Test 1 (F1) and Test 2 (Accuracy) results. In Phase Two, organizers reproduce Phase One results from the submitted model and scripts (non-reproducible results lead to disqualification) and evaluate finalists on a private test set to decide the winner and runner-up.

Team limitations?

Yes. You may submit up to 10 submissions per day. Each team can have a maximum of 5 participants. The top 4 teams after Phase One advance to Phase Two, where you must submit your model, all relevant scripts, and a .txt file listing the sources of information used to train the model.

Where do I ask questions?

Please use the challenge’s official discussion channels (see below). Include any relevant logs and submission identifiers when applicable. Also, report dataset download issues as early as possible so they can be addressed (note: since some images from pretraining data originate from RE-LAION, a small portion may be removed during the challenge).