Build a Vision-Language Model that matches food images to detailed captions as accurately as possible! Got a VLM already? Use our 400K image–text dataset (synthetic + noisy real) to pretrain it and test it. Submit predictions, climb the leaderboard, and earn a place to present your method at the 3rd MetaFood CVPR workshop.
Build a Vision-Language Model that matches food images to the right text, with two evaluation tasks: multi-label and single-label.
Compete across two phases with public test sets and a final deadline.
Keep it fair, reproducible, and compatible with the dataset licenses.
Final score = Harmonic mean of Test 1 (F1-score) and Test 2 (Accuracy).
Phase Two stress-tests robustness via verification and additional evaluation.
Limits ensure fairness and consistent evaluation conditions.
| Rank | Team | Score | Submitted |
|---|---|---|---|
| #1 | Baseline • MTF Organizers | 0.37 | Feb 10, 2026 |
We provide a training set of 400,000 food image–caption pairs (captions include synthetic and noisy real text; some images may also be synthetic). Each sample includes its caption and a link to download the corresponding image. Participants may use external data sources and pre-trained models as long as they comply with this challenge’s license terms and any third-party licenses. Pay-to-use / proprietary models are forbidden. Also, test set images must not be used for training—ensure they are removed from any external training sources you use.
The challenge runs in two phases. In Phase One, models are evaluated on two public test sets: Test 1 (multi-label image-to-captions linking; scored with F1-score) and Test 2 (single correct caption per image; scored with Accuracy). The final evaluation score is the harmonic mean of the Test 1 (F1) and Test 2 (Accuracy) results. In Phase Two, organizers reproduce Phase One results from the submitted model and scripts (non-reproducible results lead to disqualification) and evaluate finalists on a private test set to decide the winner and runner-up.
Yes. You may submit up to 10 submissions per day. Each team can have a maximum of 5 participants. The top 4 teams after Phase One advance to Phase Two, where you must submit your model, all relevant scripts, and a .txt file listing the sources of information used to train the model.
Please use the challenge’s official discussion channels (see below). Include any relevant logs and submission identifiers when applicable. Also, report dataset download issues as early as possible so they can be addressed (note: since some images from pretraining data originate from RE-LAION, a small portion may be removed during the challenge).