Fine-Grained Alignment for Cross-Modal Recipe Retrieval

Muntasir Wahed, Xiaona Zhou, Tianjiao Yu, Ismini Lourentzou

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Vision-language pre-trained models have exhibited significant advancements in various multimodal and unimodal tasks in recent years, including cross-modal recipe retrieval. However, a persistent challenge in multimodal frameworks is the lack of alignment between the encoders of different modalities. Although previous works addressed image and recipe embedding alignment, the alignment of individual recipe components has been overlooked. To address this gap, we present Fine-grained Alignment for Recipe Embeddings (FARM), a cross-modal retrieval approach that aligns the encodings of recipe components, including titles, ingredients, and instructions, within a shared representation space alongside corresponding image embeddings. Moreover, we introduce a hyperbolic loss function to effectively capture the similarity information inherent in recipe classes. FARM improves Recall@1 by 1.4% for image-to-recipe and 1.0 for recipe-to-image retrieval. Additionally, FARM achieves up to 6.1% and 15.1% performance improvement in image-to-recipe retrieval tasks, when just one and two components of the recipe are available, respectively. Comprehensive qualitative analysis of retrieved images for various recipes showcases the semantic capabilities of our trained models. Code is available at
Original languageEnglish (US)
Title of host publicationProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Number of pages10
StatePublished - Jan 1 2024
Externally publishedYes


Dive into the research topics of 'Fine-Grained Alignment for Cross-Modal Recipe Retrieval'. Together they form a unique fingerprint.

Cite this