Fine-Grained Alignment for Cross-Modal Recipe Retrieval

Muntasir Wahed, Xiaona Zhou, Tianjiao Yu, Ismini Lourentzou

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Vision-language pre-trained models have exhibited significant advancements in various multimodal and unimodal tasks in recent years, including cross-modal recipe retrieval. However, a persistent challenge in multimodal frameworks is the lack of alignment between the encoders of different modalities. Although previous works addressed image and recipe embedding alignment, the alignment of individual recipe components has been overlooked. To address this gap, we present Fine-grained Alignment for Recipe eMbeddings (FARM), a cross-modal retrieval approach that aligns the encodings of recipe components, including titles, ingredients, and instructions, within a shared representation space alongside corresponding image embeddings. Moreover, we introduce a hyperbolic loss function to effectively capture the similarity information inherent in recipe classes. FARM improves Recall@1 by 1.4% for image-to-recipe and 1.0% for recipe-to-image retrieval. Additionally, FARM achieves up to 6.1% and 15.1% performance improvement in image-to-recipe retrieval tasks, when just one and two components of the recipe are available, respectively. Comprehensive qualitative analysis of retrieved images for various recipes showcases the semantic capabilities of our trained models. Code is available at https://github.com/PLAN-Lab/FARM.

Original languageEnglish (US)
Title of host publicationProceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5572-5581
Number of pages10
ISBN (Electronic)9798350318920
DOIs
StatePublished - Jan 3 2024
Externally publishedYes
Event2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024 - Waikoloa, United States
Duration: Jan 4 2024Jan 8 2024

Publication series

NameProceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024

Conference

Conference2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
Country/TerritoryUnited States
CityWaikoloa
Period1/4/241/8/24

Keywords

  • Algorithms
  • Vision + language and/or other modalities

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Fine-Grained Alignment for Cross-Modal Recipe Retrieval'. Together they form a unique fingerprint.

Cite this