CLASP: Cross-modal Alignment Using Pre-trained Unimodal Models

Jianing Zhou, Ziheng Zeng, Hongyu Gong, Suma Bhat

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recent advancements in joint speech-text pretraining have significantly advanced the processing of natural language. However, a key limitation is their reliance on parallel speech-text data, posing challenges due to data accessibility. Addressing this, our paper introduces an innovative framework for jointly performing speech and text processing without parallel corpora during pre-training but only downstream. Utilizing pre-trained unimodal models, we extract distinct representations for speech and text, aligning them effectively in a newly defined space using a multi-level contrastive learning mechanism. A unique swap reconstruction mechanism enhances the alignment and is followed by fusion via a multi-head mechanism, seamlessly merging modality-invariant and modality-specific representations. Testing for emotion recognition (Spoken Language Understanding task) and idiom usage detection (Natural Language Understanding task) demonstrates robust performance, with commendable robustness to noise in text or speech data.

Original languageEnglish (US)
Title of host publication62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference
EditorsLun-Wei Ku, Andre Martins, Vivek Srikumar
PublisherAssociation for Computational Linguistics (ACL)
Pages11518-11531
Number of pages14
ISBN (Electronic)9798891760998
DOIs
StatePublished - 2024
EventFindings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Hybrid, Bangkok, Thailand
Duration: Aug 11 2024Aug 16 2024

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

ConferenceFindings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Country/TerritoryThailand
CityHybrid, Bangkok
Period8/11/248/16/24

ASJC Scopus subject areas

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'CLASP: Cross-modal Alignment Using Pre-trained Unimodal Models'. Together they form a unique fingerprint.

Cite this