TY - JOUR
T1 - Unified-IO 2
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Lu, Jiasen
AU - Clark, Christopher
AU - Lee, Sangho
AU - Zhang, Zichen
AU - Khosla, Savya
AU - Marten, Ryan
AU - Hoiem, Derek
AU - Kembhavi, Aniruddha
N1 - We thank Klemen Kotar for helping gather Embodied AI pre-training data, Jonathan Frankle from MosaicML for suggesting the mixture of NLP pre-training data, Jack Hessel for interleaved image & text dataset and Micheal Schmitz for helping support the compute infrastructure. We also thank Tanmay Gupta for helpful discussions, as well as Hamish Ivison, and Ananya Harsh Jha for their insightful discussions about model design. We additionally thank Oscar Michel, Yushi Hu and Yanbei Chen for their help editing the paper, and Matt Deitke for help setting up the webpage. Savya Khosla and Derek Hoiem were supported in part by ONR award N00014-23-1-2383. This research was made possible with cloud TPUs from Google's TPU Research Cloud (TRC).
PY - 2024
Y1 - 2024
N2 - We present Unified-IO 2,the. first autoregressive multi-modal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs - images, text, audio, action, bounding boxes etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and. finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unifiedio 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.
AB - We present Unified-IO 2,the. first autoregressive multi-modal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs - images, text, audio, action, bounding boxes etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and. finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unifiedio 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.
UR - http://www.scopus.com/inward/record.url?scp=85199592450&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85199592450&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.02497
DO - 10.1109/CVPR52733.2024.02497
M3 - Conference article
AN - SCOPUS:85199592450
SN - 1063-6919
SP - 26429
EP - 26445
JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Y2 - 16 June 2024 through 22 June 2024
ER -