In the medical imaging field, task-based metrics of image quality have been advocated as a mean to evaluate the performance of imaging systems and/or reconstruction algorithms. One such way of obtaining these metrics is through a numerical observer. Although the Bayesian ideal observer is optimal by definition, it is frequently intractable and nonlinear. Therefore, linear approximations to the IO are sometimes used to obtain task-based statistics. The optimal linear observer for maximizing the signal-To-noise ratio (SNR) of the test statistic is the Hotelling Observer (HO). However, the computational cost for obtaining the HO increases with image size and becomes intractable for large scale images. In multimodal data, this further becomes an issue because each additional modality dramatically increases the size of the composite image. An alternative to obtaining the HO is approximating the test statistic using a feed-forward neural network (FFNN). However, these methods of learning the HO have not been evaluated on multi-modal data. In this work, a tractable learned multi-modal observer is implemented. The considered task is a signal-known-statistically/background known statistically binary signal detection task. A stylized operator representing an ultrasound computed tomography imaging system and numerical breast phantoms with speed of sound and attenuation modalities are considered. The considered signal is a microcalcification cluster with a random amplitude. It is demonstrated that the learned HO can closely approximate the HO for the considered task. 2022 SPIE.