TY - GEN
T1 - Automatic video annotation using multimodal Dirichlet process mixture model
AU - Velivelli, Atulya
AU - Huang, Thomas S.
PY - 2008
Y1 - 2008
N2 - In this paper we infer a multimodal Dirichlet process mixture model from video data, the mixture components in this model follow a Gaussian-Multinomial distribution. The multimodal Dirichlet process mixture model clusters freely available multimodal data in videos i.e., the combination of visual track and the corresponding keywords extracted from speech transcripts obtained from the audio track of videos, using the parameters of the model we build a predictive model that can output keyword annotations given video shots. In the multimodal Dirichlet process mixture model the keywords follow a multinomial distribution while the features used to represent the video shot follow a Gaussian distribution. We infer the multimodal Dirichlet process mixture model by collecting samples from the corresponding markov chain using a Blocked Gibbs sampling algorithm, and use the inferred parameters to predict video shot annotations that can be used to perform text based retrieval of shots. We compare the performance of our proposed model with other baseline models that use predicted annotations for retrieval.
AB - In this paper we infer a multimodal Dirichlet process mixture model from video data, the mixture components in this model follow a Gaussian-Multinomial distribution. The multimodal Dirichlet process mixture model clusters freely available multimodal data in videos i.e., the combination of visual track and the corresponding keywords extracted from speech transcripts obtained from the audio track of videos, using the parameters of the model we build a predictive model that can output keyword annotations given video shots. In the multimodal Dirichlet process mixture model the keywords follow a multinomial distribution while the features used to represent the video shot follow a Gaussian distribution. We infer the multimodal Dirichlet process mixture model by collecting samples from the corresponding markov chain using a Blocked Gibbs sampling algorithm, and use the inferred parameters to predict video shot annotations that can be used to perform text based retrieval of shots. We compare the performance of our proposed model with other baseline models that use predicted annotations for retrieval.
UR - http://www.scopus.com/inward/record.url?scp=49249133750&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=49249133750&partnerID=8YFLogxK
U2 - 10.1109/ICNSC.2008.4525431
DO - 10.1109/ICNSC.2008.4525431
M3 - Conference contribution
AN - SCOPUS:49249133750
SN - 9781424416851
T3 - Proceedings of 2008 IEEE International Conference on Networking, Sensing and Control, ICNSC
SP - 1366
EP - 1371
BT - Proceedings of 2008 IEEE International Conference on Networking, Sensing and Control, ICNSC
T2 - 2008 IEEE International Conference on Networking, Sensing and Control, ICNSC
Y2 - 6 April 2008 through 8 April 2008
ER -