TY - GEN
T1 - A utility framework for the automatic generation of audio-visual skims
AU - Sundaram, Hari
AU - Xie, Lexing
AU - Chang, Shih Fu
N1 - Publisher Copyright:
© 2002 ACM.
PY - 2002/12/1
Y1 - 2002/12/1
N2 - In this paper, we present a novel algorithm for generating audio-visual skims from computable scenes. Skims are useful for browsing digital libraries, and for on-demand summaries in set-top boxes. A computable scene is a chunk of data that exhibits consistencies with respect to chromaticity, lighting and sound. There are three key aspects to our approach: (a) visual complexity and grammar, (b) robust audio segmentation and (c) an utility model for skim generation. We define a measure of visual complexity of a shot, and map complexity to the minimum time for comprehending the shot. Then, we analyze the underlying visual grammar, since it makes the shot sequence meaningful. We segment the audio data into four classes, and then detect significant phrases in the speech segments. The utility functions are defined in terms of complexity and duration of the segment. The target skim is created using a general constrained utility maximization procedure that maximizes the information content and the coherence of the resulting skim. The objective function is constrained due to multimedia synchronization constraints, visual syntax and by penalty functions on audio and video segments. The user study results indicate that the optimal skims show statistically significant differences with other skims with compression rates up to 90%.
AB - In this paper, we present a novel algorithm for generating audio-visual skims from computable scenes. Skims are useful for browsing digital libraries, and for on-demand summaries in set-top boxes. A computable scene is a chunk of data that exhibits consistencies with respect to chromaticity, lighting and sound. There are three key aspects to our approach: (a) visual complexity and grammar, (b) robust audio segmentation and (c) an utility model for skim generation. We define a measure of visual complexity of a shot, and map complexity to the minimum time for comprehending the shot. Then, we analyze the underlying visual grammar, since it makes the shot sequence meaningful. We segment the audio data into four classes, and then detect significant phrases in the speech segments. The utility functions are defined in terms of complexity and duration of the segment. The target skim is created using a general constrained utility maximization procedure that maximizes the information content and the coherence of the resulting skim. The objective function is constrained due to multimedia synchronization constraints, visual syntax and by penalty functions on audio and video segments. The user study results indicate that the optimal skims show statistically significant differences with other skims with compression rates up to 90%.
UR - http://www.scopus.com/inward/record.url?scp=85134290841&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134290841&partnerID=8YFLogxK
U2 - 10.1145/641007.641042
DO - 10.1145/641007.641042
M3 - Conference contribution
AN - SCOPUS:85134290841
T3 - Proceedings of the 10th ACM International Conference on Multimedia, MULTIMEDIA 2002
SP - 189
EP - 198
BT - Proceedings of the 10th ACM International Conference on Multimedia, MULTIMEDIA 2002
PB - Association for Computing Machinery
T2 - 10th ACM International Conference on Multimedia, MULTIMEDIA 2002
Y2 - 1 December 2002 through 6 December 2002
ER -