TY - GEN
T1 - Tracking persons-of-interest via adaptive discriminative features
AU - Zhang, Shun
AU - Gong, Yihong
AU - Huang, Jia Bin
AU - Lim, Jongwoo
AU - Wang, Jinjun
AU - Ahuja, Narendra
AU - Yang, Ming Hsuan
N1 - Publisher Copyright:
© Springer International Publishing AG 2016.
PY - 2016
Y1 - 2016
N2 - Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Low-level features used in existing multitarget tracking methods are not effective for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face features using convolutional neural networks (CNNs). Unlike existing CNN-based approaches that are only trained on large-scale face image datasets offline, we further adapt the pre-trained face CNN to specific videos using automatically discovered training samples from tracklets. Our network directly optimizes the embedding space so that the Euclidean distances correspond to a measure of semantic face similarity. This is technically realized by minimizing an improved triplet loss function. With the learned discriminative features, we apply the Hungarian algorithm to link tracklets within each shot and the hierarchical clustering algorithm to link tracklets across multiple shots to form final trajectories. We extensively evaluate the proposed algorithm on a set of TV sitcoms and music videos and demonstrate significant performance improvement over existing techniques.
AB - Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Low-level features used in existing multitarget tracking methods are not effective for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face features using convolutional neural networks (CNNs). Unlike existing CNN-based approaches that are only trained on large-scale face image datasets offline, we further adapt the pre-trained face CNN to specific videos using automatically discovered training samples from tracklets. Our network directly optimizes the embedding space so that the Euclidean distances correspond to a measure of semantic face similarity. This is technically realized by minimizing an improved triplet loss function. With the learned discriminative features, we apply the Hungarian algorithm to link tracklets within each shot and the hierarchical clustering algorithm to link tracklets across multiple shots to form final trajectories. We extensively evaluate the proposed algorithm on a set of TV sitcoms and music videos and demonstrate significant performance improvement over existing techniques.
UR - http://www.scopus.com/inward/record.url?scp=84990060978&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84990060978&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-46454-1_26
DO - 10.1007/978-3-319-46454-1_26
M3 - Conference contribution
AN - SCOPUS:84990060978
SN - 9783319464534
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 415
EP - 433
BT - Computer Vision - 14th European Conference, ECCV 2016, Proceedings
A2 - Leibe, Bastian
A2 - Matas, Jiri
A2 - Sebe, Nicu
A2 - Welling, Max
PB - Springer
T2 - 14th European Conference on Computer Vision, ECCV 2016
Y2 - 11 October 2016 through 14 October 2016
ER -