TY - GEN
T1 - Tracking Anything with Decoupled Video Segmentation
AU - Cheng, Ho Kei
AU - Wug Oh, Seoung
AU - Price, Brian
AU - Schwing, Alexander
AU - Lee, Joon Young
N1 - Acknowledgments. Work supported in part by NSF grants 2008387, 2045586, 2106825, MRI 1725729 (HAL [25]), and NIFA award 2020-67021-32799.
PY - 2023
Y1 - 2023
N2 - Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: hkchengrex.github.io/Tracking-Anything-with-DEVA.
AB - Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: hkchengrex.github.io/Tracking-Anything-with-DEVA.
UR - http://www.scopus.com/inward/record.url?scp=85175138296&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85175138296&partnerID=8YFLogxK
U2 - 10.1109/ICCV51070.2023.00127
DO - 10.1109/ICCV51070.2023.00127
M3 - Conference contribution
AN - SCOPUS:85175138296
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 1316
EP - 1326
BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Y2 - 2 October 2023 through 6 October 2023
ER -