TY - GEN
T1 - Joint semantic segmentation and 3D reconstruction from monocular video
AU - Kundu, Abhijit
AU - Li, Yin
AU - Dellaert, Frank
AU - Li, Fuxin
AU - Rehg, James M.
PY - 2014
Y1 - 2014
N2 - We present an approach for joint inference of 3D scene structure and semantic labeling for monocular video. Starting with monocular image stream, our framework produces a 3D volumetric semantic + occupancy map, which is much more useful than a series of 2D semantic label images or a sparse point cloud produced by traditional semantic segmentation and Structure from Motion(SfM) pipelines respectively. We derive a Conditional Random Field (CRF) model defined in the 3D space, that jointly infers the semantic category and occupancy for each voxel. Such a joint inference in the 3D CRF paves the way for more informed priors and constraints, which is otherwise not possible if solved separately in their traditional frameworks. We make use of class specific semantic cues that constrain the 3D structure in areas, where multiview constraints are weak. Our model comprises of higher order factors, which helps when the depth is unobservable.We also make use of class specific semantic cues to reduce either the degree of such higher order factors, or to approximately model them with unaries if possible. We demonstrate improved 3D structure and temporally consistent semantic segmentation for difficult, large scale, forward moving monocular image sequences.
AB - We present an approach for joint inference of 3D scene structure and semantic labeling for monocular video. Starting with monocular image stream, our framework produces a 3D volumetric semantic + occupancy map, which is much more useful than a series of 2D semantic label images or a sparse point cloud produced by traditional semantic segmentation and Structure from Motion(SfM) pipelines respectively. We derive a Conditional Random Field (CRF) model defined in the 3D space, that jointly infers the semantic category and occupancy for each voxel. Such a joint inference in the 3D CRF paves the way for more informed priors and constraints, which is otherwise not possible if solved separately in their traditional frameworks. We make use of class specific semantic cues that constrain the 3D structure in areas, where multiview constraints are weak. Our model comprises of higher order factors, which helps when the depth is unobservable.We also make use of class specific semantic cues to reduce either the degree of such higher order factors, or to approximately model them with unaries if possible. We demonstrate improved 3D structure and temporally consistent semantic segmentation for difficult, large scale, forward moving monocular image sequences.
UR - http://www.scopus.com/inward/record.url?scp=84906342308&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84906342308&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-10599-4_45
DO - 10.1007/978-3-319-10599-4_45
M3 - Conference contribution
AN - SCOPUS:84906342308
SN - 9783319105987
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 703
EP - 718
BT - Computer Vision, ECCV 2014 - 13th European Conference, Proceedings
PB - Springer
T2 - 13th European Conference on Computer Vision, ECCV 2014
Y2 - 6 September 2014 through 12 September 2014
ER -