TY - GEN
T1 - VL-Con
T2 - 41st International Symposium on Automation and Robotics in Construction, ISARC 2024
AU - Hsu, Shun Hsiang
AU - Fu, Junryu
AU - Golparvar-Fard, Mani
N1 - Publisher Copyright:
© 2024 ISARC. All Rights Reserved.
PY - 2024
Y1 - 2024
N2 - Recently, vision-language research has gained significant interest by successfully connecting visual concepts to natural language, advancing computer vision-based construction monitoring using a wide variety of text queries. While vision language models demonstrate high capability, performance degradation can be expected when adapting the model to the ever-changing construction scenarios. In contrast to the source image-text pairs, it is more challenging to cover the multitude of potentially involved objects and their naming conventions for construction activities. To bridge the domain gap, this study aims to collect construction-specific image-text pairs of building elements and related site work based on the ASTM Uniformat II. The image-text pairs of 641 activities in Uniformat are retrieved from the LAION-5B dataset based on the image and text embeddings using CLIP with two different prompts. Then, the collected images are labeled at the image level to conclude the requirements of the vision-language datasets for further development. Based on the results, a vision-language dataset, VL-Con, consisting of image-text pairs for construction monitoring applications is proposed with the aid of a construction semantic predictor and prompt engineering. The proposed VL-Con dataset can be accessed through https://github.com/huhuman/VL-Con.
AB - Recently, vision-language research has gained significant interest by successfully connecting visual concepts to natural language, advancing computer vision-based construction monitoring using a wide variety of text queries. While vision language models demonstrate high capability, performance degradation can be expected when adapting the model to the ever-changing construction scenarios. In contrast to the source image-text pairs, it is more challenging to cover the multitude of potentially involved objects and their naming conventions for construction activities. To bridge the domain gap, this study aims to collect construction-specific image-text pairs of building elements and related site work based on the ASTM Uniformat II. The image-text pairs of 641 activities in Uniformat are retrieved from the LAION-5B dataset based on the image and text embeddings using CLIP with two different prompts. Then, the collected images are labeled at the image level to conclude the requirements of the vision-language datasets for further development. Based on the results, a vision-language dataset, VL-Con, consisting of image-text pairs for construction monitoring applications is proposed with the aid of a construction semantic predictor and prompt engineering. The proposed VL-Con dataset can be accessed through https://github.com/huhuman/VL-Con.
KW - Construction Monitoring
KW - Foundation Model
KW - Vision-Language Dataset
UR - http://www.scopus.com/inward/record.url?scp=85199656319&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85199656319&partnerID=8YFLogxK
U2 - 10.22260/ISARC2024/0146
DO - 10.22260/ISARC2024/0146
M3 - Conference contribution
AN - SCOPUS:85199656319
T3 - Proceedings of the International Symposium on Automation and Robotics in Construction
SP - 1128
EP - 1135
BT - Proceedings of the 41st International Symposium on Automation and Robotics in Construction, ISARC 2024
PB - International Association for Automation and Robotics in Construction (IAARC)
Y2 - 3 June 2024 through 5 June 2024
ER -