TY - GEN
T1 - Non-Sequential Graph Script Induction via Multimedia Grounding
AU - Zhou, Yu
AU - Li, Sha
AU - Li, Manling
AU - Lin, Xudong
AU - Chang, Shih Fu
AU - Bansal, Mohit
AU - Ji, Heng
N1 - Many thanks to Prof. Mark Yatskar, Prof. Chris Callison-Burch, Prof. Long Chen, and Prof. Juanzi Li for helpful discussions and insightful feedback. We would also like to thank the anonymous reviewers for their constructive suggestions. This research is based upon work supported by U.S. DARPA KAIROS Program No. FA8750-19-2-1004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
PY - 2023
Y1 - 2023
N2 - Online resources such as wikiHow compile a wide range of scripts for performing everyday tasks, which can assist models in learning to reason about procedures. However, the scripts are always presented in a linear manner, which does not reflect the flexibility displayed by people executing tasks in real life. For example, in the CrossTask Dataset, 64.5% of consecutive step pairs are also observed in the reverse order, suggesting their ordering is not fixed. In addition, each step has an average of 2.56 frequent next steps, demonstrating "branching". In this paper, we propose a new challenging task of non-sequential graph script induction, aiming to capture optional and interchangeable steps in procedural planning. To automate the induction of such graph scripts for given tasks, we propose to take advantage of loosely aligned videos of people performing the tasks. In particular, we design a multimodal framework to ground procedural videos to wikiHow textual steps and thus transform each video into an observed step path on the latent ground truth graph script. This key transformation enables us to train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence. Our best model outperforms the strongest pure text/vision baselines by 17.52% absolute gains on F1@3 for next step prediction and 13.8% absolute gains on Acc@1 for partial sequence completion. Human evaluation shows our model outperforming the wikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relations.
AB - Online resources such as wikiHow compile a wide range of scripts for performing everyday tasks, which can assist models in learning to reason about procedures. However, the scripts are always presented in a linear manner, which does not reflect the flexibility displayed by people executing tasks in real life. For example, in the CrossTask Dataset, 64.5% of consecutive step pairs are also observed in the reverse order, suggesting their ordering is not fixed. In addition, each step has an average of 2.56 frequent next steps, demonstrating "branching". In this paper, we propose a new challenging task of non-sequential graph script induction, aiming to capture optional and interchangeable steps in procedural planning. To automate the induction of such graph scripts for given tasks, we propose to take advantage of loosely aligned videos of people performing the tasks. In particular, we design a multimodal framework to ground procedural videos to wikiHow textual steps and thus transform each video into an observed step path on the latent ground truth graph script. This key transformation enables us to train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence. Our best model outperforms the strongest pure text/vision baselines by 17.52% absolute gains on F1@3 for next step prediction and 13.8% absolute gains on Acc@1 for partial sequence completion. Human evaluation shows our model outperforming the wikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relations.
UR - http://www.scopus.com/inward/record.url?scp=85163067652&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85163067652&partnerID=8YFLogxK
U2 - 10.18653/v1/2023.acl-long.303
DO - 10.18653/v1/2023.acl-long.303
M3 - Conference contribution
AN - SCOPUS:85163067652
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 5529
EP - 5545
BT - Long Papers
PB - Association for Computational Linguistics (ACL)
T2 - 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023
Y2 - 9 July 2023 through 14 July 2023
ER -