TY - GEN
T1 - A Principled Decomposition of Pointwise Mutual Information for Intention Template Discovery
AU - Ma, Denghao
AU - Chang, Kevin Chen Chuan
AU - Chen, Yueguo
AU - Lv, Xueqiang
AU - Shen, Liang
N1 - Publisher Copyright:
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2023/10/21
Y1 - 2023/10/21
N2 - With the rise of Artificial Intelligence (AI), question answering systems have become common for users to interact with computers, e.g., ChatGPT and Siri. These systems require a substantial amount of labeled data to train their models. However, the labeled data is scarce and challenging to be constructed. The construction process typically involves two stages: discovering potential sample candidates and manually labeling these candidates. To discover high-quality candidate samples, we study the intention paraphrase template discovery task: Given some seed questions or templates of an intention, discover new paraphrase templates that describe the intention and are diverse to the seeds enough in text. As the first exploration of the task, we identify the new quality requirements, i.e., relevance, divergence and popularity, and identify the new challenges, i.e., the paradox of divergent yet relevant paraphrases, and the conflict of popular yet relevant paraphrases. To untangle the paradox of divergent yet relevant paraphrases, in which the traditional bag of words falls short, we develop usage-centric modeling, which represents a question/template/answer as a bag of usages that users engaged (e.g., up-votes), and uses a usage-flow graph to interrelate templates, questions and answers. To balance the conflict of popular yet relevant paraphrases, we propose a new and principled decomposition for the well-known Pointwise Mutual Information from the usage perspective (usage-PMI), and then develop a Bayesian inference framework over the usage-flow graph to estimate the usage-PMI. Extensive experiments over three large CQA corpora show strong performance advantage over the baselines adopted from paraphrase identification task. We release 885, 000 paraphrase templates of high quality discovered by our proposed PMI decomposition model, and the data is available in site https://github.com/ParaQuestions/Intention_template_discovery.
AB - With the rise of Artificial Intelligence (AI), question answering systems have become common for users to interact with computers, e.g., ChatGPT and Siri. These systems require a substantial amount of labeled data to train their models. However, the labeled data is scarce and challenging to be constructed. The construction process typically involves two stages: discovering potential sample candidates and manually labeling these candidates. To discover high-quality candidate samples, we study the intention paraphrase template discovery task: Given some seed questions or templates of an intention, discover new paraphrase templates that describe the intention and are diverse to the seeds enough in text. As the first exploration of the task, we identify the new quality requirements, i.e., relevance, divergence and popularity, and identify the new challenges, i.e., the paradox of divergent yet relevant paraphrases, and the conflict of popular yet relevant paraphrases. To untangle the paradox of divergent yet relevant paraphrases, in which the traditional bag of words falls short, we develop usage-centric modeling, which represents a question/template/answer as a bag of usages that users engaged (e.g., up-votes), and uses a usage-flow graph to interrelate templates, questions and answers. To balance the conflict of popular yet relevant paraphrases, we propose a new and principled decomposition for the well-known Pointwise Mutual Information from the usage perspective (usage-PMI), and then develop a Bayesian inference framework over the usage-flow graph to estimate the usage-PMI. Extensive experiments over three large CQA corpora show strong performance advantage over the baselines adopted from paraphrase identification task. We release 885, 000 paraphrase templates of high quality discovered by our proposed PMI decomposition model, and the data is available in site https://github.com/ParaQuestions/Intention_template_discovery.
KW - Bayesian inference
KW - Paraphrasing
KW - Pointwise mutual information
UR - http://www.scopus.com/inward/record.url?scp=85178102482&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85178102482&partnerID=8YFLogxK
U2 - 10.1145/3583780.3614767
DO - 10.1145/3583780.3614767
M3 - Conference contribution
AN - SCOPUS:85178102482
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1746
EP - 1755
BT - CIKM 2023 - Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery
T2 - 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023
Y2 - 21 October 2023 through 25 October 2023
ER -