TY - JOUR
T1 - Decision tree Thompson sampling for mining hidden populations through attributed search
AU - Kumar, Suhansanu
AU - Gao, Heting
AU - Wang, Changyu
AU - Chang, Kevin Chen Chuan
AU - Sundaram, Hari
N1 - Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature.
PY - 2022/12
Y1 - 2022/12
N2 - Researchers often query online social platforms through their application programming interfaces (API) to find target populations such as people with mental illness (De Choudhury et al. in Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing, CSCW ’17. ACM, New York, pp 353–369, https://doi.org/10.1145/2998181.2998220, 2017) and jazz musicians (Heckathorn and Jeffri in Poetics 28(4):307, 2001). Entities of such target population satisfy a property that is typically identified using an oracle (human or a pre-trained classifier). When the property of the target entities is not directly queryable via the API, we refer to the property as ‘hidden’ and the population as hidden population. Our objective in this paper is to sample such target populations that satisfy a certain property that can be verified by an oracle (typically a pre-trained classifier). However, the property itself cannot be a part of the query. We refer to such populations which are not directly queryable via the API as a hidden population. Finding individuals who belong to these populations on social networks is hard because they are non-queryable, and the sampler has to explore from a combinatorial query space within a finite budget limit. Further, limited API calls, a limited number of queryable attributes to explore from and an exponential query space to select a query from make the problem very challenging. To address the challenges, we model the search for hidden population search as a decision problem. By exploiting the correlation between queryable attributes and the population of interest and by hierarchically ordering the query space, we propose a decision-tree-based Thompson sampler (DT-TMP) that efficiently discovers the right combination of attributes to query. Additionally, DT-TMP alleviates the problem of exploring the exponentially large query space through a hierarchical ordering of the queries. Our proposed sampler outperforms the state-of-the-art samplers in online experiments, for example, by 54% in Twitter. When the number of matching entities to a query is known in offline experiments, DT-TMP performs exceedingly well by a factor of 0.9–1.5 × over the baseline samplers. In future, we wish to explore the option of finding hidden populations by formulating more complex queries.
AB - Researchers often query online social platforms through their application programming interfaces (API) to find target populations such as people with mental illness (De Choudhury et al. in Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing, CSCW ’17. ACM, New York, pp 353–369, https://doi.org/10.1145/2998181.2998220, 2017) and jazz musicians (Heckathorn and Jeffri in Poetics 28(4):307, 2001). Entities of such target population satisfy a property that is typically identified using an oracle (human or a pre-trained classifier). When the property of the target entities is not directly queryable via the API, we refer to the property as ‘hidden’ and the population as hidden population. Our objective in this paper is to sample such target populations that satisfy a certain property that can be verified by an oracle (typically a pre-trained classifier). However, the property itself cannot be a part of the query. We refer to such populations which are not directly queryable via the API as a hidden population. Finding individuals who belong to these populations on social networks is hard because they are non-queryable, and the sampler has to explore from a combinatorial query space within a finite budget limit. Further, limited API calls, a limited number of queryable attributes to explore from and an exponential query space to select a query from make the problem very challenging. To address the challenges, we model the search for hidden population search as a decision problem. By exploiting the correlation between queryable attributes and the population of interest and by hierarchically ordering the query space, we propose a decision-tree-based Thompson sampler (DT-TMP) that efficiently discovers the right combination of attributes to query. Additionally, DT-TMP alleviates the problem of exploring the exponentially large query space through a hierarchical ordering of the queries. Our proposed sampler outperforms the state-of-the-art samplers in online experiments, for example, by 54% in Twitter. When the number of matching entities to a query is known in offline experiments, DT-TMP performs exceedingly well by a factor of 0.9–1.5 × over the baseline samplers. In future, we wish to explore the option of finding hidden populations by formulating more complex queries.
KW - Decision tree
KW - Hidden population mining
KW - Reinforcement learning
KW - Sampling
UR - http://www.scopus.com/inward/record.url?scp=85119077968&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119077968&partnerID=8YFLogxK
U2 - 10.1007/s13278-021-00812-5
DO - 10.1007/s13278-021-00812-5
M3 - Article
AN - SCOPUS:85119077968
SN - 1869-5450
VL - 12
JO - Social Network Analysis and Mining
JF - Social Network Analysis and Mining
IS - 1
M1 - 6
ER -