TY - JOUR
T1 - Attribute-Guided Network Sampling Mechanisms
AU - Kumar, Suhansanu
AU - Sundaram, Hari
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/6
Y1 - 2021/6
N2 - This article introduces a novel task-independent sampler for attributed networks. The problem is important because while data mining tasks on network content are common, sampling on internet-scale networks is costly. Link-trace samplers such as Snowball sampling, Forest Fire, Random Walk, and Metropolis-Hastings Random Walk are widely used for sampling from networks. The design of these attribute-agnostic samplers focuses on preserving salient properties of network structure, and are not optimized for tasks on node content. This article has three contributions. First, we propose a task-independent, attribute aware link-trace sampler grounded in Information Theory. Our sampler greedily adds to the sample the node with the most informative (i.e., surprising) neighborhood. The sampler tends to rapidly explore the attribute space, maximally reducing the surprise of unseen nodes. Second, we prove that content sampling is an NP-hard problem. A well-known algorithm best approximates the optimization solution within 1-1/e, but requires full access to the entire graph. Third, we show through empirical counterfactual analysis that in many real-world datasets, network structure does not hinder the performance of surprise based link-trace samplers. Experimental results over 18 real-world datasets reveal: surprise-based samplers are sample efficient and outperform the state-of-the-art attribute-agnostic samplers by a wide margin (e.g., 45% performance improvement in clustering tasks).
AB - This article introduces a novel task-independent sampler for attributed networks. The problem is important because while data mining tasks on network content are common, sampling on internet-scale networks is costly. Link-trace samplers such as Snowball sampling, Forest Fire, Random Walk, and Metropolis-Hastings Random Walk are widely used for sampling from networks. The design of these attribute-agnostic samplers focuses on preserving salient properties of network structure, and are not optimized for tasks on node content. This article has three contributions. First, we propose a task-independent, attribute aware link-trace sampler grounded in Information Theory. Our sampler greedily adds to the sample the node with the most informative (i.e., surprising) neighborhood. The sampler tends to rapidly explore the attribute space, maximally reducing the surprise of unseen nodes. Second, we prove that content sampling is an NP-hard problem. A well-known algorithm best approximates the optimization solution within 1-1/e, but requires full access to the entire graph. Third, we show through empirical counterfactual analysis that in many real-world datasets, network structure does not hinder the performance of surprise based link-trace samplers. Experimental results over 18 real-world datasets reveal: surprise-based samplers are sample efficient and outperform the state-of-the-art attribute-agnostic samplers by a wide margin (e.g., 45% performance improvement in clustering tasks).
KW - Task-independent sampling
KW - attributed networks
KW - data mining
UR - http://www.scopus.com/inward/record.url?scp=85108437657&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85108437657&partnerID=8YFLogxK
U2 - 10.1145/3441445
DO - 10.1145/3441445
M3 - Article
AN - SCOPUS:85108437657
SN - 1556-4681
VL - 15
JO - ACM Transactions on Knowledge Discovery from Data
JF - ACM Transactions on Knowledge Discovery from Data
IS - 4
M1 - 3441445
ER -