Abstract
With numerous nodes on online heterogeneous networks, how to reach and extract target nodes of our specific interests is a pressing problem. In this paper, we propose a novel heterogeneous network crawler, MCrawl. It addresses the problem via iterative online heterogeneous network crawling by navigating its available APIs, starting from a set of target nodes, i.e., seed nodes. We are facing two challenges towards addressing the problem. First, to navigate within a vast network, how do we start from a small set of target nodes? In other words, which nodes in the 'current frontier' and which direction shall we expand, to reach promising target nodes quickly? We propose motif-based crawling to exploit the complex structures and rich semantics of heterogeneous networks. Second, in many scenarios, we do not have a classifier to assess the quality of the harvested nodes and thus the motifs to expand. We develop a probabilistic inference framework to estimate the yield and harvest rates of motifs, achieving principled bootstrapping for crawling. Our experiment on real networks of MCrawl achieves significant margins over baselines.
Original language | English (US) |
---|---|
Pages (from-to) | 4285-4297 |
Number of pages | 13 |
Journal | IEEE Transactions on Knowledge and Data Engineering |
Volume | 34 |
Issue number | 9 |
DOIs | |
State | Published - Sep 1 2022 |
Keywords
- Blogs
- Crawlers
- Heterogeneous network crawling
- Heterogeneous networks
- Navigation
- Semantics
- Social networking (online)
- Task analysis
- harvest and yield rate
- label propagation
- network motifs
- probabilistic inference
- Network motifs
- Label propagation
- Probabilistic inference
- Harvest and yield rate
ASJC Scopus subject areas
- Information Systems
- Computer Science Applications
- Computational Theory and Mathematics