TY - GEN
T1 - Mining top-n local outliers in large databases
AU - Jin, Wen
AU - Tung, Anthony K.H.
AU - Han, Jiawei
PY - 2001
Y1 - 2001
N2 - Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. A recent work on outlier detection has introduced a novel notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned a Local Outlier Factor (LOF) which represents the likelihood of that object being an outlier. Although the concept of local outliers is a useful one, the computation of LOF values for every data objects requires a large number of k-nearest neighbors searches and can be computationally expensive. Since most objects are usually not outliers, it is useful to provide users with the option of finding only n most outstanding local outliers, i.e., the top-n data objects which are most likely to be local outliers according to their LOFs. However, if the pruning is not done carefully, finding top-n outliers could result in the same amount of computation as finding LOF for all objects. In this paper, we propose a novel method to efficiently find the top-n local outliers in large databases. The concept of "micro-cluster" is introduced to compress the data. An efficient micro-cluster-based local outlier mining algorithm is designed based on this concept. As our algorithm can be adversely affected by the overlapping in the micro-clusters, we proposed a meaningful cut-plane solution for overlapping data. The formal analysis and experiments show that this method can achieve good performance in finding the most outstanding local outliers.
AB - Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. A recent work on outlier detection has introduced a novel notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned a Local Outlier Factor (LOF) which represents the likelihood of that object being an outlier. Although the concept of local outliers is a useful one, the computation of LOF values for every data objects requires a large number of k-nearest neighbors searches and can be computationally expensive. Since most objects are usually not outliers, it is useful to provide users with the option of finding only n most outstanding local outliers, i.e., the top-n data objects which are most likely to be local outliers according to their LOFs. However, if the pruning is not done carefully, finding top-n outliers could result in the same amount of computation as finding LOF for all objects. In this paper, we propose a novel method to efficiently find the top-n local outliers in large databases. The concept of "micro-cluster" is introduced to compress the data. An efficient micro-cluster-based local outlier mining algorithm is designed based on this concept. As our algorithm can be adversely affected by the overlapping in the micro-clusters, we proposed a meaningful cut-plane solution for overlapping data. The formal analysis and experiments show that this method can achieve good performance in finding the most outstanding local outliers.
UR - http://www.scopus.com/inward/record.url?scp=0035788909&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0035788909&partnerID=8YFLogxK
U2 - 10.1145/502512.502554
DO - 10.1145/502512.502554
M3 - Conference contribution
AN - SCOPUS:0035788909
SN - 158113391X
SN - 9781581133912
T3 - Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 293
EP - 298
BT - Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
A2 - Provost, F.
A2 - Srikant, R.
A2 - Schkolnick, M.
A2 - Lee, D.
PB - Association for Computing Machinery (ACM)
T2 - Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001)
Y2 - 26 August 2001 through 29 August 2001
ER -