TY - GEN
T1 - HyperAid
T2 - 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2022
AU - Chien, Eli
AU - Tabaghi, Puoya
AU - Milenkovic, Olgica
N1 - Publisher Copyright:
© 2022 Owner/Author.
PY - 2022/8/14
Y1 - 2022/8/14
N2 - The problem of fitting distances by tree-metrics has received significant attention in the theoretical computer science and machine learning communities alike, due to many applications in natural language processing, phylogeny, cancer genomics and a myriad of problem areas that involve hierarchical clustering. Despite the existence of several provably exact algorithms for tree-metric fitting of data that inherently obeys tree-metric constraints, much less is known about how to best fit tree-metrics for data whose structure moderately (or substantially) differs from a tree. For such noisy data, most available algorithms perform poorly and often produce negative edge weights in representative trees. Furthermore, it is currently not known how to choose the most suitable approximation objective for noisy fitting. Our contributions are as follows. First, we propose a new approach to tree-metric denoising (HyperAid) in hyperbolic spaces which transforms the original data into data that is "more'' tree-like, when evaluated in terms of Gromov's δhyperbolicity. Second, we perform an ablation study involving two choices for the approximation objective, lp norms and the Dasgupta loss. Third, we integrate HyperAid with schemes for enforcing nonnegative edge-weights. As a result, the HyperAid platform outperforms all other existing methods in the literature, including Neighbor Joining (NJ), TreeRep and T-REX, both on synthetic and real-world data. Synthetic data is represented by edge-augmented trees and shortest-distance metrics while the real-world datasets include Zoo, Iris, Glass, Segmentation and SpamBase; on these datasets, the average improvement with respect to NJ is $125.94%$.
AB - The problem of fitting distances by tree-metrics has received significant attention in the theoretical computer science and machine learning communities alike, due to many applications in natural language processing, phylogeny, cancer genomics and a myriad of problem areas that involve hierarchical clustering. Despite the existence of several provably exact algorithms for tree-metric fitting of data that inherently obeys tree-metric constraints, much less is known about how to best fit tree-metrics for data whose structure moderately (or substantially) differs from a tree. For such noisy data, most available algorithms perform poorly and often produce negative edge weights in representative trees. Furthermore, it is currently not known how to choose the most suitable approximation objective for noisy fitting. Our contributions are as follows. First, we propose a new approach to tree-metric denoising (HyperAid) in hyperbolic spaces which transforms the original data into data that is "more'' tree-like, when evaluated in terms of Gromov's δhyperbolicity. Second, we perform an ablation study involving two choices for the approximation objective, lp norms and the Dasgupta loss. Third, we integrate HyperAid with schemes for enforcing nonnegative edge-weights. As a result, the HyperAid platform outperforms all other existing methods in the literature, including Neighbor Joining (NJ), TreeRep and T-REX, both on synthetic and real-world data. Synthetic data is represented by edge-augmented trees and shortest-distance metrics while the real-world datasets include Zoo, Iris, Glass, Segmentation and SpamBase; on these datasets, the average improvement with respect to NJ is $125.94%$.
KW - hierachical clustering
KW - hyperbolic
KW - tree-metric
UR - http://www.scopus.com/inward/record.url?scp=85137141302&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85137141302&partnerID=8YFLogxK
U2 - 10.1145/3534678.3539378
DO - 10.1145/3534678.3539378
M3 - Conference contribution
AN - SCOPUS:85137141302
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 201
EP - 211
BT - KDD 2022 - Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
Y2 - 14 August 2022 through 18 August 2022
ER -