You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ATLAS: The Landscape of Approximate Similarity Search — Two Decades of Algorithmic Advances
📄 Overview
Similarity search methods enable efficient retrieval of vectors similar to a given query and play a central role in a wide range of applications. Among the variants, approximate similarity search methods offer high accuracy with substantially improved efficiency over exact methods. Despite substantial progress, existing studies suffer from major limitations: (i) omission of key algorithmic families; (ii) overlooking recent methodological advances; (iii) lack of rigorous statistical validation; and (iv) evaluation on limited datasets reflecting modern AI applications. To address these gaps, we introduce ATLAS, the most comprehensive benchmark of approximate nearest neighbor search methods to date. Specifically, our contributions are fourfold: (i) a systematic review of five major algorithmic categories; (ii) a large-scale evaluation of 45 methods across 58 datasets; (iii) the introduction of a new measure that captures latency over a recall range, offering a threshold-free, unbiased assessment of query efficiency; and (iv) statistical analysis to ensure the robustness of the conclusions.Our findings reveal seven key insights: (i) modern quantization-based methods achieve query efficiency comparable to graph-based algorithms while requiring substantially less memory; (ii) across four categories, previously unreported top performers emerge, with two showing statistically significant improvements; (iii) relative algorithm rankings exhibit variation across data modalities and vector dimensionality; (iv) parameter settings do not consistently transfer across datasets, and performance is highly sensitive to data characteristics; (v) hardware-accelerated methods exhibit architecture-dependent performance; (vi) performance is highly sensitive to implementation quality; and (vii) both indexing strategies and hardware acceleration yield substantial throughput gains at the cost of reduced accuracy. Collectively, these findings sharpen our understanding of the ANNS landscape, uncover previously unexplored behaviors, and guide future research.
🗄️ Dataset
Due to limitations in the upload size on GitHub, we host the datasets at a different location. Please download the datasets using the following links