10.14489/vkit.2026.01.pp.012-015

DOI: 10.14489/vkit.2026.01.pp.012-015

Яценко Д. В., Яценко М. Д.
МЕТОД ИНДЕКСИРОВАННОГО ПОИСКА СХОЖИХ ВЕКТОРОВ В БАЗАХ ДАННЫХ
(с. 12-15)

Аннотация. Приведен алгоритм индексирования для ускорения поиска в базе данных датасета векторов, релевантных эталонному по метрике косинусной близости. Алгоритм похож на алгоритм построения расширенного k–d дерева в методе Approximate Best Bin First k–d Tree, где выполняется разбиение признакового пространства на прямоугольные гиперпараллелепипеды. В представленном методе в качестве индекса в исходном наборе данных выступает номер вектора-орта, наиболее близкого по метрике косинусного расстояния к каждому образцу датасета. Таким образом, при поиске векторов, релевантных эталонному, на заранее подготовленном (проиндексированном) по представленному алгоритму датасете можно по значениям индекса отбирать заведомо близкие вектора и отбрасывать иные. В этом случае получение значительно меньшего субдатасета позволяет выполнять поиск при помощи индекса значительно быстрее. Для больших датасетов, которые хранятся в базах данных на жестких дисках, выборка по номеру индекса позволит использовать индексный поиск системы управления базами данных, например, по алгоритмам btree или beetmap. Помимо существенной экономии временных затрат на вычислениях при этом значительно экономится время на операциях ввода–вывода. Авторы применили этот подход практически в задаче поиска релевантных текстов по базе данных векторов размерностью 384, с количеством образцов 500 000. Использование индекса сократило время поиска в 30…40 раз.

Ключевые слова: машинное обучение, индексированный поиск, косинусное сходство, векторизованный текстовый поиск.

Iatsenko D. V., Iatsenko M. D.
IMPLEMENTATION OF AN INDEXED METHOD FOR SEARCHING SIMILAR VECTORS IN DATABASES
(pp. 12-15)

Abstract. This paper addresses the performance bottleneck of cosine-similarity search in large vector datasets stored in database systems. A conventional approach computes cosine similarity between a query vector and every dataset vector and then ranks the results, which is costly for high dimensionality and, when data reside on disk, for I/O. We propose a lightweight two-stage indexing scheme. First, we build an auxiliary table of 2M unit “anchor” vectors (orths) whose coordinates are zero everywhere except one axis with value +1 or −1. During offline preprocessing, each dataset vector is assigned the identifier of the anchor vector with the smallest cosine distance, and this identifier is stored as an index field. At query time, we compute cosine similarity between the query and all anchors and select the top L anchors (or those above a threshold). Then the DBMS retrieves only rows whose stored anchor id matches the selected anchors, leveraging standard index structures such as B-trees or bitmap indexes. Exact cosine ranking is performed only within this reduced subset. The resulting complexity is reduced by approximately M/L compared with a full scan, while also cutting disk accesses. In experiments on a 384-dimensional text-embedding database of about 500,000 vectors (paraphrase-multilingual-MiniLM-L12-v2), using L = 7 decreased search time from ~16 s to ~0.5 s, providing an observed 30–40× speed-up. The approach is easy to integrate into DB pipelines and suits large-scale text retrieval.

Keywords: Machine learning; Indexing search; Cosine similarirty; Vectorize text search.

+ - Информация об авторах (About the Authors) Click to collapse

Рус

Д. В. Яценко (Южный федеральный университет, Ростов-на-Дону, Россия) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript
М. Д. Яценко (Южно-Российский государственный политехнический университет, Новочеркасск, Ростовская обл., Россия)

Eng

D. V. Iatsenko (South Federal University, Rostov-on-Don, Russia) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript
M. D. Iatsenko (South-Russian State Polytechnic University, Novocherkassk, Rostov region, Russia)

+ - Библиографический список (References) Click to collapse

Рус

1. Duda R. O., Hart P. E., Stork D. G. Pattern Classification. 2-d ed. New York: Wiley-Interscience, 2001. 654 p.
2. Sun J. W., Du W. X., Shi N. C. A survey of kNN algorithm // Information Engineering and Applied Computing. 2018. V. 1. Art. ID 770. DOI: 10.18063/ieac.v1i1.770
3. Cover T., Hart P. E. Nearest neighbor pattern classification // IEEE Transactions on Information Theory. 1967. V. 13, № 1. P. 21–27.
4. Omran M., Engelbrecht A. P., Salman A. An overview of clustering methods // Intell. Data Analysis. 2007. V. 11. P. 583–605.
5. Shahmirzadi O., Lugowski A., Younge K. Text similarity in vector space models: a comparative study [Электронный ресурс]. 2019. 17 p. URL: https://ssrn.com/abstract=3259971 (дата обращения: 20.07.2024).
6. Tsaparas P. Nearest neighbor search in multidimensional spaces [Электронный ресурс] Technical Report 319-02. Toronto, Ontario, Canada, 1999. URL: https://www.researchgate.net/publication/2313984_Nearest_Neighbor_Search_in_Multidimensional_Spaces (дата обращения: 13.07.2025).
7. Friedman J. H., Baskett F., Shustek L. J. An algorithm for finding nearest neighbors // IEEE Trans. Comput. 1975. V. C-24. P. 1000–1006.
8. Nene S. A., Nayar S. K. A simple algorithm for nearest neighbor search in high dimensions // IEEE Trans. Pattern Anal. Mach. Intell. 1997. V. 19, № 9. P. 989–1003.
9. Datar M., Indyk P., Immorlica N., Mirrokni V. S. Locality-sensitive hashing scheme based on p-stable distributions // Proc. 20th Annu. Symp. Comput. Geometry. 2004. P. 253–262.
10. Indyk P., Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality // Proc. 30th Annu. ACM Symp. Theory Comput. 1998. Dallas (Tekhas), USA. 23–26 May 1998. P. 604–613.
11. Kybic J., Vnucko I. Approximate all nearest neighbor search for high dimensional entropy estimation for image registration // Signal Process. 2012. V. 92. P. 1302–1316.
12. VMware Inc: Using indexes in Greenplum Database [Электронный ресурс]. 2024. URL: https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-ddl-ddl-index.html#creating-an-index (дата обращения: 20.07.2024).

Eng

1. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). Wiley-Interscience.
2. Sun, J. W., Du, W. X., & Shi, N. C. (2018). A survey of kNN algorithm. Information Engineering and Applied Computing, 1, Article 770. https://doi.org/10.18063/ieac.v1i1.770
3. Cover, T., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
4. Omran, M., Engelbrecht, A. P., & Salman, A. (2007). An overview of clustering methods. Intelligent Data Analysis, 11, 583–605.
5. Shahmirzadi, O., Lugowski, A., & Younge, K. (2019). Text similarity in vector space models: A comparative study (SSRN Scholarly Paper ID 3259971). Retrieved July 20, 2024, from https://ssrn.com/abstract=3259971
6. Tsaparas, P. (1999). Nearest neighbor search in multi-dimensional spaces (Technical Report 319-02). University of Toronto. Retrieved July 13, 2025, from https://www.researchgate.net/publication/2313984_Nearest_Neighbor_Search_in_Multidimensional_Spaces
7. Friedman, J. H., Baskett, F., & Shustek, L. J. (1975). An algorithm for finding nearest neighbors. IEEE Transactions on Computers, C-24(10), 1000–1006.
8. Nene, S. A., & Nayar, S. K. (1997). A simple algorithm for nearest neighbor search in high dimensions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(9), 989–1003.
9. Datar, M., Indyk, P., Immorlica, N., & Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th Annual Symposium on Computational Geometry (pp. 253–262).
10. Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (pp. 604–613).
11. Kybic, J., & Vnucko, I. (2012). Approximate all nearest neighbor search for high dimensional entropy estimation for image registration. Signal Processing, 92, 1302–1316.
12. VMware, Inc. (2024). Using indexes in Green

+ - Заказать электронную версию статьи (Purchase digital version of a single article) Click to collapse

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 700 руб. (в том числе НДС 20%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа скопируйте doi статьи:

10.14489/vkit.2026.01.pp.012-015

и заполните форму

Отправляя форму вы даете согласие на обработку персональных данных.

Eng

This article is available in electronic format (PDF).

The cost of a single article is 700 rubles. (including VAT 20%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please copy the article doi:

10.14489/vkit.2026.01.pp.012-015

and fill out the form