10.14489/vkit.2016.03.pp.031-037

DOI: 10.14489/vkit.2016.03.pp.031-037

Аверченков В. И., Будыльский Д. В., Подвесовский А. Г.
АНАЛИЗ ПРИМЕНЕНИЯ МОДЕЛЕЙ ВЕКТОРНОГО ПРЕДСТАВЛЕНИЯ ТЕКСТОВОЙ ИНФОРМАЦИИ ДЛЯ РУССКОЯЗЫЧНЫХ ТЕКСТОВ
(c. 31-37)

Аннотация. Рассмотрена задача векторного представления текстовой информации на русском языке и ее актуальность, а также основные принципы и подходы, применяемые для решения данной задачи. Детально описаны методы word2vec и GloVe, проведена их апробация на корпусе русскоязычных текстов, построенном из статей Википедии. Выявлено, что метод word2vec показывает лучшие для данного корпуса значения точности в задаче определения словесной аналогии. Подтверждена способность методов раскрывать взаимосвязи между словами, близость которых определяется косинусным расстоянием между векторными представлениями.

Ключевые слова: дистрибутивная семантика; векторное представление текста; машинное обучение; обработка естественного языка; семантическая близость.

Averchenkov V. I., Budylskii D. V., Podvesovskii A. G.
APPLICATION OF WORD REPRESENTATIONS FOR RUSSIAN TEXT CORPORA
(pp. 31-37)

Abstract. Machine (ML) Learning in Natural Language Processing tasks becomes more and more actual. It is related to growing amounts of text data, available in internet media and social networks. Word representation is an important part of many ML methods for text processing. In this paper we consider word representation task and its actuality in modern applications. We reviewed formal definition and common approaches, from basic one-hot representation to recent ones. Two main problems of word representations are vector space size and model’s ability to represent latent relations of words. First one can be resolved by dimensionality reduction methods like singular value decomposition. Second problem was significantly resolved recently by word2vec and GloVe methods. We applied these latest models to Russian text corpora of Wikipedia and tested resulting word embeddings on word similarity task, introduced by Mikolov. For syntactic relations we used word forms of Russian nouns, adjectives, verbs and adverbs. For syntactic ones we used male-female and geographic analogies, like in Mikolov’s tests. Our accuracy results force us to perform researches with larger corpora, deeply investigate parameters influence and apply other models for Russian texts.

Keywords: Distributional semantics; Word representation; Machine learning; Natual language processing; Semantic similarity.

+ - Информация об авторах (About the Authors) Click to collapse

Рус

В. И. Аверченков, Д. В. Будыльский, А. Г. Подвесовский (Брянский государственный технический университет) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

Eng

V. I. Averchenkov, D. V. Budylskii, A. G. Podvesovskii (Bryansk State Technical University) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

+ - Библиографический список (References) Click to collapse

Рус

1. Ванин С. Анализ текстовой информации – ключ к повышению эффективности бизнеса [Электронный ресурс]. 2014. URL: http://www.cnews.ru/reviews/new/ bi_bigdata_2014 /articles / analiz_tekstovoj_informatsii_klyuch_ k_povysheniyu_effektivnosti/ (дата обращения: 08.06.2015).
2. Лукашевич Н. В., Четвёркин И. И. Открытое тестирование систем анализа тональности на материале русского языка // Искусственный интеллект и принятие решений. 2014. № 1. С. 25 – 33.
3. Ландэ Д. В., Снарский А. А., Безсуднов И. В. Интернетика. Навигация в сложных сетях. Модели и алгоритмы. М.: Либроком, 2009. 264 с.
4. Ramos J. Using TF-IDF to Determine Word Relevance in Document Queries // Proc. of the First Instructional Conf. on Machine Learning (ICML-2003). 2003. P. 45 – 65.
5. Indexing by Latent Semantic Analysis / S. Deerwester et al. // Journal of the American Society for Information Science. 1990. V. 41, № 6. P. 391 – 407.
6. Landauer T. K., Dumais S. T. A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge // Psychological Review. 1997. V. 104, № 2. P. 211 – 240. doi: 10.1037/0033-295X.104.2.211.
7. Turney P. D., Pantel P. From Frequency to Meaning: Vector Space Models of Semantics // Journal of Artificial Intelligence Research. 2010. V. 37, № 1. P. 141 – 188.
8. Морозова Ю. И. Построение семантических векторных пространств различных предметных областей // Третья школа молодых ученых ИПИ РАН: сб. докл. М., 2012. С. 4 – 11.
9. Efficient Estimation of Word Representations in Vector Space / T. Mikolov et al. // arXiv preprint arXiv: 1301.3781. 2013. 12 р.
10. Distributed Representations of Words and Phrases and their Compositionality / T. Mikolov et al. // Advances in Neural Information Processing Systems (NIPS’2013). 2013. P. 3111 – 3119.
11. Rong X. word2vec Parameter Learning Explained // arXiv preprint arXiv:1411.2738. 2014.
12. Pennington J., Socher R., Manning C. D. GloVe: Global Vectors for Word Representation // Proc. of the Conf. on Empirical Methods in Natural Language Processing (EMNLP’2014). 2014. V. 12. P. 1532 – 1543.
13. Интерпретация семантических связей в текстах русскоязычного сегмента Живого Журнала на основе тематической модели LDA / С. Кольцов и др. // Технологии информационного общества в науке, образовании и культуре: сб. науч. ст. / Тр. XVII Всерос. объединенной конф. «Интернет и современное общество», Санкт-Петербург, 19–20 ноября 2014 г. СПб., 2014. С. 135 – 142.
14. Порицкий В. В., Волчек О. А. Построение векторной семантической модели на основе русскоязычных текстов: первые эксперименты // Информационные технологии и системы-2013: тр. конф.; Калининград, 1 – 6 сент. 2013 г. Калининград, 2013. С. 114 – 119.
15. Статистические механизмы формирования ассоциативных портретов предметных областей на основе естественно-языковых текстов больших объемов для систем извлечения знаний / М. М. Шарнин и др. // Информатика и ее применения. 2013. Т. 7, № 2. С. 92 – 99.

Eng

1. Vanin S. (2014). An analysis of textual information - the key for improving business performance. Available at: http://www.cnews.ru/reviews/new/bi_bigdata_2014 /articles / analiz_tekstovoj_informatsii_klyuch k_povysheniyu_ effektivnosti/ (Accessed: 08.06.2015).
2. Lukashevich N. V., Chetverkin I. I. (2014). Open test of tone analysis systems based on Russian language. Iskusstvennyi intellekt i priniatie reshenii, (1), pp. 25-33.
3. Lande D. V., Snarskii A. A., Bezsudnov I. V. (2009). Internetika. Navigation in complex networks. Models and algorithms. Moscow: Librokom.
4. Ramos J. (2003). Using TF-IDF to determine word relevance in document queries. Proc. of the First Instructional Conf. on Machine Learning (ICML-2003), pp. 45-65.
5. Deerwester S. et al. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), pp. 391-407. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
6. Landauer T. K., Dumais S. T. (1997). A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), pp. 211-240. doi: 10.1037/0033-295X.104.2.211.
7. Turney P. D., Pantel P. (2010). From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), pp. 141-188.
8. Morozova Iu. I. (2012). Building semantic vector spaces in different subject areas. Proceedings of the 3rd school of young scientist of the Institute of Informatics Problems. Moscow, pp. 4-11.
9. Mikolov T. et al. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781.
10. Mikolov T. et al. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (NIPS’2013), pp. 3111-3119.
11. Rong X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738. 2014.
12. Pennington J., Socher R., Manning C. D. (2014). GloVe: global vectors for word representation. Proc. of the Conf. on Empirical Methods in Natural Language Processing (EMNLP’2014), V. 12, pp. 1532-1543.
13. Kol'tsov S. et al. (2014). Interpretation of semantic relationships in texts of Russian-speaking segment of LiveJournal based on the LDA mathematical model. Technologies of information society in science, education and culture: proceedings of the XVII All-Russian united conference «Internet and modern society». St. Petersburg, 19–20 November 2014. St. Petersburg, pp. 135-142.
14. Poritskii V. V., Volchek O. A. (2013). Construction of the vector semantic model based on Russian texts: the first experiments. Information technologies and systems-2013: proceedings of the conference. Kaliningrad, 1 – 6 September 2013, pp. 114-119.
15. Sharnin M. M. et al. (2013). Statistical mechanisms of associative portraits of subject areas based on natural language texts to extract large amounts of knowledge systems. Informatika i ee primeneniia, 7(2), pp. 92-99.

+ - Заказать электронную версию статьи (Purchase digital version of a single article) Click to collapse

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 350 руб. (в том числе НДС 18%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа статьи заполните форму:

{jform=1,doi=10.14489/vkit.2016.03.pp.031-037}

Eng

This article is available in electronic format (PDF).

The cost of a single article is 350 rubles. (including VAT 18%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please fill out the form below:

{jform=2,doi=10.14489/vkit.2016.03.pp.031-037}