10.14489/vkit.2018.12.pp.028-035

DOI: 10.14489/vkit.2018.12.pp.028-035

Лапшин С. В., Спивак А. И., Лебедев И. С.
АВТОМАТИЧЕСКАЯ КЛАССИФИКАЦИЯ ТЕКСТОВ С ИСПОЛЬЗОВАНИЕМ СЕМАНТИКО-СИНТАКСИЧЕСКИХ СВЯЗЕЙ СЛОВ
(с. 28-35)

Аннотация. Предложен метод повышения показателей качества автоматической классификации текстов за счет использования информации о семантико-синтаксических связях между словами. Анализ графа семантико-синтаксического разбора текста позволяет выделить множество признаков, которые могут быть использованы как для обучения отдельного классификатора, так и добавлены к статистическим признакам и использоваться при обучении совместно. Разработан классификатор, реализующий рассматриваемую идею. Эксперимент, поставленный на коротких научных текстах, показал снижение числа ошибок классификации на 12,15 % по сравнению с классификатором, обученным на статистических признаках.

Ключевые слова: тематическая классификация текстов; семантический анализ; синтаксический анализ; выделение семантико-синтаксических признаков; метод опорных векторов.

Lapshin S. V., Spivak A. I., Lebedev I. S.
AUTOMATIC TEXT CLASSIFICATION USING SEMANTIC-SYNTACTIC WORDS DEPENDENCIES
(pp. 28-35)

Abstract. We present the method for improving the quality metrics of text classification. The result achieved by using of additional semantico-syntactic features for text classifier. These features calculated from a semantico-syntactic representation of text. In our research, we used Stanford CoreNLP parser and its “Universal++Dependencies” representation of parse tree. It allowed us to handle some dependencies between words without additions preprocessing of parse tree and get a more complete set of semantico-syntactic features. In comparison with statistical features, such as TF–IDF (Term Frequency – Inverse Document Frequency) for words or n-grams, our features allows to build more “meaningful” numerical model of texts. At the same time, semantico-syntactic features can be used both for the training of a separate classifier, and are added to statistical features and used in training together.We performed an experiment on English texts from arXiv.org. We have taken the titles and abstracts of 4500 papers from three lexically close subject areas without intersection in subjects and used them for training and evaluation of two classifiers to check our idea. The first classifier trained on statistical features. The second trained on both statistical and semantico-syntactic features. Both of them used support vector machine method and tuned separately for maximum accuracy using cross-validation. The experiment showed a decrease of classification error count by 12.15 % compared with the classifier that trained on the statistical features.

Keywords: Topic classification; Semantic Analysis; Syntactic Analysis; Extraction of semantico-syntactic features; Support vector machine.

+ - Информация об авторах (About the Authors) Click to collapse

Рус

С. В. Лапшин (Санкт-Петербургский государственный университет, Санкт-Петербург, Россия) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript
А. И. Спивак (Санкт-Петербургский национальный исследовательский университет информационных технологий, механики и оптики, Санкт-Петербург, Россия)
И. С. Лебедев (Санкт-Петербургский институт информатики и автоматизации РАН, Санкт-Петербург, Россия);

Eng

S. V. Lapshin (Saint-Petersburg State University, St. Petersburg, Russia) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript
A. I. Spivak (Saint-Petersburg National Research University of Information Technologies, Mechanics and Optics, St. Petersburg, Russia)
I. S. Lebedev (Saint-Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia)

+ - Библиографический список (References) Click to collapse

Рус

1. Ломотин К. Е., Козлова Е. С., Романов А. Ю. Применение методов машинного обучения при классификации научных текстов по специализиро-ванной базе текстов // Инновационные, информационные и коммуникационные технологии. 2017. № 1. С. 410 – 414.
2. Глобальная конкурентоспособность веду-щих университетов: модели и методы ее оценки и прогнозирования / Е. М. Анохина и др.; под общ. ред. В. Г. Халина. М.: Проспект, 2018. 544 с.
3. Rie Johnson and Tong Zhang. Deep Pyramid Convolutional Neural Networks for Text Categorization // Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (Long Papers). Vancouver, Canada, 2017. V. 1. P. 562 – 570.
4. Jeremy Howard, Sebastian Ruder. Universal Language Model Fine-Tuning for Text Classification // Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers). Melbourne, Australia, 2018. P. 328 – 339.
5. Батура Т. В. Методы автоматической клас-сификации текстов // Программные продукты и сис-темы. 2017. № 1(30). С. 85 – 99. doi: 10.15827/0236-235X.030.1.085-099
6. Сухарева А. В., Царьков С. В. Классифика-ция научных текстов по отраслям знаний [Электрон-ный ресурс]. URL: http://svn.code.sf.net/p/mlalgorithms/ code/Group274/Sukhareva2015TextClassification/doc/Sukhareva2015TextClassification.pdf?format=raw (дата обращения: 10.08.2018).
7. Rie Johnson and Tong Zhang. Supervised and Semi-Supervised Text Categorization Using LSTM for Region Embeddings // Proc. of the 33rd Intern. Conf. on Machine Learning. New York, USA, 2016. V. 48. P. 526 – 534.
8. Sebastian Schuster and Christopher D. Manning. Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks // Proc. of the Tenth Intern. Conf. on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia, 2016. P. 2371 – 2378.
9. arXiv.org API [Электронный ресурс]. URL: https://arxiv.org/help/api/index (дата обращения: 10.08.2018).
10. Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python // O'Reilly Media, Inc. 2009.

Eng

1. Lomotin K. E., Kozlova E. S., Romanov A. Yu. (2017). The use of machine learning methods for the classification of scientific texts on a specialized database of texts. Innovatsionnye, informatsionnye i kommunikatsionnye tekhnologii, (1), pp. 410-414. [in Russian language]
2. Halin V. G. (Ed.), Anohina E. M. et al. (2018). Global competitiveness of leading universities: models and methods for its assessment and forecasting. Moscow: Prospekt. [in Russian language]
3. Rie Johnson, Tong Zhang. (2017) Deep Pyra mid Convolutional Neural Networks for Text Categorization. Proceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics (Long Papers), Vol. 1. Vancouver, Canada, pp. 562-570.
4. Jeremy Howard, Sebastian Ruder. (2018). Universal Language Model Fine-Tuning for Text Classifica-tion. Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Long Papers). Melbourne, Australia, pp. 328-339.
5. Batura T. V. (2017). Methods for automatic classification of texts. Programmnye produkty i sistemy, 30(1), pp. 85 – 99. [in Russian language] doi: 10.15827/0236-235X.030.1.085-099
6. Suhareva A. V., Tsar'kov S. V. (2018). Classifi-cation of scientific texts by branches of knowledg. Avail-able at: http://svn.code.sf.net/p/mlalgorithms/code/ Group274/Sukhareva2015TextClassification/doc/Sukharea2015TextClassification.pdf?format=raw (Accessed: 10.08.2018). [in Russian language]
7. Rie Johnson, Tong Zhang. (2016). Supervised and Semi-Supervised Text Categorization Using LSTM for Region Embeddings. Proceedings of the 33rd International Conference on Machine Learning, Vol. 48. New York, USA, pp. 526-534.
8. Sebastian Schuster, Christopher D. (2016). Manning. Enhanced English Universal Dependencies: An Improved Representation for Natural Language Under-standing Tasks. Proceedings of the Tenth International Conferebce on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia, pp. 2371-2378.
9. arXiv.org API. Available at: https://arxiv.org/ help/api/index (Accessed: 10.08.2018).
10. Steven Bird, Ewan Klein, Edward Loper. (2009). Natural Language Processing with Python. O'Reilly Media.

+ - Заказать электронную версию статьи (Purchase digital version of a single article) Click to collapse

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 350 руб. (в том числе НДС 18%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа скопируйте doi статьи:

10.14489/vkit.2018.12.pp.028-035

и заполните форму

Отправляя форму вы даете согласие на обработку персональных данных.

Eng

This article is available in electronic format (PDF).

The cost of a single article is 350 rubles. (including VAT 18%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please copy the article doi:

10.14489/vkit.2018.12.pp.028-035

and fill out the form