10.14489/vkit.2021.07.pp.046-056

DOI: 10.14489/vkit.2021.07.pp.046-056

Лагерев Д. Г., Макарова Е. А.
ОЦЕНКА СЕМАНТИЧЕСКОЙ БЛИЗОСТИ НОВОСТНЫХ СООБЩЕНИЙ НА ОСНОВЕ АНАЛИЗА ЗАГОЛОВКОВ
(с. 46-56)

Аннотация. Посвящена особенностям использования данных из неструктурированных источников, таких как социальные сети, интернет-средства массовой информации и т.д., для разработки управленческих решений. Рассмотрен вопрос анализа подобных источников в процессе разработки и принятия управленческих решений в социально-экономической сфере. Описаны сложности обработки неструктурированных данных, такие как невозможность полностью автоматической оценки семантики данных, наличие большого количества дублирующейся информации. Предложены различные подходы к определению рейтингов и различных метрик по оценке источников и сообщений. В частности, описан гибридный подход к идентификации и исключению дублирующихся сообщений в рамках процесса предобработки неструктурированных данных в контексте принятия управленческих решений.

Ключевые слова: обработка естественного языка; определение дублей; семантическая близость; интеллектуальный анализ данных; коэффициент Джаккара; косинусово расстояние.

Lagerev D. G., Makarova E. A.
DETERMINING THE SEMANTIC PROXIMITY OF NEWS MESSAGES BASED ON TITLES ANALYSIS
(pp. 46-56)

Abstract. The paper is devoted to the peculiarities of using data from unstructured sources, such as social networks, online media, etc. for the development of management decisions. The issue of analyzing such sources in the process of developing and making managerial decisions in the socio-economic sphere is considered. The difficulties of processing unstructured data are described, such as the impossibility of fully automatic evaluation of the semantics of data, the presence of a large amount of duplicate information. Various approaches to determining ratings and various metrics of sources and messages are described. The problem of identifying duplicate messages is considered using the example of online media both by full texts and by titles. Under the duplicate of a news message in this context refers a repetition of a significant amount of information from it in another article. If the text of a news message (article) is not repeated verbatim, it is impossible to determine the degree of duplication of the meaning of the article without involving human expertise. There are various metrics for assessing the similarity (semantic proximity) of textual information that can help in this matter. Some of them are described in the article. An adaptation of the Word Mover Distance method for the Russian language is proposed and the Word2Vec model is trained for its use. A hybrid approach to identifying and eliminating duplicate messages as part of the preprocessing of unstructured data in the context of managerial decision-making is proposed. According to the results of the experiments, depending on the chosen method, it was possible to automatically determine based on the publication time and the analysis of titles from 43 to 74 % duplicates.

Keywords: Natural language processing; Duplicate detection; Semantic proximity; Data mining; Jakkar coefficient; Cosine distance.

+ - Информация об авторах (About the Authors) Click to collapse

Рус

Д. Г. Лагерев, Е. А. Макарова (Брянский государственный технический университет, Брянск, Россия) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

Eng

D. G. Lagerev, E. A. Makarova (Bryansk State Technical University, Bryansk, Russia) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

+ - Библиографический список (References) Click to collapse

Рус

1. Медиапотребление в России – 2020 // Исследовательский центр компании Deloitte в СНГ. 2020 [Электронный ресурс]. URL: https://www2.deloitte.com/content/dam/Deloitte/ru/Documents/technology-media-telecommunications/russian/media-consumption-russia-2020.pdf (дата обращения: 06.03.2021).
2. Рынок поглощений и слияний в России в 2017 году [Электронный ресурс]. URL: https://home.kpmg/content/dam/kpmg/ru/pdf/2018/03/ru-ru-ma-survey-2017.pdf (дата обращения: 06.03.2021).
3. Определены наиболее кредитно-активные регионы России в I кв. 2018 г. [Электронный ресурс] URL: https://bki-okb.ru/press/news/opredeleny-naibolee-kreditno-aktivnye-regiony-rossii-v-i-kv-2018-g (дата обращения: 06.03.2021).
4. Mai F., Tian Sh., Lee Ch., Ma L. Deep Learning Models for Bankruptcy Prediction using Textual Disclosures // European Journal of Operational Research. 2018. No. 10. P. 24. DOI 10.1016/j.ejor.2018.10.024
5. Xu W., Pan Y., Chen W., Fu H. Forecasting Corporate Failure in the Chinese Energy Sector: A Novel Integrated Model of Deep Learning and Support Vector Machine. Energies. 2019. № 12. Р. 2251. 20 p. DOI 10.3390/en12122251
6. Guo L., Shi F., Tu J. Textual Analysis and Machine Leaning: Crack Unstructured Data in Finance and Accounting // The Journal of Finance and Data Science. 2017. No. 2. Р. 001. DOI 10.1016/j.jfds.2017.02.001
7. Description-Text Related Soft Information in Peer-to-Peer Lending / G. Dorfleitner, C. Priberny, S. Schuster et al. // Evidence from two Leading European platforms. Journal of Banking & Finance. 2015. № 64. P. 169 – 187. DOI 10.1016/j.jbankfin.2015.11.009
8. A Data Preparation Methodology in Data Mining Applied to Mortality Population Databases / J. Pérez, E. Iturbide, V. Olivares et al. // J. Med. Syst. 2015. № 39. Р. 152. DOI 10.1007/s10916-015-0312-5
9. Makarova E. A., Lagerev D. G., Lozbinev F. Y. Approaches to Visualizing Big Text Data at the Stage of Collection and Pre-Processing // Scientific Visualization. 2019. No. 4. P. 13 – 26.
10. Лагерев Д. Г., Макарова Е. А. Поиск и раскрытие сокращений в русскоязычных данных медицинских информационных систем // Вестник компьютерных и информационных технологий. 2020. № 7. C. 44 – 54.
11. Бацанина М. С. Информационный анализ лент деловых новостей // Тр. СПБГИК. 2013. URL: https://cyberleninka.ru/article/n/informatsionnyy-analiz-lent-delovyh-novostey (дата обращения: 06.03.2021).
12. Rajaraman A., Leskovec J., Ullman J. Mining of Massive Datasets. 2011. P. 53–107. DOI 10.1017/CBO9781139058452
13. Makarova E. A., Lagerev D. G. Methodology for Preprocessing Semi-Structured Data for Making Managerial Decisions in the Healthcare // CEUR Workshop Proceedings of the 30th International Conference on Computer Graphics and Machine Vision. (GraphiCon 2020), V. 2744. URL: http://ceur-ws.org/Vol-2744/paper78.pdf. DOI 10.51130/graficon-2020-2-3-78
14. Luhn H. P. A Statistical Approach to Mechanized Encoding and Searching of Literary Information // IBM Journal of Research and Development. 1957. № 1(4). Р. 309 – 317. DOI10.1147/rd.14.0309
15. Kusner M., Sun Y., Kolkin N. I., Weinberger K. From Word Embeddings to Document Distances // Proceedings of the 32nd International Conference on Machine Learning (ICML 2015). 2015. Р. 957 – 966.
16. Rubner Y., Tomasi C., Guibas L. The Earth Mover’s Distance as a Metric for Image Retrieval // International Journal of Computer Vision. 2000. No. 40. Р. 99 – 121. DOI 10.1023/A:1026543900054

Eng

1. Media Consumption in Russia - 2020. (2020). Research Center of Deloitte in the CIS. Available at: https://www2.deloitte.com/content/dam/Deloitte/ru/Documents/technology-media-telecommunications/russian/me-dia-consumption-russia-2020.pdf (Accessed: 06.03.2021). [in Russian language]
2. The market of acquisitions and mergers in Russia in 2017. Available at: https://home.kpmg/ content/dam/kpmg/ru/pdf/2018/03/ru-ru-ma-survey-2017. pdf (Accessed: 06.03.2021). [in Russian language]
3. The most credit-active regions of Russia in the 1st quarter were determined. (2018). Available at: https://bki-okb.ru/press/news/opredeleny-naibolee-kredi-tno-aktivnye-regiony-rossii-v-i-kv-2018-g (Accessed: 06.03.2021). [in Russian language]
4. Mai F., Tian Sh., Lee Ch., Ma L. (2018). Deep Learning Models for Bankruptcy Prediction using Textual Disclosures. European Journal of Operational Research, (10). DOI 10.1016/j.ejor.2018.10.024
5. Xu W., Pan Y., Chen W., Fu H. (2019). Forecasting Corporate Failure in the Chinese Energy Sector: A Novel Integrated Model of Deep Learning and Support Vector Machine. Energies, (12). DOI 10.3390/ en12122251
6. Guo L., Shi F., Tu J. (2017). Textual Analysis and Machine Leaning: Crack Unstructured Data in Finance and Accounting. The Journal of Finance and Data Science, (2). DOI 10.1016/j.jfds.2017.02.001
7. Dorfleitner G., Priberny C., Schuster S. et al. (2015). Description-Text Related Soft Information in Peer-to-Peer Lending. Evidence from two Leading European platforms. Journal of Banking & Finance, 64, pp. 169 – 187. DOI 10.1016/j.jbankfin.2015.11.009.
8. Pérez J., Iturbide E., Olivares V. et al. (2015). A Data Preparation Methodology in Data Mining Applied to Mortality Population Databases. Journal of Medical Systems, 39. DOI 10.1007/s10916-015-0312-5
9. Makarova E. A., Lagerev D. G., Lozbinev F. Y. (2019). Approaches to Visualizing Big Text Data at the Stage of Collection and Pre-Processing. Scientific Visualization, (4). pp. 13 – 26.
10. Lagerev D. G., Makarova E. A. (2020). Search and disclosure of abbreviations in Russian-language data of medical information systems. Vestnik komp'yuternyh i informatsionnyh tekhnologiy, (7), pp. 44 – 54. [in Russian language] DOI 10.14489/vkit.2020.07.pp.044-054
11. Batsanina M. S. (2013). Information analysis of business news feeds. Trudy SPBGIK. Available at: https://cyberleninka.ru/article/n/informatsionnyy-analiz-lent-delovyh-novostey (Accessed: 06.03.2021). [in Russian language]
12. Rajaraman A., Leskovec J., Ullman J. (2011). Mining of Massive Datasets. pp. 53 – 107. DOI 10.1017/CBO9781139058452
13. Makarova E. A., Lagerev D. G. (2020). Methodology for Preprocessing Semi-Structured Data for Making Managerial Decisions in the Healthcare. Proceedings of the 30th International Conference on Computer Graphics and Machine Vision. (GraphiCon 2020) V. 2744. Available at: http://ceur-ws.org/Vol-2744/paper78.pdf (Accessed: 02.07.2021). DOI 10.51130/graficon-2020-2-3-78
14. Luhn H. P. (1957). A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development, 4(1), pp. 309 – 317. DOI 10.1147/rd.14.0309
15. Kusner M., Sun Y., Kolkin N. I., Weinberger K. (2015). From Word Embeddings to Document Distances. Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pp. 957 – 966.
16. Rubner Y., Tomasi C., Guibas L. (2000). The Earth Mover’s Distance as a Metric for Image Retrieval. International Journal of Computer Vision, 40, pp. 99 – 121. DOI 10.1023/A:1026543900054

+ - Заказать электронную версию статьи (Purchase digital version of a single article) Click to collapse

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 450 руб. (в том числе НДС 18%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа скопируйте doi статьи:

10.14489/vkit.2021.07.pp.046-056

и заполните форму

Отправляя форму вы даете согласие на обработку персональных данных.

Eng

This article is available in electronic format (PDF).

The cost of a single article is 450 rubles. (including VAT 18%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please copy the article doi:

10.14489/vkit.2021.07.pp.046-056

and fill out the form