10.14489/vkit.2020.07.pp.044-054

DOI: 10.14489/vkit.2020.07.pp.044-054

Лагерев Д. Г., Макарова Е. А.
ПОИСК И РАСКРЫТИЕ СОКРАЩЕНИЙ В РУССКОЯЗЫЧНЫХ ДАННЫХ МЕДИЦИНСКИХ ИНФОРМАЦИОННЫХ СИСТЕМ
(c. 44-54)

Аннотация. Рассмотрена проблема интеграции, обработки и интеллектуального анализа слабоструктурированных данных из информационных медицинских систем в целях принятия управленческих решений в сфере здравоохранения. Даны описания проблем, которые свойственны подобным данным, такие как отсутствие достаточной структурированности, наличие большого числа ошибок и специфичных конкретным нозологиям сокращений и аббревиатур, сложность автоматической семантической интерпретации некоторых полей данных. Продемонстрирован подход к поиску и дальнейшему раскрытию сокращений и аббревиатур в текстах, построенный на сочетании машинной и человеческой обработки. Проведены эксперименты на обезличенных медицинских записях, по результатам которых сделан вывод, что внедрение подобного подхода позволяет значительно уменьшить трудозатраты при небольшом снижении точности раскрытия сокращений.

Ключевые слова: медицинская информационная система; интеллектуальный анализ данных; обработка естественного языка; поиск сокращений; раскрытие сокращений.

Lagerev D. G., Makarova E. A.
FEATURES OF PRELIMINARY PROCESSING OF SEMI-STRUCTURED MEDICAL DATA IN RUSSIAN FOR USE IN ENSEMBLES OF DATA MINING MODELS
(pp. 44-54)

Abstract. The paper considers the problem of integration, processing and mining of poorly structured data of medical information systems in order to make managerial decisions in healthcare. The problems of medical data are described, such as the lack of a sufficient structure, a large number of abbreviations characteristic of specific nosologies, the complexity of the automatic semantic interpretation of some fields. The authors demonstrated an approach to the search and disclosure of abbreviation in texts, based on a combination of machine and human processing. The method proposed by the authors, based on a hybrid approach combining the strengths of machine and human processing, made it possible to increase the number of abbreviations found by automatic methods by 21 %, and also opened up to 55 % of cases in the automated mode (with a probability of correctness above 70 %) and significantly reduce the time spent by specialists in processing the remaining reductions. Further research will be aimed at solving the following problems associated with the processing and specificity of medical data, such as a large number of spelling errors, specific grammatical constructions. Using a hybrid approach to preprocessing poorly structured data will increase the efficiency of management decisions in the field of healthcare by reducing the time spent by experts on their creation and support. The hybrid approach to the preprocessing of text data in Russian can be applied in other subject areas. However, it may be necessary to adjust the technique to the specifics of the processed data.

Keywords: Health information system; Data mining; Natural language processing; Search for abbreviations; Disclosure of abbreviation.

+ - Информация об авторах (About the Authors) Click to collapse

Рус

Д. Г. Лагерев, Е. А. Макарова (Брянский государственный технический университет, Брянск, Россия) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

Eng

D. G. Lagerev, E. A. Makarova (Bryansk State Technical University, Bryansk, Russia) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

+ - Библиографический список (References) Click to collapse

Рус

1. Портал оперативного взаимодействия участников ЕГИСЗ. URL: https://portal.egisz.rosminzdrav.ru/ materials (дата обращения: 07.02.2020)
2. Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks / D. Ciresan et al. // Intern. Conf. on Medical Image Computing and Computer-Assisted Intervention (MICCAI’2013). Lecture Notes in Computer Science. 2013. V. 8150. P. 411 – 418. doi: 10.1007/978-3-642-40763-5_51
3. Experience and Reflection from China’s Xiangya Medical Big Data Project / B. Li et al. // Journal of Biomedical Informatics. May 2019. V. 93. P. 103149. doi: 10.1016/j.jbi.2019.103149
4. Zakharova A. A., Lagerev D. G., Podvesovskii A. G. Multi-Level Model for Structuring Heterogeneous Biomedical Data in the Tasks of Socially Significant Diseases Risk Evaluation // 3rd Conf. on Creativity in Intelligent Technologies and Data Science (CIT and DS 2019), Volgograd, 20 August 2019. V. 1084. P. 461 – 473.
5. Чопоров О. Н., Золотухин О. В., Болгов С. В. Алгоритмизация интеллектуального анализа данных о распространенности заболеваний на региональном и муниципальном уровнях [Электронный ресурс] // Моделирование, оптимизация и информационные технологии. 2015. № 2(9). 9 c. URL: https://moit.vivt.ru/ wp-content/uploads/2015/06/ChoporovZolotuhinBoglov_ 2_15_1.pdf (дата обращения: 07.02.2020).
6. Лазаренко В. А., Антонов А. Е. Диагностика и прогнозирование вероятности возникновения холецистита на основе нейросетевого анализа факторов риска // Исследования и практика в медицине. 2017. Т. 4, № 4. С. 67 – 72. doi: 10.17709/2409-2231-2017-4-4-7
7. Dahiwade D., Patle G., Meshram E. Designing Disease Prediction Model Using Machine Learning Approach // 3rd Intern. Conf. on Computing Methodologies and Communication (ICCMC), Erode, India, 2019. P. 1211 – 1215. doi: 10.1109/ICCMC.2019.8819782
8. Machine Learning Methods for Disease Prediction with Claims Data / A. Christensen et al. // IEEE Intern. Conf. on Healthcare Informatics (ICHI), New York, NY, 2018. P. 467 – 4674. doi: 10.1109/ ICHI.2018.00108
9. Shukla N., Hagenbuchner M., Win T. K. Breast Cancer Data Analysis for Survivability Studies and Prediction // Computer Methods and Programs in Biomedicine. 2017. V. 155. P. 199 – 208. doi: 10.1016/j.cmpb.2017.12.011
10. Lohr St. For Big-Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights // The New York Times. 17 August, 2014. URL: http://www.nytimes.com/2014/ 08/18/technology/for-big-data-scientists-hurdle-to-insights- is-janitor-work.html?_r=0 (дата обращения: 07.02.2020).
11. Макарова Е. А., Лагерев Д. Г., Лозбинев Ф. Ю. Подходы к визуализации больших массивов текстовых данных на этапе их сбора и предобработки // Научная визуализация. 2019. Т. 11, № 4. С. 13 – 26. doi: 10.26583/sv.11.4.02
12. Introduction to HL7 Standards. URL: https:// www.hl7.org/implement/standards/ (дата обращения: 07.02.2020).
13. Gudivada V. N., Apon A., Dingh J. Data Quality Considerations for Big Data and Machine Learning: Going Beyond Data Cleaning and Transformations // Intern. Journal on Advances in Software. 2017. V. 10, No. 1-2. P. 1 – 20.
14. Unsupervised Abbreviation Detection in Clinical Narratives / M. Kreuzthaler et al. // Proc. of the Clinical Natural Language Processing Workshop (COLING’2016). December 2016. P. 91 – 98.
15. Статистические механизмы формирования ассоциативных портретов предметных областей на основе естественно-языковых текстов больших объемов для систем извлечения знаний / М. М. Шарнин и др. // Информатика и ее применения. 2013. Т. 7, № 2. С. 92 – 99.
16. Alahmadi A., Joorabchi A., Mahdi A. E. A New Text Representation Scheme Combining Bag-of-Words and Bag-of-Concepts Approaches for Automatic Text Classification // Proc. of the 7th IEEE GCC Conference and Exhibition (GCC). Doha, Qatar. 2013. С. 108 – 113.
17. Leydesdorff L. On the Normalization and Visualization of Author Co-Citation Data: Salton's Cosine Versus the Jaccard Index // Journal of the American Society for Information Science and Technology. 2008. V. 59, No. 1. P. 77 – 85.

Eng

1. Portal of operational interaction of participants of EGISZ. Available at: https://portal.egisz.rosminzdrav.ru/ materials (Accessed: 07.02.2020). [in Russian language]
2. Ciresan D. et al. (2013). Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI’2013). Lecture Notes in Computer Science, Vol. 8150, pp. 411 – 418. doi: 10.1007/978-3-642-40763-5_51
3. Li B. et al. (2019). Experience and Reflection from China’s Xiangya Medical Big Data Project. Journal of Biomedical Informatics, Vol. 93. doi: 10.1016/j.jbi.2019.103149
4. Zakharova A. A., Lagerev D. G., Podvesovskii A. G. (2019). Multi-Level Model for Structuring Heterogeneous Biomedical Data in the Tasks of Socially Significant Diseases Risk Evaluation. 3rd Conference on Creativity in Intelligent Technologies and Data Science (CIT and DS 2019), Vol. 1084, pp. 461 – 473. Volgograd.
5. Choporov O. N., Zolotuhin O. V., Bolgov S. V. (2015). Algorithmization of the intellectual analysis of disease prevalence data at the regional and municipal levels. Modelirovanie, optimizatsiya i informatsionnye tekhnologii, 9(2). Available at: https://moit.vivt.ru/wp-content/uploads/2015/06/ChoporovZolotuhinBoglov_ 2_15_1.pdf (Accessed: 07.02.2020). [in Russian language]
6. Lazarenko V. A., Antonov A. E. (2017). Diagnosis and prediction of the likelihood of cholecystitis based on a neural network analysis of risk factors. Issledovaniya i praktika v meditsine, Vol. 4, (4), pp. 67 – 72. [in Russian language] doi: 10.17709/2409-2231-2017-4-4-7
7. Dahiwade D., Patle G., Meshram E. (2019). Designing Disease Prediction Model Using Machine Learning Approach. 3rd International Conference on Computing Metho¬dologies and Communication (ICCMC), pp. 1211 – 1215. Erode. doi: 10.1109/ICCMC.2019.8819782
8. Christensen A. et al. (2018). Machine Learning Methods for Disease Prediction with Claims Data. IEEE International Conference on Healthcare Informatics (ICHI), pp. 467 – 4674. New York. doi: 10.1109/ ICHI.2018.00108
9. Shukla N., Hagenbuchner M., Win T. K. (2017). Breast Cancer Data Analysis for Survivability Studies and Prediction. Computer Methods and Programs in Biomedicine, Vol. 155, pp. 199 – 208. doi: 10.1016/j.cmpb.2017.12.011
10. Lohr St. (2014). For Big-Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights. The New York Times. Available at: http://www.nytimes.com/2014/ 08/18/technology/for-big-data-scientists-hurdle-to-insights- is-janitor-work.html?_r=0 (Accessed: 07.02.2020).
11. Makarova E. A., Lagerev D. G., Lozbinev F. Yu. (2019). Approaches to the visualization of large amounts of text data at the stage of their collection and preprocessing. Nauchnaya vizualizatsiya, Vol. 11, (4), pp. 13 – 26. [in Russian language] doi: 10.26583/sv.11.4.02
12. Introduction to HL7 Standards. Available at: https:// www.hl7.org/implement/standards/ (Accessed: 07.02.2020).
13. Gudivada V. N., Apon A., Dingh J. (2017). Data Quality Considerations for Big Data and Machine Learning: Going Beyond Data Cleaning and Trans¬formations. International Journal on Advances in Software, Vol. 10, (1-2), pp. 1 – 20.
14. Kreuzthaler M. et al. (2016). Unsupervised Abbreviation Detection in Clinical Narratives. Proceedings of the Clinical Natural Language Processing Workshop (COLING’2016), pp. 91 – 98.
15. Sharnin M. M. et al. (2013). Statistical mechanisms for the formation of associative portraits of subject areas based on natural language texts of large volumes for knowledge extraction systems. Informatika i ee primeneniya, Vol. 7, (2), pp. 92 – 99. [in Russian language]
16. Alahmadi A., Joorabchi A., Mahdi A. E. (2013).A New Text Representation Scheme Combining Bag-of-Words and Bag-of-Concepts Approaches for Automatic Text Classification. Proceedings of the 7th IEEE GCC Conference and Exhibition (GCC), pp. 108 – 113. Doha.
17. Leydesdorff L. (2008). On the Normalization and Visualization of Author Co‐Citation Data: Salton's Cosine Versus the Jaccard Index. Journal of the American Society for Information Science and Technology, Vol. 59, (1), pp. 77 – 85.

+ - Заказать электронную версию статьи (Purchase digital version of a single article) Click to collapse

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 350 руб. (в том числе НДС 18%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа скопируйте doi статьи:

10.14489/vkit.2020.07.pp.044-054

и заполните форму

Отправляя форму вы даете согласие на обработку персональных данных.

Eng

This article is available in electronic format (PDF).

The cost of a single article is 350 rubles. (including VAT 18%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please copy the article doi:

10.14489/vkit.2020.07.pp.044-054

and fill out the form