10.14489/vkit.2018.12.pp.003-010

DOI: 10.14489/vkit.2018.12.pp.003-010

Белозеров А. А., Вахлаков Д. В., Пересыпкин В. А., Мельников С. Ю., Скавинская Д. В.
ЭВОЛЮЦИОННЫЕ МЕТОДЫ ДИСКРЕТНОЙ ОПТИМИЗАЦИИ В ЗАДАЧЕ КОРРЕКЦИИ ИСКАЖЕННЫХ ТЕКСТОВ
(с. 3-10)

Аннотация. Предложено использовать эволюционные методы дискретной оптимизации для коррекции искаженного текста, символы которого известны в вариантах. Целевая функция – степень согласия скорректированного варианта текста со статистической моделью языка. В качестве методов оптимизации рассмотрены генетические алгоритмы и методы искусственных иммунных систем, в качестве модели языка использована шестиграммная модель на символах. Приведены результаты экспериментов по оценке точности разработанных процедур коррекции искажений типа «замена символа» для английского языка.

Ключевые слова: автоматическая коррекция; модель языка; коррекция текстов; искажения текстов;генетические алгоритмы; искусственные иммунные системы.

Belozerov A. A., Vakhlakov D. V., Peresypkin V. A., Melnikov S. Yu., Skavinskaya D. V.
Garbled TEXT CORRECTION USING EVOLUTIONARY DISCRETE OPTIMIZATION ALGORITHMS
(pp. 3-10)

Abstract. For the garbled text correction problem with information about possible characters’ variants the application of evolutionary discrete optimization algorithms is offered. Number of characters in the corrected text defines the size of parameter vector in optimization problem; each vector component corresponds to the text character and can take a value from the predefined finite set (correcting set). The fitness function is calculated with the language model and the garbled text correction problem is reduced to the maximization of this function. As a language model, the 6-gram character-level model is applied. As optimization methods, the genetic algorithm and artificial immune system method are considered. For long texts, which are large-scale problem, the two-stage evolutionary method with preliminary search in short substrings was developed. The developed software for the defined problem solution based on evolutionary algorithms is described. To estimate the accuracy of developed correction methods, which correct the “character replacement” garble type for English language, the experimental results are presented. The experiments include 27-character alphabet including 26 Latin letters and space symbol, letters case is not considered. The cardinality of the correcting set is constant and equal to the predefined value. Two cases with cardinality equal to 5 and 8 are considered. The dependence of the evolutionary methods accuracy on the number of populations and on the probability of mutation is analyzed. For the two-stage method the dependence of the accuracy on the length of substring in preliminary search is revealed. The correlation between fitness function and the percentage of correctly found characters is studied. The evolutionary methods are able to obtain the high percentage of true characters in a short corrected text. The results of the genetic algorithm are better than the results of the artificial immune system method. For the correction of long texts, the two-stage evolutionary method shows its effectiveness.

Keywords: Automatic correction; Language model; Text correction; Noisy text; Genetic algorithms; Artificial immune systems.

+ - Информация об авторах (About the Authors) Click to collapse

Рус

А. А. Белозеров, Д. В. Вахлаков, В. А. Пересыпкин (ФГУП «Научно-технический центр «Орион», Москва, Россия)
С. Ю. Мельников (ООО «Лингвистические и информационные технологии», Москва, Россия) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript
Д. В. Скавинская (Московский авиационный институт (национальный исследовательский университет), Москва, Россия);

Eng

A. A. Belozerov, D. V. Vakhlakov, V. A. Peresypkin (Scientific and Technical Center “Orion”, Moscow, Russia)
S. Yu. Melnikov (Linguistics and Information Technologies Ltd., Moscow, Russia) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript
D. V. Skavinskaya (Moscow Aviation Institute (National Research University), Moscow, Russia)

+ - Библиографический список (References) Click to collapse

Рус

1. Kumar A. A Survey on Various OCR Errors // Intern. Journal of Computers and Applications. 2016. V. 143, No. 4. P. 8 – 10. doi: 10.5120/ijca2016910142
2. Бахтенко Е. А., Баланин Е. О. Коррекция ошибок на этапе постобработки при оптическом распознавании символов // Таврический научный обозреватель. 2016. № 8-2(13). C. 10 – 12.
3. Мещеряков Р. В. Структура систем синтеза и распознавания речи // Изв. Томского политехн. университета. Управление, вычислительная техника и информатика. 2009. Т. 315, № 5. С. 121 – 126.
4. MacKenzie I. S., Soukoreff R. W. A Character-Level Error Analysis Technique for Evaluating Text Entry Methods // Proc. of the Second Nordic Conf. on Human-Computer Interaction – NordiCHI. New York, 2002. P. 241 – 244.
5. De Pauw G., Wagacha P. W., de Schryver G.-M. Automatic Diacritic Restoration for Resource-Scarce Languages // Intern. Conf. on Text, Speech and Dialogue. Lecture Notes in Computer Science. 2007. V. 4629. P. 170 – 179. doi: 10.1007/ 978-3-540-74628-7_24
6. Kipyatkova I., Karpov A. Lexicon Size and Language Model Order Optimization for Russian LVCSR // Intern. Conf. on Speech and Computer. Lecture Notes in Computer Science. 2013. V. 8113. P. 219 – 226. doi: 10.1007/978-3-319-01931-4_29
7. Мельников С. Ю., Пересыпкин В. А. О применении вероятностных моделей языка для обнаружения ошибок в искаженных текстах // Вестник компьютерных и информационных технологий. 2016. № 5. С. 29 – 33. doi: 10.14489/vkit.2016.05.pp.029-033
8. Zhou B. Statistical Machine Translation for Speech: A Perspective on Structures, Learning and Decoding // Proc. of the IEEE. Special Issue: Speech Information Processing. 2013. V. 101. P. 1180 – 1202.
9. Extractive Single-Document Summarization Based on Genetic Operators and Guided Local Search / M. Mendoza et al. // Expert Systems with Applications. 2014. V. 41, Is. 9. P. 4158 – 4169. doi: 10.1016/ j.eswa.2013.12.042
10. Man K. F., Tang K. S., Kwong S. Genetic Algorithms in Speech Recognition Systems // Genetic Algorithms: Concepts and Designs (Advanced Textbooks in Control and Signal Processing). Springer-Verland, London, 1999. P. 199 – 257.
11. Bungum L., Gambäck B. Evolutionary Algorithms in Natural Language Processing // Proc. of the Norwegian Artificial Intelligence Symposium. 2010. P. 7 – 19.
12. Picek S., Golub M. On Evolutionary Computation Methods in Cryptography // Proc. of the 34th Intern. Convention, MIPRO. 2011. P. 1496 – 1501.
13. Пантелеев А. В., Скавинская Д. В., Алешина Е. А. Метаэвристические алгоритмы поиска оптимального программного управления. М.: ИНФРА-М, 2016. 396 с. (Сер. Научная мысль).
14. Карпенко А. П. Современные алгоритмы поисковой оптимизации. Алгоритмы, вдохновленные природой: учеб. пособие. М.: Изд-во МГТУ им. Н. Э. Баумана, 2014. 446 с.
15. Панченко Т. В. Генетические алгоритмы / под ред. Ю. Ю. Тарасевича. Астрахань: ИД «Астраханский университет», 2007. 87 с.
16. Гладков Л. А., Курейчик В. В., Курейчик В. М. Генетические алгоритмы. 2-е изд., испр. и доп. М.: ФИЗМАТЛИТ, 2006. 320 с.
17. Holland J. H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence / University of Michigan Press. Ann Arbor: MIT Press, 1992. 225 p.
18. Дасгупта Д. Искусственные иммунные системы и их применение / пер с англ. А. А. Романюха. М.: ФИЗМАТЛИТ, 2006. 344 c.
19. Kopec G., Said M., Popat K. N-Gram Language Models for Document Image Decoding // Proc. IS&T/SPIE. Electronic Imaging. Document Recognition and Retrieval IX. San Jose, California, US. 2002. V. 4670. 12 p. doi: 10.1117/ 12.450728
20. Технологические аспекты построения системы сбора и предобработки корпусов новостных текстов для создания моделей языка / А. А. Белозеров и др. // Изв. ЮФУ. Технические науки. 2016. № 12(185). С. 29 – 42. doi: 10.18522/2311-3103-2016-12-2942
21. Zhai C., Lafferty J. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval // ACM SIGIR Forum. 2017. V. 51, No. 2. P. 268 – 276. doi:10.1145/383952.384019
22. Wolfram: Computation Meets Knowledge [Электронный ресурс]: офиц. сайт. URL: http://www. wolfram.com (дата обращения: 05.10.2018).

Eng

1. Kumar A. (2016). A Survey on Various OCR Errors. International Journal of Computers and Applications, 143(4), pp. 8-10. doi: 10.5120/ijca2016910142
2. Bahtenko E. A., Balanin E. O. (2016). Error correction at the post-processing stage for optical character recognition. Tavricheskiy nauchnyy obozrevatel', 13(8-2), pp. 10-12. [in Russian language]
3. Mescheryakov R. V. (2009). Structure of speech synthesis and recognition systems. Izvestiya Tomskogo politekhnicheskogo universiteta. Upravlenie, vychislitel'naya tekhnika i informatika, 315(5), pp. 121-126. [in Russian language]
4. MacKenzie I. S., Soukoreff R. W. (2002). A Character-Level Error Analysis Technique for Evaluating Text Entry Methods. Proceedings of the second Nordic conference on Human-computer interaction – NordiCHI. New York, pp. 241-244.
5. De Pauw G., Wagacha P. W., de Schryver G.-M. (2007). Automatic Diacritic Restoration for Resource-Scarce Languages. International Conference on Text, Speech and Dialogue. Lecture Notes in Computer Science, 4629, pp. 170-179. doi: 10.1007/ 978-3-540-74628-7_24
6. Kipyatkova I., Karpov A. (2013). Lexicon Size and Language Model Order Optimization for Russian LVCSR. International Conference on Speech and Computer. Lecture Notes in Computer Science, 8113, pp. 219-226. doi: 10.1007/978-3-319-01931-4_29
7. Mel'nikov S. Yг., Peresypkin V. A. (2016). On the use of probabilistic language models for detecting errors in distorted texts. Vestnik komp'yuternyh i informatsionnyh tekhnologiy, (5), pp. 29-33. doi: 10.14489/vkit.2016.05.pp.02933 [in Russian language]
8. Zhou B. (2013). Statistical Machine Translation for Speech: A Perspective on Structures, Learning and Decoding. Proceedings of the IEEE. Special Issue: Speech Information Processing, 101, pp. 1180-1202.
9. Mendoza M. et al. (2014). Extractive Single-Document Summarization Based on Genetic Operators and Guided Local Search. Expert Systems with Applications, 41(9), pp. 4158-4169. doi: 10.1016/ j.eswa.2013.12.042
10. Man K. F., Tang K. S., Kwong S. (1999). Genetic Algorithms in Speech Recognition Systems. In: Genetic Algorithms: Concepts and Designs (Advanced Textbooks in Control and Signal Processing). Springer-Verland, London, pp. 199-257.
11. Bungum L., Gambäck B. (2010). Evolutionary Algorithms in Natural Language Processing. Proceedings of the Norwegian Artificial Intelligence Symposium, pp. 7-19.
12. Picek S., Golub M. (2011). On Evolutionary Computation Methods in Cryptography. Proceedings of the 34th International Convention, MIPRO, pp. 1496-1501.
13. Panteleev A. V., Skavinskaya D. V., Aleshina E. A. (2016). Metaheuristic algorithms for finding optimal programmed control. Moscow: INFRA-M. (Seriya Nauchnaya mysl'). [in Russian language]
14. Karpenko A. P. (2014). Modern search engine optimization algorithms. Algorithms inspired by nature: study guide. Moscow: Izdatel'stvo MGTU im. N. E. Baumana. [in Russian language]
15. Tarasevich Yu. Yu. (Ed.), Panchenko T. V. (2007). Genetic Algorithms. Astrakhan: ID «Astrahanskiy universitet». [in Russian language]
16. Gladkov L. A., Kureychik V. V., Kureychik V. M. (2006). Genetic Algorithms. 2nd ed. Moscow: FIZMATLIT. [in Russian language]
17. Holland J. H. (1992). Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence. University of Michigan Press. Ann Arbor: MIT Press.
18. Dasgupta D. (2006). Artificial Immune Systems and Their Applications. Moscow: FIZMATLIT. [in Russian language]
19. Kopec G., Said M., Popat K. (2002). N-Gram Language Models for Document Image Decoding. Proceedings IS&T/SPIE. Electronic Imaging. Document Recognition and Retrieval IX, Vol. 4670. San Jose, California, US. doi: 10.1117/ 12.450728
20. Belozerov A. A. et al. (2016). Technological aspects of building a system for collecting and preprocessing news corpuses to create language models. Izvestiya YuFU. Tekhnicheskie nauki, 185(12), pp. 29-42. [in Russian language] doi: 10.18522/2311-3103-2016-12-2942
21. Zhai C., Lafferty J. (2017). A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. ACM SIGIR Forum, 51(2), pp. 268-276. doi:10.1145/383952.384019
22. Wolfram: Computation Meets Knowledge. Available at: http://www. wolfram.com (Accessed: 05.10.2018).

+ - Заказать электронную версию статьи (Purchase digital version of a single article) Click to collapse

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 350 руб. (в том числе НДС 18%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа скопируйте doi статьи:

10.14489/vkit.2018.12.pp.003-010

и заполните форму

Отправляя форму вы даете согласие на обработку персональных данных.

Eng

This article is available in electronic format (PDF).

The cost of a single article is 350 rubles. (including VAT 18%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please copy the article doi:

10.14489/vkit.2018.12.pp.003-010

and fill out the form