10.14489/vkit.2016.05.pp.029-033

DOI: 10.14489/vkit.2016.05.pp.029-033

Мельников С. Ю., Пересыпкин В. А.
О ПРИМЕНЕНИИ ВЕРОЯТНОСТНЫХ МОДЕЛЕЙ ЯЗЫКА ДЛЯ ОБНАРУЖЕНИЯ ОШИБОК В ИСКАЖЕННЫХ ТЕКСТАХ
(c. 29-33)

Аннотация. Предложена система коррекции искаженных текстов с использованием вероятностных моделей языка. Рассмотрены два вида случайных искажений текста: посимвольное (символ алфавита заменяется на другой с вероятностью р) и пословное (слово заменяется на другое случайное слово из своей окрестности в метрике Левенштейна так, что посимвольная доля искажений близка к р). Приведены результаты экспериментов по оценке уровня обнаружения ошибок предложенной системой в зависимости от вида и уровня искажений для английского и французского языков.

Ключевые слова: автоматическая коррекция; модель языка; коррекция текстов; искажения текстов; метрика Левенштейна.

Melnikov S. Yu., Peresypkin V. A.
LANGUAGE PROBABILISTIC MODELS FOR DETECTION OF SPELLING ERRORS IN GARBLED TEXT
(pp. 29-33)

Abstract. Spelling errors correction for alphabetic languages is relevant for several domains. Garbled characters in the text can lead to two types of errors at the word level: nondictionary (distorted word does not belong to the language dictionary) and vocabulary (distorted word belongs to the language dictionary). In practice there are both types of errors, but with the development of recognition systems operating at the level of words and phrases, the main trouble is caused by vocabulary mistakes. The garbled text correction system by using language probabilistic model is described. We deal with two types of random garbling: character-level (the alphabet character is replaced with a different character with probability p) and word-level (the word is replaced with another random word from its neighborhood at Levenshtein distance, so that distortion rate on character-level is close to p. The results of the experiments on the spelling error detection quality of the proposed system, depending on the type and level of distortion for the texts in English and the French languages, are presented. The results show that the used trigram model is sensitive to garbling of both the first and second types. Specific values of the sensitivity thresholds depend on the number of characters in the alphabet and the morphological features of the language in question. Garbling of the second type at the Levenshtein distance, equal to two, is distinguished by this model better than garbling at the Levenshtein distance equal to one.

Keywords: Automatic correction; Language model; Text correction; Garbled text; Levenstein distance.

+ - Информация об авторах (About the Authors) Click to collapse

Рус

S. Yu. Melnikov (“Language and Information Technologies” Ltd., Moscow) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript
V. A. Peresypkin (Scientific and Technical Center “Orion”, Moscow)

Eng

+ - Библиографический список (References) Click to collapse

Рус

1. Logan F. A. Errors in Copy Typewriting // Journal of Experimental Psychology: Human Perception and Performance. 1999. V. 25. P. 1760 – 1773.
2. Смирнов С. В. Корректировка ошибок оптического распознавания на основе рейтинго-ранговой модели текста // Тр. СПИИРАН. 2014. № 4(35). С. 64 – 82.
3. Мещеряков Р. В. Структура систем синтеза и распознавания речи // Изв. Томского политехн. ун-та. 2009. Т. 315, № 5. С. 121 – 126.
4. A Robust Diacritics Restoration System Using Unreliable Raw Text Data / L. Petric et al. // Proc. 4th Intern. Workshop оn Spoken Language Technologies for Under-Resourced Languages (SLTU-2014), St. Petersburg, Russia, 14 – 16 May, 2014. St. Petersburg, 2014. P. 215 – 220.
5. Кипяткова И. С., Карпов А. А. Разработка и исследование статистической модели русского языка // Тр. СПИИРАН. 2010. № 1(12). С. 35 – 49.
6. Tong X., Evans D. A Statistical Approach to Automatic OCR Error Correction in Context // Proc. of the Fourth Workshop on Very Large Corpora (WVLC-4), Copenhagen, Denmark, August 4, 1996. P. 88 – 100.
7. Katz S. M. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer // IEEE Trans. on Acoustics, Speech and Signal Proc. 1987. V. 35, № 3. Р. 400–401.
8. Bassil Y., Alwani M. OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set // American J. of Scientific Research. 2012. Is. 50. Р. 14 – 25.
9. Garabík R. Slovak Morphology Analyzer Based on Levenshtein Edit Operations // Proc. of the WIKT’06 Conf., Institute of Informatics SAS, Bratislava, Slovakia. 2006. P. 2 – 5.

Eng

1. Logan F. A. (1999). Errors in copy typewriting. Journal of Experimental Psychology: Human Perception and Performance, 25, pp. 1760 – 1773. doi: 10.1037/0096-1523.25.6.1760
2. Smirnov S. V. (2014). Correcting of OCR errors based on rating-ranking model text. Trudy SPIIRAN, 35(4), pp. 64-82.
3. Meshcheriakov R. V. (2009). The structure of synthesis and voice recognition systems. Izvestiia Tomskogo politekhnicheskogo universiteta, 315(5), pp. 121-126.
4. Petric L. et al. (2014). A robust diacritics restoration system using unreliable raw text data. Proc. 4th Intern. Workshop оn Spoken Language Technologies for Under-Resourced Languages (SLTU-2014), St. Petersburg, Russia, 14-16 May 2014, pp. 215-220.
5. Kipiatkova I. S., Karpov A. A. (2010). Development and research of a statistical model of the Russian language. Trudy SPIIRAN, 12(1), pp. 35-49.
6. Tong X., Evans D. (1996). A statistical approach to automatic OCR error correction in context. Proc. of the Fourth Workshop on Very Large Corpora (WVLC-4), Copenhagen, Denmark, August 4, pp. 88-100.
7. Katz S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. on Acoustics, Speech and Signal Proc. 35(3), pp. 400–401. doi: 10.1109/tassp.1987.1165125
8. Bassil Y., Alwani M. (2012). OCR Context-sensitive error correction based on google web 1T 5-gram data set. American J. of Scientific Research, 50, pp. 14-25.
9. Garabík R. (2006). Slovak morphology analyzer based on Levenshtein edit operations. Proc. of the WIKT’06 Conf., Institute of Informatics SAS, Bratislava, Slovakia, pp. 2-5.

+ - Заказать электронную версию статьи (Purchase digital version of a single article) Click to collapse

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 350 руб. (в том числе НДС 18%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа статьи заполните форму:

{jform=1,doi=10.14489/vkit.2016.05.pp.029-033}

Eng

This article is available in electronic format (PDF).

The cost of a single article is 350 rubles. (including VAT 18%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please fill out the form below:

{jform=2,doi=10.14489/vkit.2016.05.pp.029-033}