10.14489/vkit.2019.12.pp.010-017

DOI: 10.14489/vkit.2019.12.pp.010-017

Федоренко Ю. С.
МЕТОДИКА СТАТИСТИЧЕСКОГО ТЕСТИРОВАНИЯ ДЛЯ СРАВНЕНИЯ КАЧЕСТВА РАБОТЫ МОДЕЛЕЙ МАШИННОГО ОБУЧЕНИЯ
(c. 10-17)

Аннотация. Рассмотрены методики статистического тестирования для сравнения значений метрик моделей машинного обучения на тестовой выборке. Поскольку значения метрик зависят не только от моделей, но и от набора данных, часто бывает недостаточно традиционного подхода со сравнением значений метрик на тестовой выборке. В таких случаях используют статистическое сравнение результатов, полученных на основе кроссвалидации, однако тогда нельзя гарантировать независимость полученных измерений, что не позволяет использовать t-критерий Стьюдента. Существуют критерии, не требующие независимости измерений, однако они имеют меньшую мощность. Для аддитивных метрик предложена методика, когда тестовая выборка разбивается на N частей, на каждой из которых вычисляется значение метрики. Для оценки необходимого объема выборок предложено использовать тесты на нормальность распределений, строить диаграммы квантиль–квантиль, а затем применять модификацию t-критерия Стьюдента для проведения статистического теста по сравнению средних значений метрик. Рассмотрена упрощенная методика, в которой проверяется попадание значений метрик моделей в доверительный интервал базовой модели.

Ключевые слова: машинное обучение; метрики; бинарная кросс-энтропия; тестовая выборка; статистическое тестирование; критерии нормальности; t-критерий Стьюдента; доверительные интервалы.

Fedorenko Yu. S.
STATISTICAL TESTING TECHNIQUE FOR COMPARISON MACHINE LEARNING MODELS PERFORMANCE
(pp. 10-17)

Abstract. The statistical testing technique is considered to compare the metrics values of machine learning models on a test set. Since the values of metrics depend not only on the models, but also on the data, it may turn out that different models are the best on different test sets. For this reason, the traditional approach to comparing the values of metrics on a test set is often not enough. Sometimes a statistical comparison of the results obtained on the basis of cross-validation is used, but in this case it is impossible to guarantee the independence of the obtained measurements, which does not allow the use of the Student's t-test. There are criteria that do not require independent measurements, but they have less power. For additive metrics, a technique is proposed in this paper, when a test sample is divided into N parts, on each of which the values of the metrics are calculated. Since the value on each part is obtained as the sum of independent random variables, according to the central limit theorem, the obtained metrics values on each of the N parts are realizations of the normally distributed random variable. To estimate the required sample size, it is proposed to use normality tests and build quantile– quantile plots. You can then use a modification of the Student's t-test to conduct a statistical test comparing the mean values of the metrics. A simplified approach is also considered, in which confidence intervals are built for the base model. A model whose metric values do not fall into this interval works differently from the base model. This approach reduces the amount of computations needed, however, an experimental analysis of the binary cross-entropy metric for CTR (Click-Through Rate) prediction models showed that it is more rough than the first one.

Keywords: Machine learning; Metrics; Binary cross-entropy; Test set; Statistical testing; Normality test; Student’s t-test; Confidence intervals.

+ - Информация об авторах (About the Authors) Click to collapse

Рус

Ю. С. Федоренко (Московский государственный технический университет им. Н. Э. Баумана (национальный исследовательский университет), Москва, Россия) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

Eng

Yu. S. Fedorenko (Bauman Moscow State Technical University, Moscow, Russia) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

+ - Библиографический список (References) Click to collapse

Рус

1. Kohavi R. A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection // Proc. of the 14th Intern. Joint Conf. on Artificial Intel¬ligence, IJCAI, San Mateo, 20 – 25 August, 1995. Р. 1137 – 1143.
2. Dietterich T. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms // Neural Computation. 1998. V. 10, Is. 7. Р. 1895 – 1923.
3. Fay M. Exact McNemar’s Test and Matching Confidence Intervals [Электронный ресурс]: Research Gate. URL: https://cran.rstudio.com/web/ packages/ exact2x2/ vignettes/exactMcNemar.pdf (дата обращения: 29.05.2019).
4. Janez D. Statistical Comparisons of Classifiers over Multiple Data Sets // Journal of Machine Learning Research. 2006. V. 7. Р. 1 – 30.
5. Fedorenko Yu. S., Gapanyuk Yu. E. The Neural Network with Automatic Feature Selection for Solving Problems with Categorical Variables // Inter. Conf. on Neurainformatics Advances in Neural Computation, Machine Learning, and Cognitive Research, NEUROIN¬FORMATICS, 2018. V. 799. Р. 129 – 135. doi:10.1007/ 978-3-030-01328-8_13
6. Теория вероятностей: учебник для вузов / А. В. Печинкин и др. М.: Изд-во МГТУ им. Н. Э. Баумана, 2004. 456 с.
7. Nornadiah M. R., Yap B. W. Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling Tests // Statiastical Modeling and Analytics. 2011. V. 2, Is. 1. P. 21 – 33.
8. Royston J. P. A Remark on Algorithm AS181: The W-test for Normality // Journal of the Royal Statis¬tical Society. 1995. V. 44, Is. 4. P. 547 – 551.
9. D’Agostino R. B., Belanger A. J., D’Agostino Jr. R. B. A Suggestion for Using Powerful and Informative Tests of Normality // The American Statistician. 1990. V. 44. P. 316 – 321.
10. Математическая статистика: учебник для вузов / В. Б. Горяинов и др. 3-е изд. М.: Изд-во МГТУ им. Н. Э. Баумана, 2008. 424 с.
11. Ruxton G. D. The Unequal Variance T-Test is an Underused Alternative to Student’s T-Test and the Mann-Whitney U test // Behavioral Ecology. 2006. V. 17, Is. 4. P. 688 – 690. doi:10.1093/beheco/ark016
12. Ruder S. An Overview of Gradient Descent Optimization Algorithms. [Электрон. ресурс]. URL: https://arxiv.org/pdf/1609.04747.pdf (дата обращения: 29.05.2019)

Eng

1. Kohavi R. (1995). A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. Proc. of the 14th Intern. Joint Conf. on Artificial Intel¬ligence, IJCAI, San Mateo, 20 – 25 August, 1995, pp. 1137 – 1143.
2. Dietterich T. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), pp. 1895 – 1923.
3. Fay M. Exact McNemar’s Test and Matching Confidence Intervals: ResearchGate. Available at: https://cran.rstudio.com/web/packages/exact2x2/ vi-gnettes/exactMcNemar.pdf (Accessed: 29.05.2019).
4. Janez D. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, Vol. 7, pp. 1 – 30.
5. Fedorenko Yu. S., Gapanyuk Yu. E. (2018). The Neural Network with Automatic Feature Selection for Solving Problems with Categorical Variables. Inter. Conf. on Neurainformatics Advances in Neural Computation, Machine Learning, and Cognitive Research, NEUROIN¬FORMATICS, 799, pp. 129 – 135. doi:10.1007/ 978-3-030-01328-8_13
6. Pechinkin A. V. et al. (2004). Probability theory: textbook for universities.Moscow: Izdatel'stvo MGTU im. N. E. Baumana. [in Russian language]
7. Nornadiah M. R., Yap B. W. (2011). Power Compa¬risons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling Tests. Statiastical Modeling and Analytics, 2(1), pp. 21 – 33.
8. Royston J. P. (1995). A Remark on Algorithm AS181: The W-test for Normality. Journal of the Royal Statis¬tical Society, 44(4), pp. 547 – 551.
9. D’Agostino R. B., Belanger A. J., D’Agostino Jr. R. B. (1990). A Suggestion for Using Powerful and Informative Tests of Normality. The American Statisti-cian, 44, pp. 316 – 321.
10. Goriainov V. B. et al. (2008). Math statistics: textbook for universities, 3rd Ed. Moscow: Izdatel'stvo MGTU im. N. E. Baumana. [in Russian language]
11. Ruxton G. D. (2006). The Unequal Variance T-Test is an Underused Alternative to Student’s T-Test and the Mann-Whitney U test. Behavioral Ecology, 17(4), pp. 688 – 690. doi:10.1093/beheco/ark016
12. Ruder S. An Overview of Gradient Descent Optimization Algorithms. Available at: https://arxiv.org /pdf/1609.04747.pdf (Accessed: 29.05.2019)

+ - Заказать электронную версию статьи (Purchase digital version of a single article) Click to collapse

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 350 руб. (в том числе НДС 18%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа скопируйте doi статьи:

10.14489/vkit.2019.12.pp.010-017

и заполните форму

Отправляя форму вы даете согласие на обработку персональных данных.

Eng

This article is available in electronic format (PDF).

The cost of a single article is 350 rubles. (including VAT 18%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please copy the article doi:

10.14489/vkit.2019.12.pp.010-017

and fill out the form