10.14489/vkit.2026.01.pp.036-043

DOI: 10.14489/vkit.2026.01.pp.036-043

Дмитриев А. А.
РАСПРЕДЕЛЕННОЕ ОБУЧЕНИЕ ГЛУБОКИХ СЕТЕЙ С ИСПОЛЬЗОВАНИЕМ СОБЫТИЙ-ТРИГГЕРОВ
(с. 36-43)

Аннотация. Рассмотрены подходы к реализации распределенных алгоритмов машинного обучения в кластере. В контексте метода параллелизма данных проанализирован подход на основе событий триггеров: Distributed Deep Learning with Event-Triggered Communication (DDLETC). В рамках данного подхода компоненты кластера представлены в качестве полуавтономных агентов, обменивающихся сообщениями, значимыми для процесса обучения модели. Предложено несколько вариантов организации обмена сообщений с точки зрения эффективного расходования вычислительных ресурсов.

Ключевые слова: глубокое обучение; распределенные вычисления; кластер; параллелизм данных; события-триггеры; агенты.

Dmitriev A. A.
DEEP NETWORKS DISTRIBUTED LEARNING WITH EVENT-TRIGGERED COMMUNICATION
(pp. 36-43)

Abstract. The article investigates methods for implementing distributed machine learning algorithms within a cluster environment. Three key approaches to parallelizing computations in a cluster during distributed deep network training are examined. It addresses challenges inherent to distributed learning, using the data parallelization framework, and emphasizes strategies for efficient resource utilization. Specifically, it is demonstrated why simple horizontal scaling in such tasks does not guarantee performance gains beyond a certain point. The study analyzes the Distributed Deep Learning with Event-Triggered Communication (DDLETC) approach within the data parallelization framework. In this method, cluster nodes operate as semi-autonomous agents that exchange messages critical to model training. Common algorithms for coordinating agents in networks, applied to the DDLETC task, are discussed. The research proposes multiple strategies for organizing message exchanges to optimize computational resources effectively. Various agent network topologies are explored, where parameters, data, and service messages are transmitted either directly between nodes or through designated intermediaries. It is shown how the size and number of transmitted messages impact the efficiency of distributed learning. A mathematical model for coordinating agents is introduced to enhance cluster resource efficiency, providing a structured approach to distributed learning optimization. A heuristic algorithm is also proposed to identify an effective blend of agents’ computational resources to achieve the collective effect of distributed model training.

Keywords: Deep learning; Distribution calculations; Cluster; Data parallel; Trigger-messages; Agents.

+ - Информация об авторах (About the Authors) Click to collapse

Рус

А. А. Дмитриев (Санкт-Петербургский государственный университет аэрокосмического приборостроения, Санкт-Петербург, Россия) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

Eng

A. A. Dmitriev (Saint Petersburg State University of Aerospace Instrumentation, Saint Petersburg, Russia) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

+ - Библиографический список (References) Click to collapse

Рус

1. Bekuzarov Maksym. An Introduction to Parallel and Distributed Training in Deep Learning [Электронный ресурс] // Better Programming. 2023. URL: https://betterprogramming.pub/parallel-and-distributed-training-in-deep-learning-for-beginners-part-1-introduction-612a4534a117 (дата обращения: 10.06.2025).
2. Distributed contrastive learning for medical image segmentation [Электронный ресурс] / Wu Yawen, Zeng Dewen, Wang Zhepeng et al. // ScienceDirect. 2022. URL: https://www.sciencedirect.com/science/article/abs/pii/S1361841522002079 (дата обращения: 12.05.2025).
3. Xu Lanyu. Poster: A Distributed Deep Reinforcement Learning System for Medical Image Segmentation [Электронный ресурс] // IEEE Xplore. 2023. URL: https://ieeexplore.ieee.org/document/10183747 (дата обращения: 13.06.2025).
4. Хайдарова Р. Р., Муромцев Д. И., Лапаев М. В., Фищенко В. Д. Модель распределенной сверточной нейронной сети на кластере компьютеров с ограниченными вычислительными ресурсами // Научно-технический вестник информационных технологий, механики и оптики. 2020. № 5. С. 739–745.
5. Zhao Zhuoran, Mirzazad Barijough Kamyar, Gerstlauer Andreas. DeepThings: Distributed Adaptive Deep Learning Inference on Resource-Constrained IoT Edge Clusters [Электронный ресурс]. URL: https://slam.ece.utexas.edu/pubs/codes18.DeepThings.pdf (дата обращения: 04.05.2025).
6. Kolodziej Marek. Allreduce - the basis of multi-device communication for neural network training [Электронный ресурс]. URL: https://marek.ai/allreduce-the-basis-of-multi-device-communication-for-neural-network-training.html (дата обращения: 11.05.2025).
7. A Survey on Distributed Machine Learning / Verbraeken Joost, Wolting Matthijs, Katzy Jonathan et al. [Электронный ресурс] // arXiv. 2019. URL: https://arxiv.org/abs/1912.09789 (дата обращения: 16.06.2025).
8. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server / Cui Henggang, Zhang Hao, Ganger Gregory R. et al. [Электронный ресурс] // ACM Digital Library. 2016. URL: https://dl.acm.org/doi/10.1145/2901318.2901323 (дата обращения: 16.06.2025).
9. George Jemin, Gurram Prudhvi. Distributed Deep Learning with Event-Triggered Communication [Элек-тронный ресурс] // arXiv. 2019. URL: https://arxiv.org/abs/1909.05020 (дата обращения: 05.06.2025).
10. George Jemin, Gurram Prudhvi. Distributed Stochastic Gradient Descent with Event-Triggered Communication [Электронный ресурс] // ArXiv. 2019. URL: https://arxiv.org/pdf/1909.05020 (дата обращения: 05.06.2025).
11. Deep Learning of a Communication Policy for an Event-Triggered Observer for Linear Systems / Marchand Mathieu, Andrieu Vincent, Bertrand Sylvain et al. [Электронный ресурс] // HAL. URL: https://hal.science/hal-03945265v2/document (дата обращения: 06.06.2025).
12. Mirza Mujtaba. Distributed Training: Guide for Data Scientists [Электронный ресурс] // Neptune AI. URL: https://neptune.ai/blog/distributed-training (дата обращения: 14.06.2025).
13. Ванг К., Сзето Д. Конструирование систем глубокого обучения. М.: ДМК Пресс, 2023. 462 с.
14. Ju Jang. Distributed model training II: Parameter Server and AllReduce [Электронный ресурс] // URL: http://www.juyang.co/distributed-model-training-ii-parameter-server-and-allreduce/ (дата обращения: 09.05.2025).
15. Dehghani Mohammad, Yazdanparast Zahra. From distributed machine to distributed deep learning: a comprehensive survey [Электронный ресурс] // ResearchGate. 2023. URL: https://www.researchgate.net/publication/374722659_From_distributed_machine_to_distributed_deep_learning_a_comprehensive_survey (дата обращения: 12.06.2025).
16. Николенко С., Кадурин А., Архангельская Е. Глубокое обучение. СПб.: Питер, 2021. 480 с.
17. Jozefovicz Rafal, Chen Jianmin, Xiaghao, Monga Rajat, Bengio Samy. Revisiting Distributed Synchronous SGD [Электронный ресурс] // ArXiv. 2016. URL: https://arxiv.org/abs/1604.00981 (дата обра-щения: 12.01.2025).
18. Cohen P. R., Levesque H. J. Performatives in a Rationally Based Speech Act Theory [Электронный ресурс]. URL: https://lia.disi.unibo.it/books/woa2001/pdf/03.pdf (дата обращения: 06.06.2025).

Eng

1. Bekuzarov, M. (2023). An introduction to parallel and distributed training in deep learning. Better Programming. Retrieved June 10, 2025, from https://betterprogramming.pub/parallel-and-distributed-training-in-deep-learning-for-beginners-part-1-introduction-612a4534a117
2. Wu, Y., Zeng, D., Wang, Z., et al. (2022). Distributed contrastive learning for medical image segmentation. ScienceDirect. Retrieved May 12, 2025, from https://www.sciencedirect.com/science/article/abs/pii/S1361841522002079
3. Xu, L. (2023). Poster: A distributed deep reinforcement learning system for medical image segmentation. IEEE Xplore. Retrieved June 13, 2025, from https://ieeexplore.ieee.org/document/10183747
4. Khaidarova, R. R., Muromtsev, D. I., Lapaev, M. V., & Fishchenko, V. D. (2020). Model of a distributed convolutional neural network on a computer cluster with limited computing resources. Nauchno-Tekhnicheskii Vestnik Informatsionnykh Tekhnologii, Mekhaniki i Optiki, (5), 739–745. [in Russian language].
5. Zhao, Z., Mirzazad Barijough, K., & Gerstlauer, A. (n.d.). DeepThings: Distributed adaptive deep learning inference on resource-constrained IoT edge clusters. Retrieved May 4, 2025, from https://slam.ece.utexas.edu/pubs/codes18.DeepThings.pdf
6. Kolodziej, M. (n.d.). Allreduce - the basis of multi-device communication for neural network training. Retrieved May 11, 2025, from https://marek.ai/allreduce-the-basis-of-multi-device-communication-for-neural-network-training.html
7. Verbraeken, J., Wolting, M., Katzy, J., et al. (2019). A survey on distributed machine learning [Preprint]. arXiv. Retrieved June 16, 2025, from https://arxiv.org/abs/1912.09789
8. Cui, H., Zhang, H., Ganger, G. R., et al. (2016). GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. ACM Digital Library. Retrieved June 16, 2025, from https://dl.acm.org/doi/10.1145/2901318.2901323
9. George, J., & Gurram, P. (2019). Distributed deep learning with event-triggered communication [Preprint]. arXiv. Retrieved June 5, 2025, from https://arxiv.org/abs/1909.05020
10. George, J., & Gurram, P. (2019). Distributed stochastic gradient descent with event-triggered communication [Preprint]. arXiv. Retrieved June 5, 2025, from https://arxiv.org/pdf/1909.05020
11. Marchand, M., Andrieu, V., Bertrand, S., et al. (n.d.). Deep learning of a communication policy for an event-triggered observer for linear systems. HAL. Retrieved June 6, 2025, from https://hal.science/hal-03945265v2/document
12. Mirza, M. (n.d.). Distributed training: Guide for data scientists. Neptune AI. Retrieved June 14, 2025, from https://neptune.ai/blog/distributed-training
13. Wang, K., & Szeto, D. (2023). Building deep learning systems [in Russian language]. DMK Press.
14. Ju, J. (n.d.). Distributed model training II: Parameter Server and AllReduce. Retrieved May 9, 2025, from http://www.juyang.co/distributed-model-training-ii-parameter-server-and-allreduce/
15. Dehghani, M., & Yazdanparast, Z. (2023). From distributed machine to distributed deep learning: a comprehensive survey. ResearchGate. Retrieved June 12, 2025, from https://www.researchgate.net/publication/374722659_From_distributed_machine_to_distributed_deep_learning_a_comprehensive_survey
16. Nikolenko, S., Kadurin, A., & Arkhangelskaya, E. (2021). Deep learning. Piter. [in Russian language].
17. Jozefowicz, R., Chen, J., Xiaghao, Monga, R., & Bengio, S. (2016). Revisiting distributed synchronous SGD [Preprint]. arXiv. Retrieved January 12, 2025, from https://arxiv.org/abs/1604.00981
18. Cohen, P. R., & Levesque, H. J. (n.d.). Performatives in a rationally based speech act theory. Retrieved June 6, 2025, from https://lia.disi.unibo.it/books/woa2001/pdf/03.pdf

+ - Заказать электронную версию статьи (Purchase digital version of a single article) Click to collapse

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 700 руб. (в том числе НДС 20%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа скопируйте doi статьи:

10.14489/vkit.2026.01.pp.036-043

и заполните форму

Отправляя форму вы даете согласие на обработку персональных данных.

Eng

This article is available in electronic format (PDF).

The cost of a single article is 700 rubles. (including VAT 20%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please copy the article doi:

10.14489/vkit.2026.01.pp.036-043

and fill out the form