+7 (495) 957-77-43

T-Comm_Article 4_9_2021

Извините, этот техт доступен только в “Американский Английский”. For the sake of viewer convenience, the content is shown below in the alternative language. You may click the link to switch the active language.

ANALYSIS OF THE INFLUENCE OF MACHINE LEARNING ALGORITHM PARAMETERS ON THE RESULTS OF TRAFFIC CLASSIFICATION IN REAL TIME

Irina A. Krasnova, MTUCI, Moscow, Russia, irina_krasnova-angel@mail.ru

Abstract
The paper analyzes the impact of setting the parameters of Machine Learning algorithms on the results of traffic classification in real-time. The Random Forest and XGBoost algorithms are considered. A brief description of the work of both methods and methods for evaluating the results of classification is given. Experimental studies are conducted on a database obtained on a real network, separately for TCP and UDP flows. In order for the results of the study to be used in real time, a special feature matrix is created based on the first 15 packets of the flow. The main parameters of the Random Forest (RF) algorithm for configuration are the number of trees, the partition criterion used, the maximum number of features for constructing the partition function, the depth of the tree, and the minimum number of samples in the node and in the leaf. For XGBoost, the number of trees, the depth of the tree, the minimum number of samples in the leaf, for features, and the percentage of samples needed to build the tree are taken. Increasing the number of trees leads to an increase in accuracy to a certain value, but as shown in the article, it is important to make sure that the model is not overfitted. To combat overfitting, the remaining parameters of the trees are used. In the data set under study, by eliminating overfitting, it was possible to achieve an increase in classification accuracy for individual applications by 11-12% for Random Forest and by 12-19% for XGBoost. The results show that setting the parameters is a very important step in building a traffic classification model, because it helps to combat overfitting and significantly increases the accuracy of the algorithm’s predictions. In addition, it was shown that if the parameters are properly configured, XGBoost, which is not very popular in traffic classification works, becomes a competitive algorithm and shows better results compared to the widespread Random Forest.

Keywords: Machine Learning, Random Forest, XGBoost, QoS, traffic classification, parameter tuning, overtraining.

References

1. V.A. Mankov, I.A. Krasnova (2017), «Algorithm for dynamic classification of flows in a multiservice software defined network», T-Comm, vol. 11, no. 12, pp. 37-42.
2. V.A. Mankov and I.A. Krasnova (2017), «Zadacha upravleniya trafikom s dinamicheskim opredeleniem QoS v mul’tiservisnyh SDN setyah», Sbornik trudov XI Mezhdunarodnoj otraslevoj nauchno-tekhnicheskoj konferencii «Tekhnologii informacionnogo obshchestva», pp. 67-68.
3. T. Iwai, A. Nakao (2016). Adaptive mobile application identification through in-network machine learning. 2016 18th Asia-Pacific Network Operations and Management Symposium (APNOMS), 1-6. https://doi.org/10.1109/APNOMS.2016.7737226
4. I. Anantavrasilp, T. Scholer (2007). Automatic flow classification using machine learning. 2007 15th International Conference on Software, Telecommunications and Computer Networks, 1-6. https://doi.org/10.1109/SOFTCOM.2007.4446129.
5. O.I. Sheluhin, S.D. Erohin and A.V Vanyushina (2018), «Klassifikaciya IP-trafika metodami mashinnogo obucheniya». Moscow: Goryachaya liniya – Telekom.
6. O.I. Sheluhin, A.G. Simonyan and A.V. Vanyushina (2017), «Influence of training sample structure on traffic application efficiency classificationusing machine-learning methods», T-Comm, vol. 11, no. 2, pp. 25-31.
7. P. Probst, A. Boulesteix, B. Bischl (2019). Tunability: Importance of Hyperparameters of Machine Learning Algorithms. J. Mach. Learn. Res., 20, 53:1-53:32.
8. Y. Hong, C. Huang, B. Nandy, N. Seddigh (2015). Iterative-tuning support vector machine for network traffic classification. 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), 458-466.
9. A.K. Bansal, S. Kaur (2018). Extreme Gradient Boosting Based Tuning for Classification in Intrusion Detection Systems.
10. V. Yu. Deart, V. A. Mankov, I. A. Krasnova (2021), «Analysis of promising approaches and research on traffic flow classification for maintain QoS using ML methods in SDN networks». Vestnik SibGUTI. No.1.
11. L. Breiman, J. Friedman, R. Olshen, C. Stone (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
12. L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. Vander Plas, A. Joly, B. Holt, G. Varoquaux (2013). API design for machine learning software: experiences from the scikit-learn project. ArXiv, abs/1309.0238.
13. L. Breiman (2001). Random Forests. Machine Learning, 45, pp. 5-32.
14. Introduction to Boosted Trees // URL: https://xgboost.readthedocs.io/en/latest/tutorials/model.html (Date of access 03.03.2020)
15. K. Cho, K. Mitsuya, A. Kato (2000). Traffic Data Repository at the WIDE Project. USENIX Annual Technical Conference, FREENIX Track.
16. V.A. Mankov, V.Y. Deart, I.A. Krasnova (2021). Evaluation of the Effect of Preprocessing Data on Network Traffic Classifier Based on ML Methods for Qos Predication in Real-Time. In: Hu Z., Petoukhov S., He M. (eds) Advances in Artificial Systems for Medicine and Education IV. AIMEE 2020. Advances in Intelligent Systems and Computing, vol 1315. Springer, Cham. https://doi.org/10.1007/978-3-030-67133-4_5
17. V. Deart, Mankov, I. Krasnova (2020), Development of a Feature Matrix for Classifying Network Traffic in SDN in Real-Time Based on Machine Learning Algorithms, 2020 International Scientific and Technical Conference Modern Computer Network Technologies (MoNeTeC), Moscow, Russia, pp. 1-9, doi: 10.1109/MoNeTeC49726.2020.9258314
18. V. Deart, Mankov, I. Krasnova (2021), Agglomerative Clustering of Network Traffic Based on Various Approaches to Determining the Distance Matrix, 2021 28th Conference of Open Innovations Association (FRUCT), Moscow, Russia, pp. 81-88, doi: 10.23919/FRUCT 50888.2021.9347616
19. V.A. Mankov and I.A. Krasnova (2019), «Klassifikatsiya potokov trafika SDN-setei metodami mashinnogo obucheniya v rezhime real»nogo vremeni», Informatsionnye Tekhnologii I Matematicheskoe Modelirovanie Sistem 2019, [online] Available: https://doi.org/10.36581/CITP.2019.31.51.016.
20. V.A. Mankov and I.A. Krasnova (2020) Collection of Individual Packet Statistical Information in a Flow Based on P4-switch. In: Hu Z., Petoukhov S., He M. (eds) Advances in Intelligent Systems, Computer Science and Digital Economics. CSDEIS 2019. Advances in Intelligent Systems and Computing, vol 1127. Springer, Cham. https://doi.org/10.1007/978-3-030-39216-1_11

Information about author:

 Irina A. Krasnova, graduate student, MTUCI, Moscow, Russia