T-Comm_Article 5_6_2021

Извините, этот техт доступен только в “Американский Английский”. For the sake of viewer convenience, the content is shown below in the alternative language. You may click the link to switch the active language.

SELECTION OF METRIC AND CATEGORICAL ATTRIBUTES OF RARE ANOMALOUS EVENTS IN A COMPUTER SYSTEM USING DATA MINING METHODS

Oleg I. Sheluhin, Moscow Technical University of Communication and Informatics, Moscow, Russia, sheluhin@mail.ru

Dmitry I. Rakovsky, Moscow Technical University of Communication and Informatics, Moscow, Russia, dimitor1998@mail.ru

Abstract
The process of marking multi-attribute experimental data for subsequent use by means of data mining in problems of detection and classification of rare anomalous events of computer systems (CS) is considered. The labeling process is carried out using three methods: manual preprocessing, statistical analysis and cluster analysis. Among the attributes of the metric type, the authors identified two macrogroups: «integral attributes» and «impulse attributes». It is shown that the combination of statistical and cluster analysis methods increases the accuracy of detecting anomalous events in the CS, and also allows the selection of attributes according to their information significance. The expediency of manual preprocessing of data before clustering is shown by the example of dividing attributes into macrogroups, analyzing the density distribution using violin plot and removing the trend component using the method difference stationary series. With the help of construction of violin diagrams (Violin plot) for the attribute of the «integral» macrogroup, the distribution of states of the CS is shown. It is shown that the removal of the trend component by the DS-series method, normalization and reduction to absolute values allows more accurate marking of anomalous outliers, but this is not always acceptable. The interpretation of the clustering results performed for each normalized attribute shows that the normal values for all attributes are concentrated around zero values. The result of labeling experimental data is attribute-labeled data, where each attribute at the current time is assigned one of two states: abnormal or normal.

Keywords: attribute value markup, correlation analysis, k-means, experimental data, trend, violin plot, difference stationary series.

References

1. Sheluhin O.I. (2019) Network anomalies. Detection, localization, forecasting. 4th ed. Scientific and technical publishing house «Hot line — Telecom», Moscow, 448 p. (in Russian)
2. Talleb N. (2011) The Black Swan. Allen Lane, Penguin Books Ltd, London, 480 p.
3. Borodina A.V., Tishchenko V.A. (2018) Simulation modeling of a non-uniform degradation process in a system with gradual and sudden failures. Proceedings of the Karelian Scientific Center of the Russian Academy of Sciences, no. 7, pp. 3-13. (in Russian)
4. Sheluhin O.I., Osin A.V., Kostin D.V. (2020) Health monitoring of a computer network based on sequential analysis of serial pattern. T-Comm, vol. 14, vol. 2, pð. 9-16. (in Russian)
5. Sheluhin O.I., Kostin D.V., Reznik I.Yu. (2020) Monitoring and structure of abnormal patterns of system logs of computer systems. REDS: Telecommunication devices and systems, vol. 2, pp. 3-8. (in Russian)
6. Komleva N.O., Zinovatnaya S.L., Liubchenko V.V. (2020) Methodology of information monitoring and diagnostics of objects represented by quantitative estimates based on cluster analysis. Applied Aspects of Information Technology, no. 1, pp. 376-392.
7. Gmurman V.E. (2004) Probability theory and mathematical statistics: a textbook for universities. 10th edition, stereotyped. Higher school, Moscow, 479 p. (in Russian)
8. Guyon I, Elisseeff A. (2003) An Introduction to Variable and Feature Selection. Journal of Machine Learning Research. Vol. 3.
9. Peng H.C., Long F., Ding C. (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8. DOI: 10.1109 / TPAMI.2005.159
10. Sizova T.M. (2005) Statistics: Study Guide. SPb GU ITMO, Saint Petersburg, 190 p. (in Russian)
11. sklearn.cluster.KMeans // scikit-learn Machine Learning in Python URL: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html (accessed 11/14/2020).
12. Gueniche T., Fournier-Viger P., Raman R., Tseng V.S. (2015) CPT +: Decreasing the Time / Space Complexity of the Compact Prediction Tree. Conference: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 1-12.
13. Daw S., Finney C., Tracy E.R. (2003) A Review of Symbolic Analysis of Experimental Data. Review of Scientific Instruments, no. 2, pp. 915-930.
14. Boriah S., Chandola V., Kumar V. (2008) Similarity measures forcategorical data: A comparative evaluation. Proc. 8th SIAM International Conference on Data Mining, pp. 243-254.
15. Gasfield D. Strings, (2003) Trees and Sequences in Algorithms. SPb: BHV-Petersburg, Saint Petersburg, 654 p. (in Russian)
16. Christos B. (2007) Detection and Prediction of Rare Events in Transaction Databases. International Journal on Artificial Intelligence Tools, no. 5. P. 829-848.
17. Witten I.H., Frank E., Hall M.A. (2011) Data mining practical machine learning tools and techniques, 3rd edition, pp. 217-221.
18. Daw C. S., Kennel, M. B., Finney C.E.A., Connolly F. T. (1998). Observing and modeling nonlinear dynamics in an internal combustion engine. Physical Review E. P. 2811-2819.
19. Magnus Ya.R., Katyshev P.K., Pereseckij A.A. (2007). Ekonometrika. Nachal’nyj kurs. Moscow: Delo. 504 p. (in Russian)
20. Yashchenkov K.G., Dymko K.S., Uhanov N.O., Yakunin Yu.Yu., Hnykin A.V. (2020). Ispol’zovanie algoritma poiska anomalij v zadache povysheniya kachestva meteodannyh. In the collection: intellektual’nye informacionnye sistemy: teoriya i praktika, a collection of scientific articles based on the materials of the I All-Russian scientific-practical conference. Kursk. pp. 40-48. (in Russian)
21. Katenko Yu.V. (2018). Primenenie algoritmov klasterizacii dlya vyyavleniya anomalij vo vremennyh ryadah pri analize dannyh bankovskih i telekommunikacionnyh system. Ohrana, bezopasnost’, svyaz’. no 3. pp. 88-94. (in Russian)
22. Popov E.F., Tyukova A.A., Fuchko M.M., Zaharov A.A. (2015). Vyyavlenie netipichnyh sobytij sredstvami statisticheskogo analiza. Vestnik UrFO. Bezopasnost’ v informacionnoj sfere. No 1. P. 44-47. (in Russian)

Information about authors:

Oleg I. Sheluhin, Moscow Technical University of Communication and Informatics, Professor at the Department of Information Security, Moscow, Russia
Dmitry I. Rakovsky, Moscow Technical University of Communication and Informatics, Moscow, Russia