+7 (495) 957-77-43

T-Comm_Article 4_1_2022

Извините, этот техт доступен только в “Американский Английский”. For the sake of viewer convenience, the content is shown below in the alternative language. You may click the link to switch the active language.

ANALYSIS OF EXISTING METHODS TO REDUCE THE DIMENSIONALITY OF INPUT DATA

Sergey D. Erokhin, Moscow Technical University of Communications and Informatics, Moscow, Russia, esd@mtuci.ru
Boris B. Borisenko, Moscow Technical University of Communications and Informatics, Moscow, Russia, fepem@yandex.ru
Ivan D. Martishin, Moscow Technical University of Communications and Informatics, Moscow, Russia, martishinid@gmail.com
Alexander S. Fadeev, Moscow Technical University of Communications and Informatics, Moscow, Russia, aleksandr-sml@mail.ru

Abstract
The explosive growth of data arrays, both in the number of records and in attributes, has triggered the development of a number of platforms for handling big data (Amazon Web Services, Google, IBM, Infoworks, Oracle, etc.), as well as parallel algorithms for data analysis (classification, clustering, associative rules). This, in turn, has prompted the use of dimensionality reduction techniques. Feature selection, as a data preprocessing strategy, has proven to be effective and efficient in preparing data (especially high-dimensional data) for various data collection and machine learning tasks. Dimensionality reduction is not only useful for speeding up algorithm execution, but can also help in the final classification/clustering accuracy. Too noisy or even erroneous input data often results in less than desirable algorithm performance. Removing uninformative or low-informative columns of data can actually help the algorithm find more general areas and classification rules and generally achieve better performance. This article discusses commonly used data dimensionality reduction methods and their classification. Data transformation consists of two steps: feature generation and feature selection. A distinction is made between scalar feature selection and vector methods (wrapper methods, filtering methods, embedded methods and hybrid methods). Each method has its own advantages and disadvantages, which are outlined in the article. It describes the application of one of the most effective methods of dimensionality reduction — the method of correspondence analysis for CSE-CIC-IDS2018 dataset. The effectiveness of this method in the tasks of dimensionality reduction of the specified dataset in the detection of computer attacks is checked.

Keywords: intrusion detection systems (IDS); dataset; feature generation; feature selection methods, correspondence analysis, computer attacks (CA).

References

1. S.V. Vasyutin, V.V. Korneev, V.V. Raikh, I.N. Sinitsa (2005). Making generalized decisions in intrusion detection systems using several methods of monitoring data analysis. Information counteraction to terrorist threats, 2005, no. 4, pp. 54-65.
2. B.B. Borisenko, S.D. Erokhin, A.S. Fadeev, I.D. Martishin (2021). Intrusion detection using multilayer perceptron and neural networks with Long Short-Term Memory. 2021 Systems of Signal Synchronization, Generating and Processing in Telecommunications (SYNCHROINFO). IEEE. DOI: 10.1109/synchroinfo51390.2021.9488416
3. O.V. Erokhina, B.B. Borisenko, I.D. Martishin, A.S. Fadeev (2021). Analysis of the multilayer perceptron parameters impact on the quality of network attacks identification. 2021 Systems of Signal Synchronization, Generating and Processing in Telecommunications (SYNCHROINFO). IEEE. DOI: 10.1109/synchroinfo51390.2021.9488344
4. C.J.C. Burges (2010). Dimension reduction: A guided tour. Foundations and Trends in Machine Learning, vol. 2, no. 4, pp. 275-365. DOI: 10.1561/2200000002
5. X.-B. Li and V.S. Jacob (2008). Adaptive data reduction for large-scale transaction data. European Journal of Operational Research. Elsevier BV, 188(3), pp. 910-924. DOI: 10.1016/j.ejor.2007.08.008
6. Towards Data Science. 11 dimensionality reduction techniques you should know in 2021 URL:https://towardsdatascience.com/11-dimensionality-reduction-techniques-you-should-know-in-2021-dcb9500d388b (access date: 06.08.2021).
7. L.M. Mestetsky (2004). Mathematical methods of pattern recognition, a course of lectures. Moscow State University, 85 p.
8. S.D. Erokhin, A.V. Vanyushina (2018). Selecting attributes to classify IP traffic by machine learning methods. T-Comm, vol. 12, no. 9, pp. 25-29. URL: https://cyberleninka.ru/article/n/vybor-atributov-dlya-klassifikatsii-ip-trafika-metodami-mashinnogo-obucheniya (access date: 13.08.2021).
9. J. Li, K. Cheng, S. Wang, F. Morstatter, R. . Trevino, J. Tang, H. Liu (2018). Feature selection. ACM Computing Surveys. Association for Computing Machinery (ACM), 50(6), pp. 1-45. DOI: 10.1145/3136625
10. L.C. Molina, L. Belanche, A. Nebot (2002). Feature selection algorithms: a survey and experimental evaluation. 2002 IEEE International Conference on Data Mining, 2002. Proceedings. IEEE Comput. Soc. DOI: 10.1109/icdm.2002.1183917
11. I.T. Jolliffe (2007). Principal component analysis. Second Edition, Springer, 487 p.
12. O.I. Shelukhin, V.V. Barkov, M.V. Polkovnikov (2019). Comparative analysis of algorithms to estimate the number and structure of attributes in the classification tasks of mobile applications. Science-intensive Technologies in Space Exploration, vol. 11, no. 2, pp. 90-100. DOI: 10.24411/2409-5419-2018-10263
13. K.V. Mardia, J.T. Kent, J.M. Bibby (1995). Multivariate analysis. Probability and mathematical statistics. Academic Press Limited, 521 p.
14. G.W. Stewart (1993). On the early history of the Singular Value Decomposition. SIAM Review. Society for Industrial & Applied Mathematics (SIAM), 35(4), pp. 551-566. DOI: 10.1137/1035134
15. J. Lambers (2010). The SVD algorithm. Lect. 6 Notes, vol. CME335, no. Spring Quarter 2010-11, pp. 1-2.
16. H.-W. Cho (2007). Nonlinear feature extraction and classification of multivariate data in kernel feature space. Expert Systems with Applications. Elsevier BV, 32(2), pp. 534-542. DOI: 10.1016/j.eswa.2005.12.007
17. L. Van der Maaten, E. Postma, J. Van den Herik (2009). Dimensionality reduction: A comparative review. Tilburg University Centre for Creative Computing, Technical Report TiCC-TR 2009-005, 36 p.
18. L. Van der Maaten, G. Hinton (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, no.9, pp. 2579-2605.
19. A. Hyvarinen, J. Karhunen, E. Oja (2001). Independent component analysis. Book, John Wiley & Sons, 504 p.
20. A. Tharwat (2020). Independent component analysis: An introduction. Applied Computing and Informatics. Emerald, 17(2), pp. 222-249. DOI: 10.1016/j.aci.2018.08.006
21. H. Avron, C. Boutsidis, S. Toledo, A. Zouzias (2013). Efficient dimensionality reduction for canonical correlation analysis. Proceedings of the 30th International Conference on Machine Learning, in PMLR, no. 28(1), pp. 347-355.
22. V. Snasel, Z. Horak, J. Kocibova, A. Abraham (2009). Reducing social network dimensions using matrix factorization methods. 2009 International Conference on Advances in Social Network Analysis and Mining. IEEE. DOI: 10.1109/asonam.2009.48.
23. Y. Koren, R. Bell, C. Volinsky (2009). Matrix factorization techniques for recommender systems. Computer. Institute of Electrical and Electronics Engineers (IEEE), 42(8), pp. 30-37. DOI: 10.1109/mc.2009.263.
24. S. Erokhin, B. Borisenko, A. Fadeev (2021). Reducing the dimension of input data for IDS by using match analysis. 2021 28th Conference of Open Innovations Association (FRUCT). IEEE. DOI: 10.23919/fruct50888.2021.9347629.
25. M.E. Burlakov (2018). Application of correspondence analysis method for optimization of attribute combinations in datasets. PNRPU Bulletin, Electrical Engineering, Information Technology, Control Systems, 2018, no. 26. URL: https://cyberleninka.ru/article/n/primenenie-metoda-analiza-sootvetstviy-dlya-optimizatsii-kombinatsiy-atributov-u-naborov-dannyh (access data: 16.08.2021).
26. M. Belkin, P. Niyogi (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation. MIT Press — Journals, 15(6), pp. 1373-1396. DOI: 10.1162/089976603321780317
27. B. Ghojogh, A. Ghodsi, F. Karray, M. Crowley (2020). Multidimensional scaling, Sammon mapping, and Isomap: Tutorial and Survey. arXiv:2009.08136, pp. 1-15.
28. Y. Kiarashinejad, S. Abdollahramezani, A. Adibi (2020). Deep learning approach based on dimensionality reduction for designing electromagnetic nanostructures. npj Computational Materials. Springer Science and Business Media LLC, 6(1). DOI: 10.1038/s41524-020-0276-y
29. S. Asgari Taghanaki, B., Zamani Dehkordi, A. Hatam, B. Bahraminejad (2012). Synthetic feature transformation with RBF neural network to improve the Intrusion Detection System Accuracy and Decrease Computational Costs. International Journal of Information and Network Security (IJINS). Institute of Advanced Engineering and Science, 1(1). DOI: 10.11591/ijins.v1i1.339.
30. I. Guyon, J. Weston, S. Barnhill, V. Vapnik (2002). Machine learning. Springer Science and Business Media LLC, 46(1/3), pp. 389-422. DOI: 10.1023/a:1012487302797
31. Datasets Canadian Institute for Cybersecurity URL:https://www.unb.ca/cic/datasets/index.html (access date: 19.08.2021).
32. S.D. Erokhin, A.P. Zhuravlev (2020). Comparative analysis of open data sets for artificial intelligence technologies in solving information security problems. Signal Timing, Formation and Processing Systems, 2020, vol. 3, no. 3, pp. 12-19.
33. A.A. Afanasyeva (2017). Computation of singular matrix decomposition. Collection of articles. All-Russian youth scientific conference «All facets of mathematics and mechanics», Tomsk, pp. 162-167.

Information about authors:

Sergey D. Erokhin, PhD (technical sciences), associate professor, rector, Moscow Technical University of Communications and Informatics, Moscow, Russia
Boris B. Borisenko, PhD (technical sciences), associate professor, lead researcher, Moscow Technical University of Communications and Informatics, Moscow, Russia
Ivan D. Martishin, researcher, Moscow Technical University of Communications and Informatics, Moscow, Russia
Alexander S. Fadeev, researcher, Moscow Technical University of Communications and Informatics, Moscow, Russia