A Hybrid Malware Detection Framework Utilizing Natural Language Processing and Surface Analysis Features
iacs CAI

Computing and Algorithm Insight

Computing and Algorithm Insight is a peer-reviewed journal publishing research in artificial intelligence, soft...

Publishing Model

Open Access
This journal published by Integra Academic Press

Abstract

The escalating frequency of malware attacks necessitates the development of robust detection models, predominantly relying on features derived from surface analysis and machine learning. While prior research in surface analysis has established image-based methods via ensemble learning, there remains a significant deficiency in natural language processing (NLP) methodologies that effectively integrate multiple features. Existing NLP-based detection schemes typically utilize singular features, as the amalgamation of hybrid features into a unified data point disrupts word sequence integrity, thereby impeding detection accuracy. Addressing this gap, this paper introduces a novel hybrid model that leverages three distinct features obtained through surface analysis for malware identification. This study validates the efficacy of applying NLP techniques in conjunction with hybrid features, overcoming previous sequential data limitations. Empirical results demonstrate the superior performance of this combined approach, achieving an F-measure of 0.927.

Keywords: Malware Detection Natural Language Processing (NLP) Hybrid Features Surface Analysis Machine Learning


References

Y. Ye, T. Li, D. Adjeroh, and S. S. Iyengar, “A survey on malware detection using data mining techniques,” ACM Comput. Surveys, vol. 50, no. 3, pp. 1–40, Jun. 2017.

J. Saxe and K. Berlin, “Deep neural network based malware detection using two dimensional binary program features,” in Proc. 10th Int. Conf. Malicious Unwanted Softw. (MALWARE), Fajardo, PR, USA. Washington, DC, USA : IEEE Computer Society, Oct. 2015, pp. 11–20.

R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitelman, S. Dolev, and Y. Elovici, “Unknown malcode detection using OPCODE representation,” in Intelligence and Security Informatics, D. Ortiz-Arroyo, H. L. Larsen, D. D. Zeng, D. Hicks, and G. Wagner, Eds., Berlin, Germany : Springer, 2008, pp. 204–215.

R. Moskovitch, D. Stopel, C. Feher, N. Nissim, and Y. Elovici, “Unknown malcode detection via text categorization and the imbalance problem,” in Proc. IEEE Int. Conf. Intell. Secur. Informat., Jun. 2008, pp. 156–161.

R. Tian, R. Islam, L. Batten, and S. Versteeg, “Differentiating malware from cleanware using behavioural analysis,” in Proc. 5th Int. Conf. Malicious Unwanted Softw., Oct. 2010, pp. 23–30.

S. Das, Y. Liu, W. Zhang, and M. Chandramohan, “Semantics-based online malware detection: Towards efficient real-time protection against malware,” IEEE Trans. Inf. Forensics Security, vol. 11, no. 2, pp. 289–302, Feb. 2016.

R. Islam, R. Tian, L. M. Batten, and S. Versteeg, “Classification of malware based on integrated static and dynamic features,” J. Netw. Comput. Appl., vol. 36, no. 2, pp. 646–656, Mar. 2013.

B. Anderson, C. Storlie, and T. Lane, “Improving malware classification: Bridging the static/dynamic gap,” in Proc. 5th ACM Workshop Secur. Artif. Intell. New York, NY, USA : Association for Computing Machinery, Oct. 2012, pp. 3–14.

M. Mimura and R. Ito, “Applying NLP techniques to malware detection in a practical environment,” Int. J. Inf. Secur., vol. 21, no. 2, pp. 279–291, Apr. 2022.

A. Moser, C. Krüegel, and E. Kirda, “Limits of static analysis for malware detection,” in Proc. 23rd Annu. Comput. Secur. Appl. Conf. (ACSAC), Dec. 2007, pp. 421–430.

R. Perdisci, A. Lanzi, and W. Lee, “McBoost: Boosting scalability in malware collection and analysis using statistical classification of executables,” in Proc. Annu. Comput. Secur. Appl. Conf. (ACSAC), Dec. 2008, pp. 301–310.

C. Rossow, C. J. Dietrich, C. Grier, C. Kreibich, V. Paxson, N. Pohlmann, H. Bos, and M. V. Steen, “Prudent practices for designing malware experiments: Status quo and outlook,” in Proc. IEEE Symp. Secur. Privacy, May 2012, pp. 65–79.

J. Yan, Y. Qi, and Q. Rao, “Detecting malware with an ensemble method based on deep neural network,” Secur. Commun. Netw., vol. 2018, no. 1, 2018, Art. no. 7247095.

B. N. Narayanan and V. S. P. Davuluru, “Ensemble malware classification system using deep neural networks,” Electronics, vol. 9, no. 5, p. 721, Apr. 2020.

R. Takeuchi, R. Mitsuhashi, M. Nishigaki, and T. Ohki, “Ensemble malware classifier considering PE section information,” IEICE Trans. Fundam. Electron., Commun. Comput. Sci., vol. 107, no. 3, pp. 306–318, Mar. 2024.

M. Ficco, “Malware analysis by combining multiple detectors and observation windows,” IEEE Trans. Comput., vol. 71, no. 6, pp. 1276–1290, Jun. 2022.

A. Shabtai, R. Moskovitch, Y. Elovici, and C. Glezer, “Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey,” Inf. Secur. Tech. Rep., vol. 14, no. 1, pp. 16–29, Feb. 2009.

Q. Le, O. Boydell, B. Mac Namee, and M. Scanlon, “Deep learning at the shallow end: Malware classification for non-domain experts,” Digit. Invest., vol. 26, pp. S118–S126, Jul. 2018.

Henchiri and N. Japkowicz, “A feature selection and evaluation scheme for computer virus detection,” in Proc. 6th Int. Conf. Data Mining (ICDM). Washington, DC, USA : IEEE Computer Society, Dec. 2006, pp. 891–895.

J. Z. Kolter and M. A. Maloof, “Learning to detect malicious executables in the wild,” in Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Seattle, WA, USA. New York, NY, USA : Association for Computing Machinery, Aug. 2004, pp. 470–478.

J. Z. Kolter and M. A. Maloof, “Learning to detect and classify malicious executables in the wild,” J. Mach. Learn. Res., vol. 7, pp. 2721–2744, Dec. 2006.

B. Zhang, J. Yin, J. Hao, D. Zhang, and S. Wang, “Malicious codes detection based on ensemble learning,” in Proc. 4th Int. Conf. Autonomic Trusted Comput., in Lecture Notes in Computer Science, vol. 4610, Hong Kong, B. Xiao, L. T. Yang, J. Ma, C. Müller-Schloer, and Y. Hua, Eds., Springer, 2007, pp. 468–477.

G. Jacob, P. M. Comparetti, M. Neugschwandtner, C. Kruegel, and G. Vigna, “A static, packer-agnostic filter to detect similar malware samples,” in Detection of Intrusions and Malware, and Vulnerability Assessment (Lecture Notes in Computer Science), U. Flegel, E. P. Markatos, and W. K. Robertson, Eds., Springer, 2013, pp. 102–122.

S. Alam, R. N. Horspool, I. Traore, and I. Sogukpinar, “A framework for metamorphic malware analysis and real-time detection,” Comput. Secur., vol. 48, pp. 212–233, Feb. 2015.

M. E. Karim, A. Walenstein, A. Lakhotia, and L. Parida, “Malware phylogeny generation using permutations of code,” J. Comput. Virol., vol. 1, nos. 1–2, pp. 13–23, Nov. 2005.

A. Khalilian, A. Nourazar, M. Vahidi-Asl, and H. Haghighi, “G3MD: Mining frequent opcode sub-graphs for metamorphic malware detection of existing families,” Expert Syst. Appl., vol. 112, pp. 15–33, Dec. 2018.

M. Zolotukhin and T. Hämäläinen, “Detection of zero-day malware based on the analysis of opcode sequences,” in Proc. IEEE 11th Consum. Commun. Netw. Conf. (CCNC), Jan. 2014, pp. 386–391.

D. Bilar, “Opcodes as predictor for malware,” Int. J. Electron. Secur. Digit. Forensics, vol. 1, no. 2, p. 156, 2007.

D. Kong and G. Yan, “Discriminant malware distance learning on structural information for automated malware classification,” in Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Chicago, IL, USA, I. S. Dhillon, Y. Koren, R. Ghani, T. E. Senator, P. Bradley, R. Parekh, J. He, R. L. Grossman, and R. Uthurusamy, Eds., New York, NY, USA : Association for Computing Machinery, Aug. 2013, pp. 1357–1365.

I. Ismail, M. N. Marsono, and S. M. Nor, “Detecting worms using data mining techniques: Learning in the presence of class noise,” in Proc. 6th Int. Conf. Signal-Image Technol. Internet Based Syst., Kuala Lumpur, Malaysia, K. YÃtongnon, A. Dipanda, and R. Chbeir, Eds., Washington, DC, USA : IEEE Computer Society, Dec. 2010, pp. 187–194.

L. Martignoni, M. Christodorescu, and S. Jha, “OmniUnpack: Fast, generic, and safe unpacking of malware,” in Proc. 23rd Annu. Comput. Secur. Appl. Conf. (ACSAC). Washington, DC, USA : IEEE Computer Society, Dec. 2007, pp. 431–441.

R. Tian, L. M. Batten, and S. C. Versteeg, “Function length as a tool for malware classification,” in Proc. 3rd Int. Conf. Malicious Unwanted Softw. (MALWARE), Oct. 2008, pp. 69–76.

V. S. Sathyanarayan, P. Kohli, and B. Bruhadeshwar, “Signature generation and detection of malware families,” in Proc. 13th Australas. Conf. Inf. Secur. Privacy, in Lecture Notes in Computer Science, Wollongong, NSW, Australia, Y. Mu, W. Susilo, and J. Seberry, Eds., Springer, 2008, pp. 336–349.

H. D. Menéndez, S. Bhattacharya, D. Clark, and E. T. Barr, “The arms race: Adversarial search defeats entropy used to detect malware,” Expert Syst. Appl., vol. 118, pp. 246–260, Mar. 2019.

R. Perdisci, A. Lanzi, and W. Lee, “Classification of packed executables for accurate computer virus detection,” Pattern Recognit. Lett., vol. 29, no. 14, pp. 1941–1946, Oct. 2008.

M. Wojnowicz, G. Chisholm, M. Wolff, and X. Zhao, “Wavelet decomposition of software entropy reveals symptoms of malicious code,” J. Innov. Digit. Ecosyst., vol. 3, no. 2, pp. 130–140, Dec. 2016.

Y. Elovici, A. Shabtai, R. Moskovitch, G. Tahan, and C. Glezer, “Applying machine learning techniques for detection of malicious code in network traffic,” in Proc. 30th Annu. Conf. Artif. Intell., in Lecture Notes in Computer Science, vol. 4667, J. Hertzberg, M. Beetz, and R. Englert, Eds., Osnabrück, Germany : Springer, 2007, pp. 44–50.

B. Li, K. Roundy, C. Gates, and Y. Vorobeychik, “Large-scale identification of malicious singleton files,” in Proc. 7th ACM Conf. Data Appl. Secur. Privacy, Scottsdale, AZ, USA, G.-J. Ahn, A. Pretschner, and G. Ghinita, Eds., New York, NY, USA : Association for Computing Machinery, Mar. 2017, pp. 227–238.

E. Raff, J. Sylvester, and C. Nicholas, “Learning the PE header, malware detection with minimal domain knowledge,” in Proc. 10th ACM Workshop Artif. Intell. Secur., Dallas, TX, USA, B. M. Thuraisingham, B. Biggio, D. M. Freeman, B. Miller, and A. Sinha, Eds., New York, NY, USA : Association for Computing Machinery, Nov. 2017, pp. 121–132.

M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo, “Data mining methods for detection of new malicious executables,” in Proc. IEEE Symp. Secur. Privacy (S&P), Oakland, CA, USA. Washington, DC, USA : IEEE Computer Society, May 2001, pp. 38–49.

J. Lee, C. Im, and H. Jeong, “A study of malware detection and classification by comparing extracted strings,” in Proc. 5th Int. Conf. Ubiquitous Inf. Manage. Commun., Seoul, South Korea, Feb. 2011, p. 75.

Y. Ye, L. Chen, D. Wang, T. Li, Q. Jiang, and M. Zhao, “SBMDS: An interpretable string based malware detection system using SVM ensemble with bagging,” J. Comput. Virol., vol. 5, no. 4, pp. 283–293, Nov. 2009.

H. Aghakhani, F. Gritti, F. Mecca, M. Lindorfer, S. Ortolani, D. Balzarotti, G. Vigna, and C. Kruegel, “When malware is Packin’ heat; limits of machine learning classifiers based on static analysis features,” in Proc. 27th Netw. Distrib. Syst. Secur. Symp., San Diego, CA, USA, 2020, pp. 1–20.

B. Kolosnjaji, G. Eraisha, G. Webster, A. Zarras, and C. Eckert, “Empowering convolutional networks for malware classification and analysis,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), May 2017, pp. 3838–3845.

S. R. Islam, W. Eberle, S. K. Ghafoor, A. Siraj, and M. Rogersc, “Domain knowledge aided explainable artificial intelligence for intrusion detection and response,” in Proc. AAAI Spring Symp. Combining Mach. Learn. Knowl. Eng. Pract., Palo Alto, CA, USA, 2020.

O. Arreche, T. R. Guntur, J. W. Roberts, and M. Abdallah, “E-XAI: Evaluating black-box explainable AI frameworks for network intrusion detection,” IEEE Access, vol. 12, pp. 23954–23988, 2024.

H. Yakura, S. Shinozaki, R. Nishimura, Y. Oyama, and J. Sakuma, “Neural malware analysis with attention mechanism,” Comput. Secur., vol. 87, Nov. 2019, Art. no. 101592.

X. Ma, S. Guo, H. Li, Z. Pan, J. Qiu, Y. Ding, and F. Chen, “How to make attention mechanisms more practical in malware classification,” IEEE Access, vol. 7, pp. 155270–155280, 2019.

S. Choi, J. Bae, C. Lee, Y. Kim, and J. Kim, “Attention-based automated feature extraction for malware analysis,” Sensors, vol. 20, no. 10, p. 2893, May 2020.

S. Kanno and M. Mimura, “Detection of malware using self-attention mechanism and strings,” in Network and System Security, S. Li, M. Manulis, and A. Miyaji, Eds., Cham, Switzerland : Springer, 2023, pp. 46–60.

R. Thomas. ( Apr. 2017 ). Lief—Library to Instrument Executable Formats. [Online]. Available: https://lief.quarkslab.com/

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997.

A. Vaswani, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–12.

Computer Security Group. Anti Malware Engineering Workshop. Accessed: Sep. 18, 2023. [Online]. Available: https://www.iwsec.org/mws/datasets.html

M. Mimura, “Evaluation of printable character-based malicious PE file-detection method,” Internet Things, vol. 19, Aug. 2022, Art. no. 100521.