Intelligent Phishing Website Detection before and after Multiple Informative Feature Selection Techniques: Machine Learning Approach

Adane, Kibreab; Beyene, Berhanu; Abebe, Mohammed

doi:10.22034/ijism.2023.1977974.0

Intelligent Phishing Website Detection before and after Multiple Informative Feature Selection Techniques: Machine Learning Approach

Document Type : Articles

Authors

kibreab adane ¹

Berhanu Beyene ²

Mohammed Abebe ³

¹ Ph.D. Student, Faculty of Computing & Software Engineering, Arba Minch University, Institute of Technology, Arba Minch, Ethiopia.

² Associate Prof., Ethiopian Cybersecurity Association, Addis Ababa, Ethiopia

³ Assistant Prof., Faculty of Computing & Software Engineering, Arba Minch University, Institute of Technology, Arba Minch, Ethiopia.

https://doi.org/10.22034/ijism.2023.1977974.0

Abstract

Individuals and Organizations that rely on the Internet for communication, collaboration, and daily tasks regularly encounter security and privacy issues unless interventions of intelligent Cybersecurity defense systems have been made to counter them. The existing pieces of evidence reveal that phishing website attacks have drastically increased despite the scientific communities' best efforts to combat them. Based on the key research gaps explored, the study has made significant attempts to answer the following research questions: RQ#1: Which cross-validation techniques and model optimization parameters are appropriate for given datasets and classifiers? RQ#2: Which Classifier(s) yielded a superior Accuracy, F1-Score, AUC-ROC, and MCC value with acceptable train-test computational time before and after applying the Informative Feature Selection Techniques? RQ#3: What are the strengths and weaknesses of each Classifier after being applied with multiple Informative Feature Selection Techniques? RQ#4: Could the results of the top-performed Classifier and Informative Feature Selection Technique on Dataset one (DS-1) be consistent on Dataset two (DS-2)? The study used a Google Co-Lab environment and Python Code to conduct rigorous experiments. Our experimental findings reveal that the CAT-B Classifier demonstrated a superior phishing website detection performance in terms of (Accuracy, F1-Score, AUC-ROC, and MCC value with acceptable train-test computational time both before and after applying the UFS Feature Selection Technique by scoring 0.9764 accuracies, 0.9762 F1-Score, 0.996 AUC-ROC, and 0.9528 MCC Value with 6 Seconds train-test computational time. The study practically demonstrated implementing the CAT-B-UFS technique using a Python Code so that upcoming researchers can easily replicate their results and learn more. In future work, the study proposed implementing deep learning algorithms with proper feature selection techniques on Individual and Hybrid approaches to obtain more promising results.

Keywords

Machine Learning

Feature Selection Technique

Cat-Boost Classifier

Phishing Website Detection

Uni-Variate Feature Selection

Information Network Security Agency (INSA)

20.1001.1.20088302.2024.22.1.3.0

Abdelhamid, N., Ayesh, A. & Thabtah, F. (2014). Phishing detection based Associative Classification data mining. Expert Systems with Applications, 41(13), 5948–5959. https://doi.org/10.1016/j.eswa.2014.03.019

Abedin, N. F., Bawm, R., Sarwar, T., Saifuddin, M., Rahman, M. A. & Hossain, S. (2020, December). Phishing attack detection using machine learning classification techniques. In 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS) (pp. 1125-1130). IEEE. https://doi.org/10.1109/ICISS49785.2020.9315895

Adane, K. & Beyene, B. (2022). Machine learning and deep learning based phishing websites detection: the current gaps and next directions. Review of Computer Engineering Research, 9(1), 13–29. https://doi.org/10.18488/76.v9i1.2983

Ali, W. & Malebary, S. (2020). Particle Swarm Optimization-Based Feature Weighting for Improving Intelligent Phishing Website Detection. IEEE Access, 8, 116766–116780. https://doi.org/10.1109/ACCESS.2020.3003569

Althnian, A., AlSaeed, D., Al-Baity, H., Samha, A., Dris, A. Bin, Alzakari, N., Abou Elwafa, A. & Kurdi, H. (2021). Impact of dataset size on classification performance: An empirical evaluation in the medical domain. Applied Sciences (Switzerland), 11(2), 796. https://doi.org/10.3390/app11020796

APWG. (2023). Phishing activity trends report, 2nd Quarter 2023. Retrieved from https://docs.apwg.org/reports/apwg_trends_report_q2_2023.pdf?_gl=1*4onbyz*_ga*MTI2NTYwMjQ1Ni4xNjk5Nzk5Njk4*_ga_55RF0RHXSR*MTY5OTc5OTY5OC4xLjAuMTY5OTc5OTY5OC4wLjAuMA..&_ga=2.86143135.64577318.1699799699-1265602456.1699799698

Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1),6. https://doi.org/10.1186/s12864-019-6413-7

Chiew, K. L., Tan, C. L., Wong, K. S., Yong, K. S. C. & Tiong, W. K. (2019). A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences, 484, 153–166. https://doi.org/10.1016/j.ins.2019.01.064

Flach, P. A. (2003). The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In Proceedings of the 20th international conference on machine learning (ICML-03) (pp. 194-201).

Gupta, B. B., Yadav, K., Razzak, I., Psannis, K., Castiglione, A. & Chang, X. (2021). A novel approach for phishing URLs detection using lexical based machine learning in a real-time environment. Computer Communications, 175, 47–57. https://doi.org/10.1016/j.comcom.2021.04.023

Hancock, J. T., & Khoshgoftaar, T. M. (2020). CatBoost for big data: an interdisciplinary review. Journal of Big Data, 7, 94. https://doi.org/10.1186/s40537-020-00369-8

Hannousse, A. & Yahiouche, S. (2021). Towards benchmark datasets for machine learning based website phishing detection: An experimental study. Engineering Applications of Artificial Intelligence, 104, 104347. https://doi.org/10.1016/j.engappai.2021.104347

Hossain, S., Sarma, D. & Chakma, R. J. (2020). Machine learning-based phishing attack detection. International Journal of Advanced Computer Science and Applications, 11(9), 378–388. https://dx.doi.org/10.14569/IJACSA.2020.0110945

Ibrahim, B., Ewusi, A. & Ahenkorah, I. (2022). Assessing the suitability of boosting machine-learning algorithms for classifying arsenic-contaminated waters: A novel model-explainable approach using SHapley Additive exPlanations. Water, 14(21), 3509. https://doi.org/10.3390/w14213509

INSA. (2020). INSA foils cyber attacks from Egypt. A 6th Months Cyber-attack Reports dated on June 23, 2020 Via Ethiopian News Agency. Retrieved from https://www.ena.et/web/eng/w/en_15454

INSA. (2022a). An increasing level of cyber-attacks in Ethiopia. A 6th Months Cyber-attack Reports dated on February 14, 2022. Retrieve from
https://www.facebook.com/INSA.ETHIOPIA/posts/319500900216492

INSA. (2022b). Causes of Walta-info Facebook website hacking. A 6th Months Cyber-attack Reports dated on February 14, 2022. Retrieved from https://www.facebook.com/INSA.ETHIOPIA/posts/319430846890164

Jain, A. K. & Gupta, B. B. (2019). A machine learning based approach for phishing detection using hyperlinks information. Journal of Ambient Intelligence and Humanized Computing, 10(5), 2015–2028. https://doi.org/10.1007/s12652-018-0798-z

Masoudi-Sobhanzadeh, Y., Motieghader, H. & Masoudi-Nejad, A. (2019). FeatureSelect: A software for feature selection based on machine learning approaches. BMC Bioinformatics, 20(1), 170. https://doi.org/10.1186/s12859-019-2754-0

Mourtaji, Y., Bouhorma, M., Alghazzawi, D., Aldabbagh, G. & Alghamdi, A. (2021). Hybrid Rule-Based Solution for Phishing URL Detection Using Convolutional Neural Network. Wireless Communications and Mobile Computing, 2021, 8241104. https://doi.org/10.1155/2021/8241104

Odeh, A., Alarbi, A., Keshta, I. & Abdelfettah, E. (2020). Efficient prediction of phishing websites using multilayer perceptron (mlp). Journal of Theoretical and Applied Information Technology, 98(16), 3353–3363. Retrieved from http://www.jatit.org/volumes/Vol98No16/14Vol98No16.pdf

Singhal, S., Chawla, U. & Shorey, R. (2020, January). Machine learning & concept drift based approach for malicious website detection. In 2020 International Conference on COMmunication Systems & NETworkS (COMSNETS) (pp. 582-585). IEEE. https://doi.org/10.1109/COMSNETS48256.2020.9027485

Tang, L. & Mahmoud, Q. H. (2021). A survey of machine learning-based solutions for phishing website detection. Machine Learning and Knowledge Extraction, 3(3), 672–694. https://doi.org/10.3390/make3030034