Analytical Comparison of Stop Word Recognition Methods in Persian Texts

Samie, Mohammad Ebrahim; Bahmani, Erta; Mozafari, Niloofar

doi:10.22034/ijism.2025.2017335.1322

Analytical Comparison of Stop Word Recognition Methods in Persian Texts

Document Type : Original Article

Authors

Mohammad Ebrahim Samie ¹

Erta Bahmani ¹

Niloofar Mozafari ²

¹ Department of Computer Engineering and IT, Jahrom University, Jahrom, Iran

² Islamic World Science and Technology Monitoring and Citation Institute (ISC), Shiraz, Iran

https://doi.org/10.22034/ijism.2025.2017335.1322

Abstract

Stop words are primarily non-significant words used to connect other words in sentence construction. Since these words do not contain specific information about the text, they are typically removed during text processing. Therefore, identifying stop words is an essential operation in text processing. A challenge arises when usually insignificant words can become significant depending on the situation, while words that are typically important can sometimes be classified as stop words. This problem is particularly pronounced in Persian due to the complexities inherent in the language. Recognizing the importance of identifying stop words in Persian, we analyzed and reviewed various approaches, including a dictionary-based approach, POS tagging-based approach, Word2Vec-based approach and FastText-based approach to identify stop words using a corpus of 50.000 Persian sentences from Hamshahri dataset. Our findings indicate that the FastText-based approach outperformed the others with a detection accuracy of 96.98, suggesting that this method can lead to the development of an automatic, reliable, and efficient system.

Keywords

Stop Words

Content Words

Persian Language Processing

POS Tagging

Word2Vec

Fasttext

20.1001.1.20088302.2025.23.1.7.1

Subjects

natural language processing (NLP)

Alajmi, A., Saad, E. M. & Darwish, R. (2012). Toward an ARABIC stop-words list generation. International Journal of Computer Applications, 46(8), 8-13.

AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M. & Oroumchian, F. (2009). Hamshahri: A standard Persian text collection. Knowledge-Based Systems, 22(5), 382-387. https://doi.org/10.1016/j.knosys.2009.05.002

Behera, S. (2018, May). Implement a finite state automaton to recognize and remove stop words in English text on its retrieval. In 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI) (pp. 476-480). IEEE. https://doi.org/10.1109/ICOEI.2018.8553828

Chanda, S. & Pal, S. (2023). The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media. SN Computer Science, 4(5), 494. https://doi.org/10.1007/s42979-023-01942-7

Chekima, K. & Alfred, R. (2016). An automatic construction of Malay stop words is based on the aggregation method. In Soft Computing in Data Science: Second International Conference, SCDS 2016, Kuala Lumpur, Malaysia, September 21-22, 2016, Proceedings 2 (pp. 180-189). Springer Singapore.

Chiche, A. & Yitagesu, B. (2022). Part of speech tagging: A systematic review of deep learning and machine learning approaches. Journal of Big Data, 9, 10. https://doi.org/10.1186/s40537-022-00561-y

Choi, W., Yoo, K. & Choi, S. (2019). Create a List of Stopwords and Typing Errors by TF-IDF Weight Value. EasyChair. Retrieved from file:///C:/Users/Reza/Downloads/EasyChair-Preprint-1410.pdf

Daowadung, P. & Chen, Y. H. (2012, July). Stop word in readability assessment of Thai text. In 2012 IEEE 12th International Conference on Advanced Learning Technologies (pp. 497-499). IEEE. https://doi.org/10.1109/ICALT.2012.9

Dar, K. S., Shafat, A. B. & Hassan, M. U. (2017, June). An efficient stop-word elimination algorithm for the Urdu language. In 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) (pp. 911-914). IEEE. https://doi.org/10.1109/ECTICon.2017.8096386

Davarpanah, M. R., Sanji, M. & Aramideh, M. (2009). Farsi lexical analysis and stop word list. Library Hi Tech, 27(3), 435-449. https://doi.org/10.1108/07378830910988559

Dehghani, M. & Manthouri, M. (2021, October). Semi-automatic detection of Persian stopwords using FastText library. In 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE) (pp. 267-271). IEEE. https://doi.org/10.1109/ICCKE54056.2021.9721519

Gunasekara, S. V. S. & Haddela, P. S. (2018, October). Context-aware stopwords for Sinhala Text classification. In 2018 National Information Technology Conference (NITC) (pp. 1-6). IEEE. https://doi.org/10.1109/NITC.2018.8550073

Hao, L. & Hao, L. (2008, December). Automatic identification of stop words in Chinese text classification. In 2008 International conference on computer science and software engineering (Vol. 1, pp. 718-722). IEEE. https://doi.org/10.1109/CSSE.2008.829

Haque, R. U., Mridha, M., Hamid, M. A., Abdullah-Al-Wadud, M. & Islam, M. S. (2020). Bengali stop word and phrase detection mechanism. Arabian Journal for Science and Engineering, 45, 3355-3368. https://doi.org/10.1007/s13369-020-04388-8

Jayashree, R., Murthy, K. S. & Anami, B. S. (2014). Effect of stop word removal on the performance of naïve Bayesian methods for text classification in the Kannada language. International Journal of Artificial Intelligence and Soft Computing, 4(2-3), 264-282. https://doi.org/10.1504/IJAISC.2014.062824

Jefriyanto, J., Ainun, N. & Al Ardha, M. A. (2023). Application of naïve bayes classification to analyze performance using stopwords. Journal of Information System, Technology and Engineering, 1(2), 49-53. Retrieved from https://pdfs.semanticscholar.org/866d/e716b31fdeb3aa639b3acd9a4902cefce14d.pdf?_gl=1*1127od5*_gcl_au*MTg2NjY3NDM0My4xNzMzNTQ5ODEz*_ga*MTUzNTM3MTM0MS4xNjUyMDMyOTM4*_ga_H7P4ZT52H5*MTczNDA3MDE1NC4xMDAuMS4xNzM0MDc1ODA4LjYwLjAuMA.

Kholwal, R. (2023). Text-Classify: A comprehensive comparative study of logistic regression, random forest, and knn models for enhanced text classification performance. International Journal of Advances in Engineering & Technology, 16(5), 415-433. https://doi.org/10.5281/zenodo.10148008

Luhn, H. P. (1985). Keyword-in-context index for technical literature. American Documentation, 11(4), 288.295. https://doi.org/10.1002/asi.5090110403

Madatov, K., Bekchanov, S. & Vičič, J. (2022). Automatic detection of stop words for texts in the Uzbek language. Informatica, 47,143-150. https://doi.org/10.31449/inf.v47i2.3788

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S. & McClosky, D. (2014, June). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55-60). Retrieved from https://aclanthology.org/P14-5010.pdf

Metïn, S. K. & Karaoğlan, B. (2017). Stop word detection as a binary classification problem. Anadolu University Journal of Science and Technology A-Applied Sciences and Engineering, 18(2), 346-359. Retrieved from https://dergipark.org.tr/tr/download/article-file/316284

Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. https://doi.org/10.48550/arXiv.1301.3781

Miretie, S. G. & Khedkar, V. (2018). Automatic generation of stopwords in the Amharic text. International Journal of Computer Applications, 975, 8887.

Namly, D., Bouzoubaa, K. & Yousfi, A. (2019). A bi-technical analysis for Arabic stop-words detection. Compusoft, 8(5), 3126-3134. Retrieved from https://ijact.in/index.php/j/article/view/491

Rahimi, Z. & Homayounpour, M. M. (2023). The impact of preprocessing on word embedding quality: A comparative study. Language Resources and Evaluation, 57(1), 257-291. https://doi.org/10.1007/s10579-022-09620-5

Raja, F., Tasharofi, S. & Oroumchian, F. (2007). Statistical POS tagging experiments on Persian text. University of WollongongUniversity of Wollongong. Conference Contribution. Retrieved from https://ro.uow.edu.au/dubaipapers/6

Rajkumar, N., Subashini, T. S., Rajan, K. & Ramalingam, V. (2020). Tamil stopword removal based on term frequency. In Data Engineering and Communication Technology: Proceedings of 3rd ICDECT-2K19 (pp. 21-30). Springer Singapore. https://doi.org/10.1007/978-981-15-1097-7_3

Raulji, J. K. & Saini, J. R. (2016). Stop-word removal algorithm and its implementation for Sanskrit language. International Journal of Computer Applications, 150(2), 15-17. https:doi.org/10.5120/ijca2016911462

Sadeghi, M. & Vegas, J. (2014). Automatic identification of light stop words for Persian information retrieval systems. Journal of Information Science, 40(4), 476-487. https://doi.org/10.1177/0165551514530655

Saif, H., Fernandez, M. & Alani, H. (2014, October). Automatic stopword generation using contextual semantics for Twitter sentiment analysis. In CEUR Workshop Proceedings (Vol. 1272).

ul Haque, R., Mehera, P., Mridha, M. F. & Hamid, M. A. (2019, May). A complete Bengali stop word detection mechanism. In 2019, the Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and the 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR) (pp. 103-107). IEEE. https://doi.org/10.1109/ICIEV.2019.8858544

Wilbur, W. J. & Sirotkin, K. (1992). The automatic identification of stop words. Journal of Information Science, 18(1), 45-55. https://doi.org/10.1177/016555159201800106

Yaghoub-Zadeh-Fard, M. A., Minaei-Bidgoli, B., Rahmani, S. & Shahrivari, S. (2015, November). PSWG: An automatic stop-word list generator for Persian information retrieval systems based on similarity function & POS information. In 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI) (pp. 111-117). IEEE. https://doi.org/10.1109/KBEI.2015.7436031

Zheng, G. & Gaowa, G. (2010, October). The selection of Mongolian stop words. In 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems (Vol. 2, pp. 71-74). IEEE. https://doi.org/10.1109/ICICISYS.2010.5658841