Analytical Comparison of Stop Word Recognition Methods in Persian Texts

Document Type : Original Article

Authors

1 Department of Computer Engineering and IT, Jahrom University, Jahrom, Iran

2 Islamic World Science and Technology Monitoring and Citation Institute (ISC), Shiraz, Iran

Abstract
 
Stop words are primarily non-significant words used to connect other words in sentence construction. Since these words do not contain specific information about the text, they are typically removed during text processing. Therefore, identifying stop words is an essential operation in text processing. A challenge arises when usually insignificant words can become significant depending on the situation, while words that are typically important can sometimes be classified as stop words. This problem is particularly pronounced in Persian due to the complexities inherent in the language. Recognizing the importance of identifying stop words in Persian, we analyzed and reviewed various approaches, including a dictionary-based approach, POS tagging-based approach, Word2Vec-based approach and FastText-based approach to identify stop words using a corpus of 50.000 Persian sentences from Hamshahri dataset. Our findings indicate that the FastText-based approach outperformed the others with a detection accuracy of 96.98, suggesting that this method can lead to the development of an automatic, reliable, and efficient system.
 

Keywords

Subjects


Alajmi, A., Saad, E. M. & Darwish, R. (2012). Toward an ARABIC stop-words list generation. International Journal of Computer Applications, 46(8), 8-13.
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M. & Oroumchian, F. (2009). Hamshahri: A standard Persian text collection. Knowledge-Based Systems, 22(5), 382-387. https://doi.org/10.1016/j.knosys.2009.05.002
Behera, S. (2018, May). Implement a finite state automaton to recognize and remove stop words in English text on its retrieval. In 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI) (pp. 476-480). IEEE. https://doi.org/10.1109/ICOEI.2018.8553828
Chanda, S. & Pal, S. (2023). The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media. SN Computer Science, 4(5), 494. https://doi.org/10.1007/s42979-023-01942-7
Chekima, K. & Alfred, R. (2016). An automatic construction of Malay stop words is based on the aggregation method. In Soft Computing in Data Science: Second International Conference, SCDS 2016, Kuala Lumpur, Malaysia, September 21-22, 2016, Proceedings 2 (pp. 180-189). Springer Singapore.
Chiche, A. & Yitagesu, B. (2022). Part of speech tagging: A systematic review of deep learning and machine learning approaches. Journal of Big Data, 9, 10. https://doi.org/10.1186/s40537-022-00561-y
Choi, W., Yoo, K. & Choi, S. (2019). Create a List of Stopwords and Typing Errors by TF-IDF Weight Value. EasyChair. Retrieved from file:///C:/Users/Reza/Downloads/EasyChair-Preprint-1410.pdf
Daowadung, P. & Chen, Y. H. (2012, July). Stop word in readability assessment of Thai text. In 2012 IEEE 12th International Conference on Advanced Learning Technologies (pp. 497-499). IEEE. https://doi.org/10.1109/ICALT.2012.9
Dar, K. S., Shafat, A. B. & Hassan, M. U. (2017, June). An efficient stop-word elimination algorithm for the Urdu language. In 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) (pp. 911-914). IEEE. https://doi.org/10.1109/ECTICon.2017.8096386
Davarpanah, M. R., Sanji, M. & Aramideh, M. (2009). Farsi lexical analysis and stop word list. Library Hi Tech, 27(3), 435-449. https://doi.org/10.1108/07378830910988559
Dehghani, M. & Manthouri, M. (2021, October). Semi-automatic detection of Persian stopwords using FastText library. In 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE) (pp. 267-271). IEEE. https://doi.org/10.1109/ICCKE54056.2021.9721519
Gunasekara, S. V. S. & Haddela, P. S. (2018, October). Context-aware stopwords for Sinhala Text classification. In 2018 National Information Technology Conference (NITC) (pp. 1-6). IEEE. https://doi.org/10.1109/NITC.2018.8550073
Hao, L. & Hao, L. (2008, December). Automatic identification of stop words in Chinese text classification. In 2008 International conference on computer science and software engineering (Vol. 1, pp. 718-722). IEEE. https://doi.org/10.1109/CSSE.2008.829
Haque, R. U., Mridha, M., Hamid, M. A., Abdullah-Al-Wadud, M. & Islam, M. S. (2020). Bengali stop word and phrase detection mechanism. Arabian Journal for Science and Engineering, 45, 3355-3368. https://doi.org/10.1007/s13369-020-04388-8
Jayashree, R., Murthy, K. S. & Anami, B. S. (2014). Effect of stop word removal on the performance of naïve Bayesian methods for text classification in the Kannada language. International Journal of Artificial Intelligence and Soft Computing, 4(2-3), 264-282. https://doi.org/10.1504/IJAISC.2014.062824
Jefriyanto, J., Ainun, N. & Al Ardha, M. A. (2023). Application of naïve bayes classification to analyze performance using stopwords. Journal of Information System, Technology and Engineering, 1(2), 49-53. Retrieved from https://pdfs.semanticscholar.org/866d/e716b31fdeb3aa639b3acd9a4902cefce14d.pdf?_gl=1*1127od5*_gcl_au*MTg2NjY3NDM0My4xNzMzNTQ5ODEz*_ga*MTUzNTM3MTM0MS4xNjUyMDMyOTM4*_ga_H7P4ZT52H5*MTczNDA3MDE1NC4xMDAuMS4xNzM0MDc1ODA4LjYwLjAuMA.
Kholwal, R. (2023). Text-Classify: A comprehensive comparative study of logistic regression, random forest, and knn models for enhanced text classification performance. International Journal of Advances in Engineering & Technology, 16(5), 415-433. https://doi.org/10.5281/zenodo.10148008
Luhn, H. P. (1985). Keyword-in-context index for technical literature. American Documentation, 11(4), 288.295. https://doi.org/10.1002/asi.5090110403
Madatov, K., Bekchanov, S. & Vičič, J. (2022). Automatic detection of stop words for texts in the Uzbek language. Informatica, 47,143-150. https://doi.org/10.31449/inf.v47i2.3788
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S. & McClosky, D. (2014, June). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55-60). Retrieved from https://aclanthology.org/P14-5010.pdf
 
 
Metïn, S. K. & Karaoğlan, B. (2017). Stop word detection as a binary classification problem. Anadolu University Journal of Science and Technology A-Applied Sciences and Engineering, 18(2), 346-359. Retrieved from https://dergipark.org.tr/tr/download/article-file/316284
Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. https://doi.org/10.48550/arXiv.1301.3781
Miretie, S. G. & Khedkar, V. (2018). Automatic generation of stopwords in the Amharic text. International Journal of Computer Applications, 975, 8887.
Namly, D., Bouzoubaa, K. & Yousfi, A. (2019). A bi-technical analysis for Arabic stop-words detection. Compusoft, 8(5), 3126-3134. Retrieved from https://ijact.in/index.php/j/article/view/491
Rahimi, Z. & Homayounpour, M. M. (2023). The impact of preprocessing on word embedding quality: A comparative study. Language Resources and Evaluation, 57(1), 257-291. https://doi.org/10.1007/s10579-022-09620-5
Raja, F., Tasharofi, S. & Oroumchian, F. (2007). Statistical POS tagging experiments on Persian text. University of WollongongUniversity of Wollongong. Conference Contribution. Retrieved from https://ro.uow.edu.au/dubaipapers/6
Rajkumar, N., Subashini, T. S., Rajan, K. & Ramalingam, V. (2020). Tamil stopword removal based on term frequency. In Data Engineering and Communication Technology: Proceedings of 3rd ICDECT-2K19 (pp. 21-30). Springer Singapore. https://doi.org/10.1007/978-981-15-1097-7_3
Raulji, J. K. & Saini, J. R. (2016). Stop-word removal algorithm and its implementation for Sanskrit language. International Journal of Computer Applications, 150(2), 15-17. https:doi.org/10.5120/ijca2016911462
Sadeghi, M. & Vegas, J. (2014). Automatic identification of light stop words for Persian information retrieval systems. Journal of Information Science, 40(4), 476-487. https://doi.org/10.1177/0165551514530655
Saif, H., Fernandez, M. & Alani, H. (2014, October). Automatic stopword generation using contextual semantics for Twitter sentiment analysis. In CEUR Workshop Proceedings (Vol. 1272).
ul Haque, R., Mehera, P., Mridha, M. F. & Hamid, M. A. (2019, May). A complete Bengali stop word detection mechanism. In 2019, the Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and the 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR) (pp. 103-107). IEEE. https://doi.org/10.1109/ICIEV.2019.8858544
Wilbur, W. J. & Sirotkin, K. (1992). The automatic identification of stop words. Journal of Information Science, 18(1), 45-55. https://doi.org/10.1177/016555159201800106
Yaghoub-Zadeh-Fard, M. A., Minaei-Bidgoli, B., Rahmani, S. & Shahrivari, S. (2015, November). PSWG: An automatic stop-word list generator for Persian information retrieval systems based on similarity function & POS information. In 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI) (pp. 111-117). IEEE. https://doi.org/10.1109/KBEI.2015.7436031
Zheng, G. & Gaowa, G. (2010, October). The selection of Mongolian stop words. In 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems (Vol. 2, pp. 71-74). IEEE. https://doi.org/10.1109/ICICISYS.2010.5658841

  • Receive Date 03 January 2024
  • Revise Date 01 January 2025
  • Accept Date 01 January 2025