Document Type : Articles


Department of Computer Engineering, Alzahra University, Tehran, Iran


Persian natural language processing (NLP) researchers have many limitations to access linguistic tools which are suitable for text processing. Therefore, researchin Persian text processing is very limited. Since dataset is an important requirement for experiments and their evaluation, we aimed to create appropriate corpora for information retrieval and natural language processing in Persian. The provided corpora in this article are based on HAMSHAHRI dataset which is appropriate for simple information retrieval and simple natural language processing because it has not been tagged. We converted this dataset to tagged collection and increased its text quality. The new corpora minimize the text preprocessing requirement. Here we have used STep-1 tools for text processing and have proposed some ideas to remove the bugs of these tools in order to increase their quality. At the end we used the new corpora for text retrieval and results showed performance improvement.  

  1. Aleahmad, A., Amiri, H., Rahgozar, M. Oroumchian, F. (2009). Hamshahri: A standard Persian Text Collection. Knowledge-Based Systems, 22(5), 382-387.
  2. Amtrup, J.W., Mansouri Rad, H., Megerdoomian, K., Zajac, R.(2000). Persian-English Machine Translation: An Overview of the Shiraz Project. NMSU, CRL, Memoranda in Computer and Cognitive Science (MCCS-00-319).
  3. Berenjian, S.H. (2013). Persian Simple (Past, Present & Future) Verb Stemmer, Shiraz: Takhte Jamshid. Available At: (Persian)
  4. Berenjian, S.H. (2013). The Stemmer of Past and Present from the Infinitive Non Transient Verbs in Persian Language, Shiraz: Navid. Available At:
  5. Berenjkoob, M. , Mehri, R., Khosravi, H., Nematbakhsh, M.A. (2009). A Method for Stemming and Eliminating Common Words for Persian Text Summarization, Natural Language Processing and Knowledge Engineering, NLP-KE International Conference on, 1-6.
  6. Darrudi E., Baradaran Hashemi, H., AleAhmad, A., Zare Bidoki, A.M., Habibian, A.H., Mahdikhani, F., et al. ( 2008). dorIR collection for Persian web retrieval. Technical Report No. DBRG-TR-02.
  7. Eslami, M., Sharifi, M., Alizadeh, S., Zandi, T. (2004). Persian ZAYA Lexicon. 1st Workshop on Persian Language and Computer, Tehran, Iran.
  8. Estahbanati, S., Javidan, R. (2011). A New Stemmer for Farsi Language. Computer Science and Software Engineering(CSSE), CSI international Symposium on, 3(1), 25-29.
  9. Karimpour, R., Ghorbani, A., Pishdad, A., Mohtarami, M., Aleahmad, A., Amiri, H., Oroumchian, F. (2009). Improving Persian information retrieval system using stemming and part of speech tagging. 9th workshop of the cross-language evaluation forum.
  10. Mehrad, j., Berenjian, S.R. (2011). provided a Persian language singular stemmer system, International Journal of Information Science and Management, 9(2).
  11. Miangah, T.M. (2009). Constructing a Large-Scale English-Persian Parallel Corpus. Meta:Translators' Journal 54(1), 181-188.
  12. Oroumchian, F., Tasharofi, S., Amiri, H., Hojjat, H., Raja, F.(2006). Creating a Feasible Corpus for Persian POS Tagging. UOWD Technical Reports Series , Number TR 3.
  13. Oroumchian, F., Darrudi, E., Hejazi, M.R. (2004). Assessment of a modern Persian corpus. Proceedings of The 2nd Workshop on Information Technology & its Disciplines (WITID), ITRC, Iran.
  14. Rashidi, A., Zolfy Lighvan, M. (2014). HPS: A Hierarchical Persian Stemming Method. International Journal on Natural Language Computing (IJNLC), 3(1).
  15. Rezvan, Y., Ghandchi, M., Rezvan, F. (2009). Suggesting Correct Words Algorithms Developing in FarsiTeX. Proceedings of the European Computing Conference.
  16. Shamsfard, M., Jafari, H.S., Ilbeygi, M. (2010). STeP-1: A Set of Fundamental Tools for Persian Text Processing. LREC 2010 - 8th Language Resources and Evaluation Conference, 19-21 May, Malta.
  17. Sheykholeslam, M.H., Minaei-Bidgoli, B., Juzi, H. (2012). A Framework for Spelling Correction in Persian Language Using Noisy Channel Model. In Proceedings of Language Resources and Evaluation Conference (LREC).
  18. Sheykh Esmaili, K., Abolhassani, H., Neshati, M., Behrangi, E., Rostami, A., Mohammadi, M. (2007). Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems. IEEE/ACS International Conference on Computer Systems and Applications.
  19. Taghiyareh, F., Darrudi, E., Oroumchian, F., Angoshtari, N.(2003). Compression of Persian Text for Web-Based Applications, Without Explicit Decompression. WSEAS Transactions on Computers, 2(4), 961-966.
  20. Talvensaari, T., Pirkola, A., Jãrvelin, K., Juhola, M., Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora. Information Retrieval 11, 427- 445.
  21. Tashakori, M., Meybodi, M.R., Oroumchian, F. (2002). Bon: The Persian Stemmer. Information and Communication Technology - EurAsia-ICT , 487-494.
  22. Yang, C.C., Li, K.W. (2004). Building parallel corpora by automatic title alignment using length-based and text-based approaches. Information Processing & Management 40(6), 939-955.