Data Augmentation For Sorani Kurdish News Headline Classification Using Back-Translation And Deep Learning Model

https://doi.org/10.24017/science/2023.1.4

Abstract views: 337 / PDF downloads: 92

Authors

  • Soran Badawi Language Center, Charmo University, KRG, Chamchamal, Kurdistan, Iraq

Abstract

With the increase in the volume of news articles and headlines being generated, it is becoming more difficult for individuals to keep up with the latest developments and find relevant news articles in the Kurdish language. To address this issue, this paper proposes a novel data augmentation approach for improving the performance of Kurdish news headline classification using back-translation and a proposed deep learning Bidirectional Long Short-Term Memory (BiLSTM) model. The approach involves generating synthetic training data by translating Kurdish headlines into a target language in this context English language and back-translating them to the Kurdish language, resulting in an augmented dataset. The proposed BiLSTM model is trained on the augmented data and compared with baseline models SVM (Support-Vector-Machines) and Naïve Bayes an trained on the original data. The experimental results demonstrate that the proposed BiLSTM model outperforms the baseline model and other existing models, achieving state-of-the-art performance on the Kurdish news headline classification task. The findings suggest that the combination of back-translation and a proposed BiLSTM model is a promising approach for data augmentation in low-resource languages, contributing to the advancement of natural language processing in under-resourced languages. Moreover, having a Kurdish news headline classification model can improve access to news and information for Kurdish speakers. With the classification model, they can easily and quickly search for news articles that interest them based on their preferred categories, such as politics, sports, or entertainment.

Keywords:

Data Augmentation, Deep Learning, Text Classification, Machine Learning, Kurdish Language

References

[1] B. R. Chakravarthi et al., "Detecting abusive comments at a fine-grained level in a low-resource language," Natural Language Processing Journal, vol. 3, p. 100006, Jun. 2023, doi: 10.1016/j.nlp.2023.100006.
https://doi.org/10.1016/j.nlp.2023.100006
[2] M. A. Hedderich, L. Lange, H. Adel, J. Strötgen, and D. Klakow, "A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios," Oct. 2020.
https://doi.org/10.18653/v1/2021.naacl-main.201
[3] C. Shorten, T. M. Khoshgoftaar, and B. Furht, "Text Data Augmentation for Deep Learning," J Big Data, vol. 8, no. 1, p. 101, Dec. 2021, doi: 10.1186/s40537-021-00492-0.
https://doi.org/10.1186/s40537-021-00492-0
[4] M. Varasteh and A. Kazemi, "Using ParsBert on Augmented Data for Persian News Classification," in 2021 7th International Conference on Web Research (ICWR), IEEE, May 2021, pp. 78-81. doi: 10.1109/ICWR51868.2021.9443119.
https://doi.org/10.1109/ICWR51868.2021.9443119
[5] J. P. R. Sharami, P. A. Sarabestani, and S. A. Mirroshandel, "DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus," Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.05328
[6] A. Karimi, L. Rossi, and A. Prati, "AEDA: An Easier Data Augmentation Technique for Text Classification," Aug. 2021.
https://doi.org/10.18653/v1/2021.findings-emnlp.234
[7] M. Bayer, M.-A. Kaufhold, and C. Reuter, "A Survey on Data Augmentation for Text Classification," ACM Comput Surv, vol. 55, no. 7, pp. 1-39, Jul. 2023, doi: 10.1145/3544558.
https://doi.org/10.1145/3544558
[8] S. Badawi, "Transformer-Based Neural Network Machine Translation Model for the Kurdish Sorani Dialect," UHD Journal of Science and Technology, vol. 7, no. 1, pp. 15-21, Jan. 2023, doi: 10.21928/uhdjst.v7n1y2023.pp15-21.
https://doi.org/10.21928/uhdjst.v7n1y2023.pp15-21
[9] D. T. Vu, G. Yu, C. Lee, and J. Kim, "Text Data Augmentation for the Korean Language," Applied Sciences, vol. 12, no. 7, p. 3425, Mar. 2022, doi: 10.3390/app12073425.
https://doi.org/10.3390/app12073425
[10] A. W. Yu et al., "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension," Apr. 2018, [Online]. Available: http://arxiv.org/abs/1804.09541.
[11] M. Fadaee, A. Bisazza, and C. Monz, "Data Augmentation for Low-Resource Neural Machine Translation," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Stroudsburg, PA, USA: Association for Computational Linguistics, 2017, pp. 567-573. doi: 10.18653/v1/P17-2090.
https://doi.org/10.18653/v1/P17-2090
[12] R. Sennrich, B. Haddow, and A. Birch, "Improving Neural Machine Translation Models with Monolingual Data," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Stroudsburg, PA, USA: Association for Computational Linguistics, 2016, pp. 86-96. doi: 10.18653/v1/P16-1009.
https://doi.org/10.18653/v1/P16-1009
[13] C. Sabty, I. Omar, F. Wasfalla, M. Islam, and S. Abdennadher, "Data Augmentation Techniques on Arabic Data for Named Entity Recognition," Procedia Comput Sci, vol. 189, pp. 292-299, 2021, doi: 10.1016/j.procs.2021.05.092.
https://doi.org/10.1016/j.procs.2021.05.092
[14] H. Alsayadi, A. Abdelhamid, I. Hegazy, and Z. Taha, "Data Augmentation for Arabic Speech Recognition Based on End-to-End Deep Learning," International Journal of Intelligent Computing and Information Sciences, vol. 21, no. 2, pp. 50-64, Jul. 2021, doi: 10.21608/ijicis.2021.73581.1086.
https://doi.org/10.21608/ijicis.2021.73581.1086
[15] H. Nasiri and M. Analoui, "Persian Stance Detection with Transfer Learning and Data Augmentation," in 2022 27th International Computer Conference, Computer Society of Iran (CSICC), IEEE, Feb. 2022, pp. 1-5. doi: 10.1109/CSICC55295.2022.9780479.
https://doi.org/10.1109/CSICC55295.2022.9780479
[16] A. Nazarizadeh, T. Banirostam, and M. Sayyadpour, "Using Group Deep Learning and Data Augmentation in Persian Sentiment Analysis," in 2022 8th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), IEEE, Dec. 2022, pp. 1-5. doi: 10.1109/ICSPIS56952.2022.10044052.
https://doi.org/10.1109/ICSPIS56952.2022.10044052
[17] K. Awlla and H. Veisi, "Central Kurdish Sentiment Analysis Using Deep Learning," Journal of University of Anbar for Pure Science, vol. 16, no. 2, pp. 119-130, Dec. 2022, doi: 10.37652/juaps.2022.176501.
https://doi.org/10.37652/juaps.2022.176501
[18] S. S. Badawi, "Using Multilingual Bidirectional Encoder Representations from Transformers on Medical Corpus for Kurdish Text Classification," ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, vol. 11, no. 1, pp. 10-15, Jan. 2023, doi: 10.14500/aro.11088.
https://doi.org/10.14500/aro.11088
[19] T. A. Rashid, A. M. Mustafa, and A. M. Saeed, "Automatic Kurdish Text Classification Using KDC 4007 Dataset," in International Conference on Emerging Intelligent Data and Web Technologies, 2017.
https://doi.org/10.1007/978-3-319-59463-7_19
[20] S. Badawi, A. M. Saeed, S. A. Ahmed, P. A. Abdalla, and D. A. Hassan, "Kurdish News Dataset Headlines (KNDH) through multiclass classification," Data Brief, vol. 48, p. 109120, Jun. 2023, doi: 10.1016/j.dib.2023.109120.
https://doi.org/10.1016/j.dib.2023.109120
[21] S. Ahmadi, "KLPT - Kurdish Language Processing Toolkit," in NLPOSS, 2020.
https://doi.org/10.18653/v1/2020.nlposs-1.11
[22] Y.-M. Li and T.-Y. Li, "Deriving market intelligence from microblogs," Decis Support Syst, vol. 55, no. 1, pp. 206-217, Apr. 2013, doi: 10.1016/j.dss.2013.01.023.
https://doi.org/10.1016/j.dss.2013.01.023
[23] T. Shaik, X. Tao, C. Dann, H. Xie, Y. Li, and L. Galligan, "Sentiment analysis and opinion mining on educational data: A survey," Natural Language Processing Journal, vol. 2, p. 100003, Mar. 2023, doi: 10.1016/j.nlp.2022.100003.
https://doi.org/10.1016/j.nlp.2022.100003
[24] R. Collobert, J. Weston, J. Com, M. Karlen, K. Kavukcuoglu, and P. Kuksa, "Natural Language Processing (Almost) from Scratch," 2011.
[25] P. Vateekul and T. Koomsubha, "A study of sentiment analysis using deep learning techniques on Thai Twitter data," in 2016 13th International Joint Conference on Computer Science and Software Engineering (JCSSE), IEEE, Jul. 2016, pp. 1-6. doi: 10.1109/JCSSE.2016.7748849.
https://doi.org/10.1109/JCSSE.2016.7748849
[26] G. Liu, X. Huang, X. Liu, and A. Yang, "A Novel Aspect-based Sentiment Analysis Network Model Based on Multilingual Hierarchy in Online Social Network," Comput J, vol. 63, no. 3, pp. 410-424, Mar. 2020, doi: 10.1093/comjnl/bxz031.
https://doi.org/10.1093/comjnl/bxz031
[27] N. Srivastava, G. Hinton, A. Krizhevsky, and R. Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," 2014.
[28] X. Glorot, A. Bordes, and Y. Bengio, "Deep Sparse Rectifier Neural Networks," 2011.

Downloads

How to Cite

[1]
S. Badawi, “Data Augmentation For Sorani Kurdish News Headline Classification Using Back-Translation And Deep Learning Model”, KJAR, vol. 8, no. 1, pp. 27–34, Jun. 2023, doi: 10.24017/science/2023.1.4.

Article Metrics

Published

30-06-2023

Issue

Section

Pure and Applied Science