Hate Speech Detection from Transliterated Amharic Social Media Comments Using Machine Learning and Deep Learning Approaches

Authors

  • Zeleke Abebaw Addis Ababa Science and Technology University, Department of Software Engineering
  • Andreas Rauber nstitute of Information Systems Engineering, Technical University of Vienna, Vienna, Austria
  • Solomon Atnafu Department of Computer Science, Addis Ababa University, Addis Ababa, Ethiopia

DOI:

https://doi.org/10.69660/jcsda.01022404

Keywords:

Hate speech detection, transliteration, Amharic words, Latin script, single channel CNN, Multichannel, SVM

Abstract

The rise of transliterated script usage on social media has presented significant challenges to hate speech detection models, as such scripts often bypass models trained exclusively on formal language datasets. Existing Amharic hate speech detection studies predominantly focus on datasets written in formal Amharic scripts using machine learning approaches, leaving transliterated comments underexplored. This research addresses the gap by evaluating the impact of auto-transliterated and manually transliterated datasets, merged with an existing Amharic hate speech dataset, on the performance of machine learning and deep learning classifiers. The study employed a total of 3,000 datasets which is split into ratio of 80:20 for training and testing. The dataset consists of auto-transliterated, manually transliterated, formal Amharic script, and their combinations. The classifiers including Support Vector Machine, single and multichannel Convolutional Neural Networks were assessed. Experimental results show that the multichannel CNN outperformed single-channel CNN models on the existing Amharic dataset, achieving an F1-score of 0.810 compared to 0.783 and 0.769 for single channel and multichannel CNN, respectively. However, combining transliterated datasets with the existing dataset did not improve classifier performance, likely due to the inconsistencies in scrip transliteration and dataset domain dependencies. This study concludes that transliterated datasets should be treated separately for hate speech detection, and combining datasets from different domains and transliteration techniques negatively impacts classifier performance.

Downloads

Published

2024-12-30

How to Cite

Abebaw, Z., Rauber , A. ., & Atnafu , S. . (2024). Hate Speech Detection from Transliterated Amharic Social Media Comments Using Machine Learning and Deep Learning Approaches. Journal of Computational Science and Data Analytics, 1(02), 59-75. https://doi.org/10.69660/jcsda.01022404

Issue

Section

Special issue: Proceeding of STII2024