Article Info
A Hybrid BERT-Based Normalization Framework for Hate Speech Detection
Zainab Mansur , Nazlia Omar, Sabrina Tiun, Eissa M. Alshari
Abstract
Over the past years, hate speech has proliferated more on social media. Among the words used to compose social media messages are out-of-vocabulary (OOV) words, when such words in training and embedding models exhibit a significant decrease in their ability to effectively perform the hate speech detection task. There is an increasing number of abbreviations in tweets. However, one of the most challenging aspects of normalizing abbreviations is finding a suitable candidate for correction and choosing the right alternative when absent a long context and normalization resources. In this paper, we seek to enhance abbreviations and OOV word normalization methods by finding appropriate candidates and choosing suitable alternatives. This model uses BERT-masked language modeling for candidate generation, a few developed rules, and a string similarity measure with SymSpell spelling correction. Based on the experiments, the OOV words have decreased, and the proposed normalization methods have improved the classifier performance. The suggested normalization model reduced the number of OOV words by 12%, which affects classification accuracy. Also, the hate speech detection model performance is 85% F1, higher than the classifier performance when applying the normalization models proposed in our previous work. It would thereby enhance the identification of hate speech in texts.
keyword
Normalization, hate speech, BERT, rule-based, abbreviation, out-of-vocabulary, social media, X
Area
Pattern Recognition

