Full Text

Turn on search term navigation

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Multilingual characteristics, lack of annotated data, and imbalanced sample distribution are the three main challenges for toxic comment analysis in a multilingual setting. This paper proposes a multilingual toxic text classifier which adopts a novel fusion strategy that combines different loss functions and multiple pre-training models. Specifically, the proposed learning pipeline starts with a series of pre-processing steps, including translation, word segmentation, purification, text digitization, and vectorization, to convert word tokens to a vectorized form suitable for the downstream tasks. Two models, multilingual bidirectional encoder representation from transformers (MBERT) and XLM-RoBERTa (XLM-R), are employed for pre-training through Masking Language Modeling (MLM) and Translation Language Modeling (TLM), which incorporate semantic and contextual information into the models. We train six base models and fuse them to obtain three fusion models using the F1 scores as the weights. The models are evaluated on the Jigsaw Multilingual Toxic Comment dataset. Experimental results show that the best fusion model outperforms the two state-of-the-art models, MBERT and XLM-R, in F1 score by 5.05% and 0.76%, respectively, verifying the effectiveness and robustness of the proposed fusion strategy.

Details

Title
A Study of Multilingual Toxic Text Detection Approaches under Imbalanced Sample Distribution
Author
Song, Guizhe 1   VIAFID ORCID Logo  ; Degen Huang 1 ; Xiao, Zhifeng 2 

 School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China; [email protected] (G.S.); [email protected] (D.H.) 
 School of Engineering, Penn State Erie, The Behrend College, Erie, PA 16563, USA 
First page
205
Publication year
2021
Publication date
2021
Publisher
MDPI AG
e-ISSN
20782489
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2532337044
Copyright
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.