Offline Corpus Augmentation for English-Amharic Machine Translation - Department of Natural Language Processing & Knowledge Discovery Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

Offline Corpus Augmentation for English-Amharic Machine Translation

Yohannes Biadgligne
  • Fonction : Auteur
  • PersonId : 1124663

Résumé

The purpose of this study was to investigate the effect of corpus augmentation on the quality of English-Amharic Machine Translation (MT). In fact, trigram and four-gram Statistical Machine Translation (SMT) language models, as well as Neural Machine Translation (NMT) models based on Gated Recurrent Units (GRU) were used. They were trained independently using both the original and augmented corpus to see how the augmentation of the corpus affects the translation quality of these models. These two corpora (original and augmented) contain 225,304 and 463,796 English-Amharic parallel sentences respectively. To complete the corpus augmentation challenge, an offline token level tokenization technique was used. This technique (corpus augmentation) was used before any other MT processes were started. Among several token-level tokenization mechanisms, random insertion, replacement, deletion, and swapping approaches were chosen and implemented. After both models had been trained, the Bilingual Evaluation Understudy (BLEU) ratings were collected and analyzed. Our results demonstrate that the models trained with the augmented corpus outperform their corresponding models (models trained with the original corpus) in terms of BLEU scores. As a result, we can conclude that corpus augmentation did indeed help in the improvement of the performance of both SMT and NMT translation systems.
Fichier principal
Vignette du fichier
ICICT2022Augmented_corpusFinal Draft.pdf (871.95 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03547539 , version 1 (28-01-2022)

Identifiants

  • HAL Id : hal-03547539 , version 1

Citer

Yohannes Biadgligne, Kamel Smaïli. Offline Corpus Augmentation for English-Amharic Machine Translation. 2022 The 5th International Conference on Information and Computer Technologies, Mar 2022, New York, United States. ⟨hal-03547539⟩
203 Consultations
268 Téléchargements

Partager

Gmail Facebook X LinkedIn More