Debre Berhan University Institutional Repository

TEXT-BASED LANGUAGE IDENTIFICATION FOR TYPOLOGICALLY RELATED ETHIOPIAN LANGUAGES USING DEEP LEARNING

Show simple item record

dc.contributor.author MIKRE, GETU MIHRETE
dc.date.accessioned 2024-09-03T08:03:28Z
dc.date.available 2024-09-03T08:03:28Z
dc.date.issued 2023-06-26
dc.identifier.uri http://etd.dbu.edu.et:80/handle/123456789/1524
dc.description.abstract Today we live in a world where there are more multilingual than monolingual. There is an ever- increasing amount of information on the world wide web that is written in different languages. Ethiopia is a multilingual country par excellence and multiple languages are used as media of administration, education and mass communications. But, these textual contents may not be expressed in a monolithic format. To use such textual resources for various purposes, language identification (LID) is an important preprocessing task for understanding, organizing and analyzing these contents. LID is the detection of the natural language of an input text. It is also the first necessary step to do any language-dependent natural language processing tasks. Although text-based LID has been extensively studied, there is still no comprehensive understanding of the factors that determine its identification accuracy. Factors such as the size of the text fragment to be identified, the amount and variety of training data available, the classification algorithm and the embedding techniques used. LID in very closely related languages is another unsolved problem. Current LID applications and models are unable to accurately identify the language for given text written in the Ge’ez script due to their similarity. The Ethiopic script is an alpha-syllabary or abugida “ አቡጊዳ” writing system used for several languages spoken in Ethiopia and Eritrea. In this work, we presented a LID model for six typologically and phylogenetically related low- resourced Ethiopian languages that use the Ge’ez script as their writing system; namely Amharic, Awngi, Ge'ez, Guragigna, Tigrigna and Xamtanga. The corpus used was collected automatically from various sources including Ethiopian mass media websites, social media, Bibles and related publications. We used the chars2vec embedding technique as a feature and DNN model for classification. To train and evaluate the proposed LID model, the researchers conducted several experiments with sample texts of different lengths using the best hyperparameter setting. Finally, the proposed LID model correctly identified the languages with an accuracy of more than 99% for texts longer than 50 characters and an accuracy of 77.68% for texts 5 characters long. The developed model also performed well for the out-of-vocabulary texts. In cases where languages are closely related and texts are very short, the identification performance of the proposed model was relatively poor. Therefore, it would be of interest to keep exploring LID models that handle closely related languages with short texts in the future. en_US
dc.language.iso en en_US
dc.subject Amharic, Awngi, Bag-of-Characters, Closely Related Languages, Deep Neural Network, Ethiopic Script, Guragigna, Language Identification, Tigrigna, Xamtanga en_US
dc.title TEXT-BASED LANGUAGE IDENTIFICATION FOR TYPOLOGICALLY RELATED ETHIOPIAN LANGUAGES USING DEEP LEARNING en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DBU-IR


Browse

My Account