TEXT-BASED LANGUAGE IDENTIFICATION FOR TYPOLOGICALLY RELATED ETHIOPIAN LANGUAGES USING DEEP LEARNING

MIKRE, GETU MIHRETE

dc.contributor.author	MIKRE, GETU MIHRETE
dc.date.accessioned	2024-09-03T08:03:28Z
dc.date.available	2024-09-03T08:03:28Z
dc.date.issued	2023-06-26
dc.identifier.uri	http://etd.dbu.edu.et:80/handle/123456789/1524
dc.description.abstract	Today we live in a world where there are more multilingual than monolingual. There is an ever- increasing amount of information on the world wide web that is written in different languages. Ethiopia is a multilingual country par excellence and multiple languages are used as media of administration, education and mass communications. But, these textual contents may not be expressed in a monolithic format. To use such textual resources for various purposes, language identification (LID) is an important preprocessing task for understanding, organizing and analyzing these contents. LID is the detection of the natural language of an input text. It is also the first necessary step to do any language-dependent natural language processing tasks. Although text-based LID has been extensively studied, there is still no comprehensive understanding of the factors that determine its identification accuracy. Factors such as the size of the text fragment to be identified, the amount and variety of training data available, the classification algorithm and the embedding techniques used. LID in very closely related languages is another unsolved problem. Current LID applications and models are unable to accurately identify the language for given text written in the Ge’ez script due to their similarity. The Ethiopic script is an alpha-syllabary or abugida “ አቡጊዳ” writing system used for several languages spoken in Ethiopia and Eritrea. In this work, we presented a LID model for six typologically and phylogenetically related low- resourced Ethiopian languages that use the Ge’ez script as their writing system; namely Amharic, Awngi, Ge'ez, Guragigna, Tigrigna and Xamtanga. The corpus used was collected automatically from various sources including Ethiopian mass media websites, social media, Bibles and related publications. We used the chars2vec embedding technique as a feature and DNN model for classification. To train and evaluate the proposed LID model, the researchers conducted several experiments with sample texts of different lengths using the best hyperparameter setting. Finally, the proposed LID model correctly identified the languages with an accuracy of more than 99% for texts longer than 50 characters and an accuracy of 77.68% for texts 5 characters long. The developed model also performed well for the out-of-vocabulary texts. In cases where languages are closely related and texts are very short, the identification performance of the proposed model was relatively poor. Therefore, it would be of interest to keep exploring LID models that handle closely related languages with short texts in the future.	en_US
dc.language.iso	en	en_US
dc.subject	Amharic, Awngi, Bag-of-Characters, Closely Related Languages, Deep Neural Network, Ethiopic Script, Guragigna, Language Identification, Tigrigna, Xamtanga	en_US
dc.title	TEXT-BASED LANGUAGE IDENTIFICATION FOR TYPOLOGICALLY RELATED ETHIOPIAN LANGUAGES USING DEEP LEARNING	en_US
dc.type	Thesis	en_US