Development of the Annotated Swahili Digraph Corpus Using a CNN-Based Digraph Extraction Model

Tirus Muya Maina; Aaron Mogeni Oirere; Stephen Kahara

Authors

Tirus Muya Maina Computer Science Department, Murang’a University of Technology, Murang’a, Kenya
Aaron Mogeni Oirere Computer Science Department, Murang’a University of Technology, Murang’a, Kenya
Stephen Kahara Computer Science Department, Murang’a University of Technology, Murang’a, Kenya

Keywords:

Annotated, Swahili, Digraph, Corpus, NLP, CNN, Dense layer

Abstract

This study undertakes the development of the Annotated Swahili Digraph Corpus, utilizing a convolutional neural network-based model specifically designed for the extraction of digraphs. This initiative addresses a significant gap in the availability of dedicated digraph corpora for the Swahili language, which is increasingly needed for various applications in Natural Language Processing (NLP). The CNN-based model was accurately crafted to optimize the extraction and classification of digraphs, taking full advantage of the annotated features within the corpus. Digraphs are pairs of letters that create distinct sounds in a language, and Swahili`s linguistic structure presents unique challenges and requirements in this regard. Therefore, specialized tools and models are essential for ensuring accurate transcription and efficient speech recognition that cater specifically to the nuances of the Swahili language. The resulting Swahili Digraph Corpus comprises a comprehensive collection of 31,197 words, each systematically annotated to highlight their respective digraphs. Notably, this corpus features the nine key Swahili digraphs: "ch," "dh," "gh," "kh," "ng’," "ny," "sh," "th," and "ng." Furthermore, it includes annotations for vowel distribution, showcasing the core vowels "a," "e," "i," "o," and "u." This detailed annotated corpus supports a wide array of NLP applications, enabling researchers and developers to utilize accurate linguistic data for tasks such as text processing, machine translation, and speech synthesis. Through this dedicated effort, we aim to enhance the resources available for processing the Swahili language, ultimately contributing to its greater accessibility in the digital landscape.

References

I. A. Okafor, “Distinctive Features: A Linguistic Analysis of Consonant Sounds in English Language,” Ansu Journal of Language and Literary Studies, vol. 2, Issue.2, 2022.

M. Sipser, Introduction to the Theory of Computation, PWS Publishing Company, 1996.

S. S. Rao, Engineering Optimization: Theory and Practice, John Wiley & Sons, 2019.

J. Hopcroft, R. Motwani, and J. D. Ullman, “Introduction to automata theory, languages, and computation,” ACM Sigact News, vol. 32, Issue. 1, pp. 60–65, 2001.

J. H. Hansen and G. Liu, “Unsupervised accent classification for deep data fusion of accent and language information,” Speech Communication, vol. 78, pp. 19–33, 2016

A. F. Atanda, “Multinomial Logistic Regression Probability Ratio-Based Feature Vectors for Malay Vowel Recognition,” Universiti Utara Malaysia, 2021.

W. H. Finch, J. E. Bolin, and K. Kelley, Multilevel Modelling Using R, United Kingdom: CRC Press/Taylor & Francis Group, 2019.

M. S. Azmi, “Development of Malay Word Pronunciation Application using Vowel Recognition,” Malay, vol. 9, Issue.1, 2016.

M. S. Azmi, “Malay Word Pronunciation Test Application for Pre-School Children,” International Journal of Interactive Digital Media, vol. 4, Issue. 2, pp. 2289–4098, 2016.

K. Y. Chan and M. D. Hall, “The importance of vowel formant frequencies and proximity in vowel space to the perception of foreign accent,” Journal of Phonetics, vol. 77, 2019.

M. Mehraj, A. Goel, M. A. Butt, and M. Zaman, “Automatic Speech Recognition Approach for Diverse Voice Commands,” International Journal of Advanced Research in Computer Science, vol. 8, Issue.9, 2017.

J. O. De Sordi, Design Science Research Methodology, Springer International Publishing, 2021.

A. R. Kivaisi, Q. Zhao, and J. T. Mbelwa, “Swahili Speech Dataset Development and Improved Pre-Training Method for Spoken Digit Recognition,” ACM Transactions on Asian and Low-Resource Language Information Processing, 2023.

T. M. Maina, A.M. Oirere, and S.Kahara “A CNN-Based Digraph Extraction Model for Enhanced Swahili Natural Language Processing,” International Journal of Scientific Research in Computer Science and Engineering, Vol.12, Issue.6, pp.43-55, 2024.

T. M. Maina, “The Swahili Digraph Corpus,” Mendeley Data, 2024.

R. Yacouby and D. Axman, “Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models,” in Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems (Eval4NLP), 2020.

H. Dalianis, “Evaluation Metrics and Evaluation,” in Clinical Text Mining, Springer, Cham, pp.45-53, 2018

Development of the Annotated Swahili Digraph Corpus Using a CNN-Based Digraph Extraction Model

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Most read articles by the same author(s)

Make a Submission

Journal Information

Information

Join Editorial Board

Keywords

Current Issue