Ijtimoiy-gumanitar fanlar

COMPARATIVE ANALYSIS OF MACHINE LEARNING ALGORITHMS FOR IDENTIFYING UZBEK DIALECTS AT THE SENTENCE LEVEL

Uzbek dialects, dialect classification, natural language processing, TF-IDF, Naive Bayes, BERT, mBERT, data scarcity

Authors

  • Shahnoza POZILOVA Toshkent axborot texnologiyalari universiteti professori, DSc, Uzbekistan
  • Madina RAXIMOVA Toshkent axborot texnologiyalari universiteti magistranti, Uzbekistan

This study examines the task of automatic classification of Uzbek language dialects. While resources of Natural Language Processing (NLP) are increasing, the scarcity of dialectological corpora remains one of the primary challenges. In this work, two fundamental approaches were tested on a small-scale, author-collected dataset comprising dialects on TF-IDF + Naive Bayes and BERT(bert-base-multilingual-cased) models. The main conclusion of the research is that the primary obstacle to creating high-accuracy models in Uzbek dialectology is not only the the right algorithm, but rather the absence of a high-quality, comprehensively annotated corpus.