N-GRAM YORDAMIDA TURG‘UN LISONIY BIRLIKLARNI ANIQLASH BOSQICHLARI
The article examines the scientific and methodological foundations for the automatic identification of stable linguistic units in the Uzbek language using the national text corpus. In this study, 2-5-word N-grams were selected based on statistical measures and classified as either phraseological or free word combinations through linguistic criteria and contextual models. The proposed approach achieved an accuracy of 90%, confirming that a substantial portion of highly associated combinations exhibit phraseological characteristics. The findings contribute to the development of automatic phraseological dictionaries and enhance the processing of multiword expressions in corpus linguistics and NLP systems.
1. Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. Multiword expressions: A pain in the neck for NLP. In Computational Linguistics and Intelligent Text Processing: Third International Conference, 2002, Mexico, February 17–23.
2. https://blog.devgenius.io/ngram-collocation-analysis-for-hate-speech-detection-9de4330e410c
3. Mandravickaite, J., Krilavicius, T., & Man, K. L. A Combined approach for automatic identification of multi-word expressions for Latvian and Lithuanian. IAENG International Journal of Computer Science, 2017, 44(4), 598-606.
4. Manning, C., & Schutze, H. Foundations of Statistical Natural Language Processing. MIT Press, 1999.
5. Jurafsky, D., & Martin, J. Speech and Language Processing. Prentice Hall, 2023.
6. Ramshaw, L., & Marcus, M. “Text Chunking Using Transformation-Based Learning.” ACL Workshop, 1995.
7. Mikolov, T. et al. “Efficient Estimation of Word Representations in Vector Space.” ICLR, 2013.
8. Devlin, J. et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” ACL, 2019.
9. Rahmatullayev, Sh. O‘zbek tilining frazeologik lug‘ati. Toshkent: O‘qituvchi, 2010.
10. Uznatcorpora.uz – O‘zbek tilining milliy matn korpusi.
11. Kilgarriff, A. “Corpora and Collocations.” International Journal of Corpus Linguistics, 2006.
12. Baldwin, T., & Kim, S. N. “Multiword Expressions.” In: Handbook of NLP, 2010.
Copyright (c) 2025 «ACTA NUUz»

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.






.jpg)

1.png)





