The Open Corpus of the Veps and Karelian Languages: Overview and Applications

Abstract

Abstract. A growing priority in the study of Baltic-Finnic languages of the Republic of Karelia has been the methods and tools of corpus linguistics. Since 2016, linguists, mathematicians, and programmers at the Karelian Research Centre have been working with the Open Corpus of the Veps and Karelian Languages (VepKar), which is an extension of the Veps Corpus created in 2009. The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search using various criteria of the texts (language, genre, etc.) and numerous linguistic categories (lexical and grammatical search in texts was implemented thanks to the generator of word forms that we created earlier). A corpus of 3000 texts was compiled, texts were uploaded and marked up, the system for classifying texts into languages, dialects, types and genres was introduced, and the word-form generator was created. Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs. Owing to continuous functional advancements in the corpus manager and ongoing VepKar enrichment with new material and text markup, users can handle a wide range of scientific and applied tasks. In creating the universal national VepKar corpus, its developers and managers strive to preserve and exhibit as fully as possible the state of the Veps and Karelian languages in the 19

References
[1] Shvedova M. The general regionally annotated corpus of Ukrainian (GRAC, uacorpus.org): Architecture and functionality. Paper presented at: the 4th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2020); April 23-24, 2020; Lviv, Ukraine.
[2] Zaitseva NG, Krizhanovskaya NB. Corpus linguistics in the Baltic-Finnic research area (the corpus of the Veps language and the open corpus of the Veps and Karelian languages). Nordic and Baltic Studies Review. 2018;3:264-273. https://doi.org/10.15393/j103.art.2018.1062
[3] Krizhanovsky AA, Krizhanovskaya NB, Novak IP. Dialects in open corpus of Veps and Karelian languages (VepKar). Paper presented at: the International Conference Corpus Linguistics – 2019; June 24–28, 2019; Saint Petersburg, Russia.
[4] McCarthy AD, Kirov C, Grella M et al. UniMorph 3.0: Universal morphology. Paper presented at: the 12th Conference on Language Resources and Evaluation (LREC 2020); May 13–15, 2020; Marseille, France. [5] Novak IP, Krizhanovskaya NB, Boyko TP, Pellinen NA. Development of rules of generation of nominal word forms for new-written variants of the Karelian language. Bulletin of Ugric Studies. 2020;10(4):679–691. https://doi.org/10.30624/2220-4156- 2020-10-4-679-691
[6] Krizhanovsky A, Krizhanovskaya N, Novak I. Part of speech and gramset tagging algorithms for unknown words based on morphological dictionaries of the Veps and Karelian languages in management in data intensive domains. Communications in Computer and Information Science. 2021;1427:163–177. https://doi.org/10.1007/978-3- 030-81200-3_12.
[7] Kibrik AY, Tatevosov SG, Lyutikova EA, et al. Introduction to the science of language. Moscow: Buki Vedi; 2019.