home edit page issue tracker

This page pertains to UD version 2.

UD Georgian GLC

Language: Georgian (code: ka)
Family: Kartvelian

This treebank has been part of Universal Dependencies since the UD v2.13 release.

The following people have contributed to making this treebank part of UD: Irina Lobzhanidze.

Repository: UD_Georgian-GLC
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.13

License: CC BY-SA 4.0

Genre: fiction, nonfiction

Questions, comments? General annotation questions (either Georgian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [irina_lobzhanidze (æt) iliauni • edu • ge]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.

Annotation Source
Lemmas assigned by a program, not checked manually
UPOS assigned by a program, with some manual corrections, but not a full manual verification
XPOS assigned by a program, not checked manually
Features assigned by a program, with some manual corrections, but not a full manual verification
Relations assigned by a program, with some manual corrections, but not a full manual verification

Description

The Georgian UD Treebank (UD_Georgian-GLC) is the first syntactically annotated corpus of Georgian, based on a collection of annotated sentences selected from the Georgian Language Corpus (GLC) available at http://oldcorpora.iliauni.edu.ge/. The Georgian UD Treebank (UD_Georgian-GLC) was created at the University of Göttingen

The Georgian UD Treebank (UD_Georgian-GLC) is the first syntactically annotated corpus of Georgian. The annotations have been performed on a representative sample of sentences randomly selected from the GLC (Doborjginidze et al. 2013). The annotations provide information about the grammatical structure and dependencies within the sentences, allowing a better understanding of the syntactic structure of the Georgian language. The tokenization and segmentation principles in the GLC (Google Language Codes) differ slightly from those represented in the UD (Universal Dependencies) specifications, particularly regarding multiword tokens and, the UD specifications’ approach has been adopted to avoid the above-mentioned difference. Morpho-syntactic annotation already discussed in Lobzhanidze (2022) was automatically converted to meet the requirements of the UD. Thus, the UD_Georgian-GLC incorporates automatic annotation for lemmas (LEMMA), part-of-speech categories (UPOS; XPOS), morphological features (FEATS), transliteration and tokenization issues (MISC). The heads of the current words (HEADS), dependency relations (DEPREL), and enhanced dependency graph (DEPS) have been automatically converted, reviewed, and manually corrected.

The current version of the UD_Georgian-GLC treebank includes 151 utterances (sentences) or 2123 tokens. In future releases, the UD_Georgian-GLC treebank will expand the available data by incorporating additional texts. The primary objective is to provide a more comprehensive and representative dataset for training and analysis purposes.

Acknowledgments

The UD_Georgian-GLC release is based on the data from the Georgian Language Corpus (GLC) developed with the financial support of the Shota Rustaveli National Science Foundation (Project Nos. DP2016_23, LE/17/1-30/13, AR/320/4-105/11, Y-04-10).

Special gratitudes goes to Prof. Dr. Stavros Skopeteas from the University of Göttingen for his support and valuable comments on the initial data of the UD_Georgian-GLC treebank and to Prof. Dr. Dan Zeman for his invaluable contributions in making the dataset available on GitHub and offering valuable suggestions.

References

Doborjginidze, N., Lobzhanidze, I., Gunia, I. (2012). Georgian language corpus. See, http://corpora.iliauni.edu.ge/. Accessed 15 July 2023.

Doborjginidze, N., Lobzhanidze, I., Mirianashvili, G. (2014). Corpus of Georgian Chronicles. See, http://corpora.iliauni.edu.ge/. Accessed 15 July 2023.

Lobzhanidze, I. (2022). Finite-State Computational Morphology: An Analyzer and Generator for Georgian. Cham: Springer.

Statistics of UD Georgian GLC

POS Tags

ADJADPADVAUXCCONJNOUNNUMPARTPRONPROPNPUNCTSCONJVERB

Features

AbbrAdpTypeAdvTypeAnimacyAspectCaseDegreeEvidentMoodNameTypeNumberNumber[io]Number[obj]Number[subj]NumFormNumTypePartTypePersonPerson[io]Person[obj]Person[subj]PossPronTypePunctTypeSubcatTenseVerbFormVoice

Relations

acladvcladvmodadvmod:lmodamodauxcaseccccompconjcopcsubjdet:possflat:nameiobjmarknmodnsubjnsubj:passnummodobjoblparataxispunctrootxcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview