home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD for Slovenian

Tokenization and Word Segmentation

In Slovenian UD treebanks, tokens are generally delimited by whitespace or punctuation, with the following exceptions being treated as single tokens:

words with an apostrophe, e.g. O’Brian
numerical expressions, e.g. 30:00, 200.000,000
words with hyphens before suffixes, e.g. “OZN-ovski* (“UN-like”), a-ju (“to the letter a”), 15-i (“the 15th”)
abbreviations without whitespace, e.g. dr.

Multiword tokens are not used in Slovenian UD treebanks. This means that fused words, such as bound pronouns (e.g. name “on me”), multi-word abbreviations (e.g. npr. “for example”) and colloquial contractions (e.g. nauš “won’t”), are currently treated as single tokens.

For more details, see the Slovenian tokenization description page.

Morphology

Features

The Slovenian UD tagset includes all features from the universal tagset except for Voice, Typo, NounClass, Evident, Polite and Clusivity. In addition to that, the set of universal features has been extended with four additional features to either describe language-specific features (such as Gender[psor], Number[psor] and Variant) or preserve some finer-grained morphological information encoded in the original ssj500k treebank (such as NumForm).
Nouns have inherent Gender (feminine, masculine and neutral) and inflect for Number (singular, dual or plural), Case (nominative, genitive, dative, accusative, locative, instrumental), Animacy and Definite (indefinite or definite).
Verbs have inherent Aspect. Non-finite forms include infinitives, supine forms and l-participles. Finite forms inflect for Gender (feminine, masculine and neutral), Number (singular, dual or plural), Mood (indicative and imperative) and Person (first, second and third). The verb biti (be) also inflects for conditional Mood and future Tense, while the verbs biti (be), imeti (have) and hoteti (want) also inflect for Polarity (positive and negative).

For more details on all other morphological features see the language-specific guidelines for individual features with a detailed explanation of the JOS-to-UD conversion rules, and the Slovenian UD treebanks overview with statistical details on the tagset distribution.

Syntax

Nominal subjects (nsubj) are nominal phrases typically occurring in nominative or (negated) genitive case.
Nominal objects (obj) are non-adpositional predicate arguments in all other cases, regardless of semantic role.
In case of two objects in a sentence, one is considered to be indirect object (iobj), typically the phrase in dative case.
All prepositional phrases are annotated as oblique arguments (obl), regardless of semantic role.
The copula (cop) label is used for verb biti (be) in all non-existential uses. The only exception are biti-constructions with prepositional phrases, which are currently always labelled as obliques, even in attributive role (e.g. biti v škripcih “to be in trouble”).
The following subtypes are used in Slovenian treebanks:
- cc:preconj for preconjuncts (e.g. tako in tako X kot Y “both X and Y”)
- discourse:filler for filled pauses (e.g. eee “uhm”)
- flat:foreign for non-first words in quoted foreign phrases (e.g. Chamber of torture)
- flat:name for exocentric complex names (e.g. Novak in Janez Novak)
- parataxis:discourse for clausal discourse markers (e.g. a veš “you know”)
- parataxis:restart for repaired sentence beginnings (e.g. sits in the image is- … this man sits)
The Slovenian treebanks do not use the clf] and compound] relations.

For more details on the syntactic annotation, see the Slovenian UD treebanks overview with statistical details on the dependency relation distribution.

Treebanks

There are two Slovenian UD treebanks: