home edit page issue tracker

This page pertains to UD version 2.

UD for Slovenian

Tokenization and Word Segmentation

In Slovenian UD treebanks, tokens are generally delimited by whitespace or punctuation, with the following exceptions being treated as single tokens:

Multiword tokens are not used in Slovenian UD treebanks. This means that fused words, such as bound pronouns (e.g. name “on me”), multi-word abbreviations (e.g. npr. “for example”) and colloquial contractions (e.g. nauš “won’t”), are currently treated as single tokens.

For more details, see the Slovenian tokenization description page.

Morphology

Tags

For more details, see the language-specific guidelines for individual tags with a detailed explanation of the JOS-to-UD conversion rules, and the Slovenian UD treebanks overview with statistical details on the tagset distribution.

Features

For more details on all other morphological features see the language-specific guidelines for individual features with a detailed explanation of the JOS-to-UD conversion rules, and the Slovenian UD treebanks overview with statistical details on the tagset distribution.

Syntax

For more details on the syntactic annotation, see the Slovenian UD treebanks overview with statistical details on the dependency relation distribution.

Treebanks

There are two Slovenian UD treebanks: