Linguistic Annotations for a Diachronic Corpus of German
Abstract
This paper describes the Ta-D/DC, a diachronic corpus of German that uses selected materials from the German Gutenberg Project and enriches them with different linguistic annotation layers, including part-of-speech, lemmata, and constituent structure. Linguistic annotation is performed automatically by using statistical tools that have been trained with data from the Tinger Baumbank des Deutschen (Ta- D/Z). In order to assess the annotation quality, an evaluation of the POS tagging is performed on the basis of a data sample of texts that range from the 13th to the 20th century. The paper concludes with a description of three different query mechanisms provided for the user.
Keywords
treebank;German; linguistic annotation