Ebook: Computational Dependency Theory
Dependencies – directed labeled graph structures representing hierarchical relations between morphemes, words, and semantic units – are the standard representation in many fields of computational linguistics. The linguistic significance of these structures often remains vague, however, and those working in the field stress the need for the development of a common notational and formal basis. Although dependency analysis has become quasi-hegemonic in Natural Language Processing (NLP), the connection between computational linguistics and dependency linguists remains sporadic. But theoretical dependency linguists and computational linguists have much to share.
This book presents papers from the International Conference on Dependency Linguistics (Depling 2011) held in Barcelona, Spain, in September 2011. Beginning with what may be the first formal definition of dependency structure, the book continues with papers covering subjects such as: the interface of the syntactic structures with semantics; mapping semantic structures to text surface by means of statistical language generation; formalization of dependency; advances in dependency parsing; and the link between statistical and rule-based dependency parsing.
This comprehensive collection gives a coherent overview of recent advances in the interplay of linguistics and natural language engineering around dependency grammars, ranging from definitional challenges of syntactic functions to formal grammars, tree bank development, and parsing issues.
In the past decade, dependencies, i.e., directed labeled graph structures representing hierarchical relations between morphemes, words and semantic units, have become the near-standard representation in many fields of computational linguistics – including, e.g., parsing, generation, and machine translation. The linguistic significance of these structures often remains vague, and the need for the development of common notational and formal grounds is felt strongly by many people working in these fields.
Historically, the generative grammatical tradition that, in its origins, solely attempts to construct a system that distinguishes grammatical from ungrammatical sentences, left linguistics in a state where the outcome of the grammatical analysis, namely phrase structure, was difficult to connect to deeper (semantic, conceptual) structures. The result was a complete separation between, on one side, Natural Language Processing (NLP) that needed deeper analyses for translation, classification, generation etc. and, on the other side, generative linguistics that built structures which grew more and more complex as languages farther away from English began to be addressed, with the declared goal to model Language as a whole. In the second half of the 20th century, only a few linguists, often referring to Lucien Tesnière, continued to describe language in terms of dependency, mainly because they were working on free word order languages, where the use of phrase structure has been obviously not appropriate.
Since the 1990s, NLP is turning towards dependency analysis, and in the past five years dependency has become quasi-hegemonic: The very large majority of parsers presented in recent NLP conferences are explicitly dependency-based. It seems, however, that the connection between computational linguists and dependency linguists remains sporadic. A very common procedure is that an existing phrase structure tree bank is transferred into a dependency format that fits the computational linguist's needs, and other researchers attempt to reproduce this annotation, with statistical or rule-based grammars. This is not to say that the situation was any better when parsers still derived phrase structures and linguistics discussed “move alpha”. Yet, we believe that the circumstances are different today and dependency linguists and computational linguists have a lot to share: We know that statistical parsing gives better results if we have a linguistically coherent corpus analysis. We need to know what the differences are between surface and deep dependency. What are the units that appear in dependency analysis? What kind of analysis works for which application? How to link dependency structures to the lexicon and to semantics?
Kim Gerdes, Eva Hajičová and Leo Wanner
The paper proposes a mathematical method of defining dependency and constituency provided linguistic criteria to characterize the acceptable fragments of an utterance have been put forward. The method can be used to define syntactic structures of sentences, as well as discourse structures for texts, or even morphematic structures for words. Our methodology leads us to put forward the notion of connection, simpler than dependency, and to propose various representations which are not necessarily trees. We also study criteria to refine and hierarchize a connection structure in order to obtain a tree.
Over the last decade, the prominence of statistical NLP applications that use syntactic rather than only word-based shallow clues increased very significantly. This prominence triggered the creation of large scale treebanks, i.e., corpora annotated with syntactic structures. However, a look at the annotation schemata used across these treebanks raises some issues. Thus, it is often unclear how the set of syntactic relation labels has been obtained and how it can be organized so as to allow for different levels of granularity in the annotation. Furthermore, it appears questionable that despite the linguistic insight that syntax is very much language-specific, multilingual treebanks often draw upon the same schemata, with little consideration of the syntactic idiosyncrasies of the languages involved. Our objective is to detail the procedure for establishing an annotation schema for surface-syntactic annotation of Spanish and present a restricted set of easy-to-use criteria and a methodology which facilitate the decision process of the annotators, but which can also accommodate for the elaboration of a more or a less fine-grained tagset. The procedure has been tested on a Spanish 3,513 sentence corpus, a fragment of the AnCora newspaper corpus.
In this paper, we investigate errors in syntax annotation with the Turku Dependency Treebank, a recently published treebank of Finnish, as study material. This treebank uses the Stanford Dependency scheme as its syntax representation, and its published data contains all data created in the full double annotation as well as timing information, both of which are necessary for this study.
First, we examine which syntactic structures are the most error-prone for human annotators, and compare these results to those of two baseline parsers. We find that annotation decisions involving highly semantic distinctions, as well as certain morphological ambiguities, are especially difficult for both human annotators and the parsers. Second, we train an automatic system that offers for inspection sentences ordered by their likelihood of containing errors. We find that the system achieves a performance that is clearly superior to the random baseline: for instance, by inspecting 10% of all sentences ordered by our system, it is possible to weed out 25% of errors.
One of the reasons for the popularity of dependency approaches in recent computational linguistics is their ability to efficiently derive the core functor-argument structure of a sentence as an interface to semantic interpretation. Exploring this feature of dependency structures further, in this paper we show how basic dependency representations can be mapped to semantic representation as used in Lexical Resource Semantics, an underspecified semantic formalism originally developed as a semantic formalism for the HPSG framework and its elaborate syntactic representations. We describe a two stage process, which in the first stage establishes a syntax-semantics interface representation abstracting away from some differences in surface dependencies. It ensures the local reconstruction of arguments for middle and long-distance dependencies, before building the actual LRS semantics in the second stage. We evaluate the approach on the CREG-109 corpus, a small dependency-annotated corpus with answers to reading comprehension questions written by American learners of German.
Different valence patterns of Chinese parts of speech from 3 different Chinese dependency syntactic networks were extracted in this study and their similarities and differences are presented. The results show that there are some persisting properties of Chinese, which are not affected by style. Other parts of speech are more sensitive to style in Chinese. The advantages and disadvantages of the network approach are discussed at the end of this paper. The network approach to linguistic study can make the complex data concise and easy to understand. However, it also has some deficiencies. First of all, when the network size is large, the structure will become so complex that easy understanding is impossible. Secondly, although the network can easily provide an overview of the language, it usually does not allow analyzing language details.
Semantic stochastic sentence realization is still in its fledgling stage. Most of the available stochastic realizers start from syntactic structures or shallow semantic structures, which still contain numerous syntactic features. This is unsatisfactory since sentence generation traditionally starts from abstract semantic or conceptual structures. However, a change of this state of affairs requires first a change of the annotation of available corpora: even multilevel annotated corpora of the CoNLL competitions contain syntax-influenced semantic structures. We address both tasks – the amendment of an existing annotation with the purpose to make it more adequate for generation and the development of a semantic stochastic realizer. We work with the English CoNLL 2009 corpus, which we map onto an abstract semantic (predicate-argument) annotation and into which we introduce a novel “deep-syntactic” annotation, which serves as intermediate structure between semantics and (surface-)syntax. Our realizer consists of a chain of decoders for mappings between adjacent levels of annotation: semantic → deep-syntactic → syntactic → linearized → morphological.
In linguistics, the influence of the results by Tesnière cannot be under-estimated: above all, he introduced the concepts of dependency and valency, which had a considerable influence in the development of linguistics by the second half 20th century. However, his Structural Syntax remains still uninvestigated in most parts: in particular, there is still no grammar formalism directly inspired by it, that is suitable for theoretical and practical applications in Artificial Intelligence and computational linguistics. The aim of this paper is to fill this gap, in proposing a formal grammar that adopts Tesnière's intuitions and concepts of Structural Syntax—as much as possible—in adapting them to the needs of contemporary linguistics, notably natural language processing. The result of this modelling process is a new formalism derived from Tesnière's, where natural language grammars are expressed in constructive mathematical terms (and therefore suitable for computational treatment) where the abstract notion of adposition is of the greatest importance. For these reasons, they are called Constructive Adpositional Grammars (CxAdGrams). This paper explains the linguistic and formal reasons of CxAdGrams, with a special regard to the heritage of Tesnière and its relations with existing dependency grammar formalisms in terms of similarities and differences.
Categorial Dependency Grammars (CDG) are classical categorial grammars adapted to surface syntactic dependencies. They generate unlimited projective and non-projective dependency structures, are completely lexicalized and analysed in polynomial time. We present an extension of the CDG, designed for elaboration of wide coverage hand-crafted grammars also analysed in polynomial time. Besides this, for the extended CDG, we introduce a specific method of “Structural Bootstrapping” consisting in incremental construction of extended CDG from representative samples of dependency structures. This method is supported by efficient algorithmic tools. We also outline a wide coverage dependency grammar of French developed by structural bootstrapping using these tools.
We present “CDG Lab”, an integrated environment for development of dependency grammars and treebanks. It uses the Categorial Dependency Grammars (CDG) as a formal model of dependency grammars. CDG are very expressive. They generate unlimited dependency structures, are analyzed in polynomial time and are conservatively extendable by regular type expressions without loss of parsing efficiency. Due to these features, they are well adapted to definition of large scale grammars. CDG Lab supports the analysis of correctness of treebanks developed in parallel with evolving grammars.
In this paper, we compare a higher order graph-based parser and a transition-based parser with beam search. These parsers provide a higher accuracy than a second order MST parser and a deterministic transition-based parser. We apply and compare the output on languages, which have not been in the research focus of Shared Tasks. The parser are implemented in a uniform framework. The transition-based parser was newley implemented and we revised the graph-based parser. The graph-based parser has to our knowlege the highest published scores for French and Czech with 90.40 and 81.43 labeled accuracy score.
For incomplete sentences, different structural descriptions with different levels of prediction are possible. Two existing dependency parsers have been modified to generate sequences of output hypotheses in an incremental manner. The parsing results can be characterized with respect to different criteria like the amount of predicted information, its quality, monotonicity, delay, inclusiveness and connectedness. We propose an evaluation scheme able to capture these properties and apply it to the parsers in different configurations.
We explore the performance of two dependency parsing approaches, the rule-based WCDG approach and the data-driven dependency parser MaltParser on texts written by language learners. We show that WCDG outperforms MaltParser in identifying the main functor-argument relations, whereas MaltParser is more successful than WCDG in establishing optional, adjunct dependency relations. This can be interpreted as a tradeoff between the rich, hand-crafted lexical resources capturing obligatory argument relations in WCDG and the ability of a data-driven parser to identify optional, adjunct relations based on the linguistic and world knowledge encoded in the gold-standard training corpora.
The paper presents a large-coverage rule-based dependency parser for Russian, ETAP-3, and results of its evaluation according to several criteria. The parser takes a morphological structure of a sentence processed as input and builds a dependency tree for this sentence using a set of syntactic rules. Each rule establishes one labeled and directed link between two words of a sentence that form a specific syntactic construction. The parser makes use of about 65 different syntactic links. The rules are applied by an algorithm that at first builds all possible hypothetical links and then uses a variety of filters to delete excessive links so that the remaining ones form a dependency tree. Several types of data collected either empirically or from a syntactically tagged corpus of Russian, SynTagRus, are used at this filtering stage to refine the parser performance. The parser utilizes a highly structured 130,000-strong Russian dictionary, whose entries contain detailed descriptions of syntactic, semantic and other properties of words. A notable proportion of the links in the output trees are non-projective. An important feature of the parser is its ability to produce multiple parses for the same sentence. In a special mode of operation, the parser may be instructed to produce more parsing outputs in addition to the first one. This can be done automatically or interactively. In the evaluation, SynTagRus is viewed as a gold standard. Evaluation results show the figures of 0.911 for unlabeled attachment score, 0.874 for labeled attachment score, 0.507 for unlabeled structure correctness, and 0.360 for labeled structure correctness.