Interlinking Valency Frames and WordNet Synsets in the LiLa Knowledge Base of Linguistic Resources for Latin

. This paper describes the steps taken to model a valency lexicon for Latin (Latin Vallex) according to the principles of the Linked Data paradigm, and to interlink its valency frames with the lexical senses recorded in a manually checked subset of the Latin WordNet. The valency lexicon and the WordNet share lexical entries and are part of the LiLa Knowledge Base, which interlinks multiple linguis- tic resources for Latin. After describing the overall architecture of LiLa, as well as the structure of the lexical entries of Latin Vallex and Latin WordNet, the paper fo- cuses on how valency frames have been modeled in LiLa, in line with a submodule of the Predicate Model for Ontologies (PreMOn) speciﬁcally created for the repre- sentation of grammatical valency. A mapping of the valency frames and the WordNet synsets assigned to the lexical entries shared by the two resources is detailed, as well as a number of queries that can be run across the interoperable resources for Latin currently included in LiLa.


Introduction
In lexicography, a widespread approach to the representation of lexical meaning is based on the fundamental assumption of frame semantics by C. Fillmore [1], namely that the meaning of some words can be fully understood only by knowing the frame elements that those lexical items evoke. Such an approach is related to the concept of linguistic valency [2] [3]. The latter concept is used to denote the number of obligatory complements (called 'arguments', but also 'actants' [3] or 'inner participants' [4]) controlled by a word, usually a content verb. The different types of argument are generally represented in valency frames through labels for semantic roles, such as Agent, Patient and Beneficiary.
Several lexical resources are built around the concept of valency and argument frames that describe the argument structure of words with the help of various sets of labels for semantic roles. For instance, the criteria for distinguishing obligatory and nonobligatory complements and the degree of granularity of the set of semantic roles are what mostly distinguishes valency-based resources like PropBank [5] (and NomBank [6]), VerbNet [7] and FrameNet [8] one from the other.
Another kind of lexical resource largely used in both theoretical and computational linguistics is WordNet (WN) [9], which is built around the idea of synonymy in the broad sense. In WordNet, words are included in synsets, which are sets of lexical items that share the same sense, so "that [they] are interchangeable in some context without changing the truth value of the proposition in which they are embedded". 2 Despite their differences, not only are the valency-based and the WN approaches to lexical meaning not incompatible, but they are strictly related, because valency frames tend to correspond to lexical senses. This is precisely the case in the PDT-VALLEX valency lexicon for Czech [10], where one valency lexicon entry (i.e., valency frame) is created for each sense of a word. Given that in WN the different senses of a polysemic word are represented through assignments to different synsets, mapping the frame entries of a valency lexicon with the synsets of a WN promises to be an effective way to provide a comprehensive representation of lexical meaning that joins the valency-based and the synset-based approaches.
In order to perform such mapping, two conditions must hold. Firstly, both a WN and a valency lexicon must be available for a specific language, with a substantial number of lexical entries in common. Secondly, the mapping process entails the design of a specific technique for interlinking these two kinds of lexical resources in the most standard fashion as possible, so as to make the process applicable to data in any language.
For Latin, the former condition is satisfied by the existence of both a Latin WN and a Latin valency lexicon. In this paper, we will address the challenges in the second condition with the help of a set of models and ontologies developed by the community working in the area of the Linguistic Linked Open Data (LLOD), whose objective is to make (meta)data of distributed linguistic resources interoperable on the web. We describe the process to represent a valency lexicon for Latin according to the principles of LLOD and to interlink its valency frames with the lexical senses recorded in a manually checked subset of the Latin WN. The valency lexicon and the WN are included together (with shared lexical entries) in the LiLa Knowledge Base, which makes distributed linguistic resources for Latin interoperable by applying LLOD principles. The outcome of the work is a new sub-module of the PreMOn ontology (see Section 5.1) dedicated to valency frames and the two linked datasets of the Latin WordNet, 3 and Latin Vallex. 4 The paper is organized as follows. Section 2 provides an overview of the related work about (inter)linking valency-based lexical resources between them and with WNs. Section 3 presents the fundamental architecture of the LiLa Knowledge Base. Section 4 describes the Latin WN and the Latin Vallex resources. Section 5 details how we modelled valency in LiLa and the mapping between the valency frames of Latin Vallex and the synsets of WN. Finally, Section 6 concludes the paper.

Related Work
Over the last decade, several attempts at (inter)linking different lexical resources together have been performed. One of the best known projects is Semlink, which makes use of a set of mappings to link PropBank, VerbNet, FrameNet and WN [11]. Pazienza et alii [12] study the semantics of verb relations by mixing WN, VerbNet and PropBank. Shi and Mihalcea [13] integrate FrameNet, VerbNet and WN into one knowledge-base for semantic parsing purposes.
In the LLOD context, the PreMOn (predicate model for ontologies) resource [14] exposes predicate models for PropBank, NomBank, VerbNet, and FrameNet and mappings between them. More information on PreMOn will be provided in Section 5.1.
Regarding the relations between valency lexica and WNs, Hlavácková [15] describes the merging of the Czech WordNet (CWN) with the database of verb valency frames for Czech VerbaLex, whose lexical entries are related to each other according to the CWN synsets. Hajič et alii [16] use CWN while performing the lexico-semantic annotation of the Prague Dependency Treebank for Czech (PDT), which is in turn exploited to improve the quality and the coverage of CWN. To pick out the semantic constraints of the verbal arguments in the Polish WordNet (PolNet), the valency structure of verbs is used as a property of verbal synsets, because it is "one of the formal indices of the meaning (it is so that all members of a given synset share the valency structure)" (page 402 of [17]). Finally, Passarotti et alii [18] compare the different views on lexical meaning conveyed by the Latin WN and by a valency lexicon for Latin, evaluating the degree of overlapping between a number of homogeneous lexical subsets extracted from the two resources.

The LiLa Knowledge Base
The LiLa: Linking Latin project (2018-2023) 5 was awarded funding from the European Research Council (ERC) to build a Knowledge Base (KB) of linguistic resources for Latin based on the Linked Data paradigm. Our aim is to build a collection of multifarious, interlinked data sets of Latin resources represented with the same vocabulary of knowledge description (by using common data categories and ontologies) [19].
According to the Linked Data paradigm, data in the Semantic Web [20] are interlinked through connections that can be semantically queried, so as to make the structure of web data better serve the needs of users. In order to achieve interoperability between distributed resources for Latin, LiLa makes use of a set of Semantic Web and Linked Data standards and practices. These include ontologies to describe linguistic annotation (OLiA: [21]), corpus annotation (CoNLL-RDF: [22]) and lexical resources (Ontolex-Lemon: [23]).
Following Bird and Liberman [24], the Resource Description Framework (RDF) [25] is used to encode graph-based data structures to represent linguistic annotations in terms of triples: (i) a predicate-property (a relation; in graph terms: a labeled edge) that connects (ii) a subject (a resource; in graph terms: a labeled node) with (iii) its object (another resource/node, or a literal, e.g. a string).
Given the presence and role played by lemmatization in various linguistic resources, and the good accuracy rates achieved by the best performing lemmatizers for Latin (up  to 0.96, as per the results of the EvaLatin 2020 evaluation campaign [26]), LiLa uses the lemma as the most productive interface between lexical resources, annotated corpora and Natural Language Processing (NLP) tools. Consequently, the ontology model of the LiLa KB is highly lexically based, grounded on a simple, but effective assumption that strikes a good balance between feasibility and granularity: textual resources are made of (occurrences of) words, lexical resources describe properties of words, and NLP tools process words. Fig. 1 presents the main components of the ontology model of the LiLa KB, showing the key role of interlinking played by lemmas. Fig. 1 shows that there are three kinds of (meta)data sources in LiLa, namely: (1) lexical resources, which include lexical entries, (2) textual resources, which are made of occurrences of words (tokens), and (3) NLP tools, which produce different outputs according to their task(s) of automatic linguistic analysis. 6 In LiLa, lexical entries, tokens and NLP outputs are interlinked via lemmas.
To make this conceptual architecture work, the core of the LiLa KB consists of a large collection of Latin lemmas: interoperability is achieved by linking all those entries in lexical resources and tokens in corpora that point to the same lemma. The lexical basis of the Latin morphological analyzer Lemlat [27] was used to populate the LiLa collection. Lemlat's database reconciles three reference dictionaries for Classical Latin [28] [29] [30], the entire Onomasticon from Forcellini's Lexicon Totius Latinitatis [31] and the Medieval Latin Glossarium Mediae et Infimae Latinitatis by du Cange et alii [32], for a total of over 150,000 lemmas.
Beside the collection of Latin lemmas, the linguistic resources linked so far via the LiLa KB are the following: 7 • the Index Thomisticus Treebank, both in its original and in the Universal Dependencies version [33]; 8 • the Latin works of Dante Alighieri, taken from the DanteSearch corpus [34]; 9 • the text of the Late Antiquity comedy Querolus sive Aulularia; 10 14 • the valency lexicon Latin Vallex [41]; 15 • the derivational lexicon Word Formation Latin [42]. 16 Both the lemma collection and the source (meta)data of the resources linked to LiLa (together with their Turtle files, which provide the RDF triples) are freely available from the GitHub page of the CIRCSE research center. 17

Latin WordNet and Latin Vallex
The LatinWordNet (LWN) [43] was initiated in the context of the MultiWordNet project [44], whose aim was to build a number of semantic networks for specific languages aligned with the synsets of the Princeton WordNet (PWN) [45]. Unfortunately, the automatic process employed to set up such an alignment resulted in a dataset that is at the same time largely incomplete (i.e. lacking a number of crucial lemmas e.g. amo "to love") and inaccurate, especially due to the presence of various modern senses inherited from MultiWordNet. 7 To connect the entries of lexical resources as well as the lemmatized tokens of annotated corpora to the LiLa collection of lemmas, we perform a simple string match between the lemmas in the resource to connect and those in the LiLa collection. Ambiguous links, which happen when one lemma in the resource is connected to more than one lemma in the LiLa collection, are disambiguated manually. Missing links are solved by including the missing lemmas in LiLa. The procedure is detailed with reference to the Index Thomisticus Treebank in [19]. 8  Before its inclusion in LiLa, LWN counted 8,973 synsets and 9,124 lemmas. These are currently undergoing substantial revision to refine and extend its contents. 18 There are currently 1,424 fully reviewed entries in LiLa, for a total of 2,809 unique synsets. Synset are found or checked against two main Classical and Late Latin dictionaries [29] [46]. For each sense found in the dictionaries, one or more corresponding synsets are extracted from the 3.0 version of Princeton WordNet [9]. For example, the Latin verb concordo has the following senses: • "to be in good terms, be friendly, live in harmony (with someone)"; • "to be in agreement, harmonise, agree"; • "to bring about harmony/an harmonious relationship (between things), bring into union".
Each of these are then linked to a valency frame in the valency lexicon Latin Vallex (LV).
LV [41] was originally developed while performing the semantic annotation of two Latin treebanks, namely the Index Thomisticus Treebank, which includes works of Thomas Aquinas [47], and the Latin Dependency Treebank, featuring works of different authors of the Classical era [48]. All valency-capable lemmas occurring in the semantically annotated portion of the two treebanks are assigned one lexical entry and one valency frame in LV.
The structure of LV resembles that of the valency lexicon for Czech PDT-VALLEX [10]. On the topmost level, the lexicon is divided into lexical entries. Each entry consists of a sequence of frame entries relevant for the lemma in question. A frame entry contains a sequence of frame slots, each corresponding to one argument of the given lemma. Each frame slot is assigned a semantic role. The set of semantic roles is the same used for the semantic annotation of the PDT [49]. Since the development of the lexicon is directly related to the annotation of the texts in the PDT, the surface form of the semantic roles run across during the annotation process is recorded as well.
In view of its extension and linking to LiLa, a different choice has been made in regards to the inner structure and theoretical principles of LV. In order to increase the lexical coverage of LV, the writing of the valency frames was disassociated from the treebank annotation process and valency is now defined on the basis of senses. For each sense, a valency frame is established intuitively, listing only its obligatory complements. This process involves finding a valency frame for each sense of the dictionary headword, as there might be differences in the number and/or type of arguments for different senses.
Since senses of words in LiLa are represented as PWN synsets evoked by LWN lexical entries, each (valency-bearing) synset is linked to a valency frame of LV, thus interlinking the two resources (see Section 5.1). The job is currently being performed manually, the valency frames included in the first version of LV have been updated, cleaned or rectified, and applied to each valency-bearing synset. Currently, 1,064 lexical entries have been annotated, for a total of 9,806 valency frames, while 1,424 entries of the LWN have been checked and revised manually.

Ontolex and PreMOn
In order to provide a common infrastructure to link the different lexical and textual resources using lemmatization as the connecting point, the LiLa KB adopts the Ontolex-Lemon ontology to model information of its comprehensive collection of Latin lemmas. Ontolex-Lemon [23] is a de-facto standard for the description of lexical entries. Particularly relevant for the aims of the projects is the property canonical form, as defined in the ontology, that allows us to express the relation between an entry in any given lexical resource and a lemma in the LiLa collection.
The core properties and classes of Ontolex are sufficiently expressive to account for the relations between words, senses and synsets as defined in WN. The meaning of a lexical entry is captured by expressing the relation to either a denoted entity or an evoked lexical concept in an ontology. An instance of the class Lexical Sense is used to reify this relation. Following the model of a complete publication of the PWN as LLOD, LiLa now includes all the 1,424 entries of the manually revised LWN. 19 While a number of extensions of Ontolex allow for the expression of many properties of the lexical items, including the syntactic frames and their elements, the ontology is not well suited to conveyexpress the notion of a predicate structure of semantic roles. The Predicate Model for Ontologies (PreMOn) builds also on Ontolex, but is explicitly designed to provide a description of the predicate structure and of the semantic roles connected to each lexical entry. At the same time, it also allows to map different semantic descriptions of any given word (such as a predicate structure and a link to a WN synset) to each other. This feature makes it ideally suited to represent both the structure of the valency frames and their connection to the senses described in the LWN.
PreMOn is based on a core module, whose main elements are the Semantic Class and Semantic Role classes. The former represents the the semantic classes from the various predicate models; thus, for examplinstance, rolesets from PropBank and frames from FrameNet are all instances of semantic classes. The different sub-classes of this general class, however, can be further specified in dedicated submodels, where the terminology and the relations that are peculiar to each specific project can also be defined.

The Vallex Submodule
To properly capture the structure of the valency frames in LV, we created a submodule of PreMOn that introduces a series of new subclasses and subproperties (which are prefixed with the namespace pmolv in what follows). The structure of the LV submodule is illustrated in Fig. 2. We define the valency frame as a subclass of pmo:SemanticClass; each different frame of any given entry in our lexicon is an instance of this class. The arguments involved in the valency frames of LV, called frame slots, are treated as a subclass of pmo:SemRole. These slots, which, as it is mandated for PreMOn's semantic roles, are defined locally to each semantic class, correspond to the so-called "functors" (i.e. semantic values of syntactic dependency relations) of the Functional Generative Description [49], which are classified into the three main categories of inner participants, free modifications, and quasi-valency modifications [49] [50]. We also use the pmolv:complementationType property to distinguish between obligatory and optional modification. Although the morphemic form, i.e. the morpho-syntactic realization of the role used in language, is not expressed in LV, the submodule identifies one property and one class to specify this information.

Mapping between Valency Frames and Synsets
PreMOn allows users to map pairs of words and predicate structure from different predicate models (e.g. from PropBank and FrameNet) to each other. In order to express this link, the core module defines a special reification of the relation between a given semantic class and a lexical entry, called "Conceptualization". Mapping itself is performed with instances of the class pmo:Mapping, which is defined as a set of conceptualizations, semantic classes, or semantic roles. Thus, in the PreMOn data, words-synsets pairs are matched with the predicate analyses from resources like PropBank and VerbNet by means of mapping instances linking the corresponding conceptualizations. 20 The version of LV included in LiLa adopts the same approach to connect valency frames and senses from the LWV. Fig. 3 represents the predicate structure and one of the WN links for the Latin verb abduco, in the sense "to take away, to remove". 21 A mapping instance (furthest node on the right, in the figure) holds together the two conceptualizations, that connect the verb to, respectively, synset 00173338-v of WN 22 and a valency frame for that specific sense. 23 The valency frame requires four (obligatory) frame slot, whose roles are filled by the functors: ACT (generally, the Actor), PAT (generally, the Patient), 24 DIR1 (Direction From) and DIR3 (Direction To). Note that in the figure only the link between the semantic role for PAT and the functor is shown.

Querying Interlinked Lexical and Textual Resources
As an example of the benefits of interoperability between different resources enabled by the Linked-Data approach to the publication of the joint LWN and LV data, we consider http://premon.fbk.eu/resource/sense-Ep7UGYgbEXbB3B2uGhZamc. 21 Following the habit of Latin lexicography, the 1st person singular of active present indicative is conventionally chosen as the citation form for verbs. 22 See http://wordnet-rdf.princeton.edu/pwn30/00173338-v. 23 The frame is: http://lila-erc.eu/data/lexicalResources/LatinVallex/id/ Conceptualization/co-val-l\_86867\_00173338-v. 24 It must be kept in mind that, in the PDT-VALLEX formalism, the first arguments are subject to the "argument shift" [49,  a couple of complex queries that span over several resources. At present, the majority of valency frames in LV has two argument slots (5,505, i.e. 66% of the frames). The highest value of obligatory arguments is 4, a number that is required by 55 frames (0.6%). The mapping between LV and the LWN allows for an easy investigation on the semantics of these verbs. By querying the SPARQL endpoint of LiLa, 25 it is possible to list the synsets that are mapped with the 55 valency frames that require 4 arguments. Table 1 lists the top five synsets associated with tetravalent verbs, ranked by the number of lexical entries associated to them.
LiLa's connections via lemmatization enables also to scrutinize, at least in a preliminary survey, the distribution of these tetravalent verbs in the linked corpora. The three most frequent verbs that have at least one 4-argument frame in the Index Thomisticus Treebank and DanteSearch are reported in Table 2. It must be kept in mind, however, that the raw frequencies reported in the table do not distinguish between the multiple valency frame associated with each verb; thus, the 300 and 60 occurrences of the verb confero "to bring, to carry together" might be instances of any of the 32 valency frames associated with it. 26 Only with a corpus enhanced with word-sense disambiguation it will be possible to distinguish between them.

Conclusions
In this paper we described how we modeled the contents of a valency lexicon for Latin strictly connected to a WordNet, to interlink it with other linguistic resources for Latin in a Knowledge Base built upon LLOD principles.
The modeling was performed by building a specific submodule of the PreMOn Linked Data resource for representing predicate-based lexical resources. In particular, the OWL ontology provided by PreMOn was used for modeling the valency frames and the semantic roles adopted in the Latin valency lexicon. We reused (and extended) an already existing vocabulary to meet one the main tenets of the LLOD world, where distributed linguistic data, metadata and resources are made interoperable just thanks to the fact that they are represented through common data categories and ontologies maintained by the large and active LLOD community.
Although the LiLa Knowledge Base aims first of all to interlink the available linguistic resources for Latin, by grounding on standard LLOD modules, it also wants to represent a reference architecture for the publication of the several (kinds of) resources that were developed for many languages over the last decades.
Thanks to LiLa, a number of linguistic resources for Latin are already made interoperable. The inclusion of a valency lexicon, with entries shared with a manually checked subset of the Latin WordNet, is a major achievement for the Knowledge Base and, hopefully, for the entire community interested in accessing and using linguistic resources for Latin, as it enhances the queries that can be performed on the textual resources connected to LiLa with lexical semantic information. In such respect, one of the near future objectives of the LiLa project is to include a reference Latin-English dictionary in the Knowledge Base [46], while continuing to enlarge the number of corpora that are made interoperable thanks to their connection to LiLa.