Annotating Entities with Fine-Grained Types in Austrian Court Decisions

, Abstract. The usage of Named Entity Recognition tools on domain-speciﬁc corpora is often hampered by insuﬃcient training data. We investigate an approach to produce ﬁne-grained named entity annotations of a large corpus of Austrian court decisions from a small manually annotated training data set. We apply a general purpose Named Entity Recognition model to produce annotations of common coarse-grained types. Next, a small sample of these annotations are manually inspected by domain experts to produce an initial ﬁne-grained training data set. To eﬃciently use the small manually annotated data set we formulate the task of named entity typing as a binary classiﬁcation task – for each originally annotated occurrence of an entity, and for each ﬁne-grained type we verify if the entity belongs to it. For this purpose we train a transformer-based classiﬁer. We randomly sample 547 predictions and evaluate them manually. The incorrect predictions are used to improve the performance of the classiﬁer – the corrected annotations are added to the training set. The experiments show that re-training with even a very small number (5 or 10) of originally incorrect predictions can signif-icantly improve the classiﬁer performance. We ﬁnally train the classiﬁer on all available data and re-annotate the whole data set.

text knowledge graph population, information extraction or question answering. Such downstream applications benefit from high-quality NER output, which is why the task of NER is an important and often critical one. At the same time, many NER tools are limited to distinguishing only a relatively small set of entity types, because most of the popular corpora and data sets that are used for training these tools [24,16] are annotated for the entity types person, location and organisation only. It is especially difficult to find annotated corpora in languages other than English. Producing such a data set is a very expensive task and is completely infeasible in practical applications. In domain-specific corpora it might be even difficult to produce the annotations of the coarse-grained types mentioned above, because the domain-specific use of language and the special terms can easily confuse the general-purpose NER tools resulting in noisy annotations.
In this paper, we tackle the task of classifying the entities recognized by a general-purpose NER tool into fine-grained entity types chosen by a domain expert. We conduct a case study on the corpus of Austrian court decisions collected from The Legal Information System of the Republic of Austria 4 in German language. We consider real industry settings, therefore, we rely on a very small amount of annotated training data that is produced by a domain expert in frames of this work, and we additionally aim at recovering from noisy annotations produced by the general purpose NER tool on the domain-specific corpus.
The task is motivated by the commercial tool LawThek 5 that offers its customer a way to access the Austrian legislation and related legal documents. The customers benefit from additional enrichments produced by the system and a domain-specific NER model would provide a further extension that would help the user to better retrieve the relevant documents and identify useful information in those documents.
Open Data All the data, including the original corpus and the manual annotations is publicly available 6 . The code to repeat the experiments is also publicly available 7 .
to create training data sets with more fine-grained entity types, for example two benchmarks: FIGER [1] and OntoNotes [19]. In the work presented here, we investigate a method that significantly reduces the required amount of training data.
Fine-grained NER Several systems addressing the fine-grained NER task have been successfully applied on those data sets. In [28] and [27] authors exploit modern transformer-based language models to learn joint embeddings of words and entities from large entity-annotated corpora. These models allow one to effectively combine the semantic signals retrieved from both words and entities. The authors of K-Adapter [26] also build on top of modern transformer language models adding special adapters that inject multiple kinds of diverse knowledge. These adapters are task-specific and are, therefore, able continuously infuse knowledge, without forgetting.
The mentioned models reach state of the art performance on the mentioned fine-grained data sets. However, these and other similar models (e.g. [23,15]) rely on significant amounts of data, incorporating external knowledge bases to learn each type of entities. On the contrast, we are interested in learning the basic "concept" of type verification from very small data and seek to find a model that is able to benefit from cross-type interactions.
Entity Linking and Distant Supervision Solving the problem of typing entities can also be addressed by the methods of Entity Linking (EL), in which a diversity of string-matching techniques are employed to relate found mentions of NEs with known entities. Common tools for entity linking include DBPedia Spotlight [4], Entity Fishing [13] or Babelfly [14]; and the number of approaches to EL continues to increase (see [18] for a recent overview). EL, however, assumes the existence of a catalog of entities (preferably containing additional knowledge). In domainspecific settings, such as the one treated here, this is an unreasonable assumption that limit the potential range of use cases. One reason is that no entity catalog is complete and the entities specific to a domain are unlikely to be found in general domain ones. Another reason is that, for some purposes, entities of interest in the text do not correspond to any particular real world entity. For example, the complainant or the buyer are specific entities within a document, but are not a surface form of any particular real world entity.
The idea of Distant Supervision [7,21] is to employ EL to create an (entity) annotated corpora. The drawbacks are the inevitable errors produced by the EL tools, especially false negatives. To mitigate this problem, researchers design neural models that are able to cope with such noisy data and/or help to recover from noisy [22,12]. Yet, the Distant Supervision approach and its extension suffer from the same shortcomings as EL, because these techniques rely on EL and external data sources, whereas our approach is bound only to the domainspecific corpus and domain experts.

Data
The current paper presents a use case study of re-tagging entities with finegrained types selected by the domain expert in a legal corpus in German language. The corpus is downloaded from Legal Information System of the Republic of Austria (RIS) -a computer-assisted information system on Austrian law 11 . The corpus for the current experiments consists of 2500 randomly selected court decisions taken from the category "Judikatur" (= Judicature of the courts), section "Bundesverwaltungsgericht" (= Federal Administrative Court). These documents are not older than 2014 and provided in the original German language. A large part of the decisions concern decisions on asylum procedures. The personal information is anonymised in the documents, see Example 1.

Original Named Entity Annotations
The corpus is initially annotated with a NER tool, based on BERT neural networks, which is developed following the work of Kamal Raj. 12 The original approach is adapted to allow for training of a new model on the WikiNer data set [17] -a general purpose data set containing four coarse-grained types. In the initial annotation process the tool annotated 39,324 Persons (PER), 215,699 Locations (LOC), 183,045 Organizations (ORG) and 324,926 Miscellanea (MISC). As expected, given the type of documents, court decisions, the PER type is the least abundant, while the other three types are present one order of magnitude as many times.
The model is quite confused by the domain-specific usage of language and also special symbols such as anonymization masks and frequent abbreviations. Therefore, we identified many noisy annotations in the original coarse-grained annotations.

Selection of Named Entity Types of Interest
The original annotations are collected and analysed to group the entities and produce new entity types 13 . These new types are reviewed by the domain expert from the point of view of the targeted functionality of the final application and 11 https://www.ris.bka.gv.at/UI/Erv/Info.aspx accessed on March 26 12 https://github.com/kamalkraj/BERT-NER 13 For the purpose of the current work we omit the details of type induction as the presented classification approach does not depend on the type selection procedure.
9 fine-grained entity types are selected for further analysis, see also Table 1 for the definitions of types.
Gericht to recognize the different courts, therefore, identify the level at which a certain decision was taken. This could be potentially accomplished with the usage of a gazetteer for Austrian courts, however, colloquial usages and international courts would be missed. Behörde, Administration to be able to see the involvement of different government offices into processes, therefore identifying in the stakeholders from the government. Verwandtschaft to group physical persons into clusters. Could be further used for information extraction or grouping by kinship. Land, Staat to identify potential international involvement. Information, Quelle, Daten to find external information sources that could potentially be interlinked. Zocken, Spielhallen to group documents w.r.t exact type of criminal activity.
This activity was particularly prominent. Rolle, Gruppe von Personen to identify roles, persons can be assigned too used to for further grouping / information retrieval. E. g. person is a complainant, a buyer, a seller, etc. Kriegshandlung, Konflikt to group documents w.r.t. exact type of non-criminal activity that could be triggers. In this case, many armed conflicts lead to asylum procedures. Strasse, Addresse to get more detailed GEO locations down to an exact address which can be checked against an address database.
The new types are chosen from the analysed data and with the idea to provide some additional value for the end user. However, the types appear to have overlaps and do not necessarily cover all the data. For example, the types "Gericht" and "Behörde" are much closer to each other than to "Strasse, Addresse", for example. On the other hand, the type "Rolle, Gruppe" comprises entities of quite different semantics and could be potentially split into two types; the choice is done in favor of this joint type because in the random sample we find many borderline entities such as "complainants" or 'legal representatives". We note that this choice of types makes the classification task more challenging, often an entity might belong to more than one type. We see it necessary to cope with this choice as it was provided by the expert from the point of view of the domain of application itself.
In the following we add an artificial type "Other" that is reserved for 1) original noisy annotations and 2) entities that do not belong to any of the described types.

Manual Annotations
We annotate a small sample of data manually. For this purpose we choose a few seed entities that unambiguously belong to the chosen fine-grained types and identify their occurrences in the documents, see Example 2. We then manually verify the correctness. This original manually annotated data set is publicly available, the number of entities is presented in Table 2. There are in total 109 annotated instances for 9 types resulting in ≈12 instances per type on average. We choose clean, unambiguous entities as the seed entities. This process is tedious for the expert, as it requires to skim through the documents to identify those entities, therefore it is not feasible to produce a large initial data set. Yet, these initial manual annotations are not expected to represent the whole data set, but rather produce a good seed data set for the chosen fine-grained types.

Classifier
We design a classifier that is capable of verifying the type of a given entity. We take into account the lack of training data and, therefore, aim at a robust solution that would be capable to efficiently use some preliminary training to solve the task. Therefore, we focus our attention on the Target Sense Verification (TSV) task [2]. The core task is a binary classification task -given an entity of interest in a context and definitions / hypernyms of an entity's sense decide if the entity is mentioned in the given sense or not. The main challenge of the task is to generalize the ability of verifying the sense of an entity to unseen domains and senses.
Our task setting can be formulated in a similar way, with the difference of verifying the type of an entity instead of its sense -target type verification. Yet the inputs -the target in context, the definition and the synonyms of the target type -are very similar to TSV. Therefore, we reuse the results of the challenge and employ the model from [25] 14 that showed the best results in Task 3 of the challenge. The model is based on a transformer model (we use Bert [5] 15 ), and it marks the input to let the encoder focus on the target and sense/type identifiers.

General Purpose Fine-tuning
For fine-tuning our model on the proposed task, we chose a learning setup similar to [2]: we created a training set where each instance consists of a target word in a context (e.g. the spring was broken), and a target sense, indicated by the definition and hypernyms of the target word (e.g., the season of growth and season) as well as a label indicating if the target word was used in the target sense (in this case, F ). As it has been shown, that the proposed classifier is to some extent able to transfer intrinsic classification capabilities gained on a general purpose data set into specific domains [25], we generated this training set from German Wiktionary. Herefore, we scraped the entries of nouns for which multiple meanings, definitions, hypernyms and examples were available. We removed senses that were too close (i.e., those that were listed as [1a] and [1b] instead of [1] and [2] ) and manually cleaned the data set to reduce noise. The final training set consists of 3,564 instances, with 55% positive examples and 45% negative ones.

Training
We always start from a model tuned on TSV data set. We remind that the classification task is binary and the input is encoded as: where T i stands for ith token of the context, TARGET T i stands for ith token of the target entity, DEF T i -ith token of the definition of the target type, SYN j T i -ith token of the jth synonym of the target type, and [CLS], [SEP] are the special tokens used by the model, "..." denotes a continuation of an enumeration, i. e. T i or DEF T i . It is straightforward to generate negative training examples -it is enough to substitute the definition and the synonyms of the correct type with some other type's definition and synonyms. In the preliminary experiments on a different similar data set reported in [11] we identified that the optimal number of negative examples is ≈70% of the available other types, therefore, in our experiments for each positive example we generate 6 negative examples.
In the first experiment with only manually annotated data we use all the data for continuing fine tuning, because the size of the data set is small, see Section 3.3 and in particular Table 2. We train the model for 3 epochs. For the consecutive experiments experiments with manually verified data we use up to 15 instances per type from the verified data with 2 negative examples per positive as the development data set. The model is trained for 7 epochs in each run and the best scoring model (in terms of F 1 score on the development data set) is chosen for further evaluation. The training batch size is set to 8, for the rest the parameters are set either as reported by authors or as the defaults by the training framework PyTorch [20].

Experiment
We recap that the goal of the experiment is to annotate the originally recognized entities of coarse types with new fine-grained types as defined in Table 1. For this purpose we first fine-tune the model (Section 4) on the German TSV data set (Section 4.1). Then we manually annotate a small sample of data with target fine-grained types (Section 3.3). Further we fine-tune the model on this small manually annotated data set and generate predictions of new entities. We randomly take 547 predictions and manually evaluate the correctness of predictions (Section 5.1). Finally, we combine all the manually annotated entities -the initial manually annotated data set and the verified sample -and fine tune our model on the whole data set. We use this latter model to generate predictions for the complete corpus.
As described in Section 3.1 the original coarse annotations contain many noisy annotations that actually do not belong to any type. Another goal of the experiment is to evaluate how efficiently we can recover from these noisy annotations and correctly reject them.

Manual Verification
After fine-tuning on the manually annotated data set, we manually verify the results. The trained model is used to automatically generate predictions on the original corpus. Then, a sample of those predictions is taken to manually verify their correctness. We take ≈55 randomly sampled predictions per predicted fine-grained type.
For each of the fine-grained type, we create a separate spreadsheet, containing one prediction per row. In detail, on each row, the surface form of the entity, the predicted type, the position of the entity in the context and the full context are given. Six independent reviewers were tasked to examine the correctness of these samples. The review is performed as a binary classification task, by determining if the prediction is correct or not. Additionally, in case of a false prediction, the reviewers are required to provide a proper classification for the entity according to 9 types described in Table 1. The reviewers are instructed to be tolerant only in the case of incomplete boundaries of the annotations. Example 3. In the following sample, Eurostat is tagged as of type "Information, Quelle, Daten". Even that the annotation is incomplete, i. e. it does not contain the date reference, this prediction is considered correct. As the predictions are often incorrect, the resulting verified data set is quite unbalanced, with many wrongly annotated entities that end up in the type "Other". The actual frequencies of the verified fine-grained types are presented in 2.

Results
In the first run the model is trained on manually annotated data for 3 epochs, the results are presented in Table 3 16 . We note that only one type "Gericht" reaches F 1 score above 0.5. It is remarkable that the performance does not always correlate with the size of the training set for a given types; for example, "Strasse, Addresse" with only 6 training samples reaches F 1 of 0.35 that is significantly higher than F 1 score of "Kriegshandlung, Konflikt" with 15 training samples.
Overall accuracy and F 1 are below 0.3. These low scores can be explained by a significantly different real verified data as compared to the clean seed instances used for the manual annotations. Therefore, we do further runs of the experiment taking a few samples of the manually verified types. In these runs we extend the training set by 5 and 10 instances, respectively, of the incorrectly labelled manually verified data set. We add those instances with the corrected tag and, in the preparation of the training set, we generate negative examples for them as described in Section 4.2. We only do it for those types that have at least 20 incorrectly labelled instances, namely for "Information, Quelle, Daten", "Rolle, Gruppe von Personen", "Behörde, Administration", "Gericht", "Land, Staat". We also add 5 and 10 instances from the type "Other" to generate negative examples for training. Therefore, we use roughly 5% and 10% of the manually verified data set, respectively, to extend the training set. For both settings 3 runs were performed with randomly chosen 5 and 10 instances, average results are reported.
The results of training on the data set extend by 5 additional instances are presented in Table 4. We note that now for 5 types the F 1 scores are 0.5 and 16 We used the functionality of scikit-learn (https://scikitlearn.org/stable/modules/model evaluation.html#classification-report visited on 02.04.2021) to produce the classification report. above. Remarkably, for the type "Zocken, Spielhallen" the results have grown significantly, though no additional training instances were added. On the other hand, the type "Strasse, Addresse" is now much more often misclassified as false negative, therefore recall and F 1 are much lower. This demonstrates the crosstype interactions in our model, i. e. the model seems to be able to learn the idea of type verification in general and not fit to the specific training data at hand. Overall accuracy and F 1 scores grow by ≈0.2, which can be considered a very significant growth. This demonstrates that with the current model and training routine even a very small amount of real annotated data can have a significant impact on classification results.
We are further interested if adding more data can still have a significant impact and proceed with 10 additional instances per type for the populated types (with support more than 20), the results are presented in Table 5. We observe further grows of scores, now 6 types have F 1 scores of 0.5 and above. Though the absolute values now demonstrate a more moderate growth of not more than 0.1 for accuracy and average F 1 , the variance has significantly decreased for most scores. This might due to the fact that the verified data set is very challenging for classification, including noisy and arguable entities. Therefore, we slowly see the "saturation" of scores, i. e. some entities are outliers in its types and can either not be classified well or corrupt the model if added as training instances. However, as the variance decreases, with 10 added instances we observe less impact from those noisy instances. We also note further cross-type learning as, for example, for "Strasse, Addresse" and "Zocken, Spielhallen" the scores keep growing without any further training instances of this type.
We see that using the extended data set we also manage to train the model to recover from noise to a certain extent. In the latter experiment the F 1 score for the type "Other" is above 0.5. Yet, in both extended experiments we observe that often the precision for this type is higher than recall. Analysis of errors Some errors are listed in Table 6. For the first row the entity Art (stands for "article") is annotated as "Other" because the phrase is incomplete. However, the model still classifies it as "Information, Quelle, Daten". In the second row we see the inverse error. These errors are due to incomplete original annotations, therefore we think some heuristics to extend the annotations to complete entities could be useful. In the third and fourth rows we see examples of entities that could be classified into more than one type, however, the manual annotators had to choose only one type for their input.

Conclusion
We aim at producing an annotated fine-grained domain-specific data set for training an NER tool, while attempting to reduce the high cost of manual annotations produced by domain experts. In particular, we address a realistic case of 1) having to cope with the choice of fine-grained types produced by domain experts and 2) small golden training data set. Therefore, we formulate the (named) entity typing task as a binary classification task and explore the cross-type learning of the model. We exploit a binary classifier that has shown good results on a similar binary task of sense verification. The experiments demonstrate that the initial clean and manually annotated data set might not be enough to achieve good classification results. However, adding even a small amount of randomly sampled incorrectly classified entities to the training set might significantly improve the performance. Moreover, we observe that the performance of the model increases even for those types where no additional instances are added, therefore exploiting cross-type learning effect. We also observe the models increasing ability to recover from original incorrect (noisy) annotations produced by the general purpose NER model.
Finally, we train the model on all the available manual data and (re-)annotate the whole original corpus.