Object-Action Association Extraction from Knowledge Graphs

Infusing autonomous artificial systems with knowledge about the physical world they inhabit is of utmost importance and a long-lasting goal in Artificial Intelligence (AI) research. Training systems with relevant data is a common approach; yet, it is not always feasible to find the data needed, especially since a big portion of this knowledge is commonsense. In this paper, we propose a novel method for extracting and evaluating relations between objects and actions from knowledge graphs, such as ConceptNet and WordNet. We present a complete methodology of locating, enriching, evaluating, cleaning and exposing knowledge from such resources, taking into consideration semantic similarity methods. One important aspect of our method is the flexibility in deciding how to deal with the noise that exists in the data. We compare our method with typical approaches found in the relevant literature, such as methods that exploit the topology or the semantic information in a knowledge graph, and embeddings. We test the performance of these methods on the Something-Something Dataset.


Introduction
Humans are able to understand relations between real-world objects and actions relying not only on observations, but also on their commonsense knowledge. Machines, on the other hand, need a large quantity of data, in order to learn and reason about object-action relations, as for instance to correlate the object Knife with the action Cut. Yet, recognizing such types of associations is crucial for a wide spectrum of applications involving autonomous entities. Commonsense Knowledge Graphs (KGs), such as ConceptNet [1] and WordNet [2], contain to some extent knowledge about object-action relations, and can help construct knowledge bases, which subsequently can be used by machines. However, inserting knowledge into a knowledge base from crowd-built KGs hides risks, as these may contain information that is noisy or false. Therefore, evaluation methods are crucial when exploiting knowledge from such KGs.
A method that correctly identifies positive and negative object-action associations in the presence of noise can increase the quality of data that a machine can utilize, which in turn can improve the performance of autonomous, artificial intelligence (AI) systems. Driven by this need, in this paper we compare a number of methods of different nature that are commonly used in practice to extract associations from KGs, organizing them into topology-based, semantics-based and embeddings-based. We also introduce a novel semantics-based approach to identify such object-action correlations that can achieve or improve state-of-the-art performance, while offering flexibility in ironing out noise. Its main characteristic is the exploitation of patterns of relations, which carry important information as to which associations to trust and which to dismiss.
Informally, the problem we aim to solve is the following: given a directed KG and a pair of nodes, one of which refers to a real-world object class and the other to a real-world action class, we try to infer whether these two nodes are associated or not, and to what degree, i.e., if the action can be performed by/on that object. This demand is amplified by the volume of current research in fields, such as robotic manipulation, where object affordances play a key role in enabling a robot to accomplish tasks (see for instance [3] for a recent survey of the relevant literature). Of course, the methods described in our study are not limited to the household domain, but can be applied for the detection of a broader class of associations; yet, framing our analysis on the given domain helps us compare more accurately the behaviour of diverse methods.
The main contributions of this paper are (a) a comparative analysis of popular methods for extracting associations from KGs focusing on a specific domain, that of household objects, (b) the proposal of a new, enhanced method that lays more emphasis on the semantic knowledge that exists in the KG, and (c) the generation of a dataset of positive and negative object-action relations, comprising labels that are commonly used for benchmarking both research and practical approaches. Our method and dataset are publicly available 2 .
The rest of this paper is structured as follows. Section 2 discusses related work. Section 3 presents the existing and proposed methods for identifying object-action relations. Section 4 describes our experimental evaluation, and Section 5 concludes the paper.

Related Work
Extracting commonsense knowledge from problem-agnostic repositories has been applied to a diversity of AI-related domains to solve various problems. The authors of [4] rely upon ConceptNet to identify word similarities, which they then use in order to improve the performance of sentence-based image retrieval algorithms. A more elaborate use of KGs is presented in [5], where the authors approach the problem of zero-shot label learning in images by creating KGs based on labels detected visually and on correlations found in external sources. The authors rely on WordNet to populate the graph and use Wu Palmer similarity 3 to specify the properties. In [6], the authors assign labels to a visual scene using Bayesian logic networks and relying on commonsense knowledge extracted from WordNet, ConceptNet, and Wikipedia. WordNet is utilized in order to disambiguate seed words with the aid of their hypernym. ConceptNet properties, such as LocatedAt or UsedFor, which may pinpoint the location of an object, are also retrieved. With this method, the system can generate a compact semantic knowledge base given only a small number of objects. Similar methods are used in [7,8,9,10].
The aforementioned studies attempt to integrate knowledge from general-purpose Web resources in a KG without, however, paying much attention to the validity of the information extracted from such resources. Moreover, they rely on the rather simplistic assumption that if two nodes are connected via any edge, then the two nodes are semantically related. We, on the other hand, try to iron out the noise or erroneous information that might exist in such Web resources before adding new knowledge to a KG.
The representation, as well as identification, of object-action relations has been the focus of interest of many studies in the field of cognitive robotics. In the projects KnowRob [11] and RoboSherlock [12], semantic correlation of physical entities is indeed captured, yet object-action relations are either learned exclusively through observed data, or captured in a problem-specific way. In [13], the authors integrate knowledge from ConceptNet in a KG. Given an object or action label, the authors construct subgraphs of ConceptNet with only two properties, in order to train a data-driven model, which can predict if an object is related with an action. Similar approaches are also used in RoboCSE [14], which uses embeddings to represent object and action labels and infer object-action relations based on the similarity of their vectors. Our proposed method exploits both semantically relevant and commonsense information captured in generalpurpose repositories, which can complement and enrich the outcomes of the aforementioned studies.
The study of Zhou et al. [15] is more closely related to ours. The authors train a Long Short-Term Memory (LSTM) to predict the path between two nodes of the ConceptNet graph. They collect, for a set of node pairs, the most quality paths, defining quality as the most natural set of edges that connects two given nodes. For instance, the path Lead  [16,17] a data-driven model predicts a path between two nodes of ConceptNet; the quality of a path is hand-coded by the authors. Our method, on the other hand, aims to determine the importance of a path through training, rather than through manual annotation. This has two benefits: it takes into account, to a larger extent, the structural and semantic characteristics of the underlying KG and it is more adaptive to changes in the KG or the application domain.

Methodology
In this section, we formulate the problem and describe the different methods we evaluated. We classify the methods based on the information they utilize into Topology-based, Semantics-based and Embeddings-based. Topology-based methods exploit the structure of the graph, while semantics-based methods also take into account the types of relations connecting the two nodes. Embeddings-based methods use vector representations of graphs, potentially taking into account the structure of the graph, as well as the semantics of the node labels. Our novel Relation Pattern Method is part of the semanticsbased methods.

Problem Formulation
The problem we aim to solve is the following. Given a directed knowledge graph G = (E, R), where E is the set of nodes, corresponding to entities, and R is set of edges, corresponding to relations, and a pair of nodes (e 1 , e 2 ) with e 1 , e 2 ∈ E, where e 1 represents an action and e 2 an object (E may contain other types of nodes as well), find whether e 1 and e 2 are related. We consider the two nodes, e 1 and e 2 , as related if the following question yields a positive answer: "Can the action e 1 be performed by/on the object e 2 ?". For instance, the question "Can the action Fold be performed by the object Knife?" should yield a negative answer.
Before presenting the methods we evaluated to solve the aforementioned problem, we first describe how we can create the graph G from a given set of labels L that refer to real-world objects or actions. We extract the object and action labels from the Something-Something Dataset 4 , a dataset that is commonly used by the Machine Vision community (see Section 4.1 for more details). Note, however, that any set of object and action labels can be used to create G. For every label l i ∈ L, we generate a graph S i , by appending information relevant to l i from ConceptNet [1] and WordNet [2] in two steps. Then, we construct G by unifying all |L| graphs S 1 , . . . , S |L| , i.e., every graph S i is a subgraph of G.
Step 1: For each object or action label, we search for a node with the same lemmatized label in the ConceptNet knowledge graph and extract a subgraph containing a subset of the properties found in ConceptNet that are considered relevant to the domain of interest. The subgraphs contain 2-hop paths from the object or action label. The edge types we consider are: We omit only 3 relations from ConceptNet 5 : Causes and Desires, which, although seemingly relevant, their use is human centric and they describe the sentiments that are caused to humans after an event, and ExternalURL, in order not to append information from other external resources, except WordNet.
Step 2: The next step is to insert context knowledge into the subgraph. We retrieve knowledge from WordNet by looking at the super-classes of each node in the subgraph created in Step 1, and if any super-class of a node falls into a domain-specific category of super-classes, then we keep the node in the graph, otherwise we delete it. The superclasses we consider are: We consider this specific set of classes following the findings of [18], which showed that almost all nodes in the WordNet graph that refer to a real-world object or action have at least one of these as a super-class. This pruning of nodes based on WordNet superclasses can give domain-specific concepts, e.g., when interested in household appliances. Figure 1 shows part of the subgraph for the label Knife. We highlight in red the node that is pruned in Step 2. After creating a graph for each object and action label, as described in Steps 1 and 2, we end up with a set of graphs {S 1 ,..., S n }, such that S i = (E i , R i ) for i = 1,...,n, where E i is the set of nodes and R i the set of edges in S i . Thus, the final graph is defined

Topology-based Methods
We apply the two most commonly used methods proposed in the relevant literature [10,9] that exploit the topology of a graph, in order to infer the extent to which two nodes are related.
The Connecting Paths Method takes into consideration each sequence of edges that begins from the object node and reaches the action node after a finite number of steps, or vice versa. The authors omit paths that contain loops, but do not take into account the type of edges a path contains. Given two subgraphs S 1 and S 2 , as described in Section 3.1, corresponding to an object node and an action node respectively, the connectPath metric for S 1 and S 2 is defined as

A. Vassiliades et al. / Object-Action Association Extraction from Knowledge Graphs
where C 1 is the set of paths that start from the object node and reach the action node, C 2 is the set of paths that start from the action node and reach the object node, P 1 is the set of all paths that start from the object node, and P 2 the set of all paths that start from the action node. Since (C 1 ∪C 2 ) ⊆ (P 1 ∪ P 2 ), it follows that 0 ≤ connectPath ≤ 1.
Example 1 Let S kni f e be the subgraph for the object node knife and S f old be the subgraph for the action node fold and let S kni f e have two paths that start from the node knife, Some recent studies that apply this method, as is or with small variations, are [16,4,5,15]. In fact, they also focus on inferring object-action relations ( [16,5]) and on object identification ( [4,15]).
The Common Nodes Method divides the number of common nodes by the number of total nodes in two given graphs. Two nodes are considered common when they refer to the same entity in ConceptNet, i.e., the nodes have the same label. Duplicate nodes are cleared, allowing only one occurrence of each node. The commonNodes metric between two subgraphs S 1 and S 2 is defined as where E i is the set of nodes in S i . Essentially, the commonNodes metric between two graphs is the Jaccard similarity of the sets of nodes in these graphs. Example 2 shows how the commonNodes metric works. where E kni f e is the set of nodes in the S kni f e subgraph, and E f old is the set of nodes in the S f old subgraph. Recent studies that apply this method, as is or with small variations, are [4,19,20]. The focus is on object identification and on finding the similarity of two nodes in a knowledge graph.

Semantics-based Methods
As a semantics-based method, we consider the very popular WUP similarity, and we also present a novel Related Pattern Method, which exploits the pattern of connections in a KG to infer whether two nodes are semantically related.
The Wu Palmer Similarity (WUP) uses the acyclic graph of WordNet to calculate relatedness by considering the depth of two nodes in the WordNet taxonomies, along with the depth of their LCS (Least Common Subsumer). Given two nodes from the WordNet acyclic graph, the LCS of these nodes is their most specific common ancestor. The score can never be zero because the depth of the LCS is never zero (the depth of the root of the taxonomy is one). This metric calculates the similarity based on how close the nodes are to each other in the WordNet acyclic graph. The WUP similarity between an object node (n o ) and an action node (n a ) is defined as where depth(·) is the depth of an entity in the WordNet graph.

Example 3
The WUP similarity for the object knife and the action fold is Many studies use the WUP similarity in a wide spectrum of domains. Recent studies, such as [5,18], use WUP scores to infer object-action relations and object identification.
Our proposed method, the Relation Pattern Method, is based on the assumption that some of the paths connecting two nodes carry more semantically relevant information than others. For instance, the path object node of the ConceptNet relations that we considered more relevant to the problem at hand; yet, this subset can change according to the context of the problem. An important aspect of our proposed method is the flexibility in deciding how to deal with the noise that exists in the data.
A relation pattern is any connecting path that is composed of at least one of the aforementioned relations, except from paths that only contain the relation Synonym. The latter are omitted, to avoid connecting an object and an action node having similar labels. In the end, 155 different relation patterns were produced; whenever a relation pattern is found between an object and an action node in the KG, we consider it as an indication that the two nodes are associated. If P = {pattern 1 ,..., pattern 155 } is the set of all relation patterns, then, for each pattern i ∈ P, the goal is to assign a weight of importance W pattern i , in order to specify how confident we are that the given pattern produces correct associations. For instance, in the next section, we assign the weights based on how well each pattern performs in our training data. Of course, other heuristics can be used instead.
Since it is reasonable to consider more than one patterns before reaching a conclusion about the relation between two labels, one can group together patterns, based on their performance, their domain-specific relevance, or other criteria. For quantifying the performance of a cluster W C , one can consider, for instance, the weighted sum of the weights of each individual pattern, the max or min of these weights, or other heuristicsbased metrics. In our evaluation, we adopt an even simpler approach as the baseline case, namely to treat all patterns with weight above a given threshold as equally relevant.

Embeddings-based Methods
Recently, there has been a surge of interest in the field of KG embeddings for the task of link prediction [21]. Studies following this methodology represent nodes of a KG as vectors in a latent space, which are generated by taking into account both the textual and structural features of those nodes. The textual features are considered by acquiring the word and sentence embeddings of node labels, from word embeddings that have been pre-trained on large document corpora, such as Wikipedia. The structural features consider, typically in an iterative way, the node embeddings of each node's neighbors in a KG.
For example, AllenAI-CommonSense [22], which constitutes the state of the art for link prediction in ConceptNet, employs a pre-trained BERT [23] model that is fine-tuned on ConceptNet, using Graph Convolutional Networks (GCN) [24] for embedding the ConceptNet graph. This model returns a list of possible relations between a given pair of ConceptNet nodes, ranked in descending order of likelihood (aka confidence score).
We can use the results of this system in two different ways for our purposes: a) we can either consider the confidence score returned by AllenAI-CommonSense for a specific relation (e.g., ReceivesAction), given two query nodes from our graph G, or b) we can consider that the answer to the problem formulated in Section 3.1 is positive for two query nodes, when the relation "ReceivesAction" is within the top answers for those query nodes.

Evaluation
In this section, we first describe how we created the ground truth from the Something-Something Dataset 6 and then we discuss the experimental setup and the results that each method achieved.

Data Collection
Rather than using a random set of action and object labels, aiming to achieve an adequate coverage of entities for the household domain, we decided to extract the set of labels for our evaluation from the Something-Something Dataset. Something-Something consists of a large collection of short video clips (more than 220k) containing actions performed on and with common household objects. The actions involve either one type of object (e.g., opening a bottle) or two distinct types of objects (e.g., putting coins inside a box). Due to its vast number of sample videos, the Something-Something Dataset has become a de-facto benchmark for the assessment of systems addressing the task of action recog-nition. The dataset provides for each clip a small description that contains action and object(s) labels.
Ground Truth Creation: From Something-Something we initially extracted 247 object labels and 35 action labels, which produced 8,645 object-action pairs. We replaced all object labels in plural form with their singular form, for example notes was replaced with note. Then, we removed certain object and action labels that we did not consider context related 7 . Next, for the remaining action and object labels, we issued a query to the ConceptNet KG using the ConceptNet Web API 8 , in order to identify which labels are indeed part of the graph. We ended up with 148 object labels and 25 action labels. Since some actions have the same label with some objects (3 in total), we renamed these labels as follows: (a) pile → pileO and pile → vpile, (b) stack → stackO and stack → vstack, and (c) cover → coverO and cover → vcover, to refer to the object and action label, respectively.
Eventually, 3,700 object-action relations were kept in total. Those pairs that existed in the description of at least one video in the Something-Something Dataset were automatically characterized as positive pairs. The remaining were manually annotated, in order to determine if they are negative or if they are positive but it so happens that no clip in the dataset refered to them. At the end, 1,965 positive and 1,735 negative object-action relations were produced, forming our ground truth 9 .

Experimental Setup
The evaluation of the methods described in Section 3 was performed using 10-fold cross validation over the 3,700 positive and negative relations described in Section 4.1. We used Sklearn 10 to split our data into 10 folds. Each fold contained 370 relations, 52% of which were positive and 48% negative, which reflects the distribution of the relations in the original dataset.
Each iteration of the 10-fold cross-validation process was used, in order to train the different models. Specifically, for the Connecting Path Method, the WUP, and the Common Node Method, the training folds helped specify the optimal threshold for each method that maximizes the F1 score. For the Relation Pattern Method, the training phase helped compute the weights of importance W pattern i of each relation pattern pattern i ∈ P, as described in Section 3.3. During testing, we measured the performance of each method with the given thresholds and weights. Patterns that performed poorly during training were omitted completely.
We characterize the results as True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) according to the following definitions: • TP is when a pair of object-action nodes (n o , n a ) is related in the ground truth and also achieves a score above the threshold (for the threshold-based methods) or the pattern under consideration connects node n o with n a (for the pattern-based methods) • FP is when a pair of object-action nodes (n o , n a ) is not related in the ground truth, but achieves a score above the threshold (for the threshold-based methods) or the pattern under consideration connects node n o with n a (for the pattern-based methods) • TN is when a pair of object-action nodes (n o , n a ) is not related in the ground truth and also achieves a score below the threshold (for the threshold-based methods) or the pattern under consideration does not connect node n o with n a (for the patternbased methods) • FN is when a pair of object-action nodes (n o , n a ) is related in the ground truth, but achieves a score below the threshold (for the threshold-based methods) or the pattern under consideration does not connect node n o with n a (for the patternbased methods) Finally, note that we define the weight of importance W pattern i for pattern i as the harmonic mean between precision P and recall R (Equation 4). In other words, it shows the proportion of object-action pairs that can be classified correctly (i.e., related or not related), by this relation pattern. Additionally, we evaluated the embeddings-based method AllenAI-Common Sense (top-k) over all ground truth object-action pairs, both positive and negative, by testing whether the relation "ReceivesAction" was within the top-k results for each pair, for k ∈ {1, 3, 5}. If "ReceivesAction" is within the top-k results, we consider this as a predicted positive pair, otherwise, a predicted negative, and follow the same conventions (TP, FP, TN, FN) as described above.
We note that another variation of this approach would have been to restrict the predicted results to those having a confidence score above a predefined threshold. However, our experiments showed that this method performs best when such minimum confidence threshold is 0 (confidence scores are extremely low in too many cases), so we do not report numbers for this variation. Table 1 summarizes the overall performance measures for each method. Although the differences among the first three popular approaches are small, the WUP similarity seems to achieve higher scores both in terms of accuracy (.555) and of F1 score (.696). We also see that there are patterns that achieve similar or better scores in the one or the other measure, but not in both (the weight of importance coincides with the F1 score). Due to the plurality of relation patterns, we display only the Top-20 relation patterns. The AllenAI-CommonSense (top-k) methods, despite their high accuracy, underperform in F1 scores, compared to the other methods. This is due to a considerable difference noticed in the accuracy for positive pairs (. 19) with respect to that for negative pairs (.854).

Results
An investigation of the figures for the Relation Pattern method reveals some interesting insights, not easily detectable with the other methods. First of all, we can see that at least one occurrence of the relation RelatedTo exists in almost all relation patterns. This is because, although not explicitly stated in the ConceptNet documentation 11 , RelatedTo plays the role of a super-property, i.e., it subsumes the other relations. While one would expect that less abstract relations among nodes, such as UsedFor, would produce better results, this is not the case. This conclusion reflects, to some extent, the quality of data in ConceptNet and provides hints as to where there exists room for data cleaning.
We also observe that certain longer paths, such as RelatedTo ← −−−− → , achieve better performance than shorter paths involving the same type of relations, e.g., This might seem odd at first, as one would expect that the closer two nodes are in the graph, the more semantically tightly related they would be. This finding is probably owed to the nature of our problem. In contrast to entity resolution for instance, the nodes whose association we try to find are of different type, namely object and action.
Of course, such similar paths obtain practical meaning if considered as a group, rather than as individuals. For this reason, Table 1 also reports indicatively the performance of two clusters of patterns, one that is composed only of RelatedTo and Synonym relations, and one composed of UsedFor and Synonym relations. We adopt a simplistic approach in deciding what the answer of a cluster is: any pattern above a threshold is considered relevant. As such, even if a single pattern is found in the graph, the corresponding object-action pair is considered related. As we only wish to measure a baseline case, we set a rather generous threshold for including patterns in the cluster, namely any pattern with weight above 0.1. More elaborate methods can of course be implemented, e.g., by taking into consideration the weight of importance among the patterns of each cluster or by utilizing domain-specific criteria. Yet, even with this baseline, we notice that clustering paths can produce improved state-of-the-art F1 scores (.702). By ignoring the relative importance of each individual pattern though, we end up introducing noise, as shown in the precision scores if compared to the best performing patterns, an aspect that a more advanced method could eliminate.
Overall, probably the most important advantage of our proposed method, beyond its prominent performance, is the flexibility in deciding how to deal with noise in the data. By carefully choosing which patterns to trust, one can decide where to focus when importing new data. Such an adaptive behavior is not offered by the other methods, such as data-driven models, which are more vulnerable to noisy data, due to the domainagnostic way of treating the KG.

Conclusion
In this paper, we present a novel method for extracting and evaluating relations between objects and actions from KGs, such as ConceptNet and WordNet. We compared our method with popular approaches proposed in relevant literature, such as methods that exploit the topology or the semantic information in a KG, and embeddings. Our method can improve state-of-the art performance in terms of F1 scores. But its most important advantage, beyond its very good performance, is the flexibility in finding and adapting to the noise in the data. In the future, we plan to integrate knowledge from other commonsense knowledge graphs, such as ATOMIC [25,26], and to evaluate our methods on other types of relations, such as those between an object and a state, and causal relations (i.e., in which states can the object be before and after we perform an action on it).