Enhancing the spaCy Named Entity Recognizer for Crowdsensing

Fern&#225;ndez-Pedauye, Julio; Peri&#241;&#225;n-Pascual, Carlos; Arcas-T&#250;nez, Francisco; Cecilia, Jos&#233; M.

doi:10.3233/AISE200061

Abstract

Social sensing leverages user-contributed data from social media by considering participants as “social sensors”, i.e. agents that provide information about their environment through social-media services such as Twitter, Facebook or Instagram. Social sensors may serve as a complementary source to physical sensors as (1) they can explain why or how specific events occurred, and (2) they can be deemed to be an alternative source in case that physical sensors malfunction or a sensor network cannot be afforded. However, one of the main challenges for social sensors is to know where a particular event has occurred. Social-media services rely on user preferences to geolocate their opinions, which is not really a widespread practice and, therefore, it limits the success of these techniques as early warning systems. In this paper, we analyze the spaCy named entity recognizer (NER), an open-source tool widely used by the community, to identify named entities in Spanish microtexts taken from social networks. The spaCy NER is based on Artificial Neural Networks, and our preliminary results show that further training should be undertaken to increase its accuracy. Indeed, it is well known that supervised methods are domain dependent, so their performance tends to decrease when dealing with target documents that come from a domain different from that of the training dataset. For this purpose, a training tool has been designed to automatically generate datasets suitable for spaCy NER’s training with Twitter-based microtexts in Spanish. Using the dataset generated by this tool, the spaCy NER tool increases its accuracy to 0.7 F-score, defeating by a wide margin the use of other classic datasets such as AnCora, WIKINER or CONLL for training.

This website uses cookies

This website uses cookies