OneSeC

In this work we formulate the assumption of One Sense per Wikipedia Category and present OneSeC, a language-independent method for the automatic extraction of hundreds of thousands of sentences in which a target word is tagged with its meaning.

Abstract

The well-known problem of knowledge acquisition is one of the biggest issues in Word Sense Disambiguation (WSD), where annotated data are still scarce in English and almost absent in other languages. In this paper we formulate the assumption of One Sense per Wikipedia Category and present OneSeC, a language-independent method for the automatic extraction of hundreds of thousands of sentences in which a target word is tagged with its meaning. Our automatically-generated data consistently lead a supervised WSD model to state-of-the-art performance when compared with other automatic and semi-automatic methods. Moreover, our approach outperforms its competitors on multilingual and domain-specific settings, where it beats the existing state of the art on all languages and most domains.

References

Just “OneSeC” for Producing Multilingual Sense-Annotated Data
BibTex
Bianca Scarlini, Tommaso Pasini and Roberto Navigli
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 1-3 August 2019.

Sense-Annotated Corpora for Word Sense Disambiguation in Multiple Languages and Domains
BibTex
Bianca Scarlini, Tommaso Pasini and Roberto Navigli
Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, 2020.

Authors

Bianca Scarlini

PhD Student @ Sapienza

scarlini [at] di.uniroma1.it

Tommaso Pasini

Postdoc @ Sapienza

pasini [at] di.uniroma1.it

Roberto Navigli

Full Professor @ Sapienza

navigli [at] di.uniroma1.it

Download ACL Data in 5 languages

Data are available for English, Italian, Spanish, French and German, covering ~ 3000 nominal lemmas (321 MB - compressed tar.gz).

Download LREC Data in 5 Languages and 5 Domains in English

Data are available for English, Italian, Spanish, French and German and 5 semantic domains in English, covering ~ 80000 nominal lemmas
(18 GB - compressed tar.gz).
Data can also be downloaded separately: EN, IT, ES, FR, DE, domains.