MIDAS: Migrating Information from Different Annotated Sets – a Cleopatra demonstrator

This is the first of a series of blog posts discussing different aspects of the CLEOPATRA Research and Development week, which was held online at the end of March. The project ESRs organized themselves into groups to develop demonstrators, and this is the report from Group 3, written by Gabriel Maia.

Group number 3, called MIDAS (Migrating Information from Different Annotated Sets), proposed uniting the functionalities of named entity recognition, sentiment analysis of citations, and event extraction under a single text annotation umbrella.

The goal was the creation of an API that could read and annotate text, with a focus on under-resourced European languages. The text could be sent directly in the requests to the API, or inside URLs or text files, and the API would return a structured object containing the text that had been read, and annotations for:

  1. Named entities, as well as their type according to a newly developed unified complex classification hierarchy, called the Universal Named-Entity Recognition Framework (UNER). UNER is inspired by the work of Sekine [2], and consists of a three-level class hierarchy of named entity classes, which would be applied commonly across the many languages supported.
  2. Sentiment analysis of citations and quotations found in the text, including whether they have a positive or negative connotation.
  3. Event triggers and arguments such as location, time and participants, extracted following the ACE 2005 [4] definition of events.

The resulting dataset would thus be annotated in a three-fold manner, optimising recall. For this, we would use a stack of pre-existing models pre-trained on the English language – on an English corpus – as a means of increasing recall by resolving annotation conflicts by giving priority to models with the highest precision scores. The corpus chosen was SETimes [3], as it in fact consists of parallel corpora, with English and many south-eastern European languages included.

Next, the dataset would be passed through a crowd-sourcing phase, where we would optimise for precision. The crowd workers would not be able to propose new tags, but would be responsible for judging whether the tags were correct or not. They would also be able to:

  • Remove a tag if they deemed it erroneous;
  • Adjust the span of a tag if needed;
  • Correct the typing of a tag if needed.

This would give us a curated annotated dataset for the English SETimes, which we would then use to propagate tags across the other languages in SETimes. The end result would be parallel corpora annotated for named entities, sentiment, and events, including many under-resourced languages. These corpora could then be used to train annotating models.

Through the R&D Week, it became clear to us that our project and approach suffered from a number of issues:

  1. The project did not adhere to a clear topic, rather meshing three different pipelines with their own overall topics;
  2. It was not clearly defined what the end-goal would be: the dataset, the trained models or the methodology;
  3. The crowdsourcing step was too ambitious. It would require too much funding to get a dataset large enough to be used for training models, and that holds even more true for the whole of SETimes, whether the end result had been a model or a dataset.

We have thus used this opportunity to build an API structure which we can still use moving forward, but have taken a step back to re-evaluate the project and what we feasibly want it to be. We have decided to focus it on Named Entity Recognition for under-resourced languages by leveraging resources such as Wikipedia and DBpedia [1], instead of depending on SETimes and crowdsourcing.

Gabriel Maia, King’s College London

References:

  1. DBpedia: Dbpedia. https://wiki.dbpedia.org/ (2019), (Accessed on 28/02/2020)
  2. Sekine, S.: The Definition of Sekine’s Extended Named Entities. https://nlp.cs.nyu.edu/ene/version7_1_0Beng.html (07 2007), (Accessed on 28/02/2020).
  3. Tyers, F.M., Alperen, M.S.: South-east European times: A parallel corpus of balkan languages. In: Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages. pp. 49-53 (2010).
  4. Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. 2006. Ace 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 57.

Leave a Reply