ESR secondments at Tilde

One of the tasks of the CLEOPATRA project is to introduce early stage researchers (ESRs) to novel language technologies though secondment in industrial environments. For this purpose two ESRs from Zagreb University visited the Tilde company to learn more about AI-powered machine translation solutions and terminology services. Both researchers were involved in research and development activities related to the challenging task of speech to speech translation. Their direct supervisor was Mārcis Pinnis, chief AI officer at Tilde.

Diego Alves’s PhD studies are devoted to the research and development of natural language processing tools and resources for under-resourced languages. During his stay at Tilde, he was introduced to its machine translation technologies and, together with Tilde’s researchers, worked on methods for data augmentation which allow the creation of machine translation solutions that are robust for dealing with the errors produced by speech recognizers. He reported his research results in several research seminars at Tilde. Research results achieved during the secondment were presented at the Baltic HLT 2020 conference and summarized in the publication: Diego Alves, Askars Salimbajevs and Mārcis Pinnis. 2020. Data Augmentation for Pipeline-Based Speech TranslationFrontiers in Artificial Intelligence and Applications, volume 328: Human Language Technologies – The Baltic Perspective, 73-79.

The research activities of Gaurish Thakkar are related to cross-lingual sentiment analysis. Gaurish arrived in Riga in early spring and spent several weeks at the Tilde office with Tilde’s researchers. Later, when strong COVID-19 restrictions were introduced, he continued his secondment in Riga virtually, with regular meetings with his supervisor and employees at Tilde. Research results achieved during the secondment were presented at the Baltic HLT 2020 conference and summarized in the publication: Gaurish Thakkar and Mārcis Pinnis. 2020. Pretraining and Fine-Tuning Strategies for Sentiment Analysis of Latvian TweetsFrontiers in Artificial Intelligence and Applications, volume 328: Human Language Technologies – The Baltic Perspective, 55-61.

Tilde is a leading European language technology company, specialising in language technologies. The company provides custom machine translation systems, online terminology and knowledge management services, mobile applications, intelligent virtual assistants, speech processing (analysis and generation) solutions, and proofing tools. Tilde has specific expertise and competence in developing high-demand cloud-based and desktop language technology solutions for complex, highly inflected languages, particularly smaller European languages. Tilde’s services are used by the European Commission and international companies such as IBM and Microsoft.

Inguna Skadiņa, Chief Scientific Officer, Tilde

Using the UK Web Archive to study Olympic legacy

Caio Mello’s secondment to the British Library, to work with the team at the UK Web Archive, has resulted in the publication of three blog posts discussing how he has used the archived web to study the topic of Olympic legacy. The first considers the question ‘What is left behind? Exploring the Olympic Games legacies through the UK Web Archive‘; while the second asks ‘Boris Johnson, fertility and the royal baby: how far does the concept of Olympic Legacy go?‘. The final post in the series presents Caio’s research, with Daniela Major, ‘Exploring media events with Shine‘, the prototype search interface for the archive of the .uk country code Top Level Domain from 1996 to 2013. Web archives are a key primary source for the history of the late 20th and early 21st century, and Caio’s blog posts help to demonstrate their value for the study of transnational sporting events like the Olympic Games.


TIME: Temporal discourse analysis applied to media articles

This is the third of a series of blog posts discussing different aspects of the CLEOPATRA Research and Development week, which was held online at the end of March. The project ESRs organized themselves into groups to develop demonstrators, and this is the report from Team TIME (Gullal Singh Cheema, Daniela Major, Caio Mello and Abdul Sittar)

Working in an interdisciplinary project is an effort as rewarding as it is challenging.

With a group like ours, made up of computer scientists and social science researchers, communication and a readiness to listen to and learn from your colleagues is key. We soon realised that our differences as well as our shared interests pushed us to find ways of combining the methodologies of the social sciences with the analytical tools of computer science.

This is how we arrived at TIME: Temporal Discourse Analysis applied to Media Articles. During the weeks preceding the R&D week we defined research questions and thought about the best ways to answer them. The social scientists in the group were especially interested in analysing media texts on two different topics (the concept of Olympic legacy and the concept of Euroscepticism). The choice of media outlets also followed the logic of our research questions: the comparative approach was always a priority in both of the topics. In the case of the concept of legacy we chose to scrape data on the Rio and London Olympics in both Brazilian and British media. With Euroscepticism, our choice fell on English and Spanish media coverage.

One of the main challenges was to build a tool that could answer our questions about the data we collected. However, any tool we built had to be able to answer more than just the questions we wanted to ask. After all, an analytical tool should be able to answer the demands of several researchers as opposed to only a couple of them. The effort, therefore, was to think outside our box and create something that could be used not only by our group, but by a wide array of people from academics to the general public.

The tool we created offers a graphical visualisation of the concepts as employed by each media outlet. We hope that in the near future this cross-temporal tool can be fed with more terms of research and more media channels so as to facilitate a wider range of analysis.

The weeks we spent working on collecting the data and building the tool were evenly divided between the social scientists and the computer scientists. It is worth keeping in mind that working on an interdisciplinary team involves a certain degree of self-awareness. The point is not that everyone should know about everything, but that the knowledge of each team member complements that of the others.

Returning to our original idea of challenges and rewards, like all research projects the CLEOPATRA network involves a willingness to work as a team, and above all, requires the effort to understand the points of view of others, even when they are a little off our comfort zone. But the works pays off!

Daniela Major, University of London

MLM: a large-scale dataset with Multiple Languages and Modalities

This is the second of a series of blog posts discussing different aspects of the CLEOPATRA Research and Development week, which was held online at the end of March. The project ESRs organized themselves into groups to develop demonstrators, and this is the report from Team GOAL (Jason Armitage, Endri Kacupaj, Golsa Tahmasebzadeh and Swati Suman)

We’re a team of four researchers from different universities. We started working on MLM in December. Generating a publicly available, large-scale dataset to handle multiple modalities and languages is one of our core objectives. At the end of March, we had our first Hackathon Week for the Cleopatra project where we presented a pilot version of our demonstrator. We could say that it was a success!

Our Demonstrator revolves around MLM so, let’s check out what it’s all about.

What is MLM?

MLM is a Wikidata-generated dataset that is intended to train multitask systems.  This system is anticipated to outperform single-task alternatives when simultaneously trained on task sequences. In our case, it is a system that performs both information retrieval and location estimation.

The dataset is composed of text, images, location data, and knowledge graph triples. We’re planning to release our dataset in two versions.

  • The first version (MLM-v1) will contain texts with ~236k entities in three languages (English, German, and French)
  • The second version (MLM-v2) will contain text with ~254k entities in ten different languages (English, German, French, Italian, Spanish, Polish, Romanian, Dutch, Hungarian, Portuguese).

Now, let’s find out why this idea for the demonstrator is worth pursuing.

Why MLM? 

Systems trained on large datasets are believed to display rudimentary forms of the human ability to generalize. However, today’s systems struggle to deal with the diversity in real-world data and the ability to extend learning to new tasks.

MLM is a large-scale resource composed of different modalities and languages. It will offer the opportunity to train and evaluate multitasking systems. Such systems aim to reduce the application process complexity and boost generalization. At the same time, related applications will also benefit from machine learning models able to perform a number of independent tasks.

What did we do in the Hackathon?

Our final aim for the week was to create a live demo visualizing the results of the geo-location estimate of our models on maps. It was important because with just the raw latitude and longitude values it is hard to tell the quality of the output. For inference, we subdivided the surface of the earth into a number of geographic cells and trained a deep neural network with geo-tagged inputs that could be either text or image.

Visualizing results on Google Maps

Now, let’s see how our demo helps to visualize the output of the model’s location prediction.

What are our goals for the next demonstrator?

We are planning to build a multitask system capable of handling multi-objective tasks. We are intending to include some or all of the following tasks using the rich data provided by MLM:

  • Given a news article,
    • predict locations that conceptually resemble the news story. For this task, we aim to pinpoint the locations on the map with the highest conceptual similarity based on common KG entities in news reports.
  • Given a location-based image and/or text, identify semantically similar
    • locations and then link to a relevant news source to collect trending articles at that location;
    • locations and places similar to that. Text and/or images will also be displayed based on user preference.
  • Given a coordinate, identify
    • semantically similar locations;
    • textual description of the location;
    • image(s) that best describes the locations;
    • relevant news sources to collect trending articles at that location.

How do we collaborate as a team?

To keep track of our iterative progress towards well-defined goals, we conduct bi-weekly Scrum meetings and document our work. There we organize our ideas efficiently into sprint goals and we distribute the work among us. We even exchange our thoughts, challenges, and other information between the meetings.

Jason Armitage, Endri Kacupaj, Golsa Tahmasebzadeh and Swati Suman

MIDAS: Migrating Information from Different Annotated Sets – a Cleopatra demonstrator

This is the first of a series of blog posts discussing different aspects of the CLEOPATRA Research and Development week, which was held online at the end of March. The project ESRs organized themselves into groups to develop demonstrators, and this is the report from Group 3, written by Gabriel Maia.

Group number 3, called MIDAS (Migrating Information from Different Annotated Sets), proposed uniting the functionalities of named entity recognition, sentiment analysis of citations, and event extraction under a single text annotation umbrella.

The goal was the creation of an API that could read and annotate text, with a focus on under-resourced European languages. The text could be sent directly in the requests to the API, or inside URLs or text files, and the API would return a structured object containing the text that had been read, and annotations for:

  1. Named entities, as well as their type according to a newly developed unified complex classification hierarchy, called the Universal Named-Entity Recognition Framework (UNER). UNER is inspired by the work of Sekine [2], and consists of a three-level class hierarchy of named entity classes, which would be applied commonly across the many languages supported.
  2. Sentiment analysis of citations and quotations found in the text, including whether they have a positive or negative connotation.
  3. Event triggers and arguments such as location, time and participants, extracted following the ACE 2005 [4] definition of events.

The resulting dataset would thus be annotated in a three-fold manner, optimising recall. For this, we would use a stack of pre-existing models pre-trained on the English language – on an English corpus – as a means of increasing recall by resolving annotation conflicts by giving priority to models with the highest precision scores. The corpus chosen was SETimes [3], as it in fact consists of parallel corpora, with English and many south-eastern European languages included.

Next, the dataset would be passed through a crowd-sourcing phase, where we would optimise for precision. The crowd workers would not be able to propose new tags, but would be responsible for judging whether the tags were correct or not. They would also be able to:

  • Remove a tag if they deemed it erroneous;
  • Adjust the span of a tag if needed;
  • Correct the typing of a tag if needed.

This would give us a curated annotated dataset for the English SETimes, which we would then use to propagate tags across the other languages in SETimes. The end result would be parallel corpora annotated for named entities, sentiment, and events, including many under-resourced languages. These corpora could then be used to train annotating models.

Through the R&D Week, it became clear to us that our project and approach suffered from a number of issues:

  1. The project did not adhere to a clear topic, rather meshing three different pipelines with their own overall topics;
  2. It was not clearly defined what the end-goal would be: the dataset, the trained models or the methodology;
  3. The crowdsourcing step was too ambitious. It would require too much funding to get a dataset large enough to be used for training models, and that holds even more true for the whole of SETimes, whether the end result had been a model or a dataset.

We have thus used this opportunity to build an API structure which we can still use moving forward, but have taken a step back to re-evaluate the project and what we feasibly want it to be. We have decided to focus it on Named Entity Recognition for under-resourced languages by leveraging resources such as Wikipedia and DBpedia [1], instead of depending on SETimes and crowdsourcing.

Gabriel Maia, King’s College London


  1. DBpedia: Dbpedia. (2019), (Accessed on 28/02/2020)
  2. Sekine, S.: The Definition of Sekine’s Extended Named Entities. (07 2007), (Accessed on 28/02/2020).
  3. Tyers, F.M., Alperen, M.S.: South-east European times: A parallel corpus of balkan languages. In: Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages. pp. 49-53 (2010).
  4. Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. 2006. Ace 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 57.

Cleopatra at the 17th Extended Semantic Web Conference, Heraklion, Greece

We will be running our first international workshop on cross-lingual, event-centric open analytics at the 17th ESWC, which this year will be held in Heraklion, Greece, 31 May – 4 June 2020.

The goal of the interdisciplinary workshop is to bring together researchers and practitioners from the fields of the Semantic Web, the Web, NLP, IR, Human Computation, Visual Analytics and Digital Humanities to discuss and evaluate methods and solutions for effective and efficient analytics of event-centric multilingual information spread across heterogeneous sources. These methods will help to deliver analytical results to users in a meaningful way, and assist them in crossing language barriers in order to understand event representations in other languages and contexts.

You can read more about the workshop at

ESR presentations and publications

Several of the Cleopatra ESRs have already had papers and posters accepted at international conferences (in Iceland, the Republic of Ireland and France). All of the presentations will be made available to read on an open access basis:

  • Daniela Major has had a paper accepted at the 27th International Conference of Europeanists – Europe’s Past, Present and Future (22-24 June 2020), Reykjavik, on ‘The “New Destiny of Portugal”: the idea of Europe in Portuguese Presidential Discourse’ (
  • Daniela Major and Caio Mello have had a poster accepted at the ‘Engaging with web archives: opportunities, challenges and potentialities’, Maynooth, 15-16 April 2020, on ‘Tracking and analysing media events through web archives’ (
  • Diego Alves and Gaurish Thakkar (with Marko Tadić) have had a poster accepted at the 12th International Conference on Language Resources and Evaluation (LREC2020), Marseille, on ‘Evaluating Language Tools for Fifteen EU-official Under-resourced Languages’ (