Individual Research Projects
There are 15 Early Stage Researchers in the CLEOPATRA programme. There are details of their projects below:
1. Fact extraction and cross-lingual alignment (Gottfried Wilhelm Leibniz University Hannover, LUH)
2. Interactive user access models to cross-lingual information (Gottfried Wilhelm Leibniz University Hannover, LUH)
3. Crowd quality and training in hybrid multilingual information processing and analytics (King’s College London, KCL)
4. Incentives design for hybrid multilingual information processing and analytics (King’s College London, KCL)
5. Fact validation across multilingual text corpora (Rheinische Friedrich-Wilhelms-Universität Bonn, UBO)
6. Interactive multilingual question answering (Rheinische Friedrich-Wilhelms-Universität Bonn, UBO)
7. Relations of textual and visual information (TIB Hannover – German National Library of Science and Technology, TIB)
8. Contextualisation of images in multilingual sources (TIB Hannover – German National Library of Science and Technology, TIB)
9. National and transnational media coverage of European parliamentary elections, 2004-2014 (University of London, UoL)
10. Nationalism, internationalism and sporting identity: the London and Rio Olympics/Paralympics (University of London, UoL)
11. Information propagation with barriers (Institut Jožef Stefan ,JSI)
12. Cross-lingual news reporting bias (Institut Jožef Stefan, JSI)
13. Multilingual Wikipedia as ‘first draft of history’ (University of Amsterdam, UvA)
14. NLP for under-resourced languages (Sveuciliste u Zagrebu Filozofski Fakultet – University of Zagreb, FFZG)
15. Cross-lingual sentiment detection (Sveuciliste u Zagrebu Filozofski Fakultet – University of Zagreb, FFZG)
18. Multi Modal Fact Validation (TIB Hannover – German National Library of Science and Technology, TIB)
|ESR1: Tin Kuculo||PhD enrolment: LUH|
|Project Title: Fact extraction and cross-lingual alignment (WP4 – Event-centric cross-lingual Information processing)|
|Objectives: Extract and interlink mentions of related facts and their multilingual context and establish their semantic and temporal relations in comparable corpora by leveraging hybrid computational methods while utilizing NLP and ML-based technologies.|
|Expected Results: Hybrid computational methods for extraction and interlinking of related facts and their context across languages.|
|News articles are a rich source of event-centric information published across the globe, reflecting a wide variety of cultural differences and political views. In my work, I aim to develop information extraction methods that help to extract concise event representations from multilingual news corpora to facilitate in-depth analysis of differences in event representations across languages and communities. One particularly appealing direction in this context is the extraction of event coreference chains, i.e. sentences that contain mentions of the event of interest in a news article. Such coreference chains can build a basis for comprehensive and concise event summaries that facilitate comparison of event representations across news sources. In particular, I aim to develop methods supporting the cross-lingual analysis of events represented in English, French, German, and Croatian news sources.|
|ESR2: Sara Abdollahi||PhD enrolment: LUH|
|Project Title: Interactive user access models to cross-lingual information (WP5 – Hybrid computation, user interaction and question answering)|
|Objectives: Develop user interaction models that enable users to efficiently and effectively access extracted event-centric multilingual information and its context and analyse language-specific differences.|
|Expected Results: Models for interactive efficient access to structured multilingual information and its context validated through user studies.|
|The importance and perception of prominent entities, concepts and events such as Brexit or migration crisis, varies strongly across the language communities.
Cross-lingual entity recommendation – which I’m currently working on – aims to suggest the entities relevant to the user information needs in the context of specific languages.
In my research, I aim to develop language-specific entity recommendation methods providing powerful user experience in cross-lingual information exploration, while taking into account linguistic and cultural aspects.
|ESR3: Gabriel Amaral||PhD enrolment: SOTON / KCL|
|Project Title: Crowd quality and training in hybrid multilingual information processing and analytics (WP5 – Hybrid computation, user interaction and question answering)|
|Objectives: Design a mixed-crowdsourcing workflow to produce high-quality multilingual data for knowledge graphs.|
|Expected Results: Human-machine workflows and quality assurance methods.|
|I’m a Computer Scientist and Researcher from Ceará, Brazil, with background in Natural Language Processing (NLP), Machine Learning (ML) and Logical Agents. My role in CLEOPATRA is to investigate how Crowdsourcing can prove itself a valuable asset for the enrichment of knowledge-based systems in a multilingual and multicultural environment, as well as for collecting resources to help support minority languages and their speakers.|
|ESR4: Elisavet Koutsiana||PhD enrolment: KCL|
|Project Title: Incentives design for hybrid multilingual information processing and analytics (WP5 – Hybrid computation, user interaction and question answering)|
|Objectives: Understand what motivates people to engage in knowledge graph creation and curation activities, across language contexts, and devise incentive mechanisms to foster useful knowledge graph contributions.|
|Expected Results:Methodologies and insights into crowd behaviour in the context of multilingual knowledge graphs.|
|Human behaviour is one of the main topics of research throughout the years. The rapid technological development and the daily use of the internet in communication, social media, learning, finance, journalism etc. led us to evolve this research and study the user’s behaviour.
Crowdsourcing is a model used for assigning micro-tasks to an open group of users through the internet. My research aims to understand the motivation and incentives of users participating in several forms of crowdsourcing in the field of multilingual information science. The main objective of my research is to explore behavioral characteristics to find any engagement patterns as well as potential challenges discouraging users’ participation.
Through this study I aim to analyse how to attract more users and how to encourage them to participate. In addition, one more aspect of this study is the use of multilingual knowledge bases to understand whether there are different incentives in cross-lingual information and how these connect with users’ contribution.
|ESR5: Jason Armitage||PhD enrolment: UBO|
|Project Title: Fact validation across multilingual text corpora (WP4 – Event-centric cross-lingual Information processing)|
|Objectives: Develop hybrid methods for cross-lingual fact validation and leverage multilingual distributed sources to provide a more complete set of source candidates in order to validate the facts.|
|Expected Results: Methods for hybrid cross-lingual fact validation using heterogeneous information sources.|
|Unimodal systems that process natural language predominate in research on fact validation – but reporting on events are rarely comprised of text alone. The co-existence of multiple modalities creates an opportunity to implement machine learning systems that retrieve, combine, and model diverse inputs. Fact validation systems that process multiple modalities are able to assess additional evidence, learn rich representations, and conduct supplementary analyses on source data.
Fact validation is composed of the sub-tasks of parsing inputs, retrieving evidence, and inferring relations between the two. In related applications, approaches that compose a design – after breaking problems down into steps – are better suited to tasks with composite structure. Learning on multiple modalities introduces an additional level of complexity for monolithic architectures. This research aims to implement and evaluate systems optimised to process diverse data and adapt to the multiple sub-tasks comprising the fact validation pipeline.
|ESR6: Endri Kacupaj||PhD enrolment: UBO|
|Project Title: Interactive multilingual question answering (WP5 – Hybrid computation, user interaction and question answering)|
|Objectives: Train neural networks to convert natural language queries to a formal query language, which will then be answered using existing knowledge bases. Enable efficient user interaction and feedback to enhance results.|
|Expected Results: End-to-end interactive Question Answering prototype trained using a neural network which will support a more expressive query language and user interaction, in particular to support event-centric questions.|
|Question answering over Knowledge Graphs has emerged as an intuitive way of querying structured data sources and has witnessed significant progress over the years. However, there is still plenty of space for improvement and there exist specific challenges that are still far from being effectively solved. In this research project, we aim to address some of these challenges and provide innovative solutions in the field. Our research will mainly focus on machine learning approaches.|
|ESR7: Gullal Singh Cheema||PhD enrolment: TIB|
|Project Title: Relations of textual and visual Information (WP4 – Event-centric cross-lingual Information processing)|
|Objectives: Develop and research (deep) learning systems that are able to 1.) find the paragraphs and sentences in a text which are relevant to image content, and 2.) predict the granularity and semantic level of text-image relations.|
|Expected Results: A cross-lingual model of semantic relations of textual and visual information with different levels and granularities.|
|Information on the web today, irrespective of the genre is usually multimodal in nature, such that it is curated by combining textual, visual and audio elements to engage more senses in the target audience. On one hand, information in two or more modalities can exist in parallel or fairly correlated like in image captioning and, on the other, it can be complementary like in advertisements where a person needs background knowledge or context to understand the content. Most of the research in Machine Learning community has addressed the former problem and proposed systems that perform with high accuracy, due to the fact that learning systems like deep networks can encode the correlated
multimodal information effectively. However, in the latter case, the progress has been slower due to complex relationships between different modalities that are hard to model by existing machine learning systems.
By modeling these complex relationships in multimodal data, we can develop better information retrieval engines, make models that understand content on social media and as a consequence control offensive or fake content, and in general train systems to be better at encoding or representing multimodal information.
|ESR8: Golsa Tahmasebzadeh||PhD enrolment: TIB|
|Project Title: Contextualisation of images in multilingual sources (WP4 – Event-centric cross-lingual Information processing)|
|Objectives: Research how surrounding text information can be utilized to infer and refine spatial and temporal information about an image in multilingual Web sources to support cross-lingual alignment.|
|Expected Results: Methods to detect temporal and spatial information for images by exploiting visual information and their multilingual textual context.|
|In the past few years tremendous growth in amount of multimodal data flow has provoked the need for technologies to categorize it so as to extract invaluable information and provide the users, from different parts of the world, with their desired information. To this end, this research is focused on validation and contextualization of images in multimodal multilingual sources. To be more specific, objective is to detect capturing date and location of images by utilizing useful information from their surrounding texts. This research topic is useful in various domains such as information retrieval from multimodal multilingual sources, temporal or spatial study of events in news sources and fake news detection to name but a few.|
|ESR9: Daniela Major||PhD enrolment: UoL|
|Project Title: National and transnational media coverage of European parliamentary elections, 2004-2014 (WP6 – Event-centric cross-lingual analytics and cross-cultural studies)|
|Objectives: Explore information flows between national media, identify translingual concepts and topics emerging during the elections.|
|Expected Results: Identification of issues remaining bounded by language or national political cultures in election information flows.|
|My project concerns the way European media covered the European Parliamentary elections from 2004 to 2019. I’ll be looking at newspapers in digital formats from different countries and languages. I will be studying which themes come up the most and how they change throughout the years as well as the usage of political concepts such as Nationalism, Sovereignty and Europeanisation in the media coverage.
The point of this project is to contribute to a larger understanding of the role of national media in the European Public Sphere as well as measuring the influence of the media in shaping the public’s image of the European Parliament.
Ideally, this project would also provide some pointers as to how the European Institutions should communicate with the electorate thus bridging the gap between these institutions and public opinion.
|ESR10: Caio Mello||PhD enrolment: UoL|
|Project Title: Nationalism, internationalism and sporting identity: the London and Rio Olympics (WP6 – Event-centric cross-lingual analytics and cross-cultural studies)|
|Objectives: Explore online discussion of the two recent Olympics, which took place on different continents, in different time zones, and in different linguistic contexts.|
|Expected Results: Identification of differences in coverage between nations of major sporting events, analysing factors such as the type of activity, the location of the event, and the languages of the host nations.|
|The Olympic Games happen every 4 years. This means that every 4 years a city has to be chosen as a host city. It is easy to think about the impact of hosting such a big event in your own country. Usually governments have to prepare everything for their guests and be aware that the local population is expecting something that will remain as a legacy after the event ends. But what are people actually expecting? What usually happens after the Olympics? Are people happy or unhappy with the legacy left behind with the end of the games? We can try to answer these questions by reading what was published on the internet before, during and after the games in these countries that have hosted the Olympics. Of course there are lots of publications about this topic on social media and news media and it would be very difficult to read everything. Because of that, we can use computers to help us to read this material and select important things. Computers can, for example, analyse the most recurrent words related to this topic and provide us with insights about what kind of legacy people usually expect and what are their feelings when they face the materialization of their plans some years later. This kind of research has many possible applications. It can help governments to plan better public policies as well as provide us with tools to understand the impacts of such big events what can be used to find solutions.|
|ESR11: Abdul Sittar||PhD enrolment: JSI|
|Project Title: Information propagation with barriers (WP6 – Event-centric cross-lingual analytics and cross-cultural studies)|
|Objectives: Model the phenomenon of information propagation within the dynamic network of interconnected events. In other words, the objective is to model the characteristics of information spreading once a physical event happens somewhere in the world.|
|Expected Results: A model that facilitates tracking how the information about events spreads across languages, borders and cultures including the relations between barriers and the information spreading (e.g. delays, blocks, filters).|
|Only a few studies have focused on the task of analysing event-centric information, and those are usually limited to mono-lingual setting. Our objective is to explore event-centric information spreading across different barriers in a cross-lingual setting using primarily news data. To address the problem, we will define several topical areas of different characteristics, such as, sport events, political events, natural disasters. For each area we will find several specific events for which we can collect cross-lingual news data. We will model the data using machine learning techniques, defining appropriate event representation that captures the dynamics of information spreading across barriers.
The main focus of this research is to develop methodology, techniques and tools for modeling the characteristics of information which is spreading from different sources across different barriers. As a part of the Cleopatra project, it will expand the existing research results and provide insight towards new areas of information spreading in the contemporary world.
|ESR12: Swati Suman||PhD enrolment: JSI|
|Project Title: Cross-lingual news reporting bias (WP6 – Event-centric cross-lingual analytics and cross-cultural studies)|
|Objectives: Analyse cross-lingual news reporting bias along several dimensions: topic, language, geography, political orientation, source, sentiment, time, attention and some other contextual features.|
|Expected Results: Models describing information consumption in different parts of the world and feature analysis with respect to bias.|
|News bias is the product of the inherent bias present in its coverage. It occurs when a news outlet publishes a news story selectively or incorrectly. If the news is biased, then it can bias the thought process and decision making of the person listening, watching and/or reading it. Thus news bias analysis is not only important to the general public but it is also essential for journalists, publishers and other people involved in the news production process.
In general, news bias is determined by studying linguistic attributes such as keywords and other syntactic and semantic features. But with the advent of machine learning in recent years, advanced algorithmic methods have been developed to determine bias in written news content. The majority of online written news content is available in English and even more importantly most of the tools and resources for text analysis are designed for English language. Consequently, most research work seeks to strengthen techniques for monolingual (mostly English) news bias analysis and there is less consideration of cross-lingual and cross-cultural bias in news reporting. The purpose of our research is therefore to develop methods, techniques and tools for analyzing bias in cross-lingual news reporting. It will rely on and expand the language processing pipelines that have been created as part of the Cleopatra project.
|ESR13: Anna Katrine Jørgensen||PhD enrolment: UvA|
|Project Title: Multilingual Wikipedia as ‘first draft of history’ (WP6 – Event-centric cross-lingual analytics and cross-cultural studies)|
|Objectives: Perform cross-cultural comparison of Wikipedia language versions of articles on emerging news events and their temporal evolution.|
|Expected Results: Identification of language-specific and community-specific differences across Wikipedia language editions with respect to coverage of emerging news events over time.|
|Current events are popular and frequent pages on Wikipedia, with each language version offering a different representation of the events. These representations are sequences of temporally and culturally situated drafts of history that are continuously revised and repositioned by the Wikipedians. Despite their popularity, there is still a lack of research on how the current events representations are created, how they develop over time, and how they differ across languages versions.
In my PhD, I research the creation and development of current event pages on Wikipedia and perform large-scale temporal analysis of events across different European Wikipedia language versions. I specifically research events that have a social or cultural significance for the European Union or European cultures such as the European Refugee Crisis, the Arab Spring, and Brexit.
I also work on cultural recommendation of page creations based on cultural and social relevance, preferences and importance, as well as the relationship between events, location and related entities.
|ESR14: Diego Alves||PhD enrolment: FFZG|
|Project Title: NLP for under-resourced languages (WP4 – Event-centric cross-lingual Information processing)|
|Objectives: Extend Language Processing Pipelines (LPPs) for the well-resourced EU languages and gradually add new languages.|
|Expected Results: A set of stable LPPs covering the core Language Technology tasks for most of the EU official languages.|
|Do you remember all the boring syntax analysis and other grammar studies you had to do in High School? Well, computers can do that! And good news, they can be as good as graduate linguists! However, in order to be efficient, computers need a lot of linguistic data and, nowadays, as several automatic methods for natural language processing are available, it is difficult to know which are the best ones. My work consists in finding the optimal combination of softwares to process different European languages, especially the under-resourced ones.|
|ESR15: Gaurish Thakkar||PhD enrolment: FFZG|
|Project Title: Cross-lingual sentiment detection (WP4 – Event-centric cross-lingual Information processing)|
|Objectives: Produce and test a cross-lingual sentiment detection module with support for under-resourced EU languages.|
|Expected Results: A module for cross-lingual sentiment analysis, integrated in the CKPP.|
|Humans expressions are innate to all human activities. Knowing what others think about a particular topic or entity is a vital step in decision making. This ranges from buying a dress from an online store to selling/buying stocks at the very latest price point.
While the sentiment analysis in highly resourced languages has received full attention, the same cannot be said for the low-resourced languages. It is this resource-scarce nature that makes them less effective in the analysis. This study aims to develop tools/methods to perform sentiment detection in cross-lingual fashion for low-resourced languages. This would imply the investigation of using resources from high-resourced languages and aid the sentiment analysis in scarce-resourced languages.
|ESR18: Sahar Tahmasebi||PhD enrolment: TIB|
|Project Title: Multi Modal Fact Validation (WP4 – Event-centric cross-lingual Information processing)|
|Objectives: Develop methods for multimodal fact validation by exploiting hybrid inputs, provide evidence for them and learn rich representation.|
|Expected Results: Methods for multimodal fact validation using heterogeneous information sources.|
|Unimodal systems that process natural language predominate in research on fact validation – but reporting on events is rarely comprised of text alone. The co-existence of multiple modalities creates an opportunity to implement machine learning systems that retrieve, combine, and model diverse inputs. Fact validation systems that process multiple modalities are able to assess additional evidence, learn rich representations, and conduct supplementary analyses on source data.
Fact validation is composed of the sub-tasks of parsing inputs, retrieving evidence, and inferring relations between the two. In related applications, approaches that compose a design – after breaking problems down into steps – are better suited to tasks with composite structure. Learning on multiple modalities introduces an additional level of complexity for monolithic architectures. This research aims to implement and evaluate systems optimized to process diverse data and adapt to the multiple sub-tasks comprising the fact validation pipeline.