ESR projects

Individual Research Projects

There are 15 Early Stage Researchers in the CLEOPATRA programme. There are details of their projects below:

1. Fact extraction and cross-lingual alignment (Gottfried Wilhelm Leibniz University Hannover, LUH)
2. Interactive user access models to cross-lingual information (Gottfried Wilhelm Leibniz University Hannover, LUH)
3. Crowd quality and training in hybrid multilingual information processing and analytics (King’s College London, KCL)
4. Incentives design for hybrid multilingual information processing and analytics (King’s College London, KCL)
5. Fact validation across multilingual text corpora (Rheinische Friedrich-Wilhelms-Universität Bonn, UBO)
6. Interactive multilingual question answering (Rheinische Friedrich-Wilhelms-Universität Bonn, UBO)
7. Relations of textual and visual information (TIB Hannover – German National Library of Science and Technology, TIB)
8. Contextualisation of images in multilingual sources (TIB Hannover – German National Library of Science and Technology, TIB)
9. National and transnational media coverage of European parliamentary elections, 2004-2014 (University of London, UoL)
10. Nationalism, internationalism and sporting identity: the London and Rio Olympics/Paralympics (University of London, UoL)
11. Information propagation with barriers (Institut Jožef Stefan ,JSI)
12. Cross-lingual news reporting bias (Institut Jožef Stefan, JSI)
13. Multilingual Wikipedia as ‘first draft of history’ (University of Amsterdam, UvA)
14. NLP for under-resourced languages (Sveuciliste u Zagrebu Filozofski Fakultet – University of Zagreb, FFZG)
15. Cross-lingual sentiment detection (Sveuciliste u Zagrebu Filozofski Fakultet – University of Zagreb, FFZG)
18. Multi Modal Fact Validation (TIB Hannover – German National Library of Science and Technology, TIB)
19. Cross Cultural Comparison of Russian State Controlled and Independent Media (University of Amsterdam, UvA)

 

ESR1: Tin Kuculo        PhD enrolment: LUH
Project Title: Fact extraction and cross-lingual alignment (WP4 – Event-centric cross-lingual Information processing)
Objectives: Extract and interlink mentions of related facts and their multilingual context and establish their semantic and temporal relations in comparable corpora by leveraging hybrid computational methods while utilizing NLP and ML-based technologies.
Impact: QuoteKG was developed, a multilingual knowledge graph of quotes, containing nearly a million quotes across 55 languages from over 69,000 public figures, to aid in understanding world history and influencing societal discourse. Wikiquote was used as a primary source to address the challenges of context scarcity in quote collections, language variation, and quote alignment, and a language-agnostic transformer model for cross-lingual alignment was employed. QuoteKG, with its rich metadata and context for each quote, facilitates cultural and historical research, demonstrated by its application in curating event quote collections and conducting event-centric analyses.
Future research: The Semantic Web and NLP communities have developed different methods for event-centric information extraction and representation. Knowledge graphs focus on named events, while NLP-based approaches focus on finer-grained, self-contained events. FollowUp work aims to incorporate the strengths of both Semantic Web and NLP perspectives and presents a new approach, Ontology-Guided Event Extraction (O-GEE), to address challenges associated with event ontologies, the extraction of relevant event relations, and the alignment of event ontologies in NLP and the Semantic Web.

A two-step approach is planned to extract fine-grained event locations from large-scale knowledge graphs, such as Wikidata, DBpedia, and YAGO. The method first integrates a geographic knowledge graph into an event knowledge graph to counteract the lack of detailed location information, and then employs a graph neural network on the combined event knowledge graph to extract precise event locations.

 

ESR2: Sara Abdollahi PhD enrolment: LUH
Project Title: Interactive user access models to cross-lingual information (WP5 – Hybrid computation, user interaction and question answering)
Objectives: Develop user interaction models that enable users to efficiently and effectively access extracted event-centric multilingual information and its context and analyse language-specific differences.
Impact: The LaSER language-specific event recommendation algorithm was developed to help users (mostly researchers and journalists) supporting cross-language exploration, Web navigation and exploratory search.
Future research: Document retrieval for events using query expansion and knowledge graphs; and building event-centric collections from web archives, as the result of my research during my secondment at the British Library.

 

ESR3: Gabriel Amaral PhD enrolment: SOTON / KCL
Project Title: Crowd quality and training in hybrid multilingual information processing and analytics (WP5 – Hybrid computation, user interaction and question answering)
Objectives: Design a mixed-crowdsourcing workflow to produce high-quality multilingual data for knowledge graphs.
Impact: The research has made great strides in the Wikidata Research community in uncovering the impact that the quality of references (for Wikidata claims) has on the perceived quality of the information itself. We explored how humans perceive the quality of such references, how to best measure it, what is the current state of such quality metrics in current Wikidata, and direct steps that could be done to improve it. This section of our research was extremely well received by the community, being granted a Wikimedia Foundation Research Award of the Year. We followed this up by constructing an automated pipeline to assist in the maintenance of Wikidata (or any KG) references. We also explored how such a model can be explained in order to gain the trust and understanding of its users.
Future research: Follow-ups for this research would look into other ways to automate fully or partially the measurement of relevant metrics for reference quality. This could focus on the main metrics we have established (relevance, authoritativeness, and ease of use), or others. NLP models are making huge advancements (e.g. ChatGPT), so applying these techniques to this task seems like a natural next step.

 

ESR4: Elisavet Koutsiana PhD enrolment: KCL
Project Title: Incentives design for hybrid multilingual information processing and analytics (WP5 –  Hybrid computation, user interaction and question answering)
Objectives: Understand what motivates people to engage in knowledge graph creation and curation activities, across language contexts, and devise incentive mechanisms to foster useful knowledge graph contributions.
Impact: The contribution investigated how members of the community work and interact through their discussions providing valuable inputs to peer production communities, particularly the collaborative knowledge graph communities. The research study confirmed how important discussions are as a source of insights regarding members’ collaboration and provided frameworks for studying online discussions in collaborative knowledge graph projects, like qualitative analysis coding schemes for investigating the main topics discussed, as well as for identifying argumentation patterns and the role of discussion participants in controversial discussions. An overview of the complete corpus of Wikidata discussions was provided, as well as publicly available code for descriptive statistical analysis of the corpus. The research suggests design improvements and topics for follow-up studies in knowledge graph quality and editor engagement.
Future research: Future directions of this research could investigate: how decisions are made and how they impact the knowledge graph construction; and how discussions and members’ interaction impact the quality of the knowledge graph
ESR5: Jason Armitage PhD enrolment (2019-20): UBO
Project Title: Fact validation across multilingual text corpora (WP4 –  Event-centric cross-lingual Information processing)
Objectives: Develop hybrid methods for cross-lingual fact validation and leverage multilingual distributed sources to provide a more complete set of source candidates in order to validate the facts.
Expected Results: Methods for hybrid cross-lingual fact validation using heterogeneous information sources.
Unimodal systems that process natural language predominate in research on fact validation – but reporting on events are rarely comprised of text alone. The co-existence of multiple modalities creates an opportunity to implement machine learning systems that retrieve, combine, and model diverse inputs. Fact validation systems that process multiple modalities are able to assess additional evidence, learn rich representations, and conduct supplementary analyses on source data. 

Fact validation is composed of the sub-tasks of parsing inputs, retrieving evidence, and inferring relations between the two. In related applications, approaches that compose a design – after breaking problems down into steps – are better suited to tasks with composite structure. Learning on multiple modalities introduces an additional level of complexity for monolithic architectures. This research aims to implement and evaluate systems optimised to process diverse data and adapt to the multiple sub-tasks comprising the fact validation pipeline.

 

ESR6: Endri Kacupaj PhD enrolment: UBO
Project Title: Interactive multilingual question answering (WP5 –  Hybrid computation, user interaction and question answering)
Objectives: Train neural networks to convert natural language queries to a formal query language, which will then be answered using existing knowledge bases. Enable efficient user interaction and feedback to enhance results.
Impact: Conversational Question Answering over Knowledge Graphs with Answer Verbalization. The impact of the research can be summarised as follows:

    • We provided a semi-automated framework for generating multiple paraphrase responses for a question by utilizing methods such as back-translation. Furthermore, we release the first KGQA dataset with multiple paraphrased verbalized responses.
    • We presented the first multi-task-based answer verbalization framework that employs questions and logical forms as inputs and simultaneously trains four modules for generating natural language answers.
    • We developed two multi-task neural semantic parsing approaches for (complex) conversational question answering over knowledge graphs. We distinguish the works from the number of sub-tasks they perform and the different deep neural architectural modules they utilize.

For Conversational QA, we studied whether the availability of entire dialog history, domain information, and verbalized answers can act as context sources in determining the ranking of KG paths while retrieving correct answers. We proposed an approach that models conversational context and KG paths in a shared space by jointly learning the embeddings for homogeneous representation.

Further research: The contributions of this research pave the way for a more extensive research agenda that will foster further research. A few of such directions are enumerated below:

    • There are empirical evaluations that for AI systems, the explanations regarding the retrieved answers improve trustworthiness, especially in wrong prediction. Hence, how an answer verbalization can be explained remains an important open research direction.
    • For Conversational QA, path ranking is a relatively new method for approaching the task. While it is more difficult to comprehend the predictions compared to semantic parsing models, path ranking will be favored since it does not depend on any annotation process or data. We see an emergence of future ConvQA systems towards that direction; however, the explainability of ranking strategies remains a tough challenge and is eventually to be addressed.
    • For the developed approaches, we employ multi-task learning via hard parameter sharing. Multi-task learning can be crucial for approaching complicated tasks such as question answering. We expect the development of massive and flexible multi-task learners able to perform numerous tasks. Question answering will be achieved by combining parts of them, and for any new knowledge learned, the model will update a comparatively small number of parameters. Likewise, we believe that multi-task learning is essential in developing artificial intelligence with more human-like qualities.

 

ESR7: Gullal Singh Cheema PhD enrolment: TIB
Project Title: Relations of textual and visual Information (WP4 –  Event-centric cross-lingual Information processing)
Objectives: Develop and research (deep) learning systems that are able to 1.) find the paragraphs and sentences in a text which are relevant to image content, and 2.) predict the granularity and semantic level of text-image relations.
Impact:

    • A novel method for text-based claim detection was proposed, as well as a novel approach for hate speech detection in multimodal memes (a multi-task learning model).
    • A new multimodal claim detection dataset was created including a multi-topic and diverse set of claims with annotations for multiple tasks.
    • A theoretical framework was proposed for multimodal news analysis based on research in semiotics (image-text relations), journalism (news values) and computational science.
Further research: A follow up on multimodal claims is already part of Checkthat! 2023 challenge on Check Worthiness, Subjectivity, Political Bias, Factuality, and Authority of News Articles and Their Sources. The new test dataset, part of the challenge, contains tweets from 2021 and 2022. A few papers are planned for computational research on image-text relations, multimodal hate speech and analyzing large multimodal language models.

 

ESR8: Golsa Tahmasebzadeh PhD enrolment: TIB
Project Title: Contextualisation of images in multilingual sources (WP4 –  Event-centric cross-lingual Information processing)
Objectives: Research how surrounding text information can be utilized to infer and refine spatial and temporal information about an image in multilingual Web sources to support cross-lingual alignment.
Impact: Contextualization of news photos based on geolocation estimation and event type classification can impact the research field in various aspects such as:

    • Data enrichment: Contextualizing news documents helps researchers enrich their datasets with valuable information such as the geographic source of an event.
    • Improved data analysis: Geolocation estimation and event type classification provide additional insights for analysing news photos. For instance researchers can explore the spatial and temporal patterns in news coverage.
    • Verification and fact-checking: Geolocation estimation and event type classification can aid researchers in verifying the accuracy of news documents. By integrating these meta information researchers can further explore misinformation detection, and identification of manipulated images. 
    • Multimedia representation: Geolocation-based image-text relations can help social scientists in analysing media content and representation. For instance they can investigate how different locations are depicted in images and the accompanying text. Furthermore they can explore cultural stereotypes, or media bias.
Further research: Possible follow-up research directions are:

    • Study geolocation explainability between image and text pairs of news
    • Study the impact of geolocation estimation on fake news detection and news recommendation

 

ESR9: Daniela Major PhD enrolment:  UoL
Project Title: National and transnational media coverage of European parliamentary elections, 2004-2014 (WP6 –  Event-centric cross-lingual analytics and cross-cultural studies)
Objectives: Explore information flows between national media, identify translingual concepts and topics emerging during the elections.
Impact: Research in Cleopatra has contributed to the following:

    • Development of a framework to combine qualitative and quantitative methods for the study of media coverage of attitudes to Europe in UK and Portuguese media. The focus was on accessibility and reproducibility of the method by historians with limited digital research skills.
    • Development of a new (and easily reproducible) method for building a multilingual corpus derived from online newspapers, both open access and subscription content, based on thematic criteria
    • The study of media coverage of the European Union has contributed to understandings of how discussion in the digital public sphere influences the attitudes of European citizens to the EU, including pre- and post-Brexit in the UK. It has uncovered significant similarities but also significant differences between the topics, institutions and individuals who feature prominently in UK and Portuguese media, and particularly in how the two countries mobilise their histories and national traditions in relation to the EU.
Further research: Possible future developments of the research include:

    • Expanding debates in multilingual Digital Humanities, with special focus on democratising access to resources and data.
    • Introducing reproducible digital methods to historians, including through publication of results in mainstream history journals.
    • Encouraging reflection on the nature of the digital public sphere in the European Union.

 

ESR10: Caio Mello PhD enrolment: UoL
Project Title: Nationalism, internationalism and sporting identity: the London and Rio Olympics (WP6 –  Event-centric cross-lingual analytics and cross-cultural studies)
Objectives: Explore online discussion of the two recent Olympics, which took place on different continents, in different time zones, and in different linguistic contexts.
Impact: Research in Cleopatra has contributed to the following:

    • Development of a framework to combine quantitative (digital) and qualitative methods for the study of the media coverage of events of significant importance for society. The focus was especially on mechanisms of critical data and digital methods thinking by using different approaches to review the implications of their use in the social sciences. Results are published in the paper (https://doi.org/10.1007/s42803-022-00052-9)
    • development of a tutorial (forthcoming) on the Programming Historian on how to retrieve and analyse data publicly provided by the BL. Results were shared in the UK Web Archive conference 2022, organised internally by the institution for its collaborators. Documented in video (https://www.youtube.com/watch?v=9XbCcVqXVeo)
    • The study on the media coverage of the Olympic legacy has contributed to the understanding of the research community of the different dimensions in which the event is narrated and cities are impacted as a result of hosting mega events. The results were shared and discussed in the event ‘Documenting the Olympics and the Paralympics’, organised by me (ESR10) in partnership with the British Library, British Society of Sports History (BSSH), the International Centre for Sports History and Culture at De Montfort University (ICSHC) and the School of Advanced Study (SAS). Documented as podcast (https://open.spotify.com/episode/5zDSk4JsfOvO8Jq4Aq1cXL?go=1&sp_cid=0e0eeb5dba04330acafd5dd9774ffa5c&utm_source=embed_player_p&utm_medium=desktop&nd=1)
Further research: Possible future developments of this research include:

    • Expanding debates in multilingual Digital Humanities, with special focus on democratising access to resources and data.
    • Deepening research on digital methods, their potentialities and limitations. It would be important to investigate how explainable AI can help Digital Humanities scholars to improve our understanding of the impact of using such tools.
    • Provoking reflections and debates on interdisciplinary research and how to develop mechanisms to improve collaboration across disciplines in aspects such as joint publications, productive discussions and development of projects.

 

ESR11: Abdul Sittar PhD enrolment:  JSI
Project Title: Information propagation with barriers (WP6 –  Event-centric cross-lingual analytics and cross-cultural studies)
Objectives: Model the phenomenon of information propagation within the dynamic network of interconnected events. In other words, the objective is to model the characteristics of information spreading once a physical event happens somewhere in the world.
Impact: There are many factors that influence the news selection, reporting, and spreading such as cultural, political, economic, geographic, and linguistic. Analysing these factors in news spreading related to different international events is an open research area.  The impact of the research project can be summarised as follows:

    • A new methodology to analyse news spreading barriers: a corpus has been collected and annotated containing news articles related to three different domains: natural disasters, sports, and climate change. The collected corpus has been extended with background information related to six barriers (linguistic, economic, time-zone, geographic, political, and cultural). For the mono-lingual news, network analysis is performed, whereas for the temporal analysis cascading analysis is performed.
    • A new approach to enhance the topic modelling technique and understand political and economic differences in news reporting: enhanced topic modelling technique that uses LDA with a combination of 1-6 grams and article pooling based on queries to improve the quality of topics without modifying the structure of LDA. The proposed approach has been applied to news articles related to COVID-19.
Further research: The developed methodologies will be used for social media analysis within the new TWON project (Twin of online social networks).

 

ESR12: Swati PhD enrolment: JSI
Project Title: Cross-lingual news reporting bias (WP6 – Event-centric cross-lingual analytics and cross-cultural studies)
Objectives: Analyse cross-lingual news reporting bias along several dimensions: topic, language, geography, political orientation, source, sentiment, time, attention and some other contextual features.
Impact: Introduction of a novel knowledge-infused learning framework for enhancing the prediction of political bias in news headlines and its extensive evaluation.

We believe that our proposed framework would be a valuable tool for copy editors responsible for rewriting headlines. It also has the potential to be useful in practical applications such as e-journalism and manual news-bias prediction portals, where it could be used to automatically classify headlines into different bias types. In addition, it could help to reduce the number of articles that require manual examination, which is a time-consuming process prone to annotator bias.

    • Extension of the proposed framework to deal with low-resource multilingual headlines under imbalanced sample distribution in a language-agnostic setting with comprehensive qualitative analysis. Our proposed language-agnostic solution is adaptable to real world scenarios where a system is expected to deal with low-resource situations.
    • Design and implementation of flexible data generation methods to generate custom-labelled datasets for related tasks. A major issue that frequently arises in real-world scenarios is a lack of annotated data. Our proposed data generation strategy, which is capable of working with low-resource languages with an imbalanced distribution, is intended to address this issue. In addition, it will facilitate future expansion and the creation of custom datasets for related tasks. The generated data would not only be ideal for automated systems like the “bias flipper” but would also be beneficial for social scientists and researchers interested in analysing news reporting bias.
Further research:

    • We plan to analyze the speed of reporting, time-span, and importance given to the events by the outlets. In addition, we also are looking into how the outlets change their coverage style over time.
    • We intend to diversify our additional knowledge sources. In particular, we intend to investigate how knowledge sources such as Wiktionary and ConceptNet influence the task of polarity prediction. 
    • We are planning to extend this study beyond polarity prediction to its quantification and correction. We believe it would be interesting to experiment with auxiliary task involving news headlines in a multitask learning paradigm
    • We intend to have a distribution of articles with positive and negative sentiment for specific events and outlets. This would reveal not only the outlet’s political orientation but also the editorial’s overall attitude.

 

ESR13: Anna Katrine Jørgensen PhD enrolment (2019-20): UvA
Project Title: Multilingual Wikipedia as ‘first draft of history’ (WP6 –  Event-centric cross-lingual analytics and cross-cultural studies)
Objectives: Perform cross-cultural comparison of Wikipedia language versions of articles on emerging news events and their temporal evolution.
Expected Results: Identification of language-specific and community-specific differences across Wikipedia language editions with respect to coverage of emerging news events over time.
Current events are popular and frequent pages on Wikipedia, with each language version offering a different representation of the events. These representations are sequences of temporally and culturally situated drafts of history that are continuously revised and repositioned by the Wikipedians. Despite their popularity, there is still a lack of research on how the current events representations are created, how they develop over time, and how they differ across languages versions.

In my PhD, I research the creation and development of current event pages on Wikipedia and perform large-scale temporal analysis of events across different European Wikipedia language versions. I specifically research events that have a social or cultural significance for the European Union or European cultures such as the European Refugee Crisis, the Arab Spring, and Brexit.

I also work on cultural recommendation of page creations based on cultural and social relevance, preferences and importance, as well as the relationship between events, location and related entities.

 

ESR14: Diego Alves PhD enrolment: FFZG
Project Title: NLP for under-resourced languages (WP4 –  Event-centric cross-lingual Information processing)
Objectives: Extend Language Processing Pipelines (LPPs) for the well-resourced EU languages and gradually add new languages.
Impact: The new typological methods that were proposed allowed to compare and classify languages in an innovative way providing valuable information for dependency parsing improvement. The optimized methods were applied for low-resourced European languages (i.e.: Lithuanian, Hungarian, Irish, and Maltese), thus, improving their scores concerning automatic syntactic annotation. A Speech-To-Text translation method from Latvian to English was proposed to improve the results via synthetisation of errors in the training phase; this solution allowed overall scores to be improved.As part of CLEOPATRA activities (RD Week and Hackathon), a new hierarchy regarding named-entity recognition (UNER) and classification was developed with a specific pipeline for extracting and annotating data from Wikipedia which can be applied for any language available in it.  Constructive discussions while using the data available in one of the collections of Arquivo.pt, allowed the researchers from that institution to come up with new services to facilitate users of their resources.
Further research: The typological methods which have been identified for a better understanding of linguistic phenomena when languages are combined to train dependency parsing models will be further improved, expanding the research for other linguistic families. Regarding the named-entity project, the idea is to continue on working in this domain to improve the automatic extraction and annotation in the first step, followed by the creation of UNER datasets for all EU languages

 

ESR15: Gaurish Thakkar PhD enrolment: FFZG
Project Title: Cross-lingual sentiment detection (WP4 –  Event-centric cross-lingual Information processing)
Objectives: Produce and test a cross-lingual sentiment detection module with support for under-resourced EU languages.
Impact: Sentiment analysis for low-resourced languages is an under-researched area. The impact of the research carried out within Cleopatra  in the field can be summarised as follows:

    • Study of various large language models using probing mechanisms (negation) to improve the final sentiment analysis performance for low-resourced Slavic languages
    • Improving the overall scores using multitask learning and resources from the same and distant family languages in low- and high-resource settings
    • Study of various data augmentation techniques for the task of sentiment analysis and their performance in low-resourced language
    • Development of new resource(s) for SA in Slavic language
Further research: Ideas for follow up are:

    • Experiment with datasets from multiple language families and the transfer of sentiment for low-resourced languages.
    • Introduction of multiple modalities in the form of audio and video annotation from other language families for SA.
    • More sophisticated data augmentation tools for low-resourced languages.
    • Focus on individual instances’ importance and domain sensitivity when using the source and target languages resources in cross-lingual experimentation.
    • Study of noisy labels in low-resourced settings using confident learning is another field to be explored.
ESR17: Sahar Tahmasebi        PhD enrolment: TIB
Project Title: Multi Modal Fact Validation (WP4 – Event-centric cross-lingual Information processing)
Objectives: Develop methods for multimodal fact validation by exploiting hybrid inputs, provide evidence for them and learn rich representation.
Impact: Research in this field provided a new approach for multimodal fake news detection with better generalisation capability on realistic use cases.
Further research: Possible follow-up research directions are:

    • Improving applicability of method by considering related evidence for example user credibility, Geo location, etc…and extending the methods to be applicable on other modalities which can be used for spreading fake news
    • providing a dataset for chart fact checking in order to validate a textual claim based on scientific charts (e.g bar charts, line charts, pie charts, etc)
ESR18: Alberto Olivieri        PhD enrolment (2022-3): UVA
Project Title: Cross Cultural Comparison of Russian State Controlled and Independent Media (WP6 –  Event-centric cross-lingual analytics and cross-cultural studies)
Objectives: Comparison of the images of the war/special military operation in Ukraine, and therefore on the underlying semantic message that is implicitly vectored by the actors involved.
Expected Results: Insights on how state and independent actors develop, build, and spread narratives and on the visual semantics involved.
The project wants to focus on the clashing of narratives in contemporary society, the influence of these narratives for society at large, and the semantics they use. Nowadays, we assist at ferocious debates regarding what is considered truth from opposite sides of a narration battle that we cannot ignore anymore. These narratives are such powerful tools that could shape so fundamentally the belief system of an individual and influence how it perceives reality. As such, we should try to understand better how they work in the digital age, with its new and evolving challenges.

Back