Open Event Knowledge Graph

One of the key planned outputs of the CLEOPATRA project is its Open Event Knowledge Graph (OEKG). Its second version (V2.0) was released on 31 January 2021 and is built on the EventKG and several other data sets created by the Cleopatra ESR students. You can find the information at, and the officially released dataset on Zenodo. The White Paper describing the second version of the OEKG, by Endri Kacupaj, Simon Gottschalk, Elena Demidova and Maria Maleshkova with contributions from the CLEOPATRA ESRs, can be downloaded here. For information on the first release of the OEKG (OEKG V1.0) published in March 2020, please read here.

The OEKG V2.0 contains more than 1 million events, in 15 languages (English, French, German, Italian, Russian, Portuguese, Spanish, Dutch, Polish, Norwegian (Bokmål), Romanian, Croatian, Slovene, Bulgarian and Danish). The OEKG is composed of seven different data sets from multiple application domains, including question answering, entity recommendation and named entity recognition. These data sets were all integrated through an easy-to-use and robust pipeline.


The OEKG integrates seven datasets. The following table provides an overview of these datasets, including the number of triples in the OEKG. At the end of this page, there is an extended list of data set descriptions.

Dataset Short Description Triples
EventKG light A light-weight version of EventKG, a multilingual, event-centric, knowledge graph 434,752,387
EventKG+Click A data set of language-specific event-centric user interaction traces 118,662
VQuAnDa A verbalization question answering dataset 38,243
MLM A benchmark dataset for multitask learning with multiple languages and modalities 942,753
Information Spreading A data set for information spreading over the news 277,992
TIME Two collections of news articles related to the Olympic legacy and Euroscepticism 70,754
UNER The universal named-entity recognition framework 206,622
OEKG The Open Event Knowledge Graph 436,407,413


The following figure shows the OEKGV2.0 schema. This schema is based on the EventKG schema and then extended for the integration of the other six data sets.

OEKG Schema
Excerpt of the OEKGV2.0 schema.

For a more detailed description of the OEKG schema and a list of the prefixes, check the OEKG website.

SPARQL Endpoint and Examples

Check for the SPARQL endpoint to the OEKG and selected examples demonstrating the use of the OEKG for type-specific image retrieval, hybrid question answering over knowledge graphs and news articles, as well as language-specific event recommendation.

Detailed Dataset Descriptions

The following tables provide more insights into the datasets that are integrated into the OEKG.


Partner organization LUH
Name of the dataset EventKG
Description of the dataset The EventKG is a multilingual resource incorporating event-centric information extracted from several large-scale knowledge graphs such as Wikidata, DBpedia and YAGO, as well as less structured sources such as the Wikipedia Current Events Portal and Wikipedia event lists in 15 languages. The EventKG is an extensible event-centric resource modeled in RDF. It relies on Open Data and best practices to make event data spread across different sources available through a common representation and reusable for a variety of novel algorithms and real-world applications.

(which languages)

English, German, French, Italian, Portuguese, Russian, Spanish, Italian, Dutch, Polish, Croatian, Bulgarian, Norwegian (Bokmål), Romanian and Slovene

(RDF, JSON, XML, text) 

RDF (.nq, .ttl)
Dataset size ~ 150GB
Technical requirements (repository, libraries, …) SPARQL
Licensing Creative Commons Attribution Share Alike 4.0 International
Further details Publications:

Simon Gottschalk and Elena Demidova. EventKG – the Hub of Event Knowledge on the Web – and Biographical Timeline Generation.

Semantic Web Journal. In press.

Simon Gottschalk and Elena Demidova. EventKG: A Multilingual Event-Centric Temporal Knowledge Graph. In

Proceedings of the Extended Semantic Web Conference (ESWC 2018).

SPARQL endpoint:

Example application:


Partner organization UBO
Name of the dataset VQuAnDa: Verbalization QUestionANswering DAtaset
Description of the dataset VQuAnDa is an answer verbalization dataset that is based on a commonly used large-scale Question Answering dataset – LC-QuAD. It contains 5,000 questions, the corresponding SPARQL query, and the verbalized answer. The target knowledge base is DBpedia, specifically the April 2016 version.

(which languages)

No (English)

(RDF, JSON, XML, text) 

Dataset size 5k samples (question, SPARQL query, answer verbalization)
Technical requirements (repository, libraries, …) SPARQL
Licensing Attribution 4.0 International (CC BY 4.0)
Publication Kacupaj, Endri, et al. “Vquanda: Verbalization question answering dataset.” European Semantic Web Conference. Springer, Cham, 2020.


Partner organization FFZG, KCL, LUH
Name of the dataset UNER

(Universal Named Entity Recognition)

Description of the dataset The dataset is composed of parallel corpora based on the content published on the news portal which (news and views from Southeast Europe), annotated in terms of events as defined in the ACE 2005 corpus and named entities following a new classification hierarchy composed of 3 levels:

1st level: 8 supertypes

2nd level: 47 types

3rd level: 69 subtypes


(which languages)

Albanian, Bulgarian, Bosnian, Croatian, English, Greek, Macedonian, Romanian, Serbian and Turkish.

(RDF, JSON, XML, text) 


(BIO Index based)

Dataset size 200k sentences for each language.
Licensing CC-BY-SA
Documentation Under development.
Publication Alves, Diego, et al. “UNER: Universal Named-Entity RecognitionFramework.” CLEOPATRA – 1st International Workshop on Cross-lingual Event-centric Open Analytics, 2020.
Further details Database being developed by using pre-annotation with automatic tools of the English corpus, followed by a correction step via crowdsourcing and, finally, automatically propagated to other languages.

SETimes dataset:


Partner organization(s) UBO, TIB, JSI
Involved ESRs ESR 5 (Jason Armitage), ESR 6 (Endri Kacupaj), ESR 8 (Golsa Tahmasebzadeh), ESR 12 (Swati)
Name of the dataset MLM
Description of the dataset (2-3 sentences) MLM is a processed data extraction from Wikidata and Wikipedia for multilingual and multimodal tasks. The primary aim is to train and evaluate systems designed to perform multiple tasks over diverse data.

(which languages)

English, French, German

(RDF, JSON, XML, text) 

Text, geo-coordinates, triples – JSON

Images – PNG

Dataset size ≈200k samples (four modalities per sample)
Technical requirements (repository, libraries, …) None
Licensing Creative Commons Public Licence
Publication Armitage, Jason, et al. “MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities.” Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2020.


Partner organization(s) LUH
Involved ESRs ESR2 (Sara Abdollahi)
Name of the dataset EventKG+Click
Description of the dataset (2-3 sentences) EventKG+Click is a novel cross-lingual dataset that reflects the language-specific relevance of events and their relations. This dataset aims to provide a reference source to train and evaluate novel models for event-centric cross-lingual user interaction, with a particular focus on the models supported by knowledge graphs.

EventKG+Click consists of two subsets:

  1. EventKG+Click_event which contains relevance scores, location-closeness, recency and Wikipedia link count factors for more than 4 thousand events; and
  2. EventKG+Click_relation with nearly 10 thousand event-centric click-through pairs, and their language specific number of clicks, relation relevance and co-mentions of the relation which is the number of sentences in whole Wikipedia language editions that mentions both the source and target.

(which languages)

English, German, Russian

(RDF, JSON, XML, text) 

Dataset size 3 MB in total

4113 events in EventKG+Click_event

9119 event-centric click-through pairs in EventKG+Click_relation

Licensing CC BY-SA 4.0
Publication Sara Abdollahi, Simon Gottschalk, and Elena Demidova. “EventKG+Click: A Dataset of Language-specific Event-centric User Interaction Traces.” CLEOPATRA – 1st International Workshop on Cross-lingual Event-centric Open Analytics, 2020.


Partner organization(s) JSI
Involved ESRs ESR11 (Abdul Sittar)
Name of the dataset Information Spreading Over News
Description of the dataset (2-3 sentences) This data set focuses on three contrasting events (Global Warming, FIFA world cup and earthquake). Main purpose to collect this data set is to understand information spreading patterns and detection of several barriers in events related to different domains such as sports, natural disasters and climate changes.

(which languages)

five languages (eng, spa, ger, slv, por)

(RDF, JSON, XML, text) 

Dataset size 2682, 3147 and 1944 news articles related to FIFA world cup, earthquake, and Global Warming
Technical requirements (repository, libraries, …)
Licensing Creative Commons Attribution 4.0 International
Documentation Each articles include meta data: id, title, body, similarity-score, class, event, article-url publisher, political-alignment, publishing time, country, country-timezone, country-economic-conditions, country-culture, and country-lat/long. This meta data will be used to create OEKG’s schema.
Publication Sittar, Abdul, Dunja Mladenić, and Tomaž Erjavec. “A Dataset for Information Spreading over the News”. Conference on Data Mining and Data Warehouses (SiKDD) (2020).
Further details


Partner organization(s) UOL
Involved ESRs ESR9 (Daniela Major) & ESR10 (Caio Castro Mello)
Name of the dataset TIME: Temporal Discourse Analysis applied to Media Articles
Description of the dataset (2-3 sentences) During the weeks preceding the Cleopatra R&D week we defined research questions and thought about the best ways to answer them. The social scientists in the group were especially interested in analysing media texts on two different topics (the concept of Olympic legacy and the concept of Euroscepticism). The choice of media outlets also followed the logic of our research questions: the comparative approach was always a priority in both of the topics. In the case of the concept of legacy we chose to scrape data on the Rio and London Olympics in both Brazilian and British media. With Euroscepticism, our choice fell on English and Spanish media coverage.

(which languages)

English, Portuguese, Spanish