Open Event Knowledge Graph (OEKG V1.0)
One of the key planned outputs of the CLEOPATRA project is its Open Event Knowledge Graph (OEKG). The core of the OEKG (V1.0) is built on the EventKG V3.0, which was released on 31 March 2020. You can find the information at http://eventkg.l3s.uni-hannover.de/, and the officially released dataset on Zenodo (https://zenodo.org/record/3733829). The White Paper describing the first version of the OEKG, by Maria Maleshkova, Elena Demidova, Simon Gottschalk and Endri Kacupaj, with contributions from CLEOPATRA ESRs, can be downloaded here: Open Event Knowledge Graph V1.
EventKG V3.0/OEKG V1.0 now contains more than 1 million events, in 15 languages (English, French, German, Italian, Russian, Portuguese, Spanish, Dutch, Polish, Norwegian (Bokmål), Romanian, Croatian, Slovene, Bulgarian and Danish).
The language coverage and the number of events in the knowledge graph were significantly extended based on the work done during a CLEOPATRA demonstrator session in February 2020, so thanks to all of our contributors!
Below, you can find detailed descriptions of all the newly developed CLEOPATRA datasets which are already available:
Partner organization | LUH |
Name of the dataset | EventKG |
Description of the dataset | The EventKG is a multilingual resource incorporating event-centric information extracted from several large-scale knowledge graphs such as Wikidata, DBpedia and YAGO, as well as less structured sources such as the Wikipedia Current Events Portal and Wikipedia event lists in six languages. The EventKG is an extensible event-centric resource modeled in RDF. It relies on Open Data and best practices to make event data spread across different sources available through a common representation and reusable for a variety of novel algorithms and real-world applications. |
Multilingual (which languages) | English, German, French, Italian, Portuguese, Russian |
URL | http://eventkg.l3s.uni-hannover.de/ |
Dataformat (RDF, JSON, XML, text) | RDF (.nq, .ttl) |
Dataset size | ~ 30GB |
Technical requirements (repository, libraries, …) | SPARQL |
Licensing | Creative Commons Attribution Share Alike 4.0 International |
Documentation | https://github.com/sgottsch/eventkg |
Further details | Publications: Simon Gottschalk and Elena Demidova. EventKG – the Hub of Event Knowledge on the Web – and Biographical Timeline Generation. Semantic Web Journal. In press.Simon Gottschalk and Elena Demidova. EventKG: A Multilingual Event-Centric Temporal Knowledge Graph. In Proceedings of the Extended Semantic Web Conference (ESWC 2018).SPARQL endpoint: http://eventkginterface.l3s.uni-hannover.de/sparqlExample application: http://eventkg-timeline.l3s.uni-hannover.de/ |
Partner organization | LUH |
Name of the dataset | Event-QA |
Description of the dataset | Event-QA: A Dataset for Event-Centric Question Answering over Knowledge Graphs.
Event-QA dataset contains 300 semantic queries and the corresponding verbalisations for EventKG |
Multilingual (which languages) | English, German, Portuguese |
URL | http://eventcqa.l3s.uni-hannover.de/ |
Dataformat (RDF, JSON, XML, text) | JSON |
Dataset size | 300 queries |
Technical requirements (repository, libraries, …) | SPARQL |
Licensing | Creative Commons Attribution Share Alike 4.0 International |
Documentation | |
Further details | Cite as: “Event-QA: A Dataset for Event-Centric Question Answering over Knowledge Graphs” Tarcisio Souza Costa; Simon Gottschalk; Elena Demidova
http://eventcqa.l3s.uni-hannover.de/ The number of queries will be increased to ca. 1000 in the next release |
Partner organization | LUH |
Name of the dataset | The German Web corpus |
Description of the dataset | The German Web corpus covers all Web pages from the .de top-level domain as captured by the Internet Archive from 1996 to 2013, the HTML portion (~30TB) with 4.05 billion captures of 1 billion URLs. Overall size is ~80TB and also includes English content.
From this corpus, a collection of German news sites was created based on a set of 400 domains of German news websites. |
Multilingual (which languages) | German (primarily), English |
URL | Available only at LUH, on site |
Dataformat (RDF, JSON, XML, text) | WARC, JSON |
Dataset size | ~80TB
German news collection: 4.3TB (32,794,626 captures) |
Technical requirements (repository, libraries, …) | Hadoop cluster, ElasticSearch |
Licensing | Research only |
Documentation | http://alexandria-project.eu/datasets/german-and-uk-web-archive/
German news collection: https://github.com/tarcisiosouza/elastic-client-api |
Further details | – |
Partner organization | UBO |
Name of the dataset | FactBench |
Description of the dataset | FactBench is a multilingual benchmark for the evaluation of fact validation algorithms. All facts in FactBench are scoped with a timespan in which they were true, enableing the validation of temporal relation extraction algorithms. FactBenchcurrently supports english, german and french. You can get the current release here. |
Multilingual (which languages) | yes |
URL | https://github.com/DeFacto/FactBench/tree/master/core |
Dataformat (RDF, JSON, XML, text) | RDF models |
Dataset size | 1500 correct statements, 780 negative examples |
Technical requirements (repository, libraries, …) | SPARQL or MQL |
Licensing | The MIT License (MIT) |
Documentation | https://github.com/DeFacto/FactBench |
Further details | Used by DeFacto |
Partner organization | UBO |
Name of the dataset | VQuAnDa: Verbalization QUestionANswering DAtaset |
Description of the dataset | VQuAnDa is an answer verbalization dataset that is based on a commonly used large-scale Question Answering dataset – LC-QuAD. It contains 5,000 questions, the corresponding SPARQL query, and the verbalized answer. The target knowledge base is DBpedia, specifically the April 2016 version. |
Multilingual (which languages) | No (English) |
URL | https://figshare.com/projects/VQuAnDa/72488 |
Dataformat (RDF, JSON, XML, text) | JSON |
Dataset size | 5k samples (question, SPARQL query, answer verbalization) |
Technical requirements (repository, libraries, …) | SPARQL |
Licensing | Attribution 4.0 International (CC BY 4.0) |
Documentation | http://vquanda.sda.tech/ |
Further details |
Partner organization | FCT-FCCN |
Name of the dataset | Arquivo.pt web archive |
Description of the dataset | Arquivo.pt is a research infrastructure that preserves content written in several languages broadly interesting to the Portuguese community and related to research and education in general.
Arquivo.pt has been developing special web collections about international events such as European Elections, online news, Wikipedia or the celebration of the 100 years of World War. Arquivo.pt also collected and preserved 50.4 million Web files related to R&D activities funded by the EU since 1994 (FP4 to FP7). All the outputs from this study were made publicly available and we believe they constitute a unique and precious resource for research activities in all fields of knowledge. Arquivo.pt provides access to its collection of historical web data through a public web user interface or an API that enables the refinement of queries (e.g. by special collection). ESRs can also have access to Arquivo.pt Big Data Analytics, based on Hadoop, to perform investigations that require large-scale automatic processing of large-scale web collections. |
Multilingual (which languages) | Mostly in Portuguese, English, French and Spanish. We don’t perform language restrictions. Thus, in theory documents in all languages may be found. |
URL | https://arquivo.pt https://arquivo.pt/api |
Dataformat (RDF, JSON, XML, text) | JSON, XML, HTML |
Dataset size | 6062 million web files collected from 14 million websites stored in 336 TB (compressed format) |
Technical requirements (repository, libraries, …) | Knowledge about JSON and REST APIs |
Licensing | https://sobre.arquivo.pt/en/about/terms-and-conditions/ |
Documentation | https://github.com/arquivo |
Further details | https://sobre.arquivo.pt/en/ |
Partner organization | University of Southampton |
Name of the dataset | Global web news feed (RSS) |
Description of the dataset | Monthly collections of news articles, harvested from a seeded RSS list. Each month contains around ~30 million posts. Check for duplications is required. |
Multilingual (which languages) | English |
URL | https://webobservatory.soton.ac.uk/datasets/NKtKuwrMei8SFQG4H |
Dataformat (RDF, JSON, XML, text) | |
Dataset size | Various sizes |
Technical requirements (repository, libraries, …) | |
Licensing | |
Documentation | |
Further details |
Partner organization | University of Southampton |
Name of the dataset | Crisisnet qualitative data reports (USHAHIDI) |
Description of the dataset | A Collection of 7,000+ qualitative reports collected from the Ushahidi + CrisisNet platform. These have been written and curated by first responders at major disaster events (e.g. Haiti Earthquake). Each record contains a timestamp, eventID, and message/text relating to a specific event. |
Multilingual (which languages) | English |
URL | https://webobservatory.soton.ac.uk/datasets/3cZxMoGEfmoMCTEA7 |
Dataformat (RDF, JSON, XML, text) | Text |
Dataset size | Various sizes |
Technical requirements (repository, libraries, …) | |
Licensing | |
Documentation | |
Further details |
Partner organization | TIB |
Name of the dataset | Im2GPS |
Description of the dataset | Im2GPS is a test set for geolocation estimation. The test set contains 237 geo-tagged photos, where 5% depict specific touristic sites and the remaining are only recognizable in a generic sense. The test set was originally crawled from Flickr. |
Multilingual (which languages) | – |
URL | http://graphics.cs.cmu.edu/projects/im2gps/ |
Dataformat (RDF, JSON, XML, text) | JPG |
Dataset size | 237 images, 40.8 MB |
Technical requirements (repository, libraries, …) | – |
Licensing | Creative commons licenses |
Documentation | – |
Further details | GPS tags of Im2GPS test set must be extracted from EXIF data |
Partner organization | TIB |
Name of the dataset | Im2GPS3k |
Description of the dataset | Im2GPS3k is a test set for geolocation estimation. The test set contains 3,000 geo-tagged images different than images in the Im2GPS benchmark. The dataset was originally collected from Flickr. |
Multilingual (which languages) | – |
URL | http://www.mediafire.com/file/7ht7sn78q27o9we/im2gps3ktest.zip |
Dataformat (RDF, JSON, XML, text) | JPG |
Dataset size | 3,000 images, 479.1 MB |
Technical requirements (repository, libraries, …) | – |
Licensing | Creative commons licenses |
Documentation | – |
Further details | GPS tags of the Im2GPS3k test set must be extracted from EXIF data |
Partner organization | TIB |
Name of the dataset | MP-16 dataset |
Description of the dataset | The MediaEval Placing Task 2016 (MP-16) dataset is a subset of the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset and includes around five million geo-tagged images from Flickr without any restrictions. The dataset contains among photos of well known places and landmarks also ambiguous photos of, e.g., indoor environments, and food. |
Multilingual (which languages) | English |
URL | http://multimedia-commons.s3-website-us-west-2.amazonaws.com/?prefix=subsets/YLI-GEO/mp16/metadata/ |
Dataformat (RDF, JSON, XML, text) | SQL |
Dataset size | 4.7 M training images |
Technical requirements (repository, libraries, …) | |
Licensing | Creative commons licenses |
Documentation | |
Further details |
Partner organization | TIB |
Name of the dataset | Date Estimation in the Wild Dataset |
Description of the dataset | Collection of Flickr images for predicting when an image has been taken. The meta information provided was gathered by the Flickr API server and covers a range from 1900 to 1999. |
Multilingual (which languages) | English |
URL | https://doi.org/10.22000/0001abcde |
Dataformat (RDF, JSON, XML, text) | JPG, CSV |
Dataset size | 1,029,710 images |
Technical requirements (repository, libraries, …) | Python |
Licensing | Meta: CC BY 4.0 Attribution, Images: meta.csv |
Documentation | This package contains: Meta information for 1,029,710 images (meta.csv) |
Further details | Publication: E. Müller, M. Springstein, R. Ewerth: “When was this picture taken?” – Image Date Estimation in the Wild. In: Proceedings of 39th European Conference on Information Retrieval (ECIR), Aberdeen, UK, 2017, 619-625. https://link.springer.com/chapter/10.1007/978-3-319-56608-5_57 |
Partner organization | TIB |
Name of the dataset | Semantic Image-Text-Classes |
Description of the dataset | This dataset is comprised of image-text pairs of eight different semantic image-text classes. Pairs of images and text can be distinguished into these classes by observing their purpose and classifying their interplay in the process of conveying information. The dataset consists of 224,856 (automatically labeled) image-text pairs for training and 800 pairs with human verified labels for testing. |
Multilingual (which languages) | English |
URL | https://doi.org/10.25835/0010577 |
Dataformat (RDF, JSON, XML, text) | PNG and JSON |
Dataset size | 225,656 image-text pairs, 45.3 GB |
Technical requirements (repository, libraries, …) | – |
Licensing | Creative Commons Attribution-NonCommercial 3.0 |
Documentation | – |
Further details | Otto, C., Springstein, M., Anand, A., Ewerth, R., “Understanding, Categorizing and Predicting Semantic Image-Text Relations”, ACM International Conference on Multimedia Retrieval (ICMR), Ottawa, Canada, 2019. |
Partner organization | FFZG, KCL, LUH |
Name of the dataset | UNER (Universal Named Entity Recognition) |
Description of the dataset | The dataset is composed of parallel corpora based on the content published on the SETimes.com news portal which (news and views from Southeast Europe), annotated in terms of events as defined in the ACE 2005 corpus and named entities following a new classification hierarchy composed of 3 levels:
1st level: 8 supertypes |
Multilingual (which languages) | Albanian, Bulgarian, Bosnian, Croatian, English, Greek, Macedonian, Romanian, Serbian and Turkish. |
URL | TBD |
Dataformat (RDF, JSON, XML, text) | XML
(BIO Index based) |
Dataset size | 200k sentences for each language. |
Technical requirements (repository, libraries, …) | TBD |
Licensing | CC-BY-SA |
Documentation | Under development. |
Further details | Database being developed by using pre-annotation with automatic tools of the English corpus, followed by a correction step via crowdsourcing and, finally, automatically propagated to other languages.
SETimes dataset: http://nlp.ffzg.hr/resources/corpora/setimes/ |
Partner organization(s) | UvA |
Involved ESRs | ESR 13 (Anna Jørgensen) |
Name of the dataset | “2019-20 coronavirus outbreak” on Wikipedia |
Description of the dataset (2-3 sentences) | The data set contains the full edit histories of the “2019-20 coronavirus outbreak” pages from 70 language versions on Wikipedia.
The data set is highly multilingual containing both a wide variety of alphabets and language families, as well as language sizes (from Chinese to Scots). It is also highly multimodal: core data: content, images, links, table of content, urls; metadata: image captions, article categorization, reference types, url countries, user ID |
Multilingual (which languages) | af, ar, az, bcl, be, bg, bn, br, ca, cdo, cs, cv, da, de, el, en, eo, es, et, eu, fa, fi, fr, ga, hak, he, hi, ht, hu, hy, id, is, it, ja, ka, kk, ko, ku, lij, lmo, lt, lv, mr, ms, my, nl, nn, pl, pt, ro, ru, sah, sc, sco, sq, sr, sv, sw, ta, th, tl, tr, ug, uk, ur, vec, vi, wuu, zh |
URL | |
Dataformat (RDF, JSON, XML, text) | JSON
Text |
Dataset size | 4,37 GB |
Technical requirements (repository, libraries, …) | None |
Licensing | Creative Commons Public Licence |
Documentation | Here (will be migrated to data storage soon) |
Publications | Forthcoming |
Further details | “2019-20 coronavirus outbreak” on Wikipedia is due for release in ultimo March 2020 |
Partner organization(s) | LUH |
Involved ESRs | ESR2 (Sara Abdollahi) |
Name of the dataset | EventKG+Click |
Description of the dataset (2-3 sentences) | EventKG+Click is a novel cross-lingual dataset that reflects the language-specific relevance of events and their relations. This dataset aims to provide a reference source to train and evaluate novel models for event-centric cross-lingual user interaction, with a particular focus on the models supported by knowledge graphs.
EventKG+Click consists of two subsets: 1. EventKG+Click_event which contains relevance scores, location-closeness, recency and Wikipedia link count factors for more than 4 thousand events; and 2. EventKG+Click_relation with nearly 10 thousand event-centric click-through pairs, and their language specific number of clicks, relation relevance and co-mentions of the relation which is the number of sentences in whole Wikipedia language editions that mentions both the source and target. |
Multilingual (which languages) | English, German, Russian |
URL | https://github.com/saraabdollahi/EventKG-Click |
Dataformat (RDF, JSON, XML, text) | text |
Dataset size | 3 MB in total
4113 events in EventKG+Click_event |
Technical requirements (repository, libraries, …) | |
Licensing | CC BY-SA 4.0 |
Documentation | |
Publications | |
Further details |
Partner organizations | UBO, TIB, JSI |
Involved ESRs | ESR 5 (Jason Armitage), ESR 6 (Endri Kacupaj), ESR 8 (Golsa Tahmasebzadeh), ESR 12 (Swati) |
Name of the dataset | Wiki-MLM |
Description of the dataset (2-3 sentences) | Wiki-MLM is a processed data extraction from Wikipedia for multilingual and multimodal tasks. The primary aim is to train and evaluate systems designed to perform multiple tasks over diverse data. |
Multilingual (which languages) | English, French, German |
URL | |
Dataformat (RDF, JSON, XML, text) | Text, geo-coordinates, triples – JSON Images – PNG |
Dataset size | ≈150k samples (four modalities per sample) |
Technical requirements (repository, libraries, …) | None |
Licensing | Creative Commons Public Licence |
Documentation | |
Publications | Paper due in April 2020 |
Further details | Wiki-MLM is due for first release in April 2020 |