Open Event Knowledge Graph

Open Event Knowledge Graph (OEKG V1.0)

One of the key planned outputs of the CLEOPATRA project is its Open Event Knowledge Graph (OEKG). The core of the OEKG (V1.0) is built on the EventKG V3.0, which was released on 31 March 2020. You can find the information at http://eventkg.l3s.uni-hannover.de/, and the officially released dataset on Zenodo (https://zenodo.org/record/3733829). The White Paper describing the first version of the OEKG, by Maria Maleshkova, Elena Demidova, Simon Gottschalk and Endri Kacupaj, with contributions from CLEOPATRA ESRs, can be downloaded here: Open Event Knowledge Graph V1.

EventKG V3.0/OEKG V1.0 now contains more than 1 million events, in 15 languages (English, French, German, Italian, Russian, Portuguese, Spanish, Dutch, Polish, Norwegian (Bokmål), Romanian, Croatian, Slovene, Bulgarian and Danish).

The language coverage and the number of events in the knowledge graph were significantly extended based on the work done during a CLEOPATRA demonstrator session in February 2020, so thanks to all of our contributors!

Below, you can find detailed descriptions of all the newly developed CLEOPATRA datasets which are already available:

Partner organization LUH
Name of the dataset EventKG
Description of the dataset The EventKG is a multilingual resource incorporating event-centric information extracted from several large-scale knowledge graphs such as Wikidata, DBpedia and YAGO, as well as less structured sources such as the Wikipedia Current Events Portal and Wikipedia event lists in six languages. The EventKG is an extensible event-centric resource modeled in RDF. It relies on Open Data and best practices to make event data spread across different sources available through a common representation and reusable for a variety of novel algorithms and real-world applications.
Multilingual (which languages) English, German, French, Italian, Portuguese, Russian
URL http://eventkg.l3s.uni-hannover.de/
Dataformat (RDF, JSON, XML, text)  RDF (.nq, .ttl)
Dataset size ~ 30GB
Technical requirements (repository, libraries, …) SPARQL
Licensing Creative Commons Attribution Share Alike 4.0 International
Documentation https://github.com/sgottsch/eventkg
Further details Publications:
Simon Gottschalk and Elena Demidova. EventKG – the Hub of Event Knowledge on the Web – and Biographical Timeline Generation. Semantic Web Journal. In press.Simon Gottschalk and Elena Demidova. EventKG: A Multilingual Event-Centric Temporal Knowledge Graph. In  Proceedings of the Extended Semantic Web Conference (ESWC 2018).SPARQL endpoint: http://eventkginterface.l3s.uni-hannover.de/sparqlExample application: http://eventkg-timeline.l3s.uni-hannover.de/

 

Partner organization LUH
Name of the dataset Event-QA
Description of the dataset Event-QA: A Dataset for Event-Centric Question Answering over Knowledge Graphs.

Event-QA dataset contains 300 semantic queries and the corresponding verbalisations for EventKG 

Multilingual (which languages) English, German, Portuguese
URL http://eventcqa.l3s.uni-hannover.de/

https://zenodo.org/record/2621415#.XXiLXCgzY2w 

Dataformat (RDF, JSON, XML, text)  JSON
Dataset size 300 queries
Technical requirements (repository, libraries, …) SPARQL
Licensing Creative Commons Attribution Share Alike 4.0 International
Documentation
Further details Cite as: “Event-QA: A Dataset for Event-Centric Question Answering over Knowledge Graphs” Tarcisio Souza Costa; Simon Gottschalk; Elena Demidova 

http://eventcqa.l3s.uni-hannover.de/
https://github.com/tarcisiosouza/Event-QA 

The number of queries will be increased to ca. 1000 in the next release

 

Partner organization LUH
Name of the dataset The German Web corpus
Description of the dataset The German Web corpus covers all Web pages from the .de top-level domain as captured by the Internet Archive from 1996 to 2013, the HTML portion (~30TB) with 4.05 billion captures of 1 billion URLs. Overall size is ~80TB and also includes English content.

From this corpus, a collection of German news sites was created based on a set of 400 domains of German news websites.

Multilingual (which languages) German (primarily), English
URL Available only at LUH, on site
Dataformat (RDF, JSON, XML, text)  WARC, JSON
Dataset size ~80TB

German news collection: 4.3TB (32,794,626 captures)

Technical requirements (repository, libraries, …) Hadoop cluster, ElasticSearch
Licensing Research only
Documentation http://alexandria-project.eu/datasets/german-and-uk-web-archive/ 

German news collection: https://github.com/tarcisiosouza/elastic-client-api

Further details

 

Partner organization UBO
Name of the dataset FactBench
Description of the dataset FactBench is a multilingual benchmark for the evaluation of fact validation algorithms. All facts in FactBench are scoped with a timespan in which they were true, enableing the validation of temporal relation extraction algorithms. FactBenchcurrently supports english, german and french. You can get the current release here.
Multilingual (which languages) yes
URL https://github.com/DeFacto/FactBench/tree/master/core
Dataformat (RDF, JSON, XML, text)  RDF models
Dataset size 1500 correct statements, 780 negative examples
Technical requirements (repository, libraries, …) SPARQL or MQL
Licensing The MIT License (MIT)
Documentation https://github.com/DeFacto/FactBench
Further details Used by DeFacto

 

Partner organization UBO
Name of the dataset VQuAnDa: Verbalization QUestionANswering DAtaset
Description of the dataset VQuAnDa is an answer verbalization dataset that is based on a commonly used large-scale Question Answering dataset – LC-QuAD. It contains 5,000 questions, the corresponding SPARQL query, and the verbalized answer. The target knowledge base is DBpedia, specifically the April 2016 version.
Multilingual (which languages) No (English)
URL https://figshare.com/projects/VQuAnDa/72488 
Dataformat (RDF, JSON, XML, text)  JSON
Dataset size 5k samples (question, SPARQL query, answer verbalization)
Technical requirements (repository, libraries, …) SPARQL
Licensing Attribution 4.0 International (CC BY 4.0)
Documentation http://vquanda.sda.tech/
Further details

 

Partner organization FCT-FCCN
Name of the dataset Arquivo.pt web archive
Description of the dataset Arquivo.pt is a research infrastructure that preserves content written in several languages broadly interesting to the Portuguese community and related to research and education in general.

Arquivo.pt has been developing special web collections about international events such as European Elections, online news, Wikipedia or the celebration of the 100 years of World War. Arquivo.pt also collected and preserved 50.4 million Web files related to R&D activities funded by the EU since 1994 (FP4 to FP7). All the outputs from this study were made publicly available and we believe they constitute a unique and precious resource for research activities in all fields of knowledge. 

Arquivo.pt provides access to its collection of historical web data through a public web user interface or an API that enables the refinement of queries (e.g. by special collection). 

ESRs can also have access to Arquivo.pt Big Data Analytics, based on Hadoop, to perform investigations that require large-scale automatic processing of large-scale web collections. 

Multilingual (which languages) Mostly in Portuguese, English, French and Spanish. We don’t perform language restrictions. Thus, in theory documents in all languages may be found.
URL https://arquivo.pt
https://arquivo.pt/api
Dataformat (RDF, JSON, XML, text)  JSON, XML, HTML
Dataset size 6062 million web files collected from 14 million websites stored in
336 TB (compressed format)
Technical requirements (repository, libraries, …) Knowledge about JSON and REST APIs
Licensing https://sobre.arquivo.pt/en/about/terms-and-conditions/
Documentation https://github.com/arquivo
Further details https://sobre.arquivo.pt/en/

 

Partner organization University of Southampton
Name of the dataset Global web news feed (RSS)
Description of the dataset Monthly collections of news articles, harvested from a seeded RSS list. Each month contains around ~30 million posts. Check for duplications is required.
Multilingual (which languages) English
URL https://webobservatory.soton.ac.uk/datasets/NKtKuwrMei8SFQG4H
Dataformat (RDF, JSON, XML, text) 
Dataset size Various sizes
Technical requirements (repository, libraries, …)
Licensing
Documentation
Further details

 

Partner organization University of Southampton
Name of the dataset Crisisnet qualitative data reports (USHAHIDI)
Description of the dataset A Collection of 7,000+ qualitative reports collected from the Ushahidi + CrisisNet platform. These have been written and curated by first responders at major disaster events (e.g. Haiti Earthquake). Each record contains a timestamp, eventID, and message/text relating to a specific event.
Multilingual (which languages) English
URL https://webobservatory.soton.ac.uk/datasets/3cZxMoGEfmoMCTEA7
Dataformat (RDF, JSON, XML, text)  Text
Dataset size Various sizes
Technical requirements (repository, libraries, …)
Licensing
Documentation
Further details

 

Partner organization TIB
Name of the dataset Im2GPS
Description of the dataset Im2GPS is a test set for geolocation estimation. The test set contains 237 geo-tagged photos, where 5% depict specific touristic sites and the remaining are only recognizable in a generic sense. The test set was originally crawled from Flickr.
Multilingual (which languages)
URL http://graphics.cs.cmu.edu/projects/im2gps/
Dataformat (RDF, JSON, XML, text)  JPG
Dataset size 237 images, 40.8 MB
Technical requirements (repository, libraries, …)
Licensing Creative commons licenses
Documentation
Further details GPS tags of Im2GPS test set  must be extracted from EXIF data

 

Partner organization TIB
Name of the dataset Im2GPS3k
Description of the dataset Im2GPS3k is a test set for geolocation estimation. The test set contains 3,000 geo-tagged images different than images in the Im2GPS benchmark. The dataset was originally collected from Flickr. 
Multilingual (which languages)
URL http://www.mediafire.com/file/7ht7sn78q27o9we/im2gps3ktest.zip
Dataformat (RDF, JSON, XML, text)  JPG
Dataset size 3,000 images, 479.1 MB
Technical requirements (repository, libraries, …)
Licensing Creative commons licenses
Documentation
Further details GPS tags of the Im2GPS3k test set must be extracted from EXIF data

 

Partner organization TIB
Name of the dataset MP-16 dataset
Description of the dataset The MediaEval Placing Task 2016 (MP-16) dataset is a subset of the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset and includes around five million geo-tagged images from Flickr without any restrictions. The dataset contains among photos of well known places and landmarks also ambiguous photos of, e.g., indoor environments, and food.
Multilingual (which languages) English
URL http://multimedia-commons.s3-website-us-west-2.amazonaws.com/?prefix=subsets/YLI-GEO/mp16/metadata/
Dataformat (RDF, JSON, XML, text)  SQL
Dataset size 4.7 M training images
Technical requirements (repository, libraries, …)
Licensing Creative commons licenses
Documentation
Further details

 

Partner organization TIB
Name of the dataset Date Estimation in the Wild Dataset
Description of the dataset Collection of Flickr images for predicting when an image has been taken. The meta information provided was gathered by the Flickr API server and covers a range from 1900 to 1999.
Multilingual (which languages) English
URL https://doi.org/10.22000/0001abcde

https://github.com/TIB-Visual-Analytics/DEW-Downloader

Dataformat (RDF, JSON, XML, text)  JPG, CSV
Dataset size 1,029,710 images
Technical requirements (repository, libraries, …) Python 
Licensing Meta: CC BY 4.0 Attribution, Images: meta.csv
Documentation This package contains: Meta information for 1,029,710 images (meta.csv)
Further details Publication: E. Müller, M. Springstein, R. Ewerth:  “When was this picture taken?” – Image Date Estimation in the Wild. In: Proceedings of 39th European Conference on Information Retrieval (ECIR), Aberdeen, UK, 2017, 619-625. https://link.springer.com/chapter/10.1007/978-3-319-56608-5_57

 

Partner organization TIB
Name of the dataset Semantic Image-Text-Classes
Description of the dataset This dataset is comprised of image-text pairs of eight different semantic image-text classes. Pairs of images and text can be distinguished into these classes by observing their purpose and classifying their interplay in the process of conveying information. The dataset consists of 224,856 (automatically labeled) image-text pairs for training and 800 pairs with human verified labels for testing.
Multilingual (which languages) English
URL https://doi.org/10.25835/0010577
Dataformat (RDF, JSON, XML, text)  PNG and JSON
Dataset size 225,656 image-text pairs, 45.3 GB
Technical requirements (repository, libraries, …)
Licensing Creative Commons Attribution-NonCommercial 3.0
Documentation
Further details Otto, C., Springstein, M., Anand, A., Ewerth, R., “Understanding, Categorizing and Predicting Semantic Image-Text Relations”, ACM International Conference on Multimedia Retrieval (ICMR), Ottawa, Canada, 2019. 

 

Partner organization FFZG, KCL, LUH
Name of the dataset UNER (Universal Named Entity Recognition)
Description of the dataset The dataset is composed of parallel corpora based on the content published on the SETimes.com news portal which (news and views from Southeast Europe), annotated in terms of events as defined in the ACE 2005 corpus and named entities following a new classification hierarchy composed of 3 levels:

1st level: 8 supertypes
2nd level: 47 types
3rd level: 69 subtypes

Multilingual (which languages) Albanian, Bulgarian, Bosnian, Croatian, English, Greek, Macedonian, Romanian, Serbian and Turkish.
URL TBD
Dataformat (RDF, JSON, XML, text)  XML 

(BIO Index based)

Dataset size 200k sentences for each language.
Technical requirements (repository, libraries, …) TBD
Licensing CC-BY-SA
Documentation Under development.
Further details Database being developed by using pre-annotation with automatic tools of the English corpus, followed by a correction step via crowdsourcing and, finally, automatically propagated to other languages.

SETimes dataset: http://nlp.ffzg.hr/resources/corpora/setimes/

 

Partner organization(s) UvA
Involved ESRs ESR 13 (Anna Jørgensen)
Name of the dataset “2019-20 coronavirus outbreak” on Wikipedia
Description of the dataset (2-3 sentences) The data set contains the full edit histories of the “2019-20 coronavirus outbreak” pages from 70 language versions on Wikipedia. 

The data set is highly multilingual containing both a wide variety of alphabets and language families, as well as language sizes (from Chinese to Scots). 

It is also highly multimodal: core data:  content, images, links, table of content, urls; metadata: image captions, article categorization, reference types, url countries, user ID

Multilingual  (which languages) af, ar, az, bcl, be, bg, bn, br, ca, cdo, cs, cv, da, de, el, en, eo, es, et, eu, fa, fi, fr,
ga, hak, he, hi, ht, hu, hy, id, is, it, ja, ka, kk, ko, ku, lij, lmo, lt, lv, mr, ms, my,
nl, 
nn, pl, pt, ro, ru, sah, sc, sco, sq, sr, sv, sw, ta, th, tl, tr, ug, uk, ur, vec, vi,
wuu, 
zh
URL
Dataformat (RDF, JSON, XML, text)  JSON

Text
IP addresses
Images: links to commons.wikimedia.org

Dataset size 4,37 GB 
Technical requirements (repository, libraries, …) None
Licensing Creative Commons Public Licence
Documentation Here (will be migrated to data storage soon)
Publications Forthcoming
Further details “2019-20 coronavirus outbreak” on Wikipedia is due for release in ultimo March 2020

 

Partner organization(s) LUH
Involved ESRs ESR2 (Sara Abdollahi)
Name of the dataset EventKG+Click
Description of the dataset (2-3 sentences) EventKG+Click is a novel cross-lingual dataset that reflects the language-specific relevance of events and their relations. This dataset aims to provide a reference source to train and evaluate novel models for event-centric cross-lingual user interaction, with a particular focus on the models supported by knowledge graphs.

EventKG+Click consists of two subsets:

1. EventKG+Click_event which contains relevance scores, location-closeness, recency and Wikipedia link count factors for more than 4 thousand events; and

2. EventKG+Click_relation with nearly 10 thousand event-centric click-through pairs, and their language specific number of clicks, relation relevance and co-mentions of the relation which is the number of sentences in whole Wikipedia language editions that mentions both the source and target.

Multilingual (which languages) English, German, Russian
URL https://github.com/saraabdollahi/EventKG-Click
Dataformat (RDF, JSON, XML, text)  text
Dataset size 3 MB in total

4113 events in EventKG+Click_event
9119 event-centric click-through pairs in EventKG+Click_relation

Technical requirements (repository, libraries, …)
Licensing CC BY-SA 4.0
Documentation
Publications
Further details

 

Partner organizations UBO, TIB, JSI
Involved ESRs ESR 5 (Jason Armitage), ESR 6 (Endri Kacupaj), ESR 8 (Golsa Tahmasebzadeh), ESR 12 (Swati)
Name of the dataset Wiki-MLM
Description of the dataset (2-3 sentences) Wiki-MLM is a processed data extraction from Wikipedia for multilingual and multimodal tasks. The primary aim is to train and evaluate systems designed to perform multiple tasks over diverse data. 
Multilingual (which languages) English, French, German
URL
Dataformat (RDF, JSON, XML, text)  Text, geo-coordinates, triples – JSON
Images – PNG 
Dataset size ≈150k samples (four modalities per sample)
Technical requirements (repository, libraries, …) None
Licensing Creative Commons Public Licence
Documentation
Publications Paper due in April 2020
Further details Wiki-MLM is due for first release in April 2020