Open Event Knowledge Graph V1

Open Event Knowledge Graph (OEKG V1.0)

One of the key planned outputs of the CLEOPATRA project is its Open Event Knowledge Graph (OEKG). The core of the OEKG (V1.0) is built on the EventKG V3.0, which was released on 31 March 2020. You can find the information at http://eventkg.l3s.uni-hannover.de/, and the officially released dataset on Zenodo (https://zenodo.org/record/3733829). The White Paper describing the first version of the OEKG, by Maria Maleshkova, Elena Demidova, Simon Gottschalk and Endri Kacupaj, with contributions from CLEOPATRA ESRs, can be downloaded here: Open Event Knowledge Graph V1.

EventKG V3.0/OEKG V1.0 now contains more than 1 million events, in 15 languages (English, French, German, Italian, Russian, Portuguese, Spanish, Dutch, Polish, Norwegian (Bokmål), Romanian, Croatian, Slovene, Bulgarian and Danish).

The language coverage and the number of events in the knowledge graph were significantly extended based on the work done during a CLEOPATRA demonstrator session in February 2020, so thanks to all of our contributors!

Below, you can find detailed descriptions of all the newly developed CLEOPATRA datasets which are already available:

Partner organization	LUH
Name of the dataset	EventKG
Description of the dataset	The EventKG is a multilingual resource incorporating event-centric information extracted from several large-scale knowledge graphs such as Wikidata, DBpedia and YAGO, as well as less structured sources such as the Wikipedia Current Events Portal and Wikipedia event lists in six languages. The EventKG is an extensible event-centric resource modeled in RDF. It relies on Open Data and best practices to make event data spread across different sources available through a common representation and reusable for a variety of novel algorithms and real-world applications.
Multilingual (which languages)	English, German, French, Italian, Portuguese, Russian
URL	http://eventkg.l3s.uni-hannover.de/
Dataformat (RDF, JSON, XML, text)	RDF (.nq, .ttl)
Dataset size	~ 30GB
Technical requirements (repository, libraries, …)	SPARQL
Licensing	Creative Commons Attribution Share Alike 4.0 International
Documentation	https://github.com/sgottsch/eventkg
Further details	Publications: Simon Gottschalk and Elena Demidova. EventKG – the Hub of Event Knowledge on the Web – and Biographical Timeline Generation. Semantic Web Journal. In press.Simon Gottschalk and Elena Demidova. EventKG: A Multilingual Event-Centric Temporal Knowledge Graph. In Proceedings of the Extended Semantic Web Conference (ESWC 2018).SPARQL endpoint: http://eventkginterface.l3s.uni-hannover.de/sparqlExample application: http://eventkg-timeline.l3s.uni-hannover.de/

Partner organization	LUH
Name of the dataset	Event-QA
Description of the dataset	Event-QA: A Dataset for Event-Centric Question Answering over Knowledge Graphs. Event-QA dataset contains 300 semantic queries and the corresponding verbalisations for EventKG
Multilingual (which languages)	English, German, Portuguese
URL	http://eventcqa.l3s.uni-hannover.de/ https://zenodo.org/record/2621415#.XXiLXCgzY2w
Dataformat (RDF, JSON, XML, text)	JSON
Dataset size	300 queries
Technical requirements (repository, libraries, …)	SPARQL
Licensing	Creative Commons Attribution Share Alike 4.0 International
Documentation
Further details	Cite as: “Event-QA: A Dataset for Event-Centric Question Answering over Knowledge Graphs” Tarcisio Souza Costa; Simon Gottschalk; Elena Demidova http://eventcqa.l3s.uni-hannover.de/ https://github.com/tarcisiosouza/Event-QA The number of queries will be increased to ca. 1000 in the next release

Partner organization	LUH
Name of the dataset	The German Web corpus
Description of the dataset	The German Web corpus covers all Web pages from the .de top-level domain as captured by the Internet Archive from 1996 to 2013, the HTML portion (~30TB) with 4.05 billion captures of 1 billion URLs. Overall size is ~80TB and also includes English content. From this corpus, a collection of German news sites was created based on a set of 400 domains of German news websites.
Multilingual (which languages)	German (primarily), English
URL	Available only at LUH, on site
Dataformat (RDF, JSON, XML, text)	WARC, JSON
Dataset size	~80TB German news collection: 4.3TB (32,794,626 captures)
Technical requirements (repository, libraries, …)	Hadoop cluster, ElasticSearch
Licensing	Research only
Documentation	http://alexandria-project.eu/datasets/german-and-uk-web-archive/ German news collection: https://github.com/tarcisiosouza/elastic-client-api
Further details	–

Partner organization	UBO
Name of the dataset	FactBench
Description of the dataset	FactBench is a multilingual benchmark for the evaluation of fact validation algorithms. All facts in FactBench are scoped with a timespan in which they were true, enableing the validation of temporal relation extraction algorithms. FactBenchcurrently supports english, german and french. You can get the current release here.
Multilingual (which languages)	yes
URL	https://github.com/DeFacto/FactBench/tree/master/core
Dataformat (RDF, JSON, XML, text)	RDF models
Dataset size	1500 correct statements, 780 negative examples
Technical requirements (repository, libraries, …)	SPARQL or MQL
Licensing	The MIT License (MIT)
Documentation	https://github.com/DeFacto/FactBench
Further details	Used by DeFacto

Partner organization	UBO
Name of the dataset	VQuAnDa: Verbalization QUestionANswering DAtaset
Description of the dataset	VQuAnDa is an answer verbalization dataset that is based on a commonly used large-scale Question Answering dataset – LC-QuAD. It contains 5,000 questions, the corresponding SPARQL query, and the verbalized answer. The target knowledge base is DBpedia, specifically the April 2016 version.
Multilingual (which languages)	No (English)
URL	https://figshare.com/projects/VQuAnDa/72488
Dataformat (RDF, JSON, XML, text)	JSON
Dataset size	5k samples (question, SPARQL query, answer verbalization)
Technical requirements (repository, libraries, …)	SPARQL
Licensing	Attribution 4.0 International (CC BY 4.0)
Documentation	http://vquanda.sda.tech/
Further details

Partner organization	FCT-FCCN
Name of the dataset	Arquivo.pt web archive
Description of the dataset	Arquivo.pt is a research infrastructure that preserves content written in several languages broadly interesting to the Portuguese community and related to research and education in general. Arquivo.pt has been developing special web collections about international events such as European Elections, online news, Wikipedia or the celebration of the 100 years of World War. Arquivo.pt also collected and preserved 50.4 million Web files related to R&D activities funded by the EU since 1994 (FP4 to FP7). All the outputs from this study were made publicly available and we believe they constitute a unique and precious resource for research activities in all fields of knowledge. Arquivo.pt provides access to its collection of historical web data through a public web user interface or an API that enables the refinement of queries (e.g. by special collection). ESRs can also have access to Arquivo.pt Big Data Analytics, based on Hadoop, to perform investigations that require large-scale automatic processing of large-scale web collections.
Multilingual (which languages)	Mostly in Portuguese, English, French and Spanish. We don’t perform language restrictions. Thus, in theory documents in all languages may be found.
URL	https://arquivo.pt https://arquivo.pt/api
Dataformat (RDF, JSON, XML, text)	JSON, XML, HTML
Dataset size	6062 million web files collected from 14 million websites stored in 336 TB (compressed format)
Technical requirements (repository, libraries, …)	Knowledge about JSON and REST APIs
Licensing	https://sobre.arquivo.pt/en/about/terms-and-conditions/
Documentation	https://github.com/arquivo
Further details	https://sobre.arquivo.pt/en/

Partner organization	University of Southampton
Name of the dataset	Global web news feed (RSS)
Description of the dataset	Monthly collections of news articles, harvested from a seeded RSS list. Each month contains around ~30 million posts. Check for duplications is required.
Multilingual (which languages)	English
URL	https://webobservatory.soton.ac.uk/datasets/NKtKuwrMei8SFQG4H
Dataformat (RDF, JSON, XML, text)
Dataset size	Various sizes
Technical requirements (repository, libraries, …)
Licensing
Documentation
Further details

Partner organization	University of Southampton
Name of the dataset	Crisisnet qualitative data reports (USHAHIDI)
Description of the dataset	A Collection of 7,000+ qualitative reports collected from the Ushahidi + CrisisNet platform. These have been written and curated by first responders at major disaster events (e.g. Haiti Earthquake). Each record contains a timestamp, eventID, and message/text relating to a specific event.
Multilingual (which languages)	English
URL	https://webobservatory.soton.ac.uk/datasets/3cZxMoGEfmoMCTEA7
Dataformat (RDF, JSON, XML, text)	Text
Dataset size	Various sizes
Technical requirements (repository, libraries, …)
Licensing
Documentation
Further details

Partner organization	TIB
Name of the dataset	Im2GPS
Description of the dataset	Im2GPS is a test set for geolocation estimation. The test set contains 237 geo-tagged photos, where 5% depict specific touristic sites and the remaining are only recognizable in a generic sense. The test set was originally crawled from Flickr.
Multilingual (which languages)	–
URL	http://graphics.cs.cmu.edu/projects/im2gps/
Dataformat (RDF, JSON, XML, text)	JPG
Dataset size	237 images, 40.8 MB
Technical requirements (repository, libraries, …)	–
Licensing	Creative commons licenses
Documentation	–
Further details	GPS tags of Im2GPS test set must be extracted from EXIF data

Partner organization	TIB
Name of the dataset	Im2GPS3k
Description of the dataset	Im2GPS3k is a test set for geolocation estimation. The test set contains 3,000 geo-tagged images different than images in the Im2GPS benchmark. The dataset was originally collected from Flickr.
Multilingual (which languages)	–
URL	http://www.mediafire.com/file/7ht7sn78q27o9we/im2gps3ktest.zip
Dataformat (RDF, JSON, XML, text)	JPG
Dataset size	3,000 images, 479.1 MB
Technical requirements (repository, libraries, …)	–
Licensing	Creative commons licenses
Documentation	–
Further details	GPS tags of the Im2GPS3k test set must be extracted from EXIF data

Partner organization	TIB
Name of the dataset	MP-16 dataset
Description of the dataset	The MediaEval Placing Task 2016 (MP-16) dataset is a subset of the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset and includes around five million geo-tagged images from Flickr without any restrictions. The dataset contains among photos of well known places and landmarks also ambiguous photos of, e.g., indoor environments, and food.
Multilingual (which languages)	English
URL	http://multimedia-commons.s3-website-us-west-2.amazonaws.com/?prefix=subsets/YLI-GEO/mp16/metadata/
Dataformat (RDF, JSON, XML, text)	SQL
Dataset size	4.7 M training images
Technical requirements (repository, libraries, …)
Licensing	Creative commons licenses
Documentation
Further details

Partner organization	TIB
Name of the dataset	Date Estimation in the Wild Dataset
Description of the dataset	Collection of Flickr images for predicting when an image has been taken. The meta information provided was gathered by the Flickr API server and covers a range from 1900 to 1999.
Multilingual (which languages)	English
URL	https://doi.org/10.22000/0001abcde https://github.com/TIB-Visual-Analytics/DEW-Downloader
Dataformat (RDF, JSON, XML, text)	JPG, CSV
Dataset size	1,029,710 images
Technical requirements (repository, libraries, …)	Python
Licensing	Meta: CC BY 4.0 Attribution, Images: meta.csv
Documentation	This package contains: Meta information for 1,029,710 images (meta.csv)
Further details	Publication: E. Müller, M. Springstein, R. Ewerth: “When was this picture taken?” – Image Date Estimation in the Wild. In: Proceedings of 39th European Conference on Information Retrieval (ECIR), Aberdeen, UK, 2017, 619-625. https://link.springer.com/chapter/10.1007/978-3-319-56608-5_57

Partner organization	TIB
Name of the dataset	Semantic Image-Text-Classes
Description of the dataset	This dataset is comprised of image-text pairs of eight different semantic image-text classes. Pairs of images and text can be distinguished into these classes by observing their purpose and classifying their interplay in the process of conveying information. The dataset consists of 224,856 (automatically labeled) image-text pairs for training and 800 pairs with human verified labels for testing.
Multilingual (which languages)	English
URL	https://doi.org/10.25835/0010577
Dataformat (RDF, JSON, XML, text)	PNG and JSON
Dataset size	225,656 image-text pairs, 45.3 GB
Technical requirements (repository, libraries, …)	–
Licensing	Creative Commons Attribution-NonCommercial 3.0
Documentation	–
Further details	Otto, C., Springstein, M., Anand, A., Ewerth, R., “Understanding, Categorizing and Predicting Semantic Image-Text Relations”, ACM International Conference on Multimedia Retrieval (ICMR), Ottawa, Canada, 2019.

Partner organization	FFZG, KCL, LUH
Name of the dataset	UNER (Universal Named Entity Recognition)
Description of the dataset	The dataset is composed of parallel corpora based on the content published on the SETimes.com news portal which (news and views from Southeast Europe), annotated in terms of events as defined in the ACE 2005 corpus and named entities following a new classification hierarchy composed of 3 levels: 1st level: 8 supertypes 2nd level: 47 types 3rd level: 69 subtypes
Multilingual (which languages)	Albanian, Bulgarian, Bosnian, Croatian, English, Greek, Macedonian, Romanian, Serbian and Turkish.
URL	TBD
Dataformat (RDF, JSON, XML, text)	XML (BIO Index based)
Dataset size	200k sentences for each language.
Technical requirements (repository, libraries, …)	TBD
Licensing	CC-BY-SA
Documentation	Under development.
Further details	Database being developed by using pre-annotation with automatic tools of the English corpus, followed by a correction step via crowdsourcing and, finally, automatically propagated to other languages. SETimes dataset: http://nlp.ffzg.hr/resources/corpora/setimes/

Partner organization(s)	UvA
Involved ESRs	ESR 13 (Anna Jørgensen)
Name of the dataset	“2019-20 coronavirus outbreak” on Wikipedia
Description of the dataset (2-3 sentences)	The data set contains the full edit histories of the “2019-20 coronavirus outbreak” pages from 70 language versions on Wikipedia. The data set is highly multilingual containing both a wide variety of alphabets and language families, as well as language sizes (from Chinese to Scots). It is also highly multimodal: core data: content, images, links, table of content, urls; metadata: image captions, article categorization, reference types, url countries, user ID
Multilingual (which languages)	af, ar, az, bcl, be, bg, bn, br, ca, cdo, cs, cv, da, de, el, en, eo, es, et, eu, fa, fi, fr, ga, hak, he, hi, ht, hu, hy, id, is, it, ja, ka, kk, ko, ku, lij, lmo, lt, lv, mr, ms, my, nl, nn, pl, pt, ro, ru, sah, sc, sco, sq, sr, sv, sw, ta, th, tl, tr, ug, uk, ur, vec, vi, wuu, zh
URL
Dataformat (RDF, JSON, XML, text)	JSON Text IP addresses Images: links to commons.wikimedia.org
Dataset size	4,37 GB
Technical requirements (repository, libraries, …)	None
Licensing	Creative Commons Public Licence
Documentation	Here (will be migrated to data storage soon)
Publications	Forthcoming
Further details	“2019-20 coronavirus outbreak” on Wikipedia is due for release in ultimo March 2020

Partner organization(s)	LUH
Involved ESRs	ESR2 (Sara Abdollahi)
Name of the dataset	EventKG+Click
Description of the dataset (2-3 sentences)	EventKG+Click is a novel cross-lingual dataset that reflects the language-specific relevance of events and their relations. This dataset aims to provide a reference source to train and evaluate novel models for event-centric cross-lingual user interaction, with a particular focus on the models supported by knowledge graphs. EventKG+Click consists of two subsets: 1. EventKG+Click_event which contains relevance scores, location-closeness, recency and Wikipedia link count factors for more than 4 thousand events; and 2. EventKG+Click_relation with nearly 10 thousand event-centric click-through pairs, and their language specific number of clicks, relation relevance and co-mentions of the relation which is the number of sentences in whole Wikipedia language editions that mentions both the source and target.
Multilingual (which languages)	English, German, Russian
URL	https://github.com/saraabdollahi/EventKG-Click
Dataformat (RDF, JSON, XML, text)	text
Dataset size	3 MB in total 4113 events in EventKG+Click_event 9119 event-centric click-through pairs in EventKG+Click_relation
Technical requirements (repository, libraries, …)
Licensing	CC BY-SA 4.0
Documentation
Publications
Further details

Partner organizations	UBO, TIB, JSI
Involved ESRs	ESR 5 (Jason Armitage), ESR 6 (Endri Kacupaj), ESR 8 (Golsa Tahmasebzadeh), ESR 12 (Swati)
Name of the dataset	Wiki-MLM
Description of the dataset (2-3 sentences)	Wiki-MLM is a processed data extraction from Wikipedia for multilingual and multimodal tasks. The primary aim is to train and evaluate systems designed to perform multiple tasks over diverse data.
Multilingual (which languages)	English, French, German
URL
Dataformat (RDF, JSON, XML, text)	Text, geo-coordinates, triples – JSON Images – PNG
Dataset size	≈150k samples (four modalities per sample)
Technical requirements (repository, libraries, …)	None
Licensing	Creative Commons Public Licence
Documentation
Publications	Paper due in April 2020
Further details	Wiki-MLM is due for first release in April 2020