Latest Core Dataset Releases

Publication Year: 
2020

About

The DBpedia latest core is the refurbished equivalent of the "core" folder used for previous releases before 2020 (e.g. http://downloads.dbpedia.org/2016-10/core/). It contains a small but useful subset of datasets from the DBpedia Extractions. Moreover this subset is loaded to the DBpedia Main SPARQL endpoint. With the help of Databus Latest-Core Collection it is quite easy to fetch a fresh custom-tailored selection of DBpedia files for a specific use case (e.g. a custom list of languages).

DBpedia Extraction Groups

The New DBpedia Release Cycle follows an Extraction as a Platform (EaaP) approach. In regular intervals (normally each month), the DBpedia extraction framework will be run automatically over Wikipedia (all languages) and Wikidata dumps to extract around 5000 RDF files packaged in 50 artifacts and 4 high-level groups: Generic (using generic parsers and language-specific RDF properties), Mappings (using editable ontology mappings from mappings.dbpedia.org), Text (abstract and article full-text extraction), Wikidata (mapped and cleaned Wikidata data using the DBpedia Ontology). A full description of the release cycle can be found in Hofer et al., The New DBpedia Release Cycle: Increasing Agility and Efficiency in Knowledge Extraction Workflows, SEMANTiCS 2020.

The Databus Latest-Core Collection 

Databus Collections can be seen as a customizable dynamic shopping carts of data (files). The collection link for latest core files is https://databus.dbpedia.org/dbpedia/collections/latest-core . This collection updates automatically and always refers to the latest available files. A small part of data from DBpedia Extraction Groups (approx. 100 of 4000 files or 2.5%) is selected in the latest-core collection. If you would like to customize it, it is advised to create your own Databus collection: 1. register/login 2. go to the collection and click "Action" -> "Edit Copy"

Query the Data

OpenLink/Virtuoso provides a  DBpedia Technology Preview (Feedback):

Download Data

How to retrieve the data manually

Go to https://databus.dbpedia.org/dbpedia/collections/latest-core and click on the individual download links.

How to quickly set up your own SPARQL endpoint

Run commands under quickstart using this Docker Container

How to retrieve data automatically

  1. Retrieve the data query
    • Visit the collection page and click on Actions > Copy Query to Clipboard 
    • or run curl https://databus.dbpedia.org/dbpedia/collections/latest-core -H "accept: text/sparql" > query.sparql
  2. Select one of the following options:
    • Run the query against https://databus.dbpedia.org/repo/sparql to get the list of downloadable files (make sure to use a POST request, since the request length exceeds the maximum length of a GET request)
      curl -X POST --data-urlencode query@query.sparql -d format=text/tab-separated-values  https://databus.dbpedia.org/repo/sparql
      The query will return a list of download links, which can be retrieved with wget
    • Give the query to the Databus Client. The Client provides additional options for compression and format conversion, so you don't need to do it manually.

Feedback and Contribution

Feedback and Debugging

Running the extraction each month via Extraction as a Platform means that that the DBpedia community and consumers can help to debug and extend the extraction via Github or the DBpedia Forum. A preliminary guide on How to Improve DBpedia is available. Any changes reaching the master branch will be available for the next monthly release.

Improving

Documentation and Statistics

Current issues

  • sameAs links to other DBpedia Chapters, i.e. de.dbpedia.org (in progress)
  • rdfs:label/comment/dbo:abstract only in English, was en + 19 languages, could be up to 140 languages (in progress)
  • ImageExtractor was malfunctioning and disabled, i.e. only images from infoboxes are extracted, no clean licenses. (Will be fixed with https://databus.dbpedia.org/dbpedia/wikidata/images/)
  • sameAs links to external Linked Data sites are currently not updated, (in progress, we are centralizing this with Global ID management
  • sdtypes from Mannheim need to be checked
  • Umbel in store, but not in Databus collection, loaded from https://github.com/structureddynamics/UMBEL/blob/master/External%20Ontol...
  • Yago types are missing (in progress)

What the future holds

  • Fused data: We already created several tests for a fused dataset of dbo properties. This dataset enriches the English version with Wikidata and dbo properties from over 20 Wikipedia languages, resulting in a denser graph.
  • Community extensions such as caligraph.org or https://ner.vse.cz/datasets/linkedhypernyms/ can now be streamlined and easier contributed with the Databus and routed to the main endpoint and chapter knowledge graphs.
  • Links, Mappings, Ontologies: A special focus of DBpedia will be to take the role of a custodian for links, mappings, ontologies on the web of data and make these easier to contribute and more centrally available.