Latest Core Dataset Releases
About
The DBpedia latest core is the refurbished equivalent of the "core" folder used for previous releases before 2020 (e.g. http://downloads.dbpedia.org/2016-10/core/). It contains a small but useful subset of datasets from the DBpedia Extractions. Moreover this subset is loaded to the DBpedia Main SPARQL endpoint. With the help of Databus Latest-Core Collection it is quite easy to fetch a fresh custom-tailored selection of DBpedia files for a specific use case (e.g. a custom list of languages).
DBpedia Extraction Groups
The New DBpedia Release Cycle follows an Extraction as a Platform (EaaP) approach. In regular intervals (normally each month), the DBpedia extraction framework will be run automatically over Wikipedia (all languages) and Wikidata dumps to extract around 5000 RDF files packaged in 50 artifacts and 4 high-level groups: Generic (using generic parsers and language-specific RDF properties), Mappings (using editable ontology mappings from mappings.dbpedia.org), Text (abstract and article full-text extraction), Wikidata (mapped and cleaned Wikidata data using the DBpedia Ontology). A full description of the release cycle can be found in Hofer et al., The New DBpedia Release Cycle: Increasing Agility and Efficiency in Knowledge Extraction Workflows, SEMANTiCS 2020.
The Databus Latest-Core Collection
Databus Collections can be seen as a customizable dynamic shopping carts of data (files). The collection link for latest core files is https://databus.dbpedia.org/dbpedia/collections/latest-core . This collection updates automatically and always refers to the latest available files. A small part of data from DBpedia Extraction Groups (approx. 100 of 4000 files or 2.5%) is selected in the latest-core collection. If you would like to customize it, it is advised to create your own Databus collection: 1. register/login 2. go to the collection and click "Action" -> "Edit Copy"
Query the Data
OpenLink/Virtuoso provides a DBpedia Technology Preview (Feedback):
- https://dbpedia.demo.openlinksw.com/sparql/
- https://dbpedia.demo.openlinksw.com/resource/Leipzig
- https://dbpedia.demo.openlinksw.com/fct/
Download Data
How to retrieve the data manually
Go to https://databus.dbpedia.org/dbpedia/collections/latest-core and click on the individual download links.
How to quickly set up your own SPARQL endpoint
Run commands under quickstart using this Docker Container
How to retrieve data automatically
- Retrieve the data query
- Visit the collection page and click on Actions > Copy Query to Clipboard
- or run
curl https://databus.dbpedia.org/dbpedia/collections/latest-core -H "accept: text/sparql" > query.sparql
- Select one of the following options:
- Run the query against https://databus.dbpedia.org/repo/sparql to get the list of downloadable files (make sure to use a POST request, since the request length exceeds the maximum length of a GET request)
curl -X POST --data-urlencode query@query.sparql -d format=text/tab-separated-values https://databus.dbpedia.org/repo/sparql
The query will return a list of download links, which can be retrieved with wget - Give the query to the Databus Client. The Client provides additional options for compression and format conversion, so you don't need to do it manually.
- Run the query against https://databus.dbpedia.org/repo/sparql to get the list of downloadable files (make sure to use a POST request, since the request length exceeds the maximum length of a GET request)
Feedback and Contribution
Feedback and Debugging
Running the extraction each month via Extraction as a Platform means that that the DBpedia community and consumers can help to debug and extend the extraction via Github or the DBpedia Forum. A preliminary guide on How to Improve DBpedia is available. Any changes reaching the master branch will be available for the next monthly release.
Improving
Documentation and Statistics
- Overview statistics about entities, types, etc. are currently work in progress, in particular:
- A GSoC project implementing a knowledge graph dashboard
- Generation of VOID Class/Property Partition for the whole Databus
- Better documentation of the workflow and where the community can contribute
Current issues
- sameAs links to other DBpedia Chapters, i.e. de.dbpedia.org (in progress)
- rdfs:label/comment/dbo:abstract only in English, was en + 19 languages, could be up to 140 languages (in progress)
- ImageExtractor was malfunctioning and disabled, i.e. only images from infoboxes are extracted, no clean licenses. (Will be fixed with https://databus.dbpedia.org/dbpedia/wikidata/images/)
- sameAs links to external Linked Data sites are currently not updated, (in progress, we are centralizing this with Global ID management
- sdtypes from Mannheim need to be checked
- Umbel in store, but not in Databus collection, loaded from https://github.com/structureddynamics/UMBEL/blob/master/External%20Ontol...
- Yago types are missing (in progress)
What the future holds
- Fused data: We already created several tests for a fused dataset of dbo properties. This dataset enriches the English version with Wikidata and dbo properties from over 20 Wikipedia languages, resulting in a denser graph.
- Community extensions such as caligraph.org or https://ner.vse.cz/datasets/linkedhypernyms/ can now be streamlined and easier contributed with the Databus and routed to the main endpoint and chapter knowledge graphs.
- Links, Mappings, Ontologies: A special focus of DBpedia will be to take the role of a custodian for links, mappings, ontologies on the web of data and make these easier to contribute and more centrally available.