Dataset category: 
Publication Year: 

Lexicalization is defined by WordNet as "the process of making a word to express a concept" [1]. In the context of this project, lexicalizations are surface forms referring to a given DBpedia Resource. The DBpedia Lexicalizations Dataset stores the relationships between DBpedia Resources and a set of surface forms that we found to be referent to those resources in Wikipedia.

We use the graph of labels, redirects and disambiguations in DBpedia to extract a lexicon that associates multiple surface forms to a resource and interconnects multiple resources to an ambiguous label. Labels of the DBpedia resources are created from Wikipedia page titles, which can be seen as community-approved surface forms. Redirects to URIs indicate synonyms or alternative surface forms, including common misspellings and acronyms. Their labels also become surface forms. Disambiguations provide ambiguous surface forms that are "confusable" with all resources they link to. Their labels become surface forms for all target resources in the disambiguation page. Note that we erase trailing
parentheses from the labels when constructing surface forms. For example the label /Copyright (band)/ produces the surface form "Copyright". This means that labels of resources and of redirects can also introduce ambiguous surface forms, additionally to the labels coming from titles of disambiguation pages. The collection of surface forms created as a result constitutes a controlled set of commonly used labels for the target resources. Another source of textual references to DBpedia resources are wikilinks. We use the wikilinks to estimate the likelihood of surface forms referring to a specific
candidate resources. We consider each wikilink as evidence that the anchor text is a commonly used surface form for the DBpedia resource represented by the link target. By counting the number of times a surface form occurred with a DBpedia resource, we can estimate a score that indicates the level of association between the surface form and the resource.

The DBpedia Lexicalizations Dataset is produced by the DBpedia Spotlight team. It is used to create the index behind DBpedia Lookup service, as well as the candidate mappings behind DBpedia Spotlight.


The data can be downloaded at: http://spotlight.dbpedia.org/download/