Skip to content

Commit

Permalink
Initial commit of topical stuff
Browse files Browse the repository at this point in the history
  • Loading branch information
dirkweissenborn committed Jul 25, 2012
0 parents commit 8c61232
Show file tree
Hide file tree
Showing 474 changed files with 56,624 additions and 0 deletions.
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
*.iml
.idea
.classpath
.project
.settings/
target
*.log
*~
index/output
116 changes: 116 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# DBpedia Spotlight
#### Shedding Light on the Web of Documents

DBpedia Spotlight looks for ~3.5M things of ~320 types in text and tries to link them to their global unique identifiers in [DBpedia](http://dbpedia.org).

#### Demonstration

Go to our [demonstration](http://spotlight.dbpedia.org/demo/) page, copy+paste some text and play with the parameters to see how it works.

#### Call our web service

You can use our demonstration [Web Service](http://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service) directly from your application.

curl http://spotlight.dbpedia.org/rest/annotate \
--data-urlencode "text=President Obama called Wednesday on Congress to extend a tax break
for students included in last year's economic stimulus package, arguing
that the policy provides more generous assistance." \
--data "confidence=0.2" \
--data "support=20"

#### Run your own server

If you need service reliability and lower response times, you can run DBpedia Spotlight in your own [InHouse-Server](http://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/InHouse_Server).

wget http://spotlight.dbpedia.org/download/release-0.5/dbpedia-spotlight-quickstart.zip
unzip dbpedia-spotlight-quickstart.zip
cd dbpedia-spotlight-quickstart/
./run.sh

#### Build from source

We provide a [Java/Scala API](http://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Java%2FScala%20API) for you to use our code in your application.
More info [here](http://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Java%2FScala%20API).

[![Build Status](https://secure.travis-ci.org/dbpedia-spotlight/dbpedia-spotlight.png?branch=master)](http://travis-ci.org/dbpedia-spotlight/dbpedia-spotlight)

## Introduction

DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight recognizes that names of concepts or entities have been mentioned (e.g. "Michael Jordan"), and subsequently matches these names to unique identifiers (e.g. [dbpedia:Michael_I._Jordan](http://dbpedia.org/page/Michael_I._Jordan), the machine learning professor or [dbpedia:Michael_Jordan](http://dbpedia.org/page/Michael_Jordan) the basketball player). It can also be used for building your solution for [Named Entity Recognition](http://en.wikipedia.org/wiki/Named_entity_recognition), Keyphrase Extraction, Tagging, etc. amongst other information extraction tasks.

Text annotation has the potential of enhancing a wide range of applications, including search, faceted browsing and navigation. By connecting text documents with DBpedia, our system enables a range of interesting use cases. For instance, the ontology can be used as background knowledge to display complementary information on web pages or to enhance information retrieval tasks. Moreover, faceted browsing over documents and customization of web feeds based on semantics become feasible. Finally, by following links from DBpedia into other data sources, the Linked Open Data cloud is pulled closer to the Web of Documents.

Take a look at our [Known Uses] (http://dbpedia.org/spotlight/knownuses) page for other examples of how DBpedia Spotlight can be used. If you use DBpedia Spotlight in your project, please add a link to http://spotlight.dbpedia.org. If you use it in a paper, please use the citation available in the end of this page.

You can try out DBpedia Spotlight through our Web Application or Web Service endpoints. The Web Application is a user interface that allows you to enter text in a form and generates an HTML annotated version of the text with links to DBpedia. The Web Service endpoints provide programmatic access to the demo, allowing you to retrieve data also in XML or JSON.
## Documentation

We split the documentation according to the depth at which we give explanations. Please feel free to take a look at our:
* [User's Manual](http://dbpedia.org/spotlight/usersmanual), if you are not interested in details of how things happen, but you would like to use the system in your website or software project.
* [Technical Documentation](http://dbpedia.org/spotlight/technicaldocumentation), if you want to have an overview of technical details before you go into the source code.
* [Source code](http://sourceforge.net/projects/dbp-spotlight/), if you really want to know every detail, our source code is open, free and loves to meet new people. ;)


## Downloads

DBpedia Spotlight looks for ~3.5M things of ~320 types in text and tries to disambiguate them to their global unique identifiers in DBpedia. It uses the entire Wikipedia in order to learn how to annotate DBpedia Resources, the entire dataset cannot be distributed alongside the code, and can be downloaded in varied sizes from the [download page](http://dbpedia.org/spotlight/downloads). A tiny dataset is included in the distribution for demonstration purposes only.
After you've downloaded the files, you need to modify the configuration in server.properties with the correct path to the files. More info [here](https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Installation).

## Licenses

The program can be used under the terms of the [Apache License, 2.0](http://www.apache.org/licenses/LICENSE-2.0.html).
Part of the code uses [LingPipe](http://alias-i.com/lingpipe/) under the [Royalty Free License](http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt). Therefore, this license also applies to the output of the currently deployed web service.

The documentation on this website is shared as [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License)

## Citation

If you use this work on your research, please cite:

Pablo N. Mendes, Max Jakob, Andrés García-Silva and Christian Bizer. [DBpedia Spotlight: Shedding Light on the Web of Documents](http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/Mendes-Jakob-GarciaSilva-Bizer-DBpediaSpotlight-ISEM2011.pdf). *Proceedings of the 7th International Conference on Semantic Systems (I-Semantics)*. Graz, Austria, 7–9 September 2011.

```bibtex
@inproceedings{isem2011mendesetal,
title = {DBpedia Spotlight: Shedding Light on the Web of Documents},
author = {Pablo N. Mendes and Max Jakob and Andr\'{e}s Garc\'{i}a-Silva and Christian Bizer},
year = {2011},
booktitle = {Proceedings of the 7th International Conference on Semantic Systems (I-Semantics)},
abstract = {Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, we developed DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs. DBpedia Spotlight allows users to configure the annotations to their specific needs through the DBpedia Ontology and quality measures such as prominence, topical pertinence, contextual ambiguity and disambiguation confidence. We compare our approach with the state of the art in disambiguation, and evaluate our results in light of three baselines and six publicly available annotation systems, demonstrating the competitiveness of our system. DBpedia Spotlight is shared as open source and deployed as a Web Service freely available for public use.}
}
```

The corpus used to evaluate DBpedia Spotlight in this work is described [here](http://wiki.dbpedia.org/spotlight/evaluation).

## Support and Feedback
The best way to get help with DBpedia Spotlight is to send a message to our [mailing list](https://lists.sourceforge.net/mailman/listinfo/dbp-spotlight-users) at *[email protected]*.

You can also join the #dbpedia-spotlight IRC channel on Freenode. We also hear [Tweets](http://search.twitter.com/search.atom?q=+dbpedia+spotlight).

We'd love if you gave us some feedback.



## Team

The DBpedia Spotlight team includes the names cited below. Individual contributions are acknowledged in the source code and publications.

#### Maintainers
[Pablo Mendes](http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/team/MendesPablo.html) (Freie Universität Berlin), Jun 2010-present.

[Max Jakob](http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/team/JakobMax.html) (Freie Universität Berlin), Jun 2010-Sep 2011, Apr 2012-present.

[Jo Daiber](http://jodaiber.de/) (Charles University in Prague), Mar 2011-present.

Prof. Dr. [Chris Bizer](http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/team/BizerChristian.html) (Freie Universität Berlin), supervisor, Jun 2010-present.

#### Collaborators
[Andrés García-Silva](http://grafias.dia.fi.upm.es/Sem4Tags/about.html) (Universidad Politécnica de Madrid), Jul-Dec 2010.

[Rohana Rajapakse](http://www.linkedin.com/pub/rohana-rajapakse/3/9a1/8) (Goss Interactive Ltd.), Oct-2011.


## Acknowledgements

This work has been funded by:
* [Neofonie GmbH](http://www.neofonie.de/), a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications. (Jun 2010-Jun 2011)
* The European Commission through the project [LOD2 - Creating Knowledge out of Linked Data](http://lod2.eu/). (Jun 2010-present)
3 changes: 3 additions & 0 deletions bin/getSurfaceFormMapFromOccs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
cat output/occs.uriSorted.tsv | cut -d$'\t' -f 2,3 | perl -F/\\t/ -lane 'print "$F[1]\t$F[0]";' > output/surfaceForms-fromOccs.tsv
sort output/surfaceForms-fromOccs.tsv | uniq -c > output/surfaceForms-fromOccs.count
grep -Pv " [123] " output/surfaceForms-fromOccs.count | sed -r "s|\s+[0-9]+\s(.+)|\1|" > output/surfaceForms-fromOccs-thresh3.tsv
55 changes: 55 additions & 0 deletions bin/index.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# You are expected to run the commands in this script from inside the bin directory in your DBpedia Spotlight installation
# Adjust the paths here if you don't. This script is meant more as a step-by-step guidance than a real automated run-all.
# If this is your first time running the script, we advise you to copy/paste commands from here, closely watching the messages
# and the final output.
#
# @author maxjakob, pablomendes

here=`pwd`

INDEX_CONFIG_FILE=../conf/indexing.properties

# the indexing process merges occurrences in memory to speed up the process. the more memory the better
export JAVA_OPTS="-Xmx14G"
export MAVEN_OPTS="-Xmx14G"
export SCALA_OPTS="-Xmx14G"

# you have to run maven2 from the module that contains the indexing classes
cd ../index
# the indexing process will generate files in the directory below
mkdir output

# first step is to extract valid URIs, synonyms and surface forms from DBpedia
mvn scala:run -DmainClass=org.dbpedia.spotlight.util.ExtractCandidateMap "-DaddArgs=$INDEX_CONFIG_FILE"

# now we collect parts of Wikipedia dump where DBpedia resources occur and output those occurrences as Tab-Separated-Values
mvn scala:run -DmainClass=org.dbpedia.spotlight.lucene.index.ExtractOccsFromWikipedia "-DaddArgs=$INDEX_CONFIG_FILE|output/occs.tsv"

# (recommended) sorting the occurrences by URI will speed up context merging during indexing
sort -t $'\t' -k2 output/occs.tsv >output/occs.uriSorted.tsv

# create a lucene index out of the occurrences
mvn scala:run -DmainClass=org.dbpedia.spotlight.lucene.index.IndexMergedOccurrences "-DaddArgs=$INDEX_CONFIG_FILE|output/occs.uriSorted.tsv"

# (optional) make a backup copy of the index before you lose all the time you've put into this
cp -R output/index output/index-backup

# (optional) preprocess surface forms however you want: produce acronyms, abbreviations, alternative spellings, etc.
# in the example below we scan paragraphs for uri->sf mappings that occurred together more than 3 times.
../bin/getSurfaceFormMapFromOccs.sh
cp output/surfaceForms.tsv output/surfaceForms-fromTitRedDis.tsv
cat output/surfaceForms-fromTitRedDis.tsv output/surfaceForms-fromOccs.tsv > output/surfaceForms.tsv

# add surface forms to index
mvn scala:run -DmainClass=org.dbpedia.spotlight.lucene.index.AddSurfaceFormsToIndex "-DaddArgs=$INDEX_CONFIG_FILE"
# or
mvn scala:run -DmainClass=org.dbpedia.spotlight.lucene.index.CandidateIndexer "-DaddArgs=output/surfaceForms.tsv|output/candidateIndex|3|case-insensitive|overwrite"

# add entity types to index
mvn scala:run -DmainClass=org.dbpedia.spotlight.lucene.index.AddTypesToIndex "-DaddArgs=$INDEX_CONFIG_FILE"

# (optional) reduce index size by unstoring fields (attention: you won't be able to see contents of fields anymore)
mvn scala:run -DmainClass=org.dbpedia.spotlight.lucene.index.CompressIndex "-DaddArgs=$INDEX_CONFIG_FILE|10"

# train a linker (most simple is based on similarity-thresholds)
# mvn scala:run -DmainClass=org.dbpedia.spotlight.evaluation.EvaluateDisambiguationOnly
13 changes: 13 additions & 0 deletions bin/package.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash
#rm tmp/ -rf
deb=dbpedia-spotlight-0.4.9.deb
cd ../rest
mvn package
mkdir tmp
cd tmp
ar -x ../$deb
cat debian-binary control.tar.gz data.tar.gz > combined-contents
gpg -abs -o _gpgorigin combined-contents
ar rc $deb \
_gpgorigin debian-binary control.tar.gz data.tar.gz
cp $deb ../
1 change: 1 addition & 0 deletions bin/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
mvn scala:run -DaddArgs=conf/test.properties
5 changes: 5 additions & 0 deletions bin/stopwords.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Runs ExtractStopwords.java to get a list of top terms from the index, then cleans it up a bit

#java -jar dbpedia-spotlight.jar ExtractStopwords index CONTEXT 2000 > top-df-terms.set
cut -f 1 -d " " top-df-terms.set | sed s/CONTEXT:// | egrep -v "[0-9]+" | sort -u > stopwords.set

67 changes: 67 additions & 0 deletions conf/indexing.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Wikipedia Dump
# --------------
org.dbpedia.spotlight.data.wikipediaDump = /media/Data/Wikipedia/enwiki-20110722-pages-articles.xml

# Location for DBpedia resources index (output
org.dbpedia.spotlight.index.dir = /media/Data/Wikipedia/index

# DBpedia Datasets
# ----------------
org.dbpedia.spotlight.data.labels = /media/Data/Wikipedia/labels_en.nt
org.dbpedia.spotlight.data.redirects = /media/Data/Wikipedia/redirects_en.nt
org.dbpedia.spotlight.data.disambiguations = /media/Data/Wikipedia/disambiguations_en.nt
org.dbpedia.spotlight.data.instanceTypes = /media/Data/Wikipedia/instance_types_en.nt
org.dbpedia.spotlight.data.sortedArticlesCategories = /media/Data/Wikipedia/sorted.article_categories_en.nt
org.dbpedia.spotlight.data.categories=/media/Data/Wikipedia/skos_categories_en.nt
org.dbpedia.spotlight.data.concepts=/media/Data/Wikipedia/topical_concepts.nt


# Files created from DBpedia Datasets
# -----------------------
org.dbpedia.spotlight.data.conceptURIs = output/conceptURIs.list
org.dbpedia.spotlight.data.redirectsTC = output/redirects_tc.tsv
org.dbpedia.spotlight.data.surfaceForms = output/surfaceForms.tsv

# Language-specific config
# --------------
org.dbpedia.spotlight.language = English
org.dbpedia.spotlight.lucene.analyzer = SnowballAnalyzer

# Stop word list
org.dbpedia.spotlight.data.stopWords.english = /media/Data/Wikipedia/stopwords.en.list
#org.dbpedia.spotlight.data.stopWords.portuguese = /data/spotlight/3.6/pt/stopwords.pt.list
#org.dbpedia.spotlight.data.stopWords.spanish = /data/spotlight/3.6/es/stopwords.es.list

# URI patterns that should not be indexed. e.g. List_of_*
org.dbpedia.spotlight.data.badURIs.english = /media/Data/Wikipedia/blacklistedURIPatterns.en.list

# Will discard surface forms that are too long (reduces complexity of spotting and generally size in disk/memory)
org.dbpedia.spotlight.data.maxSurfaceFormLength = 50
# Will index only words closest to resource occurrence
org.dbpedia.spotlight.data.maxContextWindowSize = 200
org.dbpedia.spotlight.data.minContextWindowSize = 0

# Other files
org.dbpedia.spotlight.data.priors = /home/pablo/eval/grounder/gold/g1b_spotlight.words.uris.counts

# Yahoo! Boss properties
# ----------------------
# application ID
org.dbpedia.spotlight.yahoo.appID =
# number of results returned at for one query (maximum: 50)
org.dbpedia.spotlight.yahoo.maxResults = 50
# number of iteration; each iteration returns YahooBossResults results
org.dbpedia.spotlight.yahoo.maxIterations = 100
## important for Yahoo! Boss query string: both language and region must be set according to
## http://developer.yahoo.com/search/boss/boss_guide/supp_regions_lang.html
org.dbpedia.spotlight.yahoo.language = en
org.dbpedia.spotlight.yahoo.region = us

# Topic configurations
# -----------------------
org.dbpedia.spotlight.topic.dictionary=/media/Data/Wikipedia/Dictionary/model.word_id.dict
org.dbpedia.spotlight.topic.categories.dictionary=/media/Data/Wikipedia/Dictionary/cluster.topic.dict
org.dbpedia.spotlight.topic.flattenedHierarchy=/media/Data/Wikipedia/FlattenedHierarchyByTopics
org.dbpedia.spotlight.topic.info=/home/dirk/workspace/dbpedia-spotlight/index/src/main/resources/topic_descriptions.xml

org.dbpedia.spotlight.topic.dictionary.maxsize=128000
44 changes: 44 additions & 0 deletions conf/indexing.properties.default
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Maximum heap space for indexing
# -------------------------------
org.dbpedia.spotlight.index.heapspace = 2g


# Wikipedia Dump
# --------------
org.dbpedia.spotlight.data.wikipediaDump = data/enwiki-20100312-pages-articles.xml
org.dbpedia.spotlight.index.occurrences =
org.dbpedia.spotlight.index.dir =


# DBpedia Datasets
# ----------------
org.dbpedia.spotlight.data.labels = data/dbpedia/labels_en.nt
org.dbpedia.spotlight.data.redirects = data/dbpedia/redirects_en.nt
org.dbpedia.spotlight.data.disambiguations = data/dbpedia/disambiguations_en.nt
org.dbpedia.spotlight.data.instanceTypes = data/dbpedia/instance_types_en.nt


# Important created files
# -----------------------
org.dbpedia.spotlight.data.conceptURIs = data/conceptURIs.list
org.dbpedia.spotlight.data.redirectsTC = data/redirects_tc.tsv
org.dbpedia.spotlight.data.surfaceForms = data/surface_forms-Wikipedia-TitRedDis.tsv


# Stop word list
# --------------
org.dbpedia.spotlight.data.stopWords = data/stopword.list


# Yahoo! Boss properties
# ----------------------
# application ID
org.dbpedia.spotlight.yahoo.appID = please-specify-if-you-want-to-experiment-with-WebOccurrences!
# number of results returned at for one query (maximum: 50)
org.dbpedia.spotlight.yahoo.maxResults = 50
# number of iteration; each iteration returns YahooBossResults results
org.dbpedia.spotlight.yahoo.maxIterations = 100
## important for Yahoo! Boss query string: both language and region must be set according to
## http://developer.yahoo.com/search/boss/boss_guide/supp_regions_lang.html
org.dbpedia.spotlight.yahoo.language = en
org.dbpedia.spotlight.yahoo.region = us
Loading

0 comments on commit 8c61232

Please sign in to comment.