Skip to content

WeScience

StephanOepen edited this page May 19, 2010 · 33 revisions

Background

The WeScience initiative is an on-going effort to provide resources that enable eScience research and development in our own field, i.e. Computational Linguistics (or Natural Language Processing). Some of the motivating ideas and goals are sketched by [http://www.delph-in.net/wescience/tlt09.pdf Ytrestøl, Flickinger, & Oepen (2009)]. WeScience aims to (help) improve the accessibility of scholarly literature and digital libraries, with a special emphasis on community or open access resources. Current development is focused on semantic parsing of encyclopedic articles (from the on-line community resource [http://en.wikipedia.org Wikipedia]), with the long-term goal of relating natural language semantics and taxonomic knowledge, for example in relation extraction or ontology learning applications. As a complementary element, we plan to include a selection of scientific articles (from the [http://aclweb.org/anthology-new/ ACL Anthology], for example), with candidate applications ranging over, among others, function and attitude analysis for citations, attribution tracking, indexing by complex content properties (for example specific sub-fields, hypotheses, methods used), association to encyclopedia entries (or ontology nodes), or so-called 'semantic search'.

WeScience, in its early stages of 2008, 2009, and 2010, is a semi-formal collaboration between the [http://www.ifi.uio.no/research/groups/lns/lt.html University of Oslo], the [http://lingo.stanford.edu/ Center for the Study of Language and Information], the [http://www.dfki.de/lt German Research Center for AI], and [http://www.coli.uni-saarland.de Saarland University], with partial funding from the University of Oslo, the [http://www.ub.uit.no/wiki/openaccess/index.php/NORA Norwegian Open Research Archives], and the [http://www.notur.no Norwegian Metacenter for Computational Science].

Current State of Development

WeScience, at least as of early 2010, comprises two components, the WeScience Corpus (discussed in more detail by Ytrestøl, et al. (2009)) and the WeScience Treebank. The corpus comprises a selection of [http://en.wikipedia.org Wikipedia] articles in the domain of Natural Language Processing, pre-processed to strip irrelevant markup and segmented into sentence-like units. WeScience defines a simple, line-oriented textual exchange format for the corpus, aiming to strike a good balance between computer and human readability (there are formal considerations too that make the use of XML infeasible). Each sentence-like unit has a unique 8-digit identifier, with the first four digits referencing the underlying article. The corpus is broken into 16 sections, each of a maximum of 1000 segments, where no article is split across sections. Sections 14 through 16 are reserved for evaluation purposes.

The corpus is extracted from a [http://www.delph-in.net/wescience/enwiki-20080727-pages-articles.xml.bz2 Wikipedia snapshot] of July 2008, and more details of the corpus construction (selection and pre-processing) are available as a [http://www.delph-in.net/wescience/Ytrestol:09.pdf technical report] (Ytrestøl, 2009).

Development of the WeScience Treebank builds on the LinGO [http://www.delph-in.net/erg English Resource Grammar] (ERG) and [http://www.delph-in.net/redwoods Redwoods] discriminant-based treebanking approach. The [http://svn.delph-in.net/erg/trunk April 2010] release of the ERG includes a sub-set of the WeScience Corpus in treebanked form (see below).

Obtaining the Corpus and Treebank

As of early 2009, the WeScience Corpus has been released in three versions. Revisions 0.1 and 0.2 were purely internal releases and are now superseded by the present release, revision 0.3. This is publicly and freely available in a variety of formats. The recommend method of obtaining the WeScience Corpus is by virtue of the SubVersion (SVN) revision management system. A command like:

  svn co http://svn.emmtee.net/trunk/uio/wescience wescience

will retrieve the latest development version (i.e. revision 0.3, as of early 2019) and create a new subdirectory wescience/. This directory will contain both the raw, un-processed [http://en.wikipedia.org Wikipedia] articles (in the raw/ sub-directory) and the actual WeScience Corpus, in the format described above (in the txt/ sub-directory). For those without a functional SVN client (M$ Windoze users, maybe), this data is also available as a compressed Un*x tar(1) [http://www.delph-in.net/wescience/corpus.0.2.tgz archive].

The WeScience Corpus is available as so-called itsdb skeletons too, i.e. the result of importing the text files (the pre-processed ones, obviously) into the itsdb database. These skeletons have been part of the itsdb distribution through the LOGON tree (see the LogonTop page) since late 2008. The WeScience skeletons are called ws01 through ws16, and these same names are used in organizing the WeScience Treebank.

Regarding availability of the first release of the WeScience Treebank, please watch this space (or the [http://www.delph-in.net DELPH-IN] [http://lists.delph-in.net mailing lists]). At present, treebanks for the first thirteen WeScience sections are provided in itsdb format as part of the ERG release, which can be obtained from SVN through the following command:

  svn co http://svn.delph-in.net/erg/trunk erg

To experiments with these treebanks, for the time being, parts of the larger DELPH-IN toolchain are required, and we recommend working with the trunk (aka head revision) of the integrated [wiki:LogonTop LOGON distribution]. Assuming a functional, up-to-date LOGON installation, one can export the itsdb treebanks into various textual formats, for example using a command like the following:

  cd $LOGONROOT
  ./redwoods --binary --erg --target /tmp/wescience \
    --export derivation,tree,mrs ws01

As part of the imminent first public WikiWoods release, we will also provide a textual version of WeScience for direct download.

Outlook: Next Steps

Acknowledgements

Clone this wiki locally