Skip to content

WeScience

StephanOepen edited this page Jan 23, 2009 · 33 revisions

Background

The WeScience initiative is an on-going effort to provide resources that may facilitate eScience-like research in our own field, i.e. Computational Linguistics. Some of the motivating ideas and goals are sketched by [http://www.delph-in.net/wescience/tlt09.pdf Ytrestøl, Oepen, & Flickinger (2009)]. WeScience, in its early stages of 2008 and 2009, is a semi-formal collaboration between the [http://www.ifi.uio.no/research/groups/lns/lt.html University of Oslo], the [http://lingo.stanford.edu/ Center for the Study of Language and Information], and [http://www.coli.uni-saarland.de Saarland University], with partial funding from the University of Oslo.

Current State of Development

WeScience, at least as of early 2009, comprises two components, the WeScience Corpus (discussed in more detail by Ytrestøl, et al. (2009)) and the WeScience Treebank. The corpus comprises a selection of [http://en.wikipedia.org Wikipedia] articles in the domain of Natural Language Processing, pre-processed to strip irrelevant markup and segmented into sentence-like units. WeScience defines a simple, line-oriented textual exchange format for the corpus, aiming to strike a good balance between computer and human readability (there are formal considerations too that make the use of XML infeasible). Each sentence-like unit has a unique 8-digit identifier, with the first three digits (ignoring one leading digit, which is always 1) referencing the underlying article. The corpus is broken into 16 sections, each of a maximum of 1000 segments, where no article is split across sections. Section 14 through 16 are reserved for evaluation purposes.

Development of the WeScience Treebank builds on the LinGO [http://www.delph-in.net/erg English Resource Grammar] (ERG) and [http://www.delph-in.net/redwoods Redwood] discriminant-based treebanking approach. The [http://svn.delph-in.net/erg/trunk forthcoming release] of the ERG will include a sub-set of the WeScience Corpus in treebanked form; this release is planned for general availability by the end of January 2009.

Retrieving the Corpus and Treebank

As of early 2009, the WeScience Corpus has been released in three versions. Revisions 0.1 and 0.2 were purely internal releases and are now superseded by the present release, revision 0.3. This is publicly and freely available in a variety of formats. The recommend method of obtaining the WeScience Corpus is by virtue of the SubVersion (SVN) revision management system. A command like:

  svn co http://svn.emmtee.net/trunk/uio/wescience wescience

will retrieve the latest development version (i.e. revision 0.3, as of early 2009) and create a new subdirectory wescience/. This directory will contain both the raw, un-processed [http://en.wikipedia.org Wikipedia] articles (in the raw/ sub-directory) and the actual WeScience Corpus, in the format described above (in the txt/ sub-directory). For those without a functional SVN client (M$ Windoze users, maybe), this data is also available as a compressed Un*x tar(1) [http://www.delph-in.net/wescience/corpus.0.2.tgz archive].

The WeScience Corpus is available as so-called itsdb skeletons too, i.e. the result of importing the text files (the pre-processed ones, obviously) into the itsdb database. These skeletons have been part of the itsdb distribution through the LOGON tree (see the LogonTop page) since late 2008. The WeScience skeletons are called ws01 through ws16, and these same names are used in organizing the WeScience Treebank.

Regarding availability of the first release of the WeScience Treebank, please watch this space (or the [http://www.delph-in.net DELPH-IN] [http://lists.delph-in.net mailing lists]) for an imminent announcement.

Clone this wiki locally