Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting more metadata (such as authors) #10

Open
wetneb opened this issue Feb 29, 2016 · 2 comments
Open

Extracting more metadata (such as authors) #10

wetneb opened this issue Feb 29, 2016 · 2 comments

Comments

@wetneb
Copy link

wetneb commented Feb 29, 2016

Hi,

In some cases we might need to extract not just identifiers but also the rest of the metadata contained in {{cite}} templates. In this case, the task looks less trivial (author lists can be input in many different ways, for instance). For this reason, I have wrapped the Lua code that parses citations on wikipedia in a Python lib, and the result is here:
https://github.com/dissemin/wikiciteparser

Any comments / contributions / anything welcome!

@wetneb
Copy link
Author

wetneb commented Jun 6, 2016

@halfak: Just in case you are still interested in evaluating what proportion of citations do not have any identifier, I have run my citation parser on a fresh dump of the English Wikipedia.

The dump is on Zenodo.

Of course, this parser covers much more than just scholarly citations (it parses {{cite web}} for instance). It also misses a lot of citations that your method catches (all unformatted citations with an identifier matching your regular expressions). So the scope is quite different.

Here are a few quick stats:

  • the total number of citations extracted:
    $ wc -l enwiki_2016-06-01_CS1_citations.tsv
    12743634
  • the number of "cite journal" instances:
    $ cat enwiki_2016-06-01_CS1_citations.tsv| grep "cite journal" | wc -l
    955050
  • "cite journal" instances without any external identifier:
    $ cat enwiki_2016-06-01_CS1_citations.tsv| grep "cite journal" | grep -v "ID_list" | wc -l
    309305

cc @nemobis who might also be interested in this dataset

@nemobis
Copy link

nemobis commented Jun 6, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants