Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use translation-server to export directly to CSL JSON #1

Open
dhimmel opened this issue Dec 15, 2019 · 5 comments
Open

Use translation-server to export directly to CSL JSON #1

dhimmel opened this issue Dec 15, 2019 · 5 comments

Comments

@dhimmel
Copy link

dhimmel commented Dec 15, 2019

Really awesome package. I was working on a similar pandoc-filter using the manubot python package a while ago in manubot/manubot#99, but we never finished it.

I like your syntax for specifying citekey aliases that point to URLs. Looks like you also support some types of persistent identifiers directly in the citekey. Manubot currently supports several types of IDs for citation by persistent ID. Would be interested in coordinating to keep our syntaxes compatible.

Looks like the core of the functionality of pandoc-url2cite occurs around:

pandoc-url2cite/index.ts

Lines 62 to 76 in b28374a

async function getCslForUrl(url: string) {
// uses zotero extractors from https://github.com/zotero/translators to get information from URLs
// https://www.mediawiki.org/wiki/Citoid/API
// It should be possible to run a citoid or [zotero translation-server](https://github.com/zotero/translation-server) locally,
// but this works fine for now and is much simpler than trying to run that server in e.g. docker automatically.
// A server is needed since Zotero extractors run within the JS context of the website.
// It might be possible to fake the context and just run most extractors in Node, but that would be much more fragile and need a lot of testing.
// It should also be possible to use something like puppeteer to fetch the website headlessly and then run the extractor.
console.warn("fetching citation from url", url);
const res = await fetch(
`https://en.wikipedia.org/api/rest_v1/data/citation/bibtex/${encodeURIComponent(
url
)}`
);

So first you use wikipedia Citoid to create bibtex and then use pandoc-citeproc to convert to CSL JSON. Note that you can also use the translation-server API to go from Zotero metadata directly to CSL JSON (python code). Theoretically it seems possible that you could get higher quality metadata by avoiding the bibtex passthrough.

Feel free to use our public translation-server instance we host for Manubot at https://translate.manubot.org/ as described at manubot/manubot#82. When we last checked, Citoid lagged behind translation-server... not sure if that is still the case.

@phiresky
Copy link
Owner

Looks like you also support some types of persistent identifiers directly in the citekey

Yep, but only doi: and isbn:. I think it's usually better to use URLs anyways though.

I like your syntax for specifying citekey aliases that point to URLs.

Yep it's supposed to be just the markdown link syntax (how pandoc parses it anywys if citations extension is turned off). In fact, I should probably also support [@abc](https://example.com).

Would be interested in coordinating to keep our syntaxes compatible.

The syntax for pandoc-url2cite citekeys is basically defined by this regex:

return /^https?:\/\/|^doi:|^isbn:/.test(s);

Every cite key that matches this regex is passed to Citoid, everything else will be resolved as an alias to a different cite key (or fail if unmatched). Looks like manubot has the same syntax for most ids (prefixed with doi: / isbn:), but it also adds a url: prefix to URLs.

I'd like to keep the "bare" url specifiers because for me that is the preferred syntax and I'll probably be purely using those.

Some thoughts

  1. IDs like DOIs have less special characters than URLs so they don't conflict with other markdown syntax and can be dirctly used in the [@foo] citation. I in fact also support [@http://etc] but that breaks if the URL contains a &. Maybe I'll try to ask the pandoc author again to add support for [@{https://}] as described in url as citekey/referencekey jgm/pandoc-citeproc#308.
  2. IDs are shorter. Mainly useful for inline use without aliasing.
  3. I still prefer URLs because it's clear how to resolve them as a human. Every of those IDs should already have a "canonical" URL for resolving them, so imo I probably want to just use that URL. The effort of defining a citekey alias seems pretty low for me.
  4. Have you thought about declaring http: / https: the "citation source" for URLs and ://example.com the value? Then it would be compatible without the url: prefix. I mean that's kind of the point of the protocol anyways, right? I mean in theory you could try to do it "correctly" by reading all the standards of URNs / URIs and try to be compatible, but then you probably need to use urn:isbn:123 which is annoying.

Citoid to create bibtex

Yeah I was wondering why Citoid doesn't have a CSL export option since it's an obvious choice.

I've already had problems with the conversion:

  • I used citation-js first to convert the bibtex to CSL - but that had a bug and also lost important information like the abstract
  • Using pandoc-citeproc --bib2json has the problem that it outputs escaped markdown in title / abstract etc fields during the conversion (e.g. [a] becomes \[a\]) so I have to hackily unescape that.

So using something that can output CSL directly would be a good idea. But, I was already pretty anxious about using an external API as the resolver for multiple reasons (changes / downtime in the future, trust, etc). I really would like to use a local resolver but translation-server is far too much of a behemoth. I was really happy that I found that wikipedia provides a server since I can trust that to be able to handle traffic, not go down soon and be somewhat stable. I'll consider using the manubot server (thanks!) if I encounter more problems in the future (like as you said outdated convertors).

Thank you for the suggestion. I did not know about manubot, looks interesting. I definitely like what the output looks like, though the input might be too complicated / opinionated for me. Also tbh I'm kind of missing an overview of what manubot is, how it compares or relates to latex, pandoc and other markdown processors, what a manuscript is (is a paper a manuscript?), and if I can use it to write for existing journals (that need latex).

@phiresky
Copy link
Owner

Actually I just found the manubot.org homepage which explains stuff better - my fault haha. I just went through the repos, maybe link from the rootstock repo to the homepage as well?

@dhimmel
Copy link
Author

dhimmel commented Dec 15, 2019

Thanks for the link to jgm/pandoc-citeproc#308. The limited character set supported by pandoc citekeys has been a big barrier for us as well. I'll chime in on the issue.

In fact, I should probably also support [@abc](https://example.com)

And this would define the @abc citekey for use everywhere? That seems like a nice syntax.

I'd like to keep the "bare" url specifiers because for me that is the preferred syntax and I'll probably be purely using those.

Makes sense. I think you make a good point that perhaps Manubot could use the http / https prefix instead of url.

I still prefer URLs because it's clear how to resolve them as a human. Every of those IDs should already have a "canonical" URL for resolving them, so imo I probably want to just use that URL. The effort of defining a citekey alias seems pretty low for me.

Yeah, requiring URLs to give all viewers the ability to immediately resolve identifiers is a nice perk. The main downside is brevity, i.e. @pmid:28936969 versus @https://www.ncbi.nlm.nih.gov/pubmed/28936969.

wikipedia provides a server since I can trust that to be able to handle traffic, not go down soon and be somewhat stable

Yes, the reliability of the Wikipedia infrastructure is a big plus. If you were to add support for translate.manubot.org, it probably would make sense as an option (perhaps off by default) or as something that if fails, falls back to the Wikipedia endpoint.

manubot, looks interesting. I definitely like what the output looks like, though the input might be too complicated / opinionated for me.

Yes, I see your filter as more general purpose. Manubot is really a toolset to continuously publish a mansucript whose source is tracked in a git repo, so github can be used for collaborative writing.

@phiresky
Copy link
Owner

And this would define the @abc citekey for use everywhere? That seems like a nice syntax.

Yeah. Imo it's kind of a mistake that pandoc decided to introduce a different and incompatible syntax for citekeys - But it looks to me like it can be unified back pretty well if it would add the [@x](url) and [@x]: url in the way it already works with -f markdown-citations, with slightly different semantics than links.

If you were to add support for translate.manubot.org, it probably would make sense as an option

Yep, good idea. Will probably only happen the next time i write an academic document (and have problems with it) so it might be a bit :D. Might also make sense to push citoid to support csl export. From what I understand it should only be like a one line whitelist change. I'm not even sure what citoid does apart from be a translation-server.

@dhimmel
Copy link
Author

dhimmel commented Dec 16, 2019

Might also make sense to push citoid to support csl export

Definitely! I'm not sure who maintains the citoid infrastructure, but keep me up to date if you have any leads.

dhimmel added a commit to manubot/manubot that referenced this issue Jan 14, 2020
merges #99
closes #13
refs #120
refs phiresky/pandoc-url2cite#1

Create a pandoc filter, named `pandoc-manubot-cite`, for
citation-by-identifier. This filter can be used separately from
the rest of Manubot. However, if used following `manubot process`,
use the new `--skip-citations` option when processing.

The Pandoc filter relies on Pandoc's more advanced parsing of
Markdown documents, such that citations in code blocks will no
longer be interpreted by Manubot.

Adds support for defining citekey aliases using reference
link syntax.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants