Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for filenames with spaces #738

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from
Open

fix for filenames with spaces #738

wants to merge 2 commits into from

Conversation

a1ix2
Copy link

@a1ix2 a1ix2 commented Mar 5, 2025

transformtei complains when filenames contain spaces:

~$ src/Stylesheets/bin/teitohtml 'Zhang et al. - 2023 - Chemical Transdifferentiation of Somatic Cells Un.grobid.tei.xml'
basename: extra operand ‘al.’
Try 'basename --help' for more information.
~$ src/Stylesheets/bin/teitohtml Zhang\ et\ al.\ -\ 2023\ -\ Chemical\ Transdifferentiation\ of\ Somatic\ Cells\ Un.grobid.tei.xml
basename: extra operand ‘al.’
Try 'basename --help' for more information.
~$

after modifying transformtei as indicated in #737 (comment) everything seems to work:

~$ dev/Stylesheets/bin/teitohtml Zhang\ et\ al.\ -\ 2023\ -\ Chemical\ Transdifferentiation\ of\ Somatic\ Cells\ Un.grobid.tei.xml
WARNING: No localsource set. Will get a copy from /home/alix/dev/Stylesheets/source/p5subset.xml if necessary.
Convert /home/alix/Zhang et al. - 2023 - Chemical Transdifferentiation of Somatic Cells Un.grobid.tei.xml to /home/alix/Zhang et al. - 2023 - Chemical Transdifferentiation of Somatic Cells Un.grobid.tei.html (tei to html) using profile default lang=en
Picked up _JAVA_OPTIONS: -Dsun.java2d.xrender=false -Dsun.java2d.pmoffscreen=false
     [echo] XSLT generate HTML files (language en, style /home/alix/dev/Stylesheets/profiles/default/html/to.xsl, in /home/alix/Zhang et al. - 2023 - Chemical Transdifferentiation of Somatic Cells Un.grobid.tei.xml, out /home/alix/Zhang et al. - 2023 - Chemical Transdifferentiation of Somatic Cells Un.grobid.tei.html)
Warning: XML resolver not found; external catalogs will be ignored
     [xslt] Found binaryObject without @url.
     [xslt] Found binaryObject without @url.

BUILD SUCCESSFUL
Total time: 2 seconds
~$

@martindholmes
Copy link
Contributor

Both Test and Test2 are passing. Need to do some more specific tests...

@martindholmes
Copy link
Contributor

@a1ix2 I'm testing the actual bin/ symlinks, and finding that the problem with spaces is not solved. For instance, if I do this (in the Stylesheets folder):

bin/docxtotei tempdir/test.docx

then everything works; but if I rename the directory:

tempdir with spaces

then both of these fail:

bin/docxtotei tempdir\ with\ spaces/test.docx
bin/docxtotei "tempdir with spaces/text.docx"

Since this is the common use-case, I'm wondering what your use-case was, where the problem was solved?

@a1ix2
Copy link
Author

a1ix2 commented Mar 7, 2025

@martindholmes I've added a few modifications that should make it work when both filenames and directories contain spaces. I'm extremely rusty on my bash, so I'm following recommendations I found here https://unix.stackexchange.com/questions/118433/quoting-within-command-substitution-in-bash

My use-case I'm trying to convert a ton of TEI XML obtained from GROBID (an academic PDF parser) into a format that is simpler to parse during their ingestion into a RAG. I was looking at teitohtml and teitolite and was trying the exact commands I included in the first post above.

@martindholmes
Copy link
Contributor

Test2 passes, but Test fails for me:

ava -Djava.awt.headless=true -jar ../lib/lib/jing-20120724.0.0.jar -c actual-results/oddbyexample.rnc bare/test2.xml
java -Djava.awt.headless=true -jar ../lib/lib/jing-20120724.0.0.jar actual-results/oddbyexample.xsd bare/test2.xml
fatal: file not found: /home/mholmes/temp/Stylesheets/Test/actual-results/oddbyexample.xsd (No such file or directory)
make[1]: *** [Makefile:590: test-oddity] Error 1

So it seems that oddbyexample.xsd is not being generated. When I run the relevant line manually:

../bin/teitoxsd  --localsource=../source/p5subset.xml actual-results/oddbyexample.odd  actual-results/oddbyexample.xsd

I get files called oddbyexample.xsd.tmp and oddbyexample.xsd.rng, but I don't get a plain xsd file.

This is hard to figure out. It rather looks as though the failure comes when Stylesheets/xsd/postprocess.xsl is being run, but I can't yet figure out what the problem is.

@martindholmes
Copy link
Contributor

Meanwhile, both of these still fail:

bin/docxtotei tempdir\ with\ spaces/test.docx
bin/docxtotei "tempdir with spaces/text.docx"

So I don't really know where to go from here. My reading of the situation is that just changing the transformtei script is nowhere near sufficient; we will probably have to work through every stage of every transformation to ensure that paths are being passed along in a safe manner, which will be no small task. @a1ix2 I don't know if you want to take this on, or whether your life would be much simpler if you just batch-renamed all your input files to replace spaces and other unwise characters with underscores.

@a1ix2
Copy link
Author

a1ix2 commented Mar 8, 2025

Test fails in the dev branch for me, so it's likely unrelated. Test2 fails in the dev branch as well for reasons I can't parse (I'm very new to this).

These two work for me in the a1ix2:double-quotes branch

~/dev/Stylesheets/bin/docxtotei temp\ dir\ with\ spaces/test.docx
~/dev/Stylesheets/bin/docxtotei "temp dir with spaces/test.docx"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants