Add \lstset{texcl=true} to latex publish template #5

Jaakkonen · 2021-10-02T23:30:25Z

This makes comment lines to be treated as LaTeX in code blocks outputted in publish command. Fixes a bug where having UTF-8 characters in comments would cause pdflatex to fail when rendering a document with error ! Package inputenc Error: Invalid UTF-8 byte sequence..

This makes comment lines to be treated as LaTeX in code blocks outputted in `publish` command. Fixes a bug where having UTF-8 characters in comments would cause `pdflatex` to fail when rendering a document with error `! Package inputenc Error: Invalid UTF-8 byte sequence.`.

siko1056 · 2021-10-03T13:54:01Z

Thank you for this PR 🙂

This change seams pretty straight forward. I'll apply it in the major Octave repo soon at https://www.octave.org/hg/octave. This is just a mirror of that repo.

siko1056 · 2021-10-04T01:47:09Z

@Jaakkonen How would you like to get attributed for your change? Do you prefer anonymity or another name than the current one in the patch https://github.com/gnu-octave/octave/pull/5.patch?

Jaakkonen <[email protected]>

siko1056 · 2021-10-04T01:53:28Z

Link to Savannah https://savannah.gnu.org/bugs/?61272

mmuetzel · 2021-10-05T15:11:06Z

Is this change save? What would happen if the comment contained invalid LaTeX commands, e.g. a stray \ character.

Can you show an example for which publish fails for you?

Jaakkonen · 2021-10-09T22:44:54Z

Is this change safe? What would happen if the comment contained invalid LaTeX commands, e.g. a stray \ character.

You can already inject LaTeX commands in title comments (Lines starting with % after a line starting with %% without a non-comment line in between). But yes, this will make latex commands in comments be interpreted and thus I think some string substitutions should be done before passing the comments to latex as shown in following caveats section.

Can you show an example for which publish fails for you?

file.m:

%% Errors fixed by the change
% Plain latex not interpreted \frac{1}{2}
% Mathmode is interpreted even now $\frac{1}{2}$
% UTF-8 in title comment works even without the patch
% ääeeéé
a = [1,2,3];

% But UTF-8 in comments not under title are fixed by this commit
% äääööéé

% \text{äeöé} will work even without this commit but UTF-8
% will be hidden in the output

Running

$ octave --silent --eval "publish('file.m', 'pdf')"

will fail without this patch and will work with it.

Caveats

However there are regressions regarding latex control characters in non-title comments:
caveats.m:

%% Regressions caused by enabling texcl
a = [1,2,3];

% % Comments after a comment sign will not be shown

% Underscores will not work a_b

% Misaligned & will not work either

% Unmatched math-mode breaks things too $

% # does not work either in horizontal mode

% Unmatched brackets seem to break as well {}}{ a{b

% And brackets in {1}{2} discarded

% Tilde ~ disappears

% a^b is also broken

% Plain \ is discarded or treated as command start

% Some latex outside math mode causes issues
% \frac{1}{2}

Running

$ octave --silent --eval "publish('caveats.m', 'pdf')"

will work without this patch but will fail with it.

Caveat fixes

Doing some string substitutions gets it back to as expected:

file.m:

%% Regressions caused by enabling texcl
a = [1,2,3];

% \% Comments after a comment sign will not be shown

% Underscores will not work a\_b

% Misaligned \& will not work either

% Unmatched math-mode breaks things too \$

% \# does not work either in horizontal mode

% Unmatched brackets seem to break as well \{\}\}\{ a\{b

% And brackets in \{1\}\{2\} discarded

% Tilde \textasciitilde disappears

% a\textasciicircum b is also broken

% Plain \textbackslash  is discarded or treated as command start

% Some latex outside math mode still causes
% issues
% \textbackslash frac\{1\}\{2\}

I think these substitutions should be added in the processing before merging this patch.
List of control characters here: https://tex.stackexchange.com/a/34586

How would you like to get attributed for your change?

It looks like Github decided to use its own proxy email but [email protected] could be better for this :)

mmuetzel · 2021-10-10T09:40:06Z

Doing the necessary string substitution correctly automatically looks rather difficult to me. But I might be missing something obvious.

Alternatively, we could take a different approach. E.g. the one outlined at https://en.wikibooks.org/wiki/LaTeX/Source_Code_Listings#Encoding_issue

diff -r 76393baa188d scripts/miscellaneous/private/__publish_latex_output__.m
--- a/scripts/miscellaneous/private/__publish_latex_output__.m	Fri Oct 08 17:37:32 2021 +0200
+++ b/scripts/miscellaneous/private/__publish_latex_output__.m	Sun Oct 10 11:23:08 2021 +0200
@@ -146,7 +146,26 @@
 'frame=single,',
 'tabsize=2,',
 'showstringspaces=false,',
-'breaklines=true}');
+'breaklines=true,',
+'inputencoding=utf8,',
+'extendedchars=true,',
+'literate=',
+'  {á}{{\''a}}1 {é}{{\''e}}1 {í}{{\''i}}1 {ó}{{\''o}}1 {ú}{{\''u}}1',
+'  {Á}{{\''A}}1 {É}{{\''E}}1 {Í}{{\''I}}1 {Ó}{{\''O}}1 {Ú}{{\''U}}1',
+'  {à}{{\`a}}1 {è}{{\`e}}1 {ì}{{\`i}}1 {ò}{{\`o}}1 {ù}{{\`u}}1',
+'  {À}{{\`A}}1 {È}{{\''E}}1 {Ì}{{\`I}}1 {Ò}{{\`O}}1 {Ù}{{\`U}}1',
+'  {ä}{{\"a}}1 {ë}{{\"e}}1 {ï}{{\"i}}1 {ö}{{\"o}}1 {ü}{{\"u}}1',
+'  {Ä}{{\"A}}1 {Ë}{{\"E}}1 {Ï}{{\"I}}1 {Ö}{{\"O}}1 {Ü}{{\"U}}1',
+'  {â}{{\^a}}1 {ê}{{\^e}}1 {î}{{\^i}}1 {ô}{{\^o}}1 {û}{{\^u}}1',
+'  {Â}{{\^A}}1 {Ê}{{\^E}}1 {Î}{{\^I}}1 {Ô}{{\^O}}1 {Û}{{\^U}}1',
+'  {ã}{{\~a}}1 {ẽ}{{\~e}}1 {ĩ}{{\~i}}1 {õ}{{\~o}}1 {ũ}{{\~u}}1',
+'  {Ã}{{\~A}}1 {Ẽ}{{\~E}}1 {Ĩ}{{\~I}}1 {Õ}{{\~O}}1 {Ũ}{{\~U}}1',
+'  {œ}{{\oe}}1 {Œ}{{\OE}}1 {æ}{{\ae}}1 {Æ}{{\AE}}1 {ß}{{\ss}}1',
+'  {ű}{{\H{u}}}1 {Ű}{{\H{U}}}1 {ő}{{\H{o}}}1 {Ő}{{\H{O}}}1',
+'  {ç}{{\c c}}1 {Ç}{{\c C}}1 {ø}{{\o}}1 {å}{{\r a}}1 {Å}{{\r A}}1',
+'  {€}{{\euro}}1 {£}{{\pounds}}1 {«}{{\guillemotleft}}1',
+'  {»}{{\guillemotright}}1 {ñ}{{\~n}}1 {Ñ}{{\~N}}1 {¿}{{?`}}1 {¡}{{!`}}1',
+'}');
 
   latex_head = sprintf ("%s\n",
 "",

Unfortunately, that would limit the supported characters to the ones explicitly included with a substitution.
I'm not a user of publish myself. Maybe users would prefer LaTeX input in comments.
But, in general it would be preferable if any change wouldn't break existing code.

Maybe, we could add the following setting from further up in that article to allow users to explicitly escape to LaTeX:

  escapeinside={\%*}{*)},          % if you want to add LaTeX within your code

But maybe chose escape sequences that are less likely to show up in existing code (i.e., not including comment characters).
What do you think?

Jaakkonen · 2021-10-12T13:00:22Z

The literate seems to be the way to go as that keeps backwards compability. I couldn't get the \begin{lstinputlisting} approach mentioned on the wikibook to work as that just hanged in tex to pdf render step.

siko1056 · 2021-10-13T02:01:10Z

Admittedly, this issue turns out more compilcated than I expected 😓

Probably it would be best and leave things as they are (following the plain Matlab design).

Another way to go is to make use of the ability to easily customize the publish formatter.

octave/scripts/miscellaneous/publish.m

Lines 261 to 263 in 963d643

    
           ## Check existence of custom formatter 
        
           custom_formatter = ["__publish_", options.format, "_output__"]; 
        
           if (! exist (custom_formatter, "file"))

For example, copy and rename the file scripts/miscellaneous/private/__publish_latex_output__.m to __publish_mylatex_output__.m and store it in your current project directory. Change the copied file to your needs, finally change your workflow command to

$ octave --silent --eval "publish('caveats.m', 'mylatex'), for i = 1:2, system ('cd html && pdflatex caveats.tex'); endfor"

Is this a satisfyable solution?

mmuetzel · 2021-10-13T08:21:16Z

The documentation of publish currently says:

All formatting syntax of
## @var{file} is treated according to the specified output format and included
## in the report.

It could be argued that the current behavior (i.e., not interpreting LaTeX commands in some contexts when that is the specified output format) is a bug.
I didn't read the documentation carefully enough before. Sorry. (Maybe we could also improve the wording of that sentence. It wasn't clear to me what "and included in the report" means in that sentence.)

So, changing the behavior in that respect could be considered a bug fix - not a regression.
With that in mind, the original proposed change looks good to me. But we should probably add a note about this change in behavior to the NEWS file.

siko1056 · 2021-10-13T08:44:36Z

Better documentation is always good, @mmuetzel please improve it as you please 🙂

The mentioned sentence from the publish documentation refers to "Publishing Markup", a Markdown-like invention by the Mathworks, e.g. *bold*. This "Publishing Markup" (a.k.a. formatting) is treatet for each output format accordingly by the "formatter" function, as described above.

Injecting LaTeX markup in comment sections (and expecting respective HTML-pendants as output), etc., was neither supported by Matlab and only works by coincidence. Again, the only documented supported markup is "Publishing Markup".

"The generated reports interpret any Publishing Markup in comments, which is explained in detail in the GNU Octave manual."

Returning to the original problem, an encoding issue with LaTeX listings, the English character limitation is better solved indvidually for each used language / special characters, as described above.

mmuetzel · 2021-10-13T09:07:02Z

@siko1056: Thanks for clarifying.

To make sure I didn't misunderstand again, is the following correct?
All Publishing Markup in comments is converted to formatting syntax in the specified output format.

In that case, the current bug is that additionally in title comments, LaTeX syntax is interpreted when the output format is "latex".
IIUC, that shouldn't be the case. Or should it?

siko1056 · 2021-10-13T09:21:18Z

There is no final answer to it. When I worked on publish and compared a lot with Matlab, some "native" (LaTeX / HTML) markup injections (intentionally?) worked (slipped through). Escaped for the output format was mostly in Matlab what causes errors or bad output. That is what I meant by "works by coincidence". The answer what works and not, probably changes with each Matlab release.

For my part, I am not very much interested in publish anymore, since Jupyter Notebook fill in this gab way better these days. This bug only caught my eye, as the fix seemed easy, but LaTeX and special chars seems to be a bigger issue which cannot be trivially solved for all potential use cases of "Publishing Markup". However, individually things can be better tackled with the described flexibility of Octave's publish, as described above.

mmuetzel · 2021-10-13T09:44:13Z

Sorry. I don't understand your answer. Let me re-phrase my question:
Would this sentence

Publishing Markup in comments is converted to formatting syntax in the specified output format.

be a valid replacement for

All formatting syntax of @var{file} is treated according to the specified output format and included in the report.

?

That still leaves open whether "native" formatting will be displayed literally or will be interpreted.

As @Jaakonen pointed out, the current behavior is inconsistent (in that "native" LaTeX formatting is interpreted in title comments but displayed literally in follow up comments).

I understand that you don't want to spend much time for changes in publish. But getting it to consistently interpret "native" LaTeX formatting everywhere might be easy. The one-line change originally proposed here might do that.

In that case, this additional sentence in the documentation might clarify that:

Additionally, formatting syntax native to the specified output format can be used.

siko1056 · 2021-10-13T10:02:31Z

Regarding the first sentence, there is a difference between a comment and a (let's call it) "Publish comment", which start with

%% Title
% body

or without title

%%
% body

Not every comment is a comment 😓 Matlab did not give them a precise name, they use the concept of "section breaks" for the description, which does not exist in Octave

Markup only works in comments that immediately follow a section break.

Thus the first sentence does not make much sense here.

Regarding the last sentence, this should not be advertised, as users start relying on it 😇 Only "Publishing Markup" is supported. Of course please continue the development and change the game to your needs, same with the one-liner change. I don't see such easy fix and don't work on it now. If you found it, please apply it 🙂

mmuetzel · 2021-10-13T10:24:07Z

Tbh, I don't understand the current documentation of publish and I don't know what that function is supposed to do.
I thought I did. But there seem to be much more subtleties to it than I expected.
So, I'm probably not the right person to make any changes to that function.
(I should have realized that earlier. Sorry for that.)

One more attempt for a documentation change:

Publishing Markup in section comments is converted to formatting syntax in the specified output format.
Section comments are comment blocks that start with a line with double comment character.

siko1056 · 2021-10-13T15:39:21Z

Fully agree to this sentence 👍

Basically publish does the same as Jupyter Notebook, but was invented much earlier before Matlab 2006. Jupyter has a clear distinction between Markdown cells and code cells. Only the code cells must be evaluated by Octave. The Markdown is treated seperately.

In publish, this concept of a Markdown cells is simulated with those comment blocks

##
# Publish Markup here

while everything else automatically is a code cell.

The current implementation of publish embeds those code cells for LaTeX (PDF) output inside \begin{lstlisting} environments, which works quite well for English codes. Thus from that point on the control is with pdflatex.

During the development, there was an endless cycle of try and error to fine tune what characters when to escape. And I am afraid to get dragged into this cycle again 😇 It takes lots of time and you always come up with new cenarios you never have imagined somebody uses this. Since Jupyter exists (I shortly learned after "finishing" my work 😆 ) I thought to have wasted my time a lot on this format (Jupyter suites for my use cases simply better, more supported by other tools, etc.).

However, Octave now can also publish documents in a mostly Matlab compatible way 🙂

* scripts/miscellaneous/publish.m: Clarify documentation. Publishing Markup is only interpreted in section comments. Remove sentence that suggests that commands in the output format might be interpreted in that format. * scripts/miscellaneous/private/__publish_latex_output__.m: Add substitution rules for some multi-byte UTF-8 characters (mostly Latin-based) to LaTeX publish template. This fixes a bug where using these characters in comments would cause `pdflatex` to fail. See: #5

mmuetzel · 2021-10-17T11:11:18Z

Letting this sink a bit, I opted for pushing a change that uses the literate approach here:
http://hg.savannah.gnu.org/hgweb/octave/rev/33d895260fa4

While this doesn't fix the issue with multi-byte UTF-8 characters in non-section comments completely, it probably avoids it for most (Western) users.

I forgot to include the bug number of the "proxy bug" on savannah. Sorry.
But I added a link to this pull request in the commit message.

* scripts/miscellaneous/publish.m: Clarify documentation. Publishing Markup is only interpreted in section comments. Remove sentence that suggests that commands in the output format might be interpreted in that format. * scripts/miscellaneous/private/__publish_latex_output__.m: Add substitution rules for some multi-byte UTF-8 characters (mostly Latin-based) to LaTeX publish template. This fixes a bug where using these characters in comments would cause `pdflatex` to fail. See: gnu-octave/octave#5

…dimensions (bug #65261) * data.cc (Fsize): Add explanation of case where DIM arg exceeds ndims (A). Add Example #5 to docstrings showing case of DIM arg exceeding ndims (A).

siko1056 self-assigned this Oct 3, 2021

siko1056 added the enhancement New feature or request label Oct 3, 2021

mmuetzel closed this Oct 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add \lstset{texcl=true} to latex publish template #5

Add \lstset{texcl=true} to latex publish template #5

Jaakkonen commented Oct 2, 2021

siko1056 commented Oct 3, 2021

siko1056 commented Oct 4, 2021

siko1056 commented Oct 4, 2021

mmuetzel commented Oct 5, 2021

Jaakkonen commented Oct 9, 2021

mmuetzel commented Oct 10, 2021

Jaakkonen commented Oct 12, 2021

siko1056 commented Oct 13, 2021

mmuetzel commented Oct 13, 2021

siko1056 commented Oct 13, 2021

mmuetzel commented Oct 13, 2021 •

edited

Loading

siko1056 commented Oct 13, 2021

mmuetzel commented Oct 13, 2021

siko1056 commented Oct 13, 2021

mmuetzel commented Oct 13, 2021

siko1056 commented Oct 13, 2021

mmuetzel commented Oct 17, 2021

Add \lstset{texcl=true} to latex publish template #5

Add \lstset{texcl=true} to latex publish template #5

Conversation

Jaakkonen commented Oct 2, 2021

siko1056 commented Oct 3, 2021

siko1056 commented Oct 4, 2021

siko1056 commented Oct 4, 2021

mmuetzel commented Oct 5, 2021

Jaakkonen commented Oct 9, 2021

Caveats

Caveat fixes

mmuetzel commented Oct 10, 2021

Jaakkonen commented Oct 12, 2021

siko1056 commented Oct 13, 2021

mmuetzel commented Oct 13, 2021

siko1056 commented Oct 13, 2021

mmuetzel commented Oct 13, 2021 • edited Loading

siko1056 commented Oct 13, 2021

mmuetzel commented Oct 13, 2021

siko1056 commented Oct 13, 2021

mmuetzel commented Oct 13, 2021

siko1056 commented Oct 13, 2021

mmuetzel commented Oct 17, 2021

mmuetzel commented Oct 13, 2021 •

edited

Loading