Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add \lstset{texcl=true} to latex publish template #5

Closed
wants to merge 1 commit into from

Conversation

Jaakkonen
Copy link

This makes comment lines to be treated as LaTeX in code blocks outputted in publish command. Fixes a bug where having UTF-8 characters in comments would cause pdflatex to fail when rendering a document with error ! Package inputenc Error: Invalid UTF-8 byte sequence..

This makes comment lines to be treated as LaTeX in code blocks outputted in `publish` command. Fixes a bug where having UTF-8 characters in comments would cause `pdflatex` to fail when rendering a document with error `! Package inputenc Error: Invalid UTF-8 byte sequence.`.
@siko1056
Copy link
Member

siko1056 commented Oct 3, 2021

Thank you for this PR 🙂

This change seams pretty straight forward. I'll apply it in the major Octave repo soon at https://www.octave.org/hg/octave. This is just a mirror of that repo.

@siko1056 siko1056 self-assigned this Oct 3, 2021
@siko1056 siko1056 added the enhancement New feature or request label Oct 3, 2021
@siko1056
Copy link
Member

siko1056 commented Oct 4, 2021

@Jaakkonen How would you like to get attributed for your change? Do you prefer anonymity or another name than the current one in the patch https://github.com/gnu-octave/octave/pull/5.patch?

Jaakkonen <[email protected]>

@siko1056
Copy link
Member

siko1056 commented Oct 4, 2021

Link to Savannah https://savannah.gnu.org/bugs/?61272

@mmuetzel
Copy link
Member

mmuetzel commented Oct 5, 2021

Is this change save? What would happen if the comment contained invalid LaTeX commands, e.g. a stray \ character.

Can you show an example for which publish fails for you?

@Jaakkonen
Copy link
Author

Is this change safe? What would happen if the comment contained invalid LaTeX commands, e.g. a stray \ character.

You can already inject LaTeX commands in title comments (Lines starting with % after a line starting with %% without a non-comment line in between). But yes, this will make latex commands in comments be interpreted and thus I think some string substitutions should be done before passing the comments to latex as shown in following caveats section.

Can you show an example for which publish fails for you?

file.m:

%% Errors fixed by the change
% Plain latex not interpreted \frac{1}{2}
% Mathmode is interpreted even now $\frac{1}{2}$
% UTF-8 in title comment works even without the patch
% ääeeéé
a = [1,2,3];

% But UTF-8 in comments not under title are fixed by this commit
% äääööéé

% \text{äeöé} will work even without this commit but UTF-8
% will be hidden in the output

Running

$ octave --silent --eval "publish('file.m', 'pdf')"

will fail without this patch and will work with it.

Caveats

However there are regressions regarding latex control characters in non-title comments:
caveats.m:

%% Regressions caused by enabling texcl
a = [1,2,3];

% % Comments after a comment sign will not be shown

% Underscores will not work a_b

% Misaligned & will not work either

% Unmatched math-mode breaks things too $

% # does not work either in horizontal mode

% Unmatched brackets seem to break as well {}}{ a{b

% And brackets in {1}{2} discarded

% Tilde ~ disappears

% a^b is also broken

% Plain \ is discarded or treated as command start

% Some latex outside math mode causes issues
% \frac{1}{2}

Running

$ octave --silent --eval "publish('caveats.m', 'pdf')"

will work without this patch but will fail with it.

Caveat fixes

Doing some string substitutions gets it back to as expected:

file.m:

%% Regressions caused by enabling texcl
a = [1,2,3];

% \% Comments after a comment sign will not be shown

% Underscores will not work a\_b

% Misaligned \& will not work either

% Unmatched math-mode breaks things too \$

% \# does not work either in horizontal mode

% Unmatched brackets seem to break as well \{\}\}\{ a\{b

% And brackets in \{1\}\{2\} discarded

% Tilde \textasciitilde disappears

% a\textasciicircum b is also broken

% Plain \textbackslash  is discarded or treated as command start

% Some latex outside math mode still causes
% issues
% \textbackslash frac\{1\}\{2\}

I think these substitutions should be added in the processing before merging this patch.
List of control characters here: https://tex.stackexchange.com/a/34586

How would you like to get attributed for your change?

It looks like Github decided to use its own proxy email but [email protected] could be better for this :)

@mmuetzel
Copy link
Member

Doing the necessary string substitution correctly automatically looks rather difficult to me. But I might be missing something obvious.

Alternatively, we could take a different approach. E.g. the one outlined at https://en.wikibooks.org/wiki/LaTeX/Source_Code_Listings#Encoding_issue

diff -r 76393baa188d scripts/miscellaneous/private/__publish_latex_output__.m
--- a/scripts/miscellaneous/private/__publish_latex_output__.m	Fri Oct 08 17:37:32 2021 +0200
+++ b/scripts/miscellaneous/private/__publish_latex_output__.m	Sun Oct 10 11:23:08 2021 +0200
@@ -146,7 +146,26 @@
 'frame=single,',
 'tabsize=2,',
 'showstringspaces=false,',
-'breaklines=true}');
+'breaklines=true,',
+'inputencoding=utf8,',
+'extendedchars=true,',
+'literate=',
+'  {á}{{\''a}}1 {é}{{\''e}}1 {í}{{\''i}}1 {ó}{{\''o}}1 {ú}{{\''u}}1',
+'  {Á}{{\''A}}1 {É}{{\''E}}1 {Í}{{\''I}}1 {Ó}{{\''O}}1 {Ú}{{\''U}}1',
+'  {à}{{\`a}}1 {è}{{\`e}}1 {ì}{{\`i}}1 {ò}{{\`o}}1 {ù}{{\`u}}1',
+'  {À}{{\`A}}1 {È}{{\''E}}1 {Ì}{{\`I}}1 {Ò}{{\`O}}1 {Ù}{{\`U}}1',
+'  {ä}{{\"a}}1 {ë}{{\"e}}1 {ï}{{\"i}}1 {ö}{{\"o}}1 {ü}{{\"u}}1',
+'  {Ä}{{\"A}}1 {Ë}{{\"E}}1 {Ï}{{\"I}}1 {Ö}{{\"O}}1 {Ü}{{\"U}}1',
+'  {â}{{\^a}}1 {ê}{{\^e}}1 {î}{{\^i}}1 {ô}{{\^o}}1 {û}{{\^u}}1',
+'  {Â}{{\^A}}1 {Ê}{{\^E}}1 {Î}{{\^I}}1 {Ô}{{\^O}}1 {Û}{{\^U}}1',
+'  {ã}{{\~a}}1 {ẽ}{{\~e}}1 {ĩ}{{\~i}}1 {õ}{{\~o}}1 {ũ}{{\~u}}1',
+'  {Ã}{{\~A}}1 {Ẽ}{{\~E}}1 {Ĩ}{{\~I}}1 {Õ}{{\~O}}1 {Ũ}{{\~U}}1',
+'  {œ}{{\oe}}1 {Œ}{{\OE}}1 {æ}{{\ae}}1 {Æ}{{\AE}}1 {ß}{{\ss}}1',
+'  {ű}{{\H{u}}}1 {Ű}{{\H{U}}}1 {ő}{{\H{o}}}1 {Ő}{{\H{O}}}1',
+'  {ç}{{\c c}}1 {Ç}{{\c C}}1 {ø}{{\o}}1 {å}{{\r a}}1 {Å}{{\r A}}1',
+'  {€}{{\euro}}1 {£}{{\pounds}}1 {«}{{\guillemotleft}}1',
+'  {»}{{\guillemotright}}1 {ñ}{{\~n}}1 {Ñ}{{\~N}}1 {¿}{{?`}}1 {¡}{{!`}}1',
+'}');
 
   latex_head = sprintf ("%s\n",
 "",

Unfortunately, that would limit the supported characters to the ones explicitly included with a substitution.
I'm not a user of publish myself. Maybe users would prefer LaTeX input in comments.
But, in general it would be preferable if any change wouldn't break existing code.

Maybe, we could add the following setting from further up in that article to allow users to explicitly escape to LaTeX:

  escapeinside={\%*}{*)},          % if you want to add LaTeX within your code

But maybe chose escape sequences that are less likely to show up in existing code (i.e., not including comment characters).
What do you think?

@Jaakkonen
Copy link
Author

The literate seems to be the way to go as that keeps backwards compability. I couldn't get the \begin{lstinputlisting} approach mentioned on the wikibook to work as that just hanged in tex to pdf render step.

@siko1056
Copy link
Member

Admittedly, this issue turns out more compilcated than I expected 😓

Probably it would be best and leave things as they are (following the plain Matlab design).

Another way to go is to make use of the ability to easily customize the publish formatter.

## Check existence of custom formatter
custom_formatter = ["__publish_", options.format, "_output__"];
if (! exist (custom_formatter, "file"))

For example, copy and rename the file scripts/miscellaneous/private/__publish_latex_output__.m to __publish_mylatex_output__.m and store it in your current project directory. Change the copied file to your needs, finally change your workflow command to

$ octave --silent --eval "publish('caveats.m', 'mylatex'), for i = 1:2, system ('cd html && pdflatex caveats.tex'); endfor"

Is this a satisfyable solution?

@mmuetzel
Copy link
Member

The documentation of publish currently says:

All formatting syntax of
## @var{file} is treated according to the specified output format and included
## in the report.

It could be argued that the current behavior (i.e., not interpreting LaTeX commands in some contexts when that is the specified output format) is a bug.
I didn't read the documentation carefully enough before. Sorry. (Maybe we could also improve the wording of that sentence. It wasn't clear to me what "and included in the report" means in that sentence.)

So, changing the behavior in that respect could be considered a bug fix - not a regression.
With that in mind, the original proposed change looks good to me. But we should probably add a note about this change in behavior to the NEWS file.

@siko1056
Copy link
Member

Better documentation is always good, @mmuetzel please improve it as you please 🙂

The mentioned sentence from the publish documentation refers to "Publishing Markup", a Markdown-like invention by the Mathworks, e.g. *bold*. This "Publishing Markup" (a.k.a. formatting) is treatet for each output format accordingly by the "formatter" function, as described above.

Injecting LaTeX markup in comment sections (and expecting respective HTML-pendants as output), etc., was neither supported by Matlab and only works by coincidence. Again, the only documented supported markup is "Publishing Markup".

"The generated reports interpret any Publishing Markup in comments, which is explained in detail in the GNU Octave manual."

Returning to the original problem, an encoding issue with LaTeX listings, the English character limitation is better solved indvidually for each used language / special characters, as described above.

@mmuetzel
Copy link
Member

mmuetzel commented Oct 13, 2021

@siko1056: Thanks for clarifying.

To make sure I didn't misunderstand again, is the following correct?
All Publishing Markup in comments is converted to formatting syntax in the specified output format.

In that case, the current bug is that additionally in title comments, LaTeX syntax is interpreted when the output format is "latex".
IIUC, that shouldn't be the case. Or should it?

@siko1056
Copy link
Member

There is no final answer to it. When I worked on publish and compared a lot with Matlab, some "native" (LaTeX / HTML) markup injections (intentionally?) worked (slipped through). Escaped for the output format was mostly in Matlab what causes errors or bad output. That is what I meant by "works by coincidence". The answer what works and not, probably changes with each Matlab release.

For my part, I am not very much interested in publish anymore, since Jupyter Notebook fill in this gab way better these days. This bug only caught my eye, as the fix seemed easy, but LaTeX and special chars seems to be a bigger issue which cannot be trivially solved for all potential use cases of "Publishing Markup". However, individually things can be better tackled with the described flexibility of Octave's publish, as described above.

@mmuetzel
Copy link
Member

Sorry. I don't understand your answer. Let me re-phrase my question:
Would this sentence

Publishing Markup in comments is converted to formatting syntax in the specified output format.

be a valid replacement for

All formatting syntax of @var{file} is treated according to the specified output format and included in the report.

?

That still leaves open whether "native" formatting will be displayed literally or will be interpreted.

As @Jaakonen pointed out, the current behavior is inconsistent (in that "native" LaTeX formatting is interpreted in title comments but displayed literally in follow up comments).

I understand that you don't want to spend much time for changes in publish. But getting it to consistently interpret "native" LaTeX formatting everywhere might be easy. The one-line change originally proposed here might do that.

In that case, this additional sentence in the documentation might clarify that:

Additionally, formatting syntax native to the specified output format can be used.

@siko1056
Copy link
Member

Regarding the first sentence, there is a difference between a comment and a (let's call it) "Publish comment", which start with

%% Title
% body

or without title

%%
% body

Not every comment is a comment 😓 Matlab did not give them a precise name, they use the concept of "section breaks" for the description, which does not exist in Octave

Markup only works in comments that immediately follow a section break.

Thus the first sentence does not make much sense here.

Regarding the last sentence, this should not be advertised, as users start relying on it 😇 Only "Publishing Markup" is supported. Of course please continue the development and change the game to your needs, same with the one-liner change. I don't see such easy fix and don't work on it now. If you found it, please apply it 🙂

@mmuetzel
Copy link
Member

Tbh, I don't understand the current documentation of publish and I don't know what that function is supposed to do.
I thought I did. But there seem to be much more subtleties to it than I expected.
So, I'm probably not the right person to make any changes to that function.
(I should have realized that earlier. Sorry for that.)

One more attempt for a documentation change:

Publishing Markup in section comments is converted to formatting syntax in the specified output format.
Section comments are comment blocks that start with a line with double comment character.

@siko1056
Copy link
Member

Fully agree to this sentence 👍

Basically publish does the same as Jupyter Notebook, but was invented much earlier before Matlab 2006. Jupyter has a clear distinction between Markdown cells and code cells. Only the code cells must be evaluated by Octave. The Markdown is treated seperately.

In publish, this concept of a Markdown cells is simulated with those comment blocks

##
# Publish Markup here

while everything else automatically is a code cell.

The current implementation of publish embeds those code cells for LaTeX (PDF) output inside \begin{lstlisting} environments, which works quite well for English codes. Thus from that point on the control is with pdflatex.

During the development, there was an endless cycle of try and error to fine tune what characters when to escape. And I am afraid to get dragged into this cycle again 😇 It takes lots of time and you always come up with new cenarios you never have imagined somebody uses this. Since Jupyter exists (I shortly learned after "finishing" my work 😆 ) I thought to have wasted my time a lot on this format (Jupyter suites for my use cases simply better, more supported by other tools, etc.).

However, Octave now can also publish documents in a mostly Matlab compatible way 🙂

siko1056 pushed a commit that referenced this pull request Oct 17, 2021
* scripts/miscellaneous/publish.m: Clarify documentation. Publishing Markup is
  only interpreted in section comments. Remove sentence that suggests that
  commands in the output format might be interpreted in that format.
* scripts/miscellaneous/private/__publish_latex_output__.m: Add substitution
  rules for some multi-byte UTF-8 characters (mostly Latin-based) to LaTeX
  publish template. This fixes a bug where using these characters in comments
  would cause `pdflatex` to fail.

See: #5
@mmuetzel
Copy link
Member

Letting this sink a bit, I opted for pushing a change that uses the literate approach here:
http://hg.savannah.gnu.org/hgweb/octave/rev/33d895260fa4

While this doesn't fix the issue with multi-byte UTF-8 characters in non-section comments completely, it probably avoids it for most (Western) users.

I forgot to include the bug number of the "proxy bug" on savannah. Sorry.
But I added a link to this pull request in the commit message.

@mmuetzel mmuetzel closed this Oct 17, 2021
mtmiller pushed a commit to mtmiller/octave that referenced this pull request Oct 17, 2021
* scripts/miscellaneous/publish.m: Clarify documentation. Publishing Markup is
  only interpreted in section comments. Remove sentence that suggests that
  commands in the output format might be interpreted in that format.
* scripts/miscellaneous/private/__publish_latex_output__.m: Add substitution
  rules for some multi-byte UTF-8 characters (mostly Latin-based) to LaTeX
  publish template. This fixes a bug where using these characters in comments
  would cause `pdflatex` to fail.

See: gnu-octave/octave#5
siko1056 pushed a commit that referenced this pull request Feb 8, 2024
…dimensions (bug #65261)

* data.cc (Fsize): Add explanation of case where DIM arg exceeds ndims (A).
Add Example #5 to docstrings showing case of DIM arg exceeding ndims (A).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants