Added parsing for saltstack interpreted files #86

raddessi · 2019-11-14T18:20:48Z

Hey! This is the follow up to #61 and the continuation of #60, I hope this is more inline with how you would like. Let me know what you think.

I'm not 100% positive of how different tags should be added, IE if all the salt fiels should be tagged with 'salt' and then additionally the 'salt-' tag, for now I only added the non-default renderer tags since I think matching a default renderer tag when the contents of the file is not actually readable by that renderer would cause issues. I'm on the fence though.

raddessi · 2019-11-14T18:31:05Z

Also I'm not sure of a way to cleanly deal with the possible combination of any of these renderers as you are able to mix and match them at will, IE: #! jinja|py.. Should they just not be tagged?

raddessi · 2019-11-14T18:39:52Z

Actually now that I think about it I do think this needs some work, but I'll need your input for direction. I've added some failing tests that illustrate the problem.. The test expectations should probably be set first though and I'm not sure about that. Thanks

asottile · 2019-11-19T06:15:34Z

I think the approach is better than last time, I'm not sure the #! foo|bar syntax is something that identify should try and handle -- it's not at all standard and seems invented by saltstack (?)

let me know what you think, sorry for the brevity I'm trying to get through my backlog of email :S happy to elaborate on things!

raddessi · 2019-11-19T17:55:54Z

Definitely cleaner than last time :) Yes you are correct, they allow for multiple rendering passes on files so foo|bar means it would first be rendered with foo then with bar. The default renderer is no shebang is specified is actually jinja|yaml (docs here).. so I'm not really sure how to represent all the possible combinations in a tag as anything other than just "salt" which is why in the current way I set this up a file with the .sls extension should only get the salt tag if there is no renderer specified, and it gets a more specific tag for renderers when there is one and only one specifed. IE files with #! py get tagged with both salt-py and python, but not salt since their format is not the default jinja/yaml.. I'm not 100% on the logic there though like I said.

The backstory here is that I have a lot of pillars written in pure python using either the py or pyobjects renderer and I want pre-commit to be able to identify those and run checks against them, but since it currently only tags the files as salt that is too broad as that matches all .sls files.

I think the usefulness of tagging the mix-and-match combinations is very low since it would not be easy to run linters or other tools against them as the combinations of syntaxes usually produces some hard to parse contents, but tagging the files when there is only one renderer that is not the normal jinja|yaml would be very useful. At least to me.

raddessi · 2019-11-19T18:04:08Z

Maybe it is better to keep tagging anything ending with .sls as salt only, but then add the extra interpreter tag when we match a single renderer? So #! py would get tagged with salt and python rather than trying to make a subset of mixed tags like salt-py etc?

asottile · 2019-11-19T18:11:57Z

yeah I think that makes the most sense, and would be less of a breaking change from the current behaviour. I think it's also fine / expected to not handle the foo|bar syntax since that's salt specific and imo not really in the spirit of this tool (to handle ever weird combination that other tools might do)

raddessi · 2021-02-14T03:05:49Z

Hey there! I'm so sorry for letting this sit in your PR queue, I'll do my best to get this closed as quickly as possible. Check out this iteration, I think it is much cleaner. I wasn't able to make the code completely generic since the logic to parse shebang formats would probably change with different types but I think this is pretty decent.

One remaining issue is that coverage thinks some lines are not being tested and as far as I can see I think they are.. do you see an obvious issue?

Thanks again for keeping this alive, I hope your year is going well :)

raddessi · 2021-02-14T03:06:39Z

identify/identify.py

+    if ext == '.sls':
+        if shebang:
+            # try to match tags for the file extension of the first interpreter
+            try:
+                first_interpreter = shebang[0].split('|')[0]
+                ret.update(
+                    extensions.EXTENSIONS[
+                        interpreter_to_extension_map.get(
+                            first_interpreter, first_interpreter,
+                        )
+                    ],
+                )
+            except (IndexError, KeyError):
+                pass
+        else:
+            # the default interpreter is jinja
+            ret.update(extensions.EXTENSIONS['jinja'])
+
+    return ret


These are the lines coverage is complaining about missing tests. I'm stumped.

Have you tried the tests locally to verify that this block is actually executing?

yep, I verified 4 locations get hit

CAM-Gerlach

I'm just a random contributor, not @asottile , but I had some comments on your file type additions based on my experience.

With regard to your overall approach, it would simplify the implementation considerably, make it more flexible/general, and less brittle/special case, if you avoided implementing a bunch of bespoke logic just for parsing salt files (requiring any other extension added to EXTENSIONS_NEED_SHEBANG_CHECK to do the same, or silently not work properly). Instead, why not move the salt extension key/value as-is to the EXTENSIONS_NEED_SHEBANG_CHECK dict, add the appropriate interpreters to the interpreters file, add a function that returns whether the file needs a shebang check (e.g. via the first 3-4 lines of your existing new one), and tweak the execution flow to cut out the else on line 57 and change the if below to if (not len(t) and executable) or needs_shebang_check(os.path.basename(path)), so the shebang check runs on such files? This is also a little more efficient (the extra extension check only runs when the shebang isn't already known to be needed), matches the format of the existing EXTENSIONS_NEED_BINARY_CHECK, and avoids unnecessary bespoke code and data to retrieve extensions from tags and having to hardcode interpreters for each extension, and extensions for each interpreter. Plus, it completely avoids needing any of the code the test coverage is complaining about.

CAM-Gerlach · 2021-02-25T06:33:11Z

identify/extensions.py

@@ -100,6 +101,7 @@
    'lr': {'text', 'lektor'},
    'lua': {'text', 'lua'},
    'm': {'text', 'c', 'objective-c'},
+    'mako': {'text', 'mako'},


I have no authority whatsoever on the matter, but for reference, this change was previously rejected in PR #137 , for reasons of mako refusing to specify a standard file extension. However, for reference, over 170 000 files on Github have the .mako extension (while 655 have the .mao extension), of which over 160 000 are identified as Mako by Linguist and a spot check of the remaining 10 000 suggests they too are likely all or almost all Mako. A check of mime-db indicates no other registered file types using this extension, and Mako template is the only type registered on FileInfo for .mako.

Right on, I threw a couple of these in only because I didn;t see them in there already and thought why not, but if there are issues I'll just pull mako no problem there.

CAM-Gerlach · 2021-02-25T06:47:53Z

identify/extensions.py

@@ -52,9 +52,10 @@
    'gif': {'binary', 'image', 'gif'},
    'go': {'text', 'go'},
    'gotmpl': {'text', 'gotmpl'},
+    'gpg': {'text', 'gnupg'},


Suggested change

'gpg': {'text', 'gnupg'},

'gpg': {'binary', 'gnupg'},

There are 101 000 results on Github, and spot-checking the results, as well as Linguist, MIME-DB and FileInfo indicates no conflicts. However, this GnuPG mailing list response, this SO answer and the FileInfo listing, as well as an examining a sample of .gpg files on Github, indicates that this extension is designated for binary-format GnuPG data, with asc being the ASCII-armored text equivalent for both PGP and GnuPG (with 635 000 Github hits, but only 470 000 of them keys, the rest being identified as AGS scripts and ASCIIDoc text).

Great catch, I'll update when I get back

CAM-Gerlach · 2021-02-25T07:21:35Z

identify/extensions.py

@@ -179,6 +181,7 @@
    'tgz': {'binary', 'gzip'},
    'thrift': {'text', 'thrift'},
    'tiff': {'binary', 'image', 'tiff'},
+    'tmpl': {'text', 'cheetah'},


This association seems rather questionable. While it is documented that tmpl is considered the default file extension for Cheetah files, the likely ambiguity seems much too high to be acceptable. While there are no conflicts registered in Linguist or MIME-DB, there are two (of uncertain provenience) in FileInfo, and it is apparently used for Go templates.

Much worse, of the 2 300 000 Github file results, many/most don't seem to use the Cheetah $ syntax (that I'm not really familiar with beyond a quick skim of the docs), and most appear to be Javascript templates, XML templates, GoHTML templates, or other files. Furthermore, it seems rather improbable that more than a small fraction are Cheetah given, for instance, the Mustache template engine has 14.4k stars, an unambiguously standard extension (.mustache) and 995 000 files, while the Github repos I can find associated with Cheetah have only 131, 63, and 41 stars respectively, 2 OoM less despite having many more files associated with the extension.

ah I did not find those on a quick search, good call again. I'll remove this one.

CAM-Gerlach · 2021-02-25T07:23:04Z

identify/identify.py

+    if ext == '.sls':
+        if shebang:
+            # try to match tags for the file extension of the first interpreter
+            try:
+                first_interpreter = shebang[0].split('|')[0]
+                ret.update(
+                    extensions.EXTENSIONS[
+                        interpreter_to_extension_map.get(
+                            first_interpreter, first_interpreter,
+                        )
+                    ],
+                )
+            except (IndexError, KeyError):
+                pass
+        else:
+            # the default interpreter is jinja
+            ret.update(extensions.EXTENSIONS['jinja'])
+
+    return ret


Have you tried the tests locally to verify that this block is actually executing?

raddessi · 2021-02-25T16:01:42Z

I'm just a random contributor, not @asottile , but I had some comments on your file type additions based on my experience.

With regard to your overall approach, it would simplify the implementation considerably, make it more flexible/general, and less brittle/special case, if you avoided implementing a bunch of bespoke logic just for parsing salt files (requiring any other extension added to EXTENSIONS_NEED_SHEBANG_CHECK to do the same, or silently not work properly). Instead, why not move the salt extension key/value as-is to the EXTENSIONS_NEED_SHEBANG_CHECK dict, add the appropriate interpreters to the interpreters file, add a function that returns whether the file needs a shebang check (e.g. via the first 3-4 lines of your existing new one), and tweak the execution flow to cut out the else on line 57 and change the if below to if (not len(t) and executable) or needs_shebang_check(os.path.basename(path)), so the shebang check runs on such files? This is also a little more efficient (the extra extension check only runs when the shebang isn't already known to be needed), matches the format of the existing EXTENSIONS_NEED_BINARY_CHECK, and avoids unnecessary bespoke code and data to retrieve extensions from tags and having to hardcode interpreters for each extension, and extensions for each interpreter. Plus, it completely avoids needing any of the code the test coverage is complaining about.

There is a lot to unpack here :) Yeah I wasn't very happy about the setup here overall and I think this sounds like a good idea but I would have to think it over more. I'm currently on a work trip and will be knee deep in patch panels for the next week or so but as soon as I get back I'll look this over. Thanks for the reviews!

raddessi

@CAM-Gerlach Thank you again for your review :) I started working on your recommendations and saw another way to clean it up a bit more so I did that to get both of your thoughts on it.

It may be over reaching to add these interpreters since they aren't standards but I think it at least makes more sense than the previous iterations. I left a few comments on points I'd like opinions on if you have time. Thanks again

raddessi · 2021-03-09T20:29:03Z

identify/identify.py

+
+    ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')


I don't love that this introduces a need to parse the extension on all files.. but I don't see a way around that when the logic to determine the extra requirement is by extension. Thoughts?

Why not just

Suggested change

ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')

ext = os.path.splitext(path)[-1].lstrip('.')

Also, you could move this to the if statement, given the other change I suggest below, though that would only avoid it in a subset of cases.

There isn't really any way I can see around it given this design, but you can at least only parse and check it once place instead of two, as is currently being done (as my suggestions implement).

raddessi · 2021-03-09T20:31:01Z

identify/identify.py

+            tags.update(
+                tags_from_interpreter(
+                    shebang[0].split('|')[0].strip(),
+                ),
+            )


The splitting of the | delimited interpreters may be better located in the _shebang_split function.. I can move there if this looks good.

If the | convention to separate multiple interpreters is non-standard and only used with one particular file type, I don't know about trying to parse it on all files. I can't find much Googling about other uses, but the one usage I can find has different syntax than salt uses. But that's ultimately up to @asottile .

raddessi · 2021-03-09T20:32:40Z

identify/identify.py

+    ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
+    if (
+        ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and
+        not os.access(
+            path,
+            os.X_OK,
+        )
+    ):


This is the other place the extension is needed.. should it be passed to the function instead of re-parsed?

I would suggest eliminating this check entirely unless there's something I'm missing (which is entirely possible); see below.

raddessi · 2021-03-09T20:33:46Z

identify/identify.py

        # skip the slow calculation if the lengths are very different
-        if norm and abs(len(norm) - len(norm_license)) / len(norm) > .05:
+        if norm and abs(len(norm) - len(norm_license)) / len(norm) > 0.05:
            continue


I didn't make a few of these changes, they were done via precommit or black.. Maybe the precommit config changed recently?

raddessi · 2021-03-09T20:36:04Z

tests/identify_test.py

+        (
+            'jinja|yaml|gpg', {
+                'salt', 'file', 'non-executable', 'text', 'jinja',
+            },
+        ),


This does have the benefit of files with the extension .gpg being marked as binary while files with the gpg interpreter still get marked as text.

I apologize, I quoted the wrong section. The relevant line is 153:

('gpg', {'salt', 'file', 'non-executable', 'text', 'gnupg'}),

CAM-Gerlach

Thanks! I had a few comments on your implementation; the rest is up to @asottile .

CAM-Gerlach · 2021-03-09T22:45:03Z

identify/identify.py

+
+    ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')


Why not just

Suggested change

ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')

ext = os.path.splitext(path)[-1].lstrip('.')

Also, you could move this to the if statement, given the other change I suggest below, though that would only avoid it in a subset of cases.

There isn't really any way I can see around it given this design, but you can at least only parse and check it once place instead of two, as is currently being done (as my suggestions implement).

CAM-Gerlach · 2021-03-09T22:53:26Z

identify/identify.py

+        try:
+            tags.update(extensions.EXTENSIONS_NEED_INTERPRETER_CHECK[ext])
+        except KeyError:
+            pass
+


Suggested change

try:

tags.update(extensions.EXTENSIONS_NEED_INTERPRETER_CHECK[ext])

except KeyError:

pass

I would think it would both make more sense and be more efficient to add the extension-based tags in the same function the others are added in, tags_from_filename() around line 116, just like is done with EXTENSIONS_NEED_BINARY_CHECK, instead of a bespoke code block here that fires for every unknown file, and follows a different pattern than the others. This also means you can just do os.path.splitext(path)[-1].lstrip('.') right in the if statement for only those files that need it (saving a bit of performance at the cost of readibility).

CAM-Gerlach · 2021-03-09T23:05:31Z

identify/identify.py

+            tags.update(
+                tags_from_interpreter(
+                    shebang[0].split('|')[0].strip(),
+                ),
+            )


If the | convention to separate multiple interpreters is non-standard and only used with one particular file type, I don't know about trying to parse it on all files. I can't find much Googling about other uses, but the one usage I can find has different syntax than salt uses. But that's ultimately up to @asottile .

CAM-Gerlach · 2021-03-09T23:09:11Z

identify/identify.py

+    ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
+    if (
+        ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and
+        not os.access(
+            path,
+            os.X_OK,
+        )
+    ):


I would suggest eliminating this check entirely unless there's something I'm missing (which is entirely possible); see below.

CAM-Gerlach · 2021-03-09T23:17:15Z

identify/identify.py

+    ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
+    if (
+        ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and
+        not os.access(
+            path,
+            os.X_OK,
+        )
+    ):
        return ()



Suggested change

ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')

if (

ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and

not os.access(

path,

os.X_OK,

)

):

return ()

@asottile The only function that calls this already checks this same condition before running parse_shebang_from_file(), so is there a reason it needs to be checked twice? There might be something I'm missing, but if not, eliminating the redundant check would reduce complexity and increase performance without an obvious downside, and it also prevents any other third-party consumers from using it on files not marked executable if that restriction doesn't make sense for their application.

CAM-Gerlach · 2021-03-09T23:20:24Z

tests/identify_test.py

+        (
+            'jinja|yaml|gpg', {
+                'salt', 'file', 'non-executable', 'text', 'jinja',
+            },
+        ),


raddessi · 2021-04-08T17:32:13Z

FYI only, I'm playing around with this still. I don't really love any of these approaches so far as there is either a spaghetti code path or some duplicated processing.. I'm playing with either a custom path object class with @property getters to cache the duplicated calls or just a functools.lru_cache on top of some functions.. I need to do some performance comparisons before I send anything up. I don't want to add duplicated calls but I also don't want to add a ton of changes just to avoid them either.. I'm still on the fence about the whole path forward.

EDIT:
When I mentioned a custom path object I mean something like:

class IdentifyPath:

    def __init__(self, path: str) -> None:
        self._path = path

        super().__init__()

    @property
    def mode(self):
        try:
            return self._mode
        except AttributeError:
            try:
                self._mode = os.lstat(self.path).st_mode
            except (OSError, ValueError):  # same error-handling as `os.lexists()`
                raise ValueError(f'{self.path} does not exist.')

        return self._mode

    @property
    def path(self):
        return self._path

    @property
    def filename(self):
        try:
            return self._filename
        except AttributeError:
            _, filename = os.path.split(self.path)
            self._filename = filename

        return self._filename

    @property
    def ext(self):
        try:
            return self._ext
        except AttributeError:
            _, ext = os.path.splitext(self.filename)
            self._ext = ext

        return self._ext

    @property
    def executable(self):
        try:
            return self._executable
        except AttributeError:
            try:
                self._executable = os.access(self.path, os.X_OK)
            except (OSError, ValueError):  # same error-handling as `os.lexists()`
                raise ValueError(f'{self.path} does not exist.')

        return self._executable

    @property
    def shebang(self):
        try:
            return self._shebang
        except AttributeError:
            self._shebang = parse_shebang_from_file(self)

        return self._shebang

It's named terribly I know, the entire test suite would have to be changed, and I'm worried about setup cost performance like I said. I'll keep poking at it..

raddessi added 2 commits November 14, 2019 10:29

Added EXTENSIONS_NEED_SHEBANG_CHECK and logic

4504fdf

Added tests for tags_from_extension_specific_shebang

7617fa2

Added a few more test cases that illustrate a shortcoming

e74a495

raddessi added 2 commits February 13, 2021 11:31

Merge branch 'master' into master.shebang-check

74b5638

Finally updated, but coverage is not happy

d5d039f

raddessi commented Feb 14, 2021

View reviewed changes

CAM-Gerlach reviewed Feb 25, 2021

View reviewed changes

raddessi added 3 commits March 9, 2021 11:39

Merge branch 'master' into master.shebang-check

a3b69db

Removed unclear extensions and changes gpg type

9e7bb37

Next attempt at doing this better

96a56b9

raddessi commented Mar 9, 2021

View reviewed changes

raddessi changed the title ~~Add tags_from_extension_specific_shebang parser~~ Added parsing for saltstack interpreted files Mar 9, 2021

CAM-Gerlach reviewed Mar 9, 2021

View reviewed changes

chore: Merge branch 'master' into master.shebang-check

11d2ce5

graingert mentioned this pull request Jun 24, 2021

identify jest snapshot files eg those starting: // Jest Snapshot v1, https://goo.gl/fbAQLP with .snap file extension #215

Closed

kotki4017 approved these changes Dec 10, 2021

View reviewed changes

asottile closed this Sep 29, 2022


		ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')


	ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
	ext = os.path.splitext(path)[-1].lstrip('.')

Added parsing for saltstack interpreted files #86

Added parsing for saltstack interpreted files #86

Conversation

raddessi commented Nov 14, 2019

raddessi commented Nov 14, 2019

raddessi commented Nov 14, 2019

asottile commented Nov 19, 2019

raddessi commented Nov 19, 2019 • edited Loading

raddessi commented Nov 19, 2019

asottile commented Nov 19, 2019

raddessi commented Feb 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CAM-Gerlach left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CAM-Gerlach Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

raddessi Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raddessi commented Feb 25, 2021

raddessi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CAM-Gerlach Mar 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raddessi Mar 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CAM-Gerlach left a comment

Choose a reason for hiding this comment

CAM-Gerlach Mar 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raddessi commented Apr 8, 2021 • edited Loading

raddessi commented Nov 19, 2019 •

edited

Loading

CAM-Gerlach left a comment •

edited

Loading

CAM-Gerlach Feb 25, 2021 •

edited

Loading

raddessi Feb 25, 2021 •

edited

Loading

CAM-Gerlach Mar 9, 2021 •

edited

Loading

raddessi Mar 9, 2021 •

edited

Loading

CAM-Gerlach Mar 9, 2021 •

edited

Loading

raddessi commented Apr 8, 2021 •

edited

Loading