Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parsing for saltstack interpreted files #86

Closed
wants to merge 9 commits into from

Conversation

raddessi
Copy link

Hey! This is the follow up to #61 and the continuation of #60, I hope this is more inline with how you would like. Let me know what you think.

I'm not 100% positive of how different tags should be added, IE if all the salt fiels should be tagged with 'salt' and then additionally the 'salt-' tag, for now I only added the non-default renderer tags since I think matching a default renderer tag when the contents of the file is not actually readable by that renderer would cause issues. I'm on the fence though.

@raddessi
Copy link
Author

Also I'm not sure of a way to cleanly deal with the possible combination of any of these renderers as you are able to mix and match them at will, IE: #! jinja|py.. Should they just not be tagged?

@raddessi
Copy link
Author

Actually now that I think about it I do think this needs some work, but I'll need your input for direction. I've added some failing tests that illustrate the problem.. The test expectations should probably be set first though and I'm not sure about that. Thanks

@asottile
Copy link
Member

I think the approach is better than last time, I'm not sure the #! foo|bar syntax is something that identify should try and handle -- it's not at all standard and seems invented by saltstack (?)

let me know what you think, sorry for the brevity I'm trying to get through my backlog of email :S happy to elaborate on things!

@raddessi
Copy link
Author

raddessi commented Nov 19, 2019

Definitely cleaner than last time :) Yes you are correct, they allow for multiple rendering passes on files so foo|bar means it would first be rendered with foo then with bar. The default renderer is no shebang is specified is actually jinja|yaml (docs here).. so I'm not really sure how to represent all the possible combinations in a tag as anything other than just "salt" which is why in the current way I set this up a file with the .sls extension should only get the salt tag if there is no renderer specified, and it gets a more specific tag for renderers when there is one and only one specifed. IE files with #! py get tagged with both salt-py and python, but not salt since their format is not the default jinja/yaml.. I'm not 100% on the logic there though like I said.

The backstory here is that I have a lot of pillars written in pure python using either the py or pyobjects renderer and I want pre-commit to be able to identify those and run checks against them, but since it currently only tags the files as salt that is too broad as that matches all .sls files.

I think the usefulness of tagging the mix-and-match combinations is very low since it would not be easy to run linters or other tools against them as the combinations of syntaxes usually produces some hard to parse contents, but tagging the files when there is only one renderer that is not the normal jinja|yaml would be very useful. At least to me.

@raddessi
Copy link
Author

Maybe it is better to keep tagging anything ending with .sls as salt only, but then add the extra interpreter tag when we match a single renderer? So #! py would get tagged with salt and python rather than trying to make a subset of mixed tags like salt-py etc?

@asottile
Copy link
Member

yeah I think that makes the most sense, and would be less of a breaking change from the current behaviour. I think it's also fine / expected to not handle the foo|bar syntax since that's salt specific and imo not really in the spirit of this tool (to handle ever weird combination that other tools might do)

@raddessi
Copy link
Author

Hey there! I'm so sorry for letting this sit in your PR queue, I'll do my best to get this closed as quickly as possible. Check out this iteration, I think it is much cleaner. I wasn't able to make the code completely generic since the logic to parse shebang formats would probably change with different types but I think this is pretty decent.

One remaining issue is that coverage thinks some lines are not being tested and as far as I can see I think they are.. do you see an obvious issue?

Thanks again for keeping this alive, I hope your year is going well :)

Comment on lines 93 to 111
if ext == '.sls':
if shebang:
# try to match tags for the file extension of the first interpreter
try:
first_interpreter = shebang[0].split('|')[0]
ret.update(
extensions.EXTENSIONS[
interpreter_to_extension_map.get(
first_interpreter, first_interpreter,
)
],
)
except (IndexError, KeyError):
pass
else:
# the default interpreter is jinja
ret.update(extensions.EXTENSIONS['jinja'])

return ret
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the lines coverage is complaining about missing tests. I'm stumped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried the tests locally to verify that this block is actually executing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I verified 4 locations get hit

Copy link
Contributor

@CAM-Gerlach CAM-Gerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just a random contributor, not @asottile , but I had some comments on your file type additions based on my experience.

With regard to your overall approach, it would simplify the implementation considerably, make it more flexible/general, and less brittle/special case, if you avoided implementing a bunch of bespoke logic just for parsing salt files (requiring any other extension added to EXTENSIONS_NEED_SHEBANG_CHECK to do the same, or silently not work properly). Instead, why not move the salt extension key/value as-is to the EXTENSIONS_NEED_SHEBANG_CHECK dict, add the appropriate interpreters to the interpreters file, add a function that returns whether the file needs a shebang check (e.g. via the first 3-4 lines of your existing new one), and tweak the execution flow to cut out the else on line 57 and change the if below to if (not len(t) and executable) or needs_shebang_check(os.path.basename(path)), so the shebang check runs on such files? This is also a little more efficient (the extra extension check only runs when the shebang isn't already known to be needed), matches the format of the existing EXTENSIONS_NEED_BINARY_CHECK, and avoids unnecessary bespoke code and data to retrieve extensions from tags and having to hardcode interpreters for each extension, and extensions for each interpreter. Plus, it completely avoids needing any of the code the test coverage is complaining about.

@@ -100,6 +101,7 @@
'lr': {'text', 'lektor'},
'lua': {'text', 'lua'},
'm': {'text', 'c', 'objective-c'},
'mako': {'text', 'mako'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no authority whatsoever on the matter, but for reference, this change was previously rejected in PR #137 , for reasons of mako refusing to specify a standard file extension. However, for reference, over 170 000 files on Github have the .mako extension (while 655 have the .mao extension), of which over 160 000 are identified as Mako by Linguist and a spot check of the remaining 10 000 suggests they too are likely all or almost all Mako. A check of mime-db indicates no other registered file types using this extension, and Mako template is the only type registered on FileInfo for .mako.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right on, I threw a couple of these in only because I didn;t see them in there already and thought why not, but if there are issues I'll just pull mako no problem there.

@@ -52,9 +52,10 @@
'gif': {'binary', 'image', 'gif'},
'go': {'text', 'go'},
'gotmpl': {'text', 'gotmpl'},
'gpg': {'text', 'gnupg'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'gpg': {'text', 'gnupg'},
'gpg': {'binary', 'gnupg'},

There are 101 000 results on Github, and spot-checking the results, as well as Linguist, MIME-DB and FileInfo indicates no conflicts. However, this GnuPG mailing list response, this SO answer and the FileInfo listing, as well as an examining a sample of .gpg files on Github, indicates that this extension is designated for binary-format GnuPG data, with asc being the ASCII-armored text equivalent for both PGP and GnuPG (with 635 000 Github hits, but only 470 000 of them keys, the rest being identified as AGS scripts and ASCIIDoc text).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, I'll update when I get back

@@ -179,6 +181,7 @@
'tgz': {'binary', 'gzip'},
'thrift': {'text', 'thrift'},
'tiff': {'binary', 'image', 'tiff'},
'tmpl': {'text', 'cheetah'},
Copy link
Contributor

@CAM-Gerlach CAM-Gerlach Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This association seems rather questionable. While it is documented that tmpl is considered the default file extension for Cheetah files, the likely ambiguity seems much too high to be acceptable. While there are no conflicts registered in Linguist or MIME-DB, there are two (of uncertain provenience) in FileInfo, and it is apparently used for Go templates.

Much worse, of the 2 300 000 Github file results, many/most don't seem to use the Cheetah $ syntax (that I'm not really familiar with beyond a quick skim of the docs), and most appear to be Javascript templates, XML templates, GoHTML templates, or other files. Furthermore, it seems rather improbable that more than a small fraction are Cheetah given, for instance, the Mustache template engine has 14.4k stars, an unambiguously standard extension (.mustache) and 995 000 files, while the Github repos I can find associated with Cheetah have only 131, 63, and 41 stars respectively, 2 OoM less despite having many more files associated with the extension.

Copy link
Author

@raddessi raddessi Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I did not find those on a quick search, good call again. I'll remove this one.

Comment on lines 93 to 111
if ext == '.sls':
if shebang:
# try to match tags for the file extension of the first interpreter
try:
first_interpreter = shebang[0].split('|')[0]
ret.update(
extensions.EXTENSIONS[
interpreter_to_extension_map.get(
first_interpreter, first_interpreter,
)
],
)
except (IndexError, KeyError):
pass
else:
# the default interpreter is jinja
ret.update(extensions.EXTENSIONS['jinja'])

return ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried the tests locally to verify that this block is actually executing?

@raddessi
Copy link
Author

I'm just a random contributor, not @asottile , but I had some comments on your file type additions based on my experience.

With regard to your overall approach, it would simplify the implementation considerably, make it more flexible/general, and less brittle/special case, if you avoided implementing a bunch of bespoke logic just for parsing salt files (requiring any other extension added to EXTENSIONS_NEED_SHEBANG_CHECK to do the same, or silently not work properly). Instead, why not move the salt extension key/value as-is to the EXTENSIONS_NEED_SHEBANG_CHECK dict, add the appropriate interpreters to the interpreters file, add a function that returns whether the file needs a shebang check (e.g. via the first 3-4 lines of your existing new one), and tweak the execution flow to cut out the else on line 57 and change the if below to if (not len(t) and executable) or needs_shebang_check(os.path.basename(path)), so the shebang check runs on such files? This is also a little more efficient (the extra extension check only runs when the shebang isn't already known to be needed), matches the format of the existing EXTENSIONS_NEED_BINARY_CHECK, and avoids unnecessary bespoke code and data to retrieve extensions from tags and having to hardcode interpreters for each extension, and extensions for each interpreter. Plus, it completely avoids needing any of the code the test coverage is complaining about.

There is a lot to unpack here :) Yeah I wasn't very happy about the setup here overall and I think this sounds like a good idea but I would have to think it over more. I'm currently on a work trip and will be knee deep in patch panels for the next week or so but as soon as I get back I'll look this over. Thanks for the reviews!

Copy link
Author

@raddessi raddessi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CAM-Gerlach Thank you again for your review :) I started working on your recommendations and saw another way to clean it up a bit more so I did that to get both of your thoughts on it.

It may be over reaching to add these interpreters since they aren't standards but I think it at least makes more sense than the previous iterations. I left a few comments on points I'd like opinions on if you have time. Thanks again

Comment on lines +69 to +70

ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that this introduces a need to parse the extension on all files.. but I don't see a way around that when the logic to determine the extra requirement is by extension. Thoughts?

Copy link
Contributor

@CAM-Gerlach CAM-Gerlach Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just

Suggested change
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
ext = os.path.splitext(path)[-1].lstrip('.')

Also, you could move this to the if statement, given the other change I suggest below, though that would only avoid it in a subset of cases.

There isn't really any way I can see around it given this design, but you can at least only parse and check it once place instead of two, as is currently being done (as my suggestions implement).

Comment on lines +81 to +85
tags.update(
tags_from_interpreter(
shebang[0].split('|')[0].strip(),
),
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The splitting of the | delimited interpreters may be better located in the _shebang_split function.. I can move there if this looks good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the | convention to separate multiple interpreters is non-standard and only used with one particular file type, I don't know about trying to parse it on all files. I can't find much Googling about other uses, but the one usage I can find has different syntax than salt uses. But that's ultimately up to @asottile .

Comment on lines +219 to +226
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
if (
ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and
not os.access(
path,
os.X_OK,
)
):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other place the extension is needed.. should it be passed to the function instead of re-parsed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest eliminating this check entirely unless there's something I'm missing (which is entirely possible); see below.

Comment on lines 279 to 281
# skip the slow calculation if the lengths are very different
if norm and abs(len(norm) - len(norm_license)) / len(norm) > .05:
if norm and abs(len(norm) - len(norm_license)) / len(norm) > 0.05:
continue
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't make a few of these changes, they were done via precommit or black.. Maybe the precommit config changed recently?

Comment on lines +157 to +161
(
'jinja|yaml|gpg', {
'salt', 'file', 'non-executable', 'text', 'jinja',
},
),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does have the benefit of files with the extension .gpg being marked as binary while files with the gpg interpreter still get marked as text.

Copy link
Author

@raddessi raddessi Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize, I quoted the wrong section. The relevant line is 153:

('gpg', {'salt', 'file', 'non-executable', 'text', 'gnupg'}),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@raddessi raddessi changed the title Add tags_from_extension_specific_shebang parser Added parsing for saltstack interpreted files Mar 9, 2021
Copy link
Contributor

@CAM-Gerlach CAM-Gerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I had a few comments on your implementation; the rest is up to @asottile .

Comment on lines +69 to +70

ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
Copy link
Contributor

@CAM-Gerlach CAM-Gerlach Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just

Suggested change
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
ext = os.path.splitext(path)[-1].lstrip('.')

Also, you could move this to the if statement, given the other change I suggest below, though that would only avoid it in a subset of cases.

There isn't really any way I can see around it given this design, but you can at least only parse and check it once place instead of two, as is currently being done (as my suggestions implement).

Comment on lines +74 to +78
try:
tags.update(extensions.EXTENSIONS_NEED_INTERPRETER_CHECK[ext])
except KeyError:
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
tags.update(extensions.EXTENSIONS_NEED_INTERPRETER_CHECK[ext])
except KeyError:
pass

I would think it would both make more sense and be more efficient to add the extension-based tags in the same function the others are added in, tags_from_filename() around line 116, just like is done with EXTENSIONS_NEED_BINARY_CHECK, instead of a bespoke code block here that fires for every unknown file, and follows a different pattern than the others. This also means you can just do os.path.splitext(path)[-1].lstrip('.') right in the if statement for only those files that need it (saving a bit of performance at the cost of readibility).

Comment on lines +81 to +85
tags.update(
tags_from_interpreter(
shebang[0].split('|')[0].strip(),
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the | convention to separate multiple interpreters is non-standard and only used with one particular file type, I don't know about trying to parse it on all files. I can't find much Googling about other uses, but the one usage I can find has different syntax than salt uses. But that's ultimately up to @asottile .

Comment on lines +219 to +226
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
if (
ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and
not os.access(
path,
os.X_OK,
)
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest eliminating this check entirely unless there's something I'm missing (which is entirely possible); see below.

Comment on lines +219 to 228
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
if (
ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and
not os.access(
path,
os.X_OK,
)
):
return ()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.')
if (
ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and
not os.access(
path,
os.X_OK,
)
):
return ()

@asottile The only function that calls this already checks this same condition before running parse_shebang_from_file(), so is there a reason it needs to be checked twice? There might be something I'm missing, but if not, eliminating the redundant check would reduce complexity and increase performance without an obvious downside, and it also prevents any other third-party consumers from using it on files not marked executable if that restriction doesn't make sense for their application.

Comment on lines +157 to +161
(
'jinja|yaml|gpg', {
'salt', 'file', 'non-executable', 'text', 'jinja',
},
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@raddessi
Copy link
Author

raddessi commented Apr 8, 2021

FYI only, I'm playing around with this still. I don't really love any of these approaches so far as there is either a spaghetti code path or some duplicated processing.. I'm playing with either a custom path object class with @property getters to cache the duplicated calls or just a functools.lru_cache on top of some functions.. I need to do some performance comparisons before I send anything up. I don't want to add duplicated calls but I also don't want to add a ton of changes just to avoid them either.. I'm still on the fence about the whole path forward.

EDIT:
When I mentioned a custom path object I mean something like:

class IdentifyPath:

    def __init__(self, path: str) -> None:
        self._path = path

        super().__init__()

    @property
    def mode(self):
        try:
            return self._mode
        except AttributeError:
            try:
                self._mode = os.lstat(self.path).st_mode
            except (OSError, ValueError):  # same error-handling as `os.lexists()`
                raise ValueError(f'{self.path} does not exist.')

        return self._mode

    @property
    def path(self):
        return self._path

    @property
    def filename(self):
        try:
            return self._filename
        except AttributeError:
            _, filename = os.path.split(self.path)
            self._filename = filename

        return self._filename

    @property
    def ext(self):
        try:
            return self._ext
        except AttributeError:
            _, ext = os.path.splitext(self.filename)
            self._ext = ext

        return self._ext

    @property
    def executable(self):
        try:
            return self._executable
        except AttributeError:
            try:
                self._executable = os.access(self.path, os.X_OK)
            except (OSError, ValueError):  # same error-handling as `os.lexists()`
                raise ValueError(f'{self.path} does not exist.')

        return self._executable

    @property
    def shebang(self):
        try:
            return self._shebang
        except AttributeError:
            self._shebang = parse_shebang_from_file(self)

        return self._shebang

It's named terribly I know, the entire test suite would have to be changed, and I'm worried about setup cost performance like I said. I'll keep poking at it..

@asottile asottile closed this Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants