-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added parsing for saltstack interpreted files #86
Conversation
Also I'm not sure of a way to cleanly deal with the possible combination of any of these renderers as you are able to mix and match them at will, IE: |
Actually now that I think about it I do think this needs some work, but I'll need your input for direction. I've added some failing tests that illustrate the problem.. The test expectations should probably be set first though and I'm not sure about that. Thanks |
I think the approach is better than last time, I'm not sure the let me know what you think, sorry for the brevity I'm trying to get through my backlog of email :S happy to elaborate on things! |
Definitely cleaner than last time :) Yes you are correct, they allow for multiple rendering passes on files so The backstory here is that I have a lot of pillars written in pure python using either the I think the usefulness of tagging the mix-and-match combinations is very low since it would not be easy to run linters or other tools against them as the combinations of syntaxes usually produces some hard to parse contents, but tagging the files when there is only one renderer that is not the normal |
Maybe it is better to keep tagging anything ending with .sls as |
yeah I think that makes the most sense, and would be less of a breaking change from the current behaviour. I think it's also fine / expected to not handle the |
Hey there! I'm so sorry for letting this sit in your PR queue, I'll do my best to get this closed as quickly as possible. Check out this iteration, I think it is much cleaner. I wasn't able to make the code completely generic since the logic to parse shebang formats would probably change with different types but I think this is pretty decent. One remaining issue is that coverage thinks some lines are not being tested and as far as I can see I think they are.. do you see an obvious issue? Thanks again for keeping this alive, I hope your year is going well :) |
identify/identify.py
Outdated
if ext == '.sls': | ||
if shebang: | ||
# try to match tags for the file extension of the first interpreter | ||
try: | ||
first_interpreter = shebang[0].split('|')[0] | ||
ret.update( | ||
extensions.EXTENSIONS[ | ||
interpreter_to_extension_map.get( | ||
first_interpreter, first_interpreter, | ||
) | ||
], | ||
) | ||
except (IndexError, KeyError): | ||
pass | ||
else: | ||
# the default interpreter is jinja | ||
ret.update(extensions.EXTENSIONS['jinja']) | ||
|
||
return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the lines coverage is complaining about missing tests. I'm stumped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried the tests locally to verify that this block is actually executing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I verified 4 locations get hit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just a random contributor, not @asottile , but I had some comments on your file type additions based on my experience.
With regard to your overall approach, it would simplify the implementation considerably, make it more flexible/general, and less brittle/special case, if you avoided implementing a bunch of bespoke logic just for parsing salt files (requiring any other extension added to EXTENSIONS_NEED_SHEBANG_CHECK
to do the same, or silently not work properly). Instead, why not move the salt
extension key/value as-is to the EXTENSIONS_NEED_SHEBANG_CHECK
dict, add the appropriate interpreters to the interpreters file, add a function that returns whether the file needs a shebang check (e.g. via the first 3-4 lines of your existing new one), and tweak the execution flow to cut out the else
on line 57 and change the if
below to if (not len(t) and executable) or needs_shebang_check(os.path.basename(path))
, so the shebang check runs on such files? This is also a little more efficient (the extra extension check only runs when the shebang isn't already known to be needed), matches the format of the existing EXTENSIONS_NEED_BINARY_CHECK
, and avoids unnecessary bespoke code and data to retrieve extensions from tags and having to hardcode interpreters for each extension, and extensions for each interpreter. Plus, it completely avoids needing any of the code the test coverage is complaining about.
identify/extensions.py
Outdated
@@ -100,6 +101,7 @@ | |||
'lr': {'text', 'lektor'}, | |||
'lua': {'text', 'lua'}, | |||
'm': {'text', 'c', 'objective-c'}, | |||
'mako': {'text', 'mako'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no authority whatsoever on the matter, but for reference, this change was previously rejected in PR #137 , for reasons of mako refusing to specify a standard file extension. However, for reference, over 170 000 files on Github have the .mako
extension (while 655 have the .mao
extension), of which over 160 000 are identified as Mako by Linguist and a spot check of the remaining 10 000 suggests they too are likely all or almost all Mako. A check of mime-db indicates no other registered file types using this extension, and Mako template is the only type registered on FileInfo for .mako
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right on, I threw a couple of these in only because I didn;t see them in there already and thought why not, but if there are issues I'll just pull mako no problem there.
identify/extensions.py
Outdated
@@ -52,9 +52,10 @@ | |||
'gif': {'binary', 'image', 'gif'}, | |||
'go': {'text', 'go'}, | |||
'gotmpl': {'text', 'gotmpl'}, | |||
'gpg': {'text', 'gnupg'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'gpg': {'text', 'gnupg'}, | |
'gpg': {'binary', 'gnupg'}, |
There are 101 000 results on Github, and spot-checking the results, as well as Linguist, MIME-DB and FileInfo indicates no conflicts. However, this GnuPG mailing list response, this SO answer and the FileInfo listing, as well as an examining a sample of .gpg
files on Github, indicates that this extension is designated for binary-format GnuPG data, with asc
being the ASCII-armored text equivalent for both PGP and GnuPG (with 635 000 Github hits, but only 470 000 of them keys, the rest being identified as AGS scripts and ASCIIDoc text).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, I'll update when I get back
identify/extensions.py
Outdated
@@ -179,6 +181,7 @@ | |||
'tgz': {'binary', 'gzip'}, | |||
'thrift': {'text', 'thrift'}, | |||
'tiff': {'binary', 'image', 'tiff'}, | |||
'tmpl': {'text', 'cheetah'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This association seems rather questionable. While it is documented that tmpl
is considered the default file extension for Cheetah files, the likely ambiguity seems much too high to be acceptable. While there are no conflicts registered in Linguist or MIME-DB, there are two (of uncertain provenience) in FileInfo, and it is apparently used for Go templates.
Much worse, of the 2 300 000 Github file results, many/most don't seem to use the Cheetah $
syntax (that I'm not really familiar with beyond a quick skim of the docs), and most appear to be Javascript templates, XML templates, GoHTML templates, or other files. Furthermore, it seems rather improbable that more than a small fraction are Cheetah given, for instance, the Mustache template engine has 14.4k stars, an unambiguously standard extension (.mustache
) and 995 000 files, while the Github repos I can find associated with Cheetah have only 131, 63, and 41 stars respectively, 2 OoM less despite having many more files associated with the extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I did not find those on a quick search, good call again. I'll remove this one.
identify/identify.py
Outdated
if ext == '.sls': | ||
if shebang: | ||
# try to match tags for the file extension of the first interpreter | ||
try: | ||
first_interpreter = shebang[0].split('|')[0] | ||
ret.update( | ||
extensions.EXTENSIONS[ | ||
interpreter_to_extension_map.get( | ||
first_interpreter, first_interpreter, | ||
) | ||
], | ||
) | ||
except (IndexError, KeyError): | ||
pass | ||
else: | ||
# the default interpreter is jinja | ||
ret.update(extensions.EXTENSIONS['jinja']) | ||
|
||
return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried the tests locally to verify that this block is actually executing?
There is a lot to unpack here :) Yeah I wasn't very happy about the setup here overall and I think this sounds like a good idea but I would have to think it over more. I'm currently on a work trip and will be knee deep in patch panels for the next week or so but as soon as I get back I'll look this over. Thanks for the reviews! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CAM-Gerlach Thank you again for your review :) I started working on your recommendations and saw another way to clean it up a bit more so I did that to get both of your thoughts on it.
It may be over reaching to add these interpreters since they aren't standards but I think it at least makes more sense than the previous iterations. I left a few comments on points I'd like opinions on if you have time. Thanks again
|
||
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that this introduces a need to parse the extension on all files.. but I don't see a way around that when the logic to determine the extra requirement is by extension. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.') | |
ext = os.path.splitext(path)[-1].lstrip('.') |
Also, you could move this to the if statement, given the other change I suggest below, though that would only avoid it in a subset of cases.
There isn't really any way I can see around it given this design, but you can at least only parse and check it once place instead of two, as is currently being done (as my suggestions implement).
tags.update( | ||
tags_from_interpreter( | ||
shebang[0].split('|')[0].strip(), | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The splitting of the |
delimited interpreters may be better located in the _shebang_split
function.. I can move there if this looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the |
convention to separate multiple interpreters is non-standard and only used with one particular file type, I don't know about trying to parse it on all files. I can't find much Googling about other uses, but the one usage I can find has different syntax than salt uses. But that's ultimately up to @asottile .
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.') | ||
if ( | ||
ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and | ||
not os.access( | ||
path, | ||
os.X_OK, | ||
) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other place the extension is needed.. should it be passed to the function instead of re-parsed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest eliminating this check entirely unless there's something I'm missing (which is entirely possible); see below.
# skip the slow calculation if the lengths are very different | ||
if norm and abs(len(norm) - len(norm_license)) / len(norm) > .05: | ||
if norm and abs(len(norm) - len(norm_license)) / len(norm) > 0.05: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't make a few of these changes, they were done via precommit or black.. Maybe the precommit config changed recently?
( | ||
'jinja|yaml|gpg', { | ||
'salt', 'file', 'non-executable', 'text', 'jinja', | ||
}, | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does have the benefit of files with the extension .gpg
being marked as binary while files with the gpg
interpreter still get marked as text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize, I quoted the wrong section. The relevant line is 153:
('gpg', {'salt', 'file', 'non-executable', 'text', 'gnupg'}),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I had a few comments on your implementation; the rest is up to @asottile .
|
||
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.') | |
ext = os.path.splitext(path)[-1].lstrip('.') |
Also, you could move this to the if statement, given the other change I suggest below, though that would only avoid it in a subset of cases.
There isn't really any way I can see around it given this design, but you can at least only parse and check it once place instead of two, as is currently being done (as my suggestions implement).
try: | ||
tags.update(extensions.EXTENSIONS_NEED_INTERPRETER_CHECK[ext]) | ||
except KeyError: | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try: | |
tags.update(extensions.EXTENSIONS_NEED_INTERPRETER_CHECK[ext]) | |
except KeyError: | |
pass |
I would think it would both make more sense and be more efficient to add the extension-based tags in the same function the others are added in, tags_from_filename()
around line 116, just like is done with EXTENSIONS_NEED_BINARY_CHECK
, instead of a bespoke code block here that fires for every unknown file, and follows a different pattern than the others. This also means you can just do os.path.splitext(path)[-1].lstrip('.')
right in the if statement for only those files that need it (saving a bit of performance at the cost of readibility).
tags.update( | ||
tags_from_interpreter( | ||
shebang[0].split('|')[0].strip(), | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the |
convention to separate multiple interpreters is non-standard and only used with one particular file type, I don't know about trying to parse it on all files. I can't find much Googling about other uses, but the one usage I can find has different syntax than salt uses. But that's ultimately up to @asottile .
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.') | ||
if ( | ||
ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and | ||
not os.access( | ||
path, | ||
os.X_OK, | ||
) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest eliminating this check entirely unless there's something I'm missing (which is entirely possible); see below.
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.') | ||
if ( | ||
ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and | ||
not os.access( | ||
path, | ||
os.X_OK, | ||
) | ||
): | ||
return () | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ext = os.path.splitext(os.path.split(path)[-1])[-1].lstrip('.') | |
if ( | |
ext not in extensions.EXTENSIONS_NEED_INTERPRETER_CHECK and | |
not os.access( | |
path, | |
os.X_OK, | |
) | |
): | |
return () |
@asottile The only function that calls this already checks this same condition before running parse_shebang_from_file()
, so is there a reason it needs to be checked twice? There might be something I'm missing, but if not, eliminating the redundant check would reduce complexity and increase performance without an obvious downside, and it also prevents any other third-party consumers from using it on files not marked executable if that restriction doesn't make sense for their application.
( | ||
'jinja|yaml|gpg', { | ||
'salt', 'file', 'non-executable', 'text', 'jinja', | ||
}, | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
FYI only, I'm playing around with this still. I don't really love any of these approaches so far as there is either a spaghetti code path or some duplicated processing.. I'm playing with either a custom path object class with EDIT: class IdentifyPath:
def __init__(self, path: str) -> None:
self._path = path
super().__init__()
@property
def mode(self):
try:
return self._mode
except AttributeError:
try:
self._mode = os.lstat(self.path).st_mode
except (OSError, ValueError): # same error-handling as `os.lexists()`
raise ValueError(f'{self.path} does not exist.')
return self._mode
@property
def path(self):
return self._path
@property
def filename(self):
try:
return self._filename
except AttributeError:
_, filename = os.path.split(self.path)
self._filename = filename
return self._filename
@property
def ext(self):
try:
return self._ext
except AttributeError:
_, ext = os.path.splitext(self.filename)
self._ext = ext
return self._ext
@property
def executable(self):
try:
return self._executable
except AttributeError:
try:
self._executable = os.access(self.path, os.X_OK)
except (OSError, ValueError): # same error-handling as `os.lexists()`
raise ValueError(f'{self.path} does not exist.')
return self._executable
@property
def shebang(self):
try:
return self._shebang
except AttributeError:
self._shebang = parse_shebang_from_file(self)
return self._shebang It's named terribly I know, the entire test suite would have to be changed, and I'm worried about setup cost performance like I said. I'll keep poking at it.. |
Hey! This is the follow up to #61 and the continuation of #60, I hope this is more inline with how you would like. Let me know what you think.
I'm not 100% positive of how different tags should be added, IE if all the salt fiels should be tagged with 'salt' and then additionally the 'salt-' tag, for now I only added the non-default renderer tags since I think matching a default renderer tag when the contents of the file is not actually readable by that renderer would cause issues. I'm on the fence though.