-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: convert scan code to .opossum #174
Conversation
* Add dedicated package * Restructure generate function
* move `ScanCodeData.model_validate` out of `create_opossum_metadata` * check length of `headers` field in `create_opossum_metadata` and test it
* use files data to populate the 3 remaining mandatory fields of opossum files: resources, resourcesToAttributions, externalAttributions * exchanged tests/data/scancode.json for a complete example
55aa973
to
ecbb705
Compare
56085d3
to
3fefb63
Compare
* Files showed up in OpossumUI correctly but attributions where not showing up at all * Prepending a "/" to the paths in resourcesToAttributions fixes this * This is because OpossumUI always considers "/" to be the root for all file paths apparently
3fefb63
to
6e9ca39
Compare
* while writing test I noticed that `check_schema` is not working recursively * to fix that I added the method `revalidata` to `Node` * and added some tests to check that functionality
aeda890
to
bba92a7
Compare
abed078
to
76e22fd
Compare
* after discussion with Markus and Alex changed min to max when aggregating the score of multiple matches * lower scoring matches are likely noise * best match should set confidence of the detection
* resolved conflicts in cli and test_cli by combining the command line args * fixed a few places where the switch from @DataClass to BaseModel broke the constructors * the conversion pipeline now also needs to convert from Resource to ResourceInFile
* Improve clarity of option help strings and unify the phrasing * Update readme * slightly rename an internal function to be a bit more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is really large. Is there some way to break it up into smaller pieces next time?
src/opossum_lib/cli.py
Outdated
@@ -74,11 +85,14 @@ def validate_input_exit_on_error(spdx: list[str], opossum: list[str]) -> None: | |||
|
|||
|
|||
def convert_after_valid_input( | |||
spdx: list[str], opossum_files: list[str] | |||
spdx: list[str], scan_code_json: list[str], opossum_files: list[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not new but inconsistent: here we use "opossum_files" while previously the same variable is just called "opossum".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is inconsistent. I think I would prefer to name them all like format_files
and made that choice consistently throughout cli.py.
try: | ||
with open(filename) as inp: | ||
json_data = json.load(inp) | ||
except json.JSONDecodeError as jsde: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unusual to give errors specific acronyms. generally best to avoid acronyms altogether or use standard ones, like e
, in this case.
* consolidate variable name (snake_case, and choosing the same name across functions) * sort fields in model.py * use an enum for File.type * simplify and refactor create_attribution_mapping * extract document name for scan code as a constant
* convert more camelCase to snake_case * refactor in convert_scancode_to_opossum to have fewer small functions
f124201
to
7d4996b
Compare
tests/test_cli.py
Outdated
# Doing individual asserts as otherwise the diff viewer does no longer work | ||
# in case of errors | ||
assert result.exit_code == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm confused. what's the "diff viewer" and why do we need to repeat the same assertion we already did in line 104? and also, what is an "individual" assert? what would be the opposite of an individual assert? a collective assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplicated assert was a mistake from merging in main after #49 went to main.
The comment stems from the main branch and explains why we don't do assert expected_opossum_dict == opossum_dict
. If you compare these really large dicts and fail the check, then the "diff view" from pydantic is completely useless because the dict is so large that it gets truncated and you never see the actual difference. That's why #49 chose to do the comparison in a separate function that goes field by field so you have a better chance of seeing what the difference actually is.
tests/test_cli.py
Outdated
) | ||
|
||
|
||
def inline_attributions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name of this function is obscure to me: from the name, i have a hard time understanding what this function does. my best guess is that this function is inlining some attributions. but inlining into what?
md["projectId"] = expected_md["projectId"] | ||
assert md == expected_md | ||
|
||
# Python has hash salting, which means the hashes changes between sessions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not following how this comment is related to the code: i don't see any hashes anywhere. which hashes are we talking about here? i thought we're comparing dicts, not hashes.
|
||
root_node = _create_reference_Node_structure() | ||
|
||
with mock.patch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unusual to see a mock in the middle of a test. in python, you'd usually mock this upfront with a fixture or with a decorator like you did in the previous test. but i'm wondering if we really need/should mock here. every mock you introduce makes the test more fragile and reduces the confidence it provides (again, read my blog on TNG confluence where i'm also talking about this point). also, it seems that get_attribution_info
is a private method (by the way it's used) which means you're creating a mock that depends on an implementation detail.
in summary, i think we should find a way to run this test without any mock at all.
assert get_attribution_info(file) == [] | ||
|
||
|
||
def test_get_attribution_info_file_multiple() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general comments: it's best if test titles read like real sentences, e.g., "test_get_attribution_returns_three_ducks"
|
||
|
||
def _create_file(path: str, type: FileType, **kwargs: Any) -> File: | ||
default_properties = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to my point from earlier: it would be great if this was not an untyped dict but somehow coupled to the dataclass you defined elsewhere. then, if the dataclass gets updated, i don't need to wait for a test failure to get feedback that this dict also needs to be udpated. this falls in the category of use-your-static-checks-to-the-max (also discussed in my blog post).
two tips as the review process continues:
|
When I run the test command from your PR description I just get an error. |
Something that is still missing in general are the license texts. I believe they can be part of the ScanCode output. I would be fine with parsing that in a separate PR but there needs to be a task for this if there isn't already. |
That was a case of already outdated documentation. I updated the command.
This is part of a bigger problem: ScanCode has a lot of options of what it should output. On one hand, we need to think about what options we absolutely need and document that. OTOH, if there is additional information (such as the license information) it would be nice to also use this. This has the potential that there could be a lot of combinations of information. This is touched lightly in #171 but I created #184 to track this in more detail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, we'll refactor this code later but want to merge now.
* improve clarity of comments in test_resource_tree.py * use types for _create_file and move it its own file because it got long. This will be refactored in the future to use a builder pattern. * minor change to convert_scancode_to_opossum.py to avoid an intermediate untyped dict * cleanup in test_cli.py
Summary of changes
This PR introduces the capability to read in JSON files created by ScanCode.
--scan-code-json
Context and reason for change
Scancode is a very important and good tool for accessing licensing information and thus a great source of information to OpossumUI.
How can the changes be tested
Run
uv run opossum-file generate --scan-code-json tests/data/scancode_input.json
and open the newly createdoutput.opossum
in OpossumUI. You should find the file tree to be correctly reflected and the files have some signals attached to them.Fixes #171