Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: convert scan code to .opossum #174

Merged
merged 24 commits into from
Jan 16, 2025
Merged

feat: convert scan code to .opossum #174

merged 24 commits into from
Jan 16, 2025

Conversation

abraemer
Copy link
Contributor

@abraemer abraemer commented Jan 13, 2025

Summary of changes

This PR introduces the capability to read in JSON files created by ScanCode.

  • New command line option: --scan-code-json
  • We create a new submodule for parsing ScanCode's JSON files and
  • reuse the currently existing types/methods for writing opossum files
  • Find further rationale in how fields are mapped/populated in Convert ScanCode Json result files to Opossum files #171

Context and reason for change

Scancode is a very important and good tool for accessing licensing information and thus a great source of information to OpossumUI.

How can the changes be tested

Run uv run opossum-file generate --scan-code-json tests/data/scancode_input.json and open the newly created output.opossum in OpossumUI. You should find the file tree to be correctly reflected and the files have some signals attached to them.

Fixes #171

Hellgartner and others added 5 commits January 13, 2025 14:05
* Add dedicated package
* Restructure generate function
* move `ScanCodeData.model_validate` out of `create_opossum_metadata`
* check length of `headers` field in `create_opossum_metadata` and test it
* use files data to populate the 3 remaining mandatory fields
 of opossum files: resources, resourcesToAttributions, externalAttributions
* exchanged tests/data/scancode.json for a complete example
@abraemer abraemer force-pushed the feat-convert-scan-code branch from 55aa973 to ecbb705 Compare January 13, 2025 13:11
@abraemer abraemer marked this pull request as draft January 13, 2025 14:30
@abraemer abraemer changed the title Feat convert scan code feat: convert scan code to .opossum Jan 14, 2025
@abraemer abraemer force-pushed the feat-convert-scan-code branch from 56085d3 to 3fefb63 Compare January 14, 2025 09:11
* Files showed up in OpossumUI correctly but attributions where not showing up at all
* Prepending a "/" to the paths in resourcesToAttributions fixes this
* This is because OpossumUI always considers "/" to be the root for all file paths apparently
@abraemer abraemer force-pushed the feat-convert-scan-code branch from 3fefb63 to 6e9ca39 Compare January 14, 2025 09:15
* while writing test I noticed that `check_schema` is not working recursively
* to fix that I added the method `revalidata` to `Node`
* and added some tests to check that functionality
@abraemer abraemer force-pushed the feat-convert-scan-code branch from aeda890 to bba92a7 Compare January 14, 2025 10:19
@abraemer abraemer force-pushed the feat-convert-scan-code branch from abed078 to 76e22fd Compare January 14, 2025 11:14
* after discussion with Markus and Alex changed min to max
 when aggregating the score of multiple matches
* lower scoring matches are likely noise
* best match should set confidence of the detection
* resolved conflicts in cli and test_cli by combining the command line args
* fixed a few places where the switch from @DataClass to BaseModel broke the constructors
* the conversion pipeline now also needs to convert from Resource to ResourceInFile
@abraemer abraemer marked this pull request as ready for review January 15, 2025 05:06
@abraemer abraemer requested a review from Hellgartner January 15, 2025 05:07
* Improve clarity of option help strings and unify the phrasing
* Update readme
* slightly rename an internal function to be a bit more clear
Copy link
Member

@mstykow mstykow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is really large. Is there some way to break it up into smaller pieces next time?

@@ -74,11 +85,14 @@ def validate_input_exit_on_error(spdx: list[str], opossum: list[str]) -> None:


def convert_after_valid_input(
spdx: list[str], opossum_files: list[str]
spdx: list[str], scan_code_json: list[str], opossum_files: list[str]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not new but inconsistent: here we use "opossum_files" while previously the same variable is just called "opossum".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is inconsistent. I think I would prefer to name them all like format_files and made that choice consistently throughout cli.py.

try:
with open(filename) as inp:
json_data = json.load(inp)
except json.JSONDecodeError as jsde:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unusual to give errors specific acronyms. generally best to avoid acronyms altogether or use standard ones, like e, in this case.

@mstykow mstykow self-assigned this Jan 15, 2025
* consolidate variable name (snake_case, and choosing the same name across functions)
* sort fields in model.py
* use an enum for File.type
* simplify and refactor create_attribution_mapping
* extract document name for scan code as a constant
* convert more camelCase to snake_case
* refactor in convert_scancode_to_opossum to have fewer small functions
@abraemer abraemer force-pushed the feat-convert-scan-code branch from f124201 to 7d4996b Compare January 15, 2025 15:09
Comment on lines 108 to 110
# Doing individual asserts as otherwise the diff viewer does no longer work
# in case of errors
assert result.exit_code == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm confused. what's the "diff viewer" and why do we need to repeat the same assertion we already did in line 104? and also, what is an "individual" assert? what would be the opposite of an individual assert? a collective assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicated assert was a mistake from merging in main after #49 went to main.
The comment stems from the main branch and explains why we don't do assert expected_opossum_dict == opossum_dict. If you compare these really large dicts and fail the check, then the "diff view" from pydantic is completely useless because the dict is so large that it gets truncated and you never see the actual difference. That's why #49 chose to do the comparison in a separate function that goes field by field so you have a better chance of seeing what the difference actually is.

)


def inline_attributions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name of this function is obscure to me: from the name, i have a hard time understanding what this function does. my best guess is that this function is inlining some attributions. but inlining into what?

md["projectId"] = expected_md["projectId"]
assert md == expected_md

# Python has hash salting, which means the hashes changes between sessions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not following how this comment is related to the code: i don't see any hashes anywhere. which hashes are we talking about here? i thought we're comparing dicts, not hashes.


root_node = _create_reference_Node_structure()

with mock.patch(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unusual to see a mock in the middle of a test. in python, you'd usually mock this upfront with a fixture or with a decorator like you did in the previous test. but i'm wondering if we really need/should mock here. every mock you introduce makes the test more fragile and reduces the confidence it provides (again, read my blog on TNG confluence where i'm also talking about this point). also, it seems that get_attribution_info is a private method (by the way it's used) which means you're creating a mock that depends on an implementation detail.

in summary, i think we should find a way to run this test without any mock at all.

assert get_attribution_info(file) == []


def test_get_attribution_info_file_multiple() -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general comments: it's best if test titles read like real sentences, e.g., "test_get_attribution_returns_three_ducks"



def _create_file(path: str, type: FileType, **kwargs: Any) -> File:
default_properties = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to my point from earlier: it would be great if this was not an untyped dict but somehow coupled to the dataclass you defined elsewhere. then, if the dataclass gets updated, i don't need to wait for a test failure to get feedback that this dict also needs to be udpated. this falls in the category of use-your-static-checks-to-the-max (also discussed in my blog post).

@mstykow
Copy link
Member

mstykow commented Jan 15, 2025

two tips as the review process continues:

  1. resolve any conversations that you addressed and we no longer need to discuss
  2. use the "re-request review" button next to my (reviewer) name when you want me to take another look

@mstykow
Copy link
Member

mstykow commented Jan 15, 2025

When I run the test command from your PR description I just get an error.

@mstykow
Copy link
Member

mstykow commented Jan 15, 2025

Something that is still missing in general are the license texts. I believe they can be part of the ScanCode output. I would be fine with parsing that in a separate PR but there needs to be a task for this if there isn't already.

@abraemer
Copy link
Contributor Author

abraemer commented Jan 16, 2025

When I run the test command from your PR description I just get an error.

That was a case of already outdated documentation. I updated the command.

Something that is still missing in general are the license texts. I believe they can be part of the ScanCode output. I would be fine with parsing that in a separate PR but there needs to be a task for this if there isn't already.

This is part of a bigger problem: ScanCode has a lot of options of what it should output. On one hand, we need to think about what options we absolutely need and document that. OTOH, if there is additional information (such as the license information) it would be nice to also use this. This has the potential that there could be a lot of combinations of information. This is touched lightly in #171 but I created #184 to track this in more detail.

Copy link
Member

@mstykow mstykow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we'll refactor this code later but want to merge now.

* improve clarity of comments in test_resource_tree.py
* use types for _create_file and move it its own file because it got long.
 This will be refactored in the future to use a builder pattern.
* minor change to convert_scancode_to_opossum.py to avoid an intermediate untyped dict
* cleanup in test_cli.py
@abraemer abraemer merged commit 5054a60 into main Jan 16, 2025
10 checks passed
@abraemer abraemer deleted the feat-convert-scan-code branch January 16, 2025 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert ScanCode Json result files to Opossum files
3 participants