feat: convert scan code to .opossum #174

abraemer · 2025-01-13T12:56:59Z

Summary of changes

This PR introduces the capability to read in JSON files created by ScanCode.

New command line option: --scan-code-json
We create a new submodule for parsing ScanCode's JSON files and
reuse the currently existing types/methods for writing opossum files
Find further rationale in how fields are mapped/populated in Convert ScanCode Json result files to Opossum files #171

Context and reason for change

Scancode is a very important and good tool for accessing licensing information and thus a great source of information to OpossumUI.

How can the changes be tested

Run uv run opossum-file generate --scan-code-json tests/data/scancode_input.json and open the newly created output.opossum in OpossumUI. You should find the file tree to be correctly reflected and the files have some signals attached to them.

Fixes #171

* Add dedicated package * Restructure generate function

* move `ScanCodeData.model_validate` out of `create_opossum_metadata` * check length of `headers` field in `create_opossum_metadata` and test it

* use files data to populate the 3 remaining mandatory fields of opossum files: resources, resourcesToAttributions, externalAttributions * exchanged tests/data/scancode.json for a complete example

* Files showed up in OpossumUI correctly but attributions where not showing up at all * Prepending a "/" to the paths in resourcesToAttributions fixes this * This is because OpossumUI always considers "/" to be the root for all file paths apparently

* while writing test I noticed that `check_schema` is not working recursively * to fix that I added the method `revalidata` to `Node` * and added some tests to check that functionality

* after discussion with Markus and Alex changed min to max when aggregating the score of multiple matches * lower scoring matches are likely noise * best match should set confidence of the detection

* resolved conflicts in cli and test_cli by combining the command line args * fixed a few places where the switch from @DataClass to BaseModel broke the constructors * the conversion pipeline now also needs to convert from Resource to ResourceInFile

…r readability

…ResourceInFile

* Improve clarity of option help strings and unify the phrasing * Update readme * slightly rename an internal function to be a bit more clear

mstykow

This PR is really large. Is there some way to break it up into smaller pieces next time?

mstykow · 2025-01-15T08:49:11Z

src/opossum_lib/cli.py

@@ -74,11 +85,14 @@ def validate_input_exit_on_error(spdx: list[str], opossum: list[str]) -> None:


 def convert_after_valid_input(
-    spdx: list[str], opossum_files: list[str]
+    spdx: list[str], scan_code_json: list[str], opossum_files: list[str]


not new but inconsistent: here we use "opossum_files" while previously the same variable is just called "opossum".

I agree that this is inconsistent. I think I would prefer to name them all like format_files and made that choice consistently throughout cli.py.

mstykow · 2025-01-15T08:50:17Z

src/opossum_lib/scancode/convert_scancode_to_opossum.py

+    try:
+        with open(filename) as inp:
+            json_data = json.load(inp)
+    except json.JSONDecodeError as jsde:


unusual to give errors specific acronyms. generally best to avoid acronyms altogether or use standard ones, like e, in this case.

src/opossum_lib/scancode/convert_scancode_to_opossum.py

src/opossum_lib/scancode/model.py

src/opossum_lib/scancode/resource_tree.py

* consolidate variable name (snake_case, and choosing the same name across functions) * sort fields in model.py * use an enum for File.type * simplify and refactor create_attribution_mapping * extract document name for scan code as a constant

src/opossum_lib/scancode/convert_scancode_to_opossum.py

src/opossum_lib/scancode/model.py

src/opossum_lib/scancode/resource_tree.py

* convert more camelCase to snake_case * refactor in convert_scancode_to_opossum to have fewer small functions

src/opossum_lib/scancode/convert_scancode_to_opossum.py

mstykow · 2025-01-15T17:18:45Z

tests/test_cli.py

+    # Doing individual asserts as otherwise the diff viewer does no longer work
+    # in case of errors
+    assert result.exit_code == 0


i'm confused. what's the "diff viewer" and why do we need to repeat the same assertion we already did in line 104? and also, what is an "individual" assert? what would be the opposite of an individual assert? a collective assert?

The duplicated assert was a mistake from merging in main after #49 went to main.
The comment stems from the main branch and explains why we don't do assert expected_opossum_dict == opossum_dict. If you compare these really large dicts and fail the check, then the "diff view" from pydantic is completely useless because the dict is so large that it gets truncated and you never see the actual difference. That's why #49 chose to do the comparison in a separate function that goes field by field so you have a better chance of seeing what the difference actually is.

mstykow · 2025-01-15T17:20:44Z

tests/test_cli.py

+    )
+
+
+def inline_attributions(


the name of this function is obscure to me: from the name, i have a hard time understanding what this function does. my best guess is that this function is inlining some attributions. but inlining into what?

mstykow · 2025-01-15T17:22:55Z

tests/test_cli.py

+    md["projectId"] = expected_md["projectId"]
+    assert md == expected_md
+
+    # Python has hash salting, which means the hashes changes between sessions.


i'm not following how this comment is related to the code: i don't see any hashes anywhere. which hashes are we talking about here? i thought we're comparing dicts, not hashes.