feat: migrate `semantic` from `osv-scanner` #316

G-Rath · 2024-12-01T21:06:06Z

This brings across the semantic package from osv-scanner, which provides support for parsing and comparing versions across different ecosystems per their unique specifications (or lack thereof).

Main actions / discussion points:

deal with cachedregexp (though worth noting that semantic includes some chonky regexps which do take a second or two to compile)
discussion naming and package structure
re-write generators (primarily so that they're better documented - despite trying to do what I thought was a good job the first time round, I keep finding myself just lost enough whenever I re-read over them when implementing a new one that I want to put some work into improving them)
setup the semantic weekly workflow

Resolves #257

erikvarga · 2024-12-11T12:53:11Z

semantic/version-alpine.go

+	"math/big"
+	"strings"
+
+	"github.com/google/osv-scalibr/internal/cachedregexp"


Can't we just define all regexes at the top-level and refer to those? Then we don't need a cachedregexp implementation

Yes but then everyone that imported the library would pay the cost of compiling every regexp regardless of if the function using that regexp actually gets called, which is especially interesting in a CLI context where it's expected that only a handful of "branches" will be executed in a single run (i.e. on an average project we'd expect to scan one or two lockfiles, rather than running every single extractor and semantic comparator).

Plus, I like not having to think about it: with this I can just use cachedregexp inline every single time I need a regexp rather than having to think about the cost - semantic in particular has the largest regexp I've ever had the horror of using for parsing Python versions which is sourced from an official PEP so we don't really want to touch it.

(by extension I guess I'm implying I have a preference for collocating regexps to their usage too, as typically they're only used in one function)

Granted the key thing is that the regexps are not inlined - beyond that we're talking about saving a couple of milliseconds more, but in contrast as far as I know the only cost to using cachedregexp is an extra ~8 bytes in the binary size, which already varies more than that across different OSs 🤷

(fun side fact: moving that Python regexp inline will take go test ./semantic/... -count=10 from 3 seconds to 20 seconds, and it just keeps growing from there 😅)

One option if we really don't want the global map is to lazily initialise the few big regexp's, but as Gareth said then you have to start thinking about the cost of each regexp and whether it is worth it to do so.

I'd like to challenge this slowly compiling regex. Can we do something about that? Like change the regex to be more efficient or use code for the heavy part and use multiple regexes for sub problems? I haven't looked into the regex yet, just wanting to make sure we don't ignore an easy solution.

If there is really no chance on that: I think cachedregexp is a good alternative for those cases which are not on the hot path. We need to avoid using cachedregexp everywhere tho, as it has an overhead.

on an average project we'd expect to scan one or two lockfiles, rather than running every single extractor and semantic comparator

This is just one use case. When scanning entire systems we usually run most plugins eventually. It can happen that a single plugin runs more than 100k times. In those use cases it's very important to keep hot paths fast. Thus we can not use cachedregexp on the hot path of FileRequired. Usage inside Extract is likely fine. In an ideal world, we would have benchmarks for this so we can make quantitative decisions.

We plan to add better benchmarks to Scalibr, so it is easier to understand performance impact of various changes.

I'd like to challenge this slowly compiling regex. Can we do something about that?

Technically yes, but as I said this particular regexp comes from a third-party - specifically, it's the official regexp provided by Python in their PEP on how versions should be parsed, meaning by using it we're officially compliant with the whole ecosystem and know for sure that any version which is not matched by that regexp is not a valid version.

Attempting to optimize it or break it up would risk introducing a different behaviour, and so far it's not been an issue so long as it's only compiled the once - but yes technically we could break it up (though I'd not want to do that as part of this migration, as I think there'd be too much risk)

We need to avoid using cachedregexp everywhere tho, as it has an overhead.

Can you explain more about this overhead? My understanding is that there is already an overhead by having the regexps as globals, which is that they all get compiled when the package is imported/used - are you saying that the cost of calling the function or doing the lookup in the map every time has a more impactful overhead than just always compiling every regexp, or is there another source of overhead I'm missing?

This is just one use case. When scanning entire systems we usually run most plugins eventually.

"scanning entire systems" is also just one use case 🙃

I want the ability to trigger all regexes and populate the cache on startup instead of invocation.

I'm fine leaving this as is, but would like a way to force compile all regexes at startup. Specifically because we'll be running this in a server and:

I don't want to risk a crash due to specific requests. I'd rather the crash happen at server startup.

All functions will eventually be hit. I'd rather have the upfront startup costs than variable request latencies.

I'm not sure how to do this in the current architecture. I like that the regexes live inside the specific ecosystem files, but I'm not opposed to @erikvarga 's suggestion to move them into the cache package so this is easier. As a stop-gap, if submitted as is, we'll just call all functions that use cachedregexp at server startup to force the compilation. But I don't like that long term.

"cachedregexp everywhere tho, as it has an overhead"

@vpasdf what overhead are you concerned about? Additional binary size, slowdown of calling the functions, something else? Would pre-calling all functions at startup, mimicking inline, fix this for you?

@alowayed that's a very good, and has me leaning to removing cachedregexp since the performance difference between having it vs using globals is a lot smaller than compiling on the hot path, and in a server context those risks are tons worse than then warm fuzzy "we're being efficient devs" feeling cachedregexp can give us.

Ideally long-term it would be awesome if we could have something like this and some kind of "hey if you're on a server, call precompileEverything" kind of pattern, though sadly I can't think of any best pattern for enabling that which doesn't have tradeoffs - ultimately most of the ideas I have come back to having globals anyway...

Since this should be straightforward to remove and I've still not decided if I want to do it here or in a dedicated PR, I'll leave this to action later as I'd still like to hear more about the overhead @vpasdf has hinted at (for my own learning, if nothing else), but confirming I'm now leaning towards not using cachedregexp for the time being.

I'm concerned about the overhead of the hashmap lookup compared to compiling it before running and having it in a global variable. So I'm concerned about cpu time. This assumes of course that the hashmap lookup might be slower than the regex matching. This is for the python case likely not the case, but for trivial regexes I'm not sure.

There is also some memory overhead through the hashmap, as hashmaps store some metadata around the actual data. But I think this is neglectible here.

I'm fine with having cachedregexp. I think @alowayed made a good point about the server use case. If we use it in FileRequired I want to see benchmarks first, but I think this is not the plan rn anyways.

I'd prefer removing the cachedregexp in the same PR. No point in introducing and then removing it right after if we already know we want to remove it.

At the very least we should make sure we only use it for expensive regexes - e.g. add a comment about suggested usage to the class and remove its usage for regexes like ^[a-z] here.

semantic/compare_test.go

semantic/parse.go

semantic/utilities.go

semantic/parse.go

G-Rath · 2024-12-11T20:24:20Z

@another-rex @erikvarga thinking about this in the public context and especially in regards to error handling, I'm wondering if we want to start by making just about everything except for MustParse and Parse private? That way we'll only be exposing a function that returns an error allowing us to defer deciding if we want to have the individual parse/compare functions return an error.

I don't know much about the desired uses for semantic internally so I'm not sure if there's anyone wanting specific ecosystems, but ultimately that should still be achievable by using Parse, it'd just mean everyone would include the whole package in their binary.

i.e. this usage of semantic.ParsePackagistVersion, should be replaceable with semantic.MustParse(lower, "Packagist").CompareStr(upper)

semantic/parse.go

semantic/utilities.go

alowayed · 2024-12-12T19:31:03Z

semantic/version.go

+	// when parsed as the concrete Version relative to the subject Version.
+	//
+	// The result will be 0 if v == w, -1 if v < w, or +1 if v > w.
+	CompareStr(str string) int


Feels like this should be:

Comare(Version v) int

Instead of CompareStr. This would solve some of the error / panic discussions above. And is also more inline with how most custom comparison within Go works (comparing objects of the same type).

No because then it'd mean we have to handle being given a version from a different ecosystem, which itself would require returning an error - that was the reason why I did CompareStr, though yes if we're talking about returning an error at all here then we could explore changing this.

It's still possible for someone to provide a version from a different ecosystem as a string no?

Here's the user journey that would results in this:

semantic is run in a server that exposes a osv.dev like API that accepts package names, version, ecosystem, etc and returns vulnerabilities.

A user provides some incorrect package like foo, claims the ecosystem is Debian, and provides a non-debian version, let's say 1.2.3-not-a-debian-vesion!@#$.

The server retrieves a vulnerability with version range [1.0.0, 2.0.0].

The server runs lowerBound := semantic.Parse(1.0.0, "Debian").

The server runs lowerBound.CompareStr("1.2.3-not-a-debian-vesion!@#$"), leading to a panic.

I'm surprised osv-scanner doesn't crash when there's a non-debian version in say a status file that's scanned.

For our use case, we'd likely do:

func CompareVuln(vulnVersion, packageVersion, ecosystem str) (error) { defer func() { if r := recover(); r != nil { fmt.Println("Recovered:", r) } }() vulnV, err := semantic.Parse(lowerV, ecosystem) if err != nil { // handle } _ := vulnV.CompareStr(packaveVersion) // Rest omitted for brevity. }

Sure it's possible, but it just doesn't break anything because either implicitly or not, generally the native comparators resolve every input down to a result rather than erroring and so we do too i.e. in your example semantic handles that just fine in the same way that dpkg does:

osv-detector on  main [$!?] via 🐹 v1.21.11 via  v20.11.0 ❯ go test ./pkg/semantic/... --- FAIL: TestVersion_Compare_Ecosystems (0.00s) --- FAIL: TestVersion_Compare_Ecosystems/Debian (0.00s) compare_test.go:235: Expected 1.2.3-not-a-debian-vesion!@#$ to be less than 1.2.3, but it was greater than compare_test.go:235: 1 of 31 failed FAIL FAIL github.com/g-rath/osv-detector/pkg/semantic 0.230s FAIL osv-detector on  main [$!?] via 🐹 v1.21.11 via  v20.11.0 ❯ dpkg --compare-versions '1.2.3-not-a-debian-vesion!@#$' 'lt' '1.2.3' dpkg: warning: version '1.2.3-not-a-debian-vesion!@#$' has bad syntax: invalid character in revision number osv-detector on  main [$!?] via 🐹 v1.21.11 via  v20.11.0 ❯ echo $? 1 osv-detector on  main [$!?] via 🐹 v1.21.11 via  v20.11.0 ❯ dpkg --compare-versions '1.2.3-not-a-debian-vesion!@#$' 'gt' '1.2.3' dpkg: warning: version '1.2.3-not-a-debian-vesion!@#$' has bad syntax: invalid character in revision number osv-detector on  main [$!?] via 🐹 v1.21.11 via  v20.11.0 ❯ echo $? 0

(again, I'd really like to say this is really the case for "all comparators" but it's been so long since I implemented some of these that I want to confirm that before throwing such a big claim around)

Hmm looking at the code, wouldn't this panic with 1.2.3-not-a-debian:version!@#$

yup that one does it, which also matches dpkg --compare-version and is why I didn't want to double down on "this never panics" 😅

osv-detector on  main [$!?] via 🐹 v1.21.11 via  v20.11.0 took 4s ❯ go test ./pkg/semantic/... --- FAIL: TestVersion_Compare_Ecosystems (0.00s) --- FAIL: TestVersion_Compare_Ecosystems/Debian (0.00s) panic: failed to convert 1.2.3-not-a-debian to a number [recovered] panic: failed to convert 1.2.3-not-a-debian to a number goroutine 35 [running]: testing.tRunner.func1.2({0x52fc20, 0xc00038c040}) /home/jones/.goenv/versions/1.21.11/src/testing/testing.go:1545 +0x238 testing.tRunner.func1() /home/jones/.goenv/versions/1.21.11/src/testing/testing.go:1548 +0x397 panic({0x52fc20?, 0xc00038c040?}) /home/jones/.goenv/versions/1.21.11/src/runtime/panic.go:914 +0x21f github.com/g-rath/osv-detector/pkg/semantic.convertToBigIntOrPanic({0xc00039c000, 0x12}) /home/jones/workspace/projects-personal/osv-detector/pkg/semantic/utilities.go:13 +0xa5 github.com/g-rath/osv-detector/pkg/semantic.parseDebianVersion({0xc00039c000?, 0x0?}) /home/jones/workspace/projects-personal/osv-detector/pkg/semantic/version-debian.go:153 +0xc7 github.com/g-rath/osv-detector/pkg/semantic.Parse({0xc00039c000?, 0x2?}, {0x55c248?, 0x4c0505?}) /home/jones/workspace/projects-personal/osv-detector/pkg/semantic/parse.go:29 +0x777 github.com/g-rath/osv-detector/pkg/semantic_test.parseAsVersion(0xc0001d0680, {0xc00039c000, 0x1e}, {0x55c248, 0x6}) /home/jones/workspace/projects-personal/osv-detector/pkg/semantic/compare_test.go:97 +0x6b github.com/g-rath/osv-detector/pkg/semantic_test.expectCompareResult(0xc0001d0680, {0x55c248, 0x6}, {0xc00039c000, 0x1e}, {0xc00039c021, 0x5}, 0xffffffffffffffff) /home/jones/workspace/projects-personal/osv-detector/pkg/semantic/compare_test.go:115 +0x92 github.com/g-rath/osv-detector/pkg/semantic_test.expectEcosystemCompareResult(0xc0001d0680, {0x55c248, 0x6}, {0xc00039c000, 0x1e}, {0xc00039c01f, 0x1}, {0xc00039c021, 0x5}) /home/jones/workspace/projects-personal/osv-detector/pkg/semantic/compare_test.go:141 +0x97 github.com/g-rath/osv-detector/pkg/semantic_test.runAgainstEcosystemFixture(0xc0001d0680, {0x55c248, 0x6}, {0x55f5da, 0x13}) /home/jones/workspace/projects-personal/osv-detector/pkg/semantic/compare_test.go:78 +0x328 github.com/g-rath/osv-detector/pkg/semantic_test.TestVersion_Compare_Ecosystems.func1(0x0?) /home/jones/workspace/projects-personal/osv-detector/pkg/semantic/compare_test.go:235 +0x5a testing.tRunner(0xc0001d0680, 0xc000121560) /home/jones/.goenv/versions/1.21.11/src/testing/testing.go:1595 +0xff created by testing.(*T).Run in goroutine 18 /home/jones/.goenv/versions/1.21.11/src/testing/testing.go:1648 +0x3ad FAIL github.com/g-rath/osv-detector/pkg/semantic 0.006s FAIL osv-detector on  main [$!?] via 🐹 v1.21.11 via  v20.11.0 ❯ dpkg --compare-versions '1.2.3-not-a-debian:version!@#$' 'gt' '1.2.3' dpkg: error: version '1.2.3-not-a-debian:version!@#$' has bad syntax: epoch in version is not number osv-detector on  main [$!?] via 🐹 v1.21.11 via  v20.11.0 ❯ echo $? 2

G-Rath · 2024-12-12T20:26:40Z

@alowayed @zpavlinovic @erikvarga thanks for the reviews and discussions so far! I think there's been enough for me to do another pass before we continue further - notably, I think we should look at making semantic return an error so we can avoid panics even if we end up just returning one or two errors as it'll be easier long-term to have "return an error" as an option than to try and retroactively add support for it.

I expect too I'll probably go ahead for now with making most things private to make it easier to iterate on the internals - I think that probably comes down to MustParse and Parse being the public entry points, but I'll confirm once I've gotten back into the weeds.

I'll also try to do a general refresher on the package and related native version comparers as a whole since while I do have confidence in what I've said so far, I wrote a lot of this 1-2 years ago and its required very little change besides adding new comparators, which while a great sign of stability (or lack of use... 😅) does mean I'm not feeling as confident as I'd like when trying to think through tradeoffs and make architecture decisions

Feel free to continue discussions and review, but also feel free to wait until I've actioned the above 🙂

G-Rath · 2025-01-08T18:47:06Z

semantic/utilities.go

+	"math/big"
+)
+
+func convertToBigInt(str string) (*big.Int, error, bool) {


for now I went with having this return a bool along with an error because a lot of the existing code was v, vIsNumber and it felt weird to either rename it to vErr or to keep it the name but compare it against nil.

I don't think this is super bad, but now that I've got everything transitioned if others felt it would be more Go-y to change all the checks to be checking the err, I can do that

Here's a concrete example of what I mean:

vv, vErr := convertToBigInt(vt.value) wv, wErr := convertToBigInt(wt.value) // numeric tokens have the same natural order if vErr != nil && wErr != nil { return vv.Cmp(wv) == -1, nil }

vs

vv, vIsNumber := convertToBigInt(vt.value) wv, wIsNumber := convertToBigInt(wt.value) // numeric tokens have the same natural order if vIsNumber != nil && wIsNumber != nil { return vv.Cmp(wv) == -1, nil }

though, I've just realised I could probably actually switch to checking if vv is nil...? but I think that has a similar trade off in readability 🤔

I can see the benefit from introducing a boolean for isNumber, but we should make sure people understand what the code is supposed to return. Either add a comment about the return values or make them named, e.g.

func convertToBigInt(str string) (res *big.Int, err error, isNumber bool) {

G-Rath · 2025-01-08T18:57:33Z

@alowayed @zpavlinovic @erikvarga @another-rex I've updated the implementation so that instead of panicking semantic will now return an error - for now I've tried to take the simplest route of just returning an error straightaway in places where we would previously panic.

I think long-term there should be some exploring into refactoring the code to better leverage what's already been asserted to eliminate the code paths that cannot actually even happen e.g. there are functions where we convert to a big.Int input that has been matched with a \d based regexp.

There's a few places already where we could just ignore the error as we know for sure it can only ever be a number, but I'm guessing people's preference is to just always check-and-return the error (let me know if I'm wrong, as that would let me revert some logic changes for the Alpine implementation in particular)

I'm probably still going to remove cachedregexp at some point, but since the outcome of that should be very predictable and it sounds like no one is super against this PR being landed with it still existing, I'd say feel free to go ahead with re-reviewing 🙂

another-rex · 2025-01-15T00:55:44Z

semantic/parse.go

+// returning an ErrUnsupportedEcosystem error if the ecosystem is not supported.
+func Parse(str string, ecosystem string) (Version, error) {
+	//nolint:exhaustive // Using strings to specify ecosystem instead of lockfile types
+	switch ecosystem {


Can we add the ecosystems we support in osv.dev (https://github.com/google/osv.dev/blob/master/osv/ecosystems/_ecosystems.py) here? The missing ones are all pretty simple to add I believe, either using an existing sorting method, or literally just an int.

Exploiting my IDE, it looks like these are the ones not in our switch statement:

case "Bitnami": SemverEcosystem() case "SwiftURL": SemverEcosystem() // # Non SemVer-based ecosystems case "Bioconductor": Bioconductor() case "Chainguard": Chainguard() case "GHC": GHC() case "Hackage": Hackage() case "Wolfi": Wolfi() // # Ecosystems which require a release version for enumeration, which is // # handled separately in get(). // # Ecosystems missing implementations: case "Android": OrderingUnsupportedEcosystem() case "ConanCenter": OrderingUnsupportedEcosystem() case "GitHub Actions": OrderingUnsupportedEcosystem() case "Linux": OrderingUnsupportedEcosystem() case "OSS-Fuzz": OrderingUnsupportedEcosystem() case "Photon OS": OrderingUnsupportedEcosystem()

of those, the semver ones can just be dropped in, and we can't support the last 6, so that leaves 5 that would need custom implementations (though I wouldn't be surprised if Bioconductor is just CRAN).

I'm happy to look into those, but think they probably shouldn't be a blocker to landing this?

These are not blockers and can be done in a followup, the hardest part will just be the tests I think.

Bioconductor -> Semver
Chainguard -> Alpine
GHC -> Semver
Hackage -> SemverLike (just numbers and . separators)
Wolfi -> Alpine

erikvarga · 2025-01-20T17:20:51Z

PR looks good to me now apart from some nits - and that I'd still prefer cachedregexp to be introduced separately and discussing its merits there (vs. removing it in a separate PR)

erikvarga reviewed Dec 11, 2024

View reviewed changes

G-Rath force-pushed the semantic/add branch from b3903fd to 12728ee Compare December 11, 2024 19:16

alowayed reviewed Dec 12, 2024

View reviewed changes

G-Rath force-pushed the semantic/add branch from 17dba0c to 1eb43f8 Compare December 12, 2024 20:29

G-Rath force-pushed the semantic/add branch 2 times, most recently from 12bee2c to cc237d6 Compare January 5, 2025 23:17

G-Rath added 6 commits January 8, 2025 08:45

feat: add the semantic package

28414dd

chore: bring across semantic generators

747c003

docs: add package and public function comments

0996d5b

refactor: make everything private by default

61aa2b8

test: don't run in parallel

eecef1a

feat: support returning an error from Version#CompareStr

6c25c1d

G-Rath force-pushed the semantic/add branch from cc237d6 to eecef1a Compare January 7, 2025 19:46

G-Rath added 13 commits January 8, 2025 09:09

refactor: prefix existing parsing methods with "must"

27ff64e

feat: introduce "parse" functions that return an error

0a1d90e

chore: deprecate "must parse" version parsers

ff4f771

fix: move off internal "must parse" functions

fc70c67

feat: introduce dedicated error for invalid versions

1793f34

feat: rewrite Maven version handler to not panic

74775e7

refactor: support returning an error from convertToBigInt utility

7583c28

chore: deprecate convertToBigIntOrPanic

64f3d87

feat: move Alpine version handler off convertToBigIntOrPanic

71245d6

feat: move Debian version handler off convertToBigIntOrPanic

d01adf1

feat: move PyPI version handler off convertToBigIntOrPanic

9c6a868

chore: remove unused utility

bbf93ae

test: add some cases for a known invalid version with Debian

c8ea69c

G-Rath commented Jan 8, 2025

View reviewed changes

G-Rath requested review from another-rex, alowayed, erikvarga and vpasdf January 8, 2025 18:57

G-Rath marked this pull request as ready for review January 14, 2025 02:35

another-rex reviewed Jan 15, 2025

View reviewed changes

another-rex mentioned this pull request Jan 16, 2025

No affected versions reported by the API google/osv.dev#3054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: migrate `semantic` from `osv-scanner` #316

feat: migrate `semantic` from `osv-scanner` #316

G-Rath commented Dec 1, 2024

erikvarga Dec 11, 2024

G-Rath Dec 11, 2024

another-rex Dec 11, 2024

vpasdf Dec 12, 2024

G-Rath Dec 12, 2024

alowayed Dec 12, 2024 •

edited

Loading

G-Rath Dec 12, 2024

vpasdf Dec 13, 2024

erikvarga Jan 20, 2025

G-Rath commented Dec 11, 2024

alowayed Dec 12, 2024

G-Rath Dec 12, 2024

alowayed Dec 19, 2024

G-Rath Dec 19, 2024 •

edited

Loading

another-rex Dec 20, 2024

G-Rath Dec 20, 2024 •

edited

Loading

G-Rath commented Dec 12, 2024

G-Rath Jan 8, 2025

G-Rath Jan 8, 2025

erikvarga Jan 20, 2025

G-Rath commented Jan 8, 2025

another-rex Jan 15, 2025

G-Rath Jan 15, 2025

another-rex Jan 15, 2025

erikvarga commented Jan 20, 2025

feat: migrate semantic from osv-scanner #316

Are you sure you want to change the base?

feat: migrate semantic from osv-scanner #316

Conversation

G-Rath commented Dec 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alowayed Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

G-Rath commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

G-Rath Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

G-Rath Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

G-Rath commented Dec 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

G-Rath commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikvarga commented Jan 20, 2025

feat: migrate `semantic` from `osv-scanner` #316

feat: migrate `semantic` from `osv-scanner` #316

alowayed Dec 12, 2024 •

edited

Loading

G-Rath Dec 19, 2024 •

edited

Loading

G-Rath Dec 20, 2024 •

edited

Loading