Skip to content

Add script for finding missing IO specializations #683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jeltsch
Copy link
Collaborator

@jeltsch jeltsch commented Apr 14, 2025

This pull requests adds a script that helps in finding missing IO specializations, along with some example input and output.

When running the command scripts/find-missing-io-specialisations.sh src/**/*.hs on my Ubuntu 24.04 machine while being at the current main tip, 78a3231, I get the following output:

Database.LSMTree.Internal.Arena.newArenaManager
Database.LSMTree.Internal.Arena.withArena
Database.LSMTree.Internal.Arena.scrambleArena
Database.LSMTree.Internal.Arena.scrambleBlock
Database.LSMTree.Internal.Arena.withUnmanagedArena
Database.LSMTree.Internal.Arena.allocateFromArena
Database.LSMTree.Internal.Arena.newBlockWithFree
Database.LSMTree.Internal.BlobFile.readBlob
Database.LSMTree.Internal.BlobRef.releaseBlobRef
Database.LSMTree.Internal.BlobRef.readRawBlobRef
Database.LSMTree.Internal.ChecksumHandle.dropCache
Database.LSMTree.Internal.ChecksumHandle.closeHandle
Database.LSMTree.Internal.CRC32C.expectValidFile
Database.LSMTree.Internal.doesSnapshotDirExist
Database.LSMTree.Internal.IncomingRun.newIncomingSingleRun
Database.LSMTree.Internal.IncomingRun.newIncomingMergingRun
Database.LSMTree.Internal.IncomingRun.depositNominalCredits
Database.LSMTree.Internal.Merge.writeReaderEntry
Database.LSMTree.Internal.Merge.writeSerialisedEntry
Database.LSMTree.Internal.MergeSchedule.mkUnionCache
Database.LSMTree.Internal.MergeSchedule.duplicateUnionCache
Database.LSMTree.Internal.MergeSchedule.releaseUnionCache
Database.LSMTree.Internal.MergingRun.unsafeNew
Database.LSMTree.Internal.MergingRun.mergeType
Database.LSMTree.Internal.MergingRun.atomicReadCredits
Database.LSMTree.Internal.MergingRun.atomicModifyInt
Database.LSMTree.Internal.MergingRun.supplyChecked
Database.LSMTree.Internal.MergingRun.supplyCreditsRelative
Database.LSMTree.Internal.MergingRun.supplyCreditsAbsolute
Database.LSMTree.Internal.MergingTree.expectCompleted
Database.LSMTree.Internal.MergingTree.Lookup.mapMStrict
Database.LSMTree.Internal.RunReader.seekToDiskPage
Database.LSMTree.Internal.UniqCounter.newUniqCounter
Database.LSMTree.Internal.UniqCounter.incrUniqCounter
Database.LSMTree.Internal.Vector.mapMStrict
Database.LSMTree.Internal.Vector.imapMStrict
Database.LSMTree.Internal.Vector.forMStrict
Database.LSMTree.Internal.Vector.unsafeInsertWithMStrict

It should be checked that the same output arises on other systems, in particular such that do not use GNU sed.

@jeltsch jeltsch self-assigned this Apr 14, 2025
@jeltsch jeltsch added the enhancement New feature or request label Apr 14, 2025
@wenkokke
Copy link
Collaborator

As a preliminary note, it would be good to change the script to require no parameters (ie inline the source glob) and perhaps rename it to start with lint, for consistency.

@jeltsch
Copy link
Collaborator Author

jeltsch commented Apr 14, 2025

I‘d like to keep the flexibility of being able to pass arbitrarily chosen Haskell source code to the script but realize that it would be good to have a script that just checks the present source code (probably of the main library only). What about keeping the script as it is and adding a wrapper script that passes the main library’s modules and have its name start with lint-?

@wenkokke
Copy link
Collaborator

What about keeping the script as it is and adding a wrapper script that passes the main library’s modules and have its name start with lint-?

Sounds great! Another option would be to default to all Haskell files (or just src/**/*.hs) but allow the first argument to the script to override this.

LC_COLLATE=C LC_CTYPE=C sed -En '
:start
/^module / {
s/module +([^ ]*).*/\1 /
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to explicitly match |$ after the capture group for clarity, to not depend on greediness?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to put such extra information about delimiters into regular expressions, but now think that it’s fine to rely on greediness. After all, we’re relying on it in other situations already, where it’s hard or even impossible not to do so.

@wenkokke
Copy link
Collaborator

I think it might be a good idea to add dinner comments to the sed script that explain the logic a bit, not per se on a command by command basis, but to explain the algorithm.

@wenkokke
Copy link
Collaborator

OTOH, it's sufficiently clear the script isn't malicious, so once there's a CI job that runs it on macOS and Windows, I'm happy to merge it under the understanding that it's not a complete guarantee on the absence of missing specialisations... which is true anyway, given the assumptions it makes.

@jorisdral
Copy link
Collaborator

Are there cases where the script returns false positives, and if not now, is it conceivable that it could do so in the future? If so, it might be good to think about how such false positives could be addressed, specially if it's a CI job we run that can hold up PRs

@wenkokke
Copy link
Collaborator

Are there cases where the script returns false positives, and if not now, is it conceivable that it could do so in the future? If so, it might be good to think about how such false positives could be addressed, specially if it's a CI job we run that can hold up PRs

@jeltsch already floated the idea of an allowlist for false positives.

@jeltsch
Copy link
Collaborator Author

jeltsch commented Apr 16, 2025

f0f80d6 disables reporting of operations with inline directives. The expected result of running scripts/find-missing-io-specialisations.sh src/**/*.hs is now as follows:

Database.LSMTree.Internal.Arena.newArenaManager
Database.LSMTree.Internal.Arena.scrambleArena
Database.LSMTree.Internal.Arena.scrambleBlock
Database.LSMTree.Internal.Arena.withUnmanagedArena
Database.LSMTree.Internal.Arena.allocateFromArena
Database.LSMTree.Internal.Arena.newBlockWithFree
Database.LSMTree.Internal.ChecksumHandle.dropCache
Database.LSMTree.Internal.ChecksumHandle.closeHandle
Database.LSMTree.Internal.CRC32C.expectValidFile
Database.LSMTree.Internal.doesSnapshotDirExist
Database.LSMTree.Internal.IncomingRun.depositNominalCredits
Database.LSMTree.Internal.MergeSchedule.mkUnionCache
Database.LSMTree.Internal.MergeSchedule.duplicateUnionCache
Database.LSMTree.Internal.MergeSchedule.releaseUnionCache
Database.LSMTree.Internal.MergingTree.expectCompleted
Database.LSMTree.Internal.MergingTree.Lookup.mapMStrict
Database.LSMTree.Internal.RunReader.seekToDiskPage

@jeltsch
Copy link
Collaborator Author

jeltsch commented Apr 16, 2025

OTOH, it's sufficiently clear the script isn't malicious, so once there's a CI job that runs it on macOS and Windows, I'm happy to merge it under the understanding that it's not a complete guarantee on the absence of missing specialisations... which is true anyway, given the assumptions it makes.

Are you thinking here of a CI job that applies the script to the codebase to ensure that there are no missing specializations or one that tests the script using the data in scripts/find-missing-io-specialisations.test?

Beside that, do you think it is necessary to have CI run this script on Windows? After all, it’s not part of the software we want to ship in the end but a utility intended to help us (and perhaps future contributors). I mean, if supporting Windows, using some Unix compatibility layer, is easy, then I’m not against it; I just don’t want to spend much effort on making this script Windows-compatible (note that initially I intended this script to just assist me in a particular task and not even make it into the repository 🙂)

@jeltsch
Copy link
Collaborator Author

jeltsch commented Apr 16, 2025

I think it might be a good idea to add dinner comments to the sed script that explain the logic a bit, not per se on a command by command basis, but to explain the algorithm.

Done in 325c20e.

@jeltsch
Copy link
Collaborator Author

jeltsch commented Apr 20, 2025

What about keeping the script as it is and adding a wrapper script that passes the main library’s modules and have its name start with lint-?

Sounds great!

I’ve realized this in 539f06a. The wrapper script not only automatically picks the source files of the main library but also enables users to explicitly allow certain missing IO specializations.

@jeltsch jeltsch force-pushed the jeltsch/find-missing-io-specialisations branch 2 times, most recently from 5cc18ea to 11eb81f Compare April 20, 2025 16:57
@jeltsch jeltsch force-pushed the jeltsch/find-missing-io-specialisations branch 2 times, most recently from 130de00 to b018b5a Compare April 20, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants