Reduce binary distribution size: unbundle optional Hadoop/Kafka deps + de-duplicate shared jars#8819
Draft
rzo1 wants to merge 6 commits into
Draft
Reduce binary distribution size: unbundle optional Hadoop/Kafka deps + de-duplicate shared jars#8819rzo1 wants to merge 6 commits into
rzo1 wants to merge 6 commits into
Conversation
storm-autocreds pulls in the full Hadoop/HBase client dependency tree (~79 MB, 43 jars unique to it) but is only needed on secure (Kerberos) clusters and is off by default. Ship only the README, consistent with the other external/* connectors, and add bin/storm-autocreds-fetch to retrieve the plugin and its runtime dependencies from Maven Central into extlib-daemon on demand. Also removes the now-unused storm-autocreds-bin assembly module.
The storm-kafka-monitor jars (and their Kafka client dependencies, ~38 MB) are only needed to display Kafka spout lag in the UI or to run the bin/storm-kafka-monitor command. Ship only the README, consistent with the other external/* connectors, and add bin/storm-kafka-monitor-fetch to retrieve the tool and its runtime dependencies from Maven Central into lib-tools/storm-kafka-monitor on demand. Guard the UI against the jars being absent: TopologySpoutLag now detects whether storm-kafka-monitor is installed and, when it is not, surfaces an actionable message (and logs it once) instead of failing the lag shell-out. The bin/storm-kafka-monitor wrapper prints the same hint instead of a ClassNotFound error. Also removes the now-unused storm-kafka-monitor-bin assembly module.
Prepares de-duplication of the jars shared by the daemon (lib) and worker (lib-worker) classpaths into a single lib-common directory. storm.py now includes lib-common on both classpaths; when the directory is absent (older layouts) it contributes nothing, so the change is backward compatible.
The worker classpath (lib-worker) is a byte-identical subset of the daemon classpath (lib). dedup-libs.py moves the shared jars into a single lib-common directory and removes the duplicate copies from lib, reclaiming ~71 MB. It only de-duplicates byte-identical jars (same name and sha-256), so a version mismatch is never silently merged; tool classpaths (lib-tools/*, lib-webapp) are left untouched. Paired with the lib-common classpath support in bin/storm.py. Wiring this into the binary assembly is a follow-up (it must be validated by a full -Pdist distribution build).
final-package now stages the daemon jars (copy-dependencies -> staging/lib) and the shared jars (storm-client-bin's lib-worker -> staging/lib-common), runs dedup-libs.py to remove the byte-identical copies from lib, and the assembly packages the staged lib/ and lib-common/ directories. The storm-client-bin tree (which only carried lib-worker) is no longer copied directly. Verified through the prepare-package phase: staging/lib = 48 jars, staging/ lib-common = 40 jars, zero overlap, full 88-jar daemon set preserved across lib + lib-common, ~71 MB reclaimed. The final tar/zip packaging requires a full -Pdist (native) distribution build and must be validated in CI / on Linux.
The binary distribution now de-duplicates jars shared by the daemon and worker classpaths into a lib-common directory, which bin/storm.py adds to the classpath after the storm home wildcard. Update the expected classpath assertions in test_storm_cli.py to include lib-common.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
The binary distribution ships ~395 MB of jars, much of it optional or duplicated. This PR trims it substantially without removing any capability: optional pieces become fetch-on-demand, and jars shared between the daemon and worker classpaths are de-duplicated.
Smaller artifact means less download/storage/registry bandwidth and a smaller container image, and a lighter CI/CD carbon footprint. 🌱
Changes
storm-autocreds no longer bundled (-79 MB). It pulls the full Hadoop/HBase client tree but is only used on secure (Kerberos) clusters and is off by default. Now ships only the README (like the other
external/*connectors);bin/storm-autocreds-fetchretrieves the plugin and its deps intoextlib-daemon.storm-kafka-monitor no longer bundled (-38 MB). Only needed to show Kafka spout lag in the UI or to run
bin/storm-kafka-monitor.bin/storm-kafka-monitor-fetchinstalls it intolib-tools/storm-kafka-monitor. The UI degrades gracefully when it is absent (TopologySpoutLagdetects it, shows an actionable message, logs once) and the wrapper prints a hint instead ofClassNotFound.lib-commonde-duplication (-71 MB). The worker classpath (lib-worker) is a byte-identical subset of the daemon classpath (lib). Shared jars are now kept once inlib-common/(added to both classpaths viabin/storm.py) and removed fromlib.dedup-libs.pyonly merges byte-identical jars (name and sha-256), so no version is silently merged; tool classpaths are left untouched.Net: roughly -188 MB (~47%) of the bundled jar payload, with no loss of functionality.
Notes for reviewers
Opening as draft to get early feedback on the approach. Still need to verify on a Linux system (full
-Pdistnative distribution build).