Skip to content

Reduce binary distribution size: unbundle optional Hadoop/Kafka deps + de-duplicate shared jars#8819

Draft
rzo1 wants to merge 6 commits into
masterfrom
reduce-distro-size
Draft

Reduce binary distribution size: unbundle optional Hadoop/Kafka deps + de-duplicate shared jars#8819
rzo1 wants to merge 6 commits into
masterfrom
reduce-distro-size

Conversation

@rzo1

@rzo1 rzo1 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What & why

The binary distribution ships ~395 MB of jars, much of it optional or duplicated. This PR trims it substantially without removing any capability: optional pieces become fetch-on-demand, and jars shared between the daemon and worker classpaths are de-duplicated.

Smaller artifact means less download/storage/registry bandwidth and a smaller container image, and a lighter CI/CD carbon footprint. 🌱

Changes

  1. storm-autocreds no longer bundled (-79 MB). It pulls the full Hadoop/HBase client tree but is only used on secure (Kerberos) clusters and is off by default. Now ships only the README (like the other external/* connectors); bin/storm-autocreds-fetch retrieves the plugin and its deps into extlib-daemon.

  2. storm-kafka-monitor no longer bundled (-38 MB). Only needed to show Kafka spout lag in the UI or to run bin/storm-kafka-monitor. bin/storm-kafka-monitor-fetch installs it into lib-tools/storm-kafka-monitor. The UI degrades gracefully when it is absent (TopologySpoutLag detects it, shows an actionable message, logs once) and the wrapper prints a hint instead of ClassNotFound.

  3. lib-common de-duplication (-71 MB). The worker classpath (lib-worker) is a byte-identical subset of the daemon classpath (lib). Shared jars are now kept once in lib-common/ (added to both classpaths via bin/storm.py) and removed from lib. dedup-libs.py only merges byte-identical jars (name and sha-256), so no version is silently merged; tool classpaths are left untouched.

Net: roughly -188 MB (~47%) of the bundled jar payload, with no loss of functionality.

Notes for reviewers

Opening as draft to get early feedback on the approach. Still need to verify on a Linux system (full -Pdist native distribution build).

@rzo1 rzo1 requested review from GGraziadei, jnioche and reiabreu June 30, 2026 18:16
rzo1 added 6 commits July 1, 2026 13:31
storm-autocreds pulls in the full Hadoop/HBase client dependency tree
(~79 MB, 43 jars unique to it) but is only needed on secure (Kerberos)
clusters and is off by default. Ship only the README, consistent with
the other external/* connectors, and add bin/storm-autocreds-fetch to
retrieve the plugin and its runtime dependencies from Maven Central into
extlib-daemon on demand.

Also removes the now-unused storm-autocreds-bin assembly module.
The storm-kafka-monitor jars (and their Kafka client dependencies, ~38 MB)
are only needed to display Kafka spout lag in the UI or to run the
bin/storm-kafka-monitor command. Ship only the README, consistent with the
other external/* connectors, and add bin/storm-kafka-monitor-fetch to
retrieve the tool and its runtime dependencies from Maven Central into
lib-tools/storm-kafka-monitor on demand.

Guard the UI against the jars being absent: TopologySpoutLag now detects
whether storm-kafka-monitor is installed and, when it is not, surfaces an
actionable message (and logs it once) instead of failing the lag shell-out.
The bin/storm-kafka-monitor wrapper prints the same hint instead of a
ClassNotFound error.

Also removes the now-unused storm-kafka-monitor-bin assembly module.
Prepares de-duplication of the jars shared by the daemon (lib) and worker
(lib-worker) classpaths into a single lib-common directory. storm.py now
includes lib-common on both classpaths; when the directory is absent (older
layouts) it contributes nothing, so the change is backward compatible.
The worker classpath (lib-worker) is a byte-identical subset of the daemon
classpath (lib). dedup-libs.py moves the shared jars into a single lib-common
directory and removes the duplicate copies from lib, reclaiming ~71 MB. It only
de-duplicates byte-identical jars (same name and sha-256), so a version
mismatch is never silently merged; tool classpaths (lib-tools/*, lib-webapp)
are left untouched.

Paired with the lib-common classpath support in bin/storm.py. Wiring this into
the binary assembly is a follow-up (it must be validated by a full -Pdist
distribution build).
final-package now stages the daemon jars (copy-dependencies -> staging/lib) and
the shared jars (storm-client-bin's lib-worker -> staging/lib-common), runs
dedup-libs.py to remove the byte-identical copies from lib, and the assembly
packages the staged lib/ and lib-common/ directories. The storm-client-bin tree
(which only carried lib-worker) is no longer copied directly.

Verified through the prepare-package phase: staging/lib = 48 jars, staging/
lib-common = 40 jars, zero overlap, full 88-jar daemon set preserved across
lib + lib-common, ~71 MB reclaimed. The final tar/zip packaging requires a full
-Pdist (native) distribution build and must be validated in CI / on Linux.
The binary distribution now de-duplicates jars shared by the daemon and
worker classpaths into a lib-common directory, which bin/storm.py adds to
the classpath after the storm home wildcard. Update the expected classpath
assertions in test_storm_cli.py to include lib-common.
@rzo1 rzo1 force-pushed the reduce-distro-size branch from 7e37f99 to 15a9235 Compare July 1, 2026 11:31
@rzo1 rzo1 added this to the 3.0.0 milestone Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant