fix(registry): bump docstring truncation cap to 4000 chars (#335)#337
Open
SAY-5 wants to merge 2 commits into
Open
fix(registry): bump docstring truncation cap to 4000 chars (#335)#337SAY-5 wants to merge 2 commits into
SAY-5 wants to merge 2 commits into
Conversation
Closes sktime#335. `EstimatorNode.to_dict` truncated the estimator docstring at 500 characters before exposing it to the LLM. sktime's numpydoc-style docstrings put the `Parameters` section — the most useful piece for an agent reasoning about hyperparameters — past the first 500 chars in many estimators (e.g. composite forecasters, classifiers with 10+ parameters), so the truncation hid exactly the content the LLM needed. Bump the cap to 4000 characters via a new module-level constant (`_DOCSTRING_MAX_CHARS`) and a helper (`_truncate_docstring`): - short / None docstrings pass through unchanged; - longer ones get truncated to the cap with a trailing `...` marker so the consumer knows content was elided; - 4000 is large enough to keep the `Parameters` block of every sktime estimator we have looked at while still keeping the MCP tool result reasonably compact. Tests - `test_short_docstring_unchanged` - `test_none_passes_through` - `test_long_docstring_truncated_with_ellipsis` - `test_cap_preserves_parameters_section_past_500_chars` — pins the sktime#335 regression: a docstring with the `Parameters` section starting at offset 600 must still expose it after truncation. Verified locally: 4 passed; the last subtest fails when the cap is reverted to 500, confirming the test catches the regression. Signed-off-by: SAY-5 <[email protected]>
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #335.
What
EstimatorNode.to_dicttruncated the estimator docstring at 500characters before exposing it to the LLM. sktime's numpydoc-style
docstrings put the
Parameterssection, the most useful piece foran agent reasoning about hyperparameters, past the first 500 chars
in many estimators (composite forecasters, classifiers with 10+
parameters, etc.), so the truncation hid exactly the content the LLM
needed to configure the model.
Fix
Bump the cap to 4000 characters via a new module-level constant
(
_DOCSTRING_MAX_CHARS) and a helper (_truncate_docstring):Nonedocstrings pass through unchanged...markerso the consumer knows content was elided
Parametersblock of everysktime estimator I sampled while still keeping the MCP tool result
reasonably compact
The issue suggested either bumping the limit or implementing a
"prioritize Parameters section" parser. I went with the simpler cap
bump since smarter parsing has a much larger surface (numpydoc /
Google-style / plain text variations); happy to follow up with a
section-aware truncator if you'd prefer that direction.
Tests
tests/test_core.py::TestDocstringTruncation:test_short_docstring_unchangedtest_none_passes_throughtest_long_docstring_truncated_with_ellipsistest_cap_preserves_parameters_section_past_500_chars, the[BUG] Hardcoded 500-character docstring truncation limits LLM reasoning #335 regression: docstring with a
Parametersblock starting atoffset 600 must still surface after truncation.
Verification
pytest tests/test_core.py::TestDocstringTruncation -v→ 4 passed._DOCSTRING_MAX_CHARSto 500 makes the last subtestfail with
assert 'Parameters' in '<x...>', confirming the testcatches the regression.