Skip to content

fix(registry): bump docstring truncation cap to 4000 chars (#335)#337

Open
SAY-5 wants to merge 2 commits into
sktime:mainfrom
SAY-5:fix/docstring-truncation-335
Open

fix(registry): bump docstring truncation cap to 4000 chars (#335)#337
SAY-5 wants to merge 2 commits into
sktime:mainfrom
SAY-5:fix/docstring-truncation-335

Conversation

@SAY-5
Copy link
Copy Markdown

@SAY-5 SAY-5 commented Apr 28, 2026

Closes #335.

What

EstimatorNode.to_dict truncated the estimator docstring at 500
characters before exposing it to the LLM. sktime's numpydoc-style
docstrings put the Parameters section, the most useful piece for
an agent reasoning about hyperparameters, past the first 500 chars
in many estimators (composite forecasters, classifiers with 10+
parameters, etc.), so the truncation hid exactly the content the LLM
needed to configure the model.

Fix

Bump the cap to 4000 characters via a new module-level constant
(_DOCSTRING_MAX_CHARS) and a helper (_truncate_docstring):

  • short / None docstrings pass through unchanged
  • longer ones get truncated to the cap with a trailing ... marker
    so the consumer knows content was elided
  • 4000 is large enough to keep the Parameters block of every
    sktime estimator I sampled while still keeping the MCP tool result
    reasonably compact

The issue suggested either bumping the limit or implementing a
"prioritize Parameters section" parser. I went with the simpler cap
bump since smarter parsing has a much larger surface (numpydoc /
Google-style / plain text variations); happy to follow up with a
section-aware truncator if you'd prefer that direction.

Tests

tests/test_core.py::TestDocstringTruncation:

Verification

  • pytest tests/test_core.py::TestDocstringTruncation -v → 4 passed.
  • Reverting _DOCSTRING_MAX_CHARS to 500 makes the last subtest
    fail with assert 'Parameters' in '<x...>', confirming the test
    catches the regression.

Closes sktime#335.

`EstimatorNode.to_dict` truncated the estimator docstring at 500
characters before exposing it to the LLM. sktime's numpydoc-style
docstrings put the `Parameters` section — the most useful piece
for an agent reasoning about hyperparameters — past the first 500
chars in many estimators (e.g. composite forecasters, classifiers
with 10+ parameters), so the truncation hid exactly the content the
LLM needed.

Bump the cap to 4000 characters via a new module-level constant
(`_DOCSTRING_MAX_CHARS`) and a helper (`_truncate_docstring`):

- short / None docstrings pass through unchanged;
- longer ones get truncated to the cap with a trailing `...`
  marker so the consumer knows content was elided;
- 4000 is large enough to keep the `Parameters` block of every
  sktime estimator we have looked at while still keeping the MCP
  tool result reasonably compact.

Tests
- `test_short_docstring_unchanged`
- `test_none_passes_through`
- `test_long_docstring_truncated_with_ellipsis`
- `test_cap_preserves_parameters_section_past_500_chars` — pins
  the sktime#335 regression: a docstring with the `Parameters` section
  starting at offset 600 must still expose it after truncation.

Verified locally: 4 passed; the last subtest fails when the cap
is reverted to 500, confirming the test catches the regression.

Signed-off-by: SAY-5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Hardcoded 500-character docstring truncation limits LLM reasoning

1 participant