-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(deps): bump nltk from 3.8.1 to 3.9.1 fix unit tests for dependabot/pip/nltk-3.9 branch #51
chore(deps): bump nltk from 3.8.1 to 3.9.1 fix unit tests for dependabot/pip/nltk-3.9 branch #51
Conversation
📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 WalkthroughWalkthroughThis pull request introduces several modifications across multiple files, focusing on enhancing error handling, updating dependency specifications, and refining test cases. Key changes include improved exception handling in the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant UnstructuredParser
participant Logger
User->>UnstructuredParser: parse_records()
UnstructuredParser->>UnstructuredParser: Handle exceptions
alt Exception occurs
UnstructuredParser->>Logger: Log error
UnstructuredParser-->>User: Yield structured error message
else No Exception
UnstructuredParser-->>User: Return parsed records
end
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
unit_tests/sources/declarative/requesters/error_handlers/backoff_strategies/test_wait_time_from_header.py (1)
40-40
: Good catch on using a raw string for the regex pattern! 🎯I notice we're using a raw string
r"([-+]?\d+)"
for this regex pattern - nice improvement! Should we also update the regex in the test case below ("[-+]?\d+"
) to be consistent? wdyt?- ("test_wait_time_fœrom_header_with_regex_no_match", "wait_time", "...", "[-+]?\d+", None), # noqa + ("test_wait_time_fœrom_header_with_regex_no_match", "wait_time", "...", r"[-+]?\d+", None), # noqaairbyte_cdk/sources/file_based/file_types/unstructured_parser.py (2)
391-413
: LGTM! Nice improvements to file type detectionThe error handling and fallback strategy look solid. The file pointer reset is a great catch! Just one tiny thought - would it make sense to add a debug log when falling back to extension-based detection? This could help with troubleshooting, wdyt?
if extension in EXT_TO_FILETYPE: + logger.debug(f"Falling back to extension-based detection for {remote_file.uri}") return EXT_TO_FILETYPE[extension]
191-191
: Consider using Optional for better type clarityThe
FileType | None
union type is correct, but would you consider usingOptional[FileType]
instead? It's more idiomatic in Python and communicates the same intent, wdyt?- filetype: FileType | None = self._get_filetype(file_handle, remote_file) + filetype: Optional[FileType] = self._get_filetype(file_handle, remote_file) - if filetype is None or filetype not in self._supported_file_types():Also applies to: 193-193
unit_tests/connector_builder/test_connector_builder_handler.py (1)
Line range hint
577-607
: Consider enhancing OAuth token refresh test coverage.The test effectively verifies the config update, but what do you think about adding more specific assertions to verify:
- The actual values in the refresh token response?
- The number of times the refresh token endpoint is called?
- The structure of the request made to the refresh token endpoint?
This would make the test more robust against potential regressions, wdyt?
# Example additional assertions: assert output.record.data["latest_config_update"] == { "credentials": { "refresh_token": "an updated refresh token" } } mock_refresh.assert_called_once() mock_refresh.assert_called_with( headers={"Content-Type": "application/x-www-form-urlencoded"}, data={ "grant_type": "refresh_token", "refresh_token": "a refresh token" } )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (5)
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py
(6 hunks)pyproject.toml
(2 hunks)unit_tests/connector_builder/test_connector_builder_handler.py
(6 hunks)unit_tests/sources/declarative/requesters/error_handlers/backoff_strategies/test_header_helper.py
(1 hunks)unit_tests/sources/declarative/requesters/error_handlers/backoff_strategies/test_wait_time_from_header.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- unit_tests/sources/declarative/requesters/error_handlers/backoff_strategies/test_header_helper.py
🔇 Additional comments (8)
pyproject.toml (4)
72-75
: Helpful numpy version constraint comment! 👍
The comment explaining why numpy needs to be <2
is super helpful for future maintainers. Just wondering - since this PR is marked for debugging, are you seeing any specific numpy-related issues in the CI? Happy to help investigate if needed!
Line range hint 1-150
: General observation about PR purpose
Hey! 👋 I notice this PR is titled "Debugging problem - Do not merge" but the changes I'm seeing are primarily dependency updates (Python version constraint, NLTK upgrade, numpy comment). Would you mind clarifying if these dependency changes are related to the debugging effort? This would help reviewers better understand the context and purpose of these changes.
Also, since this is marked as "Do not merge", should we consider creating a separate PR for these dependency updates once the debugging is complete? wdyt? 🤔
32-32
: LGTM on Python version constraint! Just a thought...
The addition of <3.13
upper bound is a good practice for preventing future compatibility issues. However, since this PR is marked as "Debugging problem - Do not merge", I'm curious if this change is related to the debugging effort you're working on? wdyt?
72-72
: Question about NLTK version bump
I notice you're upgrading NLTK from 3.8.1 to 3.9.1. Since this PR is marked for debugging, is this version bump helping resolve a specific issue? Would love to understand the context! 🤔
✅ Verification successful
NLTK version bump appears to be for unit test fixes
From the git history and test files, I can see this change is part of debugging unit tests, specifically for the unstructured file parser which uses NLTK for text processing. The version bump from 3.8.1 to 3.9.1 appears to be related to fixing test issues.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for NLTK usage in the codebase
rg -l "import nltk|from nltk"
# Check the changelog for any relevant fixes
curl -s https://raw.githubusercontent.com/nltk/nltk/3.9.1/NEWS.md
Length of output: 180
Script:
#!/bin/bash
# Let's check the git history to see if there are any related changes or context
git log -p --all | rg -B 5 -A 5 "nltk.*3\.9\.1"
# Let's also check for any issues or discussions mentioning NLTK
git log --grep="nltk" --grep="NLTK" -i --all
# Let's see the actual usage in the test file
cat unit_tests/sources/file_based/scenarios/unstructured_scenarios.py
Length of output: 2280
Script:
#!/bin/bash
# Let's check the actual NLTK usage in the test file to understand the context
cat unit_tests/sources/file_based/scenarios/unstructured_scenarios.py
# Let's also check if there are any related changes in the PR description or commits
git show --name-status HEAD
Length of output: 67903
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (1)
418-423
: LGTM! Clear and informative error messages
The updated signature and error message formatting look great. The explicit handling of None
file types will be really helpful for debugging.
unit_tests/connector_builder/test_connector_builder_handler.py (3)
10-10
: LGTM! Clean import additions.
The new imports for Literal
and AirbyteStateBlob
are well-placed and properly utilized in the codebase.
Also applies to: 42-42
79-80
: LGTM! Improved state management with AirbyteStateBlob.
The change enhances type safety by properly encapsulating state data within AirbyteStateBlob
. The implementation is clean and consistent across both state fixtures.
Also applies to: 90-100
640-640
: LGTM! Precise return type annotation.
The addition of Literal[False]
as the return type annotation makes the method's contract more explicit and type-safe.
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py
Outdated
Show resolved
Hide resolved
@@ -186,7 +198,7 @@ def _read_file( | |||
remote_file, | |||
self._get_file_type_error_message(filetype), | |||
) | |||
if filetype in {FileType.MD, filetype is FileType.TXT}: | |||
if filetype in {FileType.MD, FileType.TXT}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
category_depth = dpath.get(el, "metadata/category_depth", default=1) or 1 | ||
if not isinstance(category_depth, int): | ||
category_depth = ( | ||
int(category_depth) if isinstance(category_depth, (str, float)) else 1 | ||
) | ||
heading_str = "#" * category_depth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍
try: | ||
nltk.data.find("tokenizers/punkt.zip") | ||
nltk.data.find("tokenizers/punkt_tab.zip") | ||
except LookupError: | ||
nltk.download("punkt") | ||
nltk.download("punkt_tab") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Nice work, @aldogonzalez8 !
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests
Documentation