-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Session binding capability via session_id
in Request
#1086
Merged
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
38b5531
basic implementation for bind request to session
Mantisus 08a7d01
update docs
Mantisus a67d3b3
add tests
Mantisus 039f12f
move session_id to crawlee_data
Mantisus 88c0e8c
resolve
Mantisus 7fb5fde
add docs and examples
Mantisus 300cf4f
Polishment
vdusek 6037829
Update src/crawlee/_request.py
vdusek 5e94912
Update src/crawlee/_request.py
vdusek 25a21a0
Update src/crawlee/crawlers/_basic/_basic_crawler.py
vdusek 14b8974
add session_id in extended_unique_key
Mantisus d633930
resolve
Mantisus b5fc9a4
remove empty part from extended if session_id empty
Mantisus 1a8cd22
Update docs/guides/code_examples/session_management/multi_sessions_ht…
Mantisus b5d1dc5
fix typo
Mantisus 9557ac5
rename `_raise_request_collision` to `_check_request_collision`
Mantisus 448f9e6
fix typo
Mantisus File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
85 changes: 85 additions & 0 deletions
85
docs/guides/code_examples/session_management/multi_sessions_http.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
import asyncio | ||
from datetime import timedelta | ||
from itertools import count | ||
from typing import Callable | ||
|
||
from crawlee import ConcurrencySettings, Request | ||
from crawlee.crawlers import BasicCrawlingContext, HttpCrawler, HttpCrawlingContext | ||
from crawlee.errors import RequestCollisionError | ||
from crawlee.sessions import Session, SessionPool | ||
|
||
|
||
# Define a function for creating sessions with simple logic for unique `id` generation. | ||
# This is necessary if you need to specify a particular session for the first request, | ||
# for example during authentication | ||
def create_session_function() -> Callable[[], Session]: | ||
counter = count() | ||
|
||
def create_session() -> Session: | ||
return Session( | ||
id=str(next(counter)), | ||
max_usage_count=999_999, | ||
max_age=timedelta(hours=999_999), | ||
max_error_score=100, | ||
blocked_status_codes=[403], | ||
) | ||
|
||
return create_session | ||
|
||
|
||
async def main() -> None: | ||
crawler = HttpCrawler( | ||
# Adjust request limits according to your pool size | ||
concurrency_settings=ConcurrencySettings(max_tasks_per_minute=500), | ||
# Requests are bound to specific sessions, no rotation needed | ||
max_session_rotations=0, | ||
session_pool=SessionPool( | ||
max_pool_size=10, create_session_function=create_session_function() | ||
), | ||
) | ||
|
||
@crawler.router.default_handler | ||
async def basic_handler(context: HttpCrawlingContext) -> None: | ||
context.log.info(f'Processing {context.request.url}') | ||
|
||
# Initialize the session and bind the next request to this session if needed | ||
@crawler.router.handler(label='session_init') | ||
async def session_init(context: HttpCrawlingContext) -> None: | ||
next_requests = [] | ||
if context.session: | ||
context.log.info(f'Init session {context.session.id}') | ||
next_request = Request.from_url( | ||
'https://placeholder.dev', session_id=context.session.id | ||
) | ||
next_requests.append(next_request) | ||
|
||
await context.add_requests(next_requests) | ||
|
||
# Handle errors when a session is blocked and no longer available in the pool | ||
# when attempting to execute requests bound to it | ||
@crawler.failed_request_handler | ||
async def error_processing(context: BasicCrawlingContext, error: Exception) -> None: | ||
if isinstance(error, RequestCollisionError) and context.session: | ||
context.log.error( | ||
f'Request {context.request.url} failed, because the bound ' | ||
'session is unavailable' | ||
) | ||
|
||
# Create a pool of requests bound to their respective sessions | ||
# Use `always_enqueue=True` if session initialization happens on a non-unique address, | ||
# such as the site's main page | ||
init_requests = [ | ||
Request.from_url( | ||
'https://example.org/', | ||
label='session_init', | ||
session_id=str(session_id), | ||
use_extended_unique_key=True, | ||
) | ||
for session_id in range(1, 11) | ||
] | ||
|
||
await crawler.run(init_requests) | ||
|
||
|
||
if __name__ == '__main__': | ||
asyncio.run(main()) |
56 changes: 56 additions & 0 deletions
56
docs/guides/code_examples/session_management/one_session_http.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
import asyncio | ||
from datetime import timedelta | ||
|
||
from crawlee import ConcurrencySettings, Request | ||
from crawlee.crawlers import BasicCrawlingContext, HttpCrawler, HttpCrawlingContext | ||
from crawlee.errors import SessionError | ||
from crawlee.sessions import SessionPool | ||
|
||
|
||
async def main() -> None: | ||
crawler = HttpCrawler( | ||
# Limit requests per minute to reduce the chance of being blocked | ||
concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50), | ||
# Disable session rotation | ||
max_session_rotations=0, | ||
session_pool=SessionPool( | ||
# Only one session in the pool | ||
max_pool_size=1, | ||
create_session_settings={ | ||
# High value for session usage limit | ||
'max_usage_count': 999_999, | ||
# High value for session lifetime | ||
'max_age': timedelta(hours=999_999), | ||
# High score allows the session to encounter more errors | ||
# before crawlee decides the session is blocked | ||
# Make sure you know how to handle these errors | ||
'max_error_score': 100, | ||
# 403 status usually indicates you're already blocked | ||
'blocked_status_codes': [403], | ||
}, | ||
), | ||
) | ||
|
||
# Basic request handling logic | ||
@crawler.router.default_handler | ||
async def basic_handler(context: HttpCrawlingContext) -> None: | ||
context.log.info(f'Processing {context.request.url}') | ||
|
||
# Handler for session initialization (authentication, initial cookies, etc.) | ||
@crawler.router.handler(label='session_init') | ||
async def session_init(context: HttpCrawlingContext) -> None: | ||
if context.session: | ||
context.log.info(f'Init session {context.session.id}') | ||
|
||
# Monitor if our session gets blocked and explicitly stop the crawler | ||
@crawler.error_handler | ||
async def error_processing(context: BasicCrawlingContext, error: Exception) -> None: | ||
if isinstance(error, SessionError) and context.session: | ||
context.log.info(f'Session {context.session.id} blocked') | ||
crawler.stop() | ||
|
||
await crawler.run([Request.from_url('https://example.org/', label='session_init')]) | ||
|
||
|
||
if __name__ == '__main__': | ||
asyncio.run(main()) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't
session_id
be used forunique_key
computation? I expect that users might get hindered by deduplication if they try to re-enqueue a failed request with a different session.CC @vdusek - you wrote a big part of the unique key functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, deduplication will affect this.
But I expect that users will use existing mechanisms to return a
Request
to theQueue
avoiding deduplication. By passing eitherunique_key
oralways_enqueue=True
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Currently, it infers the
unique_key
from the URL, method, headers, and payload (in its extended form). You can, of course, usesession_id
together withalways_enqueue
and it will work, but that feels like a workaround to me. I believe we should include thesession_id
in the extendedunique_key
computation.