Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: Introduce new storage clients #1107

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

vdusek
Copy link
Collaborator

@vdusek vdusek commented Mar 19, 2025

Draft version

  • This is a draft PR; only dataset-related components are implemented at this stage.
  • Let's discuss the current state before proceeding with the KVS and RQ implementations.

Description

  • In this first iteration, the following have been updated or implemented:
    • DatasetClient,
    • FileSystemDatasetClient,
    • MemoryDatasetClient,
    • and the Dataset has been updated accordingly.
  • A lot of things from the Dataset were removed and will be implemented in the specific storage clients instead.
  • The memory client is now split into file system and memory implementations, eliminating the need for a persist_storage flag.
  • All collection clients have been removed.
  • Storage client method names have been aligned with storage naming.
  • Users are now warned when using method arguments that are not supported.
  • Creation management in the storage clients has been removed; creation management in the storages/ module will be removed later.

Suggested Dataset breaking changes

  • iterate_itemsiterate method
  • storage_objectmetadata property
  • get_infometadata property
  • check_and_serialize method has been removed - The clients should handle serialization.
  • from_storage_object method has been removed - Use the open method with name and/or id instead.
  • set_metadata method has been removed - Do we want to support it (e.g. for renaming)?

Issues

Todo

  • I am not sure about the wrapping storage clients - StorageClient, file_system_storage_client, memory_storage_client.
  • Implement KVS and RQ once there is a consensus on the final form.

Usage example

import asyncio

from crawlee.storage_clients import file_system_storage_client
from crawlee.storages import Dataset


async def main() -> None:
    dataset = await Dataset.open(
        purge_on_start=False,
        storage_client=file_system_storage_client,
    )
    print(f'default dataset - ID: {dataset.id}, name: {dataset.name}')

    await dataset.push_data({'name': 'John'})
    await dataset.push_data({'name': 'John', 'age': 20})
    await dataset.push_data({})

    dataset_with_name = await Dataset.open(
        name='my_dataset',
        storage_client=file_system_storage_client,
    )
    print(f'named dataset - ID: {dataset_with_name.id}, name: {dataset_with_name.name}')

    await dataset_with_name.push_data([{'age': 30}, {'age': 25}])

    print('Default dataset items:')
    async for item in dataset.iterate(skip_empty=True):
        print(item)

    print('Named dataset items:')
    async for item in dataset_with_name.iterate():
        print(item)

    items = await dataset.get_data()
    print(items)


if __name__ == '__main__':
    asyncio.run(main())

Testing

  • Adjust existing tests and add new ones if necessary once there is agreement on the final form.

Checklist

  • CI passed

@github-actions github-actions bot added this to the 110th sprint - Tooling team milestone Mar 19, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 19, 2025
@vdusek vdusek marked this pull request as draft March 19, 2025 16:54
@vdusek vdusek changed the title Memory storage refactor refactor!: Introduce new storage clients Mar 19, 2025
@vdusek vdusek added enhancement New feature or request. debt Code quality improvement or decrease of technical debt. and removed enhancement New feature or request. labels Mar 19, 2025
invalid = [arg for arg in unsupported_args if arg not in (False, None)]
if invalid:
logger.warning(
f'The arguments {invalid} of iterate_items are not supported by the {self.__class__.__name__} client.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f'The arguments {invalid} of get_data ...


Args:
kwargs: Keyword arguments for the storage client method.
offset: Skips the specified number of items at the start.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to explicitly say in what order some arguments take effect to avoid misinterpretations.

For example: offset + desc

  • reverse and then offset
    or
  • offset and then reverse?

@vdusek vdusek force-pushed the memory-storage-refactor branch from 1f433c8 to ec32ec9 Compare March 20, 2025 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debt Code quality improvement or decrease of technical debt. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
2 participants