Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Add templates apify integration tests #1109

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions .github/workflows/templates_e2e_tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: Templates end-to-end tests

on:
workflow_dispatch:
secrets:
APIFY_TEST_USER_API_TOKEN:
description: API token of the Python testing user on Apify
required: true

jobs:
end_to_end_tests:
name: End-to-end tests
strategy:
fail-fast: false

runs-on: "ubuntu-latest"
env:
python-version: "3.13"
node-version: "22"

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Setup node
uses: actions/setup-node@v4
with:
node-version: ${{ env.node-version }}

- name: Install dependencies
run: npm install -g apify-cli

- name: Set up Python ${{ env.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ env.python-version }}

# installed to be able to patch crawlee in the poetry.lock with custom wheel file for poetry based templates
- name: Install poetry
run: pipx install poetry

- name: Set up uv package manager
uses: astral-sh/setup-uv@v5
with:
python-version: ${{ env.python-version }}

- name: Install Python dependencies
run: make install-dev

- name: Run templates end-to-end tests
run: make e2e-templates-tests
env:
APIFY_TEST_USER_API_TOKEN: ${{ secrets.APIFY_TEST_USER_API_TOKEN }}
16 changes: 11 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -72,13 +72,19 @@ To run unit tests with HTML coverage report:
make unit-tests-cov
```

<!--
TODO:
## End-to-end tests

## Integration tests
Pre-requisites for running end-to-end tests:
- [apify-cli](https://docs.apify.com/cli/docs/installation) correctly installed
- `apify-cli` available in `PATH` environment variable
- Your [apify token](https://docs.apify.com/platform/integrations/api#api-token) is available in `APIFY_TEST_USER_API_TOKEN` environment variable

...
-->

To run end-to-end tests:

```sh
make e2e-templates-tests
```

## Documentation

3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -35,6 +35,9 @@ unit-tests-cov:
integration-tests:
uv run pytest --numprocesses=$(INTEGRATION_TESTS_CONCURRENCY) --verbose tests/integration

e2e-templates-tests:
uv run pytest --numprocesses=$(INTEGRATION_TESTS_CONCURRENCY) --verbose tests/e2e/project_template

format:
uv run ruff check --fix
uv run ruff format
8 changes: 6 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -92,6 +92,8 @@ crawlee = "crawlee._cli:cli"

[dependency-groups]
dev = [
"apify_client", # For integration tests.
"build~=1.2.2", # For integration tests.
"mypy~=1.15.0",
"pre-commit~=4.2.0",
"proxy-py~=2.4.0",
@@ -119,7 +121,7 @@ packages = ["src/crawlee"]
[tool.ruff]
line-length = 120
include = ["src/**/*.py", "tests/**/*.py", "docs/**/*.py", "website/**/*.py"]
extend-exclude = ["project_template"]
extend-exclude = ["src/crawlee/project_template"]

[tool.ruff.lint]
select = ["ALL"]
@@ -207,7 +209,7 @@ timeout = 1200
[tool.mypy]
python_version = "3.9"
plugins = ["pydantic.mypy"]
exclude = ["project_template"]
exclude = ["src/crawlee/project_template"]
files = ["src", "tests", "docs", "website"]
check_untyped_defs = true
disallow_incomplete_defs = true
@@ -229,6 +231,8 @@ module = [
"jaro", # Untyped and stubs not available
"loguru", # Example code shows integration of loguru and crawlee for JSON logging.
"sklearn.linear_model", # Untyped and stubs not available
"cookiecutter.*", # Untyped and stubs not available
"inquirer.*", # Untyped and stubs not available
]
ignore_missing_imports = true

9 changes: 5 additions & 4 deletions src/crawlee/_cli.py
Original file line number Diff line number Diff line change
@@ -7,10 +7,10 @@
from typing import Annotated, Optional, cast

try:
import inquirer # type: ignore[import-untyped]
import inquirer
import typer
from cookiecutter.main import cookiecutter # type: ignore[import-untyped]
from inquirer.render.console import ConsoleRender # type: ignore[import-untyped]
from cookiecutter.main import cookiecutter
from inquirer.render.console import ConsoleRender
from rich.progress import Progress, SpinnerColumn, TextColumn
except ModuleNotFoundError as exc:
raise ImportError(
@@ -21,7 +21,8 @@
cli = typer.Typer(no_args_is_help=True)

template_directory = importlib.resources.files('crawlee') / 'project_template'
cookiecutter_json = json.load((template_directory / 'cookiecutter.json').open())
with open(str(template_directory / 'cookiecutter.json')) as f:
cookiecutter_json = json.load(f)

crawler_choices = cookiecutter_json['crawler_type']
http_client_choices = cookiecutter_json['http_client']
1 change: 0 additions & 1 deletion src/crawlee/project_template/hooks/post_gen_project.py
Original file line number Diff line number Diff line change
@@ -2,7 +2,6 @@
import subprocess
from pathlib import Path

Path('_pyproject.toml').rename('pyproject.toml')

# % if cookiecutter.package_manager in ['poetry', 'uv']
Path('requirements.txt').unlink()
Original file line number Diff line number Diff line change
@@ -20,8 +20,7 @@ RUN pip install -U pip setuptools \
# Second, copy just poetry.lock and pyproject.toml into the Actor image,
# since those should be the only files that affects the dependency install in the next step,
# in order to speed up the build
COPY pyproject.toml ./
COPY poetry.lock ./
COPY pyproject.toml poetry.lock ./

# Install the dependencies
RUN echo "Python version:" \
44 changes: 44 additions & 0 deletions tests/e2e/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import subprocess
from pathlib import Path

import pytest
from filelock import FileLock

_CRAWLEE_ROOT_PATH = Path(__file__).parent.parent.parent.resolve()


@pytest.fixture(scope='session')
def crawlee_wheel_path(tmp_path_factory: pytest.TempPathFactory, testrun_uid: str) -> Path:
"""Build the package wheel if it hasn't been built yet, and return the path to the wheel."""
# Make sure the wheel is not being built concurrently across all the pytest-xdist runners,
# through locking the building process with a temp file.
with FileLock(tmp_path_factory.getbasetemp().parent / 'crawlee_wheel_build.lock'):
# Make sure the wheel is built exactly once across all the pytest-xdist runners,
# through an indicator file saying that the wheel was already built.
was_wheel_built_this_test_run_file = tmp_path_factory.getbasetemp() / f'wheel_was_built_in_run_{testrun_uid}'
if not was_wheel_built_this_test_run_file.exists():
subprocess.run(
args='python -m build',
cwd=_CRAWLEE_ROOT_PATH,
shell=True,
check=True,
capture_output=True,
)
was_wheel_built_this_test_run_file.touch()

# Read the current package version, necessary for getting the right wheel filename.
pyproject_toml_file = (_CRAWLEE_ROOT_PATH / 'pyproject.toml').read_text(encoding='utf-8')
for line in pyproject_toml_file.splitlines():
if line.startswith('version = '):
delim = '"' if '"' in line else "'"
crawlee_version = line.split(delim)[1]
break
else:
raise RuntimeError('Unable to find version string.')

wheel_path = _CRAWLEE_ROOT_PATH / 'dist' / f'crawlee-{crawlee_version}-py3-none-any.whl'

# Just to be sure.
assert wheel_path.exists()

return wheel_path
87 changes: 87 additions & 0 deletions tests/e2e/project_template/test_static_crawlers_templates.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import os
import re
import subprocess
from pathlib import Path

import pytest
from apify_client import ApifyClientAsync
from cookiecutter.main import cookiecutter

from crawlee._cli import default_start_url, template_directory
from crawlee._utils.crypto import crypto_random_object_id
from tests.e2e.project_template.utils import patch_crawlee_version_in_pyproject_toml_based_project

# To run these tests locally, make sure you have apify-cli installed and available in the path.
# https://docs.apify.com/cli/docs/installation


@pytest.mark.parametrize('http_client', ['httpx', 'curl-impersonate'])
@pytest.mark.parametrize('crawler_type', ['parsel', 'beautifulsoup'])
@pytest.mark.parametrize('package_manager', ['uv', 'poetry'])
async def test_static_crawler_actor_at_apify(
tmp_path: Path, crawlee_wheel_path: Path, package_manager: str, crawler_type: str, http_client: str
) -> None:
# Generate new actor name
actor_name = f'crawlee-python-template-integration-test-{crypto_random_object_id(8).lower()}'

# Create project from template
cookiecutter(
template=str(template_directory),
no_input=True,
extra_context={
'project_name': actor_name,
'package_manager': package_manager,
'crawler_type': crawler_type,
'http_client': http_client,
'enable_apify_integration': True,
'start_url': default_start_url,
},
accept_hooks=False, # Do not install the newly created environment.
output_dir=tmp_path,
)

patch_crawlee_version_in_pyproject_toml_based_project(
project_path=tmp_path / actor_name, wheel_path=crawlee_wheel_path
)

# Build actor using sequence of cli commands as the user would
subprocess.run( # noqa: ASYNC221, S603
['apify', 'login', '-t', os.environ['APIFY_TEST_USER_API_TOKEN']], # noqa: S607
capture_output=True,
check=True,
cwd=tmp_path / actor_name,
)
subprocess.run(['apify', 'init', '-y', actor_name], capture_output=True, check=True, cwd=tmp_path / actor_name) # noqa: ASYNC221, S603, S607

build_process = subprocess.run(['apify', 'push'], capture_output=True, check=False, cwd=tmp_path / actor_name) # noqa: ASYNC221, S603, S607
# Get actor ID from build log
actor_id_regexp = re.compile(r'https:\/\/console\.apify\.com\/actors\/(.*)#\/builds\/\d*\.\d*\.\d*')
# Why is it in stderr and not in stdout???
actor_id = re.findall(actor_id_regexp, build_process.stderr.decode())[0]

client = ApifyClientAsync(
token=os.getenv('APIFY_TEST_USER_API_TOKEN'), api_url=os.getenv('APIFY_INTEGRATION_TESTS_API_URL')
)
actor = client.actor(actor_id)

# Run actor
try:
assert build_process.returncode == 0
started_run_data = await actor.start()
actor_run = client.run(started_run_data['id'])

finished_run_data = await actor_run.wait_for_finish()
actor_run_log = await actor_run.log().get()
finally:
# Delete the actor once it is no longer needed.
await actor.delete()

# Asserts
additional_run_info = f'Full actor run log: {actor_run_log}'
assert actor_run_log
assert finished_run_data
assert finished_run_data['status'] == 'SUCCEEDED', additional_run_info
assert (
'Crawler.stop() was called with following reason: The crawler has reached its limit of 50 requests per crawl.'
) in actor_run_log, additional_run_info
assert int(re.findall(r'requests_finished\s*│\s*(\d*)', actor_run_log)[-1]) >= 50, additional_run_info
52 changes: 52 additions & 0 deletions tests/e2e/project_template/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
import re
import shutil
import subprocess
from pathlib import Path


def patch_crawlee_version_in_pyproject_toml_based_project(project_path: Path, wheel_path: Path) -> None:
"""Ensure that the test is using current version of the crawlee from the source and not from Pypi."""
# Copy prepared .whl file
shutil.copy(wheel_path, project_path)

# Get any extras
with open(project_path / 'pyproject.toml') as f:
pyproject = f.read()
crawlee_extras = re.findall(r'crawlee(\[.*\])', pyproject)[0] or ''

# Inject crawlee wheel file to the docker image and update project to depend on it."""
with open(project_path / 'Dockerfile') as f:
modified_lines = []
for line in f:
modified_lines.append(line)
if line.startswith('COPY pyproject.toml'):
if 'uv.lock' in line:
package_manager = 'uv'
elif 'poetry.lock' in line:
package_manager = 'poetry'
else:
raise RuntimeError('This does not look like a uv or poetry based project.')

# Create lock file that is expected by the docker to exist(Even though it wil be patched in the docker).
subprocess.run(
args=[package_manager, 'lock'],
cwd=str(project_path),
check=True,
capture_output=True,
)

# Add command to copy .whl to the docker image and update project with it.
# Patching in docker file due to the poetry not properly supporting relative paths for wheel packages
# and so the absolute path(in the container) is generated when running `add` command in the container.
modified_lines.extend(
[
f'COPY {wheel_path.name} ./',
# If no crawlee version bump, poetry might be lazy and take existing crawlee version,
# make sure that one is patched as well.
f'RUN pip install ./{wheel_path.name}{crawlee_extras} --force-reinstall',
f'RUN {package_manager} add ./{wheel_path.name}{crawlee_extras}',
f'RUN {package_manager} lock',
]
)
with open(project_path / 'Dockerfile', 'w') as f:
f.write('\n'.join(modified_lines))
1 change: 0 additions & 1 deletion tests/integration/README.md

This file was deleted.

73 changes: 73 additions & 0 deletions uv.lock