Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: async reader sometime stops reading messages without any errors. #538

Open
vgvoleg opened this issue Dec 23, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@vgvoleg
Copy link
Collaborator

vgvoleg commented Dec 23, 2024

Bug Report

YDB Python SDK version:

latest

Environment

linux

Current behavior:

Examples of errors in logs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/grpc/aio/_call.py", line 452, in _consume_request_iterator
    await self._write(request)
  File "/usr/local/lib/python3.10/dist-packages/grpc/aio/_call.py", line 489, in _write
    raise asyncio.InvalidStateError(_RPC_ALREADY_FINISHED_DETAILS)
asyncio.exceptions.InvalidStateError: RPC already finished.
Exception while consuming the request_iterator: <AioRpcError of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "Internal error from Core"
	debug_error_string = "Failed "execute_batch": (<grpc._cython.cygrpc.SendMessageOperation object at 0x7f35b0e77a30>,)"
>
Exception while consuming the request_iterator: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "connections to all backends failing; last error: UNKNOWN: ipv6:%5B2a02:6b8:c41:21:0:1517:fdbf:63a5%5D:2135: Failed to connect to remote host: connect: Connection refused (111)"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"connections to all backends failing; last error: UNKNOWN: ipv6:%5B2a02:6b8:c41:21:0:1517:fdbf:63a5%5D:2135: Failed to connect to remote host: connect: Connection refused (111)", grpc_status:14, created_time:"2024-09-20T18:58:56.948845895+03:00"}"
>

Expected behavior:

Steps to reproduce:

Related code:

workaround with reconnect works fine

while True:
    its_time_to_die = (time.perf_counter() - start_time) > topic_client.worker_ttl
    if its_time_to_die:
        # periodically rebooting worker as a workaround for error (RPC already finished)
        raise Exception(
            client=topic_client, message=f"ttl ({topic_client.worker_ttl} sec) has expired"
        )

    try:
        try:
            batch: PublicBatch | None = await asyncio.wait_for(
                reader.receive_batch(), timeout=RECEIVE_MESSAGE_TIMEOUT
            )
        except asyncio.TimeoutError as exc:
            if isinstance(exc.__cause__, asyncio.CancelledError):
                # for cases when future is cancelling immediately for a long time
                # (e.g. while dc maintenance)
                # we need to sleep instead of calling receive_batch() like a ddos
                # TODO: backoff
                await asyncio.sleep(RECEIVE_MESSAGE_TIMEOUT)
            continue

        if batch:
            log_context["batch"] = str(batch)  # FIXME: potentially heavyweight
            if batch.alive:
                await self._process_batch(batch, log_context=log_context)
            if batch.alive:
                await asyncio.wait_for(reader.commit_with_ack(batch), timeout=RECEIVE_MESSAGE_TIMEOUT)

    except Exception as e:
        raise Exception(
            client=topic_client,
            message="YDB Internal Error. Failed to process batch",
            log_context=log_context,
        ) from e

Other information:

@vgvoleg vgvoleg added the bug Something isn't working label Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant