Skip to content

Add scan support #41

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ChenShuai1981 opened this issue Dec 17, 2022 · 5 comments
Open

Add scan support #41

ChenShuai1981 opened this issue Dec 17, 2022 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@ChenShuai1981
Copy link

ChenShuai1981 commented Dec 17, 2022

Could you provide an example of Http Periodically Scan Source (not lookup source)? Does it support renew access token after expiration?

@kristoffSC
Copy link
Collaborator

kristoffSC commented Dec 18, 2022

Hi @ChenShuai1981
Scan source is currently not supported by this connector, hence no example available :) for now we have only lookup source. Although this would be a great feature, would you like to contribute? :)

The proper Flink interfaces would have to be implanted.

This feature would be a nice one though, however it would be very "client specific".

@davidradl
Copy link
Contributor

@ChenShuai1981 the lookup support that exists currently ends up issuing gets, puts or posts on single records. For the scan to work, I suspect we would need to issue searches, and get involved with paging the results. This could really impact performance of a scan, as we could end up effective doing table scans, unless we could do predicate pushdown.

@ChenShuai1981
Copy link
Author

@ChenShuai1981 the lookup support that exists currently ends up issuing gets, puts or posts on single records. For the scan to work, I suspect we would need to issue searches, and get involved with paging the results. This could really impact performance of a scan, as we could end up effective doing table scans, unless we could do predicate pushdown.

Yes, you are right. Since content provider will update information irregularly so we have to periodly send get/post request to fetch them and sync into our database. Scenario like network crawler and system integration. Generally speaking if the results is too large the provider will return a streaming response.

@davidradl
Copy link
Contributor

I am prototyping adding scan support. The reason I think this is useful is that:

  • this will make this connector more comprehensive - so could realistically be contributed to Flink. I hope to do that in a Flip once the scan support is there.
  • we have had issues with the lookup connector where sometimes the predicates come through as filters rather than lookup keys e.g. issue 143.
  • I feel that issue HTTP-99 and its PR HTTP-99 Add support for generic json and URL query creator  #149 , lays down a more comprehensive way to drive the http connector without needing custom Java.

The design I am thinking of is:

  • keep the polling factory and the like as is
  • keep the query creators as-is. The only difference between what we need for a scan query and a lookup query is that the lookup query needs the look up keys.
  • introduce a new getScanRuntime, based off the Flink example
  • the socket example uses the http client in a streaming way. In order to reuse the existing Polling factory and associated customization, I am looking to call polling client.pull() in the pollNext of the SourceReader.
  • implement an equivalent of the fix I did for JDBC https://issues.apache.org/jira/browse/FLINK-33365 so the Lookup code gets access to the filters
  • As per the socket example, there will be no parallelism, split support or watermark support in the source. I assume we will get standard watermark behaviour implemented after the source as per non-Kafka sources.

A picture of the sort of architecture I am prototyping. Please let me know if this sounds reasonable. New bits in green. It is not totally formal Uml - but hopefully you get the idea.

Image

Let me know your thoughts @ChenShuai1981 @kristoffSC @grzegorz8. As this is likely to involve a major rewrite -changing package names and the like, I suggest we could consider upping the version to 2.0.

@grzegorz8 grzegorz8 changed the title Could you provide an example of Http Scan Source (not lookup source) ? Add scan support May 2, 2025
@grzegorz8
Copy link
Member

@davidradl Hey! Thank you for addressing this feature - it would be great to have it in the connector. Overall, all the points you mentioned look good. However, I have a few comments:

  • Scan abilities.
    • Filter pushdown, limit pushdown - this is fine.
    • Projection pushdown - I assume REST endpoints rarely accept the list of fields to return, so this is not a priority.
  • Parallelism. I agree, let’s stick with parallelism=1. While parallel execution might be possible, I don't think it's worth the added complexity at this point.
  • Result pagination.
    • REST APIs typically return results in pages. How do we want to handle this? This is related to: Enhance HTTP lookup join to support N:M relationships #118
    • We’ll need to consider both bounded and unbounded data scenarios. E.g. for unbounded scenario we might want to implement some backoff delay strategy if there are no further results available at the moment.
  • Checkpointing. If the scan source is going to be used in streaming execution mode, we need to support checkpointing. What we store in the checkpoint largely depends on the pagination strategy implemented in the source REST API. To ensure reliable scanning scan capability, the REST API pagination strategy has to be stable and idempotent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants