Add scan support #41

ChenShuai1981 · 2022-12-17T01:40:21Z

Could you provide an example of Http Periodically Scan Source (not lookup source)? Does it support renew access token after expiration?

kristoffSC · 2022-12-18T16:24:01Z

Hi @ChenShuai1981
Scan source is currently not supported by this connector, hence no example available :) for now we have only lookup source. Although this would be a great feature, would you like to contribute? :)

The proper Flink interfaces would have to be implanted.

This feature would be a nice one though, however it would be very "client specific".

davidradl · 2024-06-10T13:45:28Z

@ChenShuai1981 the lookup support that exists currently ends up issuing gets, puts or posts on single records. For the scan to work, I suspect we would need to issue searches, and get involved with paging the results. This could really impact performance of a scan, as we could end up effective doing table scans, unless we could do predicate pushdown.

ChenShuai1981 · 2024-06-11T01:00:22Z

@ChenShuai1981 the lookup support that exists currently ends up issuing gets, puts or posts on single records. For the scan to work, I suspect we would need to issue searches, and get involved with paging the results. This could really impact performance of a scan, as we could end up effective doing table scans, unless we could do predicate pushdown.

Yes, you are right. Since content provider will update information irregularly so we have to periodly send get/post request to fetch them and sync into our database. Scenario like network crawler and system integration. Generally speaking if the results is too large the provider will return a streaming response.

davidradl · 2025-04-16T13:57:15Z

I am prototyping adding scan support. The reason I think this is useful is that:

this will make this connector more comprehensive - so could realistically be contributed to Flink. I hope to do that in a Flip once the scan support is there.
we have had issues with the lookup connector where sometimes the predicates come through as filters rather than lookup keys e.g. issue 143.
I feel that issue HTTP-99 and its PR HTTP-99 Add support for generic json and URL query creator #149 , lays down a more comprehensive way to drive the http connector without needing custom Java.

The design I am thinking of is:

keep the polling factory and the like as is
keep the query creators as-is. The only difference between what we need for a scan query and a lookup query is that the lookup query needs the look up keys.
introduce a new getScanRuntime, based off the Flink example
the socket example uses the http client in a streaming way. In order to reuse the existing Polling factory and associated customization, I am looking to call polling client.pull() in the pollNext of the SourceReader.
implement an equivalent of the fix I did for JDBC https://issues.apache.org/jira/browse/FLINK-33365 so the Lookup code gets access to the filters
As per the socket example, there will be no parallelism, split support or watermark support in the source. I assume we will get standard watermark behaviour implemented after the source as per non-Kafka sources.

A picture of the sort of architecture I am prototyping. Please let me know if this sounds reasonable. New bits in green. It is not totally formal Uml - but hopefully you get the idea.

Let me know your thoughts @ChenShuai1981 @kristoffSC @grzegorz8. As this is likely to involve a major rewrite -changing package names and the like, I suggest we could consider upping the version to 2.0.

grzegorz8 · 2025-05-02T09:02:39Z

@davidradl Hey! Thank you for addressing this feature - it would be great to have it in the connector. Overall, all the points you mentioned look good. However, I have a few comments:

Scan abilities.
- Filter pushdown, limit pushdown - this is fine.
- Projection pushdown - I assume REST endpoints rarely accept the list of fields to return, so this is not a priority.
Parallelism. I agree, let’s stick with parallelism=1. While parallel execution might be possible, I don't think it's worth the added complexity at this point.
Result pagination.
- REST APIs typically return results in pages. How do we want to handle this? This is related to: Enhance HTTP lookup join to support N:M relationships #118
- We’ll need to consider both bounded and unbounded data scenarios. E.g. for unbounded scenario we might want to implement some backoff delay strategy if there are no further results available at the moment.
Checkpointing. If the scan source is going to be used in streaming execution mode, we need to support checkpointing. What we store in the checkpoint largely depends on the pagination strategy implemented in the source REST API. To ensure reliable scanning scan capability, the REST API pagination strategy has to be stable and idempotent.

kristoffSC self-assigned this Dec 18, 2022

kristoffSC added the question Further information is requested label Jul 10, 2023

davidradl mentioned this issue May 2, 2024

Allow short lived bearer tokens to be used (first implementation being against IBM cloud) #91

Closed

kristoffSC mentioned this issue Aug 1, 2024

http source select exception #108

Closed

grzegorz8 changed the title ~~Could you provide an example of Http Scan Source (not lookup source) ?~~ Add scan support May 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scan support #41

Add scan support #41

ChenShuai1981 commented Dec 17, 2022 •

edited

Loading

kristoffSC commented Dec 18, 2022 •

edited

Loading

davidradl commented Jun 10, 2024

ChenShuai1981 commented Jun 11, 2024

davidradl commented Apr 16, 2025

grzegorz8 commented May 2, 2025

Add scan support #41

Add scan support #41

Comments

ChenShuai1981 commented Dec 17, 2022 • edited Loading

kristoffSC commented Dec 18, 2022 • edited Loading

davidradl commented Jun 10, 2024

ChenShuai1981 commented Jun 11, 2024

davidradl commented Apr 16, 2025

grzegorz8 commented May 2, 2025

ChenShuai1981 commented Dec 17, 2022 •

edited

Loading

kristoffSC commented Dec 18, 2022 •

edited

Loading