Run AQL queries as Celery tasks #74

naglepuff · 2021-08-27T19:54:35Z

Related to #43 and intended as a follow-up for PR #71 .

We should leverage Celery when running AQL queries. For large networks, complex queries could take a significant amount of time to run. For this reason, we probably shouldn't run these queries during the HTTP request/response.

waxlamp · 2021-08-31T14:39:05Z

I'm not super sure about this. Bringing in Celery means we have to manage the query return asynchronously, which adds quite a bit of both development and UX overhead.

It's possible that we would want to run "large" queries asynchronously, but I think in the moment there are higher impact things we can do to make querying more usable/efficient:

Limit queries to something like 10 seconds so that they can be done in-request without overwhelming the server or the client with too much data.
Look into indexing so that we have a way to ensure that queries do run quickly.

(2) is something Jake and I have really not looked into at all. We should learn about how Arango handles DB indexes, and we should think about whether to expose that feature in the Multinet API, or else do some sort of automatic indexing on the user's behalf.

@AlmightyYakob, I'm interested in your thoughts on this as well.

jjnesbitt · 2021-08-31T15:21:37Z

I'm in favor of this change, @naglepuff and I actually discussed this offline, which is where this issue came from.

Limit queries to something like 10 seconds so that they can be done in-request without overwhelming the server or the client with too much data.

10 seconds really isn't much time. I think most of the queries that are being done on multilink, for example, would take much longer than that. The benefit of using celery for this is that we can vastly widen the max query time, even to just a minute, which is more than we could do in the request. Essentially, since it seems the queries would often run up against the max request time, rather than tying our hands by trying to fit it all into 30 seconds, we could run it asynchronously. Yes it adds some additional overhead in retrieving the response, but that can be abstracted away in the client library, similar to how uploads are.

Look into indexing so that we have a way to ensure that queries do run quickly.
(2) is something Jake and I have really not looked into at all. We should learn about how Arango handles DB indexes, and we should think about whether to expose that feature in the Multinet API, or else do some sort of automatic indexing on the user's behalf.

To me, indexing is an orthogonal issue here. If we can use indexing correctly to reduce query times, then that's great, but I honestly wouldn't rely on that to ensure our queries are running in a short time. Even if indexes improve document retrieval in the general case, it's unclear as to whether these improvements would persist when running arbitrary AQL queries that traverse a graph. To be clear, I'm in favor of using indexing, I just don't think that alone should be a substitute for running AQL queries asynchronously.

waxlamp · 2021-08-31T17:26:43Z

I'm in favor of this change, @naglepuff and I actually discussed this offline, which is where this issue came from.

Limit queries to something like 10 seconds so that they can be done in-request without overwhelming the server or the client with too much data.

10 seconds really isn't much time. I think most of the queries that are being done on multilink, for example, would take much longer than that. The benefit of using celery for this is that we can vastly widen the max query time, even to just a minute, which is more than we could do in the request. Essentially, since it seems the queries would often run up against the max request time, rather than tying our hands by trying to fit it all into 30 seconds, we could run it asynchronously. Yes it adds some additional overhead in retrieving the response, but that can be abstracted away in the client library, similar to how uploads are.

Do we have a specific instance where AQL queries have been timing out? If so, this becomes much more of an immediate concern, but I'm not aware of any such demands right now.

Look into indexing so that we have a way to ensure that queries do run quickly.
(2) is something Jake and I have really not looked into at all. We should learn about how Arango handles DB indexes, and we should think about whether to expose that feature in the Multinet API, or else do some sort of automatic indexing on the user's behalf.

To me, indexing is an orthogonal issue here. If we can use indexing correctly to reduce query times, then that's great, but I honestly wouldn't rely on that to ensure our queries are running in a short time. Even if indexes improve document retrieval in the general case, it's unclear as to whether these improvements would persist when running arbitrary AQL queries that traverse a graph. To be clear, I'm in favor of using indexing, I just don't think that alone should be a substitute for running AQL queries asynchronously.

I don't think it's orthogonal, for a simple reason: if properly indexing our own use cases pushes off the time at which we require async querying, then we have that time to develop other things that matter more in the short term. Investigating indexing brings one further benefit: it improves end user UX. The final factor here is that I have a feeling that getting indexing into our workflow would require less time investment than getting async AQL working. I'm happy to be shown otherwise, but to me it's a low cost investment to speed things up, leaving us more time to figure out if and how to do async queries.

jjnesbitt · 2021-08-31T18:32:19Z

Do we have a specific instance where AQL queries have been timing out? If so, this becomes much more of an immediate concern, but I'm not aware of any such demands right now.

I believe it happens with the Marclab work that @gotdairyya is doing. Maybe she can chime in on that.

The final factor here is that I have a feeling that getting indexing into our workflow would require less time investment than getting async AQL working. I'm happy to be shown otherwise, but to me it's a low cost investment to speed things up, leaving us more time to figure out if and how to do async queries.

I actually think the opposite: I think adding indexing would require more time investment than asynchronous AQL. To do so, we need to:

Investigate how we can accomplish indexing in arangodb
Develop a way to automatically create indexes for datasets (which raises questions about when/where we should create these indexes)
Profile the implementation to ensure that the indexing actually does solve the problem

There's also the possibility that creating and using these indexes doesn't actually impact our query performance all that much, or at all, as we really don't know what performance impact they will have. If we have a query that takes 60 seconds, and indexing decreases the running time to 40 seconds, that's great (a 33% improvement!), but it's still over the 30 second threshold. Alternatively, we know that doing this asynchronously will solve our problem, and we already know how celery tasks work, as well as how to implement them. Essentially:

Kick off a celery task and return the task model instance to the endpoint
Run the query in the task method, and store the result on the task instance
The client queries the task until it's done, and retrieves the result

These two approaches also might not be too far apart regarding level of effort, I'm not sure. My main point is that I don't think the indexing approach is trivial, and would take some decent time. I personally think we should eventually do both, if we can, and if they both have a place.

At the end of the day, I'm happy to do either, but I do feel that running the queries asynchronously has a higher chance of success.

gotdairyya · 2021-08-31T18:48:42Z

I believe it happens with the Marclab work that @gotdairyya is doing. Maybe she can chime in on that.

Yeah honestly anything beyond a 2 hop query starts to get pretty slow, especially if the queries have less parameters defined. I haven't tried to run any queries beyond the 2 hop example because either my browser crashes or my laptop can't handle it. But at the moment, the marclab is not interested in these longer hopped queries.

waxlamp · 2021-08-31T19:08:34Z

Going to respond to both @AlmightyYakob and @gotdairyya here:

Jake, you've convinced me that async querying is the better short term solution, but I want to make sure the client applications are able to show progress correctly (meaning, properly show a throbber and/or have UI where you can check on these things) so as not to wreck the UX.

Derya, your complaint about 2 hop querying is exactly why I believe indexing is critical. While async querying will solve your immediate problem in that the queries will work, you're going to be waiting for exponentially longer times for each hop you add. I think we can solve much of that with indexing, though of course I'm not sure until we investigate.

A final note about our approach to engaging with Arango indexes: I don't think we should aim for automatic indexing right off the bat. Instead, we should offer some simple way to create indexes on columns or groups of columns via the UI. But before that, we can simply experiment with manually creating indexes on the Arango instance in order to measure performance.

So let's go ahead and do async AQL queries, but let's also keep indexing in mind as something that requires investigation at the very least.

jjnesbitt · 2021-08-31T20:04:30Z

So let's go ahead and do async AQL queries, but let's also keep indexing in mind as something that requires investigation at the very least.

I've filed #77 to track this

jjnesbitt · 2021-09-07T18:02:51Z

One extra benefit of running AQL queries as celery tasks is that the query and results would be stored in the model used to facilitate the feature. So, after the query has already been run, you're able to view the results again by retrieving the AQL task. This is in contrast to the current implementation, in which AQL queries are ephemeral, and not stored in any permanent way.

jjnesbitt mentioned this issue Aug 31, 2021

Investigate ArangoDB indexing #77

Open

jjnesbitt mentioned this issue Sep 1, 2021

Add Task model for generic tasks #78

Merged

naglepuff mentioned this issue Sep 10, 2021

Create model for async AQL tasks #79

Merged

naglepuff closed this as completed in #79 Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run AQL queries as Celery tasks #74

Run AQL queries as Celery tasks #74

naglepuff commented Aug 27, 2021

waxlamp commented Aug 31, 2021

jjnesbitt commented Aug 31, 2021

waxlamp commented Aug 31, 2021

jjnesbitt commented Aug 31, 2021

gotdairyya commented Aug 31, 2021

waxlamp commented Aug 31, 2021

jjnesbitt commented Aug 31, 2021

jjnesbitt commented Sep 7, 2021

Run AQL queries as Celery tasks #74

Run AQL queries as Celery tasks #74

Comments

naglepuff commented Aug 27, 2021

waxlamp commented Aug 31, 2021

jjnesbitt commented Aug 31, 2021

waxlamp commented Aug 31, 2021

jjnesbitt commented Aug 31, 2021

gotdairyya commented Aug 31, 2021

waxlamp commented Aug 31, 2021

jjnesbitt commented Aug 31, 2021

jjnesbitt commented Sep 7, 2021