Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run AQL queries as Celery tasks #74

Closed
naglepuff opened this issue Aug 27, 2021 · 8 comments · Fixed by #79
Closed

Run AQL queries as Celery tasks #74

naglepuff opened this issue Aug 27, 2021 · 8 comments · Fixed by #79

Comments

@naglepuff
Copy link
Collaborator

Related to #43 and intended as a follow-up for PR #71 .

We should leverage Celery when running AQL queries. For large networks, complex queries could take a significant amount of time to run. For this reason, we probably shouldn't run these queries during the HTTP request/response.

@waxlamp
Copy link
Collaborator

waxlamp commented Aug 31, 2021

I'm not super sure about this. Bringing in Celery means we have to manage the query return asynchronously, which adds quite a bit of both development and UX overhead.

It's possible that we would want to run "large" queries asynchronously, but I think in the moment there are higher impact things we can do to make querying more usable/efficient:

  1. Limit queries to something like 10 seconds so that they can be done in-request without overwhelming the server or the client with too much data.
  2. Look into indexing so that we have a way to ensure that queries do run quickly.

(2) is something Jake and I have really not looked into at all. We should learn about how Arango handles DB indexes, and we should think about whether to expose that feature in the Multinet API, or else do some sort of automatic indexing on the user's behalf.

@AlmightyYakob, I'm interested in your thoughts on this as well.

@jjnesbitt
Copy link
Member

I'm in favor of this change, @naglepuff and I actually discussed this offline, which is where this issue came from.

  1. Limit queries to something like 10 seconds so that they can be done in-request without overwhelming the server or the client with too much data.

10 seconds really isn't much time. I think most of the queries that are being done on multilink, for example, would take much longer than that. The benefit of using celery for this is that we can vastly widen the max query time, even to just a minute, which is more than we could do in the request. Essentially, since it seems the queries would often run up against the max request time, rather than tying our hands by trying to fit it all into 30 seconds, we could run it asynchronously. Yes it adds some additional overhead in retrieving the response, but that can be abstracted away in the client library, similar to how uploads are.

  1. Look into indexing so that we have a way to ensure that queries do run quickly.
    (2) is something Jake and I have really not looked into at all. We should learn about how Arango handles DB indexes, and we should think about whether to expose that feature in the Multinet API, or else do some sort of automatic indexing on the user's behalf.

To me, indexing is an orthogonal issue here. If we can use indexing correctly to reduce query times, then that's great, but I honestly wouldn't rely on that to ensure our queries are running in a short time. Even if indexes improve document retrieval in the general case, it's unclear as to whether these improvements would persist when running arbitrary AQL queries that traverse a graph. To be clear, I'm in favor of using indexing, I just don't think that alone should be a substitute for running AQL queries asynchronously.

@waxlamp
Copy link
Collaborator

waxlamp commented Aug 31, 2021

I'm in favor of this change, @naglepuff and I actually discussed this offline, which is where this issue came from.

  1. Limit queries to something like 10 seconds so that they can be done in-request without overwhelming the server or the client with too much data.

10 seconds really isn't much time. I think most of the queries that are being done on multilink, for example, would take much longer than that. The benefit of using celery for this is that we can vastly widen the max query time, even to just a minute, which is more than we could do in the request. Essentially, since it seems the queries would often run up against the max request time, rather than tying our hands by trying to fit it all into 30 seconds, we could run it asynchronously. Yes it adds some additional overhead in retrieving the response, but that can be abstracted away in the client library, similar to how uploads are.

Do we have a specific instance where AQL queries have been timing out? If so, this becomes much more of an immediate concern, but I'm not aware of any such demands right now.

  1. Look into indexing so that we have a way to ensure that queries do run quickly.
    (2) is something Jake and I have really not looked into at all. We should learn about how Arango handles DB indexes, and we should think about whether to expose that feature in the Multinet API, or else do some sort of automatic indexing on the user's behalf.

To me, indexing is an orthogonal issue here. If we can use indexing correctly to reduce query times, then that's great, but I honestly wouldn't rely on that to ensure our queries are running in a short time. Even if indexes improve document retrieval in the general case, it's unclear as to whether these improvements would persist when running arbitrary AQL queries that traverse a graph. To be clear, I'm in favor of using indexing, I just don't think that alone should be a substitute for running AQL queries asynchronously.

I don't think it's orthogonal, for a simple reason: if properly indexing our own use cases pushes off the time at which we require async querying, then we have that time to develop other things that matter more in the short term. Investigating indexing brings one further benefit: it improves end user UX. The final factor here is that I have a feeling that getting indexing into our workflow would require less time investment than getting async AQL working. I'm happy to be shown otherwise, but to me it's a low cost investment to speed things up, leaving us more time to figure out if and how to do async queries.

@jjnesbitt
Copy link
Member

Do we have a specific instance where AQL queries have been timing out? If so, this becomes much more of an immediate concern, but I'm not aware of any such demands right now.

I believe it happens with the Marclab work that @gotdairyya is doing. Maybe she can chime in on that.

The final factor here is that I have a feeling that getting indexing into our workflow would require less time investment than getting async AQL working. I'm happy to be shown otherwise, but to me it's a low cost investment to speed things up, leaving us more time to figure out if and how to do async queries.

I actually think the opposite: I think adding indexing would require more time investment than asynchronous AQL. To do so, we need to:

  • Investigate how we can accomplish indexing in arangodb
  • Develop a way to automatically create indexes for datasets (which raises questions about when/where we should create these indexes)
  • Profile the implementation to ensure that the indexing actually does solve the problem

There's also the possibility that creating and using these indexes doesn't actually impact our query performance all that much, or at all, as we really don't know what performance impact they will have. If we have a query that takes 60 seconds, and indexing decreases the running time to 40 seconds, that's great (a 33% improvement!), but it's still over the 30 second threshold. Alternatively, we know that doing this asynchronously will solve our problem, and we already know how celery tasks work, as well as how to implement them. Essentially:

  • Kick off a celery task and return the task model instance to the endpoint
  • Run the query in the task method, and store the result on the task instance
  • The client queries the task until it's done, and retrieves the result

These two approaches also might not be too far apart regarding level of effort, I'm not sure. My main point is that I don't think the indexing approach is trivial, and would take some decent time. I personally think we should eventually do both, if we can, and if they both have a place.

At the end of the day, I'm happy to do either, but I do feel that running the queries asynchronously has a higher chance of success.

@gotdairyya
Copy link

I believe it happens with the Marclab work that @gotdairyya is doing. Maybe she can chime in on that.

Yeah honestly anything beyond a 2 hop query starts to get pretty slow, especially if the queries have less parameters defined. I haven't tried to run any queries beyond the 2 hop example because either my browser crashes or my laptop can't handle it. But at the moment, the marclab is not interested in these longer hopped queries.

@waxlamp
Copy link
Collaborator

waxlamp commented Aug 31, 2021

Going to respond to both @AlmightyYakob and @gotdairyya here:

Jake, you've convinced me that async querying is the better short term solution, but I want to make sure the client applications are able to show progress correctly (meaning, properly show a throbber and/or have UI where you can check on these things) so as not to wreck the UX.

Derya, your complaint about 2 hop querying is exactly why I believe indexing is critical. While async querying will solve your immediate problem in that the queries will work, you're going to be waiting for exponentially longer times for each hop you add. I think we can solve much of that with indexing, though of course I'm not sure until we investigate.

A final note about our approach to engaging with Arango indexes: I don't think we should aim for automatic indexing right off the bat. Instead, we should offer some simple way to create indexes on columns or groups of columns via the UI. But before that, we can simply experiment with manually creating indexes on the Arango instance in order to measure performance.

So let's go ahead and do async AQL queries, but let's also keep indexing in mind as something that requires investigation at the very least.

@jjnesbitt
Copy link
Member

So let's go ahead and do async AQL queries, but let's also keep indexing in mind as something that requires investigation at the very least.

I've filed #77 to track this

@jjnesbitt
Copy link
Member

One extra benefit of running AQL queries as celery tasks is that the query and results would be stored in the model used to facilitate the feature. So, after the query has already been run, you're able to view the results again by retrieving the AQL task. This is in contrast to the current implementation, in which AQL queries are ephemeral, and not stored in any permanent way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants