-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run AQL queries as Celery tasks #74
Comments
I'm not super sure about this. Bringing in Celery means we have to manage the query return asynchronously, which adds quite a bit of both development and UX overhead. It's possible that we would want to run "large" queries asynchronously, but I think in the moment there are higher impact things we can do to make querying more usable/efficient:
(2) is something Jake and I have really not looked into at all. We should learn about how Arango handles DB indexes, and we should think about whether to expose that feature in the Multinet API, or else do some sort of automatic indexing on the user's behalf. @AlmightyYakob, I'm interested in your thoughts on this as well. |
I'm in favor of this change, @naglepuff and I actually discussed this offline, which is where this issue came from.
10 seconds really isn't much time. I think most of the queries that are being done on multilink, for example, would take much longer than that. The benefit of using celery for this is that we can vastly widen the max query time, even to just a minute, which is more than we could do in the request. Essentially, since it seems the queries would often run up against the max request time, rather than tying our hands by trying to fit it all into 30 seconds, we could run it asynchronously. Yes it adds some additional overhead in retrieving the response, but that can be abstracted away in the client library, similar to how uploads are.
To me, indexing is an orthogonal issue here. If we can use indexing correctly to reduce query times, then that's great, but I honestly wouldn't rely on that to ensure our queries are running in a short time. Even if indexes improve document retrieval in the general case, it's unclear as to whether these improvements would persist when running arbitrary AQL queries that traverse a graph. To be clear, I'm in favor of using indexing, I just don't think that alone should be a substitute for running AQL queries asynchronously. |
Do we have a specific instance where AQL queries have been timing out? If so, this becomes much more of an immediate concern, but I'm not aware of any such demands right now.
I don't think it's orthogonal, for a simple reason: if properly indexing our own use cases pushes off the time at which we require async querying, then we have that time to develop other things that matter more in the short term. Investigating indexing brings one further benefit: it improves end user UX. The final factor here is that I have a feeling that getting indexing into our workflow would require less time investment than getting async AQL working. I'm happy to be shown otherwise, but to me it's a low cost investment to speed things up, leaving us more time to figure out if and how to do async queries. |
I believe it happens with the Marclab work that @gotdairyya is doing. Maybe she can chime in on that.
I actually think the opposite: I think adding indexing would require more time investment than asynchronous AQL. To do so, we need to:
There's also the possibility that creating and using these indexes doesn't actually impact our query performance all that much, or at all, as we really don't know what performance impact they will have. If we have a query that takes 60 seconds, and indexing decreases the running time to 40 seconds, that's great (a 33% improvement!), but it's still over the 30 second threshold. Alternatively, we know that doing this asynchronously will solve our problem, and we already know how celery tasks work, as well as how to implement them. Essentially:
These two approaches also might not be too far apart regarding level of effort, I'm not sure. My main point is that I don't think the indexing approach is trivial, and would take some decent time. I personally think we should eventually do both, if we can, and if they both have a place. At the end of the day, I'm happy to do either, but I do feel that running the queries asynchronously has a higher chance of success. |
Yeah honestly anything beyond a 2 hop query starts to get pretty slow, especially if the queries have less parameters defined. I haven't tried to run any queries beyond the 2 hop example because either my browser crashes or my laptop can't handle it. But at the moment, the marclab is not interested in these longer hopped queries. |
Going to respond to both @AlmightyYakob and @gotdairyya here: Jake, you've convinced me that async querying is the better short term solution, but I want to make sure the client applications are able to show progress correctly (meaning, properly show a throbber and/or have UI where you can check on these things) so as not to wreck the UX. Derya, your complaint about 2 hop querying is exactly why I believe indexing is critical. While async querying will solve your immediate problem in that the queries will work, you're going to be waiting for exponentially longer times for each hop you add. I think we can solve much of that with indexing, though of course I'm not sure until we investigate. A final note about our approach to engaging with Arango indexes: I don't think we should aim for automatic indexing right off the bat. Instead, we should offer some simple way to create indexes on columns or groups of columns via the UI. But before that, we can simply experiment with manually creating indexes on the Arango instance in order to measure performance. So let's go ahead and do async AQL queries, but let's also keep indexing in mind as something that requires investigation at the very least. |
I've filed #77 to track this |
One extra benefit of running AQL queries as celery tasks is that the query and results would be stored in the model used to facilitate the feature. So, after the query has already been run, you're able to view the results again by retrieving the AQL task. This is in contrast to the current implementation, in which AQL queries are ephemeral, and not stored in any permanent way. |
Related to #43 and intended as a follow-up for PR #71 .
We should leverage Celery when running AQL queries. For large networks, complex queries could take a significant amount of time to run. For this reason, we probably shouldn't run these queries during the HTTP request/response.
The text was updated successfully, but these errors were encountered: