Create model for async AQL tasks #79

naglepuff · 2021-09-10T16:56:09Z

Closes #74
This PR creates a Celery task that allows for asynchronous AQL query execution. It introduces new endpoints:

POST /api/workspaces/{workspace}/queries/: Expects a query as part of the request. Creates an AqlQuery task and starts it. Returns the task object.

GET /api/workspaces/{workspace}/queries/{query_id}/: Return the AqlQuery object represented by the query_id in the URI. This can be used to view the status of the task, query results, and any error messages.

GET /api/workspaces/{workspace}/queries/: Return all AQL queries objects for a given workspace.

GET /api/workspaces/{workspace}/queries/{query_id}/results/: Return the results of a given AQL query

Reviewers, please give opinions on the following

For these queries, the time limit has been bumped to 60 seconds
For now, it may be best to leave the /workspaces/{workspace}/aql endpoint intact, but it should be removed in the future in favor of these new endpoints. The old endpoint is still used by our client applications. Once those are updated, there will be no need for the original endpoint.
Should requests to the GET endpoint add some kind of pagination for the results?

Also create migration for the new AqlQuery model.

Also fix some linting issues.

When a query task is first created, it won't have any results. The best representation for this is Null, because a placeholder or default values like an empty dictionary could imply that a query returned no results.

Creation of tasks representing mutating queries are allowed, but no results will be stored in them. They will finish with an error message from Arango.

waxlamp · 2021-09-10T18:31:48Z

This PR creates a Celery task that allows for asynchronous AQL query execution. It introduces two new endpoints:

POST /api/workspaces/{workspace}/queries/: Expects a query as part of the request. Creates an AqlQuery task and starts it. Returns the task object.

GET /api/workspaces/{workspace}/queries/{query_id}/: Return the AqlQuery object represented by the query_id in the URI. This can be used to view the status of the task, query results, and any error messages.

We probably want a GET /api/workspaces/{workspace}/queries/ endpoint too. This will help eventually to create UI to inspect running jobs, etc. I only mention it because it seems like it could come for free from a DRF viewset.

Reviewers, please give opinions on the following

For these queries, the time limit has been bumped to 60 seconds

Sounds good to me. How hard will it be to adjust these limits per, e.g., user in the future?

For now, it may be best to leave the /workspaces/{workspace}/aql endpoint intact (since we're in hibernation mode), but it should be removed in the future in favor of these new endpoints

Is there a difficulty involved with removing it now?

If you don't remove it now, please file an issue to remind us to do so in the future.

Should requests to the GET endpoint add some kind of pagination for the results?

Why would you need pagination to view a single query?

naglepuff · 2021-09-10T18:55:24Z

We probably want a GET /api/workspaces/{workspace}/queries/ endpoint too. This will help eventually to create UI to inspect running jobs, etc. I only mention it because it seems like it could come for free from a DRF viewset.

You're absolutely right. We do actually get this for free, which is why I didn't mention it in my PR. I'll clarify that.

For these queries, the time limit has been bumped to 60 seconds

Sounds good to me. How hard will it be to adjust these limits per, e.g., user in the future?

I don't image it would be too difficult, although I haven't spent any time looking into it. What would this look like? Would owners of workspaces be able to control query limits for readers/writers via some kind of setting? I can create an issue for this if you'd like.

For now, it may be best to leave the /workspaces/{workspace}/aql endpoint intact, but it should be removed in the future in favor of these new endpoints. The old endpoint is still used by our client applications. Once those are updated, there will be no need for the original endpoint.

Is there a difficulty involved with removing it now?

I believe the current deployment of the girder4 client is still dependent on the old non-async endpoint. In my opinion that shouldn't stop this PR from moving forward. Once multinetjs is updated to use the async endpoint, we should delete the non-async one.

Should requests to the GET endpoint add some kind of pagination for the results?

Why would you need pagination to view a single query?

Maybe that was the wrong question. We currently limit query result sizes to 20MB. If a user runs a query, and their result set is 20MB, is it ok to send the entire set of data in one request, or should there be a way to limit that.

waxlamp · 2021-09-10T19:29:30Z

We probably want a GET /api/workspaces/{workspace}/queries/ endpoint too. This will help eventually to create UI to inspect running jobs, etc. I only mention it because it seems like it could come for free from a DRF viewset.

You're absolutely right. We do actually get this for free, which is why I didn't mention it in my PR. I'll clarify that.

For these queries, the time limit has been bumped to 60 seconds

Sounds good to me. How hard will it be to adjust these limits per, e.g., user in the future?

I don't image it would be too difficult, although I haven't spent any time looking into it. What would this look like? Would owners of workspaces be able to control query limits for readers/writers via some kind of setting? I can create an issue for this if you'd like.

No, we don't even need to file an issue yet. I was just idly thinking about it. 60 seconds should definitely cover us for now and if that becomes a hindrance we can revisit what to do here.

For now, it may be best to leave the /workspaces/{workspace}/aql endpoint intact, but it should be removed in the future in favor of these new endpoints. The old endpoint is still used by our client applications. Once those are updated, there will be no need for the original endpoint.

Is there a difficulty involved with removing it now?

I believe the current deployment of the girder4 client is still dependent on the old non-async endpoint. In my opinion that shouldn't stop this PR from moving forward. Once multinetjs is updated to use the async endpoint, we should delete the non-async one.

Oh yes that makes sense. I agree that the old endpoint shouldn't stop this one from moving forward.

Should requests to the GET endpoint add some kind of pagination for the results?

Why would you need pagination to view a single query?

Maybe that was the wrong question. We currently limit query result sizes to 20MB. If a user runs a query, and their result set is 20MB, is it ok to send the entire set of data in one request, or should there be a way to limit that.

Ah I see. I think for right now, there's no need for special handling (because most typical requests we're working with won't send 20MB responses). More generally, I think maybe the query model could report how big the result is, and there could be a /results endpoint for a given query which would send back the full query results; the client could always use a Range request to manage a large query result.

But this is a design discussion to have if and when we start to see unmanageably large query results in practice.

jjnesbitt

Overall looks great! Just a few suggestions, mostly around naming.

For these queries, the time limit has been bumped to 60 seconds

👍

For now, it may be best to leave the /workspaces/{workspace}/aql endpoint intact, but it should be removed in the future in favor of these new endpoints. The old endpoint is still used by our client applications. Once those are updated, there will be no need for the original endpoint.

As long as we track this with an issue as @waxlamp suggested, I'm okay with it for the time being.

Should requests to the GET endpoint add some kind of pagination for the results?

This is a good question. I'm kind of inclined to think that we shouldn't include the results field in AqlQueryTaskSerializer, and should instead have a /results endpoint (I believe @waxlamp also suggested this), which would return the paginated JSON.

My reasoning for this is that the results field can be large in many cases, and if you just want to retrieve the status of a task, you might not want all that data. If we kept the results field where it is now and we did want to support pagination, we'd need to do it in some nonstandard way. Whereas if we separated results out like suggested above, it's the same pagination we'd use everywhere else.

On the other hand, I'm not sure if pagination is necessary yet. We could just keep that at the back of our minds, and add in the result separation with pagination at a later point, if it seems necessary.

multinet/api/migrations/0011_alter_aqlquery_query_results.py

multinet/api/tasks/aql/aql_query.py

multinet/api/models/tasks.py

multinet/api/tests/test_query.py

waxlamp · 2021-09-10T21:53:30Z

Should requests to the GET endpoint add some kind of pagination for the results?

This is a good question. I'm kind of inclined to think that we shouldn't include the results field in AqlQueryTaskSerializer, and should instead have a /results endpoint (I believe @waxlamp also suggested this), which would return the paginated JSON.

My reasoning for this is that the results field can be large in many cases, and if you just want to retrieve the status of a task, you might not want all that data. If we kept the results field where it is now and we did want to support pagination, we'd need to do it in some nonstandard way. Whereas if we separated results out like suggested above, it's the same pagination we'd use everywhere else.

On the other hand, I'm not sure if pagination is necessary yet. We could just keep that at the back of our minds, and add in the result separation with pagination at a later point, if it seems necessary.

This is a +1 for a /results/ endpoint then: you are right that a major use of the query models will be to report the status of a list of queries, in which case we'd want to avoid returning the results for completed queries, and on top of that, if we can get something for free from DRF (pagination, in this case) then that's a big draw.

(I had been worried that AQL queries can return any JSON value, but it turns out that indeed, they always return lists: https://www.arangodb.com/docs/stable/aql/fundamentals-query-results.html.)

Also remove custom exception handling from that task, and instead leverage MultinetCeleryTask to handle any exceptions during task execution.

naglepuff · 2021-09-13T16:18:42Z

@AlmightyYakob I've addressed the comments on your initial review.

I've also implemented a /results endpoint. See 1b61e62.

jjnesbitt

LGTM, thanks!

naglepuff added 11 commits September 2, 2021 14:49

Create model for async AQL tasks

c5f92d2

Create celery task for AQL queries

4757b6b

Use TextField for AQL queries

e4a0229

Also create migration for the new AqlQuery model.

Create ViewSet for AQL Queries

8237df6

Also fix some linting issues.

Change query_results to allow Null

b259829

When a query task is first created, it won't have any results. The best representation for this is Null, because a placeholder or default values like an empty dictionary could imply that a query returned no results.

Specify CharField for user in aql serializer

0cd120c

Add queryset to AqlQueryViewSet

070c838

Set the query_results without using json

229a315

Initial tests for async aql

b98c23c

Add tests with mutating queries

02e96b4

Creation of tasks representing mutating queries are allowed, but no results will be stored in them. They will finish with an error message from Arango.

Increase time limit for async AQL queries

65209c2

naglepuff requested a review from jjnesbitt September 10, 2021 16:56

This was referenced Sep 10, 2021

Remove non-async AQL query endpoint #80

Open

Use async AQL query endpoints multinet-app/multinetjs#58

Open

jjnesbitt requested changes Sep 10, 2021

View reviewed changes

naglepuff added 3 commits September 13, 2021 09:45

Rename query_results and squash migrations

110f2b2

Rename ExecuteAqlQueryTask

0d40fd1

Also remove custom exception handling from that task, and instead leverage MultinetCeleryTask to handle any exceptions during task execution.

Create /results endpoint for AQL queries

1b61e62

naglepuff marked this pull request as ready for review September 13, 2021 16:18

naglepuff requested a review from jjnesbitt September 13, 2021 16:18

jjnesbitt approved these changes Sep 13, 2021

View reviewed changes

naglepuff merged commit 33a9a28 into master Sep 14, 2021

naglepuff deleted the aql-async branch September 14, 2021 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create model for async AQL tasks #79

Create model for async AQL tasks #79

Uh oh!

naglepuff commented Sep 10, 2021 •

edited

Loading

Uh oh!

waxlamp commented Sep 10, 2021

Uh oh!

naglepuff commented Sep 10, 2021

Uh oh!

waxlamp commented Sep 10, 2021

Uh oh!

jjnesbitt left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waxlamp commented Sep 10, 2021

Uh oh!

naglepuff commented Sep 13, 2021

Uh oh!

jjnesbitt left a comment

Uh oh!

Uh oh!

Create model for async AQL tasks #79

Create model for async AQL tasks #79

Uh oh!

Conversation

naglepuff commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waxlamp commented Sep 10, 2021

Uh oh!

naglepuff commented Sep 10, 2021

Uh oh!

waxlamp commented Sep 10, 2021

Uh oh!

jjnesbitt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waxlamp commented Sep 10, 2021

Uh oh!

naglepuff commented Sep 13, 2021

Uh oh!

jjnesbitt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

naglepuff commented Sep 10, 2021 •

edited

Loading