Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current paging solution > 10k is not RESTful #109

Open
rsmith013 opened this issue Dec 17, 2020 · 4 comments
Open

Current paging solution > 10k is not RESTful #109

rsmith013 opened this issue Dec 17, 2020 · 4 comments
Assignees

Comments

@rsmith013
Copy link

how can we make random access pagination work on top of elasticsearch?

@rsmith013
Copy link
Author

@rsmith013
Copy link
Author

rsmith013 commented Jan 8, 2021

I have been looking this week at possible solutions for random, deep pagination (past the 10,000th results) and I have not been able to come up with a solution that provides timely responses.

I had thought that I could build a cache, indeed, random pagination is possible from a cached response very quickly, the challenge comes if the query response is not cached and the cache has to be built.

Some simple analysis for building these cache objects:

Number of items in the dataset:
Min: 10275
Median: 29805.0
Max: 847907

Processing time (based on building cache for a dataset with 59,700 items, pages sizes of 1,000 with processing time = 6.23 * number_of_pages + 5.37):
Min: 0:01:09
Median: 0:03:11
Max: 1:28:07

The lower times might be acceptable as a one-off, which will then allow parallelised workflows to interact with the whole result set and reduced subsequent response times, but the upper end clearly is not.

A cache might be useful, more generally, if your workflows often repeat the same query to the same endpoints. This will require minimal engineering but might still provide a useful improvement.

In my research around the subject, it seems the answer to deep pagination is that you don’t. Instead you:

  • Negate the need for scrolling all results by providing aggregated metadata
  • Allow the users to subset their search to get under the limit (i.e. geo-spatial, time-ranges, …)

To help me figure out what is the next step, please may you answer the following questions:

  • What are your use cases/needs for paging all results?
  • What metadata, at the description document level, would help remove the need for paging all items?
  • Do your workflows regularly perform the same queries which might benefit from caching?

@rsmith013
Copy link
Author

Spacebel use an extra parameter in the GET request to pass state. This parameter is passed in the next prev links in the response. This means that the client doesn't need a session.

@rsmith013 rsmith013 self-assigned this Mar 26, 2021
@rsmith013
Copy link
Author

Can use base64 encoding and decoding to convert the elasticsearch sort key into a string which can be sent in the URL.

import json
import base64

# send with response
sort_key = response['sort']
sort_b = json.dumps(sort_key).encode('utf-8)
b64 = base64.encode(sort_b)

# process with request
search_after = request.GET['search_after']
sort_b = base64.decode(search_after)
sort_key = json.loads(sort_b)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant