Current paging solution > 10k is not RESTful #109

rsmith013 · 2020-12-17T16:32:28Z

how can we make random access pagination work on top of elasticsearch?

rsmith013 · 2020-12-17T16:48:28Z

Perhaps making use of:

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-request-scroll.html#sliced-scroll

rsmith013 · 2021-01-08T17:01:18Z

I have been looking this week at possible solutions for random, deep pagination (past the 10,000th results) and I have not been able to come up with a solution that provides timely responses.

I had thought that I could build a cache, indeed, random pagination is possible from a cached response very quickly, the challenge comes if the query response is not cached and the cache has to be built.

Some simple analysis for building these cache objects:

Number of items in the dataset:
Min: 10275
Median: 29805.0
Max: 847907

Processing time (based on building cache for a dataset with 59,700 items, pages sizes of 1,000 with processing time = 6.23 * number_of_pages + 5.37):
Min: 0:01:09
Median: 0:03:11
Max: 1:28:07

The lower times might be acceptable as a one-off, which will then allow parallelised workflows to interact with the whole result set and reduced subsequent response times, but the upper end clearly is not.

A cache might be useful, more generally, if your workflows often repeat the same query to the same endpoints. This will require minimal engineering but might still provide a useful improvement.

In my research around the subject, it seems the answer to deep pagination is that you don’t. Instead you:

Negate the need for scrolling all results by providing aggregated metadata
Allow the users to subset their search to get under the limit (i.e. geo-spatial, time-ranges, …)

To help me figure out what is the next step, please may you answer the following questions:

What are your use cases/needs for paging all results?
What metadata, at the description document level, would help remove the need for paging all items?
Do your workflows regularly perform the same queries which might benefit from caching?

rsmith013 · 2021-01-29T11:02:55Z

Spacebel use an extra parameter in the GET request to pass state. This parameter is passed in the next prev links in the response. This means that the client doesn't need a session.

rsmith013 · 2021-03-26T16:58:25Z

Can use base64 encoding and decoding to convert the elasticsearch sort key into a string which can be sent in the URL.

import json
import base64

# send with response
sort_key = response['sort']
sort_b = json.dumps(sort_key).encode('utf-8)
b64 = base64.encode(sort_b)

# process with request
search_after = request.GET['search_after']
sort_b = base64.decode(search_after)
sort_key = json.loads(sort_b)

rsmith013 pushed a commit that referenced this issue Jan 8, 2021

WIP: Investigating using cache to allow random access pagination #109

84a4993

rsmith013 self-assigned this Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current paging solution > 10k is not RESTful #109

Current paging solution > 10k is not RESTful #109

rsmith013 commented Dec 17, 2020

rsmith013 commented Dec 17, 2020

rsmith013 commented Jan 8, 2021 •

edited

Loading

rsmith013 commented Jan 29, 2021

rsmith013 commented Mar 26, 2021

Current paging solution > 10k is not RESTful #109

Current paging solution > 10k is not RESTful #109

Comments

rsmith013 commented Dec 17, 2020

rsmith013 commented Dec 17, 2020

rsmith013 commented Jan 8, 2021 • edited Loading

rsmith013 commented Jan 29, 2021

rsmith013 commented Mar 26, 2021

rsmith013 commented Jan 8, 2021 •

edited

Loading