Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HEAD request unnecessarily sends query data back to PostgREST #2849

Closed
bnjmnt4n opened this issue Jul 5, 2023 · 8 comments · Fixed by #2857
Closed

HEAD request unnecessarily sends query data back to PostgREST #2849

bnjmnt4n opened this issue Jul 5, 2023 · 8 comments · Fixed by #2857
Labels

Comments

@bnjmnt4n
Copy link

bnjmnt4n commented Jul 5, 2023

Environment

  • PostgreSQL version: 15.2
  • PostgREST version: 10.2
  • Operating system: Ubuntu 20.04, WSL

Description of issue

Sending a HEAD request actually executes the query and sends the aggregated JSON query results back to PostgREST, even though the response will just be discarded. I think PostgREST should just compute the count and send that back alone, to avoid unnecessary CPU and I/O operations for something which is just discarded.

I ran into this whilst using a HEAD request with Prefer: count=exact. Adding an explicit limit=0 to avoid actually sending any data decreased the response timing by nearly half.

@steve-chavez
Copy link
Member

It sounds like a possible optimization on HEAD, but I don't think we should do it.

RFC 2616 says:

The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.

Note "identical to GET". Also, I've used HEAD to debug a GET before and just a slight change in the query wouldn't make it reliable for this.

I ran into this whilst using a HEAD request with Prefer: count=exact. Adding an explicit limit=0 to avoid actually sending any data decreased the response timing by nearly half.

If you only need the count I think we can add a Prefer: return=headers-only (Insert has this) for GET.

How does that sound?

@steve-chavez steve-chavez added the idea Needs of discussion to become an enhancement, not ready for implementation label Jul 5, 2023
@bnjmnt4n
Copy link
Author

bnjmnt4n commented Jul 6, 2023

I don't have a very strong opinion on this, just that the current behavior seems a bit unexpected and unintuitive. I think sticking with the current behavior is fine as long as the documentation is clearly updated to make this clear.

@steve-chavez
Copy link
Member

just that the current behavior seems a bit unexpected and unintuitive

On second thought, I think you're right. It's expected that HEAD is a cheap/fast operation. Will fix this.

@steve-chavez steve-chavez added bug perf and removed idea Needs of discussion to become an enhancement, not ready for implementation labels Jul 7, 2023
@steve-chavez
Copy link
Member

steve-chavez commented Jul 7, 2023

To test this, the simplest way that comes to mind is comparing the width of the query:

-- we get 40 when selecting columns
postgres=# explain select * from projects ;
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on projects  (cost=0.00..22.00 rows=1200 width=40)

-- we get 0 when not selecting columns
postgres=# explain select  from projects ;
                         QUERY PLAN
------------------------------------------------------------
 Seq Scan on projects  (cost=0.00..22.00 rows=1200 width=0)
(1 row)

However we can't do that since we select columns for the count and custom headers.

So the way should be comparing the generated query:

EXPLAIN (FORMAT TEXT, verbose) WITH pgrst_source AS ( SELECT "test"."projects"."id", "test"."projects"."name" FROM "test"."projects"     )  SELECT null::bigint AS total_result_set,
pg_catalog.count(_postgrest_t) AS page_total, coalesce(json_agg(_postgrest_t), '[]') AS body, nullif(current_setting('response.headers', true), '') AS response_headers, nullif(current_setting(
'response.status', true), '') AS response_status FROM ( SELECT * FROM pgrst_source ) _postgrest_t;

-- we get the aggregation here, json_agg
 Aggregate  (cost=28.00..28.02 rows=1 width=112)
   Output: NULL::bigint, count(ROW(projects.id, projects.name)), COALESCE(json_agg(ROW(projects.id, projects.name)), '[]'::json), NULLIF(current_setting('response.headers'::text, true), ''::te
xt), NULLIF(current_setting('response.status'::text, true), ''::text)
   ->  Seq Scan on test.projects  (cost=0.00..22.00 rows=1200 width=36)
         Output: projects.id, projects.name, projects.client_id

EXPLAIN (FORMAT TEXT, verbose) WITH pgrst_source AS ( SELECT "test"."projects"."id", "test"."projects"."name" FROM "test"."projects"     )  SELECT null::bigint AS total_result_set,
pg_catalog.count(_postgrest_t) AS page_total, '' AS body, nullif(current_setting('response.headers', true), '') AS response_headers, nullif(current_setting('response.status', true), '') AS res
ponse_status FROM ( SELECT * FROM pgrst_source ) _postgrest_t;

-- here we get the empty body, ''::text
 Aggregate  (cost=25.00..25.02 rows=1 width=112)
   Output: NULL::bigint, count(ROW(projects.id, projects.name)), ''::text, NULLIF(current_setting('response.headers'::text, true), ''::text), NULLIF(current_setting('response.status'::text, tr
ue), ''::text)
   ->  Seq Scan on test.projects  (cost=0.00..22.00 rows=1200 width=36)
         Output: projects.id, projects.name, projects.client_id

Edit 1: One problem here is that for getting the EXPLAIN output we also use HTTP requests. So if we tested a HEAD request there will be no body.

I'm thinking of adding an internal config that would let us obtain the plan output in a header.

Edit 2: Considering #2825, something that seems useful now and later is to have a shortcut for the pg aggregate used. So I believe we can add this:

HEAD /users?select=name&order=id HTTP/1.1
Accept: application/vnd.pgrst.plan; options=verbose

## which will respond
200 OK
X-Pg-Aggregate: json_agg(builtin)

So when no aggregate is used, we can reply with an X-Pgrst-Aggregate: none. Only when the verbose option is provided, to match EXPLAIN behavior.

Edit 3: The above is considerable work so for now I'm just enforcing this behavior with new types.

@laurenceisla
Copy link
Member

laurenceisla commented Mar 18, 2025

To note, the limit=0 would still be faster in HEAD requests with Prefer: count=exact. This is because even after the JSON aggregate was replaced with ''::text (reducing the HEAD execution time), the pagination count is still executed. For example:

A request without limit=0:

# A request without limit=0
curl 'localhost:3000/big_table' -H 'Prefer: count=exact' -I


HTTP/1.1 200 OK
...
Content-Range: 0-99999/100000

A request with limit=0:

curl 'localhost:3000/big_table&limit=0' -H 'Prefer: count=exact' -I

HTTP/1.1 206 Partial Content
...
Content-Range: */100000

As seen in the Content-Range they both count the total and return 100000, but it differs in the "rows per page" part. With limit=0 it's faster because it counts 0 rows, but without limit=0 it counts the whole response (takes almost twice the time). This is more noticeable when db-max-rows is not set or when it's big enough.

Not promoting to change anything (not sure what you think), since right now HEAD behaves as close to a GET request (sans processing the result with a JSON agg). Maybe we could add this difference to the docs somewhere?

@steve-chavez
Copy link
Member

steve-chavez commented Mar 18, 2025

but without limit=0 it counts the whole response (takes almost twice the time).
Content-Range: */100000

Why is the page count slow? Shouldn't it be quick compared to the whole table count?

Maybe share EXPLAIN analyze output of these requests?

@laurenceisla
Copy link
Member

laurenceisla commented Mar 19, 2025

Why is the page count slow? Shouldn't it be quick compared to the whole table count?

It's quicker if db-max-rows is set to a lower value. If it's not set or if it's big enough then it will take more time.

Maybe share EXPLAIN analyze output of these requests?

(These requests do not have db-max-rows set)

Without `limit=0`

curl 'localhost:3030/big_table' -H 'Prefer: count=exact' -I
EXPLAIN (FORMAT TEXT, ANALYZE, VERBOSE)
WITH pgrst_source AS
  (SELECT "public"."big_table".* FROM "public"."big_table") ,
 pgrst_source_count AS
  (SELECT 1 FROM "public"."big_table")
SELECT
  (SELECT pg_catalog.count(*) FROM pgrst_source_count) AS total_result_set,
  pg_catalog.count(_postgrest_t) AS page_total,
  ''::text AS body,
  nullif(current_setting('response.headers', TRUE), '') AS response_headers,
  nullif(current_setting('response.status', TRUE), '') AS response_status,
  '' AS response_inserted
FROM
  (SELECT * FROM pgrst_source) _postgrest_t
Aggregate  (cost=4562.01..4562.03 rows=1 width=144) (actual time=33.566..33.566 rows=1 loops=1)
  Output: (InitPlan 1).col1, count(ROW(big_table.id, big_table.name, big_table.val, big_table.num)), ''::text, NULLIF(current_setting('response.headers'::text, true), ''::text), NULLIF(current_setting('response.status'::text, true), ''::text), ''::text
  InitPlan 1
    ->  Aggregate  (cost=2281.00..2281.01 rows=1 width=8) (actual time=4.900..4.900 rows=1 loops=1)
          Output: count(*)
          ->  Seq Scan on public.big_table big_table_1  (cost=0.00..2031.00 rows=100000 width=0) (actual time=0.009..2.627 rows=100000 loops=1)
                Output: big_table_1.id, big_table_1.name, big_table_1.val, big_table_1.num
  ->  Seq Scan on public.big_table  (cost=0.00..2031.00 rows=100000 width=50) (actual time=0.009..7.667 rows=100000 loops=1)
        Output: big_table.id, big_table.name, big_table.val, big_table.num
Planning Time: 0.151 ms
Execution Time: 33.598 ms

With `limit=0`

curl 'localhost:3030/big_table?limit=0' -H 'Prefer: count=exact' -I
EXPLAIN (FORMAT TEXT, ANALYZE, VERBOSE)
WITH pgrst_source AS
  (SELECT "public"."big_table".* FROM "public"."big_table" LIMIT 0 OFFSET 0) ,
 pgrst_source_count AS
  (SELECT 1 FROM "public"."big_table")
SELECT
  (SELECT pg_catalog.count(*) FROM pgrst_source_count) AS total_result_set,
  pg_catalog.count(_postgrest_t) AS page_total,
  ''::text AS body,
  nullif(current_setting('response.headers', TRUE), '') AS response_headers,
  nullif(current_setting('response.status', TRUE), '') AS response_status,
  '' AS response_inserted
FROM
  (SELECT * FROM pgrst_source) _postgrest_t
Aggregate  (cost=2281.03..2281.05 rows=1 width=144) (actual time=17.579..17.580 rows=1 loops=1)
  Output: (InitPlan 1).col1, count(ROW(big_table.id, big_table.name, big_table.val, big_table.num)), ''::text, NULLIF(current_setting('response.headers'::text, true), ''::text), NULLIF(current_setting('response.status'::text, true), ''::text), ''::text
  InitPlan 1
    ->  Aggregate  (cost=2281.00..2281.01 rows=1 width=8) (actual time=17.566..17.566 rows=1 loops=1)
          Output: count(*)
          ->  Seq Scan on public.big_table big_table_1  (cost=0.00..2031.00 rows=100000 width=0) (actual time=0.015..9.224 rows=100000 loops=1)
                Output: big_table_1.id, big_table_1.name, big_table_1.val, big_table_1.num
  ->  Limit  (cost=0.00..0.02 rows=1 width=50) (actual time=0.001..0.002 rows=0 loops=1)
        Output: big_table.id, big_table.name, big_table.val, big_table.num
        ->  Seq Scan on public.big_table  (cost=0.00..2031.00 rows=100000 width=50) (never executed)
              Output: big_table.id, big_table.name, big_table.val, big_table.num
Planning Time: 0.289 ms
Execution Time: 17.657 ms

@wolfgangwalther
Copy link
Member

Why is the page count slow? Shouldn't it be quick compared to the whole table count?

The whole table always need to be counted for the "total number of items". We either do so once (with limit=0) or twice (without limit). So it makes sense that the time required doubles.

A possible optimization would be to skip the extra counting query for requests without limit. Of course, that only works without max-rows, so maybe it's a pointless edge-case to optimize for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

4 participants