Skip to content

Conversation

@xqqp
Copy link
Contributor

@xqqp xqqp commented Oct 19, 2025

Description

The maximum cost of the Ristretto cache is set to a number of megabytes, but when inserting a posting list item into the cache, the cost is set to 1. Because of this the cache only grows and no items get evicted. This results in high memory consumption, especially with the new UID cache (see #9513).

This PR introduces a cost function which estimates the memory size of each item. For more details see predictable-labs#6. Credits to @darkcoderrises for the implementation.

The default cache size is changed from 1024 to 4096, to reflect the more accurate cost estimation. See below for how the size of the cache relates to the occupied memory by the Dgrpah process, when this PR is applied.

size-mb=1024

Dgraph process occupies up to 3.7 GiB of memory

heap:

  344.49MB 30.34% 30.34%   407.26MB 35.87%  github.com/hypermodeinc/dgraph/v25/posting.(*List).calculateUids
  130.37MB 11.48% 41.83%   130.37MB 11.48%  github.com/dgraph-io/ristretto/v2.newCmRow
   83.20MB  7.33% 49.15%    83.20MB  7.33%  github.com/dgraph-io/badger/v4/skl.newArena
   69.50MB  6.12% 55.28%    69.50MB  6.12%  github.com/hypermodeinc/dgraph/v25/posting.(*List).Uids.func1
   65.16MB  5.74% 61.02%    65.16MB  5.74%  github.com/dgraph-io/ristretto/v2/z.(*Bloom).Size
   53.27MB  4.69% 65.71%    53.27MB  4.69%  github.com/hypermodeinc/dgraph/v25/posting.(*List).calculateUids.func1

size-mb=2048

Dgraph process occupies up to 5.4 GiB of memory

heap:

  637.07MB 33.68% 33.68%   699.33MB 36.97%  github.com/hypermodeinc/dgraph/v25/posting.(*List).calculateUids
  260.63MB 13.78% 47.46%   260.63MB 13.78%  github.com/dgraph-io/ristretto/v2.newCmRow
  130.04MB  6.88% 54.34%   130.04MB  6.88%  github.com/dgraph-io/ristretto/v2/z.(*Bloom).Size
   99.01MB  5.23% 59.57%    99.01MB  5.23%  github.com/hypermodeinc/dgraph/v25/posting.(*List).Uids.func1
   89.89MB  4.75% 64.32%    89.89MB  4.75%  github.com/dgraph-io/ristretto/v2.(*lockedMap[go.shape.*uint8]).Set
   83.20MB  4.40% 68.72%    83.20MB  4.40%  github.com/dgraph-io/badger/v4/skl.newArena

size-mb=4096

Dgraph process occupies up to 8.4 GiB of memory

heap:

 1343.44MB 41.61% 41.61%  1404.71MB 43.51%  github.com/hypermodeinc/dgraph/v25/posting.(*List).calculateUids
  520.15MB 16.11% 57.72%   520.15MB 16.11%  github.com/dgraph-io/ristretto/v2.newCmRow
     260MB  8.05% 65.77%      260MB  8.05%  github.com/dgraph-io/ristretto/v2/z.(*Bloom).Size
  136.70MB  4.23% 70.01%   136.70MB  4.23%  github.com/dgraph-io/ristretto/v2.(*lockedMap[go.shape.*uint8]).Set
  106.51MB  3.30% 73.31%   106.51MB  3.30%  reflect.New
   83.20MB  2.58% 75.88%    83.20MB  2.58%  github.com/dgraph-io/badger/v4/skl.newArena

size-mb=8192

Dgraph process occupies up to 13.3 GiB of memory

heap:

 2539.78MB 45.10% 45.10%  2601.10MB 46.19%  github.com/hypermodeinc/dgraph/v25/posting.(*List).calculateUids
 1040.01MB 18.47% 63.56%  1040.01MB 18.47%  github.com/dgraph-io/ristretto/v2.newCmRow
     520MB  9.23% 72.80%      520MB  9.23%  github.com/dgraph-io/ristretto/v2/z.(*Bloom).Size
  223.52MB  3.97% 76.77%   223.52MB  3.97%  reflect.New
  157.97MB  2.81% 79.57%   157.97MB  2.81%  github.com/dgraph-io/ristretto/v2.(*lockedMap[go.shape.*uint8]).Set
  136.01MB  2.42% 81.99%   259.03MB  4.60%  github.com/hypermodeinc/dgraph/v25/posting.copyList

Closes #9513

@xqqp xqqp requested a review from a team October 19, 2025 21:20
@xqqp xqqp force-pushed the fix_cache_costs branch 6 times, most recently from a933320 to 7bbc497 Compare October 22, 2025 15:42
@xqqp
Copy link
Contributor Author

xqqp commented Oct 22, 2025

Test failures seem unrelated to this change. @matthewmcneely when you have time please take a look, without this patch v25 is unusable for me, because memory quickly fills up.

@matthewmcneely
Copy link
Contributor

Test failures seem unrelated to this change. @matthewmcneely when you have time please take a look, without this patch v25 is unusable for me, because memory quickly fills up.

@xqqp Yeah, the shared runners we are now using don't seem to be up to the task. We're working on it.

@matthewmcneely
Copy link
Contributor

@xqqp Thank again for this PR and your patience. We're still trying to get our runners configured thru a third party following the move back to dgraph-io. In the meantime, a few questions regarding your PR.

  1. Yesterday I ran Dgraph from your branch, live loaded the 1M triples set, and ran a ton of queries for 5 minutes. I did the same with v25. I'm not seeing an significant difference in heap between the two. Can you tell me the situation in which you saw memory growing unbounded? Maybe I'm not running enough queries and maybe not long enough.
  2. I'm reluctant to increase the default cache size to 4G. Anyone upgrading and not aware of that change might be in for a surprise, but probably only those running on smallish setups. Still, we try to avoid blindsiding if at all possible. Question: there's no downside to keeping the default cache to 1G, right? We can make it clear in the release notes that setting that value to some fraction of the available RAM is a good idea and why. Thoughts?

Thanks again, and I'll let you know when we get our runners configured and get this PR successfully thru CI

@xqqp
Copy link
Contributor Author

xqqp commented Nov 7, 2025

@matthewmcneely Thanks for looking into this.

Yesterday I ran Dgraph from your branch, live loaded the 1M triples set, and ran a ton of queries for 5 minutes. I did the same with v25. I'm not seeing an significant difference in heap between the two. Can you tell me the situation in which you saw memory growing unbounded? Maybe I'm not running enough queries and maybe not long enough.

See predictable-labs#6 (comment) regarding this. Try to query a few million unique nodes and the issue should pop up very quickly. I implemented short program which shows the issue on the 21 million dataset below. If you load more data the issue will become more prevalent.

Golang Query example
package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"strconv"

	"github.com/dgraph-io/dgo/v250"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"
)

func main() {
	dgraphClient, err := dgo.NewClient("localhost:9080",
		dgo.WithGrpcOption(grpc.WithDefaultCallOptions(grpc.MaxCallRecvMsgSize(1024*1024*1024))),
		dgo.WithGrpcOption(grpc.WithTransportCredentials(insecure.NewCredentials())))
	if err != nil {
		log.Fatalf("Failed to connect to Dgraph: %v", err)
	}
	defer dgraphClient.Close()

	ctx := context.Background()
	step := 50000
	lastNodeUID := "0x0"
	for i := 0; ; i += step {
		query := `query Q($first:int,$after:string){
				q(func: has(dgraph.type),first:$first,after:$after){
					uid
					performance.film {
						uid
    				}
					performance.actor {
						uid
					}
					performance.character{
						uid
					}
				}
			  }`
		resp, err := dgraphClient.NewTxn().QueryWithVars(ctx, query, map[string]string{
			"$first": strconv.Itoa(step), "$after": lastNodeUID})
		if err != nil {
			log.Fatalf("Failed to execute query: %v", err)
		}
		var r struct {
			Q []struct {
				UID string `json:"uid"`
			} `json:"q"`
		}

		if err = json.Unmarshal(resp.GetJson(), &r); err != nil {
			log.Fatalf("Failed to unmarshal json: %v", err)
		}

		if len(r.Q) == 0 {
			break
		}

		lastNodeUID = r.Q[len(r.Q)-1].UID
		fmt.Println("step", step)
	}
}

I'm reluctant to increase the default cache size to 4G. Anyone upgrading and not aware of that change might be in for a surprise, but probably only those running on smallish setups. Still, we try to avoid blindsiding if at all possible. Question: there's no downside to keeping the default cache to 1G, right? We can make it clear in the release notes that setting that value to some fraction of the available RAM is a good idea and why. Thoughts?

With this PR an approximation of the actual cache item size will be set for the cache cost. Without this PR the cost of each cache item is 1, which means that the cache will keep up to 1099511627776 items (default 1024 MiB max cost). So items will be rarely evicted from the cache. If this PR is applied and the default 1024 MiB size of the cache is kept, then the cache will keep much less items than before (because the approximated cost is now set), which could decrease performance. For that reason I increased the default cache size. If you want to keep the default cache size that is also fine with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak introduced by PR #9430

2 participants