Skip to content

Add Memory::alloc_static_zeroed to allocate memory that's filled with zeroes.#104124

Merged
Repiteo merged 1 commit into
godotengine:masterfrom
Ivorforce:alloc-static-calloc
May 19, 2025
Merged

Add Memory::alloc_static_zeroed to allocate memory that's filled with zeroes.#104124
Repiteo merged 1 commit into
godotengine:masterfrom
Ivorforce:alloc-static-calloc

Conversation

@Ivorforce
Copy link
Copy Markdown
Member

@Ivorforce Ivorforce commented Mar 14, 2025

This is generally faster than malloc followed by memset. It's essentially the Memory equivalent of calloc.

The reason is that pages handed out by the OS are usually zeroed out already. This is for security purposes, as you cannot risk handing out traces of memory used by another application. The OS will do this in its spare time.
Most allocators make use of this in calloc. If large amounts of memory are requested, it will just request zeroed pages from OS, and can hand them back immediately. If small amounts are requested, the allocator will just use memset to clear out the requested memory.

This can be exploited for performance benefit for e.g. resize_zeroed and is_zero_constructible structs: If we request zeroed pages for these calls, we can omit our own memset calls, which will make the allocation near instantaneous.

@aaronp64
Copy link
Copy Markdown
Contributor

#91633 looks similar, adding Memory::calloc_static. Just taking a quick look through, I like the idea of using p_ensure_zero to reuse most of the function if we can, not sure if there's other differences between the two.

For naming I'm leaning more toward calloc_static, but either sounds good to me.

@Ivorforce
Copy link
Copy Markdown
Member Author

Ivorforce commented Mar 14, 2025

#91633 looks similar

This is the same idea indeed!

I think calling the function calloc_static is somewhat dangerous: calloc normally accepts 2 parameters. If you call Memory::calloc_static(1, 512), you may expect it to generate 512 bytes if you know calloc (analogously to calloc(1, 512)). Instead, you get 1 byte with p_pad_align=true. This is because Memory::calloc_static accepts parameters bytes and pad_align, while calloc accepts num and bytes. It's a misunderstanding waiting to happen.

I instead opted to call it zeroed because that's consistent with other functions with similar behavior in the Godot codebase. The name is also (I think) more obvious what it does, at least to people who have never heard of calloc.

@DeeJayLSP

This comment was marked as resolved.

@DeeJayLSP
Copy link
Copy Markdown
Contributor

DeeJayLSP commented Apr 16, 2025

I found use for this in these files (all of them have occurrences of memalloc/Memory::alloc_static followed by resetting all values to 0):

  • core/io/zip_io.cpp
  • core/os/a_hash_map.h
  • core/templates/hash_map.h
  • core/templates/hash_set.h
  • core/templates/oa_hash_map.h
  • driver/gles3/shader_gles3.cpp

@Ivorforce Ivorforce force-pushed the alloc-static-calloc branch from 34261cf to adb793d Compare April 26, 2025 09:06
@Ivorforce Ivorforce requested a review from a team as a code owner April 26, 2025 09:06
@Ivorforce
Copy link
Copy Markdown
Member Author

Thanks a lot @DeeJayLSP!
I've patched up all the spots you mentioned. Might as well make this a performance PR to motivate it better!

@Ivorforce
Copy link
Copy Markdown
Member Author

Ivorforce commented Apr 26, 2025

I benchmarked a ~4% improvement for (large) HashMap insertion:

	// Setup
	HashMap<int, int> hash_map;

	auto t0 = std::chrono::high_resolution_clock::now();
	for (int i = 0; i < 20000000; i ++) {
		// Test
		hash_map.insert(i, i);
	}
	auto t1 = std::chrono::high_resolution_clock::now();
	std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << "ms\n";

This printed:

  • 2789ms on master
  • 2681ms on this PR

Comment thread core/templates/hash_set.h Outdated
@Ivorforce Ivorforce force-pushed the alloc-static-calloc branch from adb793d to 6b0cd9c Compare May 12, 2025 23:46
@Ivorforce
Copy link
Copy Markdown
Member Author

Ivorforce commented May 12, 2025

I rebased (to include #106020 changes), and ran a slightly more complicated test (to cover more cases):

	for (int size : { 1, 2, 8, 64, 1024, 4096}) {
		auto t0 = std::chrono::high_resolution_clock::now();

		for (int run = 0; run < 20000000 / size; run++) {
			HashMap<int64_t, int64_t> dictionary;
			for (int idx = 0; idx < size; idx ++) {
				// Test
				dictionary.insert(idx, idx);
			}
		}
		auto t1 = std::chrono::high_resolution_clock::now();

		std::cout << "size:" << size << std::endl;
		std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << "ms\n";
	}

This printed on master (db66343):

size:1
1464ms
size:2
944ms
size:8
579ms
size:64
709ms
size:1024
980ms
size:4096
1328ms

On this pr:

size:1
1268ms
size:2
838ms
size:8
550ms
size:64
672ms
size:1024
841ms
size:4096
1149ms

This new test suggests that, at least for fundamental types in key/value, HashMap insertion can be 15% faster (for normal sizes anyway). This makes the change more valuable than I previously assumed.

Admittedly, some of the speed difference may come from the new version using memset, but calloc should in some cases also skip the memset. This might be a bit harder to measure perfectly though, because whether it can skip it or not depends on how many blank pages of memory are available at the time of call.

@Ivorforce Ivorforce modified the milestones: 4.x, 4.5 May 12, 2025
…th zeroes.

This is generally faster than `malloc` followed by `memset` / loop-set to 0.
@Ivorforce Ivorforce force-pushed the alloc-static-calloc branch from 6b0cd9c to 3207066 Compare May 12, 2025 23:57
@Ivorforce
Copy link
Copy Markdown
Member Author

Ivorforce commented May 18, 2025

Out of interest, I just tested again using alloc_static with memset instead of alloc_static_zeroed on HashMap. I got this:

master:

size:1
1430ms
size:2
921ms
size:8
563ms
size:64
670ms
size:1024
817ms
size:4096
1134ms

memset:

size:1
1283ms
size:2
838ms
size:8
545ms
size:64
652ms
size:1024
806ms
size:4096
1088ms

It's the same test, but quite different results from last time for master, so this definitely is a bit flaky.
In any case, it does seem like memset covers most of the performance benefit, but not all of it. I would therefore recommend to move forwards with this PR rather than detouring over memset first.

@Repiteo Repiteo merged commit 2d42b88 into godotengine:master May 19, 2025
20 checks passed
@Repiteo
Copy link
Copy Markdown
Contributor

Repiteo commented May 19, 2025

Thanks!

@Ivorforce Ivorforce deleted the alloc-static-calloc branch May 19, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants