Skip to content

fix_: cache read-only communities to reduce memory pressure #6519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 15, 2025

Conversation

osmaczko
Copy link
Contributor

@osmaczko osmaczko commented Apr 11, 2025

Full database reads, especially on message receipt, caused repeated allocations and high RAM usage due to unmarshaling full community objects. This change introduces a lightweight cache (up to 5 entries, 1-minute TTL) to avoid redundant DB access and deserialization for commonly used communities.

Issue found during investigation of status-im/status-mobile#22463

CPU and Memory are a bit better.
cache_fix_metrics

Total memory allocation (not to be confused with memory in use) during a 4-minute app run dropped from 10 GB to less than 2 GB.
cache_fix_mem_allocation

Interestingly, the Go runtime is quite greedy and reluctant to return unused memory to the operating system. See below:

HeapAlloc: 177 MB
HeapSys:   337 MB
HeapInuse: 199 MB
HeapIdle: 137 MB
HeapReleased: 12 MB
HeapToReturnToOS: 125 MB
StackInuse: 6 MB

@status-im-auto
Copy link
Member

status-im-auto commented Apr 11, 2025

Jenkins Builds

Click to see older builds (25)
Commit #️⃣ Finished (UTC) Duration Platform Result
✔️ 1124215 #1 2025-04-11 10:39:47 ~2 min ios 📦zip
✔️ 1124215 #1 2025-04-11 10:40:03 ~3 min android 📦aar
✔️ 1124215 #1 2025-04-11 10:41:28 ~4 min macos 📦zip
✔️ 1124215 #1 2025-04-11 10:42:03 ~5 min macos 📦zip
✔️ 1124215 #1 2025-04-11 10:42:33 ~5 min windows 📦zip
✔️ 1124215 #1 2025-04-11 10:43:01 ~6 min linux 📦zip
✔️ 1124215 #1 2025-04-11 10:47:04 ~10 min tests-rpc 📄log
✔️ 1124215 #1 2025-04-11 11:13:51 ~36 min tests 📄log
✔️ 98457f8 #2 2025-04-11 10:52:44 ~3 min ios 📦zip
✔️ 98457f8 #2 2025-04-11 10:52:50 ~3 min android 📦aar
✔️ 98457f8 #2 2025-04-11 10:54:01 ~4 min windows 📦zip
✔️ 98457f8 #2 2025-04-11 10:54:17 ~4 min macos 📦zip
✔️ 98457f8 #2 2025-04-11 10:54:51 ~5 min macos 📦zip
✔️ 98457f8 #2 2025-04-11 10:55:57 ~6 min linux 📦zip
✖️ 98457f8 #2 2025-04-11 10:58:45 ~8 min tests-rpc 📄log
✔️ 98457f8 #2 2025-04-11 11:49:57 ~35 min tests 📄log
✔️ 98457f8 #3 2025-04-11 18:30:29 ~5 min tests-rpc 📄log
✔️ 0bb489f #3 2025-04-14 20:28:37 ~2 min android 📦aar
✔️ 0bb489f #3 2025-04-14 20:28:59 ~3 min ios 📦zip
✔️ 0bb489f #3 2025-04-14 20:30:43 ~4 min windows 📦zip
✔️ 0bb489f #3 2025-04-14 20:31:05 ~5 min macos 📦zip
✔️ 0bb489f #3 2025-04-14 20:31:06 ~5 min macos 📦zip
✔️ 0bb489f #3 2025-04-14 20:31:47 ~5 min linux 📦zip
✔️ 0bb489f #4 2025-04-14 20:35:09 ~9 min tests-rpc 📄log
✔️ 0bb489f #3 2025-04-14 21:01:49 ~35 min tests 📄log
Commit #️⃣ Finished (UTC) Duration Platform Result
✔️ 37e5e84 #4 2025-05-13 13:59:28 ~2 min android 📦aar
✔️ 37e5e84 #4 2025-05-13 14:00:26 ~3 min ios 📦zip
✔️ 37e5e84 #4 2025-05-13 14:00:57 ~4 min macos 📦zip
✔️ 37e5e84 #4 2025-05-13 14:02:18 ~5 min macos 📦zip
✔️ 37e5e84 #4 2025-05-13 14:02:26 ~5 min linux 📦zip
✔️ 37e5e84 #4 2025-05-13 14:03:20 ~6 min windows 📦zip
✔️ 37e5e84 #5 2025-05-13 14:08:16 ~11 min tests-rpc 📄log
✖️ 37e5e84 #4 2025-05-13 14:32:39 ~35 min tests 📄log
✖️ 37e5e84 #5 2025-05-13 15:24:28 ~36 min tests 📄log
✖️ 37e5e84 #6 2025-05-13 16:03:37 ~32 min tests 📄log
✖️ 37e5e84 #7 2025-05-13 19:09:31 ~34 min tests 📄log
✖️ 37e5e84 #8 2025-05-14 07:08:08 ~33 min tests 📄log
✖️ 37e5e84 #9 2025-05-14 09:18:19 ~33 min tests 📄log
✖️ 37e5e84 #10 2025-05-14 13:40:29 ~31 min tests 📄log
✔️ eeac6d9 #5 2025-05-14 13:47:25 ~3 min android 📦aar
✔️ eeac6d9 #5 2025-05-14 13:47:34 ~3 min ios 📦zip
✔️ eeac6d9 #5 2025-05-14 13:49:04 ~4 min macos 📦zip
✔️ eeac6d9 #5 2025-05-14 13:49:37 ~5 min macos 📦zip
✔️ eeac6d9 #5 2025-05-14 13:50:00 ~5 min windows 📦zip
✔️ eeac6d9 #5 2025-05-14 13:50:46 ~6 min linux 📦zip
✖️ eeac6d9 #6 2025-05-14 13:54:47 ~10 min tests-rpc 📄log
✔️ eeac6d9 #11 2025-05-14 14:20:29 ~35 min tests 📄log
✖️ eeac6d9 #7 2025-05-14 18:38:03 ~7 min tests-rpc 📄log
✔️ eeac6d9 #8 2025-05-14 18:49:48 ~7 min tests-rpc 📄log

@osmaczko osmaczko force-pushed the fix/community-extensive-memory-allocations branch from 1124215 to 98457f8 Compare April 11, 2025 10:49
@osmaczko osmaczko changed the title fix_: introduce read-only community cache to reduce memory pressure fix_: cache read-only communities to reduce memory pressure Apr 11, 2025
Copy link
Collaborator

@igor-sirotin igor-sirotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noice!

@@ -3993,6 +3996,10 @@ func (m *Manager) GetByID(id []byte) (*Community, error) {
return community, nil
}

func (m *Manager) GetByIDReadonly(id []byte) (ReadonlyCommunity, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some func description for both GetByIDReadonly and GetByID?
And perhaps mention that GetByIDReadonly must be used where possible

Copy link
Contributor

@qfrank qfrank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the improvement!

IsControlNode() bool
CanPost(pk *ecdsa.PublicKey, chatID string, messageType protobuf.ApplicationMetadataMessage_Type) (bool, error)
IsBanned(pk *ecdsa.PublicKey) bool
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to know what ReadonlyCommunity can do 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a subset. It should be extended to cover all read-only functions. I only included the ones that were called frequently according to pprof to iterate fast.

Copy link

codecov bot commented Apr 11, 2025

Codecov Report

Attention: Patch coverage is 77.14286% with 8 lines in your changes missing coverage. Please review.

Project coverage is 60.48%. Comparing base (ae86dd5) to head (eeac6d9).
Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
protocol/communities/manager.go 75.75% 4 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #6519      +/-   ##
===========================================
+ Coverage    60.41%   60.48%   +0.07%     
===========================================
  Files          841      841              
  Lines       104917   104930      +13     
===========================================
+ Hits         63388    63471      +83     
+ Misses       33940    33899      -41     
+ Partials      7589     7560      -29     
Flag Coverage Δ
functional 25.93% <54.28%> (+0.43%) ⬆️
unit 58.30% <77.14%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/communities/community.go 75.00% <ø> (ø)
protocol/messenger.go 64.98% <100.00%> (+0.30%) ⬆️
protocol/messenger_handler.go 59.87% <100.00%> (+0.08%) ⬆️
protocol/communities/manager.go 65.76% <75.75%> (+0.08%) ⬆️

... and 29 files with indirect coverage changes

Copy link
Member

@jrainville jrainville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. It's very cleanly done

@@ -432,6 +434,7 @@ func NewManager(
communityLock: NewCommunityLock(logger),
mediaServer: mediaServer,
communityImageVersions: make(map[string]uint32),
cache: ttlcache.New(ttlcache.WithCapacity[string, ReadonlyCommunity](5), ttlcache.WithTTL[string, ReadonlyCommunity](time.Minute)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting solution @osmaczko. Have you tried other combinations of caching parameters before settling on these?

Around the time the TTL expires, do you see any potential risk that different parts of the code might receive stale or cached data while others get fresh data? I've run into similar timing issues in the past, that's why I'm asking. Might be a point of concern when using with some goroutines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting solution @osmaczko. Have you tried other combinations of caching parameters before settling on these?

I haven’t experimented with other parameter combinations. I set them based on the results observed in the pprof output. The test scenario involved joining the Status community, fetching historical messages, and passively observing activity. The 1-minute TTL is aligned with the duration of history batch processing, which takes approximately one minute—as indicated by the CPU spike between 30s and 80s in the second screenshot. The choice of 5 communities is somewhat arbitrary. A fully deserialized Status community consumes roughly 18MB of RAM, so five communities amount to about 90MB, which I considered a reasonable threshold, rounding it to ~100MB.

Around the time the TTL expires, do you see any potential risk that different parts of the code might receive stale or cached data while others get fresh data? I've run into similar timing issues in the past, that's why I'm asking. Might be a point of concern when using with some goroutines.

Good question. The cache is invalidated in thread-safe way each time a new community is saved. In theory, the behavior remains identical to the previous implementation, except that data is now read from the cache instead of directly from the database. Unless there’s a subtle edge case I’ve overlooked, I don’t see any risks with this approach.

@osmaczko
Copy link
Contributor Author

osmaczko commented Apr 11, 2025

Interestingly, the Go runtime is quite greedy and reluctant to return unused memory to the operating system. See below:

HeapAlloc: 177 MB
HeapSys:   337 MB
HeapInuse: 199 MB
HeapIdle: 137 MB
HeapReleased: 12 MB
HeapToReturnToOS: 125 MB
StackInuse: 6 MB

Regarding this, we could try using GOMEMLIMIT to set a soft memory limit for mobile builds. This will prompt the runtime to stay within the specified memory budget and may cause it to return memory more eagerly.

CC: @ilmotta @qfrank

@osmaczko
Copy link
Contributor Author

"Out-of-memory errors (OOMs) have been a pain-point for Go applications. A class of these errors come from the same underlying cause: a temporary spike in memory causes the Go runtime to grow the heap, but it takes a very long time (on the order of minutes) to return that unneeded memory back to the system."
source: golang/go#30333

"Both of these situations, dealing with out-of-memory errors and homegrown garbage collection tuning, have a straightforward solution that other platforms (like Java and TCMalloc) already provide its users: a configurable memory limit, enforced by the Go runtime. A memory limit would give the Go runtime the information it needs to both respect users' memory limits, and allow them to optionally use that memory always, to cut back the cost of garbage collection."
source: https://github.com/golang/proposal/blob/master/design/48409-soft-memory-limit.md

@osmaczko
Copy link
Contributor Author

Be careful, setting soft memory limit too low makes CPU go bonkers:
mem_usage_new_account

@ilmotta
Copy link
Contributor

ilmotta commented Apr 15, 2025

"Out-of-memory errors (OOMs) have been a pain-point for Go applications. A class of these errors come from the same underlying cause: a temporary spike in memory causes the Go runtime to grow the heap, but it takes a very long time (on the order of minutes) to return that unneeded memory back to the system." source: golang/go#30333

@osmaczko Would it help to manually trigger the GC when the mobile app goes to the background to force reclaim memory? Should we entertain the idea of manipulating the GC, like affecting the GOGC value with SetGCPercent at specific checkpoints in the code? I have no experience manipulating the GC in Go, but these ideas could be fruitful. Or perhaps have we tried a bit of this already and it didn't bring meaningful improvements?

@osmaczko
Copy link
Contributor Author

@osmaczko Would it help to manually trigger the GC when the mobile app goes to the background to force reclaim memory?

I think that's a good approach. The only concern is how long it takes the Go runtime to release memory, and whether it's fast enough before the OS terminates the application. If we decide to go this route, we could use runtime/debug.FreeOSMemory.

Relevant paragraph on that from golang/go#30333:

"The second example of such tuning is calling runtime/debug.FreeOSMemory at some regular interval, forcing a garbage collection to trigger sooner, usually to respect some memory limit. This case is much more dangerous, because calling it too frequently can lead a process to entirely freeze up, spending all its time on garbage collection. Working with it takes careful consideration and experimentation to be both effective and avoid serious repercussions."

We need to be cautious, but I believe triggering this only when the app moves to the background is relatively safe.

Should we entertain the idea of manipulating the GC, like affecting the GOGC value with SetGCPercent at specific checkpoints in the code? I have no experience manipulating the GC in Go, but these ideas could be fruitful. Or perhaps have we tried a bit of this already and it didn't bring meaningful improvements?

I don't have experience with that. According to below, I don't think we should:

"This out-of-memory avoidance led to the Go community developing its own homegrown garbage collector tuning.

The first example of such tuning is the heap ballast. In order to increase their productivity metrics while also avoiding out-of-memory errors, users sometimes pick a low GOGC value, and fool the GC into thinking that there's a fixed amount of memory live. This solution elegantly scales with GOGC: as the real live heap increases, the impact of that fixed set decreases, and GOGC's contribution to the heap size dominates. In effect, GOGC is larger when the heap is smaller, and smaller (down to its actual value) when the heap is larger. Unfortunately, this solution is not portable across platforms, and is not guaranteed to continue to work as the Go runtime changes. Furthermore, users are forced to do complex calculations and estimate runtime overheads in order to pick a heap ballast size that aligns with their memory limits."

@ilmotta
Copy link
Contributor

ilmotta commented Apr 16, 2025

Thanks @osmaczko! Indeed, we should always be careful GC tuning. For mobile there's such a wide variety of devices that it's nearly impossible to find one single hard coded value to work optimally for everybody.

We have yet to verify the impact of forcefully freeing up memory in different devices. There's also this scenario where a user switches between apps multiple times in a row, in which case it would be good to not call FreeOSMemory every time the app is backgrounded, more like doing that and debouncing could be better. But to decide how long to debounce we would need a clearer picture of how long it would take for the GC to do its thing with casual usage and how much CPU pressure we would add by cleaning up (potentially) too frequently. Complexity 😅

@jrainville
Copy link
Member

@osmaczko can we merge this?

@osmaczko
Copy link
Contributor Author

@osmaczko can we merge this?

Need to resolve status-im/status-desktop#17781 first. Let me take a look.

@osmaczko osmaczko force-pushed the fix/community-extensive-memory-allocations branch from 0bb489f to 37e5e84 Compare May 13, 2025 13:56
Full database reads, especially on message receipt, caused repeated
allocations and high RAM usage due to unmarshaling full community
objects. This change introduces a lightweight cache (up to 5 entries,
1-minute TTL) to avoid redundant DB access and deserialization for
commonly used communities.
@osmaczko osmaczko force-pushed the fix/community-extensive-memory-allocations branch from 37e5e84 to eeac6d9 Compare May 14, 2025 13:44
@osmaczko osmaczko merged commit b76b2bc into develop May 15, 2025
21 checks passed
@osmaczko osmaczko deleted the fix/community-extensive-memory-allocations branch May 15, 2025 07:40
@github-project-automation github-project-automation bot moved this from Code Review to Done in Status Desktop/Mobile Board May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Cache read-only communities to reduce memory pressure
6 participants