-
Notifications
You must be signed in to change notification settings - Fork 254
fix_: cache read-only communities to reduce memory pressure #6519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Jenkins BuildsClick to see older builds (25)
|
1124215
to
98457f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noice!
@@ -3993,6 +3996,10 @@ func (m *Manager) GetByID(id []byte) (*Community, error) { | |||
return community, nil | |||
} | |||
|
|||
func (m *Manager) GetByIDReadonly(id []byte) (ReadonlyCommunity, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add some func description for both GetByIDReadonly
and GetByID
?
And perhaps mention that GetByIDReadonly
must be used where possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the improvement!
IsControlNode() bool | ||
CanPost(pk *ecdsa.PublicKey, chatID string, messageType protobuf.ApplicationMetadataMessage_Type) (bool, error) | ||
IsBanned(pk *ecdsa.PublicKey) bool | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to know what ReadonlyCommunity
can do 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a subset. It should be extended to cover all read-only functions. I only included the ones that were called frequently according to pprof to iterate fast.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6519 +/- ##
===========================================
+ Coverage 60.41% 60.48% +0.07%
===========================================
Files 841 841
Lines 104917 104930 +13
===========================================
+ Hits 63388 63471 +83
+ Misses 33940 33899 -41
+ Partials 7589 7560 -29
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. It's very cleanly done
@@ -432,6 +434,7 @@ func NewManager( | |||
communityLock: NewCommunityLock(logger), | |||
mediaServer: mediaServer, | |||
communityImageVersions: make(map[string]uint32), | |||
cache: ttlcache.New(ttlcache.WithCapacity[string, ReadonlyCommunity](5), ttlcache.WithTTL[string, ReadonlyCommunity](time.Minute)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting solution @osmaczko. Have you tried other combinations of caching parameters before settling on these?
Around the time the TTL expires, do you see any potential risk that different parts of the code might receive stale or cached data while others get fresh data? I've run into similar timing issues in the past, that's why I'm asking. Might be a point of concern when using with some goroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting solution @osmaczko. Have you tried other combinations of caching parameters before settling on these?
I haven’t experimented with other parameter combinations. I set them based on the results observed in the pprof output. The test scenario involved joining the Status community, fetching historical messages, and passively observing activity. The 1-minute TTL is aligned with the duration of history batch processing, which takes approximately one minute—as indicated by the CPU spike between 30s and 80s in the second screenshot. The choice of 5 communities is somewhat arbitrary. A fully deserialized Status community consumes roughly 18MB of RAM, so five communities amount to about 90MB, which I considered a reasonable threshold, rounding it to ~100MB.
Around the time the TTL expires, do you see any potential risk that different parts of the code might receive stale or cached data while others get fresh data? I've run into similar timing issues in the past, that's why I'm asking. Might be a point of concern when using with some goroutines.
Good question. The cache is invalidated in thread-safe way each time a new community is saved. In theory, the behavior remains identical to the previous implementation, except that data is now read from the cache instead of directly from the database. Unless there’s a subtle edge case I’ve overlooked, I don’t see any risks with this approach.
Regarding this, we could try using GOMEMLIMIT to set a soft memory limit for mobile builds. This will prompt the runtime to stay within the specified memory budget and may cause it to return memory more eagerly. |
"Out-of-memory errors (OOMs) have been a pain-point for Go applications. A class of these errors come from the same underlying cause: a temporary spike in memory causes the Go runtime to grow the heap, but it takes a very long time (on the order of minutes) to return that unneeded memory back to the system." "Both of these situations, dealing with out-of-memory errors and homegrown garbage collection tuning, have a straightforward solution that other platforms (like Java and TCMalloc) already provide its users: a configurable memory limit, enforced by the Go runtime. A memory limit would give the Go runtime the information it needs to both respect users' memory limits, and allow them to optionally use that memory always, to cut back the cost of garbage collection." |
98457f8
to
0bb489f
Compare
@osmaczko Would it help to manually trigger the GC when the mobile app goes to the background to force reclaim memory? Should we entertain the idea of manipulating the GC, like affecting the GOGC value with |
I think that's a good approach. The only concern is how long it takes the Go runtime to release memory, and whether it's fast enough before the OS terminates the application. If we decide to go this route, we could use runtime/debug.FreeOSMemory. Relevant paragraph on that from golang/go#30333: "The second example of such tuning is calling runtime/debug.FreeOSMemory at some regular interval, forcing a garbage collection to trigger sooner, usually to respect some memory limit. This case is much more dangerous, because calling it too frequently can lead a process to entirely freeze up, spending all its time on garbage collection. Working with it takes careful consideration and experimentation to be both effective and avoid serious repercussions." We need to be cautious, but I believe triggering this only when the app moves to the background is relatively safe.
I don't have experience with that. According to below, I don't think we should: "This out-of-memory avoidance led to the Go community developing its own homegrown garbage collector tuning. The first example of such tuning is the heap ballast. In order to increase their productivity metrics while also avoiding out-of-memory errors, users sometimes pick a low GOGC value, and fool the GC into thinking that there's a fixed amount of memory live. This solution elegantly scales with GOGC: as the real live heap increases, the impact of that fixed set decreases, and GOGC's contribution to the heap size dominates. In effect, GOGC is larger when the heap is smaller, and smaller (down to its actual value) when the heap is larger. Unfortunately, this solution is not portable across platforms, and is not guaranteed to continue to work as the Go runtime changes. Furthermore, users are forced to do complex calculations and estimate runtime overheads in order to pick a heap ballast size that aligns with their memory limits." |
Thanks @osmaczko! Indeed, we should always be careful GC tuning. For mobile there's such a wide variety of devices that it's nearly impossible to find one single hard coded value to work optimally for everybody. We have yet to verify the impact of forcefully freeing up memory in different devices. There's also this scenario where a user switches between apps multiple times in a row, in which case it would be good to not call FreeOSMemory every time the app is backgrounded, more like doing that and debouncing could be better. But to decide how long to debounce we would need a clearer picture of how long it would take for the GC to do its thing with casual usage and how much CPU pressure we would add by cleaning up (potentially) too frequently. Complexity 😅 |
@osmaczko can we merge this? |
Need to resolve status-im/status-desktop#17781 first. Let me take a look. |
0bb489f
to
37e5e84
Compare
Full database reads, especially on message receipt, caused repeated allocations and high RAM usage due to unmarshaling full community objects. This change introduces a lightweight cache (up to 5 entries, 1-minute TTL) to avoid redundant DB access and deserialization for commonly used communities.
37e5e84
to
eeac6d9
Compare
Full database reads, especially on message receipt, caused repeated allocations and high RAM usage due to unmarshaling full community objects. This change introduces a lightweight cache (up to 5 entries, 1-minute TTL) to avoid redundant DB access and deserialization for commonly used communities.
Issue found during investigation of status-im/status-mobile#22463
CPU and Memory are a bit better.

Total memory allocation (not to be confused with memory in use) during a 4-minute app run dropped from 10 GB to less than 2 GB.

Interestingly, the Go runtime is quite greedy and reluctant to return unused memory to the operating system. See below: