-
Notifications
You must be signed in to change notification settings - Fork 163
Faster parsing of GC(id) #463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is called for all lines, so cost adds up.
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes GC log parsing by replacing regex-based GC ID extraction with manual string parsing. The new implementation packs both the GC cycle ID and the textual length into a single long value for efficient reuse, particularly benefiting the UnifiedG1GCParser which needs both values.
Changes:
- Replaced
GCID_COUNTERregex pattern with manual string parsing methods inGCLogParser - Refactored
UnifiedG1GCParserfrom Stream API to for-loop for better performance - Updated all GC ID extraction call sites in
UnifiedGenerationalParserto use new methods
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java | Removed regex-based GCID_COUNTER; added manual parsing methods (extractGCID, extractGCCycleIdAndTextualLength, extractGCCycleId, extractGCCycleIdTextualLength) with bit-packing optimization |
| parser/src/main/java/com/microsoft/gctoolkit/parser/UnifiedG1GCParser.java | Refactored parse() method from Stream API to for-loop; uses new extractGCCycleIdAndTextualLength method; removed unused imports (Pattern, Matcher, AbstractMap); added wildcard import |
| parser/src/main/java/com/microsoft/gctoolkit/parser/UnifiedGenerationalParser.java | Replaced all 7 occurrences of GCID_COUNTER.parse() with extractGCID() calls |
| parser/src/test/java/com/microsoft/gctoolkit/parser/GCLogParserTest.java | Added unit tests for new GC ID extraction methods covering both normal and malformed input cases |
Comments suppressed due to low confidence (3)
parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java:257
- The JavaDoc is incomplete. It should document what the method returns, including the meaning of the packed long value and that -1 indicates failure. Consider adding a @return tag that explains: "a packed long with the GC cycle ID in the upper 32 bits and the textual length in the lower 32 bits, or -1 if the GC cycle ID cannot be extracted".
/**
* Returns a packed long containing two ints:
* - the GC cycle id in the high bytes
* - the length of the text containing the GC cycle id,e.g. 'GC(10)'
* See {@link #extractGCCycleId(long)} and {@link #extractGCCycleIdTextualLength(long)}
*/
parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java:292
- Missing JavaDoc documentation for this method. Add JavaDoc that explains the method's purpose, parameters, and return value. For example: "Extracts the textual length of the GC cycle ID from a packed long value. @param packedGcCycleIdAndEnd the packed long containing the textual length in the lower 32 bits @return the textual length in characters, or 0 if the packed value is -1".
protected static int extractGCCycleIdTextualLength(long packedGcCycleIdAndEnd) {
if (packedGcCycleIdAndEnd == -1) {
return 0;
}
return (int) packedGcCycleIdAndEnd;
}
parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java:272
- The behavior change in calculating the end position differs from the original regex implementation. The old regex
GC\\((\\d+)\\)matched including the trailing space after ")", so lineAfterGcId would start without a leading space. The new implementation returns end+1 (position after ")"), causing lineAfterGcId to include the leading space. While Matcher.find() will still match patterns that start after the space, this creates an inconsistency with the original behavior. Consider adjusting to match the original behavior by finding and skipping the trailing space, e.g., checking if line.charAt(end + 1) == ' ' and incrementing end accordingly.
// add the closing parenthesis to the length
int endToReturn = end + 1;
| // pack the two ints in a long | ||
| return (((long) gcId) << 32) | (endToReturn & 0xFFFFFFFFL); | ||
| } catch (Exception e) { | ||
| throw new RuntimeException("Failed to extract gc cycle id from " + line, e); |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling here is inconsistent with the test expectations. When the input is malformed (e.g., missing closing parenthesis), the test expects the method to return -1, but this catch block will throw a RuntimeException instead. The catch block should return -1 for malformed input to match the expected behavior and maintain consistent error handling with the rest of the method.
| throw new RuntimeException("Failed to extract gc cycle id from " + line, e); | |
| Logger.getLogger(GCLogParser.class.getName()) | |
| .log(Level.FINE, "Failed to extract GC cycle id from line: " + line, e); | |
| return -1; |
| // we search for the value between parenthesis | ||
| int start = line.indexOf('('); |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method finds the first '(' in the line, but doesn't verify it's part of "GC(". If a line has an earlier '(' before "GC(", this will extract the wrong value. For example, if a log line contains "(something) GC(35)", it would try to parse "something" as the GC ID. Consider using line.indexOf("GC(") and then adding 2 to get the position after "GC(", or change line.indexOf('(') to line.indexOf("GC(") + 2 to ensure you're finding the correct parenthesis.
This issue also appears in the following locations of the same file:
- line 271
| // we search for the value between parenthesis | |
| int start = line.indexOf('('); | |
| // we search for the value between parenthesis, specifically the one from "GC(" | |
| int gcMarker = line.indexOf("GC("); | |
| if (gcMarker == -1) { | |
| return -1; | |
| } | |
| // gcMarker points to 'G', so gcMarker + 2 is the index of '(' in "GC(" | |
| int start = gcMarker + 2; |
parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java
Outdated
Show resolved
Hide resolved
| protected static int extractGCCycleId(long packedGcCycleIdAndEnd) { | ||
| if (packedGcCycleIdAndEnd == -1) { | ||
| return -1; | ||
| } | ||
| return (int) (packedGcCycleIdAndEnd >> 32); | ||
| } |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing JavaDoc documentation for this method. Add JavaDoc that explains the method's purpose, parameters, and return value. For example: "Extracts the GC cycle ID from a packed long value. @param packedGcCycleIdAndEnd the packed long containing the GC cycle ID in the upper 32 bits @return the GC cycle ID, or -1 if the packed value is -1".
This issue also appears in the following locations of the same file:
- line 287
| import com.microsoft.gctoolkit.event.MalformedEvent; | ||
| import com.microsoft.gctoolkit.event.MemoryPoolSummary; | ||
| import com.microsoft.gctoolkit.event.RegionSummary; | ||
| import com.microsoft.gctoolkit.event.*; |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wildcard import should be avoided. While this change replaces multiple specific imports with a wildcard import, it makes it unclear which specific classes are being used from the event package. Consider keeping the explicit imports or at least listing them individually to improve code readability and maintainability.
| import com.microsoft.gctoolkit.event.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad bot suggested edit but the comment is good.
| @Test | ||
| void extractGCID() { | ||
| assertEquals(35, GCLogParser.extractGCID("[2025-10-21T16:44:29.311+0200][3645.640s] GC(35) Pause Young (Allocation Failure)")); | ||
| } |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for an important edge case: lines that contain a '(' character before "GC(". For example, a line like "[timestamp] (something) GC(35) pause" would incorrectly try to parse "something" as the GC ID. Add a test case to verify the method correctly identifies the "GC(" pattern and not just any parenthesis.
karianna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the late review! Looks mostly good. Possibly needs a comment at the top of extractGCID explaining hy we are packing the id and length in a single data structure (perf)
| return (trace != null) ? trace.getIntegerGroup(1) : -1; | ||
| static int extractGCID(String line) { | ||
| long packed = extractGCCycleIdAndTextualLength(line); | ||
| if (packed == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit-pick - Perhaps extract -1 to a constant called NOT_FOUND or something similar?
| // pack the two ints in a long | ||
| return (((long) gcId) << 32) | (endToReturn & 0xFFFFFFFFL); | ||
| } catch (Exception e) { | ||
| throw new RuntimeException("Failed to extract gc cycle id from " + line, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should actually try to recover from this and just process the next line...
| import com.microsoft.gctoolkit.event.MalformedEvent; | ||
| import com.microsoft.gctoolkit.event.MemoryPoolSummary; | ||
| import com.microsoft.gctoolkit.event.RegionSummary; | ||
| import com.microsoft.gctoolkit.event.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer explicit imports
parser/src/test/java/com/microsoft/gctoolkit/parser/GCLogParserTest.java
Show resolved
Hide resolved
|
|
||
| import org.junit.jupiter.api.Test; | ||
|
|
||
| import static org.junit.jupiter.api.Assertions.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer explicit imports
|
@obourgain - Do you have any pre/post numbers showing the per impact of this change? It's good overall (some nit-picks) but since it does add a bit of complexity I want to ensure we have a good corresponding perf gain. |
…r.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This is called for all lines, so cost adds up.
Most use cases use just the gcid, but in UnifiedG1GCParser the end position is also required to do a subString() and reduce the length of the String that will be parsed by all the Patterns.
My impl has a little behavior change. The regex
GC\\((\\d+)\\)was matching the space after the gcid while my implementation uses the index of the closing)as end, so the subString() has an additional space at first position. I can change my implementation upon request to keep the same behavior.