Faster parsing of GC(id) #463

obourgain · 2025-11-13T10:18:49Z

This is called for all lines, so cost adds up.

Most use cases use just the gcid, but in UnifiedG1GCParser the end position is also required to do a subString() and reduce the length of the String that will be parsed by all the Patterns.

My impl has a little behavior change. The regex GC\\((\\d+)\\) was matching the space after the gcid while my implementation uses the index of the closing ) as end, so the subString() has an additional space at first position. I can change my implementation upon request to keep the same behavior.

This is called for all lines, so cost adds up.

obourgain · 2025-11-13T15:09:29Z

@microsoft-github-policy-service agree

Copilot

Pull request overview

This PR optimizes GC log parsing by replacing regex-based GC ID extraction with manual string parsing. The new implementation packs both the GC cycle ID and the textual length into a single long value for efficient reuse, particularly benefiting the UnifiedG1GCParser which needs both values.

Changes:

Replaced GCID_COUNTER regex pattern with manual string parsing methods in GCLogParser
Refactored UnifiedG1GCParser from Stream API to for-loop for better performance
Updated all GC ID extraction call sites in UnifiedGenerationalParser to use new methods

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java	Removed regex-based GCID_COUNTER; added manual parsing methods (extractGCID, extractGCCycleIdAndTextualLength, extractGCCycleId, extractGCCycleIdTextualLength) with bit-packing optimization
parser/src/main/java/com/microsoft/gctoolkit/parser/UnifiedG1GCParser.java	Refactored parse() method from Stream API to for-loop; uses new extractGCCycleIdAndTextualLength method; removed unused imports (Pattern, Matcher, AbstractMap); added wildcard import
parser/src/main/java/com/microsoft/gctoolkit/parser/UnifiedGenerationalParser.java	Replaced all 7 occurrences of GCID_COUNTER.parse() with extractGCID() calls
parser/src/test/java/com/microsoft/gctoolkit/parser/GCLogParserTest.java	Added unit tests for new GC ID extraction methods covering both normal and malformed input cases

Comments suppressed due to low confidence (3)

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java:257

The JavaDoc is incomplete. It should document what the method returns, including the meaning of the packed long value and that -1 indicates failure. Consider adding a @return tag that explains: "a packed long with the GC cycle ID in the upper 32 bits and the textual length in the lower 32 bits, or -1 if the GC cycle ID cannot be extracted".

    /**
     * Returns a packed long containing two ints:
     * - the GC cycle id in the high bytes
     * - the length of the text containing the GC cycle id,e.g. 'GC(10)'
     * See {@link #extractGCCycleId(long)} and {@link #extractGCCycleIdTextualLength(long)}
     */

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java:292

Missing JavaDoc documentation for this method. Add JavaDoc that explains the method's purpose, parameters, and return value. For example: "Extracts the textual length of the GC cycle ID from a packed long value. @param packedGcCycleIdAndEnd the packed long containing the textual length in the lower 32 bits @return the textual length in characters, or 0 if the packed value is -1".

    protected static int extractGCCycleIdTextualLength(long packedGcCycleIdAndEnd) {
        if (packedGcCycleIdAndEnd == -1) {
            return 0;
        }
        return (int) packedGcCycleIdAndEnd;
    }

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java:272

The behavior change in calculating the end position differs from the original regex implementation. The old regex GC\\((\\d+)\\) matched including the trailing space after ")", so lineAfterGcId would start without a leading space. The new implementation returns end+1 (position after ")"), causing lineAfterGcId to include the leading space. While Matcher.find() will still match patterns that start after the space, this creates an inconsistency with the original behavior. Consider adjusting to match the original behavior by finding and skipping the trailing space, e.g., checking if line.charAt(end + 1) == ' ' and incrementing end accordingly.

            // add the closing parenthesis to the length
            int endToReturn = end + 1;

Copilot · 2026-01-21T02:04:17Z

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java

+            // pack the two ints in a long
+            return (((long) gcId) << 32) | (endToReturn & 0xFFFFFFFFL);
+        } catch (Exception e) {
+            throw new RuntimeException("Failed to extract gc cycle id from " + line, e);


The error handling here is inconsistent with the test expectations. When the input is malformed (e.g., missing closing parenthesis), the test expects the method to return -1, but this catch block will throw a RuntimeException instead. The catch block should return -1 for malformed input to match the expected behavior and maintain consistent error handling with the rest of the method.

Suggested change

throw new RuntimeException("Failed to extract gc cycle id from " + line, e);

Logger.getLogger(GCLogParser.class.getName())

.log(Level.FINE, "Failed to extract GC cycle id from line: " + line, e);

return -1;

Copilot · 2026-01-21T02:04:17Z

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java

+        // we search for the value between parenthesis
+        int start = line.indexOf('(');


The method finds the first '(' in the line, but doesn't verify it's part of "GC(". If a line has an earlier '(' before "GC(", this will extract the wrong value. For example, if a log line contains "(something) GC(35)", it would try to parse "something" as the GC ID. Consider using line.indexOf("GC(") and then adding 2 to get the position after "GC(", or change line.indexOf('(') to line.indexOf("GC(") + 2 to ensure you're finding the correct parenthesis.

This issue also appears in the following locations of the same file:

line 271

Suggested change

// we search for the value between parenthesis

int start = line.indexOf('(');

// we search for the value between parenthesis, specifically the one from "GC("

int gcMarker = line.indexOf("GC(");

if (gcMarker == -1) {

return -1;

}

// gcMarker points to 'G', so gcMarker + 2 is the index of '(' in "GC("

int start = gcMarker + 2;

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java

Copilot · 2026-01-21T02:04:18Z

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java

+    protected static int extractGCCycleId(long packedGcCycleIdAndEnd) {
+        if (packedGcCycleIdAndEnd == -1) {
+            return -1;
+        }
+        return (int) (packedGcCycleIdAndEnd >> 32);
+    }


Missing JavaDoc documentation for this method. Add JavaDoc that explains the method's purpose, parameters, and return value. For example: "Extracts the GC cycle ID from a packed long value. @param packedGcCycleIdAndEnd the packed long containing the GC cycle ID in the upper 32 bits @return the GC cycle ID, or -1 if the packed value is -1".

This issue also appears in the following locations of the same file:

line 287

Copilot · 2026-01-21T02:04:18Z

parser/src/main/java/com/microsoft/gctoolkit/parser/UnifiedG1GCParser.java

-import com.microsoft.gctoolkit.event.MalformedEvent;
-import com.microsoft.gctoolkit.event.MemoryPoolSummary;
-import com.microsoft.gctoolkit.event.RegionSummary;
+import com.microsoft.gctoolkit.event.*;


The wildcard import should be avoided. While this change replaces multiple specific imports with a wildcard import, it makes it unclear which specific classes are being used from the event package. Consider keeping the explicit imports or at least listing them individually to improve code readability and maintainability.

Suggested change

import com.microsoft.gctoolkit.event.*;

Bad bot suggested edit but the comment is good.

Copilot · 2026-01-21T02:04:18Z

parser/src/test/java/com/microsoft/gctoolkit/parser/GCLogParserTest.java

+    @Test
+    void extractGCID() {
+        assertEquals(35, GCLogParser.extractGCID("[2025-10-21T16:44:29.311+0200][3645.640s] GC(35) Pause Young (Allocation Failure)"));
+    }


Missing test coverage for an important edge case: lines that contain a '(' character before "GC(". For example, a line like "[timestamp] (something) GC(35) pause" would incorrectly try to parse "something" as the GC ID. Add a test case to verify the method correctly identifies the "GC(" pattern and not just any parenthesis.

karianna

Apologies for the late review! Looks mostly good. Possibly needs a comment at the top of extractGCID explaining hy we are packing the id and length in a single data structure (perf)

karianna · 2026-01-21T01:59:46Z

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java

-        return (trace != null) ? trace.getIntegerGroup(1) : -1;
+    static int extractGCID(String line) {
+        long packed = extractGCCycleIdAndTextualLength(line);
+        if (packed == -1) {


Nit-pick - Perhaps extract -1 to a constant called NOT_FOUND or something similar?

karianna · 2026-01-21T02:02:04Z

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java

+            // pack the two ints in a long
+            return (((long) gcId) << 32) | (endToReturn & 0xFFFFFFFFL);
+        } catch (Exception e) {
+            throw new RuntimeException("Failed to extract gc cycle id from " + line, e);


I wonder if we should actually try to recover from this and just process the next line...

karianna · 2026-01-21T02:02:32Z

parser/src/main/java/com/microsoft/gctoolkit/parser/UnifiedG1GCParser.java

-import com.microsoft.gctoolkit.event.MalformedEvent;
-import com.microsoft.gctoolkit.event.MemoryPoolSummary;
-import com.microsoft.gctoolkit.event.RegionSummary;
+import com.microsoft.gctoolkit.event.*;


Prefer explicit imports

parser/src/test/java/com/microsoft/gctoolkit/parser/GCLogParserTest.java

karianna · 2026-01-21T02:03:38Z

parser/src/test/java/com/microsoft/gctoolkit/parser/GCLogParserTest.java

+
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.*;


Prefer explicit imports

karianna · 2026-01-21T02:06:21Z

@obourgain - Do you have any pre/post numbers showing the per impact of this change? It's good overall (some nit-picks) but since it does add a bit of complexity I want to ensure we have a good corresponding perf gain.

…r.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…rTest.java

Faster parsing of GC(id)

72455fb

This is called for all lines, so cost adds up.

karianna requested a review from Copilot January 21, 2026 01:58

Copilot started reviewing on behalf of karianna January 21, 2026 01:58 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

karianna requested changes Jan 21, 2026

View reviewed changes

karianna and others added 2 commits January 21, 2026 15:06

Update parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParse…

88bbbe2

…r.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update parser/src/test/java/com/microsoft/gctoolkit/parser/GCLogParse…

0ae42da

…rTest.java

		// we search for the value between parenthesis
		int start = line.indexOf('(');

-        // we search for the value between parenthesis
-        int start = line.indexOf('(');
+        // we search for the value between parenthesis, specifically the one from "GC("
+        int gcMarker = line.indexOf("GC(");
+        if (gcMarker == -1) {
+            return -1;
+        }
+        // gcMarker points to 'G', so gcMarker + 2 is the index of '(' in "GC("
+        int start = gcMarker + 2;


		import org.junit.jupiter.api.Test;

		import static org.junit.jupiter.api.Assertions.*;

Faster parsing of GC(id) #463

Are you sure you want to change the base?

Faster parsing of GC(id) #463

Uh oh!

Conversation

obourgain commented Nov 13, 2025

Uh oh!

obourgain commented Nov 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

karianna Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

karianna left a comment

Choose a reason for hiding this comment

Uh oh!

karianna Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

karianna Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

karianna Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

karianna Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

karianna commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants