Skip to content

Conversation

@obourgain
Copy link

This is called for all lines, so cost adds up.

Most use cases use just the gcid, but in UnifiedG1GCParser the end position is also required to do a subString() and reduce the length of the String that will be parsed by all the Patterns.

My impl has a little behavior change. The regex GC\\((\\d+)\\) was matching the space after the gcid while my implementation uses the index of the closing ) as end, so the subString() has an additional space at first position. I can change my implementation upon request to keep the same behavior.

This is called for all lines, so cost adds up.
@obourgain
Copy link
Author

@microsoft-github-policy-service agree

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes GC log parsing by replacing regex-based GC ID extraction with manual string parsing. The new implementation packs both the GC cycle ID and the textual length into a single long value for efficient reuse, particularly benefiting the UnifiedG1GCParser which needs both values.

Changes:

  • Replaced GCID_COUNTER regex pattern with manual string parsing methods in GCLogParser
  • Refactored UnifiedG1GCParser from Stream API to for-loop for better performance
  • Updated all GC ID extraction call sites in UnifiedGenerationalParser to use new methods

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java Removed regex-based GCID_COUNTER; added manual parsing methods (extractGCID, extractGCCycleIdAndTextualLength, extractGCCycleId, extractGCCycleIdTextualLength) with bit-packing optimization
parser/src/main/java/com/microsoft/gctoolkit/parser/UnifiedG1GCParser.java Refactored parse() method from Stream API to for-loop; uses new extractGCCycleIdAndTextualLength method; removed unused imports (Pattern, Matcher, AbstractMap); added wildcard import
parser/src/main/java/com/microsoft/gctoolkit/parser/UnifiedGenerationalParser.java Replaced all 7 occurrences of GCID_COUNTER.parse() with extractGCID() calls
parser/src/test/java/com/microsoft/gctoolkit/parser/GCLogParserTest.java Added unit tests for new GC ID extraction methods covering both normal and malformed input cases
Comments suppressed due to low confidence (3)

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java:257

  • The JavaDoc is incomplete. It should document what the method returns, including the meaning of the packed long value and that -1 indicates failure. Consider adding a @return tag that explains: "a packed long with the GC cycle ID in the upper 32 bits and the textual length in the lower 32 bits, or -1 if the GC cycle ID cannot be extracted".
    /**
     * Returns a packed long containing two ints:
     * - the GC cycle id in the high bytes
     * - the length of the text containing the GC cycle id,e.g. 'GC(10)'
     * See {@link #extractGCCycleId(long)} and {@link #extractGCCycleIdTextualLength(long)}
     */

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java:292

  • Missing JavaDoc documentation for this method. Add JavaDoc that explains the method's purpose, parameters, and return value. For example: "Extracts the textual length of the GC cycle ID from a packed long value. @param packedGcCycleIdAndEnd the packed long containing the textual length in the lower 32 bits @return the textual length in characters, or 0 if the packed value is -1".
    protected static int extractGCCycleIdTextualLength(long packedGcCycleIdAndEnd) {
        if (packedGcCycleIdAndEnd == -1) {
            return 0;
        }
        return (int) packedGcCycleIdAndEnd;
    }

parser/src/main/java/com/microsoft/gctoolkit/parser/GCLogParser.java:272

  • The behavior change in calculating the end position differs from the original regex implementation. The old regex GC\\((\\d+)\\) matched including the trailing space after ")", so lineAfterGcId would start without a leading space. The new implementation returns end+1 (position after ")"), causing lineAfterGcId to include the leading space. While Matcher.find() will still match patterns that start after the space, this creates an inconsistency with the original behavior. Consider adjusting to match the original behavior by finding and skipping the trailing space, e.g., checking if line.charAt(end + 1) == ' ' and incrementing end accordingly.
            // add the closing parenthesis to the length
            int endToReturn = end + 1;

// pack the two ints in a long
return (((long) gcId) << 32) | (endToReturn & 0xFFFFFFFFL);
} catch (Exception e) {
throw new RuntimeException("Failed to extract gc cycle id from " + line, e);
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling here is inconsistent with the test expectations. When the input is malformed (e.g., missing closing parenthesis), the test expects the method to return -1, but this catch block will throw a RuntimeException instead. The catch block should return -1 for malformed input to match the expected behavior and maintain consistent error handling with the rest of the method.

Suggested change
throw new RuntimeException("Failed to extract gc cycle id from " + line, e);
Logger.getLogger(GCLogParser.class.getName())
.log(Level.FINE, "Failed to extract GC cycle id from line: " + line, e);
return -1;

Copilot uses AI. Check for mistakes.
Comment on lines +263 to +264
// we search for the value between parenthesis
int start = line.indexOf('(');
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method finds the first '(' in the line, but doesn't verify it's part of "GC(". If a line has an earlier '(' before "GC(", this will extract the wrong value. For example, if a log line contains "(something) GC(35)", it would try to parse "something" as the GC ID. Consider using line.indexOf("GC(") and then adding 2 to get the position after "GC(", or change line.indexOf('(') to line.indexOf("GC(") + 2 to ensure you're finding the correct parenthesis.

This issue also appears in the following locations of the same file:

  • line 271
Suggested change
// we search for the value between parenthesis
int start = line.indexOf('(');
// we search for the value between parenthesis, specifically the one from "GC("
int gcMarker = line.indexOf("GC(");
if (gcMarker == -1) {
return -1;
}
// gcMarker points to 'G', so gcMarker + 2 is the index of '(' in "GC("
int start = gcMarker + 2;

Copilot uses AI. Check for mistakes.
Comment on lines +280 to +285
protected static int extractGCCycleId(long packedGcCycleIdAndEnd) {
if (packedGcCycleIdAndEnd == -1) {
return -1;
}
return (int) (packedGcCycleIdAndEnd >> 32);
}
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing JavaDoc documentation for this method. Add JavaDoc that explains the method's purpose, parameters, and return value. For example: "Extracts the GC cycle ID from a packed long value. @param packedGcCycleIdAndEnd the packed long containing the GC cycle ID in the upper 32 bits @return the GC cycle ID, or -1 if the packed value is -1".

This issue also appears in the following locations of the same file:

  • line 287

Copilot uses AI. Check for mistakes.
import com.microsoft.gctoolkit.event.MalformedEvent;
import com.microsoft.gctoolkit.event.MemoryPoolSummary;
import com.microsoft.gctoolkit.event.RegionSummary;
import com.microsoft.gctoolkit.event.*;
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wildcard import should be avoided. While this change replaces multiple specific imports with a wildcard import, it makes it unclear which specific classes are being used from the event package. Consider keeping the explicit imports or at least listing them individually to improve code readability and maintainability.

Suggested change
import com.microsoft.gctoolkit.event.*;

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad bot suggested edit but the comment is good.

Comment on lines +15 to +18
@Test
void extractGCID() {
assertEquals(35, GCLogParser.extractGCID("[2025-10-21T16:44:29.311+0200][3645.640s] GC(35) Pause Young (Allocation Failure)"));
}
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for an important edge case: lines that contain a '(' character before "GC(". For example, a line like "[timestamp] (something) GC(35) pause" would incorrectly try to parse "something" as the GC ID. Add a test case to verify the method correctly identifies the "GC(" pattern and not just any parenthesis.

Copilot uses AI. Check for mistakes.
Copy link
Member

@karianna karianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late review! Looks mostly good. Possibly needs a comment at the top of extractGCID explaining hy we are packing the id and length in a single data structure (perf)

return (trace != null) ? trace.getIntegerGroup(1) : -1;
static int extractGCID(String line) {
long packed = extractGCCycleIdAndTextualLength(line);
if (packed == -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-pick - Perhaps extract -1 to a constant called NOT_FOUND or something similar?

// pack the two ints in a long
return (((long) gcId) << 32) | (endToReturn & 0xFFFFFFFFL);
} catch (Exception e) {
throw new RuntimeException("Failed to extract gc cycle id from " + line, e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should actually try to recover from this and just process the next line...

import com.microsoft.gctoolkit.event.MalformedEvent;
import com.microsoft.gctoolkit.event.MemoryPoolSummary;
import com.microsoft.gctoolkit.event.RegionSummary;
import com.microsoft.gctoolkit.event.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer explicit imports


import org.junit.jupiter.api.Test;

import static org.junit.jupiter.api.Assertions.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer explicit imports

@karianna
Copy link
Member

@obourgain - Do you have any pre/post numbers showing the per impact of this change? It's good overall (some nit-picks) but since it does add a bit of complexity I want to ensure we have a good corresponding perf gain.

karianna and others added 2 commits January 21, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants