Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Unified Line Coverage Formula #898

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Kartikayy007
Copy link
Contributor

This PR standardizes how line coverage is calculated throughout the system to ensure consistent and mathematically correct measurements.

Reference: Issue #727

Solution

Implement Standardized Formulas

As defined in the issue:

  • Single Target Coverage:
    Cov(f1) / Linked(f1)

  • Coverage Improvement:
    [Cov(f1) - Cov(f0)] / [Linked(f1 ∪ f0)]

    • Improvement is the difference in covered lines divided by the union of all linked lines.

Implementation Details

New Coverage Module

  • Created: experiment/coverage.py with standardized formula implementations.
  • Implemented:
    • calculate_coverage() for single targets.
    • calculate_coverage_improvement() for comparing targets.

Updated Calculation Points

  • stage/execution_stage.py: Updated line coverage diff calculation.
  • run_all_experiments.py: Fixed project coverage gain calculation.
  • report/aggregate_coverage_diff.py: Standardized coverage diff computation.
  • experiment/evaluator.py: Aligned with the unified formula.

Non-Destructive Operations

  • Use copy() method to prevent modifications to original coverage objects.
  • Ensures subtraction operations don’t affect original data.

@Kartikayy007
Copy link
Contributor Author

CC: @DonggeLiu please review



def calculate_coverage(cov: textcov.Textcov, linked_lines: int) -> float:
"""Calculate coverage according to formula: Cov(f) / Linked(f)."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'Calculates' for consistency, same below.

@@ -465,7 +467,9 @@ def check_target(self, ai_binary, target_path: str) -> Result:
run_result.coverage.subtract_covered_lines(existing_textcov)

if total_lines and run_result.coverage:
coverage_diff = run_result.coverage.covered_lines / total_lines
union_linked_lines = max(run_result.coverage.total_lines, total_lines)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the term "union" in the mathematical sense, where each line represents an element of a set. Each set can correspond to covered lines or linked lines in the project.

For example, suppose our fuzz target links lines {1,2,3,4} of the project at compile time and covers lines {1,2,3} during fuzzing, while an older fuzz target links lines {1,2,5} and covers {2,5} . Then:

  • The union linked lines is {1,2,3,4,5}.
  • The new fuzz target's coverage is 3/5 ({1,2,3} out of {1,2,3,4,5}).
  • The old fuzz target's coverage is 2/5 ({2,5} out of {1,2,3,4,5}).
  • The coverage increase is 2/5 (newly covered lines {1,3} out of {1,2,3,4,5}).

Ideally, the denominator should represent "the total reachable lines" or "the total number of lines" in the project. However, they are difficult to determine accurately if certain files are not linked at compile time.

Given this is a complicate task, please feel free to prioritize on writing and refining your proposal : )
That's the most important factor for you application.
If you have a draft, I am more than happy to provide general feedback to ensure you are on the right track.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct sir I'll prioritise my proposal firstly and will surely take a look back to it and completely resolve it, Thank you !

hit_count=line.hit_count)
new_cov.files[name] = new_file

return new_cov
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use python's builtin deepcopy?


if total_lines:
coverage_gain[project] = {
'language':
oss_fuzz_checkout.get_project_language(project),
'coverage_diff':
total_cov.covered_lines / total_lines,
coverage_utils.calculate_coverage_improvement(
total_cov, existing_textcov, union_linked_lines),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we have subtracted existing_textcov above:

total_existing_lines = sum(lines)
total_cov_covered_lines_before_subtraction = total_cov.covered_lines
total_cov.subtract_covered_lines(existing_textcov)

While this action should be idempotent, repeated actions add unnecessary complexity and may confuse readers.

@DonggeLiu
Copy link
Collaborator

As always, thanks sooo much for looking into this!

@DonggeLiu DonggeLiu marked this pull request as draft March 20, 2025 11:04
Kartikayy007 and others added 3 commits March 23, 2025 14:23
Changes:
- Add dedicated coverage.py module with standardized implementations
- Fix double-subtraction issue in evaluator.py
- Use non-destructive operations to preserve original coverage objects
- Replace custom copy implementation with Python's built-in deepcopy
@Kartikayy007 Kartikayy007 marked this pull request as ready for review March 24, 2025 22:16
@Kartikayy007
Copy link
Contributor Author

Kartikayy007 commented Mar 24, 2025

Hello! @DonggeLiu
been thinking for the mathematical union of sets of lines. I have described a set-based solution that would actually implement the mathematical union you suggested.

# Changes to experiment/textcov.py

@dataclasses.dataclass
class Textcov:
   """Textcov with set-based line tracking."""
   # Original properties (keep for backward compatibility)
   functions: dict[str, Function] = dataclasses.field(default_factory=dict)
   files: dict[str, File] = dataclasses.field(default_factory=dict)
   language: str = 'c++'
   
   # New set-based tracking
   covered_line_numbers: set[int] = dataclasses.field(default_factory=set)
   linked_line_numbers: set[int] = dataclasses.field(default_factory=set)
   
   def subtract_covered_lines(self, other: 'Textcov'):
       """Remove lines covered in other from self using set operations."""
       # Original function implementation (keep for backward compatibility)
       if self.language == 'python':
           for file in other.files.values():
               if file.name in self.files:
                   self.files[file.name].subtract_covered_lines(file)
       else:
           for function in other.functions.values():
               if function.name in self.functions:
                   self.functions[function.name].subtract_covered_lines(
                       function, self.language)
       
       # New set-based implementation
       self.covered_line_numbers -= other.covered_line_numbers
   
   @property
   def covered_lines(self):
       """Return count of covered lines."""
       # Prefer set-based count if available
       if self.covered_line_numbers:
           return len(self.covered_line_numbers)
           
       # Fall back to original implementation
       if self.language == 'python':
           return sum(f.covered_lines for f in self.files.values())
       return sum(f.covered_lines for f in self.functions.values()
)


# Changes to experiment/coverage.py

def calculate_coverage_improvement(new_cov: textcov.Textcov,
                                  existing_cov: textcov.Textcov,
                                  union_linked_lines: int = None) -> float:
   """Calculates coverage improvement: [Cov(f1) - Cov(f0)] / [Linked(f1 ∪ f0)]."""
   # Make a copy to avoid modifying the original
   diff_cov = new_cov.copy()
   diff_cov.subtract_covered_lines(existing_cov)
   
   # If set-based line numbers are available, use them for true mathematical union
   if new_cov.linked_line_numbers and existing_cov.linked_line_numbers:
       # Calculate true union of linked lines
       true_union = new_cov.linked_line_numbers.union(existing_cov.linked_line_numbers)
       union_size = len(true_union)
       return diff_cov.covered_lines / union_size if union_size else 0.0
   
   # Fall back to approximation if sets aren't available
   if not union_linked_lines:
       return 0.0
   return diff_cov.covered_lines / union_linked_lines

It would store actual line numbers in sets and perform corresponding set operations. This is closer to your example, What do you think? im confused I think here parsing and backward compatibility aren't fully addressed and mixing old and new approaches
Should we implement this for improved accuracy?

…alculation

This change eliminates the redundant subtraction of covered lines from the total coverage calculation, streamlining the process and ensuring accurate coverage metrics.
@Kartikayy007
Copy link
Contributor Author

please take a look whenever you are free Thank you

@DonggeLiu
Copy link
Collaborator

Hi @Kartikayy007,
I am not very confident about this new solution.
For example, how does it handle the line number of multiple files, or different files under the same name but different path?

I reckon this issue will likely be rather complicated, feel free to leave it for now.

@demoncoder-crypto
Copy link

@DonggeLiu I have done a alternative approach please do let me know If its up to expectation

@Kartikayy007
Copy link
Contributor Author

@DonggeLiu I have done a alternative approach please do let me know If its up to expectatio
Can you you explain?

@DonggeLiu
Copy link
Collaborator

I will close this PR if you don't mind? @Kartikayy007

@Kartikayy007
Copy link
Contributor Author

Kartikayy007 commented Apr 11, 2025

I will close this PR if you don't mind? @Kartikayy007

Please put this in draft.

@DonggeLiu DonggeLiu marked this pull request as draft April 14, 2025 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants