Skip to content

CDS Extractor Rewrite Phase 2 : Improve Performance and Precision #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

data-douser
Copy link
Collaborator

@data-douser data-douser commented May 18, 2025

This PR implements the planned "Phase 2" of the full rewrite of the CDS extractor, focusing on improving performance and precision. It introduces significant changes to the CodeQL CDS extractor, including a major refactor of the extraction process, updates to scripts, and improvements to configuration and debugging. Throughout this multi-phase rewrite process, the approach has been documented in the extractors/cds/tools/autobuild.md file.

This changes of this PR do not fully implement the "autobuild" run mode for the CDS extractor, but it gets reasonably close. New "run modes" were added to the renamed cds-extractor.ts script (formerly index-files.ts), and the arguments to the script have been update to allow for run modes such as index-files (legacy), debug-parser (new), and autobuild (planned / WIP).

While staying within the limitations of the index-files approach, this changes in the PR are an attempt to integrate parsing and compiling of .cds files in a manner that is "project aware", meaning that we try to only compile the top-level .cds files in an effort to avoid duplication of both compilation work and indexed .cds.json files.

Key Changes:

New Features and Functionality:

  • Introduction of cds-extractor.ts: Added a new script to handle CDS file processing, including project dependency graph building, environment setup, and integration with CodeQL tools. This script replaces the previous index-files.js script.
  • Run Modes: Adds support for running the cds-extractor.ts script with different "run mode" values, including autobuild, debug-parser, and index-files.
  • New Project-Aware Dependency Handling: The cds-extractor.ts script has been rewritten to include features like project dependency graph building, project-aware compilation, and diagnostic handling for CDS files. This enables more efficient and context-aware processing of CDS files.
  • Debugging: Enhanced debugging of project dependency graph (post-parsing, pre-compilation) now supported via the debug-parser run mode of the cds-extractor (node) script.

Script Updates:

  • Batch Script (index-files.cmd) Updates: Updated references from index-files.js to cds-extractor.js, added _run_mode parameter, and adjusted logging and execution commands to align with the new script. [1] [2] [3]
  • Shell Script (index-files.sh) Updates: Similar updates as the batch script, including parameter additions and script name changes for consistency. [1] [2] [3] [4]

Configuration Improvements:

  • ESLint Configuration (eslint.config.mjs): Refactored imports for better readability, updated rules and plugin configurations, and added comments to clarify TypeScript and JavaScript-specific settings.

Miscellaneous:

  • .gitignore Update: Added an entry to ignore debug/ files created during debugging of the CDS extractor.

Documentation Updates:

  • Updated Workflow Diagrams: The README.md now reflects the new cds-extractor workflow, replacing outdated references to index-files with the new process and steps for project-aware compilation. [1] [2]

This commit:

- Implements an initial version of a project-aware CDS parser.
- Creates a dedicated "cds" package at "extractors/cds/tools/src/cds".
- Converts existing unit tests to use the new path for functions
  related to parsing and/or compiling .cds files.
This commit:

- fixes a typo in a comment, as identified in a previous PR ( advanced-security#188 );

- updates the logic of the CDS extractor's `findPackageJsonDirs` function;

- fixes a regression in the CDS extractor where a "project directory"
  was not properly recognized when its path was the same as the
  "source root" directory for the CDS extractor scan;

- adds unit tests to cover edge cases idendified for the
  `findPackageJsonDirs` function.
Renames the entrypoint to the CDS extractor script and refactors its
arguments in order to support using different "run modes" for the
extractor, including:

  - "autobuild" : work-in-progress, just a stub right now;
  - "debug-parser" : using for debugging CDS project & file parsing;
  - "index-files" : legacy mode, useful for backwards compatibility;

Updates the usage (help) message for the script to represent the
required arguments for each of the currently planned run modes.

Adds support for the "debug-parser" run mode, which debugs to a file
under the `extractors/cds/tools/out/debug/` directory. Useful for
in-progress rewrite of the CDS extractor to be more performant when
running and more useful in terms of yielding a CodeQL database that
allows for high-precision query results for CDS projects/queries.
Adds extended unit tests for the "parser" component of the CDS
extractor, using the CDS projects nested under this repository's
`javascript/frameworks/cap/test/queries` directory as testing
targets and reference points for test cases.
Adds more extensive unit tests of CDS extractor code related to the
use of the `cds` compiler.

Adds unit tests for CDS extractor functions in "projectMapping.ts".
Fixes the setup of the CDS extractor environment to ensure that the
codeql CLI can be reliably found and to avoid duplicate runs of
the CDS parser's graph building process for "debug-parser" versus
other run modes.
Cleans up DEBUG logging and improves existing CDS extractor logging
in order to provide more useful indications of the CDS compiler
version used to compile a given `*.cds.json` file.
Initial attempt to use the `cds compile` CLI command in a way that
allows for de-duplication of individual `.cds` files that are already
included by another `.cds` file in the project.
This commit:

- Implements an initial version of a project-aware CDS parser.
- Creates a dedicated "cds" package at "extractors/cds/tools/src/cds".
- Converts existing unit tests to use the new path for functions
  related to parsing and/or compiling .cds files.
Renames the entrypoint to the CDS extractor script and refactors its
arguments in order to support using different "run modes" for the
extractor, including:

  - "autobuild" : work-in-progress, just a stub right now;
  - "debug-parser" : using for debugging CDS project & file parsing;
  - "index-files" : legacy mode, useful for backwards compatibility;

Updates the usage (help) message for the script to represent the
required arguments for each of the currently planned run modes.

Adds support for the "debug-parser" run mode, which debugs to a file
under the `extractors/cds/tools/out/debug/` directory. Useful for
in-progress rewrite of the CDS extractor to be more performant when
running and more useful in terms of yielding a CodeQL database that
allows for high-precision query results for CDS projects/queries.
Adds extended unit tests for the "parser" component of the CDS
extractor, using the CDS projects nested under this repository's
`javascript/frameworks/cap/test/queries` directory as testing
targets and reference points for test cases.
Adds more extensive unit tests of CDS extractor code related to the
use of the `cds` compiler.

Adds unit tests for CDS extractor functions in "projectMapping.ts".
Fixes the setup of the CDS extractor environment to ensure that the
codeql CLI can be reliably found and to avoid duplicate runs of
the CDS parser's graph building process for "debug-parser" versus
other run modes.
Cleans up DEBUG logging and improves existing CDS extractor logging
in order to provide more useful indications of the CDS compiler
version used to compile a given `*.cds.json` file.
Initial attempt to use the `cds compile` CLI command in a way that
allows for de-duplication of individual `.cds` files that are already
included by another `.cds` file in the project.
…/codeql-sap-js into data-douser/cds-ts-rewrite-2
Updates the mermaid flowchart for the CDS extractor in order to
reflect recent changes to how the CDS extractor actually works.
Copilot

This comment was marked as outdated.

data-douser and others added 7 commits June 7, 2025 18:26
Fixes detection of .cds file in CDS projects by ensuring that
"node_modules" subdirectories are explicitly ignored and "srv" and "db"
subdirectories are explicitly included.

Migrates some logic from cds-extractor.ts (entrypoint) script to
testable functions under extractors/cds/tools/src/ directory.

Adds and improves unit tests related to code changes from this commit.
Removes an unintended change in CDS compile (to .cds.json) behavior
due to the (mis)use of the "--parse" command.

Fixes a regression in the expected query results in at least one case:

`javascript/frameworks/cap/src/sensitive-exposure/SensitiveExposure.ql`
Refactors cds extractor `src/cds/compiler` and `src/cds/parser`
packages for improved maintainability.

Simplifies the main logic of the CDS extractor such that we always
build a graph that maps CDS projects to their imports / dependencies,
which is part of the longer process of deprecating the "index-files"
run mode of the CDS extractor (in favor of autobuild, eventually).

Attempts to fix CDS file and project parsing for test projects such as:
`javascript/frameworks/cap/test/queries/loginjection/log-injection-without-protocol-none`
Fixes a regression where the project base directory was being used
to set the `cwd` of the process spawned for running the CDS compiler
for "project-aware" compilation. Adds unit tests to ensure the `cwd`
is always set to the value of the `sourceRoot` directory.

Further refactoring of the `cds/compiler` and `cds/parser` packages
within the source code of the CDS extractor.

This commit is expected to actually cause more problems with existing
queries, despite fixing the relative-file-path problem / regression.
Some changes to existing CodeQL queries and/or expected results may be
required as, at this point, the JSON data generated by the CDS compiler
(via the CDS extractor) seems valid.
@data-douser data-douser requested a review from Copilot June 11, 2025 15:57
@data-douser data-douser self-assigned this Jun 11, 2025
@data-douser data-douser added enhancement New feature or request javascript Pull requests that update javascript code labels Jun 11, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the Phase 2 rewrite of the CDS extractor to improve performance and precision by introducing a project-aware parsing and compilation workflow, and by unifying scripts into a single cds-extractor.ts entry point with multiple run modes.

  • Added a new cds-extractor.ts script with index-files, debug-parser, and autobuild modes
  • Refactored parsing and compilation logic into modular src/cds/parser and src/cds/compiler packages
  • Updated shell and batch wrappers, ESLint config, package.json, and documentation to reference the new script and modes

Reviewed Changes

Copilot reviewed 46 out of 46 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
extractors/cds/tools/src/cdsCompiler.ts Removed legacy compiler utility in favor of new modules
extractors/cds/tools/src/cds/parser/types.ts Added parser data types for entities, imports, services
extractors/cds/tools/src/cds/parser/index.ts Exported parser APIs
extractors/cds/tools/src/cds/parser/graph.ts Implemented project dependency graph builder
extractors/cds/tools/src/cds/parser/debug.ts Added debug-info output for parser
extractors/cds/tools/src/cds/index.ts Re-exported compiler and parser
extractors/cds/tools/src/cds/compiler/version.ts Added function to retrieve CDS compiler version
extractors/cds/tools/src/cds/compiler/types.ts Extended compilation result type
extractors/cds/tools/src/cds/compiler/project.ts Added project lookup helper
extractors/cds/tools/src/cds/compiler/index.ts Exported compiler APIs
extractors/cds/tools/src/cds/compiler/compile.ts Core project-aware and individual compile logic
extractors/cds/tools/src/cds/compiler/command.ts Updated determineCdsCommand to accept cache dir
extractors/cds/tools/package.json Updated main script, dependencies, and scripts
extractors/cds/tools/index-files.ts Removed legacy TypeScript entry
extractors/cds/tools/index-files.sh Renamed to use cds-extractor.js and added run-mode
extractors/cds/tools/index-files.cmd Same updates for Windows batch wrapper
extractors/cds/tools/eslint.config.mjs Refactored imports and updated file patterns
extractors/cds/tools/cds-extractor.ts New unified entry point for all run modes
extractors/cds/tools/.gitignore Ignored debug output
extractors/README.md Updated diagram to reflect new script flow
Comments suppressed due to low confidence (1)

extractors/cds/tools/src/cds/parser/graph.ts:22

  • Add unit tests for buildCdsProjectDependencyGraph to cover scenarios with multiple nested projects and various import types, ensuring parser accuracy and preventing regressions.
export function buildCdsProjectDependencyGraph(

Fixes newly introduced code-scanning alerts due to insecure use of files
created under the system `/tmp/` directory in some recently implemented
unit tests.
@data-douser
Copy link
Collaborator Author

data-douser commented Jun 11, 2025

cds-extractor.debug.log-injection-without-protocol-none.tgz

The attached .tgz file contains sub-directories for the old/ versus new/ versions of the .cds.json files generated by the CDS extractor for the test project at repo-relative path javascript/frameworks/cap/test/queries/loginjection/log-injection-without-protocol-none/, where "old" represents what is currently produced by the main branch and "newrepresents the output as of commitaf066d7`.

From what I can tell, we have the same data in the "old" versus "new", except that the "new" representation collapses all of that data into 1 file instead of 3.

@lcartey @jeongsoolee09

@data-douser
Copy link
Collaborator Author

data-douser commented Jun 11, 2025

I noticed some tool-level errors related to node dependency installation failures. I suspect that many of these failures may actually be due to outdated (now deprecated) dependency versions in the associated project's package.json file.

I also noticed that these errors do not seem to prevent (this version of) the CDS extractor from continuing to try to use the cds compile command for project directories where node dependency installation failed (and generated a tool diagnostic error ^^).

This seems like odd behavior, but I am not sure it is wrong. The purpose of (code scanning) diagnostic errors (afaik) is to indicate that a problem was encountered that may cause some code results to be missing, and we probably want to make a best effort at compiling .cds to .cds.json as long as we have access to some form/version of cds compile command. Open to feedback here.

Creates tests and code for a new, unified `cdsExtractorLog` function and
integrates this function throughout the CDS extractor code.

Updates `test/jest.setup.ts` config for the CDS extractor in order to
simplify setup of the source root directory config required by the new
`cdsExtractorLog` function.
@data-douser data-douser marked this pull request as ready for review June 12, 2025 11:53
@data-douser
Copy link
Collaborator Author

For the expected Code Scanning results that are currently missing for the javascript/frameworks/cap/src/loginjection/LogInjection.ql query, the problem seems to be related to extra "barrier" nodes in the data flow path.

The getType predicate of the HandlerParameterData class is returning no results for a database created from the latest changes on the data-douser:data-douser/cds-ts-rewrite-2 branch. An empty set of types causes the isBarrier predicate of the CAPLogInjectionConfiguration class to hold in more cases than it should, which is blocking the query's evaluation of config.hasFlowPath(source, sink) for valid flow paths.

@data-douser
Copy link
Collaborator Author

data-douser commented Jun 20, 2025

For the expected Code Scanning results that are currently missing for the javascript/frameworks/cap/src/loginjection/LogInjection.ql query, the problem seems to be related to extra "barrier" nodes in the data flow path.

The getType predicate of the HandlerParameterData class is returning no results for a database created from the latest changes on the data-douser:data-douser/cds-ts-rewrite-2 branch. An empty set of types causes the isBarrier predicate of the CAPLogInjectionConfiguration class to hold in more cases than it should, which is blocking the query's evaluation of config.hasFlowPath(source, sink) for valid flow paths.

After further research, I think the Code Scanning results may be missing because the model.cds.json files are not being compiled/generated for some test projects due to npm install errors. I will adjust the error-handling of the CDS extractor such that we try harder to generate the .cds.json files, even when we encounter errors for a given project.

For testing of this advanced-security/codeql-sap-js project, though, we probably need to update the required CDS version to a currently supported value. Example log-injection-not-depending-on-request/package.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request javascript Pull requests that update javascript code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant