Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for code stemming with tree-sitter #19

Merged
merged 9 commits into from
Feb 19, 2025
Merged

Conversation

Signed-off-by: Keshav Priyadarshi <[email protected]>
@keshav-space keshav-space self-assigned this Feb 5, 2025
Signed-off-by: Keshav Priyadarshi <[email protected]>
Signed-off-by: Keshav Priyadarshi <[email protected]>
Signed-off-by: Keshav Priyadarshi <[email protected]>
Copy link
Member

@JonoYang JonoYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keshav-space The code looks alright, but I need to play around with this to better understand code stemming

@JonoYang
Copy link
Member

@keshav-space I am using this branch of matchcode-toolkit in aboutcode-org/matchcode-tests#2 and it is failing when stemming some C source files.

location = '/tmp/scancode-tk-tests -7j8ig1k0/rfxpd8b_/Dataset.zip/Dataset/Control/Human1/itemToString_Human1.c'

    def get_parser(location):
        """
        Get the appropriate tree-sitter parser and grammar config for
        file at location.
        """
        file_type = Type(location)
        language = file_type.programming_language
    
        if not language or language not in TS_LANGUAGE_CONF:
            return
    
        language_info = TS_LANGUAGE_CONF[language]
        wheel = language_info["wheel"]
    
        try:
            grammar = importlib.import_module(wheel)
        except ModuleNotFoundError:
            raise TreeSitterWheelNotInstalled(f"{wheel} package is not installed")
    
>       parser = Parser(language=Language(grammar.language()))
E       ValueError: Incompatible Language version 15. Must be between 13 and 14

venv/lib/python3.10/site-packages/matchcode_toolkit/stemming.py:79: ValueError

https://dev.azure.com/nexB/matchcode-tests/_build/results?buildId=15446&view=logs&j=41fca3e8-fcfe-5670-e26e-f33ade403b7f&t=cc4dfe40-db93-5fe0-c2b6-39217bc3c5fe&l=123

@keshav-space
Copy link
Member Author

I am using this branch of matchcode-toolkit in aboutcode-org/matchcode-tests#2 and it is failing when stemming some C source files.

@JonoYang Looking into it

@JonoYang
Copy link
Member

@keshav-space The issue was that I was not using the same version of tree sitter in matchcode-tests. Once I created requirements files and pinned the dependencies to the same ones we have in matchcode-toolkit, it worked.

@JonoYang
Copy link
Member

tests are passing in matchcode-toolkit. Thanks!

@JonoYang JonoYang merged commit ca21376 into main Feb 19, 2025
6 checks passed
@JonoYang JonoYang deleted the code-stemming branch February 19, 2025 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AI-GCS: Design and implement "Code Stemming", e.g., token replacement and abstraction
2 participants