Skip to content

Change to get_value_str() to escape regexes broke capa2yara.py #1909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ruppde opened this issue Dec 13, 2023 · 11 comments · May be fixed by #2626
Open

Change to get_value_str() to escape regexes broke capa2yara.py #1909

ruppde opened this issue Dec 13, 2023 · 11 comments · May be fixed by #2626
Labels
good first issue Good for newcomers

Comments

@ruppde
Copy link
Contributor

ruppde commented Dec 13, 2023

Description

With 58e94a3 the regexes returned by get_value_str() are escaped which breaks e.g.

bytesv = kid.get_value_str()

Steps to Reproduce

Run

python ./scripts/capa2yara.py  rules/host-interaction/file-system/reference-absolute-stream-path-on-windows.yml 2>&1 |grep x5D

The 2nd line shows the regex escaped, which is of no use in yara:

...
INFO:capa2yara:doing kids: [regex(string =~ /^(\\\\\?\\)?([\w]\:|\\)(\\((?![\<\>\"\/\|\*\?\:\\])[\x20-\x5B\x5D-\x7E])+)+\:\$?[a-zA-Z0-9_]+/)] - len: 1
INFO:capa2yara:doing regex: '/^(\\\\\\\\\\\\\\\\\\\\?\\\\\\\\)?([\\\\w]\\\\:|\\\\\\\\)(\\\\\\\\((?![\\\\<\\\\>\\\\\\"\\\\/\\\\|\\\\*\\\\?\\\\:\\\\\\\\])[\\\\x20-\\\\x5B\\\\x5D-\\\\x7E])+)+\\\\:\\\\$?[a-zA-Z0-9_]+/'
...

Expected behavior:

No escaping

Actual behavior:

See above

Versions

Most recent github version

Additional Information

How should we fix this? Introduce another function which returns the regex unescaped?

(capa2yara.py is the only script in scripts/ which uses the function, so shouldn't have broken more)

@williballenthin
Copy link
Contributor

williballenthin commented Dec 14, 2023

here's the regex in the raw rule yaml:

- string: /^(\\\\\?\\)?([\w]\:|\\)(\\((?![\<\>\"\/\|\*\?\:\\])[\x20-\x5B\x5D-\x7E])+)+\:\$?[a-zA-Z0-9_]+/

https://github.com/mandiant/capa-rules/blob/57b3911a72462e0597ca0d6685f8b02b38857765/host-interaction/file-system/reference-absolute-stream-path-on-windows.yml#L17C1-L17C112

and as logged above, from get_value_str:

[regex(string =~ /^(\\\\\?\\)?([\w]\:|\\)(\\((?![\<\>\"\/\|\*\?\:\\])[\x20-\x5B\x5D-\x7E])+)+\:\$?[a-zA-Z0-9_]+/)]

and they look the same to me. So get_value_str returns the regex as provided in the capa rule. This can (almost directly) be passed to re.compile to create a regular expression instance. I assume it can be converted to a yara-compatible regex trivially, too.

Can you explain the problem in a little more detail?

@ruppde
Copy link
Contributor Author

ruppde commented Dec 14, 2023

The line after that is the problem. That's what ends up in the yara rule and it's escaped when it shouldn't:
...
INFO:capa2yara:doing regex: '/^(\\\\\\\\\\?\\\\)?([\\w]\\:|\\\\)(\\\\((?![\\<\\>\\\"\\/\\|\\*\\?\\:\\\\])[\\x20-\\x5B\\x5D-\\x7E])+)+\\:\\$?[a-zA-Z0-9_]+/'
...

@williballenthin
Copy link
Contributor

is this possibly because the logging statement uses %r instead of %s? i don't think there's any extra escaping being done by capa.

@ruppde
Copy link
Contributor Author

ruppde commented Dec 15, 2023

Should be this line from the commit above:
image

@mr-tz mr-tz added the good first issue Good for newcomers label Feb 5, 2024
@acelynnzhang
Copy link
Contributor

I took a look at this a few days ago, but I couldn't figure out what was going on. Here's my output after reverting 58e94a3:

❯ python scripts/capa2yara.py -q rules/host-interaction/file-system/reference-absolute-stream-path-on-windows.yml > test.yar
...

import "pe"


private rule capa_pe_file : CAPA {
    meta:
        description = "Match in PE files. Used by other CAPA rules"
    condition:
        uint16be(0) == 0x4d5a
        or uint16be(0) == 0x558b
        or uint16be(0) == 0x5649
}


rule capa_reference_absolute_stream_path_on_Windows : CAPA  {
  meta:
 	description = "reference absolute stream path on Windows (converted from capa rule)"
	namespace = "host-interaction/file-system"
	static scope = "basic block"
	dynamic scope = "call"
	hash = "51828683DC26BFABD3994494099AE97D"
	reference = "This YARA rule converted from capa rule: https://github.com/mandiant/capa-rules/blob/master/rules/host-interaction/file-system/reference-absolute-stream-path-on-windows.yml"
	capa_nursery = "False"
	date = "2024-04-05"
	minimum_yara = "3.8"
	license = "Apache-2.0 License"

  strings:
 	$re_aaa = /\x00(\\\\\?\\)?([\w]\:|\\)(\\((?![\<\>\"\/\|\*\?\:\\])[\x20-\x5B\x5D-\x7E])+)+\:\$?[a-zA-Z0-9_]+/ ascii wide

  condition:
    capa_pe_file and
 (
			$re_aaa
	)
}

It seems like YARA doesn't like spaces in meta field names, so I had to add underscores to static scope and dynamic scope. Even after that, it still doesn't work with the converted rule. I'm also not sure where the null byte came from. I tried to play around with the original capa regex, but couldn't get that to work either.

❯ yara test.yar 51828683dc26bfabd3994494099ae97d.elf_
error: rule "capa_reference_absolute_stream_path_on_Windows" in test.yar(43): invalid regular expression "$re_aaa": syntax error

What did the regex look like when the script was working?

@Dronesh77
Copy link

Hi @ruppde ,

I wanted to follow up on the issue you raised regarding the regex escaping problem in capa2yara.py. I have attempted to address this by introducing a new function that returns the regex unescaped specifically for use in capa2yara.py.

I have modified the code to include this new function, which should resolve the issue of over-escaped regex patterns in the generated YARA rules.

def get_unescaped_regex(regex):
    unescaped = regex.replace("\\\\", "\\")
    unescaped = unescaped.replace("\\/", "/")
    
    # Remove escaping of other special characters that don't need escaping in YARA
    unescaped = unescaped.replace("\\(", "(").replace("\\)", ")").replace("\\[", "[").replace("\\]", "]")
    
    return unescaped

# ... (rest of the imports and global variables)

def convert_rule(rule, rulename, cround, depth):
    # ... (previous code remains the same)

    elif s_type == "regex":
        regex = kid.get_value_str()
        logger.info("doing regex: %r", regex)

        # Use the new function to get unescaped regex
        regex = get_unescaped_regex(regex)

        # change capas /xxx/i to yaras /xxx/ nocase, count will be used later to decide appending 'nocase'
        regex, count = re.subn(r"/i$", "/", regex)

        # remove / in the beginning and end
        regex = regex[1:-1]

        # all .* in the regexes of capa look like they should be maximum 100 chars so take 1000 to speed up rules and prevent yara warnings on poor performance
        regex = regex.replace(".*", ".{,1000}")

        # capa uses python regex which accepts /reg(|.exe)/ but yaras regex engine doesn't not => fix it
        # /reg(|.exe)/ => /reg(.exe)?/
        regex = re.sub(r"\(\|([^\)]+)\)", r"(\1)?", regex)

        # change beginning of line to null byte, e.g. /^open => /\x00open
        regex = re.sub(r"^\^", r"\\x00", regex)

        regex = "/" + regex + "/"
        if count:
            regex += " nocase"

        var_name = "re_" + var_names.pop(0)
        yara_strings += "\t" + "$" + var_name + " = " + regex + " ascii wide " + convert_description(kid) + "\n"
        yara_condition += "\t" + "$" + var_name + " "

    # ... (rest of the code remains the same)

# ... (rest of the functions and main() remain the same)

@ruppde
Copy link
Contributor Author

ruppde commented Mar 12, 2025

hi @Dronesh77,

that looks great, could you please make a pull request?

Dronesh77 pushed a commit to Dronesh77/capa that referenced this issue Mar 13, 2025
@Dronesh77 Dronesh77 linked a pull request Mar 13, 2025 that will close this issue
3 tasks
@Dronesh77
Copy link

Hi @ruppde ,

I’ve raised the pull request as requested. Could you please review it and let me know your feedback?

@ruppde
Copy link
Contributor Author

ruppde commented Mar 24, 2025

Hi @Dronesh77,
in https://github.com/mandiant/capa/pull/2626/files only the Readme is changed and the mode of capa2yara.py ?

@Dronesh77
Copy link

Hi @ruppde ,

I’ve made the following changes to address the regex escaping issue in capa2yara.py:

Introduced a new function: I added get_unescaped_regex() to handle unescaped regex patterns specifically for YARA compatibility. This function removes unnecessary escaping of special characters and adjusts the regex for proper use in YARA rules.

Updated regex handling: In the convert_rule() function, I replaced the old logic with this new function to ensure that the regex is correctly formatted. This includes:

Removing redundant escape characters (e.g., \\ → \).

Adjusting patterns like /reg(|.exe)/ to /reg(.exe)?/, which aligns with YARA's regex engine.

Modifying beginning-of-line markers (e.g., ^open → \x00open) for compatibility.

Improved performance: To prevent potential YARA warnings about poor performance, I limited patterns like .* to . {,1000}.

@ruppde
Copy link
Contributor Author

ruppde commented Apr 7, 2025

@Dronesh77 sounds great but the PR does not reflect it, it only changes the file permissions:

Image

from https://github.com/mandiant/capa/pull/2626/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants