Skip to content

Bright DPX images make MP3 identification regex extremely slow #233

@jukuisma

Description

@jukuisma

Summary

Identification gets stuck for over a minute on bright production DPX images at:

trying  b'(?s)\\xff[\\xfa\\xfb\\xf2\\xf3][\\x10-\\xeb].{46,1439}\\xff[\\xfa\\xfb\\xf2\\xf3][\\x10-\\xeb].{46,1439}\\Z'

at: https://raw.githubusercontent.com/openpreserve/fido/refs/heads/main/fido/conf/formats-v116.xml

$ curl -s https://raw.githubusercontent.com/openpreserve/fido/refs/heads/main/fido/conf/formats-v116.xml | grep -A 16 '<puid>fmt/134</puid>'
    <puid>fmt/134</puid>
    <mime>audio/mpeg</mime>
    <name>MPEG 1/2 Audio Layer 3</name>
    <version />
    <alias>MP3</alias>
    <pronom_id>687</pronom_id>
    <extension>mp3</extension>
    <apple_uti>public.mp3</apple_uti>
    <signature>
      <name>MPEG-1 Audio Layer 3 with ID3v2 Tag</name>
      <note>Regularly-spaced frame headers should always be discoverable near EOF. An ID3v1 tag of up to 355 bytes may be present at EOF.</note>
      <pattern>
        <position>EOF</position>
        <pronom_pattern>FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{47-1795}</pronom_pattern>
        <regex>(?s)\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{47,1795}\Z</regex>
      </pattern>
      <pattern>

Brightness seems like a red herring, but actually causes this regex to partially match and makes it slow.

I'm looking at improving this but any and all feedback would be much appreciated.

Steps to reproduce

I have the production images and can reproduce this locally. I'm currently trying to generate similar images with random data that makes this regex slow, but haven't gotten it to work yet. I'll attach one such image and rough instructions to create it below:

# Create 4k all white png in GIMP
# Convert it to dpx:
$ ffmpeg -i white.png white.dpx

# Create 0x17bb00 * 0x10 bytes of random xxd formatted data:
$ cat rand.py
import os

def rand_line(offset):
    """
    <hex-offset>: (f[0-f]<rand> <rand><rand> ){4}
    """
    line = f"{offset:08x}: "

    rand = os.urandom(4)

    byte1 = f"{rand[0] | 0xf0:02x}"
    byte2 = f"{rand[1]:02x}"
    byte3 = f"{rand[2]:02x}"
    byte4 = f"{rand[3]:02x}"

    for i in range(4):
        line += f"{byte1}{byte2} {byte3}{byte4} "

    return f"{line[:-1]}\n"

with open("rand.xxd", "w") as outfile:
    for i in range(0x17bb00):
        offset = i * 0x10
        if offset % 1024**2 == 0:
            print(f"{offset:08x}/{0x17bb00*0x10:08x}\r", end="")
        outfile.write(rand_line(offset))

# Create the binary
$ xxd -r rand.xxd > rand.bin

# Remove all image date from white.dpx
# i.e., all ff bytes after offset 0x680
# 00000680: ffff ffff ffff ffff ffff ffff ffff ffff  ................
$ vim -b white.dpx

# Concatenate new random image data
$ cat white.dpx rand.bin > rand.dpx

rand.dpx.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priority issues to be scheduled in a future release

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions