Skip to content

File Corruption after Repeatedly Writing Hist Histograms #1578

@LarsHebenstiel

Description

@LarsHebenstiel

Uproot version 5.7.1

For my analysis I have many intermediate histogram products I am constructing in python and wish to save into a temporary file so I don't have to re-run the time-consuming pre-processing stage. For some reason though, partway through my processing the root file becomes corrupted and uproot.update("./intermediate.root") fails on the file due to the assert on this line. At that point uproot is still able to read the histograms in the file and ROOT can still open the file without reporting issues.

Here is a download link (~360MB) to the corrupted root file: Google Drive - intermediate.root

Basically the preprocessing consists of:

# We will store intermediate results in "intermediate.root"
with uproot.create("intermediate.root") as file:
    pass

for raw_data_file in raw_data_files:
    # Read in data...
    # Construct hist Histograms...
    directory = 'rawroot_' + str(raw_data_file_number) + '/'
    with uproot.update("intermediate.root") as file:
        output_hist: hist.Hist
        for output_hist in output_hist_list:
            file[directory + output_hist.name] = output_hist

I don't see any obvious issues with this approach, but anyways the pre-processing fails partway through due to the corruption issue I observe blocking further writes to the intermediate result file.

Here are some bug-testing steps I have taken:

  1. Maybe the issue is because of the large number of histograms I am writing to the file?
    It didn't seem likely to me, but anyways I ran a toy example where I wrote 100,000 histograms to a file successfully using the same directory structure I have in my sample file. No issues here.

  2. Maybe the last histograms I wrote into the file during pre-processing are corrupt for some reason?
    I checked the possibly suspicious histograms in python before they get written and they seem fine.

  3. What if I try writing each rawroot raw data file's intermediate output into a separate new intermediate ROOT file instead of using directories like in my sample file?
    This works! Pre-processing completes and I can continue with analysis. So, for now I am using this workaround

  4. Since I can read the intermediate.root file just fine, I tried re-reading all written histograms back in and immediately copying them to another root file.
    So long as I perform this all inside a single with block, there is no issue to keep writing more histograms:

# This works
with uproot.open('intermediate.root') as in_file:
    with uproot.create('intermediate_copy.root') as out_file:
        # Copy all histograms from in to out
        # Now write more histograms to out_file, this succeeds
        pass
    pass

# Now if I try opening with uproot.update('intermediate_copy.root') I have no issues

However, just making a copy and re-opening the copy does not

# This doesn't work
with uproot.open('intermediate.root') as in_file:
    with uproot.create('intermediate_copy.root') as out_file:
        # Copy all histograms from in to out
        pass
    pass

# Fails with same assert error
with uproot.update('intermediate_copy.root') as file:
    pass

So, I'm kind of at a loss what the problem might be. My best (only?) guess at this point is that repeatedly re-opening, adding a few histograms, and then closing a ROOT file is somehow causing this corruption. Then again, performing a deep copy of all histograms in my file using just a single while block also results in a corrupted file. Anyways, I have a workaround solution for now (using multiple intermediate files - one for each raw data file), but I'm curious what's causing this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug (unverified)The problem described would be a bug, but needs to be triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions