Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests that arrow IPC data is validated #7096

Merged
merged 8 commits into from
Feb 12, 2025

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Feb 7, 2025

Which issue does this PR close?

Rationale for this change

To test disabling validation in the IPC reader I first need to show there is validation actually occuring.

It is also probably good in general to have tests showing we validate data when read

What changes are included in this PR?

Add tests that show when invalid arrow data is written to IPC files, the StreamReader, FileReader and FileDecoder catch and verify those errors

Are there any user-facing changes?

This is entirely tests, no changes in functionality

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 7, 2025
@@ -1744,27 +1745,73 @@ mod tests {
});
}

fn roundtrip_ipc(rb: &RecordBatch) -> RecordBatch {
/// Write the record batch to an in-memory buffer in IPC File format
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just refactored these helpers into smaller chunks


/// Return the first record batch read from the IPC File buffer
/// using the FileDecoder API
fn read_ipc_with_decoder(buf: Vec<u8>) -> Result<RecordBatch, ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time I find myself using FileDecoder I end up copy/pasting the example.

I am starting to think having something like

/// Incrementally decodes [`RecordBatch`]es from an IPC file stored in a Arrow
/// [`Buffer`] using the [`FileDecoder`] API.
///
/// This is a wrapper around the example in the `FileDecoder` which handles the
/// low level interaction with the Arrow IPC format.
struct IPCBufferDecoder {
/// Memory (or memory mapped) Buffer with the data
buffer: Buffer,
/// Decoder that reads Arrays that refers to the underlying buffers
decoder: FileDecoder,
/// Location of the batches within the buffer
batches: Vec<Block>,
}
in the actual crate might be useful

(given I am about to go copy/paste it again into comet...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -2492,4 +2539,109 @@ mod tests {
assert_eq!(decoded_batch.expect("Failed to read RecordBatch"), batch);
});
}

#[test]
fn test_validation_of_invalid_list_array() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on how I can make an invalid PrimitiveArray would be most apprecaited.

Anything i tried to mismatch the len and the actual data, resulted in panics in bounds checks (even with the try_new_unchecked).

I think it is probably a good thing that it is so hard to create invalid arrays but it would be nice to test this path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayData will normally let you do various inadvisable things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I tried to do was make null buffers that had different sizes than the underlying values -- that actually failed the ArrayData creation checks, but when it was re-read via IPC the resulting arrays were fine.

Maybe it was due to padding or something. I'll play around with it some more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concluded it was impossible to create an invalid PrimitiveArray -- when I made one that had invalid nulls, then I got a panic in the BooleanBufferConstructor here:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a more invasive change to add tests for validation for PrimitiveArrays and concluded that the current code will panic rather than error

Specifically, this code which generates invalid IPC data results in a panic on decode rather than an Result

    #[test]
    fn test_validation_of_invalid_primitive_array() {
        // Int32Array with not enough nulls
        let array = unsafe {
            let nulls = NullBuffer::from(&[true, false, true, false]);

            let buffer = ScalarBuffer::<i32>::from_iter(0..8000);
            let data = ArrayDataBuilder::new(DataType::Int32)
                .len(8000)
                .add_buffer(buffer.into())
                .nulls(Some(nulls)) // too few nulls
                .build_unchecked();

            Int32Array::from(data)
        };

        expect_ipc_validation_error(
            Arc::new(array),
            "Invalid argument error: Nulls do not match",
        );
    }

Since I am not really trying to change / improve validation in this PR or project, I do not plan to change how it works (though I will file a ticket with the report of panic vs error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let err = read_ipc(&buf).unwrap_err();
assert_eq!(err.to_string(), expected_err);

// TODO verify there is no error when validation is disabled
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to add a few more lines here when we disable validation

@alamb alamb marked this pull request as ready for review February 7, 2025 17:53
@alamb alamb changed the title Tests for IPC validation Add tests that arrow IPC data is validated Feb 7, 2025
@@ -2492,4 +2539,109 @@ mod tests {
assert_eq!(decoded_batch.expect("Failed to read RecordBatch"), batch);
});
}

#[test]
fn test_validation_of_invalid_list_array() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concluded it was impossible to create an invalid PrimitiveArray -- when I made one that had invalid nulls, then I got a panic in the BooleanBufferConstructor here:

@alamb
Copy link
Contributor Author

alamb commented Feb 10, 2025

I think this PR is ready for review

};
expect_ipc_validation_error(
Arc::new(array),
"Invalid argument error: Invalid UTF8 sequence at string index 3 (3..45): invalid utf-8 sequence of 1 bytes from index 38"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is a bit peculiar, but that's not really something for this PR

@alamb alamb merged commit 78c9df9 into apache:main Feb 12, 2025
26 checks passed
@alamb
Copy link
Contributor Author

alamb commented Feb 12, 2025

Thanks again @tustvold

@alamb alamb deleted the alamb/ipc_validation_tests branch February 12, 2025 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants