-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests that arrow IPC data is validated #7096
Conversation
Add tests for invalid arrays
@@ -1744,27 +1745,73 @@ mod tests { | |||
}); | |||
} | |||
|
|||
fn roundtrip_ipc(rb: &RecordBatch) -> RecordBatch { | |||
/// Write the record batch to an in-memory buffer in IPC File format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just refactored these helpers into smaller chunks
|
||
/// Return the first record batch read from the IPC File buffer | ||
/// using the FileDecoder API | ||
fn read_ipc_with_decoder(buf: Vec<u8>) -> Result<RecordBatch, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every time I find myself using FileDecoder
I end up copy/pasting the example.
I am starting to think having something like
arrow-rs/arrow/examples/zero_copy_ipc.rs
Lines 84 to 96 in 468e992
/// Incrementally decodes [`RecordBatch`]es from an IPC file stored in a Arrow | |
/// [`Buffer`] using the [`FileDecoder`] API. | |
/// | |
/// This is a wrapper around the example in the `FileDecoder` which handles the | |
/// low level interaction with the Arrow IPC format. | |
struct IPCBufferDecoder { | |
/// Memory (or memory mapped) Buffer with the data | |
buffer: Buffer, | |
/// Decoder that reads Arrays that refers to the underlying buffers | |
decoder: FileDecoder, | |
/// Location of the batches within the buffer | |
batches: Vec<Block>, | |
} |
(given I am about to go copy/paste it again into comet...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is another example:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -2492,4 +2539,109 @@ mod tests { | |||
assert_eq!(decoded_batch.expect("Failed to read RecordBatch"), batch); | |||
}); | |||
} | |||
|
|||
#[test] | |||
fn test_validation_of_invalid_list_array() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on how I can make an invalid PrimitiveArray
would be most apprecaited.
Anything i tried to mismatch the len and the actual data, resulted in panics in bounds checks (even with the try_new_unchecked
).
I think it is probably a good thing that it is so hard to create invalid arrays but it would be nice to test this path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayData will normally let you do various inadvisable things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I tried to do was make null buffers that had different sizes than the underlying values -- that actually failed the ArrayData creation checks, but when it was re-read via IPC the resulting arrays were fine.
Maybe it was due to padding or something. I'll play around with it some more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concluded it was impossible to create an invalid PrimitiveArray -- when I made one that had invalid nulls, then I got a panic in the BooleanBufferConstructor here:
assert!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a more invasive change to add tests for validation for PrimitiveArrays and concluded that the current code will panic rather than error
Specifically, this code which generates invalid IPC data results in a panic on decode rather than an Result
#[test]
fn test_validation_of_invalid_primitive_array() {
// Int32Array with not enough nulls
let array = unsafe {
let nulls = NullBuffer::from(&[true, false, true, false]);
let buffer = ScalarBuffer::<i32>::from_iter(0..8000);
let data = ArrayDataBuilder::new(DataType::Int32)
.len(8000)
.add_buffer(buffer.into())
.nulls(Some(nulls)) // too few nulls
.build_unchecked();
Int32Array::from(data)
};
expect_ipc_validation_error(
Arc::new(array),
"Invalid argument error: Nulls do not match",
);
}
Since I am not really trying to change / improve validation in this PR or project, I do not plan to change how it works (though I will file a ticket with the report of panic vs error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let err = read_ipc(&buf).unwrap_err(); | ||
assert_eq!(err.to_string(), expected_err); | ||
|
||
// TODO verify there is no error when validation is disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to add a few more lines here when we disable validation
@@ -2492,4 +2539,109 @@ mod tests { | |||
assert_eq!(decoded_batch.expect("Failed to read RecordBatch"), batch); | |||
}); | |||
} | |||
|
|||
#[test] | |||
fn test_validation_of_invalid_list_array() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concluded it was impossible to create an invalid PrimitiveArray -- when I made one that had invalid nulls, then I got a panic in the BooleanBufferConstructor here:
assert!( |
I think this PR is ready for review |
}; | ||
expect_ipc_validation_error( | ||
Arc::new(array), | ||
"Invalid argument error: Invalid UTF8 sequence at string index 3 (3..45): invalid utf-8 sequence of 1 bytes from index 38" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message is a bit peculiar, but that's not really something for this PR
Thanks again @tustvold |
Which issue does this PR close?
Rationale for this change
To test disabling validation in the IPC reader I first need to show there is validation actually occuring.
It is also probably good in general to have tests showing we validate data when read
What changes are included in this PR?
Add tests that show when invalid arrow data is written to IPC files, the StreamReader, FileReader and FileDecoder catch and verify those errors
Are there any user-facing changes?
This is entirely tests, no changes in functionality