-
Notifications
You must be signed in to change notification settings - Fork 32
feat: add file_io and local impl by adapting arrow::filesystem #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Has iceberg-cpp decided on an IO strategy already? It might be more productive to start writing the IO-less components, such as parsing the various metadata files, etc. |
FileIO is one task in the Minimum Viable Product's workitems. IMHO, we need these interfaces even for parsing the various metadata files, either on the local filesystem or S3. I'm not sure what do you mean by IO strategy, can you elaborate? |
For instance, async vs sync: #2 (comment) Or, whether the core library should do any IO at all: #2 (comment) (You could also perhaps imagine an interface where the core library just works in terms of an interface to an Avro file and only parses manifests/snapshots, deferring all I/O to other implementations.) |
Is my understanding correct that this FileIo pertains to locations that are local, on HDFS, or on S3? |
Yeah, I hope this FileIO to be extended to other storages. |
Can we provide both?
Without doing any IO, do you mean users do the read and pass the buffer to iceberg for parsing only? In my use case, I prefer giving a directory(local fs or s3 bucket) to iceberg and let iceberg do the work.
|
I think we can. It is pretty straight-forward to do something similar to
How about adding a PoC to explore the idea from @pitrou: #2 (comment). We may need to add a lightweight reader interface in it as well. |
Will try, thanks. |
eb07767
to
f454825
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! IMHO, we'd better agree on the APIs first before implementing the whole thing to avoid wasted effort.
IIUC, mainly we need three kinds of I/Os in the iceberg library:
1. File operations
- Delete file: e.g. delete uncommitted file, delete expired snapshots, etc.
2. Metadata writer/reader
- Table metadata file: write/read json file with a single
TableMetadata
- Manifest list file: write/read avro file with one or more
ManifestFile
- Manifest file: write/read avro file with one or more
ManifestEntry
3. Data/Delete file writer/reader
- Data file: write/read parquet/orc/avro file with data file schema
- Delete file: write/read parquet/orc/avro/puffin file with delete file schema
- Stats file: puffin file (sorry I'm still not very familiar with it but it is simply a container for multiple blobs)
The majority of I/O operations happen in the parquet/avro/orc libraries which we cannot control. Perhaps we can simply define structures below to hide explicit I/Os?
struct FileInfo {
std::string path;
size_t size;
std::map<std::string, std::string> storageCredentials;
};
class FileReader {
public:
virtual ArrowArray ReadNext() = 0;
};
class FileWriter {
public:
virtual void Write(ArrowArray array) = 0;
};
Then the only remaining I/O we need to address is the table metadata file which stores a single json string. This is the only reason to keep the InputFile
and OutputFile
interfaces. I'm not sure if it is fine to replace them by adding FileIO::ReadFileFully
and FileIO::WriteFile
?
CMakeLists.txt
Outdated
@@ -38,6 +38,7 @@ option(ICEBERG_BUILD_SHARED "Build shared library" OFF) | |||
option(ICEBERG_BUILD_TESTS "Build tests" ON) | |||
option(ICEBERG_ARROW "Build Arrow" ON) | |||
option(ICEBERG_AVRO "Build Avro" ON) | |||
option(ICEBERG_IO "Build IO" ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this? I think we don't need a separate library for I/O. We can just define FileIO
and related interfaces in the libiceberg
. iceberg-arrow
could possibly implement these APIs by adapting to arrow::FileSystem
. We can provide an implementation for local
or memory
filesystem in the unit test modules.
At that point, do we even need an "IO" abstraction? Just require that the user provide the full string (is a streaming JSON reader necessary here? I got the impression that Iceberg expects that metadata to fit in memory?) |
Good question! IMHO, a streaming JSON reader is an overkill in this case. That's why I propose to define a minimal FileIO like:
|
Yeah, in that case I don't think we need InputFile/OutputFile. Is the idea that basically, iceberg-cpp would effectively not deal with I/O (or would only deal with I/O of metadata files) and for data files, it's the library user's responsibility to configure everything (e.g. arrow-cpp filesystems or OpenDAL or whatever they want)? |
Exactly. Does that make sense to you? |
It will mean different implementations won't be exactly drop-in portable but I think that's too big a goal to tackle anyways |
Just FTR, it would be better to use |
I'll refactor the FileIO as per Gang's suggestion. Thank you all for your valuable input. Starting with a minimal implementation and expanding it as needed seems like an ideal approach. |
4aad42c
to
5894d3e
Compare
add reader and writer interfaces Signed-off-by: Junwang Zhao <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Thanks @zhjwpku!
Signed-off-by: Junwang Zhao <[email protected]>
@lidavidm Can you take a look? Thanks ;) |
This PR add file io interface and arrow local filesystem implementation.
FileIO is a pluggable interface for reading, writing, and deleting metadata files, not for data files.