Skip to content

feat: add file_io and local impl by adapting arrow::filesystem #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 1, 2025

Conversation

zhjwpku
Copy link
Collaborator

@zhjwpku zhjwpku commented Jan 15, 2025

This PR add file io interface and arrow local filesystem implementation.

FileIO is a pluggable interface for reading, writing, and deleting metadata files, not for data files.

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Jan 15, 2025

@wgtmac @Fokko @pitrou @Xuanwo @raulcd Please take a look at this early implementation of FileIO, I plan to add more interfaces, before that, I hope to hear some advices, thanks.

@wgtmac
Copy link
Member

wgtmac commented Jan 15, 2025

cc @lidavidm @MisterRaindrop

@pitrou
Copy link
Member

pitrou commented Jan 15, 2025

Has iceberg-cpp decided on an IO strategy already?

It might be more productive to start writing the IO-less components, such as parsing the various metadata files, etc.

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Jan 16, 2025

Has iceberg-cpp decided on an IO strategy already?

It might be more productive to start writing the IO-less components, such as parsing the various metadata files, etc.

FileIO is one task in the Minimum Viable Product's workitems. IMHO, we need these interfaces even for parsing the various metadata files, either on the local filesystem or S3.

I'm not sure what do you mean by IO strategy, can you elaborate?

@lidavidm
Copy link
Member

For instance, async vs sync: #2 (comment)

Or, whether the core library should do any IO at all: #2 (comment)

(You could also perhaps imagine an interface where the core library just works in terms of an interface to an Avro file and only parses manifests/snapshots, deferring all I/O to other implementations.)

@MisterRaindrop
Copy link

Is my understanding correct that this FileIo pertains to locations that are local, on HDFS, or on S3?

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Jan 16, 2025

Is my understanding correct that this FileIo pertains to locations that are local, on HDFS, or on S3?

Yeah, I hope this FileIO to be extended to other storages.

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Jan 16, 2025

For instance, async vs sync: #2 (comment)

Can we provide both?

Or, whether the core library should do any IO at all: #2 (comment)

Without doing any IO, do you mean users do the read and pass the buffer to iceberg for parsing only?

In my use case, I prefer giving a directory(local fs or s3 bucket) to iceberg and let iceberg do the work.

(You could also perhaps imagine an interface where the core library just works in terms of an interface to an Avro file and only parses manifests/snapshots, deferring all I/O to other implementations.)

@wgtmac
Copy link
Member

wgtmac commented Jan 17, 2025

Can we provide both?

I think we can. It is pretty straight-forward to do something similar to arrow::RandomnAccessFile: https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/interfaces.h#L268-L300

Without doing any IO, do you mean users do the read and pass the buffer to iceberg for parsing only?

How about adding a PoC to explore the idea from @pitrou: #2 (comment). We may need to add a lightweight reader interface in it as well.

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Jan 17, 2025

Can we provide both?

I think we can. It is pretty straight-forward to do something similar to arrow::RandomnAccessFile: https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/interfaces.h#L268-L300

Without doing any IO, do you mean users do the read and pass the buffer to iceberg for parsing only?

How about adding a PoC to explore the idea from @pitrou: #2 (comment). We may need to add a lightweight reader interface in it as well.

Will try, thanks.

@zhjwpku zhjwpku force-pushed the file_io branch 11 times, most recently from eb07767 to f454825 Compare March 23, 2025 01:08
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! IMHO, we'd better agree on the APIs first before implementing the whole thing to avoid wasted effort.

IIUC, mainly we need three kinds of I/Os in the iceberg library:

1. File operations

  • Delete file: e.g. delete uncommitted file, delete expired snapshots, etc.

2. Metadata writer/reader

  • Table metadata file: write/read json file with a single TableMetadata
  • Manifest list file: write/read avro file with one or more ManifestFile
  • Manifest file: write/read avro file with one or more ManifestEntry

3. Data/Delete file writer/reader

  • Data file: write/read parquet/orc/avro file with data file schema
  • Delete file: write/read parquet/orc/avro/puffin file with delete file schema
  • Stats file: puffin file (sorry I'm still not very familiar with it but it is simply a container for multiple blobs)

The majority of I/O operations happen in the parquet/avro/orc libraries which we cannot control. Perhaps we can simply define structures below to hide explicit I/Os?

struct FileInfo {
  std::string path;
  size_t size;
  std::map<std::string, std::string> storageCredentials;
};

class FileReader {
public:
  virtual ArrowArray ReadNext() = 0;
};

class FileWriter {
public:
  virtual void Write(ArrowArray array) = 0;
};

Then the only remaining I/O we need to address is the table metadata file which stores a single json string. This is the only reason to keep the InputFile and OutputFile interfaces. I'm not sure if it is fine to replace them by adding FileIO::ReadFileFully and FileIO::WriteFile?

@zhjwpku @lidavidm @Fokko @mapleFU WDYT?

CMakeLists.txt Outdated
@@ -38,6 +38,7 @@ option(ICEBERG_BUILD_SHARED "Build shared library" OFF)
option(ICEBERG_BUILD_TESTS "Build tests" ON)
option(ICEBERG_ARROW "Build Arrow" ON)
option(ICEBERG_AVRO "Build Avro" ON)
option(ICEBERG_IO "Build IO" ON)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this? I think we don't need a separate library for I/O. We can just define FileIO and related interfaces in the libiceberg. iceberg-arrow could possibly implement these APIs by adapting to arrow::FileSystem. We can provide an implementation for local or memory filesystem in the unit test modules.

@lidavidm
Copy link
Member

Then the only remaining I/O we need to address is the table metadata file which stores a single json string. This is the only reason to keep the InputFile and OutputFile interfaces. I'm not sure if it is fine to replace them by adding FileIO::ReadFileFully and FileIO::WriteFile?

At that point, do we even need an "IO" abstraction? Just require that the user provide the full string (is a streaming JSON reader necessary here? I got the impression that Iceberg expects that metadata to fit in memory?)

@wgtmac
Copy link
Member

wgtmac commented Mar 25, 2025

Good question! IMHO, a streaming JSON reader is an overkill in this case. That's why I propose to define a minimal FileIO like:

class FileIO {
public:
  virtual expected<void, Error> DeleteFile(const std::string& file_path);
  virtual expected<void, Error> WriteFile(const std::string& file_path, const std::string& content, bool overwrite);
  virtual expected<std::string, Error> ReadFile(const std::string& file_path, std::optional<size_t> length);
};

@lidavidm
Copy link
Member

Yeah, in that case I don't think we need InputFile/OutputFile.

Is the idea that basically, iceberg-cpp would effectively not deal with I/O (or would only deal with I/O of metadata files) and for data files, it's the library user's responsibility to configure everything (e.g. arrow-cpp filesystems or OpenDAL or whatever they want)?

@wgtmac
Copy link
Member

wgtmac commented Mar 25, 2025

Exactly. Does that make sense to you?

@lidavidm
Copy link
Member

It will mean different implementations won't be exactly drop-in portable but I think that's too big a goal to tackle anyways

@pitrou
Copy link
Member

pitrou commented Mar 25, 2025

Good question! IMHO, a streaming JSON reader is an overkill in this case. That's why I propose to define a minimal FileIO like:

class FileIO {
public:
  virtual expected<void, Error> DeleteFile(const std::string& file_path);
  virtual expected<void, Error> WriteFile(const std::string& file_path, const std::string& content, bool overwrite);
  virtual expected<std::string, Error> ReadFile(const std::string& file_path, std::optional<size_t> length);
};

Just FTR, it would be better to use std::string_view as much as possible when the data isn't supposed to live past the function call. I'm thinking especially of the content argument to WriteFile, since it might be big.

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Mar 25, 2025

I'll refactor the FileIO as per Gang's suggestion. Thank you all for your valuable input. Starting with a minimal implementation and expanding it as needed seems like an ideal approach.

@zhjwpku zhjwpku force-pushed the file_io branch 2 times, most recently from 4aad42c to 5894d3e Compare March 28, 2025 11:38
@zhjwpku zhjwpku changed the title feat: add file_io and local fs impl feat: add file_io and local impl by adapting arrow::filesystem Mar 28, 2025
@zhjwpku zhjwpku requested review from wgtmac and lidavidm March 28, 2025 11:59
zhjwpku added 5 commits March 31, 2025 21:04
add reader and writer interfaces

Signed-off-by: Junwang Zhao <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]>
zhjwpku added 3 commits April 1, 2025 11:10
Signed-off-by: Junwang Zhao <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]>
@zhjwpku zhjwpku requested a review from wgtmac April 1, 2025 06:50
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Thanks @zhjwpku!

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Apr 1, 2025

@lidavidm Can you take a look? Thanks ;)

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Apr 1, 2025

@Fokko @Xuanwo I'd appreciate if you can review this PR when you got some time.

@Fokko Fokko merged commit d54d079 into apache:main Apr 1, 2025
6 checks passed
@zhjwpku zhjwpku deleted the file_io branch April 2, 2025 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants