Skip to content

Commit 46cb3fb

Browse files
committed
Do not check parquet directory for consistency
Maybe there is a better way to actually do some form of check? The problem is that if we have multiple nodes/utility workers then one of them may lag behind when the director already contains files. However, blocking on this also seems not better. Is there a better way that would allow checking then just an option to disable this check? Signed-off-by: Sebastian Berg <[email protected]>
1 parent 5d9b045 commit 46cb3fb

File tree

3 files changed

+10
-3
lines changed

3 files changed

+10
-3
lines changed

cpp/include/legate_dataframe/parquet.hpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,10 @@ class ParquetReadArray : public Task<ParquetReadArray, OpCode::ParquetReadArray>
9898
* ├── part-2.parquet
9999
* └── ...
100100
*
101+
* This function may create the directory but does not ensure it is empty.
102+
* If a previous write wrote more partitions the old files will remain
103+
* leaving the directory in an inconsistent state.
104+
*
101105
* @param tbl The table to write.
102106
* @param path Destination directory for data.
103107
*/

cpp/src/parquet.cpp

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -492,9 +492,7 @@ ParquetReadInfo get_parquet_info(const std::vector<std::string>& file_paths,
492492
void parquet_write(LogicalTable& tbl, const std::string& dirpath)
493493
{
494494
std::filesystem::create_directories(dirpath);
495-
if (!std::filesystem::is_empty(dirpath)) {
496-
throw std::invalid_argument("if path exist, it must be an empty directory");
497-
}
495+
498496
auto runtime = legate::Runtime::get_runtime();
499497
legate::AutoTask task =
500498
runtime->create_task(get_library(), task::ParquetWrite::TASK_CONFIG.task_id());

python/legate_dataframe/lib/parquet.pyx

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,11 @@ def parquet_write(LogicalTable tbl, path: pathlib.Path | str) -> None:
6868
├── part.2.parquet
6969
└── ...
7070

71+
.. note::
72+
This function will create the directory but does not ensure it is empty.
73+
If a previous write had more partitions the old files will remain
74+
leaving the directory in an inconsistent state.
75+
7176
See Also
7277
--------
7378
parquet_read: Read parquet data

0 commit comments

Comments
 (0)