-
Notifications
You must be signed in to change notification settings - Fork 72
[GEN-1667] Delete Table Rows Using Filtered DataFrame #1254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far! 🔥
I did an initial pass-through and was wondering: What happens when a user feed a dataframe with row_ids/versions/etags that don't exist in that table? what does the API spit back and is that error-handling enough?
@jaymedina Thanks for the feedback, I will address them accordingly. |
"The dataframe must contain the 'ROW_ID' and 'ROW_VERSION' columns." | ||
) | ||
|
||
if self.__class__.__name__ in CLASSES_THAT_CONTAIN_ROW_ETAG: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanFauble When I add test case for the below ValueError, I got the AttributeError: 'Dataset' object has no attribute 'delete_rows_async'
for my test case of Dataset. As this line, I believe we originally intent to use delete_row
in the Dataset, EntityView, DatasetCollection, and SubmissionView but I didn't find the function in any of the models. Have we not had the chance to add this function yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these other items are special @danlu1 . You can't actually delete rows of those Entity types, the reason is because there is a separate items
attribute on the entity itself that dictates what should be present on the table:
https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/table/Dataset.html
If you notice that:
class Dataset(
DatasetSynchronousProtocol,
AccessControllable,
ViewBase,
ViewStoreMixin,
DeleteMixin,
ColumnMixin,
GetMixin,
QueryMixin,
ViewUpdateMixin,
ViewSnapshotMixin,
):
Doesn't contain the TableDeleteRowMixin
which gives this ability to the class
elif df is not None: | ||
rows_to_delete = df | ||
client.logger.info( | ||
f"Found {len(rows_to_delete)} rows to delete for given dataframe." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Found {len(rows_to_delete)} rows to delete for given dataframe." | |
f"Received {len(rows_to_delete)} rows to delete for given dataframe." |
if ( | ||
"ROW_ID" not in rows_to_delete.columns | ||
or "ROW_VERSION" not in rows_to_delete.columns | ||
): | ||
raise ValueError( | ||
"The dataframe must contain the 'ROW_ID' and 'ROW_VERSION' columns." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be put before the log message is printed so the validation occurs first?
* Deleting specific rows - Query for the rows you want to delete and call syn.delete on the results | ||
|
||
```python | ||
table.delete_rows(query=f"SELECT * FROM {table.id} WHERE Strand = '+'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few lines up there is a part of this markdown that needs to be updated: Deleting specific rows - Query for the rows you want to delete and call syn.delete on the results
Could you update this please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No major complaints from me. Thanks for making these changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 awesome work and great review. I'll defer to Bryan for final review once this is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @danlu1 ! I tested the examples, and they worked for the most part. However, when I tried deleting a row with wrong version numbers, I expected an error but the request actually went through. Here's an example:
async def main():
results = await query_async(query="SELECT * FROM syn63301937")
print(results)
Here I saw:
Downloading files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 576/576 [00:00<00:00, 1.38kB/s, syn63301937]
ROW_ID ROW_VERSION Component Filename Id entityId
0 12 6 MockFilename schematic - main/MockFilenameComponent/txt3.txt 6b27d97b-0344-4499-8f08-8535ad72b802 syn61682653
1 13 6 MockFilename schematic - main/MockFilenameComponent/this_fi... 7185f1ee-00bb-42d4-9051-1b61166fde0f syn61682653
2 14 6 MockFilename schematic - main/MockFilenameComponent/txt4.txt 9ce08bef-35a0-40d5-8b06-96c6071b1609 syn6168265
3 15 6 MockFilename schematic - main/MockFilenameComponent/txt6.txt 07973efb-7d36-475a-b29f-0aa470c722b5 NaN
As you can see, the row version number is 6. But when I did:
df = pd.DataFrame({"ROW_ID": [10, 11], "ROW_VERSION": [100, 100]})
async def main():
await Table(id="syn63301937").delete_rows_async(df=df)
asyncio.run(main())
I expected the request to fail because row version 100 doesn't exist. But the request appeared to go through:
(synapsePythonClient) lpeng@Mac synapsePythonClient % python3 /Users/lpeng/code/synapsePythonClient/test.py
Welcome, Lingling Peng! You are using the 'default' profile.
Found 2 rows to delete for given dataframe.
Uploading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 33.0/33.0 [00:00<00:00, 68.8B/s, syn63301937_upload_636feaba-4b89-4d20-961c-034866078060.csv]
/entity/syn63301937/table/transaction/async: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.00/1.00 [00:01<00:00, 1.21s/it]
df = pd.DataFrame({"ROW_ID": [1, 2], "ROW_VERSION": [1, 1]}) | ||
async def main(): | ||
await Table(id="syn1234").delete_rows_async(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danlu1 I got an error when running the example, here it should be:
df = pd.DataFrame({"ROW_ID": [1, 2], "ROW_VERSION": [1, 1]})
async def main():
await Table(id="syn1234").delete_rows_async(df=df)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Yes, the param should be specified. I will update.
@linglp Thanks for testing the code. Can you do me a favor and confirm the actual row with wrong row_version were deleted? The "Found 2 rows to delete for given dataframe." printed before the row deletion so I'd like to learn the behavior of the code. Appreciate your help! |
Problem:
At present,
delete_rows
accepts only a table query to provide ROW_ID and ROW_VERSION values for deletion. This will be extended to also accept a dataframe, replicating the deprecated function’s ability to specify rows through a row_id_vers_list.Solution:
Testing:
Unit test and integration test are added and they all passed.