Skip to content

Add initial remote table registration implementation #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 31, 2025

Conversation

DavidStirling
Copy link
Member

This PR begins work towards a mechanism for registering tables without uploading the data to the OMERO Managed Repository, so as to allow users to store table data elsewhere.

Per internal discussions this initial version operates offline. omero2pandas will create a local tiledb file and return the path, which the user should register with OMERO using external tooling. Future work will add a system for doing this registration automatically.

I've tried to match the omero-plus schema as closely as possible. Some testing for compatibility needs to be done.

N.b. For any of this to be useful the OMERO server must have the TileDB OMERO.tables backend installed. PyTables is the default backend.

Installation

pip install omero2pandas[remote]

Usage

import omero2pandas
omero2pandas.upload_table(csv_or_dataframe, table_name, local_path="/path/to/my_table.tiledb")

Due to constraints in the API this will require an OMERO login. To test offline do the following:

from omero2pandas.remote import register_table
register_table(csv_or_dataframe, chunk_size, local_path="/path/to/my_table.tiledb", remote_path=None)

chunk_size denotes how much data to load/save in a single operation, remote_path will eventually be used to provide a server-visible version of local_path when performing the upload from a different machine than the OMERO server.

@erindiel
Copy link
Member

register_table('objects.csv', 100, local_path="objects.tiledb", remote_path=None)

This worked as expected to produce a .tiledb file from my input .csv. I was able to use internal tooling to register this as an OMERO.table. This is a great workflow 🎉 !

In trying upload_table, I had a couple of failures:

upload_table('objects.csv', 'remote-table', local_path="objects-upload.tiledb", links=[("Image", 36604), ("Roi", 1983)])

Connected to localhost
Generating TileDB file...:   0%|                                                                                                                            | 1/? rows, 00:00 ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[8], line 2
      1 from omero2pandas import upload_table
----> 2 upload_table('objects.csv', 'remote-table', local_path="objects-upload.tiledb", links=[("Image", 36604), ("Roi", 1983)])

File ~/OMERO.venv/lib/python3.10/site-packages/omero2pandas/__init__.py:245, in upload_table(source, table_name, parent_id, parent_type, links, chunk_size, omero_connector, server, port, username, password, local_path, remote_path)
    243     if not register_table:
    244         raise ValueError("Remote table support is not installed")
--> 245     ann_id = register_table(
    246         source, chunk_size, local_path, remote_path)
    247 else:
    248     ann_id = create_table(source, table_name, links, conn, chunk_size)

File ~/OMERO.venv/lib/python3.10/site-packages/omero2pandas/remote.py:50, in register_table(source, chunk_size, local_path, remote_path)
     48 row_idx = 0
     49 for chunk in data_iterator:
---> 50     tiledb.from_pandas(write_path, chunk, sparse=True, full_domain=True,
     51                        tile=10000, attr_filters=None,
     52                        row_start_idx=row_idx, allows_duplicates=False,
     53                        mode="append" if row_idx else "ingest")
     54     progress_monitor.update(len(chunk))
     55     row_idx += len(chunk)

File ~/OMERO.venv/lib/python3.10/site-packages/tiledb/dataframe_.py:579, in from_pandas(uri, dataframe, **kwargs)
    576     raise ValueError(f"`ctx` expected a TileDB Context object but saw {type(ctx)}")
    578 with tiledb.scope_ctx(ctx):
--> 579     _from_pandas(uri, dataframe, tiledb_args)

File ~/OMERO.venv/lib/python3.10/site-packages/tiledb/dataframe_.py:625, in _from_pandas(uri, dataframe, tiledb_args)
    617 if date_spec:
    618     dataframe = dataframe.assign(
    619         **{
    620             name: pd.to_datetime(dataframe[name], format=format)
    621             for name, format in date_spec.items()
    622         }
    623     )
--> 625 dataframe.columns = dataframe.columns.map(str)
    626 column_infos = _get_column_infos(
    627     dataframe, tiledb_args.get("column_types"), tiledb_args.get("varlen_types")
    628 )
    630 with tiledb.scope_ctx(tiledb_args.get("ctx")):

AttributeError: 'str' object has no attribute 'columns'

I tried to pass in a dataframe instead using upload_table(df, 'remote-table', local_path="objects-upload.tiledb")

Connected to localhost
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 upload_table(df, 'remote-table', local_path="objects-upload.tiledb")

File ~/OMERO.venv/lib/python3.10/site-packages/omero2pandas/__init__.py:245, in upload_table(source, table_name, parent_id, parent_type, links, chunk_size, omero_connector, server, port, username, password, local_path, remote_path)
    243     if not register_table:
    244         raise ValueError("Remote table support is not installed")
--> 245     ann_id = register_table(
    246         source, chunk_size, local_path, remote_path)
    247 else:
    248     ann_id = create_table(source, table_name, links, conn, chunk_size)

File ~/OMERO.venv/lib/python3.10/site-packages/omero2pandas/remote.py:41, in register_table(source, chunk_size, local_path, remote_path)
     38     total_rows = None
     39 else:
     40     data_iterator = (source.iloc[i:i + chunk_size]
---> 41                      for i in range(0, len(source), chunk_size))
     42     total_rows = len(source)
     43 progress_monitor = tqdm(
     44     desc="Generating TileDB file...", initial=1, dynamic_ncols=True,
     45     total=total_rows,
     46     bar_format='{desc}: {percentage:3.0f}%|{bar}| '
     47                '{n_fmt}/{total_fmt} rows, {elapsed} {postfix}')

TypeError: 'NoneType' object cannot be interpreted as an integer

I can fix both of the above by including a chunk_size.

I do not get any links (to ROI or Image, IDs passed in above) within OMERO. I believe this is consistent with what you described:

omero2pandas will create a local tiledb file and return the path, which the user should register with OMERO using external tooling. Future work will add a system for doing this registration automatically.

So currently why would I use upload_table? Should I be testing anything in addition to the local creation of .tiledb file?

@DavidStirling
Copy link
Member Author

Thanks @erindiel

I can fix both of the above by including a chunk_size.

Yup, since we don't pre-scan tables in this mode we needed a default chunk size. That was the cause of both errors.

I do not get any links (to ROI or Image, IDs passed in above) within OMERO. I believe this is consistent with what you described

Yes

So currently why would I use upload_table? Should I be testing anything in addition to the local creation of .tiledb file?

Realistically you wouldn't 😃

I anticipate that a later version of this functionality will perform the registration, so I've put the functionality in place ready for it. Strictly speaking people can import the remote submodule and create a tiledb in isolation, but it's not something we explicitly document as an intended feature.

Copy link
Member

@erindiel erindiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes resolved the issue with not specifying chunk_size for upload_table(). Looks good!

Copy link
Member

@kkoz kkoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of nitpicks. Worked for me on a couple of simple tests.

LOGGER.debug(f"Remote path would be {str(remote_path)}")
if write_path.exists():
raise ValueError(f"Table file {write_path} already exists")
# path.as_uri() exists but mangles any spaces in the path!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this comment means. Is it an explanation of why you use str(write_path)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. tiledb.from_pandas' first argument is formally called uri, but doesn't seem to handle escaped special characters like spaces. I wanted to head off people asking "why don't you just use the Pathlib to_uri method?"

Comment on lines +34 to +35
# Use a default chunk size if not set
chunk_size = chunk_size or 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably just use a default value in the function signature instead.

Copy link
Member Author

@DavidStirling DavidStirling Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd normally do this. However in this context we may receive chunk_size=None from the higher level omero2pandas.upload_table function. This is used to indicate that the chunk size should be calculated automatically. With local tiledb we don't need to worry about ice message size limits so we don't run the pre-scan that figures out chunk size like we do with normal table uploads, so this line simply grants a fallback default.

Nonetheless I'll add some defaults to this function's signature so that people using it in isolation have an easier time.

@DavidStirling DavidStirling requested a review from kkoz January 30, 2025 09:03
Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One README suggestion to clarify the current scope of this extension. Otherwise, happy to get this in

@sbesson sbesson merged commit 3f15348 into glencoesoftware:main Jan 31, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants