Skip to content
This repository has been archived by the owner on May 21, 2024. It is now read-only.

add formats that make sample (or other) ids explicit #56

Open
gregcaporaso opened this issue Sep 29, 2023 · 7 comments
Open

add formats that make sample (or other) ids explicit #56

gregcaporaso opened this issue Sep 29, 2023 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@gregcaporaso
Copy link
Contributor

@colinvwood and @Oddant1 noticed while working on bokulich-lab/q2-annotate#63 and bokulich-lab/q2-assembly#46 that many of the formats in q2-types-genomics require some knowledge of the underlying format to identify the sample ids. This is requiring code like:

result = ContigSequencesDirFmt()

for sample_fp in samples:
    # These paths are defined in the ContigSequencesDirFmt class as
    # {sample_id}_contigs.(fa | fasta). This should get the id from a
    # name like that
    sample_id = sample_fp.name.rsplit("_contigs", 1)[0]

to crop up fairly frequently (that example is from bokulich-lab/q2-assembly#46). If a developer misses this, or doesn't handle the stripping of _contigs correctly (e.g., replacing all occurrences, rather than just stripping the last occurrence) downstream results could have misnamed sample ids (e.g., sample-1 could easily become sample-1_contigs, or sample2_contigs [a terrible sample name] could become sample2).

We're thinking that, after the alpha release, it will make sense to define new formats that contain manifests which explicitly map sample ids to filenames, and then add transformers from these existing formats to the new manifest formats. That way we encapsulate all knowledge about how the filenames map to sample ids in those transformers, and actions that need these data can use the manifest formats and have explicit access to sample ids. This would be similar to how we handle the "demux formats" in q2-types (see an example of how this is used here).

We would want to keep the existing formats as-is so that existing artifacts (generated pre- or post-alpha) would continue to work.

@gregcaporaso gregcaporaso changed the title add formats that make sample (and other) ids explicit add formats that make sample (or other) ids explicit Sep 29, 2023
@gregcaporaso
Copy link
Contributor Author

@ebolyen pointed out that we can add a method on the format that gives us tuples or a dict mapping sample ids to relative filepaths in the Archive, so that will achieve the same goal without having to create new manifest formats. So we should do that for the next release (2023.11), and take a pass to use these throughout in places where we're parsing filenames to access the sample ids.

@misialq
Copy link
Contributor

misialq commented Oct 2, 2023

Sounds good - thanks @gregcaporaso and @ebolyen! 🙏

@gregcaporaso
Copy link
Contributor Author

Since this issue was opened in the context of the ContigSequencesDirFmt, let's update all code that is using that format (e.g., here) as needed to use the lookup that @colinvwood added in #57, and then close this one out.

@colinvwood, could you put this on your list? It's a nice-to-have for the 2023.11 release, but not essential.

@colinvwood
Copy link
Contributor

colinvwood commented Dec 4, 2023

@gregcaporaso what about other formats that could use this approach, e.g. the PR @misialq linked above--should we open separate issues? that is in fact its own issue not a PR, whoops

@colinvwood
Copy link
Contributor

After looking at the places in q2-assembly where we could use this new approach, I think that things will be much more refactorable once we have corresponding functionality for the mag directory format(s).

@gregcaporaso
Copy link
Contributor Author

Ok, makes sense. Do we have an issue for updating the MAG directory format(s)? Could you create one if not?

@colinvwood
Copy link
Contributor

colinvwood commented Dec 6, 2023

Changes in q2-assembly are (at least partially) blocked by bokulich-lab/q2-assembly#66

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants