Skip to content

Spring 2025 NEWS update. #1898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 22, 2025
Merged

Conversation

whitwham
Copy link
Contributor

This is an update of the things that have happened since the last release. We will continue to update it until the next release.

NEWS Outdated
* Fix a bug in breakend detection. It was incorrectly assuming that the ALT
allele is of equal length to REF allele, but the VCF specification allows
breakend insertions.
(PR #1858, fixes bcftools#2317. Reported by Nicolai von K�gelgen).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ü should be UTF-8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was just a copy from GitHub.

I added this as an introductory paragraph rather than a bulleted item
given the significance.

I also link to out web pages, but this needs updating further with
modern data sets and to clarify the tooling so it's more than a set of
benchmarks.  This is something I'll work on separately.
NEWS Outdated
Note this release changes the default output CRAM version from 3.0 to 3.1.
HTSlib and SAMtools have been able to read CRAM 3.1 since version 1.12,
however other tools may not yet be able to cope. We know Noodles reads CRAM
3.1 and htsjdk has a draft implementation, but not yet released.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3.1 and htsjdk has a draft implementation, but not yet released.
3.1 and htsjdk has a draft implementation that has not yet been released.

NEWS Outdated
Comment on lines 41 to 42
htscodec name tokeniser uses a nul between names there is no reason why
another value could be used. This change lets CRAM recognise other separator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
htscodec name tokeniser uses a nul between names there is no reason why
another value could be used. This change lets CRAM recognise other separator
htscodec name tokeniser uses a NUL between names there is no reason why
another value could not be used. This change lets CRAM recognise other separator

NEWS Outdated
invalid (by the spec) and previous handling was inconsistent.
(PR #1882, fixes #1866)

* Updated VCF code to work with VCF 4.4 prefixed phasing info.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Updated VCF code to work with VCF 4.4 prefixed phasing info.
* Updated VCF code to work with VCF 4.4 prefixed phasing info.

NEWS Outdated
Comment on lines 81 to 84
* Remove UR:file:// and UR:ftp:// from ref search path, plus REF_PATH to EBI.
Removing EBI as the default fallback when REF_PATH not set prevents the
unintended DDOS on EBI's servers.
(PR#1881)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves to be lifted to the top alongside the CRAM 3.1 stuff, and rewritten to be much more user-focussed — something like:

HTSlib no longer fetches CRAM reference data from EBI's server by default. Your organisation may wish to set up local infrastructure to supply reference sequences, e.g., using the new ref-cache tool included in this HTSlib release. You can use REF_PATH and/or REF_CACHE environment variables to etc etc. See https://www.htslib.org/doc/reference_seqs.html for details.

The bit about UR:file:// and UR:ftp:// is comparatively minor and separate and could stay in the Updates section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, like the CRAM 3.1 stuff, this should probably be copied over to the top of the samtools NEWS entry too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea.

…es. Plus some minor text improvements (thanks to @jmarshall).
NEWS Outdated
HTSlib no longer fetches CRAM reference data from EBI's server by default. Your
organisation may wish to set up local infrastructure to supply reference
sequences, e.g., using the new ref-cache tool included in this HTSlib release.
You can use REF_PATH and/or REF_CACHE environment variables to etc etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to elucidate the “etc etc” 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh for Heaven's Sake! I even read them to check your grammar. I'll just blame it all on you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took the editorial decision to delete the etc etc sentence in the end. The following line points people to the external web site which covers it all anyway. The vast majority of users they aren't likely to notice a change (unless the vast majority of users put up with periodic wedges and network failures due to an inaccessible refget server).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put that draft sentence in the suggested paragraph text because it seemed useful to draw attention to what individual users can do (i.e., REF_PATH/REF_CACHE) to reinstate their reference data access, in addition to the rather more heavy-weight server solution that might usually be set up at organisation scale.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that ordinary users will very much notice this change. For example, I've attempted the following without any environment variables set:

  1. Constructed a foo.sam file with a GRCh38 chr1 header and a read mapped to this chromosome:
@SQ	SN:chr1	AN:1	LN:248956422	AS:38	M5:6aef897c3d6ff0c78aff06ac189178dd	UR:file:///someone/elses/path/to/GRCh38.fa
read1	0	chr1	6000001	10	200M	*	0	0	GCACCCTCTCTGCACCCTGTCTGCGCCGTCCGTGCCCCGTCTGCACCTTGTCTGTGCCCCGTCCACACCCTGCCTGCACTCTGTCTGTGCCCCATCTGGGCCCTCTGCACGCCCTACCTGTGCCCTGTCTGTCTAGATAGATCCTGCAGCGCCTCCCGCCTCTGCTCAGCCAGCCCCCCGCACCCCCTAGAGGGTATTGC	bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
  1. Used samtools-1.21 view -Co foo.cram foo.sam to convert this file to CRAM. As a byproduct of this conversion, samtools-1.21 populated ~/.cache/hts-ref/6a/ef/897c… with a chr1 cache.
  2. Used samtools-1.21 view foo.cram to view the file. Samtools-1.21 used this cache to view the file quickly.
  3. However current develop samtools could not view the file:
$ samtools view foo.cram 
[E::fai_build3_core] Failed to open the file ///someone/elses/path/to/GRCh38.fa : No such file or directory
[E::refs_load_fai] Failed to open reference file '///someone/elses/path/to/GRCh38.fa'
[W::cram_get_ref] Failed to populate reference "chr1"
[W::cram_get_ref] See https://www.htslib.org/doc/reference_seqs.html for further suggestions
…etc…

Even though that cache exists, current develop samtools does not see it by default. You can set REF_CACHE to point to it and current develop samtools will happily display the file; however, I would think that most ordinary users will need every available pointer to become aware of these environment variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'll have a think about how better to phrase it. Note the web page documentation isn't just about the new ref-cache tool. It's there as a better description of how it all fits together, and is mentioned in error messages (as you saw). It's easier to point people to that than a man page, especially htslib users given essentially all library documentation is in samtools.1 man page (ie a totally different project).

My point about users not noticing is more about the main existing user base, but yes new users will (and already do) hit problems. My assertion there is based on the observation that the current EBI's ref-get server is sufficiently unreliable that anyone relying it on will quite rapidly hit a failure. So I'd imagine most who have been down that road have already found workarounds (all bar the one university which allegedly accounts for about half of the entire ref-get load!).

Copy link
Contributor

@jkbonfield jkbonfield May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason the default location for REF_CACHE was removed is it was only automatically added when fetching from the EBI, and removing that means we no longer need to automatically cache either. Always setting REF_CACHE means other non-EBI downloads start to cache copies too, which would include any local proxies. We don't (necessarily) want two level proxies, although it can have some benefits in terms of partial file loads as the REF_CACHE is always a local file system and permits mmaps.

REF_CACHE was used in two ways - as somewhere to read from, and most importantly somewhere to write to for remote downloads. The whole point of REF_CACHE was it was caching, ie somewhere locally to write, and if you wanted read-only support then that's what REF_PATH is for. However I hadn't really considered the transition for people previously using a samtools that populated REF_CACHE from the EBI and are now using a samtools that doesn't populate. That was probably an error which should have been picked up, but sadly it sat for 4 months before being reviewed meaning I'd forgotten the nuances of the change too. We should have merged it promptly so we and others could be stress testing things, rather than delaying until days before release. :(

I'll consider reinstating this so having no REF_CACHE set defaults to searching in the standard location (whether via REF_PATH or REF_CACHE is irrelevant as it's a read-only thing and wouldn't actually change the environment). That will aid the transition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider if the default REF_CACHE should be left so it works for pre-existing caches. On balance, I decided it was better for it to be set explicitly - otherwise you get a situation where samtools-1.22 onwards mysteriously do or do not work depending on if they find a pre-existing cache. Requiring the variable to be set makes the behaviour much more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the NEWS file again, but discussions on ongoing regarding the behaviour of a default REF_CACHE search path. (I was wrong to suggest it wasn't thought about. I'd forgotten about the details in my PR, but it was considered during merge.)

jkbonfield and others added 4 commits May 19, 2025 16:28
I merged it with the following line as we don't need to regurgitate
the full details in NEWS, just point people to the authorative source
and give them a hint on the key terms to search for.
@whitwham whitwham marked this pull request as ready for review May 22, 2025 10:37
@daviesrob daviesrob merged commit 9a6534d into samtools:develop May 22, 2025
9 checks passed

* Convert U to T instead of U to N when sam_parsing. Though SAM format itself
can contain U the BAM format cannot.
(PR #1854, fixes samtools#2131 reported by James Furguson)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't notice this previously.

Suggested change
(PR #1854, fixes samtools#2131 reported by James Furguson)
(PR #1854, fixes samtools#2131 reported by James Ferguson)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now PR #1919.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants