-
Notifications
You must be signed in to change notification settings - Fork 27
Changelog
Catarina Loureiro edited this page Dec 19, 2024
·
2 revisions
The following changes cover everything from BiG-SCAPE 1 to BiG-SCAPE 2.0.0
The BiG-SCAPE logic has been separated into three workflows:
- BiG-SCAPE Cluster: Most common use case, performs distance calculations and clustering of BGCs.
- BiG-SCAPE Query: Uses the logic in BiG-SCAPE Cluster to return all BGCs that show similarity to a user defined Query BGC.
- BiG-SCAPE Benchmark: Allows benchmarking of the clustering/GCF calling of a given run/set of runs against a user provided curated set of GCF assignments.
- Allows selection of any antiSMASH record type (region, candidate cluster, protocluster, protocore) as the working record to be used throughout the run.
- Each record will be showcased as a node in the sequence similarity network, and, when relevant, topo-links/edges (dashed lines) are shown between records that originate from the same region.
- Protoclusters/Protocores within interleaved and chemical hybrid candidate clusters are merged into one record.
- Added classification based on antiSMASH Class and Category, in addition to BiG-SCAPE 1’s legacy classification. BiG-SCAPE 2’s default is to classify based on antiSMASH Category.
- BiG-SCAPE 1’s legacy classification into 8 groups remains compatible with antiSMASH versions up to v7.
- BiG-SCAPE 1’s legacy weight distribution of each distance component can be paired with the legacy classification mode, as well as new antiSMASH-based classification (for antiSMASH versions v6 and up).
- A user-defined reference set of (antiSMASH processed) .gbks can be provided.
- Using the --mibig-version [version_number] flag will (if necessary) download and use antiSMASH processed MIBiG versions 3.1 and up.
- MIBiG .gbks are now available already processed by a custom antiSMASH version that ensures that MIBiG .gbks that would not trigger an antiSMASH rule are still processed, and BiG-SCAPE will download these directly from the https://dl.secondarymetabolites.org/mibig/ webpage.
- Any custom antiSMASH processed set of MIBiG .gbk files can also be provided.
- .gbks which were not processed by antiSMASH, but that do contain CDS and Sequence features can be provided by using the --force-gbk flag. Note: beta state, use with caution.
- Replaced all intermediary files with an SQLite database.
- An already populated SQLite database can be provided such that all relevant information, i.e. records and edges that are present in the input folder, can be reused.
- Canceled runs retain data on disc to continue from on re-run.
- Added an option to run entirely in-memory to reduce runtime.
- Added a full .network file, which contains all calculated distances without any cutoffs applied. Distances = 1 are not included.
- Removed independently generated arrower.py SVGs of each individual .gbk.
- Removed Clans, a second layer of clustering to attempt to group families into clans.
- Added a config.yml file which stores a series of run and analysis parameters, many of which were previously given as command line arguments. A given SQLite DB file can only be used with a single config.yml file.
- Added a profiler/profile report (RAM, CPU, Time) and logger with several levels of verbosity. Note: Might behave strangely on MacOs.
- Added a run log file and config log file.
- Looks for input .gbks either recursively throughout the input folder, or in flat mode looking only at the first level of the input folder.
- Replaced --banned_classes with --exclude-classes.
- Added --include-classes, --exclude-categories, --include-categories.
- Interactive visualization reads data from a BiG-SCAPE generated SQLite database file.
- Interactive visualization structure and usage has been updated, see relevant section of the wiki.
- Interactive visualization can be run in (beta) dark mode.
- Interactive visualization sends an alert if it is opened without html_content present.
- Output folder structure has been updated, see relevant section of the wiki.
- Both LCS and extend work on protein domain level instead of CDS level. At the end of the LCS-extend process, boundaries are defaulted to outer edges of included CDSs.
- Added an LCS selection that prefers biosynthetic content, length and centrality, in this order.
- At the end of the LCS process, as well as the extended process, BiG-SCAPE performs checks to decide whether to keep the LCS and the extension based on percentage of total length of the shortest record in the pair, instead of the constant number of cds used in v1.
- Extend methods:
- Added a greedy extend implementation which maximizes the range of selected CDS to the first and last CDS that still contain a matching domain between two records.
- Added a simple match extend implementation which ignores the position and type of domain matches when extending from the LCS.
- In the case where both records in the pair have the same domain length, Legacy extension mode chooses the query and target records arbitrarily.
- Connected components are analysed for degree of density. For CC’s with density ≥ 0.85, affinity propagation’s internal parameter preference is set to -5.
- In cases where AP cannot converge, BiG-SCAPE 2.0 will assign all nodes to a single family, and pick a family center arbitrarily.
- DSS:
- Fixed a bug where DSS would only consider unshared domains from one of the records in the pair. DSS calculation now takes into account unshared domains of both BGC records together.
- Fixed a bug where DSS calculation could change depending on the filenames of two BGC records with the same content.
- GCF Trees: GCF alignments are based on the 3 top frequencies of domains that occur in a GCF. Visual alignment of GCF members is based on domain LCS start coordinate.
- BiG-SCAPE now considers strandedness in CDS overlap cutoff.
- GCF heatmap: genomes are clustered solely on GCF presence/absence, and no longer based on family similarity.
- BiG-SCAPE's 'BGC per Genome' statistic is based on the 'ORGANISM' field in the genbank file. If that field is not filled, v1.1.8 uses the filename instead, while in v2 these are grouped into 'Unknown'.
The entire code-base was completely refactored, resulting in numerous other small changes and fixes, for a full list see the git shortlog.