Skip to content

10. GCF Calling

Nico Louwen edited this page Feb 11, 2025 · 2 revisions

Once all relevant distances are calculated, the final network(s) is generated by culling all edges below each of the user-defined --gcf-cutoffs. For every cutoff, BiG-SCAPE creates a network using all distances lower or equal than that cutoff. BiG-SCAPE 2 also generates a full.network file that contains all computed distances for each given run.

Each connected component (CC) in this network is then processed with the Affinity Propagation clustering algorithm for generating GCFs.

Affinity Propagation’s internal preference parameter can be set in the config.yml file (PREFERENCE: 0.0), which affects the number of created GCFs (Frey and Dueck, 2007). Preference can be seen as a ‘self-similarity’, meaning that two records cannot be grouped into one family if the preference is higher than the similarity between these two records. In practice, this means that with a preference = 0, AP is neither rewarding nor penalizing the generation of new families, with preference > 0, AP is rewarding the generation of new families, and with preference < 0, AP is penalizing the generation of new families. Thus, generally, higher preference leads to more created GCFs. Additional Affinity Propagation parameters are set: damping=0.9, max_iter=1000, convergence_iter=200.

In BiG-SCAPE 2.0, we introduce CC property checks to ensure that tightly connected and uniform CCs are not being over-split into several GCFs due to overzealous AP behaviour. In particular, BiG-SCAPE 2.0 analyses each CC’s degree of density, i.e. the ratio between the number of edges and the maximum number of possible edges. For any CC that displays DENSITY: 0.85 or higher (see config.yml), BiG-SCAPE 2.0 penalizes the generation of new GCFs, thus assigning its BGC records to a smaller number of GCFs.

Note: In cases where AP cannot converge, BiG-SCAPE 2.0 will assign all nodes to a single family, and pick a family center randomly.

GCF Trees

For each GCF, BiG-SCAPE creates .newick formatted trees using FastTree. The underlying multiple sequence alignments (MSA) inputted to FastTree are created based on the most common matching protein domains within each family BGC record that are present in the family center, the exemplar BGC record, which is determined by the Affinity Propagation algorithm.

This is done in the following way:

  • Of the domains present in the exemplar BGC record, the most common ones, i.e. the domains that appear with the highest frequencies in the entire family, are selected. By default, the top 3 frequencies of domain occurrences are included (e.g. the exemplar’s domains that occur most frequently in the family). This can be adjusted in the config.yml file (TOP_FREQS: 3). For example: In a GCF with 8 members, some domains might occur in all 8 members, but some in 7, or 6, or unique to one member. Here, all domains that occur in 6-8 members will be included in the MSA. The generated MSA and newick trees are provided in the output files.
  • These protein domains have been previously aligned to their reference phmm (commonly Pfam profiles), so similarly to what is done in the DSS calculation, the alignments-to-reference are used here. For each BGC record, the relevant domain alignments-to-reference are concatenated in the same order, and passed to FastTree, which will consequently construct the tree based on sequence similarity of these domains alone. It is important to note that FastTree will not be aware of any BGC architectural changes, and this method is a simplified proxy for a true phylogeny (which can best be achieved by using CORASON).
  • In the case where BGC records in this family do not contain any of the top frequency domains, they will not be included in the GCF tree and the following message will be displayed in the UI:
NOTE: This family contains members that could not be placed in the tree due to missing common domains.

Finally, for interpretation purposes, the BGC records themselves are visualized onto the tree and aligned based on the LCS found during record comparison.

Clone this wiki locally