Skip to content
cpockrandt edited this page Jan 31, 2021 · 15 revisions

FAQ

Here is a list of (what we believe might be) frequently asked questions. If we are wrong and your questions remain unanswered, please open an issue. ☺️

Input

Alignments

  1. Where did you get the whole-genome multiple sequence alignments from to compute tracks?

    For the tracks that we computed we took the multiple sequence alignments available on the UCSC FTP server (search for "multiple alignments"). If they are not available for the latest genome release, you can use an older one and "lift over" the wig files to the new release (a script is available in the repository in ./util).

  2. Can I use such an alignment to score other species (not just the reference species)?

    PhyloCSF++ will always expect the first sequence of an alignment to be the reference sequence. In theory, you could use a multiple sequence alignment and choose a different reference sequence. However, many multiple sequence alignments are reference-guided, hence for other sequences you will have a significant lower coverage. We are currently working on publishing a tool to compute multiple sequence alignments and score them directly with PhyloCSF++. We will update the README as soon as this tool becomes available.

  3. Can I use compressed alignment files?

    PhyloCSF++ uses memory mapping to process each file in parallel without having to load the entire alignment file into memory. Hence, it is not possible to use compressed input files. For the same reason you cannot use process substitution to redirect the output of a command as input (e.g., phylocsf++ ... <(gunzip aln.maf.gz)). Please unzip your alignments first before you pass them to PhyloCSF++.

  4. Does it make a difference whether I pass multiple alignment files or a single file that contains all alignments?

    For tracks the output will be exactly the same. For scoring entire alignments, the only difference is that a file with scores is written for every input file separately. There is however an important difference for both tracks and regions: PhyloCSF++ parallelizes over the alignments in each file, not over all files. To get the best speed-up from parallelization, a file should have a significant amount of alignments in it. The worst case is if you have a lot of files with only a single alignment in each, then no parallelization will happen.

  5. How can I get scores for different

Models

  1. New models have been uploaded to the original repository. Can I use them?

    To make the tool easier to use, we included all available models into the program. Nonetheless you can also just specify a path to the model. You can also open an issue and we will include the new models in our program.

  2. How can I compute models for my own set of species?

    For training your own model, a phylogenetic tree with evolutionary distances, as well as codon frequencies and codon substitution rates for both coding and non-coding regions are required. At the moment neither PhyloCSF nor PhyloCSF++ have a tool publicly available to compute this model, but we are working on it and are planning to include it into PhyloCSF++ in the near future. Since models between PhyloCSF and PhyloCSF++ are compatible, you will also be able to use models computed with PhyloCSF++ with the original PhyloCSF software.

  3. Why do I need the genome length and coding regions to create smoothened tracks?

    To smoothen the tracks, an HMM is used and it needs this information as prior training data.

  4. How do I get the coding regions for the smoothened tracks?

    PhyloCSF++ needs a tab-separated file in the following format:

    chrom strand phase start-coord stop-coord

    You can extract this file directly from a gene annotation file (gff/gtf) with the following command:

    awk -F'\t' 'BEGIN { OFS="\t" } ($3 == "CDS") { print $1, $7, $8, $4, $5 }' genes.gff > CodingExons.txt

    The chromosome names do not have to match with the chromosome names in the alignment file.

Output

  1. I get different scores than with the original PhyloCSF tool. Is this a bug?

    Most parts of the software use randomization (the MLE and OMEGA strategy, as well as the smoothening of the tracks). This will sometimes lead to minor differences in the scores. If you get significant differences, please open an issue, provide us with the alignment and options that you ran PhyloCSF++ with, and we will investigate the cause for it.

  2. What do I need to do with the track files after I have computed them?

    If you want to load your tracks into a genome browser, you usually want to index them first, i.e., convert the wig files into bw files. For this you can use wigToBigWig. wigToBigWig will ask for a chrom.sizes file, a file that contains the sequence length for each chromosome/sequence which you can also find on the UCSC FTP server. You can parallelize converting all wig files with GNU parallel in the output directory:

    find . -name '*.wig' | parallel --will-cite -j10 'bwFile={}; bwFile="${bwFile:0:-3}bw"; wigToBigWig {} hg38.chrom.sizes $bwFile'

Clone this wiki locally