-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Here is a list of (what we believe might be) frequently asked questions. If we are wrong and your questions remain unanswered, please open an issue on GitHub
-
Where did you get the whole-genome multiple sequence alignments from to compute tracks?
For the tracks we computed we took the multiple sequence alignments available on the UCSC FTP server, search for "multiple alignments". If you they are not available for the latest genome release, you can use an older one and "lift over" the wig files to the new release (a script is available in the repository).
-
Can I use such an alignment to score other species (not just the reference species)?
PhyloCSF++ will always expect the first sequence of an alignment to be the reference sequence. In theory, you could use a multiple sequence alignment and choose a different reference sequence. However, many multiple sequence alignments are reference-guided, hence for other sequences you will have a significant lower coverage. We are currently working on publishing a tool to compute multiple sequence alignments and score them. We will update the README if this tool becomes available.
-
Can I use compressed alignment files?
PhyloCSF++ uses memory-mapping and processes each file in parallel. Hence, it is not possible to use compressed input files. For the same reason you cannot use process substitution to redirect the output of a command as input (e.g.,
phylocsf++ ... <(gunzip aln.maf.gz)
). -
Does it make a difference whether I pass multiple alignment files or a single file that contains all alignments?
For tracks the output will be exactly the same. For scoring entire alignments, the only difference is that a file with scores is written for every input file separately. There is however an important difference for both tracks and regions: PhyloCSF++ parallelizes over the alignments in each file, not over all files. To get the best speed-up from parallelization, a file should have a significant amount of alignments in it. The worst case is if you have a lot of files with only one single alignment in each, then no parallelization will happen.
-
New models have been uploaded to the original repository. Can I use them?
To make the tool easier to use, we included all available models into the program. Nonetheless you can also just specify a path to the model. Please let us know if new models become available and we will include them in our program.
-
How can I compute models for my own set of species?
For training your own model, a phylogenetic tree with evolutionary distances, as well as codon frequencies and codon substitution rates for both coding and non-coding regions are required. At the moment neither PhyloCSF nor PhyloCSF++ have a tool to compute this model, but we are working on it and are planning to include it into PhyloCSF++ in the near future.
-
Why do I need the genome length and coding regions to create smoothened tracks?
To smoothen the tracks, an HMM is used and it needs this information as training data.
-
How do I get the coding regions for the smoothened tracks?
PhyloCSF++ needs a tab-separated file in the following format:
chrom strand phase start-coord stop-coord
You can extract this file directly from a gene annotation file (gff/gtf) with the following command (The chromosome names do not have to match with the chromosome names in the alignment file):
awk -F'\t' 'BEGIN { OFS="\t" } ($3 == "CDS") { print $1, $7, $8, $4, $5 }' genes.gff
-
What do I need to do with the computed tracks?
If you want to load your tracks into a genome browser, you usually want to index them first, i.e., convert the
wig
files intobw
files. For this you can use wigToBigWig. wigToBigWig will ask for a chrom.sizes file, a file that contains the sequence length for each chromosome/sequence. You can parallelize the converting:find . -name '*.wig' | parallel --will-cite -j10 'bwFile={}; bwFile="${bwFile:0:-3}bw"; wigToBigWig {} hg38.chrom.sizes $bwFile'