diff --git a/CHANGES.md b/CHANGES.md index fb5d1c59..8191eb52 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -1,5 +1,5 @@ * Fixed a bug in --quantMode TranscriptomeSAM that prevented output to Aligned.toTranscriptome.out.bam of the reads mapped to the very last annotated transcript. -* Cleaned up the code to remove compilation warnings. +* Cleaned up the code to remove compilation warnings (thanks to github.com/yhoogstrate). * Implemented --outSAMunmapped Within KeepPairs option to record unmapped mate adjacent to the mapped one, in case single-end alignments are allowed. For multi-mappers, the unmapped mate will be recored mulitple times adjacent to the mappet mate of each alignment. diff --git a/bin/Linux_x86_64/STAR b/bin/Linux_x86_64/STAR index 3732af1b..ffe5ed59 100755 Binary files a/bin/Linux_x86_64/STAR and b/bin/Linux_x86_64/STAR differ diff --git a/bin/Linux_x86_64/STARlong b/bin/Linux_x86_64/STARlong index c34a7888..c71585ae 100755 Binary files a/bin/Linux_x86_64/STARlong and b/bin/Linux_x86_64/STARlong differ diff --git a/bin/Linux_x86_64_static/STAR b/bin/Linux_x86_64_static/STAR index ea50cb7b..f5196444 100755 Binary files a/bin/Linux_x86_64_static/STAR and b/bin/Linux_x86_64_static/STAR differ diff --git a/bin/Linux_x86_64_static/STARlong b/bin/Linux_x86_64_static/STARlong index 27f5cc8f..d4b85bb7 100755 Binary files a/bin/Linux_x86_64_static/STARlong and b/bin/Linux_x86_64_static/STARlong differ diff --git a/doc/STARmanual.pdf b/doc/STARmanual.pdf index c9238df6..383ccae6 100644 Binary files a/doc/STARmanual.pdf and b/doc/STARmanual.pdf differ diff --git a/extras/doc-latex/STARmanual.tex b/extras/doc-latex/STARmanual.tex index ac2bec67..4ea2443d 100644 --- a/extras/doc-latex/STARmanual.tex +++ b/extras/doc-latex/STARmanual.tex @@ -34,8 +34,8 @@ \newcommand{\sechyperref}[1]{\hyperref[#1]{Section \ref{#1}. \nameref{#1}}} -\title{STAR manual 2.5.0a} -\author{Alexnder Dobin\\ +\title{STAR manual 2.5.1a} +\author{Alexander Dobin\\ dobin@cshl.edu} \maketitle \tableofcontents @@ -150,7 +150,7 @@ \subsubsection{Using a list of annotated junctions.} Note, that the \opt{sjdbFileChrStartEnd} file can contain duplicate (identical) junctions, STAR will collapse (remove) duplicate junctions. \subsubsection{Very small genome.} -For small genomes, the parameter \opt{genomeSAindexNbases} needs to be scaled down, with a typical value of \code{min(14, log2(GenomeLength)/2 - 1)}. For example, for 1~megaBase genome, this is equal to 9, for 100~kiloBase genome, this is equal to 7. +For small genomes, the parameter \opt{genomeSAindexNbases} \textbf{must} to be scaled down, with a typical value of \code{min(14, log2(GenomeLength)/2 - 1)}. For example, for 1~megaBase genome, this is equal to 9, for 100~kiloBase genome, this is equal to 7. \subsubsection{Genome with a large number of references.} If you are using a genome with a large (\textgreater 5,000) number of references (chrosomes/scaffolds), you may need to reduce the \opt{genomeChrBinNbits} to reduce RAM consumption. The following scaling is recommended: \opt{genomeChrBinNbits} = \code{min(18, log2(GenomeLength/NumberOfReferences))}. For example, for 3~gigaBase genome with 100,000 chromosomes/scaffolds, this is equal to 15. @@ -303,8 +303,9 @@ \subsection{Splice junctions.} \section{Chimeric and circular alignments.} To switch on detection of chimeric (fusion) alignments (in addition to normal mapping), \opt{chimSegmentMin} should be set to a positive value. Each chimeric alignment consists of two "segments". Each segment is non-chimeric on its own, but the segments are chimeric to each other (i.e. the segments belong to different chromosomes, or different strands, or are far from each other). Both segments may contain splice junctions, and one of the segments may contain portions of both mates. \opt{chimSegmentMin} parameter controls the minimum mapped length of the two segments that is allowed. For example, if you have 2x75 reads and used \opt{chimSegmentMin} 20, a chimeric alignment with 130b on one chromosome and 20b on the other will be output, while 135 + 15 won't be. + \subsection{STAR-Fusion.} -STAR-Fusion is a software package for detecting fusion transcript from STAR chimeric output. It is developed and maintained by Brian Haas. Please visit its GitHub page for instructions and documentation: \url{https://github.com/STAR-Fusion/STAR-Fusion}. +STAR-Fusion is a software package for detecting fusion transcript from STAR chimeric output. It is developed and maintained by Brian Haas (@Broad Institute), whose effort was inspired by earlier work done by Nicolas Stransky in the landmark publication "The landscape of kinase fusions in cancer" by Stransky et al., Nat Commun 2014, in addition to very nice work done by Daniel Nicorici with his FusionCatcher software. Please visit its GitHub page for instructions and documentation: \url{https://github.com/STAR-Fusion/STAR-Fusion}. \subsection{Chimeric alignments in the main BAM files.} Chimeric alignments can be included together with normal alignments in the main (sorted or unsorted) BAM file(s) using \opt{chimOutType} \optv{WithinBAM}. In these files, formatting of chimeric alignments follows the latest SAM/BAM specifications. diff --git a/extras/doc-latex/parametersDefault.tex b/extras/doc-latex/parametersDefault.tex index cc9d050c..7dec92ad 100644 --- a/extras/doc-latex/parametersDefault.tex +++ b/extras/doc-latex/parametersDefault.tex @@ -67,6 +67,9 @@ \optName{genomeSAsparseD} \optValue{1} \optLine{int{\textgreater}0: suffux array sparsity, i.e. distance between indices: use bigger numbers to decrease needed RAM at the cost of mapping speed reduction} +\optName{genomeSuffixLengthMax} + \optValue{-1} + \optLine{int: maximum length of the suffixes, has to be longer than read length. -1 = infinite.} \end{optTable} \optSection{Splice Junctions Database}\label{Splice_Junctions_Database} \begin{optTable} @@ -188,7 +191,7 @@ \end{optOptTable} \optName{outReadsUnmapped} \optValue{None} - \optLine{string: output of unmapped reads (besides SAM)} + \optLine{string: output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s).} \begin{optOptTable} \optOpt{None} \optOptLine{no output} \optOpt{Fastx} \optOptLine{output in separate fasta/fastq files, Unmapped.out.mate1/2} @@ -249,11 +252,17 @@ \optLine{int{\textgreater}=0: start value for the IH attribute. 0 may be required by some downstream software, such as Cufflinks or StringTie.} \optName{outSAMunmapped} \optValue{None} - \optLine{string: output of unmapped reads in the SAM format} + \optLine{string(s): output of unmapped reads in the SAM format} + \optLine{1st word:} \begin{optOptTable} \optOpt{None} \optOptLine{no output} \optOpt{Within} \optOptLine{output unmapped reads within the main SAM file (i.e. Aligned.out.sam)} \end{optOptTable} + \optLine{2nd word:} +\begin{optOptTable} + \optOpt{KeepPairs} \optOptLine{record unmapped mate for each alignment, and, in case of unsroted output, keep it adjacent to its mapped mate.} +\end{optOptTable} + \optLine{Only affects multi-mapping reads} \optName{outSAMorder} \optValue{Paired} \optLine{string: type of sorting for the SAM output} @@ -377,28 +386,29 @@ \optLine{int: the score range below the maximum score for multimapping alignments} \optName{outFilterMultimapNmax} \optValue{10} - \optLine{int: read alignments will be output only if the read maps fewer than this value, otherwise no alignments will be output} + \optLine{int: maximum number of loci the read is allowed to map to. Alignments (all of them) will be output only if the read maps to no more loci than this value. } + \optLine{Otherwise no alignments will be output, and the read will be counted as "mapped to too many loci" in the Log.final.out .} \optName{outFilterMismatchNmax} \optValue{10} - \optLine{int: alignment will be output only if it has fewer mismatches than this value} + \optLine{int: alignment will be output only if it has no more mismatches than this value.} \optName{outFilterMismatchNoverLmax} \optValue{0.3} - \optLine{int: alignment will be output only if its ratio of mismatches to *mapped* length is less than this value} + \optLine{float: alignment will be output only if its ratio of mismatches to *mapped* length is less than or equal to this value.} \optName{outFilterMismatchNoverReadLmax} - \optValue{1} - \optLine{int: alignment will be output only if its ratio of mismatches to *read* length is less than this value} + \optValue{1.0} + \optLine{float: alignment will be output only if its ratio of mismatches to *read* length is less than or equal to this value.} \optName{outFilterScoreMin} \optValue{0} - \optLine{int: alignment will be output only if its score is higher than this value} + \optLine{int: alignment will be output only if its score is higher than or equal to this value.} \optName{outFilterScoreMinOverLread} \optValue{0.66} - \optLine{float: outFilterScoreMin normalized to read length (sum of mates' lengths for paired-end reads)} + \optLine{float: same as outFilterScoreMin, but normalized to read length (sum of mates' lengths for paired-end reads)} \optName{outFilterMatchNmin} \optValue{0} - \optLine{int: alignment will be output only if the number of matched bases is higher than this value} + \optLine{int: alignment will be output only if the number of matched bases is higher than or equal to this value.} \optName{outFilterMatchNminOverLread} \optValue{0.66} - \optLine{float: outFilterMatchNmin normalized to read length (sum of mates' lengths for paired-end reads)} + \optLine{float: sam as outFilterMatchNmin, but normalized to the read length (sum of mates' lengths for paired-end reads).} \optName{outFilterIntronMotifs} \optValue{None} \optLine{string: filter alignment using their motifs} diff --git a/source/STARmanual.tex b/source/STARmanual.tex deleted file mode 100644 index ab4107a8..00000000 --- a/source/STARmanual.tex +++ /dev/null @@ -1,451 +0,0 @@ -\documentclass[12pt]{article} -\usepackage[colorlinks]{hyperref} -%\usepackage{color} -\usepackage[usenames,dvipsnames,svgnames,table]{xcolor} -\usepackage{tabularx} -\usepackage{longtable} -\usepackage{hanging} -\usepackage{changepage} -\usepackage[top=1in, bottom=1in, left=0.7in, right=0.7in]{geometry} -\usepackage{enumitem} - -\begin{document} -\hypersetup{ - linkcolor=MidnightBlue - } - - -%%%%%% global commands -% option name, no reference -\newcommand{\optn}[1]{\textcolor{violet}{\texttt{--#1}}} -% option name -\newcommand{\opt}[1]{\hyperlink{#1}{\optn{#1}}} -%option value, to be replaces with user value -\newcommand{\optv}[1]{\texttt{#1}} -\newcommand{\optvr}[1]{\textit{\texttt{#1}}} - -\newcommand{\code}[1]{\texttt{#1}} - -\newcommand{\codelines}[1]{\begin{adjustwidth}{0.5in}{0in} - \raggedright\texttt{#1} - \end{adjustwidth}} - -\newcommand{\ofilen}[1]{\texttt{#1}} - -\newcommand{\sechyperref}[1]{\hyperref[#1]{Section \ref{#1}. \nameref{#1}}} - -\title{STAR manual 2.4.2a} -\author{Alexnder Dobin\\ -dobin@cshl.edu} -\maketitle -\tableofcontents - -\newpage - -%\section{Introduction} -\section{Getting started.} -\subsection{Installation.} - -STAR source code and binaries can be downloaded from GitHub: named releases from \url{https://github.com/alexdobin/STAR/releases}, or the master branch from \url{https://github.com/alexdobin/STAR}. The pre-compiled STAR executables are located \code{bin/} subdirectory. The \code{static} executables are the easisest to use, as they are statically compiled and are not dependents on external libraries. - -To compile STAR from sources run \code{make} in the source directory for a Linux-like environment, or run \code{make STARforMac} for Mac OS X. This will produce the executable \code{'STAR'} inside the source directory. - -\subsubsection{Installation - in depth and troubleshooting.} -STAR is compiled with gcc c++ compiler and depends only on standard gcc libraries. Some generic instructions on installing correct gcc environments are given below. - -\paragraph{Ubuntu.}\hfill -\codelines{\$ sudo apt-get update\\ -\$ sudo apt-get install g++\\ -\$ sudo apt-get install make -} - -\paragraph{Red Hat, CentOS, Fedora.}\hfill -\codelines{\$ sudo yum update\\ - \$ sudo yum install make\\ - \$ sudo yum install gcc-c++\\ - \$ sudo yum install glibc-static - } - -\paragraph{SUSE.}\hfill -\codelines {\$ sudo zypper update\\ - \$ sudo zypper in gcc gcc-c++ -} - -\paragraph{Mac OS X.\newline} -Current versions of Mac OS X Xcode are shipped with Clang replacing the standard gcc compiler. Presently, standard Clang does not support OpenMP which creates problems for STAR compilation. One option to avoid this problem is to install gcc (preferrably using \code{homebrew} package manager). Another option is to add OpenMP functionality to Clang. - -\subsection{Basic workflow.} -Basic STAR workflow consists of 2 steps: -\begin{enumerate} -\item -Generating genome indexes files (see \sechyperref{Generating_genome_indexes}.\newline -In this step user supplied the reference genome sequences (FASTA files) and annotations (GTF file), from which STAR generate genome indexes that are utilized in the 2nd (mapping) step. The genome indexes are saved to disk and need only be generated \textbf{once} for each genome/annotation combination. A limited collection of STAR genomes is available from \url{http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/}, however, it is strongly recommended that users generate their own genome indexes with most up-to-date assemblies and annotations. -\item -Mapping reads to the genome (see \sechyperref{Running_mapping_jobs}).\newline -In this step user supplies the genome files generated in the 1st step, as well as the RNA-seq reads (sequences) in the form of FASTA or FASTQ files. STAR maps the reads to the genome, and writes several output files, such as alignments (SAM/BAM), mapping summary statistics, splice junctions, unmapped reads, signal (wiggle) tracks etc. Output files are described in \sechyperref{Output_files}. Mapping is controlled by a variety of input parameters (options) that are described in brief in \sechyperref{Description_of_all_options}, and in more detail in \sechyperref{Running_mapping_jobs}. -\end{enumerate} - -STAR command line has the following format:\codelines{ -STAR --option1-name option1-value(s)--option2-name option2-value(s) ... -} -If an option can accept multiple values, they are separated by spaces, and in a few cases - by commas. - - -\section{Generating genome indexes.}\label{Generating_genome_indexes} -\subsection{Basic options.} -The basic options to generate genome indices are as follows: -\codelines{\opt{runThreadN} \optvr{NumberOfThreads}\\ -\opt{runMode} \optv{genomeGenerate}\\ -\opt{genomeDir} \optvr{/path/to/genomeDir}\\ -\opt{genomeFastaFiles} \optvr{/path/to/genome/fasta1 /path/to/genome/fasta2 ...} \\ -\opt{sjdbGTFfile} \optvr{/path/to/annotations.gtf}\\ -\opt{sjdbOverhang} \optvr{ReadLength-1}\\ -} - -\begin{itemize} -\item[] -\opt{runThreadN} option defines the number of threads to be used for genome generation, it has to be set to the number of available cores on the server node. - -\item[] -\opt{runMode} \optv{genomeGenerate} option directs STAR to run genome indices generation job. - -\opt{genomeDir} specifies path to the directory (henceforth called "genome directory" where the genome indices are stored. This directory has to be created (with \code{mkdir}) before STAR run and needs to writing permissions. The file system needs to have at least 100GB of disk space available for a typical mammalian genome. It is recommended to remove all files from the genome directory before running the genome generation step. This directory path will have to be supplied at the mapping step to identify the reference genome. - -\item[] -\opt{genomeFastaFiles} specified one or more FASTA files with the genome reference sequences. Multiple reference sequences (henceforth called “chromosomes”) are allowed for each fasta file. You can rename the chromosomes’ names in the chrName.txt keeping the order of the chromosomes in the file: the names from this file will be used in all output alignment files (such as .sam). The tabs are not allowed in chromosomes’ names, and spaces are not recommended. - -\item[] -\opt{sjdbGTFfile} specifies the path to the file with annotated transcripts in the standard GTF format. STAR will extract splice junctions from this file and use them to greatly improve accuracy of the mapping. While this is optional, and STAR can be run without annotations, using annotations is \textbf{highly recommended} whenever they are available. Starting from 2.4.1a, the annotations can also be included on the fly at the mapping step. - -\item[] -\opt{sjdbOverhang} specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the \optvr{ReadLength-1}, where \optvr{ReadLength} is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is \optvr{max(ReadLength)-1}. \textbf{In most cases, the default value of 100 will work as well as the ideal value.} -\end{itemize} - -Genome files comprise binary genome sequence, suffix arrays, text chromosome names/lengths, splice junctions coordinates, and transcripts/genes information. Most of these files use internal STAR format and are not intended to be utilized by the end user. It is strongly \textbf{not recommended} to change any of these file with one exception: you can rename the chromosome names in the chrName.txt keeping the order of the chromosomes in the file: the names from this file will be used in all output files (e.g. SAM/BAM). - -\subsection{Advanced options.} -\subsubsection{Which chromosomes/scaffolds/patches to include?} -It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should \textbf{not} be included in the genome. - -Examples of acceptable genome sequence files: -\begin{itemize} -\item \textbf{ENSEMBL:} files marked with .dna.primary.assembly, such as: -\url{ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz} -\item \textbf{NCBI:} "no alternative - analysis set": \url{ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz} -\end{itemize} -\subsubsection{Which annotations to use?} -The use of the most comprehensive annotations for a given species is strongly recommended. Very importantly, chromosome names in the annotations GTF file have to match chromosome names in the FASTA genome sequence files. For example, one can use ENSEMBL FASTA files with ENSEMBL GTF files, and UCSC FASTA files with UCSC FASTA files. However, since UCSC uses \code{chr1, chr2, ...} naming convention, and ENSEMBL uses \code{1, 2, ...} naming, the ENSEMBL and UCSC FASTA and GTF files cannot be mixed together, unless chromosomes are renamed to match between the FASTA anf GTF files. - -For mouse and human, the Gencode annotations are recommended: \url{http://www.gencodegenes.org/}. - -\subsubsection{Annotations in GFF format.} -In addition to the aforementioned options, for GFF3 formatted annotations you need to use \opt{sjdbGTFtagExonParentTranscript} \optv{Parent}. In general, for \opt{sjdbGTFfile} files STAR only processes lines which have \opt{sjdbGTFfeatureExon} (=\optv{exon} by default) in the 3rd field (column). The exons are assigned to the transcripts using parent-child relationship defined by the \opt{sjdbGTFtagExonParentTranscript} (=\optv{transcript\_id} by default) GTF/GFF attribute. - -\subsubsection{Using a list of annotated junctions.} -STAR can also utilize annotations formatted as a list of splice junctions coordinates in a text file: \opt{sjdbFileChrStartEnd} \optvr{/path/to/sjdbFile.txt}. This file should contains 4 columns separated by tabs: -\codelines{Chr \quad{\textbackslash}tab\quad Start \quad{\textbackslash}tab\quad End \quad{\textbackslash}tab\quad Strand=+/-/.} -Here Start and End are first and last bases of the introns (1-based chromosome coordinates). -This file can be used in addition to the \opt{sjdbGTFfile}, in which case STAR will extract junctions from both files. - -Note, that the \opt{sjdbFileChrStartEnd} file can contain duplicate (identical) junctions, STAR will collapse (remove) duplicate junctions. - -\subsubsection{Very small genome.} -For small genomes, the parameter \opt{genomeSAindexNbases} needs to be scaled down, with a typical value of \code{min(14, log2(GenomeLength)/2 - 1)}. For example, for 1~megaBase genome, this is equal to 9, for 100~kiloBase genome, this is equal to 7. - -\subsubsection{Genome with a large number of references.} -If you are using a genome with a large (\textgreater 5,000) number of references (chrosomes/scaffolds), you may need to reduce the \opt{genomeChrBinNbits} to reduce RAM consumption. The following scaling is recommended: \opt{genomeChrBinNbits} = \code{min(18, log2(GenomeLength/NumberOfReferences))}. For example, for 3~gigaBase genome with 100,000 chromosomes/scaffolds, this is equal to 15. - -\section{Running mapping jobs.}\label{Running_mapping_jobs} -\subsection{Basic options.} -The basic options to run a mapping job are as follows: -\codelines{\opt{runThreadN} \optvr{NumberOfThreads}\\ -\opt{genomeDir} \optvr{/path/to/genomeDir}\\ -\opt{readFilesIn} \optvr{/path/to/read1} [\optvr{/path/to/read2}] -} - -\begin{itemize} -\item[] -%\opt{runThreadN} option defines the number of threads to be used for mapping, it has to be set to the number of available cores on the server node. - -\opt{genomeDir} specifies path to the genome directory where genome indices where generated (see \sechyperref{Generating_genome_indexes}). - -\item[] -\opt{readFilesIn} name(s) (with path) of the files containing the sequences to be mapped (e.g. RNA-seq FASTQ files). If using Illumina paired-end reads, the \optvr{read1} and \optv{read2} files have to be supplied. STAR can process both FASTA and FASTQ files. Multi-line (i.e. sequence split in multiple lines) FASTA file are supported. If the read files are compressed, use the \opt{readFilesCommand} \optvr{UncompressionCommand} option, where \optvr{UncompressionCommand} is the un-compression command that takes the file name as input parameter, and sends the uncompressed output to stdout. For example, for gzipped files (*.gz) use -\code{\opt{readFilesCommand} \optv{zcat}} -OR -\code{\opt{readFilesCommand} \optv{gzip -c}}. -For bzip2-compressed files, use -\code{\opt{readFilesCommand} \optv{bzip2 -c}}. -\end{itemize} - -\subsection{Advanced options.} -There are many advanced options that control STAR mapping behavior. All options are briefly described in the Section \sechyperref{Description_of_all_options}. - -\subsubsection{Using annotations at the mapping stage.} -Since 2.4.1a, the annotations can be included on the fly at the mapping step, without including them at the genome generation step. You can specify \opt{sjdbGTFfile} \optvr{/path/to/ann.gtf} and/or \opt{sjdbFileChrStartEnd} \optvr{/path/to/sj.tab}, as well as \opt{sjdbOverhang}, and any other \opt{sjdb*} options. The genome indices can be generated with or without another set of annotations/junctions. In the latter case the new junctions will added to the old ones. STAR will insert the junctions into genome indices on the fly before mapping, which takes 1~2 minutes. The on the fly genome indices can be saved (for reuse) with \opt{sjdbInsertSave} \optv{All}, into \optvr{\_STARgenome} directory inside the current run directory. - -\subsubsection{ENCODE options} -An example of ENCODE standard options for long RNA-seq pipeline is given below: -\begin{itemize} -\item[] -\opt{outFilterType} BySJout\\ -reduces the number of "spurious" junctions -\item[] -\opt{outFilterMultimapNmax} 20\\ -max number of multiple alignments allowed for a read: if exceeded, the read is considered unmapped -\item[] -\opt{alignSJoverhangMin} 8\\ -minimum overhang for unannotated junctions -\item[] -\opt{alignSJDBoverhangMin} 1\\ -minimum overhang for annotated junctions -\item[] -\opt{outFilterMismatchNmax} 999\\ -maximum number of mismatches per pair, large number switches off this filter -\item[] -%\opt{outFilterMismatchNoverReadLmax} 0.04\\ -max number of mismatches per pair relative to read length: for 2x100b, max number of mismatches is 0.06*200=8 for the paired read -\item[] -\opt{alignIntronMin} 20\\ -minimum intron length -\item[] -\opt{alignIntronMax} 1000000\\ -maximum intron length -\item[] -\opt{alignMatesGapMax} 1000000\\ -maximum genomic distance between mates -\end{itemize} - -\section{Output files.}\label{Output_files} -STAR produces multiple output files. All files have standard name, however, you can change the file prefixes using \opt{outFileNamePrefix} \optvr{/path/to/output/dir/prefix}. By default, this parameter is \optv{./}, i.e. all output files are written in the current directory. -\subsection{Log files.} -\begin{itemize} -\item[] -\ofilen{Log.out}: main log file with a lot of detailed information about the run. This file is most useful for troubleshooting and debugging. -\item[] -\ofilen{Log.progress.out}: reports job progress statistics, such as the number of processed reads, \% of mapped reads etc. It is updated in ~1 minute intervals. -\item[] -\ofilen{Log.final.out}: summary mapping statistics after mapping job is complete, very useful for quality control. The statistics are calculated for each read (single- or paired-end) and then summed or averaged over all reads. Note that STAR counts a paired-end read as one read, (unlike the samtools flagstat/idxstats, which count each mate separately). Most of the information is collected about the UNIQUE mappers (unlike samtools flagstat/idxstats which does not separate unique or multi-mappers). Each splicing is counted in the numbers of splices, which would correspond to summing the counts in \ofilen{SJ.out.tab}. The mismatch/indel error rates are calculated on a per base basis, i.e. as total number of mismatches/indels in all unique mappers divided by the total number of mapped bases. -\end{itemize} - -\subsection{SAM.} -\ofilen{Aligned.out.sam} - alignments in standard SAM format. -\subsubsection{Multimappers.} -The number of loci Nmap a read maps to is given by \code{NH:i:} field. Value of 1 corresponds to unique mappers, while values \textgreater1 corresponds to multi-mappers. \code{HI} attrbiutes enumerates multiple alignments of a read starting with 1. - -The mapping quality MAPQ (column 5) is 255 for uniquely mapping reads, and int(-10*log10(1-1/Nmap)) for multi-mapping reads. This scheme is same as the one used by TopHat and is compatible with Cufflinks. The default MAPQ=255 for the unique mappers maybe changed with \opt{outSAMmapqUnique} \optvr{Integer0to255} option to ensure compatibility with downstream tools such as GATK. - -For multi-mappers, all alignments except one are marked with 0x100 (secondary alignment) in the FLAG (column 2 of the SAM). The unmarked alignment is either the best one (i.e. highest scoring), or is randomly selected from the alignments of equal quality. This default behavior can be changed with \opt{outSAMprimaryFlag} \optv{AllBestScore} option, that will output all alignments with the best score as primary alignments (i.e. 0x100 bit in the FLAG unset). - -\subsubsection{SAM attributes.} -The SAM attributes can be specified by the user using \opt{outSAMattributes} \optvr{A1 A2 A3 ...} option which accept a list of 2-character SAM attrbiutes. The implemented attrbutes are: \optv{NH HI NM MD AS nM jM jI XS}. By default, STAR outputs \optv{NH HI AS nM} attributes. -\begin{itemize} -\item[] -\optv{NH HI NM MD} have standard meaning as defined in the SAM format specifications. -\item[] -\optv{AS} id the local alignment score (paired for paired-edn reads). -\item[] -\optv{nM} is the number of mismatches per (paired) alignment, not to be confused with \optv{NM}, which is the number of mismatches in each mate. -\item[] -\optv{jM:B:c,M1,M2,...} intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value. -\item[] -\optv{jI:B:I,Start1,End1,Start2,End2,...} Start and End of introns for all junctions (1-based). -\item[] -\optv{jM jI} attributes require samtools 0.1.18 or later, and were reported to be incompatible with some downstream tools such as Cufflinks. -\end{itemize} - -\subsubsection{Compatibility with Cufflinks/Cuffdiff.} -For unstranded RNA-seq data, Cufflinks/Cuffdiff require spliced alignments with \optv{XS} strand attribute, which STAR will generate with \opt{outSAMstrandField} \optv{intronMotif} option. As required, the XS strand attribute will be generated for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical unannotated junctions) will be suppressed. - -If you have stranded RNA-seq data, you do not need to use any specific STAR options. Instead, you need to run Cufflinks with the library option \opt{library-type} options. For example, \code{cufflinks ... --library-type fr-firststrand} should be used for the “standard” dUTP protocol, including Illumina's stranded Tru-Seq. This option has to be used only for Cufflinks runs and not for STAR runs. - -In addition, it is recommended to remove the non-canonical junctions for Cufflinks runs using \opt{outFilterIntronMotifs} \optv{RemoveNoncanonical}. - -\subsection{Unsorted and sorted-by-coordinate BAM.} -STAR can output alignments directly in binary BAM format, thus saving time on converting SAM files to BAM. It can also sort BAM files by coordinates, which is required by many downstream applications. -\begin{itemize} -\raggedright -\item[] -\opt{outSAMtype} \optv{BAM Unsorted}\\ -output unsorted \ofilen{Aligned.out.bam} file. The paired ends of an alignment are always adjacent, and multiple alignments of a read are adjacent as well. This "unsorted" file can be directly used with downstream software such as \code{HTseq}, without the need of name sorting. The order of the reads will match that of the input FASTQ(A) files only if one thread is used \opt{runThread} \optv{1}, and \opt{outFilterType} \opt{BySJout} is \textbf{not} used. -\item[] -\opt{outSAMtype} \optv{BAM SortedByCoordinate}\\ -output sorted by coordinate \ofilen{Aligned.sortedByCoord.out.bam} file, similar to \code{samtools sort} command. -\item[] -\opt{outSAMtype} \optv{BAM Unsorted SortedByCoordinate}\\ -output both unsorted and sorted files. -\end{itemize} - -\subsection{Splice junctions.} -\ofilen{SJ.out.tab} contains high confidence collapsed splice junctions in tab-delimited format. Note that STAR defines the junction start/end as intronic bases, while many other software define them as exonic bases. The columns have the following meaning: -%\begin{enumerate}[label=\bfseries Exercise \arabic*:] -\begin{itemize}[leftmargin=1in] -\item[column 1:] chromosome -\item[column 2:] first base of the intron (1-based) -\item[column 3:] last base of the intron (1-based) -\item[column 4:] strand (0: undefined, 1: +, 2: -) -\item[column 5:] intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT -\item[column 6:] 0: unannotated, 1: annotated (only if splice junctions database is used) -\item[column 7:] number of uniquely mapping reads crossing the junction -\item[column 8:] number of multi-mapping reads crossing the junction -\item[column 9:] maximum spliced alignment overhang -\end{itemize} -The filtering for this output file is controlled by the \optn{outSJfilter*} parameters, as described in \sechyperref{Output_Filtering:_Splice_Junctions}. -%\subsection{Wiggle/bedGraph.} - -\section{Chimeric and circular alignments.} -To switch on detection of chimeric (fusion) alignments (in addition to normal mapping), \opt{chimSegmentMin} should be set to a positive value. Each chimeric alignment consists of two "segments". Each segment is non-chimeric on its own, but the segments are chimeric to each other (i.e. the segments belong to different chromosomes, or different strands, or are far from each other). Both segments may contain splice junctions, and one of the segments may contain portions of both mates. \opt{chimSegmentMin} parameter controls the minimum mapped length of the two segments that is allowed. For example, if you have 2x75 reads and used \opt{chimSegmentMin} 20, a chimeric alignment with 130b on one chromosome and 20b on the other will be output, while 135 + 15 won't be. -\subsection{STAR-Fusion.} -STAR-Fusion is a software package for detecting fusion transcript from STAR chimeric output. It is developed and maintained by Brian Haas. Please visit its GitHub page for instructions and documentation: \url{https://github.com/STAR-Fusion/STAR-Fusion}. - -\subsection{Chimeric alignments in the main BAM files.} -Chimeric alignments can be included together with normal alignments in the main (sorted or unsorted) BAM file(s) using \opt{chimOutType} \optv{WithinBAM}. In these files, formatting of chimeric alignments follows the latest SAM/BAM specifications. - -\subsection{Chimeric alignments in \ofilen{Chimeric.out.sam} .} -When chimeric detection is switched on, STAR will output normal alignments into \ofilen{Aligned.*.sam/bam}, and will output chimeric alignments into a separate file \ofilen{Chimeric.out.sam}. -Some reads may be output to both normal SAM/BAM files, and \ofilen{Chimeric.out.sam} for the following reason. STAR will output a non-chimeric alignment into \ofilen{Aligned.out.sam} with soft-clipping a portion of the read. If this portion is long enough, and it maps well and uniquely somewhere else in the genome, there will also be a chimeric alignment output into \ofilen{Chimeric.out.sam}. For instance, if you have a paired-end read where the second mate can be split chimerically into 70 and 30 bases. The 100b of the first mate + 70b of the 2nd mate map non-chimerically,and the mapping length/score are big enough, so they will be output into \ofilen{Aligned.out.sam} file. At the same time, the chimeric segments 100-mate1 + 70-mate2 and 30-mate2 will be output into \ofilen{Chimeric.out.sam}. - -\subsection{Chimeric alignments in \ofilen{Chimeric.out.junction}} -In addition to \ofilen{Chimeric.out.sam}, STAR will generate \ofilen{Chimeric.out.junction} file which maybe more convenient for downstream analysis. -The format of this file is as follows. Every line contains one chimerically aligned read, e.g.: -\begin{verbatim} -chr22 23632601 + chr9 133729450 + 1 0 0 -SINATRA-0006:3:3:6387:5665#0 23632554 47M29S 133729451 47S29M40p76M -\end{verbatim} - -The first 9 columns give information about the chimeric junction: -\begin{itemize}[leftmargin=1in] -\item[column 1:] chromosome of the donor -\item[column 2:] first base of the intron of the donor (1-based) -\item[column 3:] strand of the donor -\item[column 4:] chromosome of the acceptor -\item[column 5:] first base of the intron of the acceptor (1-based) -\item[column 6:] strand of the acceptor -\item[column 7:] junction type: -1=encompassing junction (between the mates), 1=GT/AG, 2=CT/AC -\item[column 8:] repeat length to the left of the junction -\item[column 9:] repeat length to the right of the junction -\end{itemize} - -Columns 10-14 describe the alignments of the two chimeric segments, it is SAM like. Alignments are given with respect to the (+) strand -\begin{itemize}[leftmargin=1in] -\item[column 10:] read name -\item[column 11:] first base of the first segment (on the + strand) -\item[column 12:] CIGAR of the first segment -\item[column 13:] first base of the second segment -\item[column 14:] CIGAR of the second segment -\end{itemize} - -Unlike standard SAM, both mates are recorded in one line here. The gap of length \code{L} between the mates is marked by the \code{p} in the CIGAR string. -If the mates overlap, \code{L<0}. - -For strand definitions, when aligning paired end reads, the sequence of the second mate is reverse complemented. - -For encompassing junctions, i.e. junction type: -1=junction is between the mates, columns 2 and 5 represent the bounds on the chimeric junction loci. For the 1st mate, it will be the genomic base following the last 3' mapped base. For the 2nd mate (which is reverse complemented to have the same orientation as 1st mate), it will be the genomic base preceding the 5' mapped base. For example, if there is a chimeric junction that connects chr1/+strand/base1000 to chr2/+strand/base2000, and read 1 maps to chr1/+strand/bases800-900, and read 2 (after reverse complementing) maps to chr2/+strand/bases2100-2200, then columns 2 and 5 will have 901 and 2099. - -To filter chimeric junctions and find the number of reads supporting each junction you could use, for example: -\begin{verbatim} -cat Chimeric.out.junction | -awk '$1!="chrM" && $4!="chrM" && $7>0 && $8+$9<=5 {print $1,$2,$3,$4,$5,$6,$7,$8,$9}' | -sort | uniq -c | sort -k1,1rn -\end{verbatim} -This will keep only the canonical junctions with the repeat length less than 5 and will remove chimeras with mitochondrion genome. - -When I do it for one of our K562 runs, I get: -\begin{verbatim} -181 chr1 144676873 - chr1 147917466 + 1 0 1 - 29 chr5 69515744 - chr5 34182973 - 1 3 1 - 28 chr1 143910077 - chr1 149459550 - 1 1 0 - 27 chr22 23632601 + chr9 133729450 + 1 0 0 - 20 chr12 90313405 - chr21 40684813 - 1 2 0 - 20 chr22 23632601 + chr9 133655755 + 1 0 1 - 20 chr9 123636256 - chr9 123578959 + 1 1 4 - 15 chr16 85589970 + chr6 16762582 + 1 3 2 - 15 chr3 197348574 - chr3 195392936 + 1 1 0 - 14 chr18 39584506 + chr18 39560613 - 1 2 0 -\end{verbatim} -Note that line 4 and 6 here are BCR/ABL fusions. You would need to filter these junctions further to see which of them connect known but not homologous genes. - - -\section{Output in transcript coordinates.} -With \opt{quantMode} \optv{TranscriptomeSAM} option STAR will output alignments translated into transcript coordinates in the \ofilen{Aligned.toTranscriptome.out.bam} file (in addition to alignments in genomic coordinates in \ofilen{Aligned.*.sam/bam} files). These transcriptomic alignments can be used with various transcript quantification software that require reads to be mapped to transcriptome, such as RSEM or eXpress. For example, RSEM command line would look as follows: \codelines{rsem-calculate-expression ... --bam Aligned.toTranscriptome.out.bam /path/to/RSEM/reference RSEM}. -Note, that STAR first aligns reads to entire genome, and only then searches for concordance between alignments and transcripts. I believe this approach might offer certain advantages compared to the alignment to transcriptome only, by not forcing the alignments to annotated transcripts. - -By default, the output satisfies RSEM requirements: soft-clipping or indels are not allowed. Use \opt{quantTranscriptomeBan} \optv{Singleend} to allow insertions, deletions ans soft-clips in the transcriptomic alignments, which can be used by some expression quantification software (e.g. eXpress). - -\section{Counting number of reads per gene.} -With \opt{quantMode} \optv{GeneCounts} option STAR will count number reads per gene while mapping. -A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the paired-end read are checked for overlaps. -The counts coincide with those produced by htseq-count with default parameters. -This option requires annotations (GTF or GFF with --sjdbGTFfile option) used at the genome generation step, or at the mapping step. -STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options: -\begin{itemize}[leftmargin=1in] -\item[column 1:] gene ID -\item[column 2:] counts for unstranded RNA-seq -\item[column 3:] counts for the 1st read strand aligned with RNA (htseq-count option -s yes) -\item[column 4:] counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse) -\end{itemize} -Select the output according to the strandedness of your data. -Note, that if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads. -With \opt{quantMode} \optv{TranscriptomeSAM} \optv{GeneCounts}, and get both the \ofilen{Aligned.toTranscriptome.out.bam} and \ofilen{ReadsPerGene.out.tab} outputs. - - -\section{2-pass mapping.} - -For the most sensitive novel junction discovery,I would recommend running STAR in the 2-pass mode. It does not increase the number of detected novel junctions, but allows to detect more splices reads mapping to novel junctions. The basic idea is to run 1st pass of STAR mapping with the usual parameters, then collect the junctions detected in the first pass, and use them as "annotated" junctions for the 2nd pass mapping. - -\subsection{Multi-sample 2-pass mapping.} -For a study with multiple samples, it is recommended to collect 1st pass junctions from all samples. -\begin{enumerate} -\item Run 1st mapping pass for all samples with "usual" parameters. Using annotations is recommended either a the genome generation step, or mapping step. -\item Run 2nd mapping pass for all samples , listing SJ.out.tab files from all samples in \opt{sjdbFileChrStartEnd} \optvr{/path/to/sj1.tab /path/to/sj2.tab ...}. -\end{enumerate} - -\subsection{Per-sample 2-pass mapping.} - -Annotated junctions will be included in both the 1st and 2nd passes. -To run STAR 2-pass mapping for each sample separately, use \opt{twopassMode} \optv{Basic} option. STAR will perform the 1st pass mapping, then it will automatically extract junctions, insert them into the genome index, and, finally, re-map all reads in the 2nd mapping pass. This option can be used with annotations, which can be included either at the run-time (see \#1), or at the genome generation step. - -\opt{twopass1readsN} defines the number of reads to be mapped in the 1st pass. The default and most sensitive approach is to set it to -1 (or make it bigger than the number of reads in the sample) - in which case all reads in the input read file(s) are used in the 1st pass. While it can reduce mapping time by $\sim40\%$, it is not recommended to use a small portion of the reads in the 1st step, since it will significantly reduce sensitivity for the low expressed novel junctions. The idea to use a portion of the reads in the 1st pass was inspired by Kim, Langmead and Salzberg in Nature Methods 12, 357–360 (2015). - -\subsection{2-pass mapping with re-generated genome.} -This is the original 2-pass method which involves genome re-generation step in-between 1st and 2nd passes. Since 2.4.1a, it is recommended to use the on the fly 2-pass options as described above. -\begin{enumerate} -\item Run 1st pass STAR for all samples with "usual" parameters. Genome indices generated with annotations are recommended. -\item Collect all junctions detected in the 1st pass by merging \ofilen{SJ.out.tab} files from all runs. Filter the junctions by removing likelie false positives, e.g. junctions in the mitochondrion genome, or non-canonical junctions supported by a few reads. If you are using annotations, only novel junctions need to be considered here, since annotated junctions will be re-used in the 2nd pass anyway. -\item Use the filtered list of junctions from the 1st pass with \opt{sjdbFileChrStartEnd} option, together with annotations (via \opt{sjdbGTFfile} option) to generate the new genome indices for the 2nd pass mapping. This needs to be done only once for all samples. -\item Run the 2nd pass mapping for all samples with the new genome index. -\end{enumerate} - - -\section{Description of all options.}\label{Description_of_all_options} -For each STAR version, the most up-to-date information about all STAR parameters can be found in the \code{parametersDefault} file in the STAR source directory. The parameters in the \code{parametersDefault}, as well as in the descriptions below, are grouped by function: -\begin{itemize} -\item[] Special attention has to be paid to parameters that start with \optn{out*}, as they control the STAR output. -\item[] In particular, \optn{outFilter*} parameters control the filtering of output alignments which[] you might want to tweak to fit your needs. -\item[] Output of “chimeric” alignments is controlled by \optn{chim*} parameters. -\item[] Genome generation is controlled by \optn{genome*} parameters. -\item[] Annotations (splice junction database) are controlled by \optn{sjdb*} options at the genome generation step. -\item[] Tweaking \optn{score*}, \optn{align*}, \optn{seed*}, \optn{win*} parameters, which requires understanding of the STAR alignment algorithm, is recommended only for advanced users. -\end{itemize} - -Below, allowed parameter values are typed in magenta, and default values - in blue. - - - - -\newcommand{\pright}[1]{\begin{flushright} \begin{minipage}{0.8\textwidth}\raggedright #1 \end{minipage} \end{flushright}} - -\newcommand{\optSection}[1]{\subsection{#1}} -\newcommand{\optName}[1]{\hypertarget{#1}{\textcolor{violet}{\texttt{--#1}}}} -\newcommand{\optValue}[1]{\pright{\textcolor[rgb]{0,0.5,0}{default: \texttt{#1}}}} -\newcommand{\optLine}[1]{\pright{#1}} -\newenvironment{optTable}{}{} -\newenvironment{optOptTable}{\begin{flushright} \begin{minipage}{0.75\textwidth}\raggedright }{\end{minipage}\end{flushright}} - -\newcommand{\optOpt}[1]{\textcolor{blue}{\texttt{#1}}\par} -\newcommand{\optOptLine}[1]{{\hangpara{0.3in}{0}#1\par}} - -\input{parametersDefault.tex} - - -\end{document} diff --git a/source/VERSION b/source/VERSION index 5f615494..bd08704d 100644 --- a/source/VERSION +++ b/source/VERSION @@ -1 +1 @@ -#define STAR_VERSION "STAR_2.5.0c_modified" +#define STAR_VERSION "STAR_2.5.1a"