assignBarcodes
is a fast, parallelized utility designed for targeted sequencing analysis in single-cell experiments. It efficiently assigns feature barcodes from FASTQ files to a known set of sequence barcodes, serving as a powerful, open-source alternative to proprietary tools.
Key features include:
- Exhaustive Search: Unlike other tools,
assignBarcodes
can perform an exhaustive search, enabling it to identify feature barcodes in both ATAC-seq and RNA-seq data, significantly increasing coverage in targeted sequencing. - Advanced Error Correction: Implements customizable error correction for sequence barcodes, handling substitutions and indels, inspired by methodologies used in tools like CellRanger.
- Fuzzy Matching: Provides fuzzy matching capabilities for feature sequences to account for sequencing errors.
- UMI Deduplication: Correctly handles UMI-based deduplication with strategies tailored for different sequencing assays (e.g., targeted vs. whole-transcriptome). It intelligently handles UMI sequencing errors by identifying and collapsing connected cliques of similar UMIs.
- High Performance: Achieves high processing speeds (e.g., ~1 million reads/second on a standard laptop for targeted sequencing) through multi-level parallelization.
On Ubuntu/Debian:
sudo apt-get update
sudo apt-get -y install build-essential zlib1g-dev libglib2.0-dev libcairo2-dev
cd process_features
make
sudo cp assignBarcodes /usr/local/bin
If heatmap generation is not required, you can omit the libcairo2-dev
dependency and compile using:
make NO_HEATMAP=1
Alternatively, you can use the Docker container biodepot/process_features:latest
or build it from the provided Dockerfile.
assignBarcodes [options] [directory1 directory2 ...]
The tool can accept input FASTQ files in two ways:
- By directory: Positional arguments specifying directories containing FASTQ files. The tool will search for files matching the patterns for barcode, forward, and reverse reads.
- By file list: Using
--barcode_fastqs
,--forward_fastqs
, and--reverse_fastqs
to provide comma-separated lists of files.
Flag | Argument | Description | Default |
---|---|---|---|
-w , --whitelist |
[filename] |
Whitelist file for sequence barcodes. | (required) |
-f , --featurelist |
[filename] |
Feature list file (CSV with 'name' and 'sequence' columns). | (required) |
-d , --directory |
[path] |
Base output directory. A subdirectory will be created for each sample. | (required) |
--filtered_barcodes |
[filename] |
A file containing a list of barcodes to process, one per line. If provided, only these barcodes will be processed. | |
--min_heatmap |
[int] |
Minimum deduped count for a feature to be included in the Feature Counts Heatmap. | 0 |
--barcode_fastqs |
[string] |
Comma-separated list of barcode FASTQ files. | |
--forward_fastqs |
[string] |
Comma-separated list of forward read FASTQ files. | |
--reverse_fastqs |
[string] |
Comma-separated list of reverse read FASTQ files. | |
--barcode_fastq_pattern |
[string] |
Pattern to identify barcode FASTQ files in directories. | _R1_ |
--forward_fastq_pattern |
[string] |
Pattern to identify forward read FASTQ files. | _R2_ |
--reverse_fastq_pattern |
[string] |
Pattern to identify reverse read FASTQ files. | _R3_ |
-k , --keep_existing |
If output files exist, skip processing for that sample. | false |
Flag | Argument | Description | Default |
---|---|---|---|
-b , --barcode_length |
[int] |
Length of the sequence barcode. | 16 |
-u , --umi_length |
[int] |
Length of the Unique Molecular Identifier (UMI). | 12 |
-o , --feature_constant_offset |
[int] |
Expected starting position of the feature sequence in the read. Used for an initial directed search. | 0 |
-B , --barcode_constant_offset |
[int] |
Starting position of the barcode and UMI in the read. | 0 |
--limit_search |
[int] |
Limit the search for the feature sequence to N bases around feature_constant_offset . Set to -1 to search the entire read. |
-1 |
-r , --reverse_complement_whitelist |
Reverse complement the whitelist barcodes before use. | false |
|
-a , --as_named |
Treat all input files as part of a single sample. | false |
Flag | Argument | Description | Default |
---|---|---|---|
-m , --maxHammingDistance |
[int] |
Maximum Hamming distance for a feature sequence to be considered a match. | 1 |
-s , --stringency |
[int] |
Stringency for UMI deduplication. See UMI de-duplication section for details. | 1 |
-i , --min_counts |
[int] |
Minimum read count for a UMI clique to be considered for counting. | 1 |
-M , --min_posterior |
[float] |
Minimum posterior probability to rescue a barcode with sequencing errors. | 0.975 |
--max_barcode_mismatches |
[int] |
Maximum number of mismatches allowed to rescue a sequence barcode. | 3 |
--feature_n |
[int] |
Maximum number of 'N' bases allowed in a feature sequence. | 3 |
--barcode_n |
[int] |
Maximum number of 'N' bases allowed in a sequence barcode. | 1 |
--max_reads |
[long] |
Maximum number of reads to process from each FASTQ file. | 0 (all) |
--min_prediction |
[int] |
Minimum prediction threshold for feature assignment (advanced, rarely needed). | 1 |
The EM fitting functionality has been removed from this version. Feature assignment now uses direct counting only.
Flag | Argument | Description | Default |
---|---|---|---|
-t , --threads |
[int] |
Maximum number of concurrent processes (samples to process in parallel). | 8 |
-S , --search_threads |
[int] |
Manually set the number of threads for the feature search step (per consumer thread). Overrides automatic allocation. | 4 |
-c , --consumer_threads_per_set |
[int] |
Manually set the number of consumer threads per sample. Overrides automatic allocation. | 1 |
-R , --read_buffer_lines |
[int] |
Number of lines for the read buffer. | 1024 |
-L , --average_read_length |
[int] |
Estimated average read length for buffer allocation. | 300 |
Flag | Argument | Description | Default |
---|---|---|---|
-v , --debug |
Enable verbose debug output. | false |
|
--translate_NXT |
Complement positions 8 and 9 of cell barcodes at output/filter stages. | false |
./assignBarcodes \
-d ./output_dir/ \
-w /path/to/10x_whitelist.txt \
-f /path/to/features.csv \
-T 32 \
-t 8 \
-m 5 \
-u 12 \
-o 26 \
--limit_search 5 \
--barcode_fastq_pattern R1 \
--forward_fastq_pattern R2 \
/path/to/sample1_fastqs/ \
/path/to/sample2_fastqs/
This command processes two samples located in separate directories. It uses 32 available threads, forking up to 8 sample-processing jobs at a time.
For targeted sequencing, most of the of read sequences will have a constant start sequence of fixed length. assignBarcodes
attempts to make a match here first. If no match is found, then it does a more expensive exhaustive search. The scope of this search can be controlled with the --limit_search
flag.
The exhaustive search checks the entire read against all the feature barcodes at all possible starting positions in the read. For ATAC-seq the search is done in both orientations. We use a novel method that converts the query and match sequences to bitcodes. We uses bitwise ops and a lookup table for hamming distance evaluation of 4 basepairs chunks with a bitops and lookup and can be vectorized by the compiler for even greater speedup. Additionally, the search is broken down into four independent subsearches which are performed in parallel for a 16x speedup over the simple Hamming search.
The error correction handles Ns (unknown base pairs) and sequencing errors. To take into account sequencing errors, a barcode can be at most 1 base pair different from a single valid barcode and then it will be assigned to that barcode. If there are multiple barcodes, then we look at the quality scores and the number of barcodes variants observed and find the most likely match for the barcode based on the posterior probability. This is described in the Cell Ranger documentation.
To handle N's the user specifies a maximum number of Ns (--barcode_n
) that are tolerated. All the possible base pairs are substituted for an N and then compared to see if a unique barcode is found.
To handle sequencing errors, the user specifies a maximum Hamming distance (-m
). If a sequence matches a feature barcode within the Hamming distance and uniquely to a sequence with a minimum distance then it is assigned to that feature barcode. For N's up to a maximum specified by the user (--feature_n
), all possible variations are generated for the N's and checked against the possible sequences. If there is a unique best match (minimum Hamming distance) that is less or equal to the maximum Hamming distance then it is assigned to that feature barcode. Assignments are tentative, pending the completion of the comprehensive search (unless there is an exact match). If there is no exact match, the comprehensive search attempts to find a better match.
Feature assignment now uses direct counting without probabilistic modeling. Each read that matches a feature sequence within the allowed Hamming distance is directly counted towards that feature. This provides a simpler, more transparent approach to feature quantification.
Unique Molecular Identifiers are random sequences that are included with barcodes that are used to identify groups of reads that are duplications of the same sequence. De-duping counts using gives a more accurate reflection of the relative abundance of the originating sequences.
The de-duping algorithm depends on the methodology used.
In scRNA-seq and bulk RNA-seq, counts assigned to a sequence barcode-umi are collapsed only if they map to the same position. This is a relatively rare event that occurs only due to the large number of reads. For CRISPR-targeted sequencing, there is a much lower number of possible mappings and it is very common for a barcode-umi to map to multiple feature sequences and requires a more complicated strategy. assignBarcodes
gathers counts all the sequences associated with a barcode-umi, and chooses a user option to handle how the counts should be de-duped.
Sequencing errors of UMIs can occur and single sequence errors in UMIs are much more likely than having two random UMIs that differ by 1 base pair. To account for this, assignBarcodes
aggregates barcode-umi sets that differ by 1 base pair in the UMI (connected-component).
Once the connected component is formed. The counts are aggregated based on two variables, stringency
and minimum_counts
that are provided by the user using the -s
and -i
flags.
--stringency 0
: RNA-seq strategy. Any feature with at least--min_counts
gets a single deduped count for that UMI clique. This is the only case where a UMI clique can yield counts for multiple features.--stringency 1-999
: Finds the feature with the highest count. If there is a unique winner, and its count is greater thantotal_counts * (stringency / 1000)
andtotal_counts > min_counts
, the feature gets a count of 1. Otherwise, no count is assigned.--stringency >=1000
: The most stringent option. A count is assigned only if a single feature is detected within the UMI clique and its raw count is greater thanmin_counts
.
There are two main levels of parallelization used in assignBarcodes
:
- Process-Level Parallelism: For handling multiple samples,
assignBarcodes
can fork a separate process for each sample. The maximum number of concurrent processes is controlled by-t
. This is highly efficient for processing large datasets with many samples. - Thread-Level Parallelism: Within each sample's process, a multi-threaded producer-consumer model is used.
- Producer-Consumer Model: One thread reads the FASTQ files (barcode, forward, and reverse reads) and populates a buffer. Multiple consumer threads pull data from this buffer to perform barcode processing and feature assignment.
- Parallel Hamming Search: The exhaustive search for feature sequences is parallelized using OpenMP. The search is broken down into four independent sub-searches that are executed concurrently. The number of threads for this search can be controlled with
-S
.
The number of consumer threads is managed with the -c
flag. This two-level parallel architecture ensures high performance by maximizing CPU utilization across multiple cores and machines.
Sequence barcodes (--whitelist
) are provided one barcode per line. Standard sequence barcode whitelists from 10x work fine.
Feature barcodes (--featurelist
) should be provided as comma separated files with a header line. The header line must contain a 'sequence' field and a 'name' field. The other fields are ignored.
The assignments are outputted in matrix market format which essentially has 3 files listing the barcodes, features and count matrix.
In the output directory for each sample, stats.txt
contains run statistics.
Total feature counts 31993392
Total deduped feature counts 6765455
Total unique barcode UMIs 9148050
Total whitelisted barcodes 259694
Total_unmatched_reads 7236844
Percentage reads assigned to barcode 81.5529
Each assigned feature sequence is listed in the feature_sequences.txt
file.
Feature Index Sequence Hamming_distance Counts Feature_name
1 CAACTGCGTCCATGAAACAATAGACGCAGTTGAGAGTGGC 0 11 5_PDX1
2 GGTATGTGAACATACAACATAGGAGTTGGTTACAAGGAAT 0 32 12_PAX6
2 GGTATGTGAACATACAACATAGaAGTTGGTTACAAGGAAT 1 1 12_PAX6
...
Each of the matched sequences is displayed under the feature index that they are matched to. Mismatches relative to the reference feature are shown in lowercase.
An interactive HTML plot file named umi_counts_histogram.html
is generated in each sample's output directory. This plot displays the cumulative histogram of feature counts per barcode, overlaid with the Expectation-Maximization (EM) model fit.
Key features of this plot include:
- Interactive Scales: A dropdown menu allows switching the Y-axis between linear and logarithmic scales for better visualization of count distributions.
- Component Visualization: The individual components of the EM fit (e.g., noise, signal, multiplets) are plotted as separate lines.
- Cutoff Lines: Vertical lines indicate the minimum and maximum signal cutoffs determined by the EM model.
- Detailed Information: Hovering over the plot provides detailed information about the observed counts and the fitted model values.
This plot is crucial for quality control, allowing for a visual assessment of the EM model's performance and the resulting signal-to-noise separation.
The repository is organized into the following main directories:
src/
: Contains all the C source code files.main.c
: The main entry point of the application, handles command-line argument parsing and orchestrates the overall workflow.assignBarcodes.c
: Core logic for barcode assignment, error correction, and feature matching.EMfit.c
: Removed - Expectation-Maximization algorithm no longer used.plot_histogram.c
: Functions for generating interactive QC histograms.io.c
: Functions related to reading FASTQ files and handling input.memory.c
: Memory management utilities, including memory pools for efficient allocation.queue.c
: Implementation of a queue data structure used for parallel processing.utils.c
: Helper functions used across the application.globals.c
: Definitions of global variables.heatmap.c
: Functions for generating QC heatmap images.plasma_colormap_16.h
,plasma_colormap_64.h
,plasma_colormap_256.h
,plasma_colormap_1024.h
: Color map definitions for heatmaps.
include/
: Contains all the header files.common.h
: Common headers, structs, and macros used throughout the project.prototypes.h
: Function prototypes for functions defined in thesrc
directory.globals.h
: Header for global variables.io.h
: Header for I/O functions.memory.h
: Header for memory management utilities.queue.h
: Header for the queue data structure.utils.h
: Header for utility functions.plot_histogram.h
: Header for histogram plotting functions.heatmap.h
: Header for heatmap generation functions.EMfit.h
: Removed - EM fitting header no longer needed.process.h
: Header for process management.
scripts/
: Contains utility scripts for testing and other purposes.graphics/
: Contains image files used in the documentation.Makefile
: The main makefile for compiling the project.Dockerfile
: For building the Docker container.
An interactive HTML plot (umi_counts_histogram.html
) is generated in each sample's output directory. This plot provides a simple visualization of UMI count distributions for quality control purposes.
Key features:
- Simple Histogram: Shows the frequency distribution of UMI counts across all features.
- Interactive Display: Basic Plotly-based visualization with hover information.
- Quality Assessment: Helps identify overall count distribution patterns and potential data quality issues.
A heatmap image (Feature_counts_heatmap.png
) is generated for each sample. In this heatmap:
- Rows: Features.
- Columns: UMI counts (starting from 1).
- Color Intensity: Number of barcodes with that UMI count for the feature.
- Bar Graph: Above the heatmap, a bar graph shows the total number of barcodes for each UMI count across all features.
- Color Bar: Indicates the scale of counts.
- Filtering: Only features with deduped counts above the threshold set by
--min_heatmap
are shown.
This heatmap provides a visual summary of the count distribution for each feature, helping to identify features with abnormal count profiles or multiplet artifacts.
A second heatmap (Feature_types_heatmap.png
) is generated for each sample to visualize feature richness. In this heatmap:
- Rows: Features.
- Columns: The total number of unique feature types present in a barcode (richness level).
- Color Intensity: The number of barcodes where the given feature (row) was observed that contained a specific total number of feature types (column).
- Bar Graph: Above the heatmap, a bar graph shows the total number of barcodes for each richness level across all features.
- Color Bar: Indicates the scale of counts.
- Filtering: Only features with at least one observed count are shown.
This heatmap helps visualize the complexity of features within single barcodes, which is useful for identifying potential multiplets and assessing the overall quality of the feature capture.
For more details on the plotting implementation, see src/plot_histogram.c
and src/heatmap.c
.
demux_fastq
is a light-weight helper that groups raw FASTQ reads into per-sample folders based on an 8-base probe barcode ("probe 8-mer") embedded in one of the reads. It re-uses the same feature-matching core that powers assignBarcodes
but is optimised for the simple one-to-one lookup required for demultiplexing.
./demux_fastq \
--probe_barcodes tables/probe-barcodes-fixed-rna-profiling-rna.txt \
--sample_map tables/probe-barcode-to-sample-mapping.txt \
--outdir demux_out \
--probe_read R2 # probe 8-mer lives in R2
--probe_offset 68 # 0-based offset inside that read
--threads 4 # number of consumer threads
<FASTQ_DIR>
The command above scans <FASTQ_DIR>
for matching R1/R2/R3 files (or accepts explicit file lists via the CLI flags), extracts the probe barcode from each read, maps it back to a sample name, and writes the complete (potentially paired) reads into demux_out/<sample>/
.
file | description |
---|---|
probe_barcodes | TSV with at least three columns: variant 8-mer, unused, human-readable name. Only the first and third columns are used. |
sample_map | TSV mapping a library ID + probe name to a final sample identifier. First column must contain the library ID (e.g. SC123 ), second the probe name (BC001 ), third the sample name. |
flag | arg | description | default |
---|---|---|---|
--probe_barcodes |
path | Required. See above. | – |
--sample_map |
path | Required. See above. | – |
--outdir |
dir | Output directory. One sub-dir per sample will be created. | . |
--probe_read |
enum | Which read contains the probe 8-mer. One of R1 ,R2 ,R3 . |
R2 |
--probe_offset |
int | 0-based offset of the 8-mer inside the chosen read. | – |
--barcode_fastqs ,--forward_fastqs ,--reverse_fastqs |
list | Comma-separated explicit FASTQ lists. Skip autodetection. | – |
--barcode_fastq_pattern ,--forward_fastq_pattern ,--reverse_fastq_pattern |
str | Filename substrings to locate FASTQs when scanning directories. | _R1_ ,_R2_ ,_R3_ |
--direct_search |
– | Force a direct 64-bit compare of 8-mers instead of the hash lookup. By default the program auto-chooses the direct path when the number of probe variants ≤128. | |
--max_records |
int | Stop after N reads (dev/debug). 0 = no limit. |
0 |
--threads |
int | Number of consumer threads that process the shared read buffer. One producer thread is always used. | 1 |
-v , --debug |
– | Verbose debug output. | off |
- K-mer extraction – For every read the probe 8-mer is sliced out (offset controlled by
--probe_offset
). Non-ACGT bases abort the lookup. - Direct vs hash lookup – Implemented in
barcode_match.c::feature_lookup_kmer()
.- Direct 64-bit path – On machines with unaligned-load tolerance the 8 bases are copied into a
uint64_t
, looked up in a lazily-built parallel array of pre-converted probe 8-mers (uint64_t[]
). O( #variants ) linear search; fast because the array ≤128 elements and the compare is one CPU instruction. - Hash path – Falls back to a GLib
GHashTable
keyed by aGBytes
wrapper around the packed 2-bit encoding. Constant-time for large variant sets.
- Direct 64-bit path – On machines with unaligned-load tolerance the 8 bases are copied into a
- Producer/consumer I/O –
- Producer thread streams R1/R2/R3 with
zlib
(gzgets
) and writes full read blocks (all 4 FASTQ lines × present reads) into a ring buffer. - Multiple consumer threads pull blocks, perform probe lookup, resolve the sample, and write to gzipped sinks.
- Per-sink mutexes guard concurrent writes so different threads can write different samples in parallel while preventing interleaving within a file.
- Producer thread streams R1/R2/R3 with
- Output sinks – Created lazily in
get_or_open_sink()
. One sink per sample, filenames re-use the original read basename. Closed and destroyed after each input set to keep file descriptors bounded.
The bridge functions added to barcode_match.c
expose a slim API:
int feature_lookup_kmer(const char *seq, int len,
const struct feature_arrays *fa,
int direct_search);
demux_fastq
relies exclusively on this call. Internally it delegates to:
ensure_feature_u64()
– builds the aligneduint64_t
probe table on first use.feature_lookup_code()
– shared hash lookup used by bothdemux_fastq
andassignBarcodes
.
Because both tools share the same feature arrays and global feature_code_hash
, improvements in one place automatically benefit the other.
<run>/
sample_SC123_R1_001.fastq.gz # barcode read (16 bp bc + umi)
sample_SC123_R2_001.fastq.gz # forward read (probe 8-mer at offset 68)
sample_SC123_R3_001.fastq.gz # reverse read (may be empty)
After running the utility the output folder looks like:
demux_out/
undetermined/
sample_SC123_unk_R1_001.fastq.gz # reads without a valid probe
SampleA/
sample_SC123_R1_001.fastq.gz
sample_SC123_R2_001.fastq.gz
sample_SC123_R3_001.fastq.gz
SampleB/
...
With --threads 4
on a 4-core laptop the producer saturates ~200 MB/s decompression while consumers reach ~80 kreads/s each, fully utilising all cores. Direct 64-bit probe matching is ~35 % faster than the hash path for the default 128-variant table.
test_files/
contains large binary/FASTQ test datasets and is therefore ignored in version control. The root .gitignore
already includes a test*
wildcard entry, which covers test_files/
while still retaining useful scripts such as scripts/test_demux.sh
. No additional ignore patterns are required.
// ... existing code ...
demux_bam
reads a STARsolo-aligned BAM and produces a Matrix-Market triplet
where rows are sample-barcode probes (e.g. BC001
), columns are cell barcodes
(CB
), and values are the number of UNIQUE UMIs that survived all filters
(duplicates, non-primary, map-quality, etc.).
./demux_bam \
--bam input.bam \
--outdir out_probe_matrix \
--sample_probes tables/probe-barcodes-fixed-rna-profiling-rna.txt \
--probe_offset 68 \
--search_nearby # optional ±1 / ±2 fallback
--count_intergene # optional keep GX='-' reads
-S 4 -t 1
flag | arg | description | default |
---|---|---|---|
--bam |
path | Input BAM (STARsolo style) | – |
--outdir |
dir | Output directory | . |
--sample_probes |
path | TSV: variant-8-mer, unused, probe-name | – |
--probe_offset |
int | 0-based offset of the 8-mer inside the read | 68 |
--search_nearby |
– | Also test offsets +1, +2, -1, -2 | off |
--cb_tag |
str | BAM tag for cell barcode | CB |
--ub_tag |
str | BAM tag for UMI | UB |
--gene_tag |
str | Preferred gene tag (fallback GE ) |
GX |
--count_intergene |
– | KEEP reads whose GX starts with '-' |
off |
--save_read_to_cb |
– | Write read_to_cb_umi_gene.txt map |
off |
--hts_threads , -S |
int | BAM BGZF threads | 2 |
--threads , -t |
int | Consumer shards (currently 1 = single-thread) | 1 |
--min_mapq |
int | Minimum MAPQ to keep | 0 |
--no_primary_filter |
– | Keep secondary & supplementary | off |
--keep_dup |
– | Keep PCR/optical duplicates | off |
--max_records |
int | Process first N reads (debug) | 0 |
-v , --debug |
– | Verbose diagnostics | off |
- barcodes.tsv – all CBs that received ≥1 probe count
- features.tsv – probe names (from
--sample_probes
) - matrix.mtx – rows = probes, cols = CBs, vals = uint32 counts
- stats.txt – run summary (total / usable reads, per-probe totals)
- read_to_cb_umi_gene.txt (only with
--save_read_to_cb
)
- Tag extraction – pull
CB
,UB
, and preferredgene_tag
(fallbackGE
). Skip read if any tag missing; optionally skip whenGX == '-'
unless--count_intergene
is set. - Filtering
- primary-alignment filter (unless
--no_primary_filter
) - duplicate filter via
BAM_FDUP
(unless--keep_dup
) - MAPQ filter (
--min_mapq
)
- primary-alignment filter (unless
- Probe lookup – slice 8-mer at
--probe_offset
; if not found and--search_nearby
is enabled, retry offsets ±1 / ±2 until located or exhausted. - Counting – For every accepted read:
- Convert probe index to 0-based
p
(0 = no match). - Maintain a
GHashTable<cb_id → uint32_t[n_probes]>
. - The table entry is allocated from a memory pool; each element is a 32-bit counter, allowing up to 4.29 × 10⁹ UMIs per (CB, probe).
- Increment
arr[p-1]
(only ifp>0
).
- Convert probe index to 0-based
- Output – After all reads:
- Write
features.tsv
,barcodes.tsv
(alphabetical CB order),matrix.mtx
(coordinate format, integer field). - Write summary stats and, if requested, the per-read map file.
- Write
Compared with the previous TripKey/majority-vote scheme this design:
- removes the heavy
(CB,UB,Gene)
dedup hash (60 + GB on large runs), - relies on the BAM duplicate flag plus
primary_only
for deduplication, - cuts memory to O(#active_CB × #probes × 4 bytes),
- eliminates ambiguous CB handling (each accepted read contributes directly to counts).
// ... existing code ...
./demux_bam \
--bam input.bam \
--outdir out_probe_matrix \
--sample_probes tables/probe-barcodes-fixed-rna-profiling-rna.txt \
--probe_offset 68 \
-S 2 -t 1
flag | arg | description | default |
---|---|---|---|
--bam |
path | Input BAM (STARsolo-style) | – |
--outdir |
dir | Output directory | . |
--sample_probes |
path | TSV with columns: variant 8-mer, unused, BC name | – |
--probe_offset |
int | 0-based offset of the 8-mer in the read | 68 |
--cb_tag |
str | BAM tag for cell barcode | CB |
--ub_tag |
str | BAM tag for UMI | UB |
--gene_tag |
str | Preferred gene tag; fallback to GE if missing |
GX |
--hts_threads / -S |
int | BAM I/O threads | 2 |
--threads / -t |
int | Consumer shards (single-thread in current stage) | 1 |
--min_mapq |
int | Minimum MAPQ filter | 0 |
--no_primary_filter |
– | Keep secondary/supplementary alignments | off |
--keep_dup |
– | Keep PCR/optical duplicates | off |
--max_records |
int | Process first N reads (debug) | 0 |
-v , --debug |
– | Verbose logging | off |
barcodes.tsv
– CB strings (ambiguous CBs dropped by majority rule)features.tsv
– probe names (BCxxx) loaded from--sample_probes
matrix.mtx
– rows: probes; cols: CBs; values: # unique UMIs after dedup and majority selection
- Extract per-read tags:
CB
,UB
, andGX
(fallbackGE
). - Dedup key =
(CB, UB, gene)
; packUB
into a 64-bit code; require A/C/G/T only. - Probe detection: slice the 8-mer at
--probe_offset
from the read sequence; lookup with the shared k-mer matcher to obtain a probe index (0 = none). - For each dedup triplet, accumulate votes per probe index observed among reads that map to it.
- At finalize, choose the majority probe index for each triplet; ties between non-zero indices mark the corresponding CB as ambiguous and it is excluded from outputs.
- Each non-ambiguous triplet contributes 1 to
(CB, chosen_probe_index)
in the final matrix.
Notes:
- This tool does not map probes to final sample IDs yet. It provides a probe×barcode matrix as an intermediate for downstream sample assignment.
assignBarcodes
is unaffected by this executable.