From 3ba48a7ad830a64c50ab6c9468557f86474bcb76 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 14 Aug 2025 17:55:54 +0100 Subject: [PATCH 01/48] Add AI-assisted Groovy essentials --- docs/side_quests/groovy_essentials.md | 1725 +++++++++++++++++ docs/side_quests/index.md | 1 + side-quests/groovy_essentials/README.md | 80 + .../data/metadata/analysis_parameters.yaml | 25 + .../groovy_essentials/data/samples.csv | 4 + .../data/sequences/sample_001.fastq | 12 + .../data/sequences/sample_002.fastq | 12 + .../data/sequences/sample_003.fastq | 12 + side-quests/groovy_essentials/main.nf | 555 ++++++ side-quests/groovy_essentials/nextflow.config | 39 + .../templates/analysis_script.sh | 27 + 11 files changed, 2492 insertions(+) create mode 100644 docs/side_quests/groovy_essentials.md create mode 100644 side-quests/groovy_essentials/README.md create mode 100644 side-quests/groovy_essentials/data/metadata/analysis_parameters.yaml create mode 100644 side-quests/groovy_essentials/data/samples.csv create mode 100644 side-quests/groovy_essentials/data/sequences/sample_001.fastq create mode 100644 side-quests/groovy_essentials/data/sequences/sample_002.fastq create mode 100644 side-quests/groovy_essentials/data/sequences/sample_003.fastq create mode 100644 side-quests/groovy_essentials/main.nf create mode 100644 side-quests/groovy_essentials/nextflow.config create mode 100644 side-quests/groovy_essentials/templates/analysis_script.sh diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md new file mode 100644 index 0000000000..7b0c394f14 --- /dev/null +++ b/docs/side_quests/groovy_essentials.md @@ -0,0 +1,1725 @@ +# Groovy Essentials for Nextflow Developers + +Nextflow is built on Apache Groovy, a powerful dynamic language that runs on the Java Virtual Machine. While Nextflow provides the workflow orchestration framework, Groovy provides the programming language foundations that make your workflows flexible, maintainable, and powerful. + +Understanding where Nextflow ends and Groovy begins is crucial for effective workflow development. Nextflow provides channels, processes, and workflow orchestration, while Groovy handles data manipulation, string processing, conditional logic, and general programming tasks within your workflow scripts. + +Think of it like cooking: Nextflow is your kitchen setup - the stove, pans, and organization system that lets you cook efficiently. Groovy is your knife skills, ingredient preparation techniques, and recipe adaptation abilities that make the actual cooking successful. You need both to create great meals, but knowing which tool to reach for when is essential. + +Many Nextflow developers struggle with distinguishing when to use Nextflow versus Groovy features, processing file names and configurations, and handling errors gracefully. This side quest will bridge that gap by taking you on a journey from basic workflow concepts to production-ready pipeline mastery. + +**Our Mission**: Transform a simple CSV-reading workflow into a sophisticated, production-ready bioinformatics pipeline that can handle any dataset thrown at it. + +Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production: +- **Messy data?** We'll add robust parsing and null-safe operators +- **Complex file naming schemes?** We'll master regex patterns and string manipulation +- **Need intelligent sample routing?** We'll implement conditional logic and strategy selection +- **Worried about failures?** We'll add comprehensive error handling and validation +- **Code getting repetitive?** We'll learn functional programming with closures and composition +- **Processing thousands of samples?** We'll leverage powerful collection operations + +Each section builds on the previous, showing you how Groovy transforms simple workflows into powerful, production-ready pipelines that can handle the complexities of real-world bioinformatics data. + +You will learn: + +- How to distinguish between Nextflow and Groovy constructs in your workflows +- String processing and pattern matching for bioinformatics file names +- Transforming file collections into command-line arguments +- Conditional logic for controlling process execution +- Basic validation and error handling patterns +- Essential Groovy operators: safe navigation, Elvis, and Groovy Truth +- Advanced closures and functional programming techniques +- Collection operations and file path manipulations + +These skills will help you write cleaner, more maintainable workflows that handle different input types appropriately and provide useful feedback when things go wrong. + +--- + +## 0. Warmup + +### 0.1. Prerequisites + +Before taking on this side quest you should: + +- Complete the [Hello Nextflow](../hello_nextflow/README.md) tutorial +- Understand basic Nextflow concepts (processes, channels, workflows) +- Have basic familiarity with Groovy syntax (variables, maps, lists) + +You may also find it helpful to review [Basic Groovy](../basic_training/groovy.md) if you need a refresher on fundamental concepts. + +### 0.2. Starting Point + +Let's move into the project directory and explore our working materials. + +```bash +cd side-quests/groovy_essentials +``` + +You'll find a `data` directory with sample files and a main workflow file that we'll evolve throughout this tutorial. + +```console title="Directory contents" +> tree +. +├── data/ +│ ├── samples.csv +│ ├── sequences/ +│ │ ├── sample_001.fastq +│ │ ├── sample_002.fastq +│ │ └── sample_003.fastq +│ └── metadata/ +│ └── analysis_parameters.yaml +├── templates/ +│ └── analysis_script.sh +├── main.nf +├── nextflow.config +└── README.md +``` + +Our sample CSV contains information about biological samples that need different processing based on their characteristics: + +```console title="samples.csv" +sample_id,organism,tissue_type,sequencing_depth,file_path,quality_score +SAMPLE_001,human,liver,30000000,data/sequences/sample_001.fastq,38.5 +SAMPLE_002,mouse,brain,25000000,data/sequences/sample_002.fastq,35.2 +SAMPLE_003,human,kidney,45000000,data/sequences/sample_003.fastq,42.1 +``` + +We'll use this realistic dataset to explore practical Groovy techniques that you'll encounter in real bioinformatics workflows. + +--- + +## 1. Nextflow vs Groovy: Understanding the Boundaries + +### 1.1. Identifying What's What + +One of the most common sources of confusion for Nextflow developers is understanding when they're working with Nextflow constructs versus Groovy language features. Let's start by examining a typical workflow and identifying the boundaries: + +```groovy title="main.nf" linenums="1" +workflow { + // Nextflow: Channel factory and operator + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + // Groovy: Map operations and string manipulation + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + + // Groovy: Conditional logic and string interpolation + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + + // Nextflow: Return tuple for channel + return [sample_meta + [priority: priority], file(row.file_path)] + } + .view() +} +``` + +Let's break this down: + +**Nextflow constructs:** +- `workflow { }` - Nextflow workflow definition +- `Channel.fromPath()` - Nextflow channel factory +- `.splitCsv()`, `.map()`, `.view()` - Nextflow channel operators +- `file()` - Nextflow file object factory + +**Groovy constructs:** +- `def sample_meta = [:]` - Groovy map definition +- `.toLowerCase()`, `.replaceAll()` - Groovy string methods +- `.toInteger()`, `.toDouble()` - Groovy type conversion +- Ternary operator `? :` - Groovy conditional expression +- Map addition `+` operator - Groovy map operations + +Run this workflow to see the processed output: + +```bash title="Test the initial processing" +nextflow run main.nf +``` + +```console title="Processed sample data" +N E X T F L O W ~ version 25.04.3 + +Launching `main.nf` [fervent_darwin] DSL2 - revision: 8a9c4f8e21 + +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal], data/sequences/sample_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal], data/sequences/sample_002.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high], data/sequences/sample_003.fastq] +``` + +### 1.2. The Collect Confusion: Nextflow vs Groovy + +A perfect example of Nextflow/Groovy confusion is the `collect` operation, which exists in both contexts but does completely different things: + +**Groovy's `collect`** (transforms each element): +```groovy +// Groovy collect - transforms each item in a list +def numbers = [1, 2, 3, 4] +def squared = numbers.collect { it * it } +// Result: [1, 4, 9, 16] +``` + +**Nextflow's `collect`** (gathers all channel elements): +```groovy +// Nextflow collect - gathers all channel items into a list +Channel.of(1, 2, 3, 4) + .collect() + .view() +// Result: [1, 2, 3, 4] (single channel emission) +``` + +Let's demonstrate this with our sample data: + +=== "After" + + ```groovy title="main.nf" linenums="25" hl_lines="1-15" + + // Demonstrate Groovy vs Nextflow collect + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + + // Groovy collect: transform each element + def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') + } + println "Groovy collect result: ${formatted_ids}" + + // Nextflow collect: gather channel elements + ch_collected = Channel.of('sample_001', 'sample_002', 'sample_003') + .collect() + ch_collected.view { "Nextflow collect result: ${it}" } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="25" + ``` + +Run the workflow to see both collect operations in action: + +```bash title="Test collect operations" +nextflow run main.nf +``` + +```console title="Different collect behaviors" +Groovy collect result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] +Nextflow collect result: [sample_001, sample_002, sample_003] +``` + +### Takeaway + +In this section, you've learned: + +- **Distinguishing Nextflow from Groovy**: How to identify which language construct you're using +- **Context matters**: The same operation name can have completely different behaviors +- **Workflow structure**: Nextflow provides the orchestration, Groovy provides the logic +- **Data transformation patterns**: When to use Groovy methods vs Nextflow operators + +Understanding these boundaries is essential for debugging, documentation, and writing maintainable workflows. + +Now that we can distinguish between Nextflow and Groovy constructs, let's enhance our sample processing pipeline with more sophisticated data handling capabilities. + +--- + +## 2. Advanced String Processing for Bioinformatics + +Our basic pipeline processes CSV metadata well, but this is just the beginning. In production bioinformatics, you'll encounter files from different sequencing centers with varying naming conventions, legacy datasets with non-standard formats, and the need to extract critical information from filenames themselves. + +The difference between a brittle workflow that breaks on unexpected input and a robust pipeline that adapts gracefully often comes down to mastering Groovy's string processing capabilities. Let's transform our pipeline to handle the messy realities of real-world bioinformatics data. + +### 2.1. Pattern Matching and Regular Expressions + +Many bioinformatics workflows encounter files with complex naming conventions that encode important metadata. Let's see how Groovy's pattern matching can extract this information automatically. + +Let's start with a simple example of extracting sample information from file names: + +=== "After" + + ```groovy title="main.nf" linenums="40" hl_lines="1-15" + + // Pattern matching for sample file names + def sample_files = [ + 'Human_Liver_001.fastq', + 'mouse_brain_002.fastq', + 'SRR12345678.fastq' + ] + + // Simple pattern to extract organism and tissue + def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/ + + sample_files.each { filename -> + def matcher = filename =~ pattern + if (matcher) { + println "${filename} -> Organism: ${matcher[0][1]}, Tissue: ${matcher[0][2]}, ID: ${matcher[0][3]}" + } else { + println "${filename} -> No standard pattern match" + } + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="40" + ``` + +This demonstrates key Groovy string processing concepts: + +1. **Regular expression literals** using `~/pattern/` syntax +2. **Pattern matching** with the `=~` operator +3. **Matcher objects** that capture groups with `[0][1]`, `[0][2]`, etc. + +Run this to see the pattern matching in action: + +```bash title="Test pattern matching" +nextflow run main.nf +``` + +```console title="Pattern matching results" +Human_Liver_001.fastq -> Organism: Human, Tissue: Liver, ID: 001 +mouse_brain_002.fastq -> Organism: mouse, Tissue: brain, ID: 002 +SRR12345678.fastq -> No standard pattern match +``` + +### 2.2. Creating Reusable Parsing Functions + +Let's create a simple function to parse sample names and return structured metadata: + +=== "After" + + ```groovy title="main.nf" linenums="60" hl_lines="1-20" + + // Function to extract sample metadata from filename + def parseSampleName(String filename) { + def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/ + def matcher = filename =~ pattern + + if (matcher) { + return [ + organism: matcher[0][1].toLowerCase(), + tissue: matcher[0][2].toLowerCase(), + sample_id: matcher[0][3], + valid: true + ] + } else { + return [ + filename: filename, + valid: false + ] + } + } + + // Test the parser + sample_files.each { filename -> + def parsed = parseSampleName(filename) + println "File: ${filename}" + if (parsed.valid) { + println " Organism: ${parsed.organism}, Tissue: ${parsed.tissue}, ID: ${parsed.sample_id}" + } else { + println " Could not parse filename" + } + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="60" + ``` + +This demonstrates key Groovy function patterns: + +- **Function definitions** with `def functionName(parameters)` +- **Map creation and return** for structured data +- **Conditional returns** based on pattern matching success + +### 2.3. Dynamic Script Logic in Processes + +In Nextflow, dynamic behavior comes from using Groovy logic within process script blocks, not generating script strings. Here are realistic patterns: + +=== "After" + + ```groovy title="main.nf" linenums="115" hl_lines="1-40" + + // Process with conditional script logic + process QUALITY_FILTER { + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("${meta.id}_filtered.fastq") + + script: + // Groovy logic to determine parameters based on metadata + def quality_threshold = meta.organism == 'human' ? 30 : + meta.organism == 'mouse' ? 28 : 25 + def min_length = meta.priority == 'high' ? 75 : 50 + + // Conditional script sections + def extra_qc = meta.priority == 'high' ? '--strict-quality' : '' + + """ + echo "Processing ${meta.id} (${meta.organism}, priority: ${meta.priority})" + + # Dynamic quality filtering based on sample characteristics + fastp \\ + --in1 ${reads} \\ + --out1 ${meta.id}_filtered.fastq \\ + --qualified_quality_phred ${quality_threshold} \\ + --length_required ${min_length} \\ + ${extra_qc} + + echo "Applied quality threshold: ${quality_threshold}" + echo "Applied length threshold: ${min_length}" + """ + } + + // Process with completely different scripts based on organism + process ALIGN_READS { + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("${meta.id}.bam") + + script: + if (meta.organism == 'human') { + """ + echo "Using human-specific STAR alignment" + STAR --runMode alignReads \\ + --genomeDir /refs/human/STAR \\ + --readFilesIn ${reads} \\ + --outSAMtype BAM SortedByCoordinate \\ + --outFileNamePrefix ${meta.id} + mv ${meta.id}Aligned.sortedByCoord.out.bam ${meta.id}.bam + """ + } else if (meta.organism == 'mouse') { + """ + echo "Using mouse-specific bowtie2 alignment" + bowtie2 -x /refs/mouse/genome \\ + -U ${reads} \\ + --sensitive \\ + | samtools sort -o ${meta.id}.bam - + """ + } else { + """ + echo "Using generic alignment for ${meta.organism}" + minimap2 -ax sr /refs/generic/genome.fa ${reads} \\ + | samtools sort -o ${meta.id}.bam - + """ + } + } + + // Using templates (Nextflow's built-in templating) + process GENERATE_REPORT { + input: + tuple val(meta), path(results) + + output: + path("${meta.id}_report.html") + + script: + template 'report_template.sh' + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="115" + ``` + +Now let's look at the template file that would go with this: + +=== "After" + + ```bash title="templates/report_template.sh" linenums="1" hl_lines="1-25" + #!/bin/bash + + # This template has access to all variables from the process input + # Groovy expressions are evaluated at runtime + + echo "Generating report for sample: ${meta.id}" + echo "Organism: ${meta.organism}" + echo "Quality score: ${meta.quality}" + + # Conditional logic in template + <% if (meta.organism == 'human') { %> + echo "Including human-specific quality metrics" + human_qc_script.py --input ${results} --output ${meta.id}_report.html + <% } else { %> + echo "Using standard quality metrics for ${meta.organism}" + generic_qc_script.py --input ${results} --output ${meta.id}_report.html + <% } %> + + # Groovy variables can be used for calculations + <% + def priority_bonus = meta.priority == 'high' ? 0.1 : 0.0 + def adjusted_score = (meta.quality + priority_bonus).round(2) + %> + + echo "Adjusted quality score: ${adjusted_score}" + echo "Report generation complete" + ``` + +=== "Before" + + ```bash title="templates/report_template.sh" + ``` + +This demonstrates realistic Nextflow patterns: + +- **Conditional script blocks** using Groovy if/else in the script section +- **Variable interpolation** directly in script blocks +- **Template files** with Groovy expressions (using `<% %>` and `${}`) +- **Dynamic parameter calculation** based on metadata + +### 2.4. Transforming File Collections into Command Arguments + +A particularly powerful pattern is using Groovy logic in the script block to transform collections of files into properly formatted command-line arguments. This is essential when tools expect multiple files as separate arguments: + +=== "After" + + ```groovy title="main.nf" linenums="200" hl_lines="1-35" + + // Process that needs to handle multiple input files + process JOINT_ANALYSIS { + input: + path sample_files // This will be a list of files + path reference + + output: + path "joint_results.txt" + + script: + // Transform file list into command arguments + def file_args = sample_files.collect { file -> "--input ${file}" }.join(' ') + def sample_names = sample_files.collect { file -> + file.baseName.replaceAll(/\..*$/, '') + }.join(',') + + """ + echo "Processing ${sample_files.size()} samples" + echo "Sample names: ${sample_names}" + + # Use the transformed arguments in the actual command + analysis_tool \\ + ${file_args} \\ + --reference ${reference} \\ + --output joint_results.txt \\ + --samples ${sample_names} + """ + } + + // Process that builds complex command based on file characteristics + process VARIABLE_COMMAND { + input: + tuple val(meta), path(files) + + output: + path "${meta.id}_processed.txt" + + script: + // Complex command building based on file types and metadata + def input_flags = files.collect { file -> + def extension = file.getExtension() + switch(extension) { + case 'bam': + return "--bam-input ${file}" + case 'vcf': + return "--vcf-input ${file}" + case 'bed': + return "--intervals ${file}" + default: + return "--data-input ${file}" + } + }.join(' ') + + // Additional flags based on metadata + def extra_flags = meta.quality > 35 ? '--high-quality' : '' + + """ + echo "Building command for ${meta.id}" + + variant_caller \\ + ${input_flags} \\ + ${extra_flags} \\ + --output ${meta.id}_processed.txt + """ + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="200" + ``` + +Key patterns demonstrated: + +- **File collection transformation**: Using `.collect{}` to transform each file into a command argument +- **String joining**: Using `.join(' ')` to combine arguments with spaces +- **File name manipulation**: Using `.baseName` and `.replaceAll()` for sample names +- **Conditional argument building**: Using switch statements or conditionals to build different arguments based on file types +- **Multiple transformations**: Building both file arguments and sample name lists from the same collection + +### Takeaway + +In this section, you've learned: + +- **Regular expression patterns** for bioinformatics file name parsing +- **Reusable parsing functions** that return structured metadata +- **Process script logic** with conditional parameter selection +- **File collection transformation** into command-line arguments using `.collect{}` and `.join()` +- **Command building patterns** based on file types and metadata + +These string processing techniques form the foundation for handling complex data pipelines that need to adapt to different input formats and generate appropriate commands for bioinformatics tools. + +With our pipeline now capable of extracting rich metadata from both CSV files and file names, we can make intelligent decisions about how to process different samples. Let's add conditional logic to route samples through appropriate analysis strategies. + +--- + +## 3. Conditional Logic and Process Control + +### 3.1. Strategy Selection Based on Sample Characteristics + +Now that our pipeline can extract comprehensive sample metadata, we can use this information to automatically select the most appropriate analysis strategy for each sample. Different organisms, sequencing depths, and quality scores require different processing approaches. + +=== "After" + + ```groovy title="main.nf" linenums="175" hl_lines="1-40" + + // Dynamic process selection based on sample characteristics + def selectAnalysisStrategy(Map sample_meta) { + def strategy = [:] + + // Sequencing depth determines processing approach + if (sample_meta.depth < 10_000_000) { + strategy.approach = 'low_depth' + strategy.processes = ['quality_check', 'simple_alignment'] + strategy.sensitivity = 'high' + } else if (sample_meta.depth < 50_000_000) { + strategy.approach = 'standard' + strategy.processes = ['quality_check', 'trimming', 'alignment', 'variant_calling'] + strategy.sensitivity = 'standard' + } else { + strategy.approach = 'high_depth' + strategy.processes = ['quality_check', 'trimming', 'alignment', 'variant_calling', 'structural_variants'] + strategy.sensitivity = 'sensitive' + } + + // Organism-specific adjustments + switch(sample_meta.organism) { + case 'human': + strategy.reference = 'GRCh38' + strategy.known_variants = 'dbSNP' + break + case 'mouse': + strategy.reference = 'GRCm39' + strategy.known_variants = 'mgp_variants' + break + default: + strategy.reference = 'custom' + strategy.known_variants = null + } + + // Quality-based modifications + if (sample_meta.quality < 30) { + strategy.extra_qc = true + strategy.processes = ['extensive_qc'] + strategy.processes + } + + return strategy + } + + // Test strategy selection + ch_samples + .map { meta, file -> + def strategy = selectAnalysisStrategy(meta) + println "\nSample: ${meta.id}" + println " Strategy: ${strategy.approach}" + println " Processes: ${strategy.processes.join(' -> ')}" + println " Reference: ${strategy.reference}" + println " Extra QC: ${strategy.extra_qc ?: false}" + + return [meta + strategy, file] + } + .view { meta, file -> "Ready for processing: ${meta.id} (${meta.approach})" } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="175" + ``` + +This demonstrates several Groovy patterns commonly used in Nextflow workflows: + +- **Numeric literals** with underscores for readability (`10_000_000`) +- **Switch statements** for multi-way branching +- **List concatenation** with `+` operator +- **Elvis operator** `?:` for null handling +- **Map merging** to combine metadata with strategy + +### 3.2. Conditional Process Execution + +In Nextflow, you control which processes run for which samples using `when` conditions and channel routing: + +=== "After" + + ```groovy title="main.nf" linenums="225" hl_lines="1-60" + + // Different processes for different strategies + process BASIC_QC { + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("${meta.id}_basic_qc.html") + + when: + meta.approach == 'low_depth' + + script: + """ + fastqc --quiet ${reads} -o ./ + mv *_fastqc.html ${meta.id}_basic_qc.html + """ + } + + process COMPREHENSIVE_QC { + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("${meta.id}_comprehensive_qc.html") + + when: + meta.approach in ['standard', 'high_depth'] + + script: + def sensitivity = meta.sensitivity == 'high' ? '--strict' : '' + """ + fastqc ${sensitivity} ${reads} -o ./ + # Additional QC for comprehensive analysis + seqtk fqchk ${reads} > sequence_stats.txt + mv *_fastqc.html ${meta.id}_comprehensive_qc.html + """ + } + + process SIMPLE_ALIGNMENT { + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("${meta.id}.bam") + + when: + meta.approach == 'low_depth' + + script: + """ + minimap2 -ax sr ${meta.reference} ${reads} \\ + | samtools sort -o ${meta.id}.bam - + """ + } + + process SENSITIVE_ALIGNMENT { + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("${meta.id}.bam") + + when: + meta.approach in ['standard', 'high_depth'] + + script: + def params = meta.sensitivity == 'sensitive' ? '--very-sensitive' : '--sensitive' + """ + bowtie2 ${params} -x ${meta.reference} -U ${reads} \\ + | samtools sort -o ${meta.id}.bam - + """ + } + + // Workflow logic that routes to appropriate processes + workflow ANALYSIS_PIPELINE { + take: + samples_ch + + main: + // All samples go through appropriate QC + basic_qc_results = BASIC_QC(samples_ch) + comprehensive_qc_results = COMPREHENSIVE_QC(samples_ch) + + // Combine QC results + qc_results = basic_qc_results.mix(comprehensive_qc_results) + + // All samples go through appropriate alignment + simple_alignment_results = SIMPLE_ALIGNMENT(samples_ch) + sensitive_alignment_results = SENSITIVE_ALIGNMENT(samples_ch) + + // Combine alignment results + alignment_results = simple_alignment_results.mix(sensitive_alignment_results) + + emit: + qc = qc_results + alignments = alignment_results + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="225" + ``` + +This shows realistic Nextflow patterns: + +- **Separate processes** for different strategies rather than dynamic generation +- **When conditions** to control which processes run for which samples +- **Mix operator** to combine results from different conditional processes +- **Process parameterization** using metadata in script blocks + +### 3.3. Channel-based Workflow Routing + +The realistic way to handle conditional workflow assembly is through channel routing and filtering: + +=== "After" + + ```groovy title="main.nf" linenums="285" hl_lines="1-50" + + workflow { + // Read and enrich sample data with strategy + ch_samples = Channel.fromPath(params.input) + .splitCsv(header: true) + .map { row -> + def meta = [ + id: row.sample_id, + organism: row.organism, + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + + // Add strategy information using our selectAnalysisStrategy function + def strategy = selectAnalysisStrategy(meta) + + return [meta + strategy, file(row.file_path)] + } + + // Split channel based on strategy requirements + ch_samples + .branch { meta, reads -> + low_depth: meta.approach == 'low_depth' + return [meta, reads] + standard: meta.approach == 'standard' + return [meta, reads] + high_depth: meta.approach == 'high_depth' + return [meta, reads] + } + .set { samples_by_strategy } + + // Route each strategy through appropriate processes + ANALYSIS_PIPELINE(samples_by_strategy.low_depth) + ANALYSIS_PIPELINE(samples_by_strategy.standard) + ANALYSIS_PIPELINE(samples_by_strategy.high_depth) + + // For high-depth samples, also run structural variant calling + high_depth_alignments = ANALYSIS_PIPELINE.out.alignments + .filter { meta, bam -> meta.approach == 'high_depth' } + + STRUCTURAL_VARIANTS(high_depth_alignments) + + // Collect and organize all results + all_qc = ANALYSIS_PIPELINE.out.qc.collect() + all_alignments = ANALYSIS_PIPELINE.out.alignments.collect() + + // Generate summary report based on what was actually run + all_alignments + .map { alignments -> + def strategies = alignments.collect { meta, bam -> meta.approach }.unique() + def total_samples = alignments.size() + + println "Pipeline Summary:" + println " Total samples processed: ${total_samples}" + println " Strategies used: ${strategies.join(', ')}" + + strategies.each { strategy -> + def count = alignments.count { meta, bam -> meta.approach == strategy } + println " ${strategy}: ${count} samples" + } + } + .view() + } + + // Additional process for high-depth samples + process STRUCTURAL_VARIANTS { + input: + tuple val(meta), path(bam) + + output: + tuple val(meta), path("${meta.id}.vcf") + + script: + """ + delly call -g ${meta.reference} ${bam} -o ${meta.id}.vcf + """ + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="285" + ``` + +Key Nextflow patterns demonstrated: + +- **Channel branching** with `.branch{}` to split samples by strategy +- **Conditional process execution** using `when:` directives and filtering +- **Channel routing** to send different samples through different processes +- **Result collection** and summary generation +- **Process reuse** - the same workflow processes different sample types + +### Takeaway + +In this section, you've learned: + +- **Strategy selection** using Groovy conditional logic +- **Process control** with `when` conditions and channel routing +- **Workflow branching** using channel operators like `.branch()` and `.filter()` +- **Metadata enrichment** to drive process selection + +These patterns help you write workflows that process different sample types appropriately while keeping your code organized and maintainable. + +Our pipeline now intelligently routes samples through appropriate processes, but production workflows need to handle invalid data gracefully. Let's add validation and error handling to make our pipeline robust enough for real-world use. + +--- + +## 4. Error Handling and Validation Patterns + +### 4.1. Basic Input Validation + +Before our pipeline processes samples through complex conditional logic, we should validate that the input data meets our requirements. Let's create validation functions that check sample metadata and provide useful error messages: + +=== "After" + + ```groovy title="main.nf" linenums="330" hl_lines="1-25" + + // Simple validation function + def validateSample(Map sample) { + def errors = [] + + // Check required fields + if (!sample.sample_id) { + errors << "Missing sample_id" + } + + if (!sample.organism) { + errors << "Missing organism" + } + + // Validate organism + def valid_organisms = ['human', 'mouse', 'rat'] + if (sample.organism && !valid_organisms.contains(sample.organism.toLowerCase())) { + errors << "Invalid organism: ${sample.organism}" + } + + // Check sequencing depth is numeric + if (sample.sequencing_depth) { + try { + def depth = sample.sequencing_depth as Integer + if (depth < 1000000) { + errors << "Sequencing depth too low: ${depth}" + } + } catch (NumberFormatException e) { + errors << "Invalid sequencing depth: ${sample.sequencing_depth}" + } + } + + return errors + } + + // Test validation + def test_samples = [ + [sample_id: 'SAMPLE_001', organism: 'human', sequencing_depth: '30000000'], + [sample_id: '', organism: 'alien', sequencing_depth: 'invalid'], + [sample_id: 'SAMPLE_003', organism: 'mouse', sequencing_depth: '500000'] + ] + + test_samples.each { sample -> + def errors = validateSample(sample) + if (errors) { + println "Sample ${sample.sample_id}: ${errors.join(', ')}" + } else { + println "Sample ${sample.sample_id}: Valid" + } + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="330" + ``` + +### 4.2. Try-Catch Error Handling + +Let's implement simple try-catch patterns for handling errors: + +=== "After" + + ```groovy title="main.nf" linenums="370" hl_lines="1-25" + + // Process sample with error handling + def processSample(Map sample) { + try { + // Validate first + def errors = validateSample(sample) + if (errors) { + throw new RuntimeException("Validation failed: ${errors.join(', ')}") + } + + // Simulate processing + def result = [ + id: sample.sample_id, + organism: sample.organism, + processed: true + ] + + println "✓ Successfully processed ${sample.sample_id}" + return result + + } catch (Exception e) { + println "✗ Error processing ${sample.sample_id}: ${e.message}" + + // Return partial result + return [ + id: sample.sample_id ?: 'unknown', + organism: sample.organism ?: 'unknown', + processed: false, + error: e.message + ] + } + } + + // Test error handling + test_samples.each { sample -> + def result = processSample(sample) + println "Result for ${result.id}: processed = ${result.processed}" + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="370" + ``` + +### 4.3. Setting Defaults and Validation + +Let's create a simple function that provides defaults and validates configuration: + +=== "After" + + ```groovy title="main.nf" linenums="400" hl_lines="1-25" + + // Simple configuration with defaults + def getConfig(Map user_params) { + // Set defaults + def defaults = [ + quality_threshold: 30, + max_cpus: 4, + output_dir: './results' + ] + + // Merge user params with defaults + def config = defaults + user_params + + // Simple validation + if (config.quality_threshold < 0 || config.quality_threshold > 40) { + println "Warning: Quality threshold ${config.quality_threshold} out of range, using default" + config.quality_threshold = defaults.quality_threshold + } + + if (config.max_cpus < 1) { + println "Warning: Invalid CPU count ${config.max_cpus}, using default" + config.max_cpus = defaults.max_cpus + } + + return config + } + + // Test configuration + def test_configs = [ + [:], // Empty - should get defaults + [quality_threshold: 35, max_cpus: 8], // Valid values + [quality_threshold: -5, max_cpus: 0] // Invalid values + ] + + test_configs.each { user_config -> + def config = getConfig(user_config) + println "Input: ${user_config} -> Output: ${config}" + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="400" + ``` + +### Takeaway + +In this section, you've learned: + +- **Basic validation functions** that check required fields and data types +- **Try-catch error handling** for graceful failure handling +- **Configuration with defaults** using map merging and validation + +These patterns help you write workflows that handle invalid input gracefully and provide useful feedback to users. + +Before diving into advanced closures, let's master some essential Groovy language features that make code more concise and null-safe. These operators and patterns are used throughout production Nextflow workflows and will make your code more robust and readable. + +--- + +## 5. Essential Groovy Operators and Patterns + +With our pipeline now handling complex conditional logic, we need to make it more robust against missing or malformed data. Bioinformatics workflows often deal with incomplete metadata, optional configuration parameters, and varying input formats. Let's enhance our pipeline with essential Groovy operators that handle these challenges gracefully. + +### 5.1. Safe Navigation and Elvis Operators in Workflows + +The safe navigation operator (`?.`) and Elvis operator (`?:`) are essential for null-safe programming when processing real-world biological data: + +=== "After" + + ```groovy title="main.nf" linenums="320" hl_lines="1-25" + + workflow { + ch_samples = Channel.fromPath(params.input) + .splitCsv(header: true) + .map { row -> + // Safe navigation prevents crashes on missing fields + def sample_id = row.sample_id?.toLowerCase() ?: 'unknown_sample' + def organism = row.organism?.toLowerCase() ?: 'unknown' + + // Elvis operator provides defaults + def quality = (row.quality_score as Double) ?: 30.0 + def depth = (row.sequencing_depth as Integer) ?: 1_000_000 + + // Chain operators for conditional defaults + def reference = row.reference ?: (organism == 'human' ? 'GRCh38' : 'custom') + + // Groovy Truth - empty strings and nulls are false + def priority = row.priority ?: (quality > 40 ? 'high' : 'normal') + + return [ + id: sample_id, + organism: organism, + quality: quality, + depth: depth, + reference: reference, + priority: priority + ] + } + .view { meta -> + "Sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}, Priority: ${meta.priority}" + } + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="320" + ``` + +### 5.2. String Patterns and Multi-line Templates + +Groovy provides powerful string features for parsing filenames and generating dynamic commands: + +=== "After" + + ```groovy title="main.nf" linenums="370" hl_lines="1-30" + + workflow { + // Demonstrate slashy strings for regex (no need to escape backslashes) + def parseFilename = { filename -> + // Slashy string - compare to regular string: "^(\\w+)_(\\w+)_(\\d+)\\.fastq$" + def pattern = /^(\w+)_(\w+)_(\d+)\.fastq$/ + def matcher = filename =~ pattern + + if (matcher) { + return [ + organism: matcher[0][1].toLowerCase(), + tissue: matcher[0][2].toLowerCase(), + sample_id: matcher[0][3] + ] + } else { + return [organism: 'unknown', tissue: 'unknown', sample_id: 'unknown'] + } + } + + // Multi-line strings with interpolation for command generation + def generateCommand = { meta -> + def depth_category = meta.depth > 10_000_000 ? 'high' : 'standard' + def db_path = meta.organism == 'human' ? '/db/human' : '/db/other' + + // Multi-line string with variable interpolation + """ + echo "Processing ${meta.organism} sample: ${meta.sample_id}" + analysis_tool \\ + --sample ${meta.sample_id} \\ + --depth-category ${depth_category} \\ + --database ${db_path} \\ + --threads ${params.max_cpus ?: 4} + """ + } + + // Test the patterns + ch_files = Channel.of('Human_Liver_001.fastq', 'Mouse_Brain_002.fastq') + .map { filename -> + def parsed = parseFilename(filename) + def command = generateCommand([sample_id: parsed.sample_id, organism: parsed.organism, depth: 15_000_000]) + return [parsed, command] + } + .view { parsed, command -> "Parsed: ${parsed}, Command: ${command.split('\n')[0]}..." } + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="370" + ``` + +### 5.3. Combining Operators for Robust Data Handling + +Let's combine these operators in a realistic workflow scenario: + +=== "After" + + ```groovy title="main.nf" linenums="420" hl_lines="1-20" + + workflow { + ch_samples = Channel.fromPath(params.input) + .splitCsv(header: true) + .map { row -> + // Combine safe navigation and Elvis operators + def meta = [ + id: row.sample_id?.toLowerCase() ?: 'unknown', + organism: row.organism ?: 'unknown', + quality: (row.quality_score as Double) ?: 30.0, + files: row.file_path ? [file(row.file_path)] : [] + ] + + // Use Groovy Truth for validation + if (meta.files && meta.id != 'unknown') { + return [meta, meta.files] + } else { + log.info "Skipping sample with missing data: ${meta.id}" + return null + } + } + .filter { it != null } // Remove invalid samples using Groovy Truth + .view { meta, files -> + "Valid sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}" + } + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="420" + ``` + +### Takeaway + +In this section, you've learned: + +- **Safe navigation operator** (`?.`) for null-safe property access +- **Elvis operator** (`?:`) for default values and null coalescing +- **Groovy Truth** - how null, empty strings, and empty collections evaluate to false +- **Slashy strings** (`/pattern/`) for regex patterns without escaping +- **Multi-line string interpolation** for command templates +- **Numeric literals with underscores** for improved readability + +These patterns make your code more resilient to missing data and easier to read, which is essential when processing diverse bioinformatics datasets. + +--- + +## 6. Advanced Closures and Functional Programming + +Our pipeline now handles missing data gracefully and processes complex input formats robustly. But as our workflow grows more sophisticated, we start seeing repeated patterns in our data transformation code. Instead of copy-pasting similar closures across different channel operations, let's learn how to create reusable, composable functions that make our code cleaner and more maintainable. + +### 6.1. Named Closures for Reusability + +So far we've used anonymous closures defined inline within channel operations. When you find yourself repeating the same transformation logic across multiple processes or workflows, named closures can eliminate duplication and improve readability: + +=== "After" + + ```groovy title="main.nf" linenums="350" hl_lines="1-30" + + // Define reusable closures for common transformations + def extractSampleInfo = { row -> + [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + quality: row.quality_score.toDouble(), + depth: row.sequencing_depth.toInteger() + ] + } + + def addPriority = { meta -> + meta + [priority: meta.quality > 40 ? 'high' : 'normal'] + } + + def formatForDisplay = { meta, file_path -> + "Sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}, Priority: ${meta.priority}" + } + + workflow { + // Use named closures in channel operations + ch_samples = Channel.fromPath(params.input) + .splitCsv(header: true) + .map(extractSampleInfo) // Named closure + .map(addPriority) // Named closure + .map { meta -> [meta, file("./data/sequences/${meta.id}.fastq")] } + .view(formatForDisplay) // Named closure + + // Reuse the same closures elsewhere + ch_filtered = ch_samples + .filter { meta, file -> meta.quality > 30 } + .map { meta, file -> addPriority(meta) } // Reuse closure + .view(formatForDisplay) // Reuse closure + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="350" + ``` + +### 6.2. Function Composition + +Groovy closures can be composed together using the `>>` operator, allowing you to build complex transformations from simple, reusable pieces: + +=== "After" + + ```groovy title="main.nf" linenums="390" hl_lines="1-25" + + // Simple transformation closures + def normalizeId = { meta -> + meta + [id: meta.id.toLowerCase().replaceAll(/[^a-z0-9_]/, '_')] + } + + def addQualityCategory = { meta -> + def category = meta.quality > 40 ? 'excellent' : + meta.quality > 30 ? 'good' : + meta.quality > 20 ? 'acceptable' : 'poor' + meta + [quality_category: category] + } + + def addProcessingFlags = { meta -> + meta + [ + needs_extra_qc: meta.quality < 30, + high_priority: meta.organism == 'human' && meta.quality > 35 + ] + } + + // Compose transformations using >> operator + def enrichSample = normalizeId >> addQualityCategory >> addProcessingFlags + + workflow { + Channel.fromPath(params.input) + .splitCsv(header: true) + .map(extractSampleInfo) + .map(enrichSample) // Apply composed transformation + .view { meta -> + "Processed: ${meta.id} (${meta.quality_category}) - Extra QC: ${meta.needs_extra_qc}" + } + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="390" + ``` + +### 6.3. Currying for Specialized Functions + +Currying allows you to create specialized versions of general-purpose closures by fixing some of their parameters: + +=== "After" + + ```groovy title="main.nf" linenums="430" hl_lines="1-20" + + // General-purpose filtering closure + def qualityFilter = { threshold, meta -> meta.quality >= threshold } + + // Create specialized filters using currying + def highQualityFilter = qualityFilter.curry(40) + def standardQualityFilter = qualityFilter.curry(30) + + workflow { + ch_samples = Channel.fromPath(params.input) + .splitCsv(header: true) + .map(extractSampleInfo) + + // Use the specialized filters in different channel operations + ch_high_quality = ch_samples.filter(highQualityFilter) + ch_standard_quality = ch_samples.filter(standardQualityFilter) + + // Both channels can be processed differently + ch_high_quality.view { meta -> "High quality: ${meta.id}" } + ch_standard_quality.view { meta -> "Standard quality: ${meta.id}" } + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="430" + ``` + +### 6.4. Closures Accessing External Variables + +Closures can access and modify variables from their defining scope, which is useful for collecting statistics: + +=== "After" + + ```groovy title="main.nf" linenums="480" hl_lines="1-20" + + workflow { + // Variable in the workflow scope + def sample_count = 0 + def human_samples = 0 + + // Closure that accesses and modifies external variables + def countSamples = { meta -> + sample_count++ // Modifies external variable + if (meta.organism == 'human') { + human_samples++ // Modifies another external variable + } + return meta // Pass data through unchanged + } + + Channel.fromPath(params.input) + .splitCsv(header: true) + .map(extractSampleInfo) + .map(countSamples) // Closure modifies external variables + .collect() // Wait for all samples to be processed + .view { + "Processing complete: ${sample_count} total samples, ${human_samples} human samples" + } + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="480" + ``` + +### Takeaway + +In this section, you've learned: + +- **Named closures** for eliminating code duplication and improving readability +- **Function composition** with `>>` operator to build complex transformations +- **Currying** to create specialized versions of general-purpose closures +- **Variable scope access** in closures for collecting statistics and generating reports + +These advanced patterns help you write more maintainable, reusable workflows that follow functional programming principles while remaining easy to understand and debug. + +With our pipeline now capable of intelligent routing, robust error handling, and advanced functional programming patterns, we're ready for the final enhancement. As your workflows scale to process hundreds or thousands of samples, you'll need sophisticated data processing capabilities that can organize, filter, and analyze large collections efficiently. + +The functional programming patterns we just learned work beautifully with Groovy's powerful collection methods. Instead of writing loops and conditional logic, you can chain together expressive operations that clearly describe what you want to accomplish. + +--- + +## 7. Collection Operations and File Path Manipulations + +### 7.1. Common Collection Methods in Channel Operations + +When processing large datasets, channel operations often need to organize and analyze sample collections. Groovy's collection methods integrate seamlessly with Nextflow channels to provide powerful data processing capabilities: + +=== "After" + + ```groovy title="main.nf" linenums="500" hl_lines="1-40" + + // Sample data with mixed quality and organisms + def samples = [ + [id: 'sample_001', organism: 'human', quality: 42, files: ['data1.txt', 'data2.txt']], + [id: 'sample_002', organism: 'mouse', quality: 28, files: ['data3.txt']], + [id: 'sample_003', organism: 'human', quality: 35, files: ['data4.txt', 'data5.txt', 'data6.txt']], + [id: 'sample_004', organism: 'rat', quality: 45, files: ['data7.txt']], + [id: 'sample_005', organism: 'human', quality: 30, files: ['data8.txt', 'data9.txt']] + ] + + // findAll - filter collections based on conditions + def high_quality_samples = samples.findAll { it.quality > 40 } + println "High quality samples: ${high_quality_samples.collect { it.id }.join(', ')}" + + // groupBy - group samples by organism + def samples_by_organism = samples.groupBy { it.organism } + println "Grouping by organism:" + samples_by_organism.each { organism, sample_list -> + println " ${organism}: ${sample_list.size()} samples" + } + + // unique - get unique organisms + def organisms = samples.collect { it.organism }.unique() + println "Unique organisms: ${organisms.join(', ')}" + + // flatten - flatten nested file lists + def all_files = samples.collect { it.files }.flatten() + println "All files: ${all_files.take(5).join(', ')}... (${all_files.size()} total)" + + // sort - sort samples by quality + def sorted_by_quality = samples.sort { it.quality } + println "Quality range: ${sorted_by_quality.first().quality} to ${sorted_by_quality.last().quality}" + + // reverse - reverse the order + def reverse_quality = samples.sort { it.quality }.reverse() + println "Highest quality first: ${reverse_quality.collect { "${it.id}(${it.quality})" }.join(', ')}" + + // count - count items matching condition + def human_samples = samples.count { it.organism == 'human' } + println "Human samples: ${human_samples} out of ${samples.size()}" + + // any/every - check conditions across collection + def has_high_quality = samples.any { it.quality > 40 } + def all_have_files = samples.every { it.files.size() > 0 } + println "Has high quality samples: ${has_high_quality}" + println "All samples have files: ${all_have_files}" + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="500" + ``` + +### 7.2. File Path Manipulations + +Working with file paths is essential in bioinformatics workflows. Groovy provides many useful methods for extracting information from file paths: + +=== "After" + + ```groovy title="main.nf" linenums="550" hl_lines="1-30" + + // File path manipulation examples + def sample_files = [ + '/path/to/data/patient_001_R1.fastq.gz', + '/path/to/data/patient_001_R2.fastq.gz', + '/path/to/results/patient_002_analysis.bam', + '/path/to/configs/experiment_setup.json' + ] + + sample_files.each { file_path -> + def f = file(file_path) // Create Nextflow file object + + println "\nFile: ${file_path}" + println " Name: ${f.getName()}" // Just filename + println " BaseName: ${f.getBaseName()}" // Filename without extension + println " Extension: ${f.getExtension()}" // File extension + println " Parent: ${f.getParent()}" // Parent directory + println " Parent name: ${f.getParent().getName()}" // Just parent directory name + + // Extract sample ID from filename + def matcher = f.getName() =~ /^(patient_\d+)/ + if (matcher) { + println " Sample ID: ${matcher[0][1]}" + } + } + + // Group files by sample ID using path manipulation + def files_by_sample = sample_files + .findAll { it.contains('patient') } // Only patient files + .groupBy { file_path -> + def filename = file(file_path).getName() + def matcher = filename =~ /^(patient_\d+)/ + return matcher ? matcher[0][1] : 'unknown' + } + + println "\nFiles grouped by sample:" + files_by_sample.each { sample_id, files -> + println " ${sample_id}: ${files.size()} files" + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="550" + ``` + +### 7.3. The Spread Operator + +The spread operator (`*.`) is a powerful Groovy feature for calling methods on all elements in a collection: + +=== "After" + + ```groovy title="main.nf" linenums="590" hl_lines="1-20" + + // Spread operator examples + def file_paths = [ + '/data/sample1.fastq', + '/data/sample2.fastq', + '/results/output1.bam', + '/results/output2.bam' + ] + + // Convert to file objects + def files = file_paths.collect { file(it) } + + // Using spread operator - equivalent to files.collect { it.getName() } + def filenames = files*.getName() + println "Filenames: ${filenames.join(', ')}" + + // Get all parent directories + def parent_dirs = files*.getParent()*.getName() + println "Parent directories: ${parent_dirs.unique().join(', ')}" + + // Get all extensions + def extensions = files*.getExtension().unique() + println "File types: ${extensions.join(', ')}" + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="590" + ``` + +### Takeaway + +In this section, you've learned: + +- **Collection filtering** with `findAll` and conditional logic +- **Grouping and organizing** data with `groupBy` and `sort` +- **File path manipulation** using Nextflow's file object methods +- **Spread operator** (`*.`) for concise collection operations + +These patterns help you process and organize complex datasets efficiently, which is essential for handling real-world bioinformatics data. + +--- + +## Summary + +Throughout this side quest, you've built a comprehensive sample processing pipeline that evolved from basic metadata handling to a sophisticated, production-ready workflow. Each section built upon the previous, demonstrating how Groovy transforms simple Nextflow workflows into powerful data processing systems. + +Here's how we progressively enhanced our pipeline: + +1. **Nextflow vs Groovy Boundaries**: You learned to distinguish between workflow orchestration (Nextflow) and programming logic (Groovy), including the crucial differences between constructs like `collect`. + +2. **String Processing**: You learned regular expressions, parsing functions, and file collection transformation for building dynamic command-line arguments. + +3. **Conditional Logic**: You added intelligent routing that automatically selects analysis strategies based on sample characteristics like organism, quality scores, and sequencing depth. + +4. **Error Handling**: You made the pipeline robust by adding validation functions, try-catch error handling, and configuration management with sensible defaults. + +5. **Essential Groovy Operators**: You mastered safe navigation (`?.`), Elvis (`?:`), Groovy Truth, slashy strings, and other key language features that make code more resilient and readable. + +6. **Advanced Closures**: You learned functional programming techniques including named closures, function composition, currying, and closures with variable scope access for building reusable, maintainable code. + +7. **Collection Operations**: You added sophisticated data processing capabilities using Groovy collection methods like `findAll`, `groupBy`, `unique`, `flatten`, and the spread operator to handle large-scale sample processing. + +### Key Benefits + +- **Clearer code**: Understanding when to use Nextflow vs Groovy helps you write more organized workflows +- **Better error handling**: Basic validation and try-catch patterns help your workflows handle problems gracefully +- **Flexible processing**: Conditional logic lets your workflows process different sample types appropriately +- **Configuration management**: Using defaults and simple validation makes your workflows easier to use + +### From Simple to Sophisticated + +The pipeline journey you completed demonstrates the evolution from basic data processing to production-ready bioinformatics workflows: + +1. **Started simple**: Basic CSV processing and metadata extraction with clear Nextflow vs Groovy boundaries +2. **Added intelligence**: Dynamic file name parsing with regex patterns and conditional routing based on sample characteristics +3. **Made it robust**: Null-safe operators, validation, error handling, and graceful failure management +4. **Made it maintainable**: Advanced closure patterns, function composition, and reusable components that eliminate code duplication +5. **Scaled it efficiently**: Collection operations for processing hundreds of samples with powerful data filtering and organization + +This progression mirrors the real-world evolution of bioinformatics pipelines - from research prototypes handling a few samples to production systems processing thousands of samples across laboratories and institutions. Every challenge you solved and pattern you learned reflects actual problems developers face when scaling Nextflow workflows. + +### Next Steps + +With these Groovy fundamentals mastered, you're ready to: + +- Write cleaner workflows with proper separation between Nextflow and Groovy logic +- Transform file collections into properly formatted command-line arguments +- Handle different file naming conventions and input formats gracefully +- Build reusable, maintainable code using advanced closure patterns and functional programming +- Process and organize complex datasets using collection operations +- Add basic validation and error handling to make your workflows more user-friendly + +Continue practicing these patterns in your own workflows, and refer to the [Groovy documentation](http://groovy-lang.org/documentation.html) when you need to explore more advanced features. + +### Key Concepts Reference + +- **Language Boundaries** + ```groovy + // Nextflow: workflow orchestration + Channel.fromPath('*.fastq').splitCsv(header: true) + + // Groovy: data processing + sample_data.collect { it.toUpperCase() } + ``` + +- **String Processing** + ```groovy + // Pattern matching + filename =~ ~/^(\w+)_(\w+)_(\d+)\.fastq$/ + + // Function with conditional return + def parseSample(filename) { + def matcher = filename =~ pattern + return matcher ? [valid: true, data: matcher[0]] : [valid: false] + } + + // File collection to command arguments (in process script block) + script: + def file_args = input_files.collect { file -> "--input ${file}" }.join(' ') + """ + analysis_tool ${file_args} --output results.txt + """ + ``` + +- **Error Handling** + ```groovy + try { + def errors = validateSample(sample) + if (errors) throw new RuntimeException("Invalid: ${errors.join(', ')}") + } catch (Exception e) { + println "Error: ${e.message}" + } + ``` + +- **Essential Groovy Operators** + ```groovy + // Safe navigation and Elvis operators + def id = data?.sample?.id ?: 'unknown' + if (sample.files) println "Has files" // Groovy Truth + + // Slashy strings for regex + def pattern = /^\w+_R[12]\.fastq$/ + def script = """ + echo "Processing ${sample.id}" + analysis --depth ${depth ?: 1_000_000} + """ + ``` + +- **Advanced Closures** + ```groovy + // Named closures and composition + def enrichData = normalizeId >> addQualityCategory >> addFlags + def processor = generalFunction.curry(fixedParam) + + // Closures with scope access + def collectStats = { data -> stats.count++; return data } + ``` + +- **Collection Operations** + ```groovy + // Filter, group, and organize data + def high_quality = samples.findAll { it.quality > 40 } + def by_organism = samples.groupBy { it.organism } + def file_names = files*.getName() // Spread operator + def all_files = nested_lists.flatten() + ``` + +## Resources + +- [Groovy Documentation](http://groovy-lang.org/documentation.html) +- [Nextflow Operators](https://www.nextflow.io/docs/latest/operator.html) +- [Regular Expressions in Groovy](https://groovy-lang.org/syntax.html#_regular_expression_operators) +- [JSON Processing](https://groovy-lang.org/json.html) +- [XML Processing](https://groovy-lang.org/processing-xml.html) diff --git a/docs/side_quests/index.md b/docs/side_quests/index.md index 4579e89b1b..99244a99e7 100644 --- a/docs/side_quests/index.md +++ b/docs/side_quests/index.md @@ -28,6 +28,7 @@ Otherwise, select a side quest from the menu below. ## Menu of Side Quests +- [Groovy Essentials for Nextflow](./groovy_essentials.md) - [Introduction to nf-core](./nf-core.md) - [Sample-specific data](./metadata.md) - [Splitting and Grouping](./splitting_and_grouping.md) diff --git a/side-quests/groovy_essentials/README.md b/side-quests/groovy_essentials/README.md new file mode 100644 index 0000000000..1a91d5d1c0 --- /dev/null +++ b/side-quests/groovy_essentials/README.md @@ -0,0 +1,80 @@ +# Groovy Essentials for Nextflow Developers + +This directory contains the supporting materials for the [Groovy Essentials side quest](../../docs/side_quests/groovy_essentials.md). + +## Contents + +- `main.nf` - Comprehensive workflow demonstrating all Groovy concepts from the side quest +- `nextflow.config` - Configuration file showcasing Nextflow's parameter system and profiles +- `data/` - Sample input data for the tutorial + - `samples.csv` - Sample metadata CSV file with realistic bioinformatics data + - `sequences/` - Sample FASTQ files for testing workflows + - `metadata/` - Additional metadata files (YAML configuration examples) +- `templates/` - Template scripts demonstrating Nextflow's templating system + +## Quick Start + +To run the demonstration workflow: + +```bash +cd side-quests/groovy_essentials +nextflow run main.nf +``` + +The workflow will demonstrate all the Groovy patterns covered in the side quest: + +1. **Nextflow vs Groovy boundaries** - See how workflow orchestration differs from programming logic +2. **String processing** - Pattern matching and file name parsing examples +3. **Conditional logic** - Dynamic strategy selection based on sample characteristics +4. **Error handling** - Validation and graceful error recovery patterns +5. **Essential Groovy operators** - Safe navigation, Elvis operator, Groovy Truth, and slashy strings +6. **Advanced closures** - Named closures, function composition, currying, and scope access +7. **Collection operations** - Advanced data processing with Groovy's collection methods + +## Testing the Workflow + +You can test different aspects of the workflow: + +```bash +# Run with different quality thresholds +nextflow run main.nf --quality_threshold_min 30 --quality_threshold_high 45 + +# Use the testing profile for reduced resource usage +nextflow run main.nf -profile test + +# Run in stub mode to test logic without executing tools +nextflow run main.nf -stub +``` + +## Learning Objectives + +This side quest teaches essential Groovy skills for Nextflow developers: + +- **Language boundaries**: Distinguish between Nextflow workflow orchestration and Groovy programming logic +- **String processing**: Use regular expressions and pattern matching for bioinformatics file names +- **Command building**: Transform file collections into command-line arguments using Groovy methods +- **Conditional logic**: Implement intelligent routing and process selection based on sample characteristics +- **Error handling**: Add validation, try-catch patterns, and graceful failure management +- **Essential operators**: Master safe navigation, Elvis operator, Groovy Truth, and slashy strings for robust code +- **Advanced closures**: Master named closures, function composition, currying, and functional programming patterns +- **Collection operations**: Process and organize large datasets using Groovy's powerful collection methods + +## Progressive Learning + +The `main.nf` file demonstrates a complete sample processing pipeline that evolves from basic metadata handling to a sophisticated, production-ready workflow. Each section builds on the previous, showing how Groovy transforms simple Nextflow workflows into powerful data processing systems. + +Follow the [main documentation](../../docs/side_quests/groovy_essentials.md) for detailed explanations, step-by-step examples, and hands-on exercises that correspond to each section of the demonstration workflow. + +## Next Steps + +After completing this side quest, you'll be ready to: + +- Write cleaner workflows with proper separation between Nextflow and Groovy logic +- Handle complex file naming conventions and input formats gracefully +- Build intelligent pipelines that adapt to different sample types and data characteristics +- Write null-safe, robust code using essential Groovy operators +- Create reusable, maintainable code using advanced closure patterns and functional programming +- Process large-scale datasets efficiently using advanced collection operations +- Add robust error handling for production-ready workflows + +Continue exploring the [other side quests](../README.md) to further develop your Nextflow expertise! diff --git a/side-quests/groovy_essentials/data/metadata/analysis_parameters.yaml b/side-quests/groovy_essentials/data/metadata/analysis_parameters.yaml new file mode 100644 index 0000000000..115321672d --- /dev/null +++ b/side-quests/groovy_essentials/data/metadata/analysis_parameters.yaml @@ -0,0 +1,25 @@ +analysis: + quality: + min_score: 30 + trim_adapters: true + remove_duplicates: false + + alignment: + reference: "GRCh38" + aligner: "STAR" + max_mismatches: 2 + + quantification: + method: "featureCounts" + feature_type: "exon" + count_overlaps: false + +resources: + max_cpus: 8 + max_memory: "16GB" + temp_dir: "/tmp" + +output: + publish_mode: "copy" + compress: true + formats: ["bam", "counts", "qc_report"] diff --git a/side-quests/groovy_essentials/data/samples.csv b/side-quests/groovy_essentials/data/samples.csv new file mode 100644 index 0000000000..829f791e71 --- /dev/null +++ b/side-quests/groovy_essentials/data/samples.csv @@ -0,0 +1,4 @@ +sample_id,organism,tissue_type,sequencing_depth,file_path,quality_score +SAMPLE_001,human,liver,30000000,data/sequences/sample_001.fastq,38.5 +SAMPLE_002,mouse,brain,25000000,data/sequences/sample_002.fastq,35.2 +SAMPLE_003,human,kidney,45000000,data/sequences/sample_003.fastq,42.1 diff --git a/side-quests/groovy_essentials/data/sequences/sample_001.fastq b/side-quests/groovy_essentials/data/sequences/sample_001.fastq new file mode 100644 index 0000000000..55bf845a27 --- /dev/null +++ b/side-quests/groovy_essentials/data/sequences/sample_001.fastq @@ -0,0 +1,12 @@ +@sample_001_read_1 +ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC ++ +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII +@sample_001_read_2 +GCATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT ++ +HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH +@sample_001_read_3 +TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA ++ +JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ diff --git a/side-quests/groovy_essentials/data/sequences/sample_002.fastq b/side-quests/groovy_essentials/data/sequences/sample_002.fastq new file mode 100644 index 0000000000..71351a97a7 --- /dev/null +++ b/side-quests/groovy_essentials/data/sequences/sample_002.fastq @@ -0,0 +1,12 @@ +@sample_002_read_1 +CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT ++ +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII +@sample_002_read_2 +ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG ++ +HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH +@sample_002_read_3 +GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC ++ +JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ diff --git a/side-quests/groovy_essentials/data/sequences/sample_003.fastq b/side-quests/groovy_essentials/data/sequences/sample_003.fastq new file mode 100644 index 0000000000..4acdd0f18e --- /dev/null +++ b/side-quests/groovy_essentials/data/sequences/sample_003.fastq @@ -0,0 +1,12 @@ +@sample_003_read_1 +GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC ++ +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII +@sample_003_read_2 +CGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG ++ +HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH +@sample_003_read_3 +ATATATATATATATATATATATATATATATATATATATAT ++ +JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ diff --git a/side-quests/groovy_essentials/main.nf b/side-quests/groovy_essentials/main.nf new file mode 100644 index 0000000000..0cccf0e04d --- /dev/null +++ b/side-quests/groovy_essentials/main.nf @@ -0,0 +1,555 @@ +#!/usr/bin/env nextflow + +/* + * Groovy Essentials Demo Workflow + * + * This workflow demonstrates essential Groovy concepts in Nextflow contexts + * Follow along with docs/side_quests/groovy_essentials.md for detailed explanations + */ + +// Basic pipeline parameters +params.input = "./data/samples.csv" +params.quality_threshold_min = 25 +params.quality_threshold_high = 40 + +//============================================================================= +// SECTION 1: Nextflow vs Groovy Boundaries +//============================================================================= + +workflow { + println "=== Groovy Essentials Demo ===" + + // 1.1: Basic sample processing (Nextflow + Groovy) + ch_samples = Channel.fromPath(params.input) + .splitCsv(header: true) + .map { row -> + // Groovy: Map operations and string manipulation + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + + // Groovy: Conditional logic and string interpolation + def priority = sample_meta.quality > params.quality_threshold_high ? 'high' : 'normal' + + // Nextflow: Return tuple for channel + return [sample_meta + [priority: priority], file(row.file_path)] + } + + // 1.2: Demonstrate Groovy vs Nextflow collect + demonstrateCollectDifference() + + // 2: String processing and pattern matching + demonstrateStringProcessing() + + // 3: Strategy selection and conditional logic + ch_enriched_samples = ch_samples.map { meta, file_path -> + def strategy = selectAnalysisStrategy(meta) + return [meta + strategy, file_path] + } + + // 4: Validation and error handling + ch_validated_samples = ch_enriched_samples.map { meta, file_path -> + def errors = validateSample(meta) + if (errors) { + log.warn "Sample ${meta.id} has validation issues: ${errors.join(', ')}" + } + return [meta, file_path] + } + + // 5: Essential Groovy operators demonstration + demonstrateGroovyOperators() + + // 6: Advanced closures demonstration + demonstrateAdvancedClosures() + + // 7: Collection operations demonstration + demonstrateCollectionOperations() + + // Display final processed samples + ch_validated_samples.view { meta, file_path -> + "Final: ${meta.id} (${meta.organism}, ${meta.approach}, priority: ${meta.priority})" + } +} + +//============================================================================= +// SECTION 1.2: Collect Confusion Demo +//============================================================================= + +def demonstrateCollectDifference() { + println "\n=== Groovy vs Nextflow Collect ===" + + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + + // Groovy collect: transform each element + def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') + } + println "Groovy collect result: ${formatted_ids}" + + // Nextflow collect: gather channel elements + ch_collected = Channel.of('sample_001', 'sample_002', 'sample_003') + .collect() + ch_collected.view { "Nextflow collect result: ${it}" } +} + +//============================================================================= +// SECTION 2: String Processing and Pattern Matching +//============================================================================= + +def demonstrateStringProcessing() { + println "\n=== String Processing Demo ===" + + // Pattern matching for sample file names + def sample_files = [ + 'Human_Liver_001.fastq', + 'mouse_brain_002.fastq', + 'SRR12345678.fastq' + ] + + // Simple pattern to extract organism and tissue + def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/ + + sample_files.each { filename -> + def matcher = filename =~ pattern + if (matcher) { + println "${filename} -> Organism: ${matcher[0][1]}, Tissue: ${matcher[0][2]}, ID: ${matcher[0][3]}" + } else { + println "${filename} -> No standard pattern match" + } + } +} + +// Function to extract sample metadata from filename +def parseSampleName(String filename) { + def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/ + def matcher = filename =~ pattern + + if (matcher) { + return [ + organism: matcher[0][1].toLowerCase(), + tissue: matcher[0][2].toLowerCase(), + sample_id: matcher[0][3], + valid: true + ] + } else { + return [ + filename: filename, + valid: false + ] + } +} + +//============================================================================= +// SECTION 3: Conditional Logic and Strategy Selection +//============================================================================= + +def selectAnalysisStrategy(Map sample_meta) { + def strategy = [:] + + // Sequencing depth determines processing approach + if (sample_meta.depth < 10_000_000) { + strategy.approach = 'low_depth' + strategy.processes = ['quality_check', 'simple_alignment'] + strategy.sensitivity = 'high' + } else if (sample_meta.depth < 50_000_000) { + strategy.approach = 'standard' + strategy.processes = ['quality_check', 'bwa_alignment', 'variant_calling'] + strategy.sensitivity = 'medium' + } else { + strategy.approach = 'high_depth' + strategy.processes = ['quality_check', 'bwa_alignment', 'variant_calling', 'structural_variants'] + strategy.sensitivity = 'low' + } + + // Organism-specific adjustments + switch(sample_meta.organism) { + case 'human': + strategy.reference = 'GRCh38' + strategy.known_variants = 'dbSNP' + break + case 'mouse': + strategy.reference = 'GRCm39' + strategy.known_variants = 'mgp_variants' + break + default: + strategy.reference = 'custom' + strategy.known_variants = null + } + + // Quality-based modifications + if (sample_meta.quality < 30) { + strategy.extra_qc = true + strategy.processes = ['extensive_qc'] + strategy.processes + } + + return strategy +} + +//============================================================================= +// SECTION 4: Error Handling and Validation +//============================================================================= + +def validateSample(Map sample) { + def errors = [] + + // Check required fields + if (!sample.id) errors << "Missing sample ID" + if (!sample.organism) errors << "Missing organism" + if (!sample.quality) errors << "Missing quality score" + + // Validate data types and ranges + if (sample.quality && (sample.quality < 0 || sample.quality > 50)) { + errors << "Quality score out of range (0-50)" + } + + if (sample.depth && sample.depth < 1000) { + errors << "Sequencing depth too low (<1000)" + } + + return errors +} + +def processSample(Map sample) { + try { + def errors = validateSample(sample) + if (errors) { + throw new RuntimeException("Validation failed: ${errors.join(', ')}") + } + + // Simulate processing + println "Processing sample: ${sample.id}" + return [success: true, sample: sample] + + } catch (Exception e) { + println "Error processing ${sample.id}: ${e.message}" + return [ + success: false, + sample: sample, + error: e.message, + // Partial result for recovery + partial_result: [id: sample.id, status: 'failed'] + ] + } +} + +//============================================================================= +// SECTION 5: Essential Groovy Operators and Patterns +//============================================================================= + +def demonstrateGroovyOperators() { + println "\n=== Essential Groovy Operators Demo ===" + + // 5.1: Safe navigation and Elvis operators for robust data handling + def simulateQcMetrics = { hasData -> + if (hasData) { + return [ + summary: [ + before_filtering: [total_reads: 25_000_000], + after_filtering: [q30_rate: 95.2] + ], + adapter_cutting: [adapter_trimmed_reads: 2_500_000] + ] + } else { + return [:] // Empty or incomplete data + } + } + + println "Safe navigation with QC data:" + def goodQc = simulateQcMetrics(true) + def badQc = simulateQcMetrics(false) + + println " Good QC total reads: ${goodQc?.summary?.before_filtering?.total_reads ?: 0}" + println " Bad QC total reads: ${badQc?.summary?.before_filtering?.total_reads ?: 0}" + println " Good QC with Elvis: ${goodQc?.summary?.after_filtering?.q30_rate ?: 'No data'}" + + // 5.2: Groovy Truth in workflow context + println "Groovy Truth for workflow validation:" + def samples = [ + [id: 'sample1', files: ['file1.fastq', 'file2.fastq']], + [id: 'sample2', files: []], + [id: 'sample3', files: null], + [id: 'sample4', organism: 'human', files: ['data.fastq']] + ] + + samples.each { sample -> + // Groovy Truth simplifies validation + def status = sample.files ? "Ready (${sample.files.size()} files)" : "No files" + def organism = sample.organism ?: 'unknown' + println " ${sample.id}: ${status}, organism: ${organism}" + } + + // 5.3: Slashy strings for filename parsing in workflows + println "Filename parsing for pipeline routing:" + def sampleFiles = [ + 'Human_Liver_001_R1.fastq', + 'Mouse_Brain_002_R2.fastq', + 'SRR12345678.fastq', + 'invalid_file.txt' + ] + + // Use slashy strings to avoid escaping + def humanPattern = /^(Human)_(\w+)_(\d+)_R([12])\.fastq$/ + def mousePattern = /^(Mouse)_(\w+)_(\d+)_R([12])\.fastq$/ + def sraPattern = /^(SRR\d+)\.fastq$/ + + sampleFiles.each { filename -> + def route = 'unknown' + if (filename =~ humanPattern) route = 'human_pipeline' + else if (filename =~ mousePattern) route = 'mouse_pipeline' + else if (filename =~ sraPattern) route = 'sra_pipeline' + + println " ${filename} -> ${route}" + } + + // 5.4: Numeric literals in bioinformatics context + def evaluateSequencingDepth = { depth -> + def category = 'unknown' + def recommendation = 'unknown' + + if (depth < 1_000_000) { + category = 'very_low' + recommendation = 'Consider resequencing' + } else if (depth < 10_000_000) { + category = 'low' + recommendation = 'Adequate for basic analysis' + } else if (depth < 50_000_000) { + category = 'good' + recommendation = 'Suitable for variant calling' + } else { + category = 'excellent' + recommendation = 'Perfect for comprehensive analysis' + } + + return [category: category, recommendation: recommendation] + } + + println "Depth analysis with readable numbers:" + [500_000, 5_000_000, 25_000_000, 75_000_000].each { depth -> + def analysis = evaluateSequencingDepth(depth) + def formatted = depth.toString().replaceAll(/(\d)(?=(\d{3})+$)/, '$1,') + println " ${formatted} reads -> ${analysis.category}: ${analysis.recommendation}" + } +} + +//============================================================================= +// SECTION 6: Advanced Closures and Functional Programming +//============================================================================= + +def demonstrateAdvancedClosures() { + println "\n=== Advanced Closures Demo ===" + + // 6.1: Named closures for reusability + def extractSampleInfo = { row -> + [ + id: row.sample_id?.toLowerCase() ?: 'unknown', + organism: row.organism ?: 'unknown', + quality: (row.quality_score as Double) ?: 0.0 + ] + } + + def addPriority = { meta -> + meta + [priority: meta.quality > 40 ? 'high' : 'normal'] + } + + println "Named closures example:" + def testRow = [sample_id: 'TEST_001', organism: 'human', quality_score: '42.5'] + def processed = addPriority(extractSampleInfo(testRow)) + println " Processed: ${processed}" + + // 6.2: Function composition with >> + def normalizeId = { meta -> + meta + [id: meta.id.toLowerCase().replaceAll(/[^a-z0-9_]/, '_')] + } + + def addQualityCategory = { meta -> + def category = meta.quality > 40 ? 'excellent' : + meta.quality > 30 ? 'good' : 'acceptable' + meta + [quality_category: category] + } + + def enrichSample = normalizeId >> addQualityCategory + def testSample = [id: 'Sample-001', quality: 42.0] + def enriched = enrichSample(testSample) + println "Function composition: ${enriched}" + + // 6.3: Currying example + def qualityFilter = { threshold, meta -> meta.quality >= threshold } + def highQualityFilter = qualityFilter.curry(40) + def standardQualityFilter = qualityFilter.curry(30) + + println "Currying example:" + println " High quality filter (40+): ${highQualityFilter([quality: 42])}" + println " Standard quality filter (30+): ${standardQualityFilter([quality: 35])}" + + // 6.4: Closure with scope access + def stats = [total: 0, high_quality: 0] + def collectStats = { meta -> + stats.total++ + if (meta.quality > 40) stats.high_quality++ + return meta + } + + // Process some test data + [[quality: 45], [quality: 30], [quality: 42]].each(collectStats) + println "Statistics collected: ${stats}" +} + +//============================================================================= +// SECTION 7: Collection Operations +//============================================================================= + +def demonstrateCollectionOperations() { + println "\n=== Collection Operations Demo ===" + + // Sample data with mixed quality and organisms + def samples = [ + [id: 'sample_001', organism: 'human', quality: 42, files: ['data1.txt', 'data2.txt']], + [id: 'sample_002', organism: 'mouse', quality: 28, files: ['data3.txt']], + [id: 'sample_003', organism: 'human', quality: 35, files: ['data4.txt', 'data5.txt', 'data6.txt']], + [id: 'sample_004', organism: 'rat', quality: 45, files: ['data7.txt']], + [id: 'sample_005', organism: 'human', quality: 30, files: ['data8.txt', 'data9.txt']] + ] + + // findAll - filter collections based on conditions + def high_quality_samples = samples.findAll { it.quality > 40 } + println "High quality samples: ${high_quality_samples.collect { it.id }.join(', ')}" + + // groupBy - group samples by organism + def samples_by_organism = samples.groupBy { it.organism } + println "Grouping by organism:" + samples_by_organism.each { organism, sample_list -> + println " ${organism}: ${sample_list.size()} samples" + } + + // unique - get unique organisms + def organisms = samples.collect { it.organism }.unique() + println "Unique organisms: ${organisms.join(', ')}" + + // flatten - flatten nested file lists + def all_files = samples.collect { it.files }.flatten() + println "All files: ${all_files.take(5).join(', ')}... (${all_files.size()} total)" + + // sort - sort samples by quality + def sorted_by_quality = samples.sort { it.quality } + println "Quality range: ${sorted_by_quality.first().quality} to ${sorted_by_quality.last().quality}" + + // count - count items matching condition + def human_samples = samples.count { it.organism == 'human' } + println "Human samples: ${human_samples} out of ${samples.size()}" + + // any/every - check conditions across collection + def has_high_quality = samples.any { it.quality > 40 } + def all_have_files = samples.every { it.files.size() > 0 } + println "Has high quality samples: ${has_high_quality}" + println "All samples have files: ${all_have_files}" + + // Spread operator demonstration + demonstrateSpreadOperator() +} + +def demonstrateSpreadOperator() { + println "\n=== Spread Operator Demo ===" + + def file_paths = [ + '/data/sample1.fastq', + '/data/sample2.fastq', + '/results/output1.bam', + '/results/output2.bam' + ] + + // Convert to file objects + def files = file_paths.collect { file(it) } + + // Using spread operator - equivalent to files.collect { it.getName() } + def filenames = files*.getName() + println "Filenames: ${filenames.join(', ')}" + + // Get all parent directories + def parent_dirs = files*.getParent()*.getName() + println "Parent directories: ${parent_dirs.unique().join(', ')}" + + // Get all extensions + def extensions = files*.getExtension().unique() + println "File types: ${extensions.join(', ')}" +} + +//============================================================================= +// SAMPLE PROCESSES (for demonstration) +//============================================================================= + +process QUALITY_FILTER { + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("${meta.id}_filtered.fastq") + + script: + // Groovy logic to determine parameters based on metadata + def quality_threshold = meta.organism == 'human' ? 30 : + meta.organism == 'mouse' ? 28 : 25 + def min_length = meta.priority == 'high' ? 75 : 50 + + // Conditional script sections + def extra_qc = meta.priority == 'high' ? '--strict-quality' : '' + + """ + echo "Processing ${meta.id} (${meta.organism}, priority: ${meta.priority})" + + # Dynamic quality filtering based on sample characteristics + fastp \\ + --in1 ${reads} \\ + --out1 ${meta.id}_filtered.fastq \\ + --qualified_quality_phred ${quality_threshold} \\ + --length_required ${min_length} \\ + ${extra_qc} + + echo "Applied quality threshold: ${quality_threshold}" + echo "Applied length threshold: ${min_length}" + """ + + stub: + """ + echo "STUB: Processing ${meta.id} (${meta.organism}, priority: ${meta.priority})" + touch ${meta.id}_filtered.fastq + """ +} + +process JOINT_ANALYSIS { + input: + path sample_files // This will be a list of files + path reference + + output: + path "joint_results.txt" + + script: + // Transform file list into command arguments + def file_args = sample_files.collect { file -> "--input ${file}" }.join(' ') + def sample_names = sample_files.collect { file -> + file.baseName.replaceAll(/\..*$/, '') + }.join(',') + + """ + echo "Processing ${sample_files.size()} samples" + echo "Sample names: ${sample_names}" + + # Use the transformed arguments in the actual command + analysis_tool \\ + ${file_args} \\ + --reference ${reference} \\ + --output joint_results.txt \\ + --samples ${sample_names} + """ + + stub: + """ + echo "STUB: Processing ${sample_files.size()} samples" + echo "STUB: Sample names: ${sample_names}" + touch joint_results.txt + """ +} diff --git a/side-quests/groovy_essentials/nextflow.config b/side-quests/groovy_essentials/nextflow.config new file mode 100644 index 0000000000..7d577dac43 --- /dev/null +++ b/side-quests/groovy_essentials/nextflow.config @@ -0,0 +1,39 @@ +// Nextflow configuration for Groovy Essentials side quest + +// Basic parameters for the tutorial +params { + input = './data/samples.csv' + quality_threshold_min = 25 + quality_threshold_high = 40 +} + +// Simple process defaults +process { + cpus = 2 + memory = '4.GB' + errorStrategy = 'retry' + maxRetries = 2 +} + +// Basic profiles +profiles { + standard { + process.executor = 'local' + } + + test { + params.input = './data/samples.csv' + process.memory = '2.GB' + process.cpus = 1 + } +} + +// Workflow metadata +manifest { + name = 'groovy-essentials' + description = 'Groovy Essentials for Nextflow Developers' + author = 'Nextflow Training' + version = '1.0.0' + homePage = 'https://training.nextflow.io' + nextflowVersion = '>=23.04.0' +} diff --git a/side-quests/groovy_essentials/templates/analysis_script.sh b/side-quests/groovy_essentials/templates/analysis_script.sh new file mode 100644 index 0000000000..5a9c219b7b --- /dev/null +++ b/side-quests/groovy_essentials/templates/analysis_script.sh @@ -0,0 +1,27 @@ +#!/bin/bash + +# Nextflow template file - accessed via template directive in process +# This template has access to all variables from the process input +# Groovy expressions are evaluated at runtime + +echo "Generating report for sample: ${meta.id}" +echo "Organism: ${meta.organism}" +echo "Quality score: ${meta.quality}" + +# Conditional logic in template +<% if (meta.organism == 'human') { %> +echo "Including human-specific quality metrics" +human_qc_script.py --input ${results} --output ${meta.id}_report.html +<% } else { %> +echo "Using standard quality metrics for ${meta.organism}" +generic_qc_script.py --input ${results} --output ${meta.id}_report.html +<% } %> + +# Groovy variables can be used for calculations +<% +def priority_bonus = meta.priority == 'high' ? 0.1 : 0.0 +def adjusted_score = (meta.quality + priority_bonus).round(2) +%> + +echo "Adjusted quality score: ${adjusted_score}" +echo "Report generation complete" From f5fba6339114c5c4320f83ccc20b55e2a594b3ef Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 4 Sep 2025 14:20:52 +0100 Subject: [PATCH 02/48] Refine section 1.1 --- docs/side_quests/groovy_essentials.md | 366 +++++++++++++---- side-quests/groovy_essentials/main.nf | 554 +------------------------- 2 files changed, 292 insertions(+), 628 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 7b0c394f14..e57ea9b462 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -45,13 +45,13 @@ Before taking on this side quest you should: - Understand basic Nextflow concepts (processes, channels, workflows) - Have basic familiarity with Groovy syntax (variables, maps, lists) -You may also find it helpful to review [Basic Groovy](../basic_training/groovy.md) if you need a refresher on fundamental concepts. +This tutorial will explain Groovy concepts as we encounter them, so you don't need extensive prior Groovy knowledge. We'll start with fundamental concepts and build up to advanced patterns. ### 0.2. Starting Point Let's move into the project directory and explore our working materials. -```bash +```bash title="Navigate to project directory" cd side-quests/groovy_essentials ``` @@ -60,19 +60,21 @@ You'll find a `data` directory with sample files and a main workflow file that w ```console title="Directory contents" > tree . -├── data/ +├── data +│ ├── metadata +│ │ └── analysis_parameters.yaml │ ├── samples.csv -│ ├── sequences/ -│ │ ├── sample_001.fastq -│ │ ├── sample_002.fastq -│ │ └── sample_003.fastq -│ └── metadata/ -│ └── analysis_parameters.yaml -├── templates/ -│ └── analysis_script.sh +│ └── sequences +│ ├── sample_001.fastq +│ ├── sample_002.fastq +│ └── sample_003.fastq ├── main.nf ├── nextflow.config -└── README.md +├── README.md +└── templates + └── analysis_script.sh + +5 directories, 9 files ``` Our sample CSV contains information about biological samples that need different processing based on their characteristics: @@ -92,79 +94,267 @@ We'll use this realistic dataset to explore practical Groovy techniques that you ### 1.1. Identifying What's What -One of the most common sources of confusion for Nextflow developers is understanding when they're working with Nextflow constructs versus Groovy language features. Let's start by examining a typical workflow and identifying the boundaries: +One of the most common sources of confusion for Nextflow developers is understanding when they're working with Nextflow constructs versus Groovy language features. Let's build a workflow step by step to see how they work together. + +#### Step 1: Basic Nextflow Workflow + +Start with a simple workflow that just reads the CSV file: ```groovy title="main.nf" linenums="1" workflow { - // Nextflow: Channel factory and operator ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map { row -> - // Groovy: Map operations and string manipulation - def sample_meta = [ - id: row.sample_id.toLowerCase(), - organism: row.organism, - tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), - depth: row.sequencing_depth.toInteger(), - quality: row.quality_score.toDouble() - ] - - // Groovy: Conditional logic and string interpolation - def priority = sample_meta.quality > 40 ? 'high' : 'normal' - - // Nextflow: Return tuple for channel - return [sample_meta + [priority: priority], file(row.file_path)] - } .view() } ``` -Let's break this down: +The `workflow` block defines our pipeline structure, while `Channel.fromPath()` creates a channel from a file path. The `.splitCsv()` operator processes the CSV file and converts each row into a map data structure. + +Run this workflow to see the raw CSV data: + +```bash title="Test basic workflow" +nextflow run main.nf +``` + +You should see output like: +```console title="Raw CSV data" +[id:sample_001, organism:human, tissue_type:liver, sequencing_depth:30000000, file_path:data/sequences/sample_001.fastq, quality_score:38.5] +[id:sample_002, organism:mouse, tissue_type:brain, sequencing_depth:25000000, file_path:data/sequences/sample_002.fastq, quality_score:35.2] +[id:sample_003, organism:human, tissue_type:kidney, sequencing_depth:45000000, file_path:data/sequences/sample_003.fastq, quality_score:42.1] +``` + +#### Step 2: Adding the Map Operator + +Now let's add the `.map()` operator, which is a **Nextflow channel operator** (not to be confused with the map data structure we'll see below). This operator takes a closure where we can write Groovy code to transform each item. + +A **closure** is a block of code that can be passed around and executed later. Think of it as a function that you define inline. In Groovy, closures are written with curly braces `{ }` and can take parameters. They're fundamental to how Nextflow operators work. + +=== "After" + + ```groovy title="main.nf" linenums="2" hl_lines="3-6" + ch_samples = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + return row + } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" hl_lines="3" + ch_samples = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .view() + ``` + +The `.map { row -> ... }` operator takes a closure where `row` represents each item from the channel. This is a named parameter - you can call it anything you want. For example, you could write `.map { item -> ... }` or `.map { sample -> ... }` and it would work exactly the same way. + +When Nextflow processes each item in the channel, it passes that item to your closure as the parameter you named. So if your channel contains CSV rows, `row` will hold one complete row at a time. + +Apply this change and run the workflow: + +```bash title="Test map operator" +nextflow run main.nf +``` + +You'll see the same output as before, because we're simply returning the input unchanged. This confirms that the map operator is working correctly. Now let's start transforming the data. + +#### Step 3: Creating a Map Data Structure + +Now let's add some Groovy code inside our closure to create a **map data structure** (different from the map operator) to organize our sample metadata. + +=== "After" + + ```groovy title="main.nf" linenums="2" hl_lines="5-12" + ch_samples = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + return sample_meta + } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" hl_lines="5" + ch_samples = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + return row + } + .view() + ``` + +A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. It lets us store related pieces of information together. In this map, we're storing the sample ID, organism, tissue type, sequencing depth, and quality score. + +We use Groovy's string manipulation methods like `.toLowerCase()` and `.replaceAll()` to clean up our data, and type conversion methods like `.toInteger()` and `.toDouble()` to convert string data from the CSV into the appropriate numeric types. + +Apply this change and run the workflow: + +```bash title="Test map data structure" +nextflow run main.nf +``` + +You should see the refined map output like: + +```console title="Transformed metadata" +[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5] +[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2] +[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1] +``` + +#### Step 4: Adding Conditional Logic + +Now let's add a ternary operator to make decisions based on data values. + +=== "After" + + ```groovy title="main.nf" linenums="2" hl_lines="11-12" + ch_samples = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return sample_meta + [priority: priority] + } + .view() + ``` + +=== "Before" -**Nextflow constructs:** -- `workflow { }` - Nextflow workflow definition -- `Channel.fromPath()` - Nextflow channel factory -- `.splitCsv()`, `.map()`, `.view()` - Nextflow channel operators -- `file()` - Nextflow file object factory + ```groovy title="main.nf" linenums="2" hl_lines="11" + ch_samples = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + return sample_meta + } + .view() + ``` -**Groovy constructs:** -- `def sample_meta = [:]` - Groovy map definition -- `.toLowerCase()`, `.replaceAll()` - Groovy string methods -- `.toInteger()`, `.toDouble()` - Groovy type conversion -- Ternary operator `? :` - Groovy conditional expression -- Map addition `+` operator - Groovy map operations +The ternary operator is a shorthand for an if/else statement that follows the pattern `condition ? value_if_true : value_if_false`. This line means: "If the quality is greater than 40, use 'high', otherwise use 'normal'". -Run this workflow to see the processed output: +The map addition operator `+` creates a **new map** rather than modifying the existing one. This line creates a new map that contains all the key-value pairs from `sample_meta` plus the new `priority` key. -```bash title="Test the initial processing" +Apply this change and run the workflow: + +```bash title="Test conditional logic" nextflow run main.nf ``` -```console title="Processed sample data" -N E X T F L O W ~ version 25.04.3 +You should see output like: +```console title="Metadata with priority" +[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal] +[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal] +[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high] +``` + +#### Step 5: Combining Maps and Returning Results + +Finally, let's return both the metadata and the file path as a tuple, which is the standard Nextflow pattern. + +=== "After" + + ```groovy title="main.nf" linenums="2" hl_lines="12" + ch_samples = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + [priority: priority], file(row.file_path)] + } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" hl_lines="12" + ch_samples = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return sample_meta + [priority: priority] + } + .view() + ``` -Launching `main.nf` [fervent_darwin] DSL2 - revision: 8a9c4f8e21 +This returns a tuple containing the enriched metadata and the file path, which is the standard pattern for passing data to processes in Nextflow. -[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal], data/sequences/sample_001.fastq] -[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal], data/sequences/sample_002.fastq] -[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high], data/sequences/sample_003.fastq] +Apply this change and run the workflow: + +```bash title="Test complete workflow" +nextflow run main.nf ``` +You should see output like: +```console title="Complete workflow output" +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_002.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_003.fastq] +``` + +!!! note + + **Key Pattern**: Nextflow operators often take closures `{ ... }` as parameters. Everything inside these closures is **Groovy code**. This is how Nextflow orchestrates workflows while Groovy handles the data processing logic. + +!!! note + + **Maps and Metadata**: Maps are fundamental to working with metadata in Nextflow. For a more detailed explanation of working with metadata maps, see the [Working with metadata](./metadata.md) side quest. + ### 1.2. The Collect Confusion: Nextflow vs Groovy +!!! warning + + The `collect` operation exists in both Nextflow and Groovy but does completely different things. This is one of the most common sources of confusion for developers. + A perfect example of Nextflow/Groovy confusion is the `collect` operation, which exists in both contexts but does completely different things: **Groovy's `collect`** (transforms each element): -```groovy +```groovy title="Groovy collect example" // Groovy collect - transforms each item in a list +// The { it * it } is a closure (anonymous function) where 'it' refers to each element def numbers = [1, 2, 3, 4] def squared = numbers.collect { it * it } // Result: [1, 4, 9, 16] ``` **Nextflow's `collect`** (gathers all channel elements): -```groovy +```groovy title="Nextflow collect example" // Nextflow collect - gathers all channel items into a list +// This waits for all items to arrive before emitting a single list Channel.of(1, 2, 3, 4) .collect() .view() @@ -176,7 +366,6 @@ Let's demonstrate this with our sample data: === "After" ```groovy title="main.nf" linenums="25" hl_lines="1-15" - // Demonstrate Groovy vs Nextflow collect def sample_ids = ['sample_001', 'sample_002', 'sample_003'] @@ -238,7 +427,6 @@ Let's start with a simple example of extracting sample information from file nam === "After" ```groovy title="main.nf" linenums="40" hl_lines="1-15" - // Pattern matching for sample file names def sample_files = [ 'Human_Liver_001.fastq', @@ -261,14 +449,14 @@ Let's start with a simple example of extracting sample information from file nam === "Before" - ```groovy title="main.nf" linenums="40" + ```groovy title="main.nf" linenums="25" ``` This demonstrates key Groovy string processing concepts: -1. **Regular expression literals** using `~/pattern/` syntax -2. **Pattern matching** with the `=~` operator -3. **Matcher objects** that capture groups with `[0][1]`, `[0][2]`, etc. +1. **Regular expression literals** using `~/pattern/` syntax - this creates a regex pattern without needing to escape backslashes +2. **Pattern matching** with the `=~` operator - this attempts to match a string against a regex pattern +3. **Matcher objects** that capture groups with `[0][1]`, `[0][2]`, etc. - `[0]` refers to the entire match, `[1]`, `[2]`, etc. refer to captured groups in parentheses Run this to see the pattern matching in action: @@ -324,14 +512,14 @@ Let's create a simple function to parse sample names and return structured metad === "Before" - ```groovy title="main.nf" linenums="60" + ```groovy title="main.nf" linenums="40" ``` This demonstrates key Groovy function patterns: -- **Function definitions** with `def functionName(parameters)` -- **Map creation and return** for structured data -- **Conditional returns** based on pattern matching success +- **Function definitions** with `def functionName(parameters)` - similar to other languages but with dynamic typing +- **Map creation and return** for structured data - maps are Groovy's primary data structure for returning multiple values +- **Conditional returns** based on pattern matching success - functions can return different data structures based on conditions ### 2.3. Dynamic Script Logic in Processes @@ -652,11 +840,11 @@ Now that our pipeline can extract comprehensive sample metadata, we can use this This demonstrates several Groovy patterns commonly used in Nextflow workflows: -- **Numeric literals** with underscores for readability (`10_000_000`) -- **Switch statements** for multi-way branching -- **List concatenation** with `+` operator -- **Elvis operator** `?:` for null handling -- **Map merging** to combine metadata with strategy +- **Numeric literals** with underscores for readability (`10_000_000`) - underscores can be used in numbers to improve readability +- **Switch statements** for multi-way branching - cleaner than multiple if/else statements +- **List concatenation** with `+` operator - combines two lists into one +- **Elvis operator** `?:` for null handling - provides a default value if the left side is null or false +- **Map merging** to combine metadata with strategy - the `+` operator merges two maps, with the right map taking precedence ### 3.2. Conditional Process Execution @@ -1081,8 +1269,15 @@ With our pipeline now handling complex conditional logic, we need to make it mor ### 5.1. Safe Navigation and Elvis Operators in Workflows +!!! note + + **Safe Navigation (`?.`) and Elvis (`?:`) Operators**: These are essential for null-safe programming. Safe navigation returns null instead of throwing an exception if the object is null, while the Elvis operator provides a default value if the left side is null, empty, or false. + The safe navigation operator (`?.`) and Elvis operator (`?:`) are essential for null-safe programming when processing real-world biological data: +- **Safe navigation (`?.`)** - returns null instead of throwing an exception if the object is null +- **Elvis operator (`?:`)** - provides a default value if the left side is null, empty, or false + === "After" ```groovy title="main.nf" linenums="320" hl_lines="1-25" @@ -1136,8 +1331,9 @@ Groovy provides powerful string features for parsing filenames and generating dy workflow { // Demonstrate slashy strings for regex (no need to escape backslashes) def parseFilename = { filename -> - // Slashy string - compare to regular string: "^(\\w+)_(\\w+)_(\\d+)\\.fastq$" - def pattern = /^(\w+)_(\w+)_(\d+)\.fastq$/ + // Slashy string - compare to regular string: "^(\\w+)_(\\w+)_(\\d+)\\.fastq$" + // Slashy strings don't require escaping backslashes, making regex patterns much cleaner + def pattern = /^(\w+)_(\w+)_(\d+)\.fastq$/ def matcher = filename =~ pattern if (matcher) { @@ -1229,7 +1425,11 @@ In this section, you've learned: - **Safe navigation operator** (`?.`) for null-safe property access - **Elvis operator** (`?:`) for default values and null coalescing -- **Groovy Truth** - how null, empty strings, and empty collections evaluate to false +!!! note + + **Groovy Truth**: In Groovy, null, empty strings, empty collections, and zero are all considered "false" in boolean contexts. This is different from many other languages and is essential to understand for proper conditional logic. + +- **Groovy Truth** - how null, empty strings, and empty collections evaluate to false - in Groovy, null, empty strings, empty collections, and zero are all considered "false" in boolean contexts - **Slashy strings** (`/pattern/`) for regex patterns without escaping - **Multi-line string interpolation** for command templates - **Numeric literals with underscores** for improved readability @@ -1244,8 +1444,14 @@ Our pipeline now handles missing data gracefully and processes complex input for ### 6.1. Named Closures for Reusability +!!! note + + **Closures**: A closure is a block of code that can be assigned to a variable and executed later. Think of it as a function that can be passed around and reused. They're fundamental to Groovy's functional programming capabilities. + So far we've used anonymous closures defined inline within channel operations. When you find yourself repeating the same transformation logic across multiple processes or workflows, named closures can eliminate duplication and improve readability: +A **closure** is a block of code that can be assigned to a variable and executed later. Think of it as a function that can be passed around and reused. + === "After" ```groovy title="main.nf" linenums="350" hl_lines="1-30" @@ -1294,6 +1500,8 @@ So far we've used anonymous closures defined inline within channel operations. W Groovy closures can be composed together using the `>>` operator, allowing you to build complex transformations from simple, reusable pieces: +**Function composition** means chaining functions together so the output of one becomes the input of the next. The `>>` operator creates a new closure that applies multiple transformations in sequence. + === "After" ```groovy title="main.nf" linenums="390" hl_lines="1-25" @@ -1340,6 +1548,8 @@ Groovy closures can be composed together using the `>>` operator, allowing you t Currying allows you to create specialized versions of general-purpose closures by fixing some of their parameters: +**Currying** is a technique where you take a function with multiple parameters and create a new function with some of those parameters "fixed" or "pre-filled". This creates specialized versions of general-purpose functions. + === "After" ```groovy title="main.nf" linenums="430" hl_lines="1-20" @@ -1432,6 +1642,8 @@ The functional programming patterns we just learned work beautifully with Groovy When processing large datasets, channel operations often need to organize and analyze sample collections. Groovy's collection methods integrate seamlessly with Nextflow channels to provide powerful data processing capabilities: +Groovy provides many built-in methods for working with collections (lists, maps, etc.) that make data processing much more expressive than traditional loops. + === "After" ```groovy title="main.nf" linenums="500" hl_lines="1-40" @@ -1545,6 +1757,8 @@ Working with file paths is essential in bioinformatics workflows. Groovy provide The spread operator (`*.`) is a powerful Groovy feature for calling methods on all elements in a collection: +The **spread operator** (`*.`) is a shorthand way to call the same method on every element in a collection. It's equivalent to using `.collect { it.methodName() }` but more concise. + === "After" ```groovy title="main.nf" linenums="590" hl_lines="1-20" @@ -1646,7 +1860,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ### Key Concepts Reference - **Language Boundaries** - ```groovy + ```groovy title="Nextflow vs Groovy examples" // Nextflow: workflow orchestration Channel.fromPath('*.fastq').splitCsv(header: true) @@ -1655,7 +1869,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ``` - **String Processing** - ```groovy + ```groovy title="String processing examples" // Pattern matching filename =~ ~/^(\w+)_(\w+)_(\d+)\.fastq$/ @@ -1674,7 +1888,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ``` - **Error Handling** - ```groovy + ```groovy title="Error handling patterns" try { def errors = validateSample(sample) if (errors) throw new RuntimeException("Invalid: ${errors.join(', ')}") @@ -1684,7 +1898,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ``` - **Essential Groovy Operators** - ```groovy + ```groovy title="Essential operators examples" // Safe navigation and Elvis operators def id = data?.sample?.id ?: 'unknown' if (sample.files) println "Has files" // Groovy Truth @@ -1698,7 +1912,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ``` - **Advanced Closures** - ```groovy + ```groovy title="Advanced closure patterns" // Named closures and composition def enrichData = normalizeId >> addQualityCategory >> addFlags def processor = generalFunction.curry(fixedParam) @@ -1708,7 +1922,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ``` - **Collection Operations** - ```groovy + ```groovy title="Collection operations examples" // Filter, group, and organize data def high_quality = samples.findAll { it.quality > 40 } def by_organism = samples.groupBy { it.organism } diff --git a/side-quests/groovy_essentials/main.nf b/side-quests/groovy_essentials/main.nf index 0cccf0e04d..31aa794ede 100644 --- a/side-quests/groovy_essentials/main.nf +++ b/side-quests/groovy_essentials/main.nf @@ -1,555 +1,5 @@ -#!/usr/bin/env nextflow - -/* - * Groovy Essentials Demo Workflow - * - * This workflow demonstrates essential Groovy concepts in Nextflow contexts - * Follow along with docs/side_quests/groovy_essentials.md for detailed explanations - */ - -// Basic pipeline parameters -params.input = "./data/samples.csv" -params.quality_threshold_min = 25 -params.quality_threshold_high = 40 - -//============================================================================= -// SECTION 1: Nextflow vs Groovy Boundaries -//============================================================================= - workflow { - println "=== Groovy Essentials Demo ===" - - // 1.1: Basic sample processing (Nextflow + Groovy) - ch_samples = Channel.fromPath(params.input) + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map { row -> - // Groovy: Map operations and string manipulation - def sample_meta = [ - id: row.sample_id.toLowerCase(), - organism: row.organism, - tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), - depth: row.sequencing_depth.toInteger(), - quality: row.quality_score.toDouble() - ] - - // Groovy: Conditional logic and string interpolation - def priority = sample_meta.quality > params.quality_threshold_high ? 'high' : 'normal' - - // Nextflow: Return tuple for channel - return [sample_meta + [priority: priority], file(row.file_path)] - } - - // 1.2: Demonstrate Groovy vs Nextflow collect - demonstrateCollectDifference() - - // 2: String processing and pattern matching - demonstrateStringProcessing() - - // 3: Strategy selection and conditional logic - ch_enriched_samples = ch_samples.map { meta, file_path -> - def strategy = selectAnalysisStrategy(meta) - return [meta + strategy, file_path] - } - - // 4: Validation and error handling - ch_validated_samples = ch_enriched_samples.map { meta, file_path -> - def errors = validateSample(meta) - if (errors) { - log.warn "Sample ${meta.id} has validation issues: ${errors.join(', ')}" - } - return [meta, file_path] - } - - // 5: Essential Groovy operators demonstration - demonstrateGroovyOperators() - - // 6: Advanced closures demonstration - demonstrateAdvancedClosures() - - // 7: Collection operations demonstration - demonstrateCollectionOperations() - - // Display final processed samples - ch_validated_samples.view { meta, file_path -> - "Final: ${meta.id} (${meta.organism}, ${meta.approach}, priority: ${meta.priority})" - } -} - -//============================================================================= -// SECTION 1.2: Collect Confusion Demo -//============================================================================= - -def demonstrateCollectDifference() { - println "\n=== Groovy vs Nextflow Collect ===" - - def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - - // Groovy collect: transform each element - def formatted_ids = sample_ids.collect { id -> - id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') - } - println "Groovy collect result: ${formatted_ids}" - - // Nextflow collect: gather channel elements - ch_collected = Channel.of('sample_001', 'sample_002', 'sample_003') - .collect() - ch_collected.view { "Nextflow collect result: ${it}" } -} - -//============================================================================= -// SECTION 2: String Processing and Pattern Matching -//============================================================================= - -def demonstrateStringProcessing() { - println "\n=== String Processing Demo ===" - - // Pattern matching for sample file names - def sample_files = [ - 'Human_Liver_001.fastq', - 'mouse_brain_002.fastq', - 'SRR12345678.fastq' - ] - - // Simple pattern to extract organism and tissue - def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/ - - sample_files.each { filename -> - def matcher = filename =~ pattern - if (matcher) { - println "${filename} -> Organism: ${matcher[0][1]}, Tissue: ${matcher[0][2]}, ID: ${matcher[0][3]}" - } else { - println "${filename} -> No standard pattern match" - } - } -} - -// Function to extract sample metadata from filename -def parseSampleName(String filename) { - def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/ - def matcher = filename =~ pattern - - if (matcher) { - return [ - organism: matcher[0][1].toLowerCase(), - tissue: matcher[0][2].toLowerCase(), - sample_id: matcher[0][3], - valid: true - ] - } else { - return [ - filename: filename, - valid: false - ] - } -} - -//============================================================================= -// SECTION 3: Conditional Logic and Strategy Selection -//============================================================================= - -def selectAnalysisStrategy(Map sample_meta) { - def strategy = [:] - - // Sequencing depth determines processing approach - if (sample_meta.depth < 10_000_000) { - strategy.approach = 'low_depth' - strategy.processes = ['quality_check', 'simple_alignment'] - strategy.sensitivity = 'high' - } else if (sample_meta.depth < 50_000_000) { - strategy.approach = 'standard' - strategy.processes = ['quality_check', 'bwa_alignment', 'variant_calling'] - strategy.sensitivity = 'medium' - } else { - strategy.approach = 'high_depth' - strategy.processes = ['quality_check', 'bwa_alignment', 'variant_calling', 'structural_variants'] - strategy.sensitivity = 'low' - } - - // Organism-specific adjustments - switch(sample_meta.organism) { - case 'human': - strategy.reference = 'GRCh38' - strategy.known_variants = 'dbSNP' - break - case 'mouse': - strategy.reference = 'GRCm39' - strategy.known_variants = 'mgp_variants' - break - default: - strategy.reference = 'custom' - strategy.known_variants = null - } - - // Quality-based modifications - if (sample_meta.quality < 30) { - strategy.extra_qc = true - strategy.processes = ['extensive_qc'] + strategy.processes - } - - return strategy -} - -//============================================================================= -// SECTION 4: Error Handling and Validation -//============================================================================= - -def validateSample(Map sample) { - def errors = [] - - // Check required fields - if (!sample.id) errors << "Missing sample ID" - if (!sample.organism) errors << "Missing organism" - if (!sample.quality) errors << "Missing quality score" - - // Validate data types and ranges - if (sample.quality && (sample.quality < 0 || sample.quality > 50)) { - errors << "Quality score out of range (0-50)" - } - - if (sample.depth && sample.depth < 1000) { - errors << "Sequencing depth too low (<1000)" - } - - return errors -} - -def processSample(Map sample) { - try { - def errors = validateSample(sample) - if (errors) { - throw new RuntimeException("Validation failed: ${errors.join(', ')}") - } - - // Simulate processing - println "Processing sample: ${sample.id}" - return [success: true, sample: sample] - - } catch (Exception e) { - println "Error processing ${sample.id}: ${e.message}" - return [ - success: false, - sample: sample, - error: e.message, - // Partial result for recovery - partial_result: [id: sample.id, status: 'failed'] - ] - } -} - -//============================================================================= -// SECTION 5: Essential Groovy Operators and Patterns -//============================================================================= - -def demonstrateGroovyOperators() { - println "\n=== Essential Groovy Operators Demo ===" - - // 5.1: Safe navigation and Elvis operators for robust data handling - def simulateQcMetrics = { hasData -> - if (hasData) { - return [ - summary: [ - before_filtering: [total_reads: 25_000_000], - after_filtering: [q30_rate: 95.2] - ], - adapter_cutting: [adapter_trimmed_reads: 2_500_000] - ] - } else { - return [:] // Empty or incomplete data - } - } - - println "Safe navigation with QC data:" - def goodQc = simulateQcMetrics(true) - def badQc = simulateQcMetrics(false) - - println " Good QC total reads: ${goodQc?.summary?.before_filtering?.total_reads ?: 0}" - println " Bad QC total reads: ${badQc?.summary?.before_filtering?.total_reads ?: 0}" - println " Good QC with Elvis: ${goodQc?.summary?.after_filtering?.q30_rate ?: 'No data'}" - - // 5.2: Groovy Truth in workflow context - println "Groovy Truth for workflow validation:" - def samples = [ - [id: 'sample1', files: ['file1.fastq', 'file2.fastq']], - [id: 'sample2', files: []], - [id: 'sample3', files: null], - [id: 'sample4', organism: 'human', files: ['data.fastq']] - ] - - samples.each { sample -> - // Groovy Truth simplifies validation - def status = sample.files ? "Ready (${sample.files.size()} files)" : "No files" - def organism = sample.organism ?: 'unknown' - println " ${sample.id}: ${status}, organism: ${organism}" - } - - // 5.3: Slashy strings for filename parsing in workflows - println "Filename parsing for pipeline routing:" - def sampleFiles = [ - 'Human_Liver_001_R1.fastq', - 'Mouse_Brain_002_R2.fastq', - 'SRR12345678.fastq', - 'invalid_file.txt' - ] - - // Use slashy strings to avoid escaping - def humanPattern = /^(Human)_(\w+)_(\d+)_R([12])\.fastq$/ - def mousePattern = /^(Mouse)_(\w+)_(\d+)_R([12])\.fastq$/ - def sraPattern = /^(SRR\d+)\.fastq$/ - - sampleFiles.each { filename -> - def route = 'unknown' - if (filename =~ humanPattern) route = 'human_pipeline' - else if (filename =~ mousePattern) route = 'mouse_pipeline' - else if (filename =~ sraPattern) route = 'sra_pipeline' - - println " ${filename} -> ${route}" - } - - // 5.4: Numeric literals in bioinformatics context - def evaluateSequencingDepth = { depth -> - def category = 'unknown' - def recommendation = 'unknown' - - if (depth < 1_000_000) { - category = 'very_low' - recommendation = 'Consider resequencing' - } else if (depth < 10_000_000) { - category = 'low' - recommendation = 'Adequate for basic analysis' - } else if (depth < 50_000_000) { - category = 'good' - recommendation = 'Suitable for variant calling' - } else { - category = 'excellent' - recommendation = 'Perfect for comprehensive analysis' - } - - return [category: category, recommendation: recommendation] - } - - println "Depth analysis with readable numbers:" - [500_000, 5_000_000, 25_000_000, 75_000_000].each { depth -> - def analysis = evaluateSequencingDepth(depth) - def formatted = depth.toString().replaceAll(/(\d)(?=(\d{3})+$)/, '$1,') - println " ${formatted} reads -> ${analysis.category}: ${analysis.recommendation}" - } -} - -//============================================================================= -// SECTION 6: Advanced Closures and Functional Programming -//============================================================================= - -def demonstrateAdvancedClosures() { - println "\n=== Advanced Closures Demo ===" - - // 6.1: Named closures for reusability - def extractSampleInfo = { row -> - [ - id: row.sample_id?.toLowerCase() ?: 'unknown', - organism: row.organism ?: 'unknown', - quality: (row.quality_score as Double) ?: 0.0 - ] - } - - def addPriority = { meta -> - meta + [priority: meta.quality > 40 ? 'high' : 'normal'] - } - - println "Named closures example:" - def testRow = [sample_id: 'TEST_001', organism: 'human', quality_score: '42.5'] - def processed = addPriority(extractSampleInfo(testRow)) - println " Processed: ${processed}" - - // 6.2: Function composition with >> - def normalizeId = { meta -> - meta + [id: meta.id.toLowerCase().replaceAll(/[^a-z0-9_]/, '_')] - } - - def addQualityCategory = { meta -> - def category = meta.quality > 40 ? 'excellent' : - meta.quality > 30 ? 'good' : 'acceptable' - meta + [quality_category: category] - } - - def enrichSample = normalizeId >> addQualityCategory - def testSample = [id: 'Sample-001', quality: 42.0] - def enriched = enrichSample(testSample) - println "Function composition: ${enriched}" - - // 6.3: Currying example - def qualityFilter = { threshold, meta -> meta.quality >= threshold } - def highQualityFilter = qualityFilter.curry(40) - def standardQualityFilter = qualityFilter.curry(30) - - println "Currying example:" - println " High quality filter (40+): ${highQualityFilter([quality: 42])}" - println " Standard quality filter (30+): ${standardQualityFilter([quality: 35])}" - - // 6.4: Closure with scope access - def stats = [total: 0, high_quality: 0] - def collectStats = { meta -> - stats.total++ - if (meta.quality > 40) stats.high_quality++ - return meta - } - - // Process some test data - [[quality: 45], [quality: 30], [quality: 42]].each(collectStats) - println "Statistics collected: ${stats}" -} - -//============================================================================= -// SECTION 7: Collection Operations -//============================================================================= - -def demonstrateCollectionOperations() { - println "\n=== Collection Operations Demo ===" - - // Sample data with mixed quality and organisms - def samples = [ - [id: 'sample_001', organism: 'human', quality: 42, files: ['data1.txt', 'data2.txt']], - [id: 'sample_002', organism: 'mouse', quality: 28, files: ['data3.txt']], - [id: 'sample_003', organism: 'human', quality: 35, files: ['data4.txt', 'data5.txt', 'data6.txt']], - [id: 'sample_004', organism: 'rat', quality: 45, files: ['data7.txt']], - [id: 'sample_005', organism: 'human', quality: 30, files: ['data8.txt', 'data9.txt']] - ] - - // findAll - filter collections based on conditions - def high_quality_samples = samples.findAll { it.quality > 40 } - println "High quality samples: ${high_quality_samples.collect { it.id }.join(', ')}" - - // groupBy - group samples by organism - def samples_by_organism = samples.groupBy { it.organism } - println "Grouping by organism:" - samples_by_organism.each { organism, sample_list -> - println " ${organism}: ${sample_list.size()} samples" - } - - // unique - get unique organisms - def organisms = samples.collect { it.organism }.unique() - println "Unique organisms: ${organisms.join(', ')}" - - // flatten - flatten nested file lists - def all_files = samples.collect { it.files }.flatten() - println "All files: ${all_files.take(5).join(', ')}... (${all_files.size()} total)" - - // sort - sort samples by quality - def sorted_by_quality = samples.sort { it.quality } - println "Quality range: ${sorted_by_quality.first().quality} to ${sorted_by_quality.last().quality}" - - // count - count items matching condition - def human_samples = samples.count { it.organism == 'human' } - println "Human samples: ${human_samples} out of ${samples.size()}" - - // any/every - check conditions across collection - def has_high_quality = samples.any { it.quality > 40 } - def all_have_files = samples.every { it.files.size() > 0 } - println "Has high quality samples: ${has_high_quality}" - println "All samples have files: ${all_have_files}" - - // Spread operator demonstration - demonstrateSpreadOperator() -} - -def demonstrateSpreadOperator() { - println "\n=== Spread Operator Demo ===" - - def file_paths = [ - '/data/sample1.fastq', - '/data/sample2.fastq', - '/results/output1.bam', - '/results/output2.bam' - ] - - // Convert to file objects - def files = file_paths.collect { file(it) } - - // Using spread operator - equivalent to files.collect { it.getName() } - def filenames = files*.getName() - println "Filenames: ${filenames.join(', ')}" - - // Get all parent directories - def parent_dirs = files*.getParent()*.getName() - println "Parent directories: ${parent_dirs.unique().join(', ')}" - - // Get all extensions - def extensions = files*.getExtension().unique() - println "File types: ${extensions.join(', ')}" -} - -//============================================================================= -// SAMPLE PROCESSES (for demonstration) -//============================================================================= - -process QUALITY_FILTER { - input: - tuple val(meta), path(reads) - - output: - tuple val(meta), path("${meta.id}_filtered.fastq") - - script: - // Groovy logic to determine parameters based on metadata - def quality_threshold = meta.organism == 'human' ? 30 : - meta.organism == 'mouse' ? 28 : 25 - def min_length = meta.priority == 'high' ? 75 : 50 - - // Conditional script sections - def extra_qc = meta.priority == 'high' ? '--strict-quality' : '' - - """ - echo "Processing ${meta.id} (${meta.organism}, priority: ${meta.priority})" - - # Dynamic quality filtering based on sample characteristics - fastp \\ - --in1 ${reads} \\ - --out1 ${meta.id}_filtered.fastq \\ - --qualified_quality_phred ${quality_threshold} \\ - --length_required ${min_length} \\ - ${extra_qc} - - echo "Applied quality threshold: ${quality_threshold}" - echo "Applied length threshold: ${min_length}" - """ - - stub: - """ - echo "STUB: Processing ${meta.id} (${meta.organism}, priority: ${meta.priority})" - touch ${meta.id}_filtered.fastq - """ -} - -process JOINT_ANALYSIS { - input: - path sample_files // This will be a list of files - path reference - - output: - path "joint_results.txt" - - script: - // Transform file list into command arguments - def file_args = sample_files.collect { file -> "--input ${file}" }.join(' ') - def sample_names = sample_files.collect { file -> - file.baseName.replaceAll(/\..*$/, '') - }.join(',') - - """ - echo "Processing ${sample_files.size()} samples" - echo "Sample names: ${sample_names}" - - # Use the transformed arguments in the actual command - analysis_tool \\ - ${file_args} \\ - --reference ${reference} \\ - --output joint_results.txt \\ - --samples ${sample_names} - """ - - stub: - """ - echo "STUB: Processing ${sample_files.size()} samples" - echo "STUB: Sample names: ${sample_names}" - touch joint_results.txt - """ + .view() } From 45013e25dc124baf233df10daf13324d25f43e71 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 4 Sep 2025 14:39:57 +0100 Subject: [PATCH 03/48] Add groovy essentials to nav --- mkdocs.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/mkdocs.yml b/mkdocs.yml index a7df779070..10baa92462 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -60,6 +60,7 @@ nav: - side_quests/splitting_and_grouping.md - side_quests/workflows_of_workflows.md - side_quests/debugging.md + - side_quests/groovy_essentials.md - side_quests/nf-test.md - side_quests/nf-core.md - side_quests/nf-test.md From 0630074312d5f42fb02f99567d46226b6e874bdf Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 4 Sep 2025 14:41:16 +0100 Subject: [PATCH 04/48] Fix list rendering --- docs/side_quests/groovy_essentials.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index e57ea9b462..6575cc5149 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -11,6 +11,7 @@ Many Nextflow developers struggle with distinguishing when to use Nextflow versu **Our Mission**: Transform a simple CSV-reading workflow into a sophisticated, production-ready bioinformatics pipeline that can handle any dataset thrown at it. Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production: + - **Messy data?** We'll add robust parsing and null-safe operators - **Complex file naming schemes?** We'll master regex patterns and string manipulation - **Need intelligent sample routing?** We'll implement conditional logic and strategy selection From e171a4c285ba3635b8d4d0b48c1e45c43174a972 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 4 Sep 2025 14:43:40 +0100 Subject: [PATCH 05/48] Emphasise departure to groovy --- docs/side_quests/groovy_essentials.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 6575cc5149..1ab17b6eb2 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -163,14 +163,15 @@ You'll see the same output as before, because we're simply returning the input u #### Step 3: Creating a Map Data Structure -Now let's add some Groovy code inside our closure to create a **map data structure** (different from the map operator) to organize our sample metadata. +Now we're going to write **pure Groovy code** inside our closure. Everything from this point forward is Groovy syntax and methods, not Nextflow operators. === "After" - ```groovy title="main.nf" linenums="2" hl_lines="5-12" + ```groovy title="main.nf" linenums="2" hl_lines="5-13" ch_samples = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .map { row -> + // This is all Groovy code now! def sample_meta = [ id: row.sample_id.toLowerCase(), organism: row.organism, @@ -194,7 +195,7 @@ Now let's add some Groovy code inside our closure to create a **map data structu .view() ``` -A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. It lets us store related pieces of information together. In this map, we're storing the sample ID, organism, tissue type, sequencing depth, and quality score. +Notice how we've left Nextflow syntax behind and are now writing pure Groovy code. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. It lets us store related pieces of information together. In this map, we're storing the sample ID, organism, tissue type, sequencing depth, and quality score. We use Groovy's string manipulation methods like `.toLowerCase()` and `.replaceAll()` to clean up our data, and type conversion methods like `.toInteger()` and `.toDouble()` to convert string data from the CSV into the appropriate numeric types. @@ -214,7 +215,7 @@ You should see the refined map output like: #### Step 4: Adding Conditional Logic -Now let's add a ternary operator to make decisions based on data values. +Now let's add more Groovy logic - this time using a ternary operator to make decisions based on data values. === "After" @@ -272,7 +273,7 @@ You should see output like: #### Step 5: Combining Maps and Returning Results -Finally, let's return both the metadata and the file path as a tuple, which is the standard Nextflow pattern. +Finally, let's use Groovy's map addition operator to combine our metadata, then return a tuple that follows Nextflow's standard pattern. === "After" From db4bcdcda4be859fc82778e9365636192fe39a09 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 4 Sep 2025 14:53:29 +0100 Subject: [PATCH 06/48] Shorten intro --- docs/side_quests/groovy_essentials.md | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 1ab17b6eb2..6c0e25ff51 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -4,13 +4,9 @@ Nextflow is built on Apache Groovy, a powerful dynamic language that runs on the Understanding where Nextflow ends and Groovy begins is crucial for effective workflow development. Nextflow provides channels, processes, and workflow orchestration, while Groovy handles data manipulation, string processing, conditional logic, and general programming tasks within your workflow scripts. -Think of it like cooking: Nextflow is your kitchen setup - the stove, pans, and organization system that lets you cook efficiently. Groovy is your knife skills, ingredient preparation techniques, and recipe adaptation abilities that make the actual cooking successful. You need both to create great meals, but knowing which tool to reach for when is essential. - Many Nextflow developers struggle with distinguishing when to use Nextflow versus Groovy features, processing file names and configurations, and handling errors gracefully. This side quest will bridge that gap by taking you on a journey from basic workflow concepts to production-ready pipeline mastery. -**Our Mission**: Transform a simple CSV-reading workflow into a sophisticated, production-ready bioinformatics pipeline that can handle any dataset thrown at it. - -Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production: +We'll transform a simple CSV-reading workflow into a sophisticated, production-ready bioinformatics pipeline that can handle any dataset thrown at it. Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production: - **Messy data?** We'll add robust parsing and null-safe operators - **Complex file naming schemes?** We'll master regex patterns and string manipulation @@ -19,8 +15,6 @@ Starting with a basic workflow that processes sample metadata, we'll evolve it s - **Code getting repetitive?** We'll learn functional programming with closures and composition - **Processing thousands of samples?** We'll leverage powerful collection operations -Each section builds on the previous, showing you how Groovy transforms simple workflows into powerful, production-ready pipelines that can handle the complexities of real-world bioinformatics data. - You will learn: - How to distinguish between Nextflow and Groovy constructs in your workflows @@ -32,8 +26,6 @@ You will learn: - Advanced closures and functional programming techniques - Collection operations and file path manipulations -These skills will help you write cleaner, more maintainable workflows that handle different input types appropriately and provide useful feedback when things go wrong. - --- ## 0. Warmup From 07dd4572424b6b0b29abf14d116ef3d80008ecc8 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 4 Sep 2025 15:04:40 +0100 Subject: [PATCH 07/48] Shorten intro --- docs/side_quests/groovy_essentials.md | 23 ++++++----------------- 1 file changed, 6 insertions(+), 17 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 6c0e25ff51..00d9e39ed3 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -8,23 +8,12 @@ Many Nextflow developers struggle with distinguishing when to use Nextflow versu We'll transform a simple CSV-reading workflow into a sophisticated, production-ready bioinformatics pipeline that can handle any dataset thrown at it. Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production: -- **Messy data?** We'll add robust parsing and null-safe operators -- **Complex file naming schemes?** We'll master regex patterns and string manipulation -- **Need intelligent sample routing?** We'll implement conditional logic and strategy selection -- **Worried about failures?** We'll add comprehensive error handling and validation -- **Code getting repetitive?** We'll learn functional programming with closures and composition -- **Processing thousands of samples?** We'll leverage powerful collection operations - -You will learn: - -- How to distinguish between Nextflow and Groovy constructs in your workflows -- String processing and pattern matching for bioinformatics file names -- Transforming file collections into command-line arguments -- Conditional logic for controlling process execution -- Basic validation and error handling patterns -- Essential Groovy operators: safe navigation, Elvis, and Groovy Truth -- Advanced closures and functional programming techniques -- Collection operations and file path manipulations +- **Messy data?** We'll add robust parsing and null-safe operators, learning to distinguish between Nextflow and Groovy constructs +- **Complex file naming schemes?** We'll master regex patterns and string manipulation for bioinformatics file names +- **Need intelligent sample routing?** We'll implement conditional logic and strategy selection, transforming file collections into command-line arguments +- **Worried about failures?** We'll add comprehensive error handling and validation patterns +- **Code getting repetitive?** We'll learn functional programming with closures and composition, mastering essential Groovy operators like safe navigation and Elvis +- **Processing thousands of samples?** We'll leverage powerful collection operations for file path manipulations --- From 5c92314863a537701df322889cde1a6c78209e61 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 4 Sep 2025 16:24:08 +0100 Subject: [PATCH 08/48] Fix highlight --- docs/side_quests/groovy_essentials.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 00d9e39ed3..a052ad91e8 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -148,7 +148,7 @@ Now we're going to write **pure Groovy code** inside our closure. Everything fro === "After" - ```groovy title="main.nf" linenums="2" hl_lines="5-13" + ```groovy title="main.nf" linenums="2" hl_lines="4-12" ch_samples = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .map { row -> @@ -167,7 +167,7 @@ Now we're going to write **pure Groovy code** inside our closure. Everything fro === "Before" - ```groovy title="main.nf" linenums="2" hl_lines="5" + ```groovy title="main.nf" linenums="2" hl_lines="4" ch_samples = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .map { row -> From 4ef998614c92840754594ab3033c99311cfa2d7c Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Wed, 10 Sep 2025 18:31:37 +0100 Subject: [PATCH 09/48] Refine section 1.2 --- docs/side_quests/groovy_essentials.md | 88 ++++++++++++------------ side-quests/groovy_essentials/collect.nf | 23 +++++++ 2 files changed, 67 insertions(+), 44 deletions(-) create mode 100644 side-quests/groovy_essentials/collect.nf diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index a052ad91e8..4e317f2c41 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -317,69 +317,69 @@ You should see output like: **Maps and Metadata**: Maps are fundamental to working with metadata in Nextflow. For a more detailed explanation of working with metadata maps, see the [Working with metadata](./metadata.md) side quest. -### 1.2. The Collect Confusion: Nextflow vs Groovy +Our workflow demonstrates the core pattern: **Nextflow constructs** (`workflow`, `Channel.fromPath()`, `.splitCsv()`, `.map()`, `.view()`) orchestrate data flow, while **basic Groovy constructs** (maps `[key: value]`, string methods, type conversions, ternary operators) handle the data processing logic inside the `.map()` closure. -!!! warning +### 1.2. Distinguishing Nextflow operators from Groovy functions - The `collect` operation exists in both Nextflow and Groovy but does completely different things. This is one of the most common sources of confusion for developers. +Having a clear understanding of which parts of your code are using basic Groovy is especially important when syntax overlaps between the two languages. -A perfect example of Nextflow/Groovy confusion is the `collect` operation, which exists in both contexts but does completely different things: +A perfect example of this confusion is the `collect` operation, which exists in both contexts but does completely different things. Groovy's `collect` transforms each element, while Nextflow's `collect` gathers all channel elements into a single-item channel. -**Groovy's `collect`** (transforms each element): -```groovy title="Groovy collect example" -// Groovy collect - transforms each item in a list -// The { it * it } is a closure (anonymous function) where 'it' refers to each element -def numbers = [1, 2, 3, 4] -def squared = numbers.collect { it * it } -// Result: [1, 4, 9, 16] -``` - -**Nextflow's `collect`** (gathers all channel elements): -```groovy title="Nextflow collect example" -// Nextflow collect - gathers all channel items into a list -// This waits for all items to arrive before emitting a single list -Channel.of(1, 2, 3, 4) - .collect() - .view() -// Result: [1, 2, 3, 4] (single channel emission) -``` - -Let's demonstrate this with our sample data: - -=== "After" +Let's demonstrate this with some sample data. Check out `collect.nf`: - ```groovy title="main.nf" linenums="25" hl_lines="1-15" - // Demonstrate Groovy vs Nextflow collect - def sample_ids = ['sample_001', 'sample_002', 'sample_003'] +```groovy title="collect.nf" linenums="1" +// Demonstrate Groovy vs Nextflow collect +def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - // Groovy collect: transform each element - def formatted_ids = sample_ids.collect { id -> - id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') - } - println "Groovy collect result: ${formatted_ids}" +println "=== GROOVY COLLECT (transforms each item, keeps same structure) ===" +// Groovy collect: transforms each element but maintains list structure +def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') +} +println "Original list: ${sample_ids}" +println "Groovy collect result: ${formatted_ids}" +println "Groovy collect maintains structure: ${formatted_ids.size} items (same as original)" +println "" - // Nextflow collect: gather channel elements - ch_collected = Channel.of('sample_001', 'sample_002', 'sample_003') - .collect() - ch_collected.view { "Nextflow collect result: ${it}" } - ``` +println "\n=== NEXTFLOW COLLECT (groups multiple items into single emission) ===" +// Nextflow collect: groups channel elements into a single emission +ch_input = Channel.of('sample_001', 'sample_002', 'sample_003') -=== "Before" +// Show individual items before collect +ch_input.view { "Individual channel item: ${it}" } - ```groovy title="main.nf" linenums="25" - ``` +// Collect groups all items into a single emission +ch_collected = ch_input.collect() +ch_collected.view { "Nextflow collect result: ${it} (${it.size()} items grouped together)" } +``` Run the workflow to see both collect operations in action: ```bash title="Test collect operations" -nextflow run main.nf +nextflow run collect.nf ``` ```console title="Different collect behaviors" + N E X T F L O W ~ version 25.04.6 + +Launching `collect.nf` [silly_bhaskara] DSL2 - revision: 5ef004224c + +=== GROOVY COLLECT (transforms each item, keeps same structure) === +Original list: [sample_001, sample_002, sample_003] Groovy collect result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] -Nextflow collect result: [sample_001, sample_002, sample_003] +Groovy collect maintains structure: 3 items (same as original) + +=== NEXTFLOW COLLECT (groups multiple items into single emission) === +Individual channel item: sample_001 +Individual channel item: sample_002 +Individual channel item: sample_003 +Nextflow collect result: [sample_001, sample_002, sample_003] (3 items grouped together) ``` +The key difference: **Groovy's `collect`** transforms items but preserves structure (like Nextflow's `map`), while **Nextflow's `collect()`** groups multiple channel emissions into a single list. + +But `collect` really isn't the main point. The key lesson: always distinguish between **Groovy constructs** (data structures) and **Nextflow constructs** (channels/workflows). Operations can share names but behave completely differently. + ### Takeaway In this section, you've learned: diff --git a/side-quests/groovy_essentials/collect.nf b/side-quests/groovy_essentials/collect.nf new file mode 100644 index 0000000000..10fd3cf692 --- /dev/null +++ b/side-quests/groovy_essentials/collect.nf @@ -0,0 +1,23 @@ +// Demonstrate Groovy vs Nextflow collect +def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + +println "=== GROOVY COLLECT (transforms each item, keeps same structure) ===" +// Groovy collect: transforms each element but maintains list structure +def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') +} +println "Original list: ${sample_ids}" +println "Groovy collect result: ${formatted_ids}" +println "Groovy collect maintains structure: ${formatted_ids.size} items (same as original)" +println "" + +println "\n=== NEXTFLOW COLLECT (groups multiple items into single emission) ===" +// Nextflow collect: groups channel elements into a single emission +ch_input = Channel.of('sample_001', 'sample_002', 'sample_003') + +// Show individual items before collect +ch_input.view { "Individual channel item: ${it}" } + +// Collect groups all items into a single emission +ch_collected = ch_input.collect() +ch_collected.view { "Nextflow collect result: ${it} (${it.size()} items grouped together)" } From aa468a2163f5e9c5b8c3f00d86e48e813e4a318a Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Wed, 8 Oct 2025 15:01:47 +0100 Subject: [PATCH 10/48] messy updates up to section 4 --- docs/side_quests/groovy_essentials.md | 1030 ++++++++--------- .../groovy_essentials/modules/fastp.nf | 37 + .../groovy_essentials/modules/fastp.nf.bak | 22 + .../groovy_essentials/modules/trimgalore.nf | 37 + 4 files changed, 557 insertions(+), 569 deletions(-) create mode 100644 side-quests/groovy_essentials/modules/fastp.nf create mode 100644 side-quests/groovy_essentials/modules/fastp.nf.bak create mode 100644 side-quests/groovy_essentials/modules/trimgalore.nf diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 4e317f2c41..11d9540e47 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -1,12 +1,21 @@ # Groovy Essentials for Nextflow Developers -Nextflow is built on Apache Groovy, a powerful dynamic language that runs on the Java Virtual Machine. While Nextflow provides the workflow orchestration framework, Groovy provides the programming language foundations that make your workflows flexible, maintainable, and powerful. +Nextflow is built on Apache Groovy, a powerful dynamic language that runs on the Java Virtual Machine. This foundation gives Nextflow its flexibility and expressiveness, but it also creates a common source of confusion for developers. -Understanding where Nextflow ends and Groovy begins is crucial for effective workflow development. Nextflow provides channels, processes, and workflow orchestration, while Groovy handles data manipulation, string processing, conditional logic, and general programming tasks within your workflow scripts. +**Here's the challenge:** Most Nextflow tutorials focus on the workflow orchestration - channels, processes, and data flow - but when you need to manipulate data, parse filenames, implement conditional logic, or handle errors gracefully, you're actually writing Groovy code. Many developers don't realize when they've crossed this boundary. + +**Why does this matter?** The difference between a brittle workflow that breaks on unexpected input and a robust pipeline that adapts gracefully often comes down to understanding and leveraging Groovy's powerful features within your Nextflow workflows. -Many Nextflow developers struggle with distinguishing when to use Nextflow versus Groovy features, processing file names and configurations, and handling errors gracefully. This side quest will bridge that gap by taking you on a journey from basic workflow concepts to production-ready pipeline mastery. +**The common struggle:** Most Nextflow developers can write basic workflows, but they hit walls when they need to: +- Process messy, real-world data with missing fields or inconsistent formats +- Extract metadata from complex file naming schemes +- Route samples through different analysis strategies based on their characteristics +- Handle errors gracefully instead of crashing on invalid input +- Build reusable, maintainable code that doesn't repeat the same patterns everywhere -We'll transform a simple CSV-reading workflow into a sophisticated, production-ready bioinformatics pipeline that can handle any dataset thrown at it. Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production: +Understanding where Nextflow ends and Groovy begins is crucial for effective workflow development. Nextflow provides channels, processes, and workflow orchestration, while Groovy handles data manipulation, string processing, conditional logic, and general programming tasks within your workflow scripts. + +This side quest will bridge that gap by taking you on a hands-on journey from basic concepts to production-ready mastery. We'll transform a simple CSV-reading workflow into a sophisticated, production-ready bioinformatics pipeline that can handle any dataset thrown at it. Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production: - **Messy data?** We'll add robust parsing and null-safe operators, learning to distinguish between Nextflow and Groovy constructs - **Complex file naming schemes?** We'll master regex patterns and string manipulation for bioinformatics file names @@ -23,7 +32,7 @@ We'll transform a simple CSV-reading workflow into a sophisticated, production-r Before taking on this side quest you should: -- Complete the [Hello Nextflow](../hello_nextflow/README.md) tutorial +- Complete the [Hello Nextflow](../hello_nextflow/README.md) tutorial or have equivalent experience - Understand basic Nextflow concepts (processes, channels, workflows) - Have basic familiarity with Groovy syntax (variables, maps, lists) @@ -80,7 +89,7 @@ One of the most common sources of confusion for Nextflow developers is understan #### Step 1: Basic Nextflow Workflow -Start with a simple workflow that just reads the CSV file: +Start with a simple workflow that just reads the CSV file (we've already done this for you in `main.nf`): ```groovy title="main.nf" linenums="1" workflow { @@ -107,9 +116,13 @@ You should see output like: #### Step 2: Adding the Map Operator -Now let's add the `.map()` operator, which is a **Nextflow channel operator** (not to be confused with the map data structure we'll see below). This operator takes a closure where we can write Groovy code to transform each item. +Now we're going to use some Groovy code to transform the data, using the `.map()` operator you will probably already be familiar with. This operator takes a 'closure' where we can write Groovy code to transform each item. -A **closure** is a block of code that can be passed around and executed later. Think of it as a function that you define inline. In Groovy, closures are written with curly braces `{ }` and can take parameters. They're fundamental to how Nextflow operators work. +!!! note + + A **closure** is a block of code that can be passed around and executed later. Think of it as a function that you define inline. In Groovy, closures are written with curly braces `{ }` and can take parameters. They're fundamental to how Nextflow operators work and if you've been writing Nextflow for a while, you may already have been using them without realizing it! + +Here's what that map operation looks like: === "After" @@ -130,7 +143,7 @@ A **closure** is a block of code that can be passed around and executed later. T .view() ``` -The `.map { row -> ... }` operator takes a closure where `row` represents each item from the channel. This is a named parameter - you can call it anything you want. For example, you could write `.map { item -> ... }` or `.map { sample -> ... }` and it would work exactly the same way. +The closure we've added is `{ row -> return row }`. We've named the parameter `row`, but it could be called anything, so you could write `.map { item -> ... }` or `.map { sample -> ... }` and it would work exactly the same way. It's also possible not to name the parameter at all and just use the implicit variable `it`, like `.map { return it }`, but naming it makes the code clearer. When Nextflow processes each item in the channel, it passes that item to your closure as the parameter you named. So if your channel contains CSV rows, `row` will hold one complete row at a time. @@ -198,6 +211,8 @@ You should see the refined map output like: Now let's add more Groovy logic - this time using a ternary operator to make decisions based on data values. +Make the following change: + === "After" ```groovy title="main.nf" linenums="2" hl_lines="11-12" @@ -235,31 +250,40 @@ Now let's add more Groovy logic - this time using a ternary operator to make dec .view() ``` -The ternary operator is a shorthand for an if/else statement that follows the pattern `condition ? value_if_true : value_if_false`. This line means: "If the quality is greater than 40, use 'high', otherwise use 'normal'". +The ternary operator is a shorthand for an if/else statement that follows the pattern `condition ? value_if_true : value_if_false`. This line means: "If the quality is greater than 40, use 'high', otherwise use 'normal'". Its cousin, the **Elvis operator** (`?:`), provides default values when something is null or empty - we'll explore that pattern later in this tutorial. The map addition operator `+` creates a **new map** rather than modifying the existing one. This line creates a new map that contains all the key-value pairs from `sample_meta` plus the new `priority` key. -Apply this change and run the workflow: +!!! Note + + Using the addition operator `+` creates a new map rather than modifying the existing one, which is a useful practice to adopt. Never directly modify maps passed into closures, as this can lead to unexpected behavior in Nextflow. This is especially important because in Nextflow workflows, the same data often flows through multiple channel operations or gets processed by different processes simultaneously. When multiple operations reference the same map object, modifying it in-place can cause unpredictable side effects - one operation might change data that another operation is still using. By creating new maps instead of modifying existing ones, you ensure that each operation works with its own clean copy of the data, making your workflows more predictable and easier to debug. + +Run the modified workflow: ```bash title="Test conditional logic" nextflow run main.nf ``` You should see output like: + ```console title="Metadata with priority" [id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal] [id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal] [id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high] ``` +We've successfully added conditional logic to enrich our metadata with a priority level based on quality scores. + #### Step 5: Combining Maps and Returning Results -Finally, let's use Groovy's map addition operator to combine our metadata, then return a tuple that follows Nextflow's standard pattern. +So far, we've only been returning what Nextflow community calls the 'meta map', and we've been ignoring the files those metadata relate to. But if you're writing Nextflow workflows, you probably want to do something with those files. + +Let's output a channel structure comprising a tuple of 2 elements: the enriched metadata map and the corresponding file path. This is a common pattern in Nextflow for passing data to processes. === "After" ```groovy title="main.nf" linenums="2" hl_lines="12" - ch_samples = Channel.fromPath("./data/samplesheet.csv") + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -294,8 +318,6 @@ Finally, let's use Groovy's map addition operator to combine our metadata, then .view() ``` -This returns a tuple containing the enriched metadata and the file path, which is the standard pattern for passing data to processes in Nextflow. - Apply this change and run the workflow: ```bash title="Test complete workflow" @@ -309,9 +331,7 @@ You should see output like: [[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_003.fastq] ``` -!!! note - - **Key Pattern**: Nextflow operators often take closures `{ ... }` as parameters. Everything inside these closures is **Groovy code**. This is how Nextflow orchestrates workflows while Groovy handles the data processing logic. +This `[meta, file]` tuple structure is a common pattern in Nextflow for passing both metadata and associated files to processes. !!! note @@ -321,39 +341,25 @@ Our workflow demonstrates the core pattern: **Nextflow constructs** (`workflow`, ### 1.2. Distinguishing Nextflow operators from Groovy functions -Having a clear understanding of which parts of your code are using basic Groovy is especially important when syntax overlaps between the two languages. +So far, so good, we can distinguish between Nextflow constructs and basic Groovy constructs. But what about when the syntax overlaps? A perfect example of this confusion is the `collect` operation, which exists in both contexts but does completely different things. Groovy's `collect` transforms each element, while Nextflow's `collect` gathers all channel elements into a single-item channel. -Let's demonstrate this with some sample data. Check out `collect.nf`: +Let's demonstrate this with some sample data, starting by refreshing ourselves on what the Nextflow `collect()` operator does. Check out `collect.nf`: ```groovy title="collect.nf" linenums="1" -// Demonstrate Groovy vs Nextflow collect def sample_ids = ['sample_001', 'sample_002', 'sample_003'] -println "=== GROOVY COLLECT (transforms each item, keeps same structure) ===" -// Groovy collect: transforms each element but maintains list structure -def formatted_ids = sample_ids.collect { id -> - id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') -} -println "Original list: ${sample_ids}" -println "Groovy collect result: ${formatted_ids}" -println "Groovy collect maintains structure: ${formatted_ids.size} items (same as original)" -println "" - -println "\n=== NEXTFLOW COLLECT (groups multiple items into single emission) ===" -// Nextflow collect: groups channel elements into a single emission -ch_input = Channel.of('sample_001', 'sample_002', 'sample_003') - -// Show individual items before collect +// Nextflow collect() - groups multiple channel emissions into one +ch_input = Channel.fromList(sample_ids) ch_input.view { "Individual channel item: ${it}" } - -// Collect groups all items into a single emission ch_collected = ch_input.collect() -ch_collected.view { "Nextflow collect result: ${it} (${it.size()} items grouped together)" } +ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } ``` -Run the workflow to see both collect operations in action: +We're using the `fromList()` channel factory to create a channel that emits each sample ID as a separate item, and we use `view()` to print each item as it flows through the channel.Then we apply Nextflow's `collect()` operator to gather all items into a single list and use a second `view()` to print the collected result which appears as a single item containing a list of all sample IDs. We've changed the structure of the channel, but we haven't changed the data itself. + +Run the workflow to confirm this: ```bash title="Test collect operations" nextflow run collect.nf @@ -362,77 +368,137 @@ nextflow run collect.nf ```console title="Different collect behaviors" N E X T F L O W ~ version 25.04.6 -Launching `collect.nf` [silly_bhaskara] DSL2 - revision: 5ef004224c +Launching `collect.nf` [loving_mendel] DSL2 - revision: e8d054a46e + +Individual channel item: sample_001 +Individual channel item: sample_002 +Individual channel item: sample_003 +Nextflow collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) +``` + +Now let's see Groovy's `collect` method in action. Modify `collect.nf` to apply Groovy's `collect` method to the original list of sample IDs: -=== GROOVY COLLECT (transforms each item, keeps same structure) === -Original list: [sample_001, sample_002, sample_003] -Groovy collect result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] -Groovy collect maintains structure: 3 items (same as original) +=== "After" + + ```groovy title="main.nf" linenums="1" hl_lines="9-13" + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] -=== NEXTFLOW COLLECT (groups multiple items into single emission) === + // Nextflow collect() - groups multiple channel emissions into one + ch_input = Channel.fromList(sample_ids) + ch_input.view { "Individual channel item: ${it}" } + ch_collected = ch_input.collect() + ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + + // Groovy collect - transforms each element, preserves structure + def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') + } + println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="1" + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + + // Nextflow collect() - groups multiple channel emissions into one + ch_input = Channel.fromList(sample_ids) + ch_input.view { "Individual channel item: ${it}" } + ch_collected = ch_input.collect() + ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + ``` + +Run the modified workflow: + +```bash title="Test Groovy collect" +nextflow run collect.nf +``` + +```console title="Groovy collect results" hl_lines="9" + N E X T F L O W ~ version 25.04.6 + +Launching `collect.nf` [cheeky_stonebraker] DSL2 - revision: 2d5039fb47 + +Groovy collect result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) Individual channel item: sample_001 Individual channel item: sample_002 Individual channel item: sample_003 -Nextflow collect result: [sample_001, sample_002, sample_003] (3 items grouped together) +Nextflow collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) ``` -The key difference: **Groovy's `collect`** transforms items but preserves structure (like Nextflow's `map`), while **Nextflow's `collect()`** groups multiple channel emissions into a single list. +This time, have NOT changed the structure of the data, we still have 3 items in the list, but we HAVE transformed each item using Groovy's `collect` method to produce a new list with modified values. This is sort of like using the `map` operator in Nextflow, but it's pure Groovy code operating on a standard Groovy list. -But `collect` really isn't the main point. The key lesson: always distinguish between **Groovy constructs** (data structures) and **Nextflow constructs** (channels/workflows). Operations can share names but behave completely differently. +`collect` is an extreme case we're using here to make a point. The key lesson is that when you're writing workflows always distinguish between **Groovy constructs** (data structures) and **Nextflow constructs** (channels/workflows). Operations can share names but behave completely differently. ### Takeaway In this section, you've learned: -- **Distinguishing Nextflow from Groovy**: How to identify which language construct you're using +- **It takes both Nextflow and Groovy**: Nextflow provides the workflow structure and data flow, while Groovy provides the data manipulation and logic +- **Distinguishing Nextflow from Groovy**: How to identify which language construct you're using given the context - **Context matters**: The same operation name can have completely different behaviors -- **Workflow structure**: Nextflow provides the orchestration, Groovy provides the logic -- **Data transformation patterns**: When to use Groovy methods vs Nextflow operators Understanding these boundaries is essential for debugging, documentation, and writing maintainable workflows. -Now that we can distinguish between Nextflow and Groovy constructs, let's enhance our sample processing pipeline with more sophisticated data handling capabilities. +Next we'll dive deeper into Groovy's powerful string processing capabilities, which are essential for handling real-world data. --- ## 2. Advanced String Processing for Bioinformatics -Our basic pipeline processes CSV metadata well, but this is just the beginning. In production bioinformatics, you'll encounter files from different sequencing centers with varying naming conventions, legacy datasets with non-standard formats, and the need to extract critical information from filenames themselves. - The difference between a brittle workflow that breaks on unexpected input and a robust pipeline that adapts gracefully often comes down to mastering Groovy's string processing capabilities. Let's transform our pipeline to handle the messy realities of real-world bioinformatics data. ### 2.1. Pattern Matching and Regular Expressions Many bioinformatics workflows encounter files with complex naming conventions that encode important metadata. Let's see how Groovy's pattern matching can extract this information automatically. -Let's start with a simple example of extracting sample information from file names: +We're going to return to our `main.nf` workflow and add some pattern matching logic to extract additional sample information from file names. The FASTQ files in our dataset follow Illumina-style naming conventions with names like `SAMPLE_001_S1_L001_R1_001.fastq.gz`. These might look cryptic, but they actually encode useful metadata like sample ID, lane number, and read direction. We're going to use Groovy's regex capabilities to parse these names. + +Make the following change to your existing `main.nf` workflow: === "After" - ```groovy title="main.nf" linenums="40" hl_lines="1-15" - // Pattern matching for sample file names - def sample_files = [ - 'Human_Liver_001.fastq', - 'mouse_brain_002.fastq', - 'SRR12345678.fastq' - ] + ```groovy title="main.nf" linenums="2" hl_lines="9-19,21" + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) - // Simple pattern to extract organism and tissue - def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/ + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] - sample_files.each { filename -> - def matcher = filename =~ pattern - if (matcher) { - println "${filename} -> Organism: ${matcher[0][1]}, Tissue: ${matcher[0][2]}, ID: ${matcher[0][3]}" - } else { - println "${filename} -> No standard pattern match" - } - } + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } ``` === "Before" - ```groovy title="main.nf" linenums="25" + ```groovy title="main.nf" linenums="2" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + [priority: priority], file(row.file_path)] + } + .view() ``` This demonstrates key Groovy string processing concepts: @@ -441,622 +507,448 @@ This demonstrates key Groovy string processing concepts: 2. **Pattern matching** with the `=~` operator - this attempts to match a string against a regex pattern 3. **Matcher objects** that capture groups with `[0][1]`, `[0][2]`, etc. - `[0]` refers to the entire match, `[1]`, `[2]`, etc. refer to captured groups in parentheses -Run this to see the pattern matching in action: +The regex pattern `^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$` is designed to match the Illumina-style naming convention and capture key components, namely the sample number, lane number, read direction, and chunk number. + +Run the modified workflow: ```bash title="Test pattern matching" nextflow run main.nf ``` -```console title="Pattern matching results" -Human_Liver_001.fastq -> Organism: Human, Tissue: Liver, ID: 001 -mouse_brain_002.fastq -> Organism: mouse, Tissue: brain, ID: 002 -SRR12345678.fastq -> No standard pattern match +You should see output with metadata enriched from the file names, like + +```console title="Metadata with file parsing" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [clever_pauling] DSL2 - revision: 605d2058b4 + +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq.gz] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq.gz] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq.gz] ``` -### 2.2. Creating Reusable Parsing Functions +### 2.2. Creating Reusable closures + +You may have noticed that the content of our map operation is getting quite long and complex. To keep our workflow maintainable, it's a good idea to break out complex logic into reusable functions or closures. -Let's create a simple function to parse sample names and return structured metadata: +To do that we simply define a closure using the assignment operator `=` and the `{}` syntax, within the `workflow{}`. Then we can call that closure by name inside our map operation using standard function call syntax (not the curly braces). + +Make that change like so: === "After" - ```groovy title="main.nf" linenums="60" hl_lines="1-20" + ```groovy title="main.nf" linenums="1" hl_lines="3-23,27" + workflow { - // Function to extract sample metadata from filename - def parseSampleName(String filename) { - def pattern = ~/^(\w+)_(\w+)_(\d+)\.fastq$/ - def matcher = filename =~ pattern + separateMetadata = { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) - if (matcher) { - return [ - organism: matcher[0][1].toLowerCase(), - tissue: matcher[0][2].toLowerCase(), - sample_id: matcher[0][3], - valid: true - ] - } else { - return [ - filename: filename, - valid: false - ] - } - } + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] - // Test the parser - sample_files.each { filename -> - def parsed = parseSampleName(filename) - println "File: ${filename}" - if (parsed.valid) { - println " Organism: ${parsed.organism}, Tissue: ${parsed.tissue}, ID: ${parsed.sample_id}" - } else { - println " Could not parse filename" + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } + + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + .view() } - } ``` === "Before" - ```groovy title="main.nf" linenums="40" + ```groovy title="main.nf" linenums="1" hl_lines="4-24" + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } + .view() + } ``` -This demonstrates key Groovy function patterns: +By doing this we've reduced the actual workflow logic down to something really trivial: -- **Function definitions** with `def functionName(parameters)` - similar to other languages but with dynamic typing -- **Map creation and return** for structured data - maps are Groovy's primary data structure for returning multiple values -- **Conditional returns** based on pattern matching success - functions can return different data structures based on conditions - -### 2.3. Dynamic Script Logic in Processes - -In Nextflow, dynamic behavior comes from using Groovy logic within process script blocks, not generating script strings. Here are realistic patterns: +```groovy title="minimal workflow" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + .view() +``` -=== "After" +... which makes the logic much easier to read and understand at a glance. The closure `separateMetadata` encapsulates all the complex logic for parsing and enriching metadata, making it reusable and testable. - ```groovy title="main.nf" linenums="115" hl_lines="1-40" +You can run that to make sure it still works: - // Process with conditional script logic - process QUALITY_FILTER { - input: - tuple val(meta), path(reads) +```bash title="Test reusable closure" +nextflow run main.nf +``` - output: - tuple val(meta), path("${meta.id}_filtered.fastq") +```console title="Closure results" + N E X T F L O W ~ version 25.04.6 - script: - // Groovy logic to determine parameters based on metadata - def quality_threshold = meta.organism == 'human' ? 30 : - meta.organism == 'mouse' ? 28 : 25 - def min_length = meta.priority == 'high' ? 75 : 50 - - // Conditional script sections - def extra_qc = meta.priority == 'high' ? '--strict-quality' : '' - - """ - echo "Processing ${meta.id} (${meta.organism}, priority: ${meta.priority})" - - # Dynamic quality filtering based on sample characteristics - fastp \\ - --in1 ${reads} \\ - --out1 ${meta.id}_filtered.fastq \\ - --qualified_quality_phred ${quality_threshold} \\ - --length_required ${min_length} \\ - ${extra_qc} - - echo "Applied quality threshold: ${quality_threshold}" - echo "Applied length threshold: ${min_length}" - """ - } +Launching `main.nf` [tender_archimedes] DSL2 - revision: 8bfb9b2485 - // Process with completely different scripts based on organism - process ALIGN_READS { - input: - tuple val(meta), path(reads) +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq.gz] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq.gz] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq.gz] +``` - output: - tuple val(meta), path("${meta.id}.bam") +### 2.3. Dynamic Script Logic in Processes - script: - if (meta.organism == 'human') { - """ - echo "Using human-specific STAR alignment" - STAR --runMode alignReads \\ - --genomeDir /refs/human/STAR \\ - --readFilesIn ${reads} \\ - --outSAMtype BAM SortedByCoordinate \\ - --outFileNamePrefix ${meta.id} - mv ${meta.id}Aligned.sortedByCoord.out.bam ${meta.id}.bam - """ - } else if (meta.organism == 'mouse') { - """ - echo "Using mouse-specific bowtie2 alignment" - bowtie2 -x /refs/mouse/genome \\ - -U ${reads} \\ - --sensitive \\ - | samtools sort -o ${meta.id}.bam - - """ - } else { - """ - echo "Using generic alignment for ${meta.organism}" - minimap2 -ax sr /refs/generic/genome.fa ${reads} \\ - | samtools sort -o ${meta.id}.bam - - """ - } - } +Another place you'll find it very useful to break out your Groovy toolbox is in process script blocks. You can use Groovy logic to make your scripts dynamic and adaptable to different input conditions. - // Using templates (Nextflow's built-in templating) - process GENERATE_REPORT { - input: - tuple val(meta), path(results) +To illustrate what we mean, let's add some processes to our existing `main.nf` workflow that demonstrate common patterns for dynamic script generation. Open `modules/fastp.nf` and take a look: - output: - path("${meta.id}_report.html") +```groovtitle="modules/fastp.nf" linenums="1" +process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' - script: - template 'report_template.sh' - } - ``` + input: + tuple val(meta), path(reads) -=== "Before" + output: + tuple val(sample_id), path("*_trimmed*.fastq.gz"), emit: reads - ```groovy title="main.nf" linenums="115" - ``` + script: + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ +} +``` -Now let's look at the template file that would go with this: +The process takes FASTQ files as input and runs the `fastp` tool to trim adapters and filter low-quality reads. Unfortunately, the person who wrote this process didn't allow for the single-end reads we have in our example dataset. Let's add it to our workflow and see what happens: === "After" - ```bash title="templates/report_template.sh" linenums="1" hl_lines="1-25" - #!/bin/bash - - # This template has access to all variables from the process input - # Groovy expressions are evaluated at runtime - - echo "Generating report for sample: ${meta.id}" - echo "Organism: ${meta.organism}" - echo "Quality score: ${meta.quality}" - - # Conditional logic in template - <% if (meta.organism == 'human') { %> - echo "Including human-specific quality metrics" - human_qc_script.py --input ${results} --output ${meta.id}_report.html - <% } else { %> - echo "Using standard quality metrics for ${meta.organism}" - generic_qc_script.py --input ${results} --output ${meta.id}_report.html - <% } %> - - # Groovy variables can be used for calculations - <% - def priority_bonus = meta.priority == 'high' ? 0.1 : 0.0 - def adjusted_score = (meta.quality + priority_bonus).round(2) - %> - - echo "Adjusted quality score: ${adjusted_score}" - echo "Report generation complete" - ``` - -=== "Before" + ```groovy title="main.nf" linenums="1" hl_lines="1,31" + include { FASTP } from './modules/fastp.nf' - ```bash title="templates/report_template.sh" - ``` + workflow { -This demonstrates realistic Nextflow patterns: - -- **Conditional script blocks** using Groovy if/else in the script section -- **Variable interpolation** directly in script blocks -- **Template files** with Groovy expressions (using `<% %>` and `${}`) -- **Dynamic parameter calculation** based on metadata - -### 2.4. Transforming File Collections into Command Arguments + separateMetadata = { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) -A particularly powerful pattern is using Groovy logic in the script block to transform collections of files into properly formatted command-line arguments. This is essential when tools expect multiple files as separate arguments: + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] -=== "After" + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } - ```groovy title="main.nf" linenums="200" hl_lines="1-35" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) - // Process that needs to handle multiple input files - process JOINT_ANALYSIS { - input: - path sample_files // This will be a list of files - path reference + ch_fastp = FASTP(ch_samples) + } + ``` - output: - path "joint_results.txt" +=== "Before" - script: - // Transform file list into command arguments - def file_args = sample_files.collect { file -> "--input ${file}" }.join(' ') - def sample_names = sample_files.collect { file -> - file.baseName.replaceAll(/\..*$/, '') - }.join(',') - - """ - echo "Processing ${sample_files.size()} samples" - echo "Sample names: ${sample_names}" - - # Use the transformed arguments in the actual command - analysis_tool \\ - ${file_args} \\ - --reference ${reference} \\ - --output joint_results.txt \\ - --samples ${sample_names} - """ - } + ```groovy title="main.nf" linenums="1" hl_lines="4-24" + workflow { - // Process that builds complex command based on file characteristics - process VARIABLE_COMMAND { - input: - tuple val(meta), path(files) + separateMetadata = { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) - output: - path "${meta.id}_processed.txt" + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] - script: - // Complex command building based on file types and metadata - def input_flags = files.collect { file -> - def extension = file.getExtension() - switch(extension) { - case 'bam': - return "--bam-input ${file}" - case 'vcf': - return "--vcf-input ${file}" - case 'bed': - return "--intervals ${file}" - default: - return "--data-input ${file}" + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] } - }.join(' ') - - // Additional flags based on metadata - def extra_flags = meta.quality > 35 ? '--high-quality' : '' - - """ - echo "Building command for ${meta.id}" - variant_caller \\ - ${input_flags} \\ - ${extra_flags} \\ - --output ${meta.id}_processed.txt - """ - } + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + .view() + } ``` -=== "Before" - - ```groovy title="main.nf" linenums="200" - ``` +Run this modified workflow: -Key patterns demonstrated: +```bash title="Test fastp process" +nextflow run main.nf +``` -- **File collection transformation**: Using `.collect{}` to transform each file into a command argument -- **String joining**: Using `.join(' ')` to combine arguments with spaces -- **File name manipulation**: Using `.baseName` and `.replaceAll()` for sample names -- **Conditional argument building**: Using switch statements or conditionals to build different arguments based on file types -- **Multiple transformations**: Building both file arguments and sample name lists from the same collection +You'll see a long error trace with some content like: -### Takeaway +```console title="Process error" +ERROR ~ Error executing process > 'FASTP (3)' -In this section, you've learned: +Caused by: + Process `FASTP (3)` terminated with an error exit status (255) -- **Regular expression patterns** for bioinformatics file name parsing -- **Reusable parsing functions** that return structured metadata -- **Process script logic** with conditional parameter selection -- **File collection transformation** into command-line arguments using `.collect{}` and `.join()` -- **Command building patterns** based on file types and metadata -These string processing techniques form the foundation for handling complex data pipelines that need to adapt to different input formats and generate appropriate commands for bioinformatics tools. +Command executed: -With our pipeline now capable of extracting rich metadata from both CSV files and file names, we can make intelligent decisions about how to process different samples. Let's add conditional logic to route samples through appropriate analysis strategies. + fastp \ + --in1 SAMPLE_003_S3_L001_R1_001.fastq \ + --in2 null \ + --out1 sample_003_trimmed_R1.fastq.gz \ + --out2 sample_003_trimmed_R2.fastq.gz \ + --json sample_003.fastp.json \ + --html sample_003.fastp.html \ + --thread 2 ---- +Command exit status: + 255 -## 3. Conditional Logic and Process Control +Command output: + (empty) +``` -### 3.1. Strategy Selection Based on Sample Characteristics +You can see that the process is trying to run `fastp` with a `null` value for the second input file, which is causing it to fail. This is because our dataset contains single-end reads, but the process is hardcoded to expect paired-end reads (two input files at a time). -Now that our pipeline can extract comprehensive sample metadata, we can use this information to automatically select the most appropriate analysis strategy for each sample. Different organisms, sequencing depths, and quality scores require different processing approaches. +Let's fix this by adding some Groovy logic to the `script:` block of the `FASTP` process to handle both single-end and paired-end reads dynamically. We'll use an if/else statement to check how many read files are are present and adjust the command accordingly. === "After" - ```groovy title="main.nf" linenums="175" hl_lines="1-40" - - // Dynamic process selection based on sample characteristics - def selectAnalysisStrategy(Map sample_meta) { - def strategy = [:] - - // Sequencing depth determines processing approach - if (sample_meta.depth < 10_000_000) { - strategy.approach = 'low_depth' - strategy.processes = ['quality_check', 'simple_alignment'] - strategy.sensitivity = 'high' - } else if (sample_meta.depth < 50_000_000) { - strategy.approach = 'standard' - strategy.processes = ['quality_check', 'trimming', 'alignment', 'variant_calling'] - strategy.sensitivity = 'standard' - } else { - strategy.approach = 'high_depth' - strategy.processes = ['quality_check', 'trimming', 'alignment', 'variant_calling', 'structural_variants'] - strategy.sensitivity = 'sensitive' - } - - // Organism-specific adjustments - switch(sample_meta.organism) { - case 'human': - strategy.reference = 'GRCh38' - strategy.known_variants = 'dbSNP' - break - case 'mouse': - strategy.reference = 'GRCm39' - strategy.known_variants = 'mgp_variants' - break - default: - strategy.reference = 'custom' - strategy.known_variants = null - } - - // Quality-based modifications - if (sample_meta.quality < 30) { - strategy.extra_qc = true - strategy.processes = ['extensive_qc'] + strategy.processes - } - - return strategy - } + ```groovy title="main.nf" linenums="11" hl_lines="3,5,15" + script: + // Simple single-end vs paired-end detection + def is_single = reads instanceof List ? reads.size() == 1 : true - // Test strategy selection - ch_samples - .map { meta, file -> - def strategy = selectAnalysisStrategy(meta) - println "\nSample: ${meta.id}" - println " Strategy: ${strategy.approach}" - println " Processes: ${strategy.processes.join(' -> ')}" - println " Reference: ${strategy.reference}" - println " Extra QC: ${strategy.extra_qc ?: false}" - - return [meta + strategy, file] + if (is_single) { + def input_file = reads instanceof List ? reads[0] : reads + """ + fastp \\ + --in1 ${input_file} \\ + --out1 ${meta.id}_trimmed.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ + } else { + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ } - .view { meta, file -> "Ready for processing: ${meta.id} (${meta.approach})" } ``` === "Before" - ```groovy title="main.nf" linenums="175" + ```groovy title="main.nf" linenums="11" + script: + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ + } ``` -This demonstrates several Groovy patterns commonly used in Nextflow workflows: - -- **Numeric literals** with underscores for readability (`10_000_000`) - underscores can be used in numbers to improve readability -- **Switch statements** for multi-way branching - cleaner than multiple if/else statements -- **List concatenation** with `+` operator - combines two lists into one -- **Elvis operator** `?:` for null handling - provides a default value if the left side is null or false -- **Map merging** to combine metadata with strategy - the `+` operator merges two maps, with the right map taking precedence - -### 3.2. Conditional Process Execution +Now the workflow can handle both single-end and paired-end reads gracefully. The Groovy logic checks the number of input files and constructs the appropriate command for `fastp`. Let's see if it works: -In Nextflow, you control which processes run for which samples using `when` conditions and channel routing: - -=== "After" - - ```groovy title="main.nf" linenums="225" hl_lines="1-60" +```bash title="Test dynamic fastp" +nextflow run main.nf +``` - // Different processes for different strategies - process BASIC_QC { - input: - tuple val(meta), path(reads) +```console title="Successful run" + N E X T F L O W ~ version 25.04.6 - output: - tuple val(meta), path("${meta.id}_basic_qc.html") +Launching `main.nf` [adoring_rosalind] DSL2 - revision: 04b1cd93e9 - when: - meta.approach == 'low_depth' +executor > local (3) +[31/a8ad4d] process > FASTP (3) [100%] 3 of 3 ✔ +``` - script: - """ - fastqc --quiet ${reads} -o ./ - mv *_fastqc.html ${meta.id}_basic_qc.html - """ - } +Looks good! If we check the actual commands that were run (customise for your task hash): - process COMPREHENSIVE_QC { - input: - tuple val(meta), path(reads) +```console title="Check commands executed" +cat work/31/a8ad4d95749e685a6d842d3007957f/.command.sh +``` - output: - tuple val(meta), path("${meta.id}_comprehensive_qc.html") +We can see that Nextflow correctly picked the right command for single-end reads: - when: - meta.approach in ['standard', 'high_depth'] +```bash title=".command.sh" +#!/bin/bash -ue +fastp \ + --in1 SAMPLE_003_S3_L001_R1_001.fastq \ + --out1 sample_003_trimmed.fastq.gz \ + --json sample_003.fastp.json \ + --html sample_003.fastp.html \ + --thread 2 +``` - script: - def sensitivity = meta.sensitivity == 'high' ? '--strict' : '' - """ - fastqc ${sensitivity} ${reads} -o ./ - # Additional QC for comprehensive analysis - seqtk fqchk ${reads} > sequence_stats.txt - mv *_fastqc.html ${meta.id}_comprehensive_qc.html - """ - } +Another common one can be seen in [the Nextflow for Science Genomics module](../nf4science/genomics/02_joint_calling.md). In that module, the GATK process being called can take multiple input files, but each must be prefixed with `-V` to form a correct command line. The process uses Groovy logic to transform a collection of input files (`all_gvcfs`) into the correct command arguments: + +```groovy title="command line manipulation for GATK" linenums="1" + script: + def gvcfs_line = all_gvcfs.collect { gvcf -> "-V ${gvcf}" }.join(' ') + """ + gatk GenomicsDBImport \ + ${gvcfs_line} \ + -L ${interval_list} \ + --genomicsdb-workspace-path ${cohort_name}_gdb + """ +``` - process SIMPLE_ALIGNMENT { - input: - tuple val(meta), path(reads) +These patterns of using Groovy logic in process script blocks are extremely powerful and can be applied in many scenarios - from handling variable input types to building complex command-line arguments from file collections, making your processes truly adaptable to the diverse requirements of real-world data. - output: - tuple val(meta), path("${meta.id}.bam") +### Takeaway - when: - meta.approach == 'low_depth' +In this section, you've learned: - script: - """ - minimap2 -ax sr ${meta.reference} ${reads} \\ - | samtools sort -o ${meta.id}.bam - - """ - } +- **Regular expressions for file parsing**: Using Groovy's `=~` operator and regex patterns to extract metadata from complex bioinformatics file naming conventions +- **Reusable closures**: Extracting complex logic into named closures that can be passed to channel operators, making workflows more readable and maintainable +- **Dynamic script generation**: Using Groovy conditional logic within process script blocks to adapt commands based on input characteristics (like single-end vs paired-end reads) +- **Command-line argument construction**: Transforming file collections into properly formatted command arguments using `collect()` and `join()` methods - process SENSITIVE_ALIGNMENT { - input: - tuple val(meta), path(reads) +These string processing patterns are essential for handling the diverse file formats and naming conventions you'll encounter in real-world bioinformatics workflows. - output: - tuple val(meta), path("${meta.id}.bam") - when: - meta.approach in ['standard', 'high_depth'] +--- - script: - def params = meta.sensitivity == 'sensitive' ? '--very-sensitive' : '--sensitive' - """ - bowtie2 ${params} -x ${meta.reference} -U ${reads} \\ - | samtools sort -o ${meta.id}.bam - - """ - } +## 3. Conditional Logic and Process Control - // Workflow logic that routes to appropriate processes - workflow ANALYSIS_PIPELINE { - take: - samples_ch +Earlier on, we discussed how to use the `.map()` operator to use snippets of Groovy code to transform data flowing through channels. The counterpart to that is using Groovy to not just transform data, but to control which processes get executed based on the data itself. This is essential for building flexible workflows that can adapt to different sample types and analysis requirements. - main: - // All samples go through appropriate QC - basic_qc_results = BASIC_QC(samples_ch) - comprehensive_qc_results = COMPREHENSIVE_QC(samples_ch) +Nextflow has several [operators](https://www.nextflow.io/docs/latest/reference/operator.html) that control process flow, including, many of which take closures as arguments, meanint their content is evaluated at run time, allowing us to use Groovy logic to drive workflow decisions based on channel content. - // Combine QC results - qc_results = basic_qc_results.mix(comprehensive_qc_results) +For example, let's pretend that our sequencing samples need to be trimmed with FASTP only if they're human samples with a coverage above a certain threshold. Mouse samples or low-coverage samples should be run with Trimgalore instead (this is a contrived example, but it illustrates the point). - // All samples go through appropriate alignment - simple_alignment_results = SIMPLE_ALIGNMENT(samples_ch) - sensitive_alignment_results = SENSITIVE_ALIGNMENT(samples_ch) +Add a new process for Trimgalore in `modules/trimgalore.nf`: - // Combine alignment results - alignment_results = simple_alignment_results.mix(sensitive_alignment_results) +=== "After" - emit: - qc = qc_results - alignments = alignment_results - } + ```groovy title="main.nf" linenums="1" hl_lines="2" + include { FASTP } from './modules/fastp.nf' + include { TRIMGALORE } from './modules/trimgalore.nf' ``` === "Before" - ```groovy title="main.nf" linenums="225" + ```groovy title="main.nf" linenums="1" + include { FASTP } from './modules/fastp.nf' ``` -This shows realistic Nextflow patterns: - -- **Separate processes** for different strategies rather than dynamic generation -- **When conditions** to control which processes run for which samples -- **Mix operator** to combine results from different conditional processes -- **Process parameterization** using metadata in script blocks - -### 3.3. Channel-based Workflow Routing - -The realistic way to handle conditional workflow assembly is through channel routing and filtering: +... and then modify your `main.nf` workflow to branch samples based on their metadata and route them through the appropriate trimming process, like this: === "After" - ```groovy title="main.nf" linenums="285" hl_lines="1-50" + ```groovy title="branched workflow" linenums="28" hl_lines="5-12" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) - workflow { - // Read and enrich sample data with strategy - ch_samples = Channel.fromPath(params.input) - .splitCsv(header: true) - .map { row -> - def meta = [ - id: row.sample_id, - organism: row.organism, - depth: row.sequencing_depth.toInteger(), - quality: row.quality_score.toDouble() - ] - - // Add strategy information using our selectAnalysisStrategy function - def strategy = selectAnalysisStrategy(meta) - - return [meta + strategy, file(row.file_path)] - } - - // Split channel based on strategy requirements - ch_samples - .branch { meta, reads -> - low_depth: meta.approach == 'low_depth' - return [meta, reads] - standard: meta.approach == 'standard' - return [meta, reads] - high_depth: meta.approach == 'high_depth' - return [meta, reads] - } - .set { samples_by_strategy } - - // Route each strategy through appropriate processes - ANALYSIS_PIPELINE(samples_by_strategy.low_depth) - ANALYSIS_PIPELINE(samples_by_strategy.standard) - ANALYSIS_PIPELINE(samples_by_strategy.high_depth) - - // For high-depth samples, also run structural variant calling - high_depth_alignments = ANALYSIS_PIPELINE.out.alignments - .filter { meta, bam -> meta.approach == 'high_depth' } - - STRUCTURAL_VARIANTS(high_depth_alignments) + trim_branches = ch_samples + .branch { meta, reads -> + fastp: meta.organism == 'human' && meta.depth >= 30000000 + trimgalore: true + } - // Collect and organize all results - all_qc = ANALYSIS_PIPELINE.out.qc.collect() - all_alignments = ANALYSIS_PIPELINE.out.alignments.collect() + ch_fastp = FASTP(trim_branches.fastp) + ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) - // Generate summary report based on what was actually run - all_alignments - .map { alignments -> - def strategies = alignments.collect { meta, bam -> meta.approach }.unique() - def total_samples = alignments.size() +=== "Before" - println "Pipeline Summary:" - println " Total samples processed: ${total_samples}" - println " Strategies used: ${strategies.join(', ')}" + ```groovy title="branched workflow" linenums="28" hl_lines="5-12" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) - strategies.each { strategy -> - def count = alignments.count { meta, bam -> meta.approach == strategy } - println " ${strategy}: ${count} samples" - } - } - .view() - } + ch_fastp = FASTP(ch_samples) - // Additional process for high-depth samples - process STRUCTURAL_VARIANTS { - input: - tuple val(meta), path(bam) - output: - tuple val(meta), path("${meta.id}.vcf") +Run this modified workflow: - script: - """ - delly call -g ${meta.reference} ${bam} -o ${meta.id}.vcf - """ - } - ``` +```bash title="Test conditional trimming" +nextflow run main.nf +``` -=== "Before" +```console title="Conditional trimming results" + N E X T F L O W ~ version 25.04.6 - ```groovy title="main.nf" linenums="285" - ``` +Launching `main.nf` [boring_koch] DSL2 - revision: 68a6bc7bd8 -Key Nextflow patterns demonstrated: +executor > local (3) +[3d/bb1e90] process > FASTP (2) [100%] 2 of 2 ✔ +[4c/455334] process > TRIMGALORE (1) [100%] 1 of 1 ✔ +``` -- **Channel branching** with `.branch{}` to split samples by strategy -- **Conditional process execution** using `when:` directives and filtering -- **Channel routing** to send different samples through different processes -- **Result collection** and summary generation -- **Process reuse** - the same workflow processes different sample types +Here, we've used small but mighty Groovy expressions inside the `.branch{}` operator to route samples based on their metadata. Human samples with high coverage go through `FASTP`, while all other samples go through `TRIMGALORE`. Combined with other closure-taking operators such as `.filter{}`, this allows us to build complex conditional workflows that adapt to the data itself. ### Takeaway -In this section, you've learned: - -- **Strategy selection** using Groovy conditional logic -- **Process control** with `when` conditions and channel routing -- **Workflow branching** using channel operators like `.branch()` and `.filter()` -- **Metadata enrichment** to drive process selection - -These patterns help you write workflows that process different sample types appropriately while keeping your code organized and maintainable. +In this section, you've learned to use Groovy logic to control workflow execution with using the closure interfaces of Nextflow opearators. Our pipeline now intelligently routes samples through appropriate processes, but production workflows need to handle invalid data gracefully. Let's add validation and error handling to make our pipeline robust enough for real-world use. @@ -1423,7 +1315,7 @@ These patterns make your code more resilient to missing data and easier to read, ## 6. Advanced Closures and Functional Programming -Our pipeline now handles missing data gracefully and processes complex input formats robustly. But as our workflow grows more sophisticated, we start seeing repeated patterns in our data transformation code. Instead of copy-pasting similar closures across different channel operations, let's learn how to create reusable, composable functions that make our code cleaner and more maintainable. +Our pipeline now handles missing data gracefully and processes complex input formats robustly. But as our workflow grows more sophisticated, we start seeing repeated patterns in our data transformation code. Instead of copy-pasting similar closures across different processes or workflows, let's learn how to create reusable, composable functions that make our code cleaner and more maintainable. ### 6.1. Named Closures for Reusability @@ -1810,7 +1702,7 @@ Here's how we progressively enhanced our pipeline: ### Key Benefits -- **Clearer code**: Understanding when to use Nextflow vs Groovy helps you write more organized workflows +- **Clearer code**: Understanding when to use Nextflow and Groovy helps you write more organized workflows - **Better error handling**: Basic validation and try-catch patterns help your workflows handle problems gracefully - **Flexible processing**: Conditional logic lets your workflows process different sample types appropriately - **Configuration management**: Using defaults and simple validation makes your workflows easier to use diff --git a/side-quests/groovy_essentials/modules/fastp.nf b/side-quests/groovy_essentials/modules/fastp.nf new file mode 100644 index 0000000000..f74e5e772a --- /dev/null +++ b/side-quests/groovy_essentials/modules/fastp.nf @@ -0,0 +1,37 @@ +process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta.id), path("*_trimmed*.fastq.gz"), emit: reads + path "*.{json,html}" , emit: reports + + script: + // Simple single-end vs paired-end detection + def is_single = reads instanceof List ? reads.size() == 1 : true + + if (is_single) { + def input_file = reads instanceof List ? reads[0] : reads + """ + fastp \\ + --in1 ${input_file} \\ + --out1 ${meta.id}_trimmed.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ + } else { + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ + } +} diff --git a/side-quests/groovy_essentials/modules/fastp.nf.bak b/side-quests/groovy_essentials/modules/fastp.nf.bak new file mode 100644 index 0000000000..827372b21a --- /dev/null +++ b/side-quests/groovy_essentials/modules/fastp.nf.bak @@ -0,0 +1,22 @@ +process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + input: + tuple val(meta), path(reads) + + output: + tuple val(sample_id), path("*_trimmed*.fastq.gz"), emit: reads + path "*.{json,html}" , emit: reports + + script: + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ +} diff --git a/side-quests/groovy_essentials/modules/trimgalore.nf b/side-quests/groovy_essentials/modules/trimgalore.nf new file mode 100644 index 0000000000..945fef640c --- /dev/null +++ b/side-quests/groovy_essentials/modules/trimgalore.nf @@ -0,0 +1,37 @@ +process TRIMGALORE { + container 'quay.io/biocontainers/trim-galore:0.6.10--hdfd78af_0' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*_trimmed*.fq"), emit: reads + path "*_trimming_report.txt" , emit: reports + + script: + // Simple single-end vs paired-end detection + def is_single = reads instanceof List ? reads.size() == 1 : true + + if (is_single) { + def input_file = reads instanceof List ? reads[0] : reads + """ + trim_galore \\ + --cores $task.cpus \\ + ${input_file} + + # Rename output to match expected pattern + mv *_trimmed.fq ${meta.id}_trimmed.fq + """ + } else { + """ + trim_galore \\ + --paired \\ + --cores $task.cpus \\ + ${reads[0]} ${reads[1]} + + # Rename outputs to match expected pattern + mv *_val_1.fq ${meta.id}_trimmed_R1.fq + mv *_val_2.fq ${meta.id}_trimmed_R2.fq + """ + } +} From 6be8aa5459e2f7dd1e44e9fcfcfca170fc0e3fa0 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 13:32:50 +0100 Subject: [PATCH 11/48] State after a good chat wiht Claude code --- docs/side_quests/groovy_essentials.md | 1547 +++++++++-------- side-quests/groovy_essentials/collect.nf | 29 +- .../groovy_essentials/data/samples.csv | 6 +- ....fastq => SAMPLE_001_S1_L001_R1_001.fastq} | 6 +- ....fastq => SAMPLE_002_S2_L001_R1_001.fastq} | 6 +- ....fastq => SAMPLE_003_S3_L001_R1_001.fastq} | 6 +- .../modules/generate_report.nf | 16 + side-quests/groovy_essentials/nextflow.config | 38 +- .../solutions/groovy_essentials/collect.nf | 18 + .../solutions/groovy_essentials/main.nf | 64 + .../groovy_essentials/modules/fastp.nf | 37 + .../modules/generate_report.nf | 24 + .../groovy_essentials/modules/trimgalore.nf | 37 + .../groovy_essentials/nextflow.config | 36 + 14 files changed, 1087 insertions(+), 783 deletions(-) rename side-quests/groovy_essentials/data/sequences/{sample_001.fastq => SAMPLE_001_S1_L001_R1_001.fastq} (59%) rename side-quests/groovy_essentials/data/sequences/{sample_002.fastq => SAMPLE_002_S2_L001_R1_001.fastq} (59%) rename side-quests/groovy_essentials/data/sequences/{sample_003.fastq => SAMPLE_003_S3_L001_R1_001.fastq} (59%) create mode 100644 side-quests/groovy_essentials/modules/generate_report.nf create mode 100644 side-quests/solutions/groovy_essentials/collect.nf create mode 100644 side-quests/solutions/groovy_essentials/main.nf create mode 100644 side-quests/solutions/groovy_essentials/modules/fastp.nf create mode 100644 side-quests/solutions/groovy_essentials/modules/generate_report.nf create mode 100644 side-quests/solutions/groovy_essentials/modules/trimgalore.nf create mode 100644 side-quests/solutions/groovy_essentials/nextflow.config diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 11d9540e47..7ea9e25c39 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -15,14 +15,15 @@ Nextflow is built on Apache Groovy, a powerful dynamic language that runs on the Understanding where Nextflow ends and Groovy begins is crucial for effective workflow development. Nextflow provides channels, processes, and workflow orchestration, while Groovy handles data manipulation, string processing, conditional logic, and general programming tasks within your workflow scripts. -This side quest will bridge that gap by taking you on a hands-on journey from basic concepts to production-ready mastery. We'll transform a simple CSV-reading workflow into a sophisticated, production-ready bioinformatics pipeline that can handle any dataset thrown at it. Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production: +This side quest will bridge that gap by taking you on a hands-on journey from basic concepts to production-ready patterns. We'll transform a simple CSV-reading workflow into a sophisticated bioinformatics pipeline that handles real-world complexity. Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production: -- **Messy data?** We'll add robust parsing and null-safe operators, learning to distinguish between Nextflow and Groovy constructs -- **Complex file naming schemes?** We'll master regex patterns and string manipulation for bioinformatics file names -- **Need intelligent sample routing?** We'll implement conditional logic and strategy selection, transforming file collections into command-line arguments -- **Worried about failures?** We'll add comprehensive error handling and validation patterns -- **Code getting repetitive?** We'll learn functional programming with closures and composition, mastering essential Groovy operators like safe navigation and Elvis -- **Processing thousands of samples?** We'll leverage powerful collection operations for file path manipulations +- **Understanding boundaries:** Distinguish between Nextflow operators and Groovy methods, and master when to use each +- **Data manipulation:** Extract, transform, and subset maps and collections using Groovy's powerful operators +- **String processing:** Parse complex file naming schemes with regex patterns and master variable interpolation +- **Dynamic logic:** Build processes that adapt to different input types and use closures for dynamic resource allocation +- **Conditional routing:** Intelligently route samples through different processes based on their metadata characteristics +- **Safe operations:** Handle missing data gracefully with null-safe operators and validate inputs with clear error messages +- **Reusable code:** Create maintainable workflows with functions and configuration-based event handlers --- @@ -56,9 +57,9 @@ You'll find a `data` directory with sample files and a main workflow file that w │ │ └── analysis_parameters.yaml │ ├── samples.csv │ └── sequences -│ ├── sample_001.fastq -│ ├── sample_002.fastq -│ └── sample_003.fastq +│ ├── SAMPLE_001_S1_L001_R1_001.fastq +│ ├── SAMPLE_002_S2_L001_R1_001.fastq +│ └── SAMPLE_003_S3_L001_R1_001.fastq ├── main.nf ├── nextflow.config ├── README.md @@ -72,9 +73,9 @@ Our sample CSV contains information about biological samples that need different ```console title="samples.csv" sample_id,organism,tissue_type,sequencing_depth,file_path,quality_score -SAMPLE_001,human,liver,30000000,data/sequences/sample_001.fastq,38.5 -SAMPLE_002,mouse,brain,25000000,data/sequences/sample_002.fastq,35.2 -SAMPLE_003,human,kidney,45000000,data/sequences/sample_003.fastq,42.1 +SAMPLE_001,human,liver,30000000,data/sequences/SAMPLE_001_S1_L001_R1_001.fastq,38.5 +SAMPLE_002,mouse,brain,25000000,data/sequences/SAMPLE_002_S2_L001_R1_001.fastq,35.2 +SAMPLE_003,human,kidney,45000000,data/sequences/SAMPLE_003_S3_L001_R1_001.fastq,42.1 ``` We'll use this realistic dataset to explore practical Groovy techniques that you'll encounter in real bioinformatics workflows. @@ -109,9 +110,9 @@ nextflow run main.nf You should see output like: ```console title="Raw CSV data" -[id:sample_001, organism:human, tissue_type:liver, sequencing_depth:30000000, file_path:data/sequences/sample_001.fastq, quality_score:38.5] -[id:sample_002, organism:mouse, tissue_type:brain, sequencing_depth:25000000, file_path:data/sequences/sample_002.fastq, quality_score:35.2] -[id:sample_003, organism:human, tissue_type:kidney, sequencing_depth:45000000, file_path:data/sequences/sample_003.fastq, quality_score:42.1] +[sample_id:SAMPLE_001, organism:human, tissue_type:liver, sequencing_depth:30000000, file_path:data/sequences/SAMPLE_001_S1_L001_R1_001.fastq, quality_score:38.5] +[sample_id:SAMPLE_002, organism:mouse, tissue_type:brain, sequencing_depth:25000000, file_path:data/sequences/SAMPLE_002_S2_L001_R1_001.fastq, quality_score:35.2] +[sample_id:SAMPLE_003, organism:human, tissue_type:kidney, sequencing_depth:45000000, file_path:data/sequences/SAMPLE_003_S3_L001_R1_001.fastq, quality_score:42.1] ``` #### Step 2: Adding the Map Operator @@ -127,7 +128,7 @@ Here's what that map operation looks like: === "After" ```groovy title="main.nf" linenums="2" hl_lines="3-6" - ch_samples = Channel.fromPath("./data/samplesheet.csv") + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> return row @@ -138,7 +139,7 @@ Here's what that map operation looks like: === "Before" ```groovy title="main.nf" linenums="2" hl_lines="3" - ch_samples = Channel.fromPath("./data/samplesheet.csv") + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .view() ``` @@ -162,7 +163,7 @@ Now we're going to write **pure Groovy code** inside our closure. Everything fro === "After" ```groovy title="main.nf" linenums="2" hl_lines="4-12" - ch_samples = Channel.fromPath("./data/samplesheet.csv") + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> // This is all Groovy code now! @@ -181,7 +182,7 @@ Now we're going to write **pure Groovy code** inside our closure. Everything fro === "Before" ```groovy title="main.nf" linenums="2" hl_lines="4" - ch_samples = Channel.fromPath("./data/samplesheet.csv") + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> return row @@ -216,7 +217,7 @@ Make the following change: === "After" ```groovy title="main.nf" linenums="2" hl_lines="11-12" - ch_samples = Channel.fromPath("./data/samplesheet.csv") + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -235,7 +236,7 @@ Make the following change: === "Before" ```groovy title="main.nf" linenums="2" hl_lines="11" - ch_samples = Channel.fromPath("./data/samplesheet.csv") + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -274,6 +275,78 @@ You should see output like: We've successfully added conditional logic to enrich our metadata with a priority level based on quality scores. +#### Step 4.5: Subsetting Maps with `.subMap()` + +While the `+` operator adds keys to a map, sometimes you need to do the opposite - extract only specific keys. Groovy's `.subMap()` method is perfect for this. + +Let's add a line to create a simplified version of our metadata that only contains identification fields: + +=== "After" + + ```groovy title="main.nf" linenums="2" hl_lines="12-13" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + def id_only = sample_meta.subMap(['id', 'organism', 'tissue']) + println "Full metadata: ${sample_meta + [priority: priority]}" + println "ID fields only: ${id_only}" + return [sample_meta + [priority: priority], file(row.file_path)] + } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" hl_lines="11" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return sample_meta + [priority: priority] + } + .view() + ``` + +Run the modified workflow: + +```bash title="Test subMap" +nextflow run main.nf +``` + +You should see output showing both the full metadata and the extracted subset: + +```console title="SubMap results" +Full metadata: [id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal] +ID fields only: [id:sample_001, organism:human, tissue:liver] +``` + +The `.subMap()` method takes a list of keys and returns a new map containing only those keys. If a key doesn't exist in the original map, it's simply not included in the result. + +This is particularly useful when you need to create different metadata versions for different processes - some might need full metadata while others need only minimal identification fields. + +Now remove those println statements to restore your workflow to its previous state, as we don't need them going forward. + +!!! tip "Map Operations Summary" + + - **Add keys**: `map1 + [new_key: value]` - Creates new map with additional keys + - **Extract keys**: `map1.subMap(['key1', 'key2'])` - Creates new map with only specified keys + - **Both operations create new maps** - Original maps remain unchanged + #### Step 5: Combining Maps and Returning Results So far, we've only been returning what Nextflow community calls the 'meta map', and we've been ignoring the files those metadata relate to. But if you're writing Nextflow workflows, you probably want to do something with those files. @@ -302,7 +375,7 @@ Let's output a channel structure comprising a tuple of 2 elements: the enriched === "Before" ```groovy title="main.nf" linenums="2" hl_lines="12" - ch_samples = Channel.fromPath("./data/samplesheet.csv") + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -326,9 +399,9 @@ nextflow run main.nf You should see output like: ```console title="Complete workflow output" -[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_001.fastq] -[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_002.fastq] -[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/sample_003.fastq] +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] ``` This `[meta, file]` tuple structure is a common pattern in Nextflow for passing both metadata and associated files to processes. @@ -430,6 +503,94 @@ This time, have NOT changed the structure of the data, we still have 3 items in `collect` is an extreme case we're using here to make a point. The key lesson is that when you're writing workflows always distinguish between **Groovy constructs** (data structures) and **Nextflow constructs** (channels/workflows). Operations can share names but behave completely differently. +### 1.3. The Spread Operator (`*.`) - Shorthand for Property Extraction + +Related to Groovy's `collect` is the spread operator (`*.`), which provides a concise way to extract properties from collections. It's essentially syntactic sugar for a common `collect` pattern. + +Let's add a demonstration to our `collect.nf` file: + +=== "After" + + ```groovy title="collect.nf" linenums="1" hl_lines="15-18" + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + + // Nextflow collect() - groups multiple channel emissions into one + ch_input = Channel.fromList(sample_ids) + ch_input.view { "Individual channel item: ${it}" } + ch_collected = ch_input.collect() + ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + + // Groovy collect - transforms each element, preserves structure + def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') + } + println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + + // Spread operator - concise property access + def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] + def all_ids = sample_data*.id + println "Spread operator result: ${all_ids}" + ``` + +=== "Before" + + ```groovy title="collect.nf" linenums="1" + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + + // Nextflow collect() - groups multiple channel emissions into one + ch_input = Channel.fromList(sample_ids) + ch_input.view { "Individual channel item: ${it}" } + ch_collected = ch_input.collect() + ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + + // Groovy collect - transforms each element, preserves structure + def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') + } + println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + ``` + +Run the updated workflow: + +```bash title="Test spread operator" +nextflow run collect.nf +``` + +You should see output like: + +```console title="Spread operator output" hl_lines="6" + N E X T F L O W ~ version 25.04.6 + +Launching `collect.nf` [cranky_galileo] DSL2 - revision: 5f3c8b2a91 + +Groovy collect result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) +Spread operator result: [s1, s2, s3] +Individual channel item: sample_001 +Individual channel item: sample_002 +Individual channel item: sample_003 +Nextflow collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) +``` + +The spread operator `*.` is a shorthand for a common collect pattern: + +```groovy +// These are equivalent: +def ids = samples*.id +def ids = samples.collect { it.id } + +// Also works with method calls: +def names = files*.getName() +def names = files.collect { it.getName() } +``` + +The spread operator is particularly useful when you need to extract a single property from a list of objects - it's more readable than writing out the full `collect` closure. + +!!! tip "When to Use Spread vs Collect" + + - **Use spread (`*.`)** for simple property access: `samples*.id`, `files*.name` + - **Use collect** for transformations: `samples.collect { it.id.toUpperCase() }` + - **Use collect** for complex logic: `samples.collect { [it.id, it.quality > 40] }` + ### Takeaway In this section, you've learned: @@ -522,81 +683,44 @@ You should see output with metadata enriched from the file names, like Launching `main.nf` [clever_pauling] DSL2 - revision: 605d2058b4 -[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq.gz] -[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq.gz] -[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq.gz] +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] ``` -### 2.2. Creating Reusable closures +### 2.2. Creating Reusable functions -You may have noticed that the content of our map operation is getting quite long and complex. To keep our workflow maintainable, it's a good idea to break out complex logic into reusable functions or closures. +You may have noticed that the content of our map operation is getting quite long and complex. To keep our workflow maintainable, it's a good idea to break out complex logic into reusable functions. -To do that we simply define a closure using the assignment operator `=` and the `{}` syntax, within the `workflow{}`. Then we can call that closure by name inside our map operation using standard function call syntax (not the curly braces). +To illustrate what that looks like with our existing workflow, make the modification below, using `def` to define a reusable function called `separateMetadata`. Make that change like so: === "After" - ```groovy title="main.nf" linenums="1" hl_lines="3-23,27" - workflow { - - separateMetadata = { row -> - def sample_meta = [ - id: row.sample_id.toLowerCase(), - organism: row.organism, - tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), - depth: row.sequencing_depth.toInteger(), - quality: row.quality_score.toDouble() - ] - def fastq_path = file(row.file_path) - - def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) - def file_meta = m ? [ - sample_num: m[0][2].toInteger(), - lane: m[0][3], - read: m[0][4], - chunk: m[0][5] - ] : [:] - - def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] - } + ```groovy title="main.nf" linenums="1" hl_lines="1-3,7" + def separateMetadata(row) { + // ... all the metadata processing logic ... + } + workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map(separateMetadata) - .view() + .map(separateMetadata) + .view() } ``` === "Before" - ```groovy title="main.nf" linenums="1" hl_lines="4-24" + ```groovy title="main.nf" linenums="1" workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map { row -> - def sample_meta = [ - id: row.sample_id.toLowerCase(), - organism: row.organism, - tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), - depth: row.sequencing_depth.toInteger(), - quality: row.quality_score.toDouble() - ] - def fastq_path = file(row.file_path) - - def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) - def file_meta = m ? [ - sample_num: m[0][2].toInteger(), - lane: m[0][3], - read: m[0][4], - chunk: m[0][5] - ] : [:] - - def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] - } - .view() + .map { row -> + // ... all the inline metadata processing logic ... + } + .view() } ``` @@ -605,11 +729,11 @@ By doing this we've reduced the actual workflow logic down to something really t ```groovy title="minimal workflow" ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map(separateMetadata) + .map{row -> separateMetadata(row)} .view() ``` -... which makes the logic much easier to read and understand at a glance. The closure `separateMetadata` encapsulates all the complex logic for parsing and enriching metadata, making it reusable and testable. +... which makes the logic much easier to read and understand at a glance. The function `separateMetadata` encapsulates all the complex logic for parsing and enriching metadata, making it reusable and testable. You can run that to make sure it still works: @@ -622,9 +746,9 @@ nextflow run main.nf Launching `main.nf` [tender_archimedes] DSL2 - revision: 8bfb9b2485 -[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq.gz] -[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq.gz] -[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq.gz] +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] ``` ### 2.3. Dynamic Script Logic in Processes @@ -659,34 +783,19 @@ process FASTP { The process takes FASTQ files as input and runs the `fastp` tool to trim adapters and filter low-quality reads. Unfortunately, the person who wrote this process didn't allow for the single-end reads we have in our example dataset. Let's add it to our workflow and see what happens: -=== "After" +First, include the module at the very first line of your `main.nf` workflow: - ```groovy title="main.nf" linenums="1" hl_lines="1,31" - include { FASTP } from './modules/fastp.nf' +```groovy title="main.nf" linenums="1" +include { FASTP } from './modules/fastp.nf' +``` - workflow { +Then modify the `workflow` block to connect the `ch_samples` channel to the `FASTP` process: - separateMetadata = { row -> - def sample_meta = [ - id: row.sample_id.toLowerCase(), - organism: row.organism, - tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), - depth: row.sequencing_depth.toInteger(), - quality: row.quality_score.toDouble() - ] - def fastq_path = file(row.file_path) - def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) - def file_meta = m ? [ - sample_num: m[0][2].toInteger(), - lane: m[0][3], - read: m[0][4], - chunk: m[0][5] - ] : [:] +=== "After" - def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] - } + ```groovy title="main.nf" linenums="30" hl_lines="6" + workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) @@ -698,35 +807,12 @@ The process takes FASTQ files as input and runs the `fastp` tool to trim adapter === "Before" - ```groovy title="main.nf" linenums="1" hl_lines="4-24" + ```groovy title="main.nf" linenums="30" hl_lines="6" workflow { - separateMetadata = { row -> - def sample_meta = [ - id: row.sample_id.toLowerCase(), - organism: row.organism, - tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), - depth: row.sequencing_depth.toInteger(), - quality: row.quality_score.toDouble() - ] - def fastq_path = file(row.file_path) - - def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) - def file_meta = m ? [ - sample_num: m[0][2].toInteger(), - lane: m[0][3], - read: m[0][4], - chunk: m[0][5] - ] : [:] - - def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] - } - ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map(separateMetadata) - .view() } ``` @@ -863,820 +949,848 @@ Another common one can be seen in [the Nextflow for Science Genomics module](../ These patterns of using Groovy logic in process script blocks are extremely powerful and can be applied in many scenarios - from handling variable input types to building complex command-line arguments from file collections, making your processes truly adaptable to the diverse requirements of real-world data. -### Takeaway - -In this section, you've learned: +### 2.4. Variable Interpolation: Groovy, Bash, and Shell Variables -- **Regular expressions for file parsing**: Using Groovy's `=~` operator and regex patterns to extract metadata from complex bioinformatics file naming conventions -- **Reusable closures**: Extracting complex logic into named closures that can be passed to channel operators, making workflows more readable and maintainable -- **Dynamic script generation**: Using Groovy conditional logic within process script blocks to adapt commands based on input characteristics (like single-end vs paired-end reads) -- **Command-line argument construction**: Transforming file collections into properly formatted command arguments using `collect()` and `join()` methods +When writing process scripts, you're actually working with three different types of variables, and using the wrong syntax is a common source of errors. Let's add a process that creates a processing report to demonstrate the differences. -These string processing patterns are essential for handling the diverse file formats and naming conventions you'll encounter in real-world bioinformatics workflows. +Create a new process file `modules/generate_report.nf`: +```groovy title="modules/generate_report.nf" linenums="1" +process GENERATE_REPORT { ---- + publishDir 'results/reports', mode: 'copy' -## 3. Conditional Logic and Process Control + input: + tuple val(meta), path(reads) -Earlier on, we discussed how to use the `.map()` operator to use snippets of Groovy code to transform data flowing through channels. The counterpart to that is using Groovy to not just transform data, but to control which processes get executed based on the data itself. This is essential for building flexible workflows that can adapt to different sample types and analysis requirements. + output: + path "${meta.id}_report.txt" -Nextflow has several [operators](https://www.nextflow.io/docs/latest/reference/operator.html) that control process flow, including, many of which take closures as arguments, meanint their content is evaluated at run time, allowing us to use Groovy logic to drive workflow decisions based on channel content. + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + """ +} +``` -For example, let's pretend that our sequencing samples need to be trimmed with FASTP only if they're human samples with a coverage above a certain threshold. Mouse samples or low-coverage samples should be run with Trimgalore instead (this is a contrived example, but it illustrates the point). +This process writes a simple report with the sample ID and filename. Now let's run it to see what happens when we need to mix different types of variables. -Add a new process for Trimgalore in `modules/trimgalore.nf`: +Include the process in your `main.nf` and add it to the workflow: === "After" - ```groovy title="main.nf" linenums="1" hl_lines="2" + ```groovy title="main.nf" linenums="1" hl_lines="3 11" include { FASTP } from './modules/fastp.nf' include { TRIMGALORE } from './modules/trimgalore.nf' + include { GENERATE_REPORT } from './modules/generate_report.nf' + + // ... separateMetadata function ... + + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + + GENERATE_REPORT(ch_samples) + + // ... rest of workflow ... + } ``` === "Before" ```groovy title="main.nf" linenums="1" include { FASTP } from './modules/fastp.nf' - ``` + include { TRIMGALORE } from './modules/trimgalore.nf' -... and then modify your `main.nf` workflow to branch samples based on their metadata and route them through the appropriate trimming process, like this: + // ... separateMetadata function ... -=== "After" + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) - ```groovy title="branched workflow" linenums="28" hl_lines="5-12" - ch_samples = Channel.fromPath("./data/samples.csv") - .splitCsv(header: true) - .map(separateMetadata) + // ... rest of workflow ... + } + ``` - trim_branches = ch_samples - .branch { meta, reads -> - fastp: meta.organism == 'human' && meta.depth >= 30000000 - trimgalore: true - } +Now run the workflow and check the generated reports in `results/reports/`. They should contain basic information about each sample. - ch_fastp = FASTP(trim_branches.fastp) - ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) +But what if we want to add information about when and where the processing occurred? Let's modify the process to include shell environment variables: + +=== "After" + + ```groovy title="modules/generate_report.nf" linenums="10" hl_lines="5-7" + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + echo "Processed by: ${USER}" >> ${meta.id}_report.txt + echo "Hostname: $(hostname)" >> ${meta.id}_report.txt + echo "Date: $(date)" >> ${meta.id}_report.txt + """ + ``` === "Before" - ```groovy title="branched workflow" linenums="28" hl_lines="5-12" - ch_samples = Channel.fromPath("./data/samples.csv") - .splitCsv(header: true) - .map(separateMetadata) + ```groovy title="modules/generate_report.nf" linenums="10" + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + """ + ``` - ch_fastp = FASTP(ch_samples) +If you run this, you'll notice an error or unexpected behavior - Nextflow tries to interpret `${USER}` as a Groovy variable that doesn't exist! We need to escape it so Bash can handle it instead. +Fix this by escaping the shell variables: -Run this modified workflow: +=== "After - Fixed" -```bash title="Test conditional trimming" -nextflow run main.nf -``` + ```groovy title="modules/generate_report.nf" linenums="10" hl_lines="5-7" + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + echo "Processed by: \${USER}" >> ${meta.id}_report.txt + echo "Hostname: \$(hostname)" >> ${meta.id}_report.txt + echo "Date: \$(date)" >> ${meta.id}_report.txt + """ + ``` -```console title="Conditional trimming results" - N E X T F L O W ~ version 25.04.6 +=== "Before - Broken" -Launching `main.nf` [boring_koch] DSL2 - revision: 68a6bc7bd8 + ```groovy title="modules/generate_report.nf" linenums="10" + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + echo "Processed by: ${USER}" >> ${meta.id}_report.txt + echo "Hostname: $(hostname)" >> ${meta.id}_report.txt + echo "Date: $(date)" >> ${meta.id}_report.txt + """ + ``` -executor > local (3) -[3d/bb1e90] process > FASTP (2) [100%] 2 of 2 ✔ -[4c/455334] process > TRIMGALORE (1) [100%] 1 of 1 ✔ +Now it works! The backslash (`\`) tells Nextflow "don't interpret this, pass it through to Bash." + +!!! note "Three Types of Variables in Process Scripts" + + - **Nextflow/Groovy variables**: Use `${variable}` - evaluated before the script runs + - **Shell environment variables**: Use `\${variable}` - passed through to Bash + - **Shell command substitution**: Use `\$(command)` - executed by Bash + +Let's add one more feature - a Groovy variable for the report type: + +```groovy title="modules/generate_report.nf - Complete" linenums="10" +script: +def report_type = meta.priority == 'high' ? 'PRIORITY' : 'STANDARD' +""" +echo "=== ${report_type} SAMPLE REPORT ===" > ${meta.id}_report.txt +echo "Processing ${reads}" >> ${meta.id}_report.txt +echo "Sample: ${meta.id}" >> ${meta.id}_report.txt +echo "Quality: ${meta.quality}" >> ${meta.id}_report.txt +echo "Priority: ${meta.priority}" >> ${meta.id}_report.txt +echo "---" >> ${meta.id}_report.txt +echo "Processed by: \${USER}" >> ${meta.id}_report.txt +echo "Hostname: \$(hostname)" >> ${meta.id}_report.txt +echo "Date: \$(date)" >> ${meta.id}_report.txt +""" ``` -Here, we've used small but mighty Groovy expressions inside the `.branch{}` operator to route samples based on their metadata. Human samples with high coverage go through `FASTP`, while all other samples go through `TRIMGALORE`. Combined with other closure-taking operators such as `.filter{}`, this allows us to build complex conditional workflows that adapt to the data itself. +Now you can see all three types together: +- `${report_type}`, `${meta.id}`, `${meta.quality}`: Groovy variables (no backslash) +- `\${USER}`: Shell environment variable (backslash) +- `\$(hostname)`, `\$(date)`: Shell command substitution (backslash) -### Takeaway +Run the workflow again and check the reports - high-priority samples will have "PRIORITY" in their header! -In this section, you've learned to use Groovy logic to control workflow execution with using the closure interfaces of Nextflow opearators. +### Takeaway -Our pipeline now intelligently routes samples through appropriate processes, but production workflows need to handle invalid data gracefully. Let's add validation and error handling to make our pipeline robust enough for real-world use. +In this section, you've learned: ---- - -## 4. Error Handling and Validation Patterns +- **Regular expressions for file parsing**: Using Groovy's `=~` operator and regex patterns to extract metadata from complex bioinformatics file naming conventions +- **Reusable functions**: Extracting complex logic into named functions that can be called from channel operators, making workflows more readable and maintainable +- **Dynamic script generation**: Using Groovy conditional logic within process script blocks to adapt commands based on input characteristics (like single-end vs paired-end reads) +- **Command-line argument construction**: Transforming file collections into properly formatted command arguments using `collect()` and `join()` methods +- **Variable interpolation**: Understanding the difference between Nextflow/Groovy variables (`${var}`), shell environment variables (`\${var}`), and shell command substitution (`\$(cmd)`) -### 4.1. Basic Input Validation +These string processing patterns are essential for handling the diverse file formats and naming conventions you'll encounter in real-world bioinformatics workflows. -Before our pipeline processes samples through complex conditional logic, we should validate that the input data meets our requirements. Let's create validation functions that check sample metadata and provide useful error messages: -=== "After" +--- - ```groovy title="main.nf" linenums="330" hl_lines="1-25" +### 2.5. Dynamic Resource Directives with Closures - // Simple validation function - def validateSample(Map sample) { - def errors = [] +So far we've used Groovy in the `script` block of processes. But Groovy closures are also incredibly useful in process directives, especially for dynamic resource allocation. Let's add resource directives to our FASTP process that adapt based on the sample characteristics. - // Check required fields - if (!sample.sample_id) { - errors << "Missing sample_id" - } +Currently, our FASTP process uses default resources. Let's make it smarter by allocating more CPUs for high-depth samples: - if (!sample.organism) { - errors << "Missing organism" - } +=== "After" - // Validate organism - def valid_organisms = ['human', 'mouse', 'rat'] - if (sample.organism && !valid_organisms.contains(sample.organism.toLowerCase())) { - errors << "Invalid organism: ${sample.organism}" - } + ```groovy title="modules/fastp.nf" linenums="1" hl_lines="4-5" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' - // Check sequencing depth is numeric - if (sample.sequencing_depth) { - try { - def depth = sample.sequencing_depth as Integer - if (depth < 1000000) { - errors << "Sequencing depth too low: ${depth}" - } - } catch (NumberFormatException e) { - errors << "Invalid sequencing depth: ${sample.sequencing_depth}" - } - } + cpus { meta.depth > 40000000 ? 4 : 2 } + memory '1 GB' - return errors - } + input: + tuple val(meta), path(reads) - // Test validation - def test_samples = [ - [sample_id: 'SAMPLE_001', organism: 'human', sequencing_depth: '30000000'], - [sample_id: '', organism: 'alien', sequencing_depth: 'invalid'], - [sample_id: 'SAMPLE_003', organism: 'mouse', sequencing_depth: '500000'] - ] - - test_samples.each { sample -> - def errors = validateSample(sample) - if (errors) { - println "Sample ${sample.sample_id}: ${errors.join(', ')}" - } else { - println "Sample ${sample.sample_id}: Valid" - } - } + // ... rest of process ... ``` === "Before" - ```groovy title="main.nf" linenums="330" - ``` - -### 4.2. Try-Catch Error Handling - -Let's implement simple try-catch patterns for handling errors: - -=== "After" - - ```groovy title="main.nf" linenums="370" hl_lines="1-25" - - // Process sample with error handling - def processSample(Map sample) { - try { - // Validate first - def errors = validateSample(sample) - if (errors) { - throw new RuntimeException("Validation failed: ${errors.join(', ')}") - } + ```groovy title="modules/fastp.nf" linenums="1" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' - // Simulate processing - def result = [ - id: sample.sample_id, - organism: sample.organism, - processed: true - ] - - println "✓ Successfully processed ${sample.sample_id}" - return result - - } catch (Exception e) { - println "✗ Error processing ${sample.sample_id}: ${e.message}" - - // Return partial result - return [ - id: sample.sample_id ?: 'unknown', - organism: sample.organism ?: 'unknown', - processed: false, - error: e.message - ] - } - } + input: + tuple val(meta), path(reads) - // Test error handling - test_samples.each { sample -> - def result = processSample(sample) - println "Result for ${result.id}: processed = ${result.processed}" - } + // ... rest of process ... ``` -=== "Before" +The closure `{ meta.depth > 40000000 ? 4 : 2 }` is evaluated for each task, allowing per-sample resource allocation. High-depth samples (>40M reads) get 4 CPUs, while others get 2 CPUs. - ```groovy title="main.nf" linenums="370" - ``` +!!! note "Accessing Input Variables in Directives" -### 4.3. Setting Defaults and Validation + The closure can access any input variables (like `meta` here) because Nextflow evaluates these closures in the context of each task execution. -Let's create a simple function that provides defaults and validates configuration: +Another powerful pattern is using `task.attempt` for retry strategies. Let's add error retry with increasing resources: -=== "After" +```groovy title="modules/fastp.nf - With retry" linenums="1" hl_lines="4-6" +process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' - ```groovy title="main.nf" linenums="400" hl_lines="1-25" + errorStrategy 'retry' + maxRetries 2 + memory { 512.MB * task.attempt } - // Simple configuration with defaults - def getConfig(Map user_params) { - // Set defaults - def defaults = [ - quality_threshold: 30, - max_cpus: 4, - output_dir: './results' - ] + input: + tuple val(meta), path(reads) + + // ... rest of process ... +``` - // Merge user params with defaults - def config = defaults + user_params +Now if the process fails due to insufficient memory, Nextflow will retry with more memory: +- First attempt: 512 MB (task.attempt = 1) +- Second attempt: 1024 MB (task.attempt = 2) - // Simple validation - if (config.quality_threshold < 0 || config.quality_threshold > 40) { - println "Warning: Quality threshold ${config.quality_threshold} out of range, using default" - config.quality_threshold = defaults.quality_threshold - } +You can combine multiple factors: - if (config.max_cpus < 1) { - println "Warning: Invalid CPU count ${config.max_cpus}, using default" - config.max_cpus = defaults.max_cpus - } +```groovy title="Complex resource allocation" +process QUALITY_CONTROL { - return config + memory { + def base_mem = meta.depth > 30000000 ? 1.GB : 512.MB + base_mem * task.attempt } - // Test configuration - def test_configs = [ - [:], // Empty - should get defaults - [quality_threshold: 35, max_cpus: 8], // Valid values - [quality_threshold: -5, max_cpus: 0] // Invalid values - ] + cpus { + def base_cpus = meta.organism == 'human' ? 4 : 2 + Math.min(base_cpus, 8) // Cap at 8 CPUs for Codespaces + } - test_configs.each { user_config -> - def config = getConfig(user_config) - println "Input: ${user_config} -> Output: ${config}" + time { + meta.priority == 'high' ? '30.m' : '1.h' } - ``` -=== "Before" + // ... rest of process ... +} +``` - ```groovy title="main.nf" linenums="400" - ``` +This demonstrates several advanced patterns: +- Creating intermediate Groovy variables (`base_mem`, `base_cpus`) +- Using Groovy math functions (`Math.min`) to set limits +- Combining metadata with retry logic +- Using Nextflow's duration syntax (`30.m`, `1.h`) ### Takeaway -In this section, you've learned: +Dynamic directives with closures let you: +- Allocate resources based on input characteristics +- Implement automatic retry strategies with increasing resources +- Combine multiple factors (metadata, attempt number, priorities) +- Use Groovy logic for complex resource calculations + +This makes your workflows both more efficient (not over-allocating) and more robust (automatic retry with more resources). -- **Basic validation functions** that check required fields and data types -- **Try-catch error handling** for graceful failure handling -- **Configuration with defaults** using map merging and validation +--- -These patterns help you write workflows that handle invalid input gracefully and provide useful feedback to users. +## 3. Conditional Logic and Process Control -Before diving into advanced closures, let's master some essential Groovy language features that make code more concise and null-safe. These operators and patterns are used throughout production Nextflow workflows and will make your code more robust and readable. +Earlier on, we discussed how to use the `.map()` operator to use snippets of Groovy code to transform data flowing through channels. The counterpart to that is using Groovy to not just transform data, but to control which processes get executed based on the data itself. This is essential for building flexible workflows that can adapt to different sample types and analysis requirements. ---- +Nextflow has several [operators](https://www.nextflow.io/docs/latest/reference/operator.html) that control process flow, including, many of which take closures as arguments, meanint their content is evaluated at run time, allowing us to use Groovy logic to drive workflow decisions based on channel content. -## 5. Essential Groovy Operators and Patterns +For example, let's pretend that our sequencing samples need to be trimmed with FASTP only if they're human samples with a coverage above a certain threshold. Mouse samples or low-coverage samples should be run with Trimgalore instead (this is a contrived example, but it illustrates the point). -With our pipeline now handling complex conditional logic, we need to make it more robust against missing or malformed data. Bioinformatics workflows often deal with incomplete metadata, optional configuration parameters, and varying input formats. Let's enhance our pipeline with essential Groovy operators that handle these challenges gracefully. +Add a new process for Trimgalore in `modules/trimgalore.nf`: -### 5.1. Safe Navigation and Elvis Operators in Workflows +=== "After" -!!! note + ```groovy title="main.nf" linenums="1" hl_lines="2" + include { FASTP } from './modules/fastp.nf' + include { TRIMGALORE } from './modules/trimgalore.nf' + ``` - **Safe Navigation (`?.`) and Elvis (`?:`) Operators**: These are essential for null-safe programming. Safe navigation returns null instead of throwing an exception if the object is null, while the Elvis operator provides a default value if the left side is null, empty, or false. +=== "Before" -The safe navigation operator (`?.`) and Elvis operator (`?:`) are essential for null-safe programming when processing real-world biological data: + ```groovy title="main.nf" linenums="1" + include { FASTP } from './modules/fastp.nf' + ``` -- **Safe navigation (`?.`)** - returns null instead of throwing an exception if the object is null -- **Elvis operator (`?:`)** - provides a default value if the left side is null, empty, or false +... and then modify your `main.nf` workflow to branch samples based on their metadata and route them through the appropriate trimming process, like this: === "After" - ```groovy title="main.nf" linenums="320" hl_lines="1-25" - - workflow { - ch_samples = Channel.fromPath(params.input) + ```groovy title="main.nf" linenums="28" hl_lines="5-12" + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map { row -> - // Safe navigation prevents crashes on missing fields - def sample_id = row.sample_id?.toLowerCase() ?: 'unknown_sample' - def organism = row.organism?.toLowerCase() ?: 'unknown' - - // Elvis operator provides defaults - def quality = (row.quality_score as Double) ?: 30.0 - def depth = (row.sequencing_depth as Integer) ?: 1_000_000 - - // Chain operators for conditional defaults - def reference = row.reference ?: (organism == 'human' ? 'GRCh38' : 'custom') - - // Groovy Truth - empty strings and nulls are false - def priority = row.priority ?: (quality > 40 ? 'high' : 'normal') - - return [ - id: sample_id, - organism: organism, - quality: quality, - depth: depth, - reference: reference, - priority: priority - ] - } - .view { meta -> - "Sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}, Priority: ${meta.priority}" + .map(separateMetadata) + + trim_branches = ch_samples + .branch { meta, reads -> + fastp: meta.organism == 'human' && meta.depth >= 30000000 + trimgalore: true } - } + + ch_fastp = FASTP(trim_branches.fastp) + ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) ``` === "Before" - ```groovy title="main.nf" linenums="320" + ```groovy title="main.nf" linenums="28" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + + ch_fastp = FASTP(ch_samples) ``` -### 5.2. String Patterns and Multi-line Templates +Run this modified workflow: -Groovy provides powerful string features for parsing filenames and generating dynamic commands: +```bash title="Test conditional trimming" +nextflow run main.nf +``` -=== "After" +```console title="Conditional trimming results" + N E X T F L O W ~ version 25.04.6 - ```groovy title="main.nf" linenums="370" hl_lines="1-30" +Launching `main.nf` [boring_koch] DSL2 - revision: 68a6bc7bd8 - workflow { - // Demonstrate slashy strings for regex (no need to escape backslashes) - def parseFilename = { filename -> - // Slashy string - compare to regular string: "^(\\w+)_(\\w+)_(\\d+)\\.fastq$" - // Slashy strings don't require escaping backslashes, making regex patterns much cleaner - def pattern = /^(\w+)_(\w+)_(\d+)\.fastq$/ - def matcher = filename =~ pattern - - if (matcher) { - return [ - organism: matcher[0][1].toLowerCase(), - tissue: matcher[0][2].toLowerCase(), - sample_id: matcher[0][3] - ] - } else { - return [organism: 'unknown', tissue: 'unknown', sample_id: 'unknown'] - } - } +executor > local (3) +[3d/bb1e90] process > FASTP (2) [100%] 2 of 2 ✔ +[4c/455334] process > TRIMGALORE (1) [100%] 1 of 1 ✔ +``` - // Multi-line strings with interpolation for command generation - def generateCommand = { meta -> - def depth_category = meta.depth > 10_000_000 ? 'high' : 'standard' - def db_path = meta.organism == 'human' ? '/db/human' : '/db/other' +Here, we've used small but mighty Groovy expressions inside the `.branch{}` operator to route samples based on their metadata. Human samples with high coverage go through `FASTP`, while all other samples go through `TRIMGALORE`. - // Multi-line string with variable interpolation - """ - echo "Processing ${meta.organism} sample: ${meta.sample_id}" - analysis_tool \\ - --sample ${meta.sample_id} \\ - --depth-category ${depth_category} \\ - --database ${db_path} \\ - --threads ${params.max_cpus ?: 4} - """ - } +### 3.1. Using `.filter()` with Groovy Truth - // Test the patterns - ch_files = Channel.of('Human_Liver_001.fastq', 'Mouse_Brain_002.fastq') - .map { filename -> - def parsed = parseFilename(filename) - def command = generateCommand([sample_id: parsed.sample_id, organism: parsed.organism, depth: 15_000_000]) - return [parsed, command] - } - .view { parsed, command -> "Parsed: ${parsed}, Command: ${command.split('\n')[0]}..." } - } - ``` +Another powerful pattern for controlling workflow execution is the `.filter()` operator, which uses a closure to determine which items should continue down the pipeline. Let's add a validation step to filter out samples that don't meet our quality requirements. -=== "Before" +Add the following before the branch operation: - ```groovy title="main.nf" linenums="370" - ``` +```groovy title="main.nf - Adding filter" hl_lines="5-9" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) -### 5.3. Combining Operators for Robust Data Handling + // Filter out invalid or low-quality samples + ch_valid_samples = ch_samples + .filter { meta, reads -> + meta.id && meta.organism && meta.depth >= 1000000 + } -Let's combine these operators in a realistic workflow scenario: + trim_branches = ch_valid_samples + .branch { meta, reads -> + fastp: meta.organism == 'human' && meta.depth >= 30000000 + trimgalore: true + } +``` -=== "After" +This filter uses **Groovy Truth** - Groovy's way of evaluating expressions in boolean contexts: - ```groovy title="main.nf" linenums="420" hl_lines="1-20" +- `null`, empty strings, empty collections, and zero are all "false" +- Non-null values, non-empty strings, and non-zero numbers are "true" - workflow { - ch_samples = Channel.fromPath(params.input) - .splitCsv(header: true) - .map { row -> - // Combine safe navigation and Elvis operators - def meta = [ - id: row.sample_id?.toLowerCase() ?: 'unknown', - organism: row.organism ?: 'unknown', - quality: (row.quality_score as Double) ?: 30.0, - files: row.file_path ? [file(row.file_path)] : [] - ] +So `meta.id && meta.organism` checks that both fields exist and are non-empty, while `meta.depth >= 1000000` ensures we have sufficient sequencing depth. - // Use Groovy Truth for validation - if (meta.files && meta.id != 'unknown') { - return [meta, meta.files] - } else { - log.info "Skipping sample with missing data: ${meta.id}" - return null - } - } - .filter { it != null } // Remove invalid samples using Groovy Truth - .view { meta, files -> - "Valid sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}" - } - } +!!! note "Groovy Truth in Practice" + + The expression `meta.id && meta.organism` is more concise than writing: + ```groovy + meta.id != null && meta.id != '' && meta.organism != null && meta.organism != '' ``` -=== "Before" + This makes filtering logic much cleaner and easier to read. - ```groovy title="main.nf" linenums="420" - ``` +You could also combine `.filter()` with more complex Groovy logic: -### Takeaway +```groovy title="Complex filtering examples" +// Filter using safe navigation and Elvis operators +ch_samples + .filter { meta, reads -> + (meta.quality ?: 0) > 30 && meta.organism?.toLowerCase() in ['human', 'mouse'] + } -In this section, you've learned: +// Filter using regular expressions +ch_samples + .filter { meta, reads -> + meta.id =~ /^SAMPLE_\d+$/ && reads.exists() + } -- **Safe navigation operator** (`?.`) for null-safe property access -- **Elvis operator** (`?:`) for default values and null coalescing -!!! note +// Filter using multiple conditions with Groovy Truth +ch_samples + .filter { meta, reads -> + meta.files // Non-empty file list + && meta.paired // Boolean flag is true + && !meta.failed // Negative check + } +``` - **Groovy Truth**: In Groovy, null, empty strings, empty collections, and zero are all considered "false" in boolean contexts. This is different from many other languages and is essential to understand for proper conditional logic. +### Takeaway -- **Groovy Truth** - how null, empty strings, and empty collections evaluate to false - in Groovy, null, empty strings, empty collections, and zero are all considered "false" in boolean contexts -- **Slashy strings** (`/pattern/`) for regex patterns without escaping -- **Multi-line string interpolation** for command templates -- **Numeric literals with underscores** for improved readability +In this section, you've learned to use Groovy logic to control workflow execution using the closure interfaces of Nextflow operators like `.branch{}` and `.filter{}`, leveraging Groovy Truth to write concise conditional expressions. -These patterns make your code more resilient to missing data and easier to read, which is essential when processing diverse bioinformatics datasets. +Our pipeline now intelligently routes samples through appropriate processes, but production workflows need to handle invalid data gracefully. Let's make our workflow robust against missing or null values. --- -## 6. Advanced Closures and Functional Programming +## 4. Safe Navigation and Elvis Operators -Our pipeline now handles missing data gracefully and processes complex input formats robustly. But as our workflow grows more sophisticated, we start seeing repeated patterns in our data transformation code. Instead of copy-pasting similar closures across different processes or workflows, let's learn how to create reusable, composable functions that make our code cleaner and more maintainable. +Our `separateMetadata` function currently assumes all CSV fields are present and valid. But what happens with incomplete data? Let's find out. -### 6.1. Named Closures for Reusability +### 4.1. The Problem: Null Pointer Crashes -!!! note +Add a row with missing data to your `data/samples.csv`: +```csv +SAMPLE_004,,unknown_tissue,20000000,data/sequences/SAMPLE_004_S4_L001_R1_001.fastq, +``` - **Closures**: A closure is a block of code that can be assigned to a variable and executed later. Think of it as a function that can be passed around and reused. They're fundamental to Groovy's functional programming capabilities. +Notice the empty organism field and missing quality_score. Now try running the workflow: -So far we've used anonymous closures defined inline within channel operations. When you find yourself repeating the same transformation logic across multiple processes or workflows, named closures can eliminate duplication and improve readability: +```bash +nextflow run main.nf +``` + +It crashes with a NullPointerException! This is where Groovy's safe operators save the day. -A **closure** is a block of code that can be assigned to a variable and executed later. Think of it as a function that can be passed around and reused. +### 4.2. Safe Navigation Operator (`?.`) + +The safe navigation operator (`?.`) returns null instead of throwing an exception. Update your `separateMetadata` function: === "After" - ```groovy title="main.nf" linenums="350" hl_lines="1-30" + ```groovy title="main.nf" linenums="4" hl_lines="6-8" + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism?.toLowerCase(), + tissue: row.tissue_type?.replaceAll('_', ' ')?.toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score?.toDouble() + ] + // ... rest unchanged + ``` + +=== "Before" - // Define reusable closures for common transformations - def extractSampleInfo = { row -> - [ + ```groovy title="main.nf" linenums="4" + def separateMetadata(row) { + def sample_meta = [ id: row.sample_id.toLowerCase(), organism: row.organism, - quality: row.quality_score.toDouble(), - depth: row.sequencing_depth.toInteger() + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() ] - } + // ... rest unchanged + ``` - def addPriority = { meta -> - meta + [priority: meta.quality > 40 ? 'high' : 'normal'] - } +Run again: - def formatForDisplay = { meta, file_path -> - "Sample: ${meta.id} (${meta.organism}) - Quality: ${meta.quality}, Priority: ${meta.priority}" - } +```bash +nextflow run main.nf +``` - workflow { - // Use named closures in channel operations - ch_samples = Channel.fromPath(params.input) - .splitCsv(header: true) - .map(extractSampleInfo) // Named closure - .map(addPriority) // Named closure - .map { meta -> [meta, file("./data/sequences/${meta.id}.fastq")] } - .view(formatForDisplay) // Named closure - - // Reuse the same closures elsewhere - ch_filtered = ch_samples - .filter { meta, file -> meta.quality > 30 } - .map { meta, file -> addPriority(meta) } // Reuse closure - .view(formatForDisplay) // Reuse closure - } - ``` +No crash! But SAMPLE_004 now has `null` values which could cause problems downstream. -=== "Before" +### 4.3. Elvis Operator (`?:`) for Defaults - ```groovy title="main.nf" linenums="350" - ``` +The Elvis operator (`?:`) provides default values. Update again: -### 6.2. Function Composition +=== "After" -Groovy closures can be composed together using the `>>` operator, allowing you to build complex transformations from simple, reusable pieces: + ```groovy title="main.nf" linenums="4" hl_lines="6-8" + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism?.toLowerCase() ?: 'unknown', + tissue: row.tissue_type?.replaceAll('_', ' ')?.toLowerCase() ?: 'unknown', + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score?.toDouble() ?: 0.0 + ] + // ... rest unchanged + ``` -**Function composition** means chaining functions together so the output of one becomes the input of the next. The `>>` operator creates a new closure that applies multiple transformations in sequence. +=== "Before" -=== "After" + ```groovy title="main.nf" linenums="4" + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism?.toLowerCase(), + tissue: row.tissue_type?.replaceAll('_', ' ')?.toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score?.toDouble() + ] + // ... rest unchanged + ``` - ```groovy title="main.nf" linenums="390" hl_lines="1-25" +Run once more: - // Simple transformation closures - def normalizeId = { meta -> - meta + [id: meta.id.toLowerCase().replaceAll(/[^a-z0-9_]/, '_')] - } +```bash +nextflow run main.nf +``` - def addQualityCategory = { meta -> - def category = meta.quality > 40 ? 'excellent' : - meta.quality > 30 ? 'good' : - meta.quality > 20 ? 'acceptable' : 'poor' - meta + [quality_category: category] - } +Perfect! SAMPLE_004 now has safe defaults: 'unknown' for organism/tissue, 0.0 for quality. - def addProcessingFlags = { meta -> - meta + [ - needs_extra_qc: meta.quality < 30, - high_priority: meta.organism == 'human' && meta.quality > 35 - ] - } +### 4.4. Filtering with Safe Operators + +Now let's filter out samples with missing data. Update your workflow: - // Compose transformations using >> operator - def enrichSample = normalizeId >> addQualityCategory >> addProcessingFlags +=== "After" + ```groovy title="main.nf" linenums="28" hl_lines="4-7" workflow { - Channel.fromPath(params.input) + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map(extractSampleInfo) - .map(enrichSample) // Apply composed transformation - .view { meta -> - "Processed: ${meta.id} (${meta.quality_category}) - Extra QC: ${meta.needs_extra_qc}" + .map(separateMetadata) + .filter { meta, reads -> + meta.organism != 'unknown' && (meta.quality ?: 0) > 0 } - } + + // ... rest of workflow ``` === "Before" - ```groovy title="main.nf" linenums="390" - ``` + ```groovy title="main.nf" linenums="28" + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) -### 6.3. Currying for Specialized Functions + // ... rest of workflow + ``` -Currying allows you to create specialized versions of general-purpose closures by fixing some of their parameters: +Run the workflow: -**Currying** is a technique where you take a function with multiple parameters and create a new function with some of those parameters "fixed" or "pre-filled". This creates specialized versions of general-purpose functions. +```bash +nextflow run main.nf +``` -=== "After" +SAMPLE_004 is now filtered out! Only valid samples proceed. - ```groovy title="main.nf" linenums="430" hl_lines="1-20" +### Takeaway - // General-purpose filtering closure - def qualityFilter = { threshold, meta -> meta.quality >= threshold } +- **Safe navigation (`?.`)**: Prevents crashes on null values - returns null instead of throwing exception +- **Elvis operator (`?:`)**: Provides defaults - `value ?: 'default'` +- **Combining**: `value?.method() ?: 'default'` is the common pattern +- **In filters**: Use to handle missing data: `(meta.quality ?: 0) > threshold` - // Create specialized filters using currying - def highQualityFilter = qualityFilter.curry(40) - def standardQualityFilter = qualityFilter.curry(30) +These operators make workflows resilient to incomplete data - essential for real-world bioinformatics. - workflow { - ch_samples = Channel.fromPath(params.input) - .splitCsv(header: true) - .map(extractSampleInfo) +### 4.5. Validation with `error()` and `log.warn` - // Use the specialized filters in different channel operations - ch_high_quality = ch_samples.filter(highQualityFilter) - ch_standard_quality = ch_samples.filter(standardQualityFilter) +Sometimes you need to stop the workflow immediately if input parameters are invalid. Nextflow provides `error()` for this. Let's add validation to our workflow. - // Both channels can be processed differently - ch_high_quality.view { meta -> "High quality: ${meta.id}" } - ch_standard_quality.view { meta -> "Standard quality: ${meta.id}" } - } - ``` +Create a validation function before your workflow block: -=== "Before" +=== "After" - ```groovy title="main.nf" linenums="430" - ``` + ```groovy title="main.nf" linenums="1" hl_lines="3-18" + include { FASTP } from './modules/fastp.nf' + include { TRIMGALORE } from './modules/trimgalore.nf' + include { GENERATE_REPORT } from './modules/generate_report.nf' -### 6.4. Closures Accessing External Variables + def validateInputs() { + // Check CSV file exists + if (!file(params.input ?: './data/samples.csv').exists()) { + error("Input CSV file not found: ${params.input ?: './data/samples.csv'}") + } -Closures can access and modify variables from their defining scope, which is useful for collecting statistics: + // Warn if output directory already exists + if (file(params.outdir ?: 'results').exists()) { + log.warn "Output directory already exists: ${params.outdir ?: 'results'}" + } -=== "After" + // Check for required genome parameter + if (params.run_gatk && !params.genome) { + error("Genome reference required when running GATK. Please provide --genome") + } + } - ```groovy title="main.nf" linenums="480" hl_lines="1-20" + // ... separateMetadata function ... workflow { - // Variable in the workflow scope - def sample_count = 0 - def human_samples = 0 - - // Closure that accesses and modifies external variables - def countSamples = { meta -> - sample_count++ // Modifies external variable - if (meta.organism == 'human') { - human_samples++ // Modifies another external variable - } - return meta // Pass data through unchanged - } + validateInputs() // Call validation first - Channel.fromPath(params.input) + ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map(extractSampleInfo) - .map(countSamples) // Closure modifies external variables - .collect() // Wait for all samples to be processed - .view { - "Processing complete: ${sample_count} total samples, ${human_samples} human samples" - } + .map(separateMetadata) + // ... rest of workflow } ``` === "Before" - ```groovy title="main.nf" linenums="480" - ``` + ```groovy title="main.nf" linenums="1" + include { FASTP } from './modules/fastp.nf' + include { TRIMGALORE } from './modules/trimgalore.nf' + include { GENERATE_REPORT } from './modules/generate_report.nf' -### Takeaway + // ... separateMetadata function ... -In this section, you've learned: + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + // ... rest of workflow + } + ``` -- **Named closures** for eliminating code duplication and improving readability -- **Function composition** with `>>` operator to build complex transformations -- **Currying** to create specialized versions of general-purpose closures -- **Variable scope access** in closures for collecting statistics and generating reports +Now try running without the CSV file: -These advanced patterns help you write more maintainable, reusable workflows that follow functional programming principles while remaining easy to understand and debug. +```bash +mv data/samples.csv data/samples.csv.bak +nextflow run main.nf +``` -With our pipeline now capable of intelligent routing, robust error handling, and advanced functional programming patterns, we're ready for the final enhancement. As your workflows scale to process hundreds or thousands of samples, you'll need sophisticated data processing capabilities that can organize, filter, and analyze large collections efficiently. +The workflow stops immediately with a clear error message instead of failing mysteriously later! -The functional programming patterns we just learned work beautifully with Groovy's powerful collection methods. Instead of writing loops and conditional logic, you can chain together expressive operations that clearly describe what you want to accomplish. +You can also add validation within the `separateMetadata` function: ---- +```groovy title="main.nf - Validation in function" +def separateMetadata(row) { + // Validate required fields + if (!row.sample_id) { + error("Missing sample_id in CSV row: ${row}") + } -## 7. Collection Operations and File Path Manipulations + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism?.toLowerCase() ?: 'unknown', + // ... rest of fields + ] -### 7.1. Common Collection Methods in Channel Operations + // Validate data makes sense + if (sample_meta.depth < 1000000) { + log.warn "Low sequencing depth for ${sample_meta.id}: ${sample_meta.depth}" + } -When processing large datasets, channel operations often need to organize and analyze sample collections. Groovy's collection methods integrate seamlessly with Nextflow channels to provide powerful data processing capabilities: + return [sample_meta, file(row.file_path)] +} +``` -Groovy provides many built-in methods for working with collections (lists, maps, etc.) that make data processing much more expressive than traditional loops. +### Takeaway (Updated) -=== "After" +- **Safe navigation (`?.`)**: Prevents crashes on null values - returns null instead of throwing exception +- **Elvis operator (`?:`)**: Provides defaults - `value ?: 'default'` +- **`error()`**: Stops workflow immediately with clear message +- **`log.warn`**: Issues warnings without stopping workflow +- **Early validation**: Check inputs before processing to fail fast with helpful errors - ```groovy title="main.nf" linenums="500" hl_lines="1-40" +These operators make workflows resilient to incomplete data - essential for real-world bioinformatics. - // Sample data with mixed quality and organisms - def samples = [ - [id: 'sample_001', organism: 'human', quality: 42, files: ['data1.txt', 'data2.txt']], - [id: 'sample_002', organism: 'mouse', quality: 28, files: ['data3.txt']], - [id: 'sample_003', organism: 'human', quality: 35, files: ['data4.txt', 'data5.txt', 'data6.txt']], - [id: 'sample_004', organism: 'rat', quality: 45, files: ['data7.txt']], - [id: 'sample_005', organism: 'human', quality: 30, files: ['data8.txt', 'data9.txt']] - ] +--- - // findAll - filter collections based on conditions - def high_quality_samples = samples.findAll { it.quality > 40 } - println "High quality samples: ${high_quality_samples.collect { it.id }.join(', ')}" +## 5. Groovy in Configuration: Workflow Event Handlers - // groupBy - group samples by organism - def samples_by_organism = samples.groupBy { it.organism } - println "Grouping by organism:" - samples_by_organism.each { organism, sample_list -> - println " ${organism}: ${sample_list.size()} samples" - } +Up until now, we've been writing Groovy code in our workflow scripts and process definitions. But there's one more important place where Groovy is essential: workflow event handlers in your `nextflow.config` file. - // unique - get unique organisms - def organisms = samples.collect { it.organism }.unique() - println "Unique organisms: ${organisms.join(', ')}" +Event handlers are Groovy closures that run at specific points in your workflow's lifecycle. They're perfect for adding logging, notifications, or cleanup operations without cluttering your main workflow code. - // flatten - flatten nested file lists - def all_files = samples.collect { it.files }.flatten() - println "All files: ${all_files.take(5).join(', ')}... (${all_files.size()} total)" +### 5.1. The `onComplete` Handler - // sort - sort samples by quality - def sorted_by_quality = samples.sort { it.quality } - println "Quality range: ${sorted_by_quality.first().quality} to ${sorted_by_quality.last().quality}" +The most commonly used event handler is `onComplete`, which runs when your workflow finishes (whether it succeeded or failed). Let's add one to summarize our pipeline results. - // reverse - reverse the order - def reverse_quality = samples.sort { it.quality }.reverse() - println "Highest quality first: ${reverse_quality.collect { "${it.id}(${it.quality})" }.join(', ')}" +Your `nextflow.config` file already has Docker enabled. Add an event handler after the existing configuration: - // count - count items matching condition - def human_samples = samples.count { it.organism == 'human' } - println "Human samples: ${human_samples} out of ${samples.size()}" +=== "After" - // any/every - check conditions across collection - def has_high_quality = samples.any { it.quality > 40 } - def all_have_files = samples.every { it.files.size() > 0 } - println "Has high quality samples: ${has_high_quality}" - println "All samples have files: ${all_have_files}" + ```groovy title="nextflow.config" linenums="1" hl_lines="5-15" + // Nextflow configuration for Groovy Essentials side quest + + docker.enabled = true + + workflow.onComplete = { + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" + } ``` === "Before" - ```groovy title="main.nf" linenums="500" - ``` + ```groovy title="nextflow.config" linenums="1" + // Nextflow configuration for Groovy Essentials side quest -### 7.2. File Path Manipulations - -Working with file paths is essential in bioinformatics workflows. Groovy provides many useful methods for extracting information from file paths: - -=== "After" + docker.enabled = true + ``` - ```groovy title="main.nf" linenums="550" hl_lines="1-30" +This is a Groovy closure being assigned to `workflow.onComplete`. Inside, you have access to the `workflow` object which provides useful properties about the execution. - // File path manipulation examples - def sample_files = [ - '/path/to/data/patient_001_R1.fastq.gz', - '/path/to/data/patient_001_R2.fastq.gz', - '/path/to/results/patient_002_analysis.bam', - '/path/to/configs/experiment_setup.json' - ] +Run your workflow and you'll see this summary appear at the end! - sample_files.each { file_path -> - def f = file(file_path) // Create Nextflow file object +Let's make it more useful by adding conditional logic: - println "\nFile: ${file_path}" - println " Name: ${f.getName()}" // Just filename - println " BaseName: ${f.getBaseName()}" // Filename without extension - println " Extension: ${f.getExtension()}" // File extension - println " Parent: ${f.getParent()}" // Parent directory - println " Parent name: ${f.getParent().getName()}" // Just parent directory name +=== "After" - // Extract sample ID from filename - def matcher = f.getName() =~ /^(patient_\d+)/ - if (matcher) { - println " Sample ID: ${matcher[0][1]}" + ```groovy title="nextflow.config" linenums="5" hl_lines="11-18" + workflow.onComplete = { + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" + + if (workflow.success) { + println "✅ Pipeline completed successfully!" + println "Results are in: ${params.outdir ?: 'results'}" + } else { + println "❌ Pipeline failed!" + println "Error: ${workflow.errorMessage}" } } + ``` - // Group files by sample ID using path manipulation - def files_by_sample = sample_files - .findAll { it.contains('patient') } // Only patient files - .groupBy { file_path -> - def filename = file(file_path).getName() - def matcher = filename =~ /^(patient_\d+)/ - return matcher ? matcher[0][1] : 'unknown' - } +=== "Before" - println "\nFiles grouped by sample:" - files_by_sample.each { sample_id, files -> - println " ${sample_id}: ${files.size()} files" + ```groovy title="nextflow.config" linenums="5" + workflow.onComplete = { + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" } ``` -=== "Before" +You can also write the summary to a file using Groovy file operations: + +```groovy title="nextflow.config - Writing summary to file" +workflow.onComplete = { + def summary = """ + Pipeline Execution Summary + =========================== + Completed: ${workflow.complete} + Duration : ${workflow.duration} + Success : ${workflow.success} + Command : ${workflow.commandLine} + """ - ```groovy title="main.nf" linenums="550" - ``` + println summary -### 7.3. The Spread Operator + // Write to a log file + def log_file = file("${workflow.launchDir}/pipeline_summary.txt") + log_file.text = summary +} +``` -The spread operator (`*.`) is a powerful Groovy feature for calling methods on all elements in a collection: +### 5.2. Other Useful Event Handlers -The **spread operator** (`*.`) is a shorthand way to call the same method on every element in a collection. It's equivalent to using `.collect { it.methodName() }` but more concise. +Besides `onComplete`, there are other event handlers you can use: -=== "After" +**`onStart`** - Runs when the workflow begins: - ```groovy title="main.nf" linenums="590" hl_lines="1-20" +```groovy title="nextflow.config - onStart handler" +workflow.onStart = { + println "="* 50 + println "Starting pipeline: ${workflow.runName}" + println "Project directory: ${workflow.projectDir}" + println "Launch directory: ${workflow.launchDir}" + println "Work directory: ${workflow.workDir}" + println "="* 50 +} +``` - // Spread operator examples - def file_paths = [ - '/data/sample1.fastq', - '/data/sample2.fastq', - '/results/output1.bam', - '/results/output2.bam' - ] +**`onError`** - Runs only if the workflow fails: + +```groovy title="nextflow.config - onError handler" +workflow.onError = { + println "="* 50 + println "Pipeline execution failed!" + println "Error message: ${workflow.errorMessage}" + println "="* 50 + + // Write detailed error log + def error_file = file("${workflow.launchDir}/error.log") + error_file.text = """ + Workflow Error Report + ===================== + Time: ${new Date()} + Error: ${workflow.errorMessage} + Error report: ${workflow.errorReport ?: 'No detailed report available'} + """ - // Convert to file objects - def files = file_paths.collect { file(it) } + println "Error details written to: ${error_file}" +} +``` - // Using spread operator - equivalent to files.collect { it.getName() } - def filenames = files*.getName() - println "Filenames: ${filenames.join(', ')}" +You can use multiple handlers together: - // Get all parent directories - def parent_dirs = files*.getParent()*.getName() - println "Parent directories: ${parent_dirs.unique().join(', ')}" +```groovy title="nextflow.config - Combined handlers" +workflow.onStart = { + println "Starting ${workflow.runName} at ${workflow.start}" +} - // Get all extensions - def extensions = files*.getExtension().unique() - println "File types: ${extensions.join(', ')}" - ``` +workflow.onError = { + println "Workflow failed: ${workflow.errorMessage}" +} -=== "Before" +workflow.onComplete = { + def duration_mins = workflow.duration.toMinutes().round(2) + def status = workflow.success ? "SUCCESS ✅" : "FAILED ❌" - ```groovy title="main.nf" linenums="590" - ``` + println """ + Pipeline finished: ${status} + Duration: ${duration_mins} minutes + """ +} +``` ### Takeaway In this section, you've learned: -- **Collection filtering** with `findAll` and conditional logic -- **Grouping and organizing** data with `groupBy` and `sort` -- **File path manipulation** using Nextflow's file object methods -- **Spread operator** (`*.`) for concise collection operations +- **Event handler closures**: Groovy closures in `nextflow.config` that run at different lifecycle points +- **`onComplete` handler**: For execution summaries and result reporting +- **`onStart` handler**: For logging pipeline initialization +- **`onError` handler**: For error handling and logging failures +- **Workflow object properties**: Accessing `workflow.success`, `workflow.duration`, `workflow.errorMessage`, etc. -These patterns help you process and organize complex datasets efficiently, which is essential for handling real-world bioinformatics data. +Event handlers are pure Groovy code running in your config file, demonstrating that Nextflow configuration is actually a Groovy script with access to the full language. --- @@ -1688,34 +1802,30 @@ Here's how we progressively enhanced our pipeline: 1. **Nextflow vs Groovy Boundaries**: You learned to distinguish between workflow orchestration (Nextflow) and programming logic (Groovy), including the crucial differences between constructs like `collect`. -2. **String Processing**: You learned regular expressions, parsing functions, and file collection transformation for building dynamic command-line arguments. - -3. **Conditional Logic**: You added intelligent routing that automatically selects analysis strategies based on sample characteristics like organism, quality scores, and sequencing depth. - -4. **Error Handling**: You made the pipeline robust by adding validation functions, try-catch error handling, and configuration management with sensible defaults. +2. **Advanced String Processing**: You mastered regular expressions, parsing functions, reusable functions, variable interpolation (Groovy vs Bash vs Shell), dynamic script generation in processes, and dynamic resource directives with closures. -5. **Essential Groovy Operators**: You mastered safe navigation (`?.`), Elvis (`?:`), Groovy Truth, slashy strings, and other key language features that make code more resilient and readable. +3. **Conditional Logic and Process Control**: You added intelligent routing using `.branch()` and `.filter()` operators, leveraging Groovy Truth for concise conditional expressions. -6. **Advanced Closures**: You learned functional programming techniques including named closures, function composition, currying, and closures with variable scope access for building reusable, maintainable code. +4. **Safe Navigation and Elvis Operators**: You made the pipeline robust against missing data using `?.` for null-safe property access, `?:` for providing default values, and `error()` for input validation. -7. **Collection Operations**: You added sophisticated data processing capabilities using Groovy collection methods like `findAll`, `groupBy`, `unique`, `flatten`, and the spread operator to handle large-scale sample processing. +5. **Groovy in Configuration**: You learned to use workflow event handlers (`onComplete`, `onStart`, `onError`) for logging, notifications, and lifecycle management. ### Key Benefits - **Clearer code**: Understanding when to use Nextflow and Groovy helps you write more organized workflows -- **Better error handling**: Basic validation and try-catch patterns help your workflows handle problems gracefully +- **Robust handling**: Safe navigation and Elvis operators make workflows resilient to missing data - **Flexible processing**: Conditional logic lets your workflows process different sample types appropriately -- **Configuration management**: Using defaults and simple validation makes your workflows easier to use +- **Adaptive resources**: Dynamic directives optimize resource usage based on input characteristics ### From Simple to Sophisticated The pipeline journey you completed demonstrates the evolution from basic data processing to production-ready bioinformatics workflows: 1. **Started simple**: Basic CSV processing and metadata extraction with clear Nextflow vs Groovy boundaries -2. **Added intelligence**: Dynamic file name parsing with regex patterns and conditional routing based on sample characteristics -3. **Made it robust**: Null-safe operators, validation, error handling, and graceful failure management -4. **Made it maintainable**: Advanced closure patterns, function composition, and reusable components that eliminate code duplication -5. **Scaled it efficiently**: Collection operations for processing hundreds of samples with powerful data filtering and organization +2. **Added intelligence**: Dynamic file name parsing with regex patterns, variable interpolation mastery, and conditional routing based on sample characteristics +3. **Made it efficient**: Dynamic resource allocation with closures in directives and retry strategies +4. **Made it robust**: Safe navigation and Elvis operators for handling missing data gracefully +5. **Added observability**: Workflow event handlers for logging, notifications, and lifecycle management This progression mirrors the real-world evolution of bioinformatics pipelines - from research prototypes handling a few samples to production systems processing thousands of samples across laboratories and institutions. Every challenge you solved and pattern you learned reflects actual problems developers face when scaling Nextflow workflows. @@ -1724,11 +1834,14 @@ This progression mirrors the real-world evolution of bioinformatics pipelines - With these Groovy fundamentals mastered, you're ready to: - Write cleaner workflows with proper separation between Nextflow and Groovy logic +- Master variable interpolation to avoid common pitfalls with Groovy, Bash, and shell variables +- Use dynamic resource directives for efficient, adaptive workflows - Transform file collections into properly formatted command-line arguments -- Handle different file naming conventions and input formats gracefully +- Handle different file naming conventions and input formats gracefully using regex and string processing - Build reusable, maintainable code using advanced closure patterns and functional programming - Process and organize complex datasets using collection operations -- Add basic validation and error handling to make your workflows more user-friendly +- Add validation, error handling, and logging to make your workflows production-ready +- Implement workflow lifecycle management with event handlers Continue practicing these patterns in your own workflows, and refer to the [Groovy documentation](http://groovy-lang.org/documentation.html) when you need to explore more advanced features. diff --git a/side-quests/groovy_essentials/collect.nf b/side-quests/groovy_essentials/collect.nf index 10fd3cf692..dbfd5ac79d 100644 --- a/side-quests/groovy_essentials/collect.nf +++ b/side-quests/groovy_essentials/collect.nf @@ -1,23 +1,18 @@ -// Demonstrate Groovy vs Nextflow collect def sample_ids = ['sample_001', 'sample_002', 'sample_003'] -println "=== GROOVY COLLECT (transforms each item, keeps same structure) ===" -// Groovy collect: transforms each element but maintains list structure +// Nextflow collect() - groups multiple channel emissions into one +ch_input = Channel.fromList(sample_ids) +ch_input.view { "Individual channel item: ${it}" } +ch_collected = ch_input.collect() +ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + +// Groovy collect - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') } -println "Original list: ${sample_ids}" -println "Groovy collect result: ${formatted_ids}" -println "Groovy collect maintains structure: ${formatted_ids.size} items (same as original)" -println "" - -println "\n=== NEXTFLOW COLLECT (groups multiple items into single emission) ===" -// Nextflow collect: groups channel elements into a single emission -ch_input = Channel.of('sample_001', 'sample_002', 'sample_003') +println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" -// Show individual items before collect -ch_input.view { "Individual channel item: ${it}" } - -// Collect groups all items into a single emission -ch_collected = ch_input.collect() -ch_collected.view { "Nextflow collect result: ${it} (${it.size()} items grouped together)" } +// Spread operator - concise property access +def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] +def all_ids = sample_data*.id +println "Spread operator result: ${all_ids}" diff --git a/side-quests/groovy_essentials/data/samples.csv b/side-quests/groovy_essentials/data/samples.csv index 829f791e71..1d12e1384a 100644 --- a/side-quests/groovy_essentials/data/samples.csv +++ b/side-quests/groovy_essentials/data/samples.csv @@ -1,4 +1,4 @@ sample_id,organism,tissue_type,sequencing_depth,file_path,quality_score -SAMPLE_001,human,liver,30000000,data/sequences/sample_001.fastq,38.5 -SAMPLE_002,mouse,brain,25000000,data/sequences/sample_002.fastq,35.2 -SAMPLE_003,human,kidney,45000000,data/sequences/sample_003.fastq,42.1 +SAMPLE_001,human,liver,30000000,data/sequences/SAMPLE_001_S1_L001_R1_001.fastq,38.5 +SAMPLE_002,mouse,brain,25000000,data/sequences/SAMPLE_002_S2_L001_R1_001.fastq,35.2 +SAMPLE_003,human,kidney,45000000,data/sequences/SAMPLE_003_S3_L001_R1_001.fastq,42.1 diff --git a/side-quests/groovy_essentials/data/sequences/sample_001.fastq b/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq similarity index 59% rename from side-quests/groovy_essentials/data/sequences/sample_001.fastq rename to side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq index 55bf845a27..5dc7a08c8b 100644 --- a/side-quests/groovy_essentials/data/sequences/sample_001.fastq +++ b/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq @@ -1,12 +1,12 @@ @sample_001_read_1 ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC + -IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @sample_001_read_2 GCATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT + -HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH +HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH @sample_001_read_3 TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA + -JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ +JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJH diff --git a/side-quests/groovy_essentials/data/sequences/sample_002.fastq b/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq similarity index 59% rename from side-quests/groovy_essentials/data/sequences/sample_002.fastq rename to side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq index 71351a97a7..f04cbb4d44 100644 --- a/side-quests/groovy_essentials/data/sequences/sample_002.fastq +++ b/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq @@ -1,12 +1,12 @@ @sample_002_read_1 CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT + -IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @sample_002_read_2 ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG + -HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH +HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH @sample_002_read_3 GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC + -JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ +JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ diff --git a/side-quests/groovy_essentials/data/sequences/sample_003.fastq b/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq similarity index 59% rename from side-quests/groovy_essentials/data/sequences/sample_003.fastq rename to side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq index 4acdd0f18e..425e92abaa 100644 --- a/side-quests/groovy_essentials/data/sequences/sample_003.fastq +++ b/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq @@ -1,12 +1,12 @@ @sample_003_read_1 GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC + -IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @sample_003_read_2 CGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG + -HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH +HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH @sample_003_read_3 ATATATATATATATATATATATATATATATATATATATAT + -JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ +JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ diff --git a/side-quests/groovy_essentials/modules/generate_report.nf b/side-quests/groovy_essentials/modules/generate_report.nf new file mode 100644 index 0000000000..4bf7a1d786 --- /dev/null +++ b/side-quests/groovy_essentials/modules/generate_report.nf @@ -0,0 +1,16 @@ +process GENERATE_REPORT { + + publishDir 'results/reports', mode: 'copy' + + input: + tuple val(meta), path(reads) + + output: + path "${meta.id}_report.txt" + + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + """ +} diff --git a/side-quests/groovy_essentials/nextflow.config b/side-quests/groovy_essentials/nextflow.config index 7d577dac43..b57f70c8c2 100644 --- a/side-quests/groovy_essentials/nextflow.config +++ b/side-quests/groovy_essentials/nextflow.config @@ -1,39 +1,3 @@ // Nextflow configuration for Groovy Essentials side quest -// Basic parameters for the tutorial -params { - input = './data/samples.csv' - quality_threshold_min = 25 - quality_threshold_high = 40 -} - -// Simple process defaults -process { - cpus = 2 - memory = '4.GB' - errorStrategy = 'retry' - maxRetries = 2 -} - -// Basic profiles -profiles { - standard { - process.executor = 'local' - } - - test { - params.input = './data/samples.csv' - process.memory = '2.GB' - process.cpus = 1 - } -} - -// Workflow metadata -manifest { - name = 'groovy-essentials' - description = 'Groovy Essentials for Nextflow Developers' - author = 'Nextflow Training' - version = '1.0.0' - homePage = 'https://training.nextflow.io' - nextflowVersion = '>=23.04.0' -} +docker.enabled = true diff --git a/side-quests/solutions/groovy_essentials/collect.nf b/side-quests/solutions/groovy_essentials/collect.nf new file mode 100644 index 0000000000..dbfd5ac79d --- /dev/null +++ b/side-quests/solutions/groovy_essentials/collect.nf @@ -0,0 +1,18 @@ +def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + +// Nextflow collect() - groups multiple channel emissions into one +ch_input = Channel.fromList(sample_ids) +ch_input.view { "Individual channel item: ${it}" } +ch_collected = ch_input.collect() +ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + +// Groovy collect - transforms each element, preserves structure +def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') +} +println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + +// Spread operator - concise property access +def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] +def all_ids = sample_data*.id +println "Spread operator result: ${all_ids}" diff --git a/side-quests/solutions/groovy_essentials/main.nf b/side-quests/solutions/groovy_essentials/main.nf new file mode 100644 index 0000000000..2ee2a2ae03 --- /dev/null +++ b/side-quests/solutions/groovy_essentials/main.nf @@ -0,0 +1,64 @@ +include { FASTP } from './modules/fastp.nf' +include { TRIMGALORE } from './modules/trimgalore.nf' +include { GENERATE_REPORT } from './modules/generate_report.nf' + +def validateInputs() { + // Check CSV file exists + if (!file(params.input ?: './data/samples.csv').exists()) { + error("Input CSV file not found: ${params.input ?: './data/samples.csv'}") + } + + // Warn if output directory already exists + if (file(params.outdir ?: 'results').exists()) { + log.warn "Output directory already exists: ${params.outdir ?: 'results'}" + } + + // Check for required genome parameter + if (params.run_gatk && !params.genome) { + error("Genome reference required when running GATK. Please provide --genome") + } +} + +def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism?.toLowerCase() ?: 'unknown', + tissue: row.tissue_type?.replaceAll('_', ' ')?.toLowerCase() ?: 'unknown', + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score?.toDouble() ?: 0.0 + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] +} + +workflow { + validateInputs() + + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + .filter { meta, reads -> + meta.organism != 'unknown' && (meta.quality ?: 0) > 0 + } + + GENERATE_REPORT(ch_samples) + + trim_branches = ch_samples + .branch { meta, reads -> + fastp: meta.organism == 'human' && meta.depth >= 30000000 + trimgalore: true + } + + ch_fastp = FASTP(trim_branches.fastp) + ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) +} diff --git a/side-quests/solutions/groovy_essentials/modules/fastp.nf b/side-quests/solutions/groovy_essentials/modules/fastp.nf new file mode 100644 index 0000000000..f74e5e772a --- /dev/null +++ b/side-quests/solutions/groovy_essentials/modules/fastp.nf @@ -0,0 +1,37 @@ +process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta.id), path("*_trimmed*.fastq.gz"), emit: reads + path "*.{json,html}" , emit: reports + + script: + // Simple single-end vs paired-end detection + def is_single = reads instanceof List ? reads.size() == 1 : true + + if (is_single) { + def input_file = reads instanceof List ? reads[0] : reads + """ + fastp \\ + --in1 ${input_file} \\ + --out1 ${meta.id}_trimmed.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ + } else { + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ + } +} diff --git a/side-quests/solutions/groovy_essentials/modules/generate_report.nf b/side-quests/solutions/groovy_essentials/modules/generate_report.nf new file mode 100644 index 0000000000..91e7af87f6 --- /dev/null +++ b/side-quests/solutions/groovy_essentials/modules/generate_report.nf @@ -0,0 +1,24 @@ +process GENERATE_REPORT { + + publishDir 'results/reports', mode: 'copy' + + input: + tuple val(meta), path(reads) + + output: + path "${meta.id}_report.txt" + + script: + def report_type = meta.priority == 'high' ? 'PRIORITY' : 'STANDARD' + """ + echo "=== ${report_type} SAMPLE REPORT ===" > ${meta.id}_report.txt + echo "Processing ${reads}" >> ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + echo "Quality: ${meta.quality}" >> ${meta.id}_report.txt + echo "Priority: ${meta.priority}" >> ${meta.id}_report.txt + echo "---" >> ${meta.id}_report.txt + echo "Processed by: \${USER}" >> ${meta.id}_report.txt + echo "Hostname: \$(hostname)" >> ${meta.id}_report.txt + echo "Date: \$(date)" >> ${meta.id}_report.txt + """ +} diff --git a/side-quests/solutions/groovy_essentials/modules/trimgalore.nf b/side-quests/solutions/groovy_essentials/modules/trimgalore.nf new file mode 100644 index 0000000000..945fef640c --- /dev/null +++ b/side-quests/solutions/groovy_essentials/modules/trimgalore.nf @@ -0,0 +1,37 @@ +process TRIMGALORE { + container 'quay.io/biocontainers/trim-galore:0.6.10--hdfd78af_0' + + input: + tuple val(meta), path(reads) + + output: + tuple val(meta), path("*_trimmed*.fq"), emit: reads + path "*_trimming_report.txt" , emit: reports + + script: + // Simple single-end vs paired-end detection + def is_single = reads instanceof List ? reads.size() == 1 : true + + if (is_single) { + def input_file = reads instanceof List ? reads[0] : reads + """ + trim_galore \\ + --cores $task.cpus \\ + ${input_file} + + # Rename output to match expected pattern + mv *_trimmed.fq ${meta.id}_trimmed.fq + """ + } else { + """ + trim_galore \\ + --paired \\ + --cores $task.cpus \\ + ${reads[0]} ${reads[1]} + + # Rename outputs to match expected pattern + mv *_val_1.fq ${meta.id}_trimmed_R1.fq + mv *_val_2.fq ${meta.id}_trimmed_R2.fq + """ + } +} diff --git a/side-quests/solutions/groovy_essentials/nextflow.config b/side-quests/solutions/groovy_essentials/nextflow.config new file mode 100644 index 0000000000..7d2d0ccef0 --- /dev/null +++ b/side-quests/solutions/groovy_essentials/nextflow.config @@ -0,0 +1,36 @@ +// Nextflow configuration for Groovy Essentials side quest + +docker.enabled = true + +workflow.onStart = { + println "Starting ${workflow.runName} at ${workflow.start}" +} + +workflow.onError = { + println "Workflow failed: ${workflow.errorMessage}" +} + +workflow.onComplete = { + def duration_mins = workflow.duration.toMinutes().round(2) + def status = workflow.success ? "SUCCESS ✅" : "FAILED ❌" + + def summary = """ + Pipeline Execution Summary + =========================== + Completed: ${workflow.complete} + Duration : ${workflow.duration} + Success : ${workflow.success} + Command : ${workflow.commandLine} + """ + + println summary + + // Write to a log file + def log_file = file("${workflow.launchDir}/pipeline_summary.txt") + log_file.text = summary + + println """ + Pipeline finished: ${status} + Duration: ${duration_mins} minutes + """ +} From 28210b46a3d92fa8d917a5229c9d9b04476b1484 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 13:38:23 +0100 Subject: [PATCH 12/48] Tone down intro --- docs/side_quests/groovy_essentials.md | 19 ++++--------------- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 7ea9e25c39..13326e76b2 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -1,21 +1,10 @@ # Groovy Essentials for Nextflow Developers -Nextflow is built on Apache Groovy, a powerful dynamic language that runs on the Java Virtual Machine. This foundation gives Nextflow its flexibility and expressiveness, but it also creates a common source of confusion for developers. +Nextflow is built on Groovy, a powerful dynamic language that runs on the Java Virtual Machine. Most Nextflow tutorials focus on workflow orchestration - channels, processes, and data flow - but when you need to manipulate data, parse filenames, or implement conditional logic, you're actually writing Groovy code. -**Here's the challenge:** Most Nextflow tutorials focus on the workflow orchestration - channels, processes, and data flow - but when you need to manipulate data, parse filenames, implement conditional logic, or handle errors gracefully, you're actually writing Groovy code. Many developers don't realize when they've crossed this boundary. +Understanding where Nextflow ends and Groovy begins is crucial for effective workflow development. Nextflow provides channels, processes, and workflow orchestration, while Groovy handles data manipulation, string processing, and conditional logic within your workflow scripts. -**Why does this matter?** The difference between a brittle workflow that breaks on unexpected input and a robust pipeline that adapts gracefully often comes down to understanding and leveraging Groovy's powerful features within your Nextflow workflows. - -**The common struggle:** Most Nextflow developers can write basic workflows, but they hit walls when they need to: -- Process messy, real-world data with missing fields or inconsistent formats -- Extract metadata from complex file naming schemes -- Route samples through different analysis strategies based on their characteristics -- Handle errors gracefully instead of crashing on invalid input -- Build reusable, maintainable code that doesn't repeat the same patterns everywhere - -Understanding where Nextflow ends and Groovy begins is crucial for effective workflow development. Nextflow provides channels, processes, and workflow orchestration, while Groovy handles data manipulation, string processing, conditional logic, and general programming tasks within your workflow scripts. - -This side quest will bridge that gap by taking you on a hands-on journey from basic concepts to production-ready patterns. We'll transform a simple CSV-reading workflow into a sophisticated bioinformatics pipeline that handles real-world complexity. Starting with a basic workflow that processes sample metadata, we'll evolve it step-by-step through realistic challenges you'll face in production: +This side quest takes you on a hands-on journey from basic concepts to production-ready patterns. We'll transform a simple CSV-reading workflow into a sophisticated bioinformatics pipeline, evolving it step-by-step through realistic challenges: - **Understanding boundaries:** Distinguish between Nextflow operators and Groovy methods, and master when to use each - **Data manipulation:** Extract, transform, and subset maps and collections using Groovy's powerful operators @@ -1368,7 +1357,7 @@ Our `separateMetadata` function currently assumes all CSV fields are present and ### 4.1. The Problem: Null Pointer Crashes -Add a row with missing data to your `data/samples.csv`: +Add a row with missing data to your `data/samples.csv`: ```csv SAMPLE_004,,unknown_tissue,20000000,data/sequences/SAMPLE_004_S4_L001_R1_001.fastq, ``` From 51245c8d36b72631cdc01d41308ebbf0237713e0 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 14:02:51 +0100 Subject: [PATCH 13/48] Reset collect.nf to starting state --- side-quests/groovy_essentials/collect.nf | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/side-quests/groovy_essentials/collect.nf b/side-quests/groovy_essentials/collect.nf index dbfd5ac79d..aaa5573933 100644 --- a/side-quests/groovy_essentials/collect.nf +++ b/side-quests/groovy_essentials/collect.nf @@ -5,14 +5,3 @@ ch_input = Channel.fromList(sample_ids) ch_input.view { "Individual channel item: ${it}" } ch_collected = ch_input.collect() ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } - -// Groovy collect - transforms each element, preserves structure -def formatted_ids = sample_ids.collect { id -> - id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') -} -println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" - -// Spread operator - concise property access -def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] -def all_ids = sample_data*.id -println "Spread operator result: ${all_ids}" From df11a3297f062e1a9c6f17129a461e2687faca44 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 14:29:11 +0100 Subject: [PATCH 14/48] Tweaks --- docs/side_quests/groovy_essentials.md | 126 +++++++++++------- side-quests/groovy_essentials/README.md | 80 ----------- side-quests/groovy_essentials/collect.nf | 11 ++ .../data/metadata/analysis_parameters.yaml | 25 ---- side-quests/groovy_essentials/main.nf | 22 +++ .../groovy_essentials/modules/fastp.nf.bak | 22 --- .../templates/analysis_script.sh | 27 ---- 7 files changed, 109 insertions(+), 204 deletions(-) delete mode 100644 side-quests/groovy_essentials/README.md delete mode 100644 side-quests/groovy_essentials/data/metadata/analysis_parameters.yaml delete mode 100644 side-quests/groovy_essentials/modules/fastp.nf.bak delete mode 100644 side-quests/groovy_essentials/templates/analysis_script.sh diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 13326e76b2..62150e54c3 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -1,8 +1,8 @@ # Groovy Essentials for Nextflow Developers -Nextflow is built on Groovy, a powerful dynamic language that runs on the Java Virtual Machine. Most Nextflow tutorials focus on workflow orchestration - channels, processes, and data flow - but when you need to manipulate data, parse filenames, or implement conditional logic, you're actually writing Groovy code. +Nextflow is built on Groovy, a powerful dynamic language that runs on the Java Virtual Machine. You can write a lot of Nextflow without ever feeling like you've learned Groovy - many workflows use only basic syntax for variables, maps, and lists. Most Nextflow tutorials focus on workflow orchestration (channels, processes, and data flow), and you can go surprisingly far with just that. -Understanding where Nextflow ends and Groovy begins is crucial for effective workflow development. Nextflow provides channels, processes, and workflow orchestration, while Groovy handles data manipulation, string processing, and conditional logic within your workflow scripts. +However, when you need to manipulate data, parse complex filenames, implement conditional logic, or build robust production workflows, you're writing Groovy code - and knowing a few key Groovy concepts can dramatically improve your ability to solve real-world problems efficiently. Understanding where Nextflow ends and Groovy begins helps you write clearer, more maintainable workflows. This side quest takes you on a hands-on journey from basic concepts to production-ready patterns. We'll transform a simple CSV-reading workflow into a sophisticated bioinformatics pipeline, evolving it step-by-step through realistic challenges: @@ -24,7 +24,7 @@ Before taking on this side quest you should: - Complete the [Hello Nextflow](../hello_nextflow/README.md) tutorial or have equivalent experience - Understand basic Nextflow concepts (processes, channels, workflows) -- Have basic familiarity with Groovy syntax (variables, maps, lists) +- Have basic familiarity with common programming constructs used in Groovy syntax (variables, maps, lists) This tutorial will explain Groovy concepts as we encounter them, so you don't need extensive prior Groovy knowledge. We'll start with fundamental concepts and build up to advanced patterns. @@ -41,21 +41,21 @@ You'll find a `data` directory with sample files and a main workflow file that w ```console title="Directory contents" > tree . +├── collect.nf ├── data -│ ├── metadata -│ │ └── analysis_parameters.yaml │ ├── samples.csv │ └── sequences │ ├── SAMPLE_001_S1_L001_R1_001.fastq │ ├── SAMPLE_002_S2_L001_R1_001.fastq │ └── SAMPLE_003_S3_L001_R1_001.fastq ├── main.nf -├── nextflow.config -├── README.md -└── templates - └── analysis_script.sh +├── modules +│ ├── fastp.nf +│ ├── generate_report.nf +│ └── trimgalore.nf +└── nextflow.config -5 directories, 9 files +4 directories, 10 files ``` Our sample CSV contains information about biological samples that need different processing based on their characteristics: @@ -75,7 +75,7 @@ We'll use this realistic dataset to explore practical Groovy techniques that you ### 1.1. Identifying What's What -One of the most common sources of confusion for Nextflow developers is understanding when they're working with Nextflow constructs versus Groovy language features. Let's build a workflow step by step to see how they work together. +One of the most common sources of confusion for Nextflow developers is understanding when they're working with Nextflow constructs versus Groovy language features. Let's build a workflow step by step to see a common example of how they work together. #### Step 1: Basic Nextflow Workflow @@ -98,7 +98,10 @@ nextflow run main.nf ``` You should see output like: + ```console title="Raw CSV data" +Launching `main.nf` [marvelous_tuckerman] DSL2 - revision: 6113e05c17 + [sample_id:SAMPLE_001, organism:human, tissue_type:liver, sequencing_depth:30000000, file_path:data/sequences/SAMPLE_001_S1_L001_R1_001.fastq, quality_score:38.5] [sample_id:SAMPLE_002, organism:mouse, tissue_type:brain, sequencing_depth:25000000, file_path:data/sequences/SAMPLE_002_S2_L001_R1_001.fastq, quality_score:35.2] [sample_id:SAMPLE_003, organism:human, tissue_type:kidney, sequencing_depth:45000000, file_path:data/sequences/SAMPLE_003_S3_L001_R1_001.fastq, quality_score:42.1] @@ -147,7 +150,7 @@ You'll see the same output as before, because we're simply returning the input u #### Step 3: Creating a Map Data Structure -Now we're going to write **pure Groovy code** inside our closure. Everything from this point forward is Groovy syntax and methods, not Nextflow operators. +Now we're going to write **pure Groovy code** inside our closure. Everything from this point forward in this section is Groovy syntax and methods, not Nextflow operators. === "After" @@ -246,7 +249,7 @@ The map addition operator `+` creates a **new map** rather than modifying the ex !!! Note - Using the addition operator `+` creates a new map rather than modifying the existing one, which is a useful practice to adopt. Never directly modify maps passed into closures, as this can lead to unexpected behavior in Nextflow. This is especially important because in Nextflow workflows, the same data often flows through multiple channel operations or gets processed by different processes simultaneously. When multiple operations reference the same map object, modifying it in-place can cause unpredictable side effects - one operation might change data that another operation is still using. By creating new maps instead of modifying existing ones, you ensure that each operation works with its own clean copy of the data, making your workflows more predictable and easier to debug. + Never modify maps passed into closures - always create new ones using `+` (for example). In Nextflow, the same data often flows through multiple operations simultaneously. Modifying a map in-place can cause unpredictable side effects when other operations reference that same object. Creating new maps ensures each operation has its own clean copy. Run the modified workflow: @@ -272,10 +275,11 @@ Let's add a line to create a simplified version of our metadata that only contai === "After" - ```groovy title="main.nf" linenums="2" hl_lines="12-13" + ```groovy title="main.nf" linenums="2" hl_lines="12-15" ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> + // This is all Groovy code now! def sample_meta = [ id: row.sample_id.toLowerCase(), organism: row.organism, @@ -283,21 +287,22 @@ Let's add a line to create a simplified version of our metadata that only contai depth: row.sequencing_depth.toInteger(), quality: row.quality_score.toDouble() ] - def priority = sample_meta.quality > 40 ? 'high' : 'normal' def id_only = sample_meta.subMap(['id', 'organism', 'tissue']) - println "Full metadata: ${sample_meta + [priority: priority]}" println "ID fields only: ${id_only}" - return [sample_meta + [priority: priority], file(row.file_path)] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return sample_meta + [priority: priority] } .view() ``` === "Before" - ```groovy title="main.nf" linenums="2" hl_lines="11" + ```groovy title="main.nf" linenums="2" hl_lines="12" ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> + // This is all Groovy code now! def sample_meta = [ id: row.sample_id.toLowerCase(), organism: row.organism, @@ -317,11 +322,19 @@ Run the modified workflow: nextflow run main.nf ``` -You should see output showing both the full metadata and the extracted subset: +You should see output showing both the full metadata displayed by the `view()` operation and the extracted subset we printed with `println`: ```console title="SubMap results" -Full metadata: [id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal] + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [peaceful_cori] DSL2 - revision: 4cc4a8340f + ID fields only: [id:sample_001, organism:human, tissue:liver] +ID fields only: [id:sample_002, organism:mouse, tissue:brain] +ID fields only: [id:sample_003, organism:human, tissue:kidney] +[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal] +[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal] +[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high] ``` The `.subMap()` method takes a list of keys and returns a new map containing only those keys. If a key doesn't exist in the original map, it's simply not included in the result. @@ -356,7 +369,7 @@ Let's output a channel structure comprising a tuple of 2 elements: the enriched quality: row.quality_score.toDouble() ] def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + [priority: priority], file(row.file_path)] + return [sample_meta + [priority: priority], file(row.file_path) ] } .view() ``` @@ -419,7 +432,15 @@ ch_collected = ch_input.collect() ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } ``` -We're using the `fromList()` channel factory to create a channel that emits each sample ID as a separate item, and we use `view()` to print each item as it flows through the channel.Then we apply Nextflow's `collect()` operator to gather all items into a single list and use a second `view()` to print the collected result which appears as a single item containing a list of all sample IDs. We've changed the structure of the channel, but we haven't changed the data itself. +We are: + +- Defining a (Groovy) list +- Using the `fromList()` channel factory to create a channel that emits each sample ID as a separate item +- Using `view()` to print each item as it flows through the channel +- Applying Nextflow's `collect()` operator to gather all items into a single list +- Using a second `view()` to print the collected result which appears as a single item containing a list of all sample IDs + +We've changed the structure of the channel, but we haven't changed the data itself. Run the workflow to confirm this: @@ -438,37 +459,44 @@ Individual channel item: sample_003 Nextflow collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) ``` +`view()` returns an output for every channel emission, so we know that this single output contains all 3 original items grouped into one list. + Now let's see Groovy's `collect` method in action. Modify `collect.nf` to apply Groovy's `collect` method to the original list of sample IDs: === "After" ```groovy title="main.nf" linenums="1" hl_lines="9-13" - def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - // Nextflow collect() - groups multiple channel emissions into one - ch_input = Channel.fromList(sample_ids) - ch_input.view { "Individual channel item: ${it}" } - ch_collected = ch_input.collect() - ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + // Nextflow collect() - groups multiple channel emissions into one + ch_input = Channel.fromList(sample_ids) + ch_input.view { "Individual channel item: ${it}" } + ch_collected = ch_input.collect() + ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } - // Groovy collect - transforms each element, preserves structure - def formatted_ids = sample_ids.collect { id -> - id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') - } - println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" - ``` + // Groovy collect - transforms each element, preserves structure + def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') + } + println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" +``` === "Before" ```groovy title="main.nf" linenums="1" - def sample_ids = ['sample_001', 'sample_002', 'sample_003'] + def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - // Nextflow collect() - groups multiple channel emissions into one - ch_input = Channel.fromList(sample_ids) - ch_input.view { "Individual channel item: ${it}" } - ch_collected = ch_input.collect() - ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } - ``` + // Nextflow collect() - groups multiple channel emissions into one + ch_input = Channel.fromList(sample_ids) + ch_input.view { "Individual channel item: ${it}" } + ch_collected = ch_input.collect() + ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } +``` + +In this new snippet we: + +- Define a new variable `formatted_ids` that uses Groovy's `collect` method to transform each sample ID in the original list +- Print the result using `println` Run the modified workflow: @@ -476,7 +504,7 @@ Run the modified workflow: nextflow run collect.nf ``` -```console title="Groovy collect results" hl_lines="9" +```console title="Groovy collect results" hl_lines="5" N E X T F L O W ~ version 25.04.6 Launching `collect.nf` [cheeky_stonebraker] DSL2 - revision: 2d5039fb47 @@ -574,11 +602,10 @@ def names = files.collect { it.getName() } The spread operator is particularly useful when you need to extract a single property from a list of objects - it's more readable than writing out the full `collect` closure. -!!! tip "When to Use Spread vs Collect" +!!! tip "When to Use Groovy's Spread vs Collect" - **Use spread (`*.`)** for simple property access: `samples*.id`, `files*.name` - - **Use collect** for transformations: `samples.collect { it.id.toUpperCase() }` - - **Use collect** for complex logic: `samples.collect { [it.id, it.quality > 40] }` + - **Use collect** for transformations or complex logic: `samples.collect { it.id.toUpperCase() }`, `samples.collect { [it.id, it.quality > 40] }` ### Takeaway @@ -608,8 +635,9 @@ Make the following change to your existing `main.nf` workflow: === "After" - ```groovy title="main.nf" linenums="2" hl_lines="9-19,21" + ```groovy title="main.nf" linenums="4" hl_lines="10-22" .map { row -> + // This is all Groovy code now! def sample_meta = [ id: row.sample_id.toLowerCase(), organism: row.organism, @@ -634,10 +662,9 @@ Make the following change to your existing `main.nf` workflow: === "Before" - ```groovy title="main.nf" linenums="2" - ch_samples = Channel.fromPath("./data/samples.csv") - .splitCsv(header: true) + ```groovy title="main.nf" linenums="4" "hl_lines="11" .map { row -> + // This is all Groovy code now! def sample_meta = [ id: row.sample_id.toLowerCase(), organism: row.organism, @@ -648,7 +675,6 @@ Make the following change to your existing `main.nf` workflow: def priority = sample_meta.quality > 40 ? 'high' : 'normal' return [sample_meta + [priority: priority], file(row.file_path)] } - .view() ``` This demonstrates key Groovy string processing concepts: diff --git a/side-quests/groovy_essentials/README.md b/side-quests/groovy_essentials/README.md deleted file mode 100644 index 1a91d5d1c0..0000000000 --- a/side-quests/groovy_essentials/README.md +++ /dev/null @@ -1,80 +0,0 @@ -# Groovy Essentials for Nextflow Developers - -This directory contains the supporting materials for the [Groovy Essentials side quest](../../docs/side_quests/groovy_essentials.md). - -## Contents - -- `main.nf` - Comprehensive workflow demonstrating all Groovy concepts from the side quest -- `nextflow.config` - Configuration file showcasing Nextflow's parameter system and profiles -- `data/` - Sample input data for the tutorial - - `samples.csv` - Sample metadata CSV file with realistic bioinformatics data - - `sequences/` - Sample FASTQ files for testing workflows - - `metadata/` - Additional metadata files (YAML configuration examples) -- `templates/` - Template scripts demonstrating Nextflow's templating system - -## Quick Start - -To run the demonstration workflow: - -```bash -cd side-quests/groovy_essentials -nextflow run main.nf -``` - -The workflow will demonstrate all the Groovy patterns covered in the side quest: - -1. **Nextflow vs Groovy boundaries** - See how workflow orchestration differs from programming logic -2. **String processing** - Pattern matching and file name parsing examples -3. **Conditional logic** - Dynamic strategy selection based on sample characteristics -4. **Error handling** - Validation and graceful error recovery patterns -5. **Essential Groovy operators** - Safe navigation, Elvis operator, Groovy Truth, and slashy strings -6. **Advanced closures** - Named closures, function composition, currying, and scope access -7. **Collection operations** - Advanced data processing with Groovy's collection methods - -## Testing the Workflow - -You can test different aspects of the workflow: - -```bash -# Run with different quality thresholds -nextflow run main.nf --quality_threshold_min 30 --quality_threshold_high 45 - -# Use the testing profile for reduced resource usage -nextflow run main.nf -profile test - -# Run in stub mode to test logic without executing tools -nextflow run main.nf -stub -``` - -## Learning Objectives - -This side quest teaches essential Groovy skills for Nextflow developers: - -- **Language boundaries**: Distinguish between Nextflow workflow orchestration and Groovy programming logic -- **String processing**: Use regular expressions and pattern matching for bioinformatics file names -- **Command building**: Transform file collections into command-line arguments using Groovy methods -- **Conditional logic**: Implement intelligent routing and process selection based on sample characteristics -- **Error handling**: Add validation, try-catch patterns, and graceful failure management -- **Essential operators**: Master safe navigation, Elvis operator, Groovy Truth, and slashy strings for robust code -- **Advanced closures**: Master named closures, function composition, currying, and functional programming patterns -- **Collection operations**: Process and organize large datasets using Groovy's powerful collection methods - -## Progressive Learning - -The `main.nf` file demonstrates a complete sample processing pipeline that evolves from basic metadata handling to a sophisticated, production-ready workflow. Each section builds on the previous, showing how Groovy transforms simple Nextflow workflows into powerful data processing systems. - -Follow the [main documentation](../../docs/side_quests/groovy_essentials.md) for detailed explanations, step-by-step examples, and hands-on exercises that correspond to each section of the demonstration workflow. - -## Next Steps - -After completing this side quest, you'll be ready to: - -- Write cleaner workflows with proper separation between Nextflow and Groovy logic -- Handle complex file naming conventions and input formats gracefully -- Build intelligent pipelines that adapt to different sample types and data characteristics -- Write null-safe, robust code using essential Groovy operators -- Create reusable, maintainable code using advanced closure patterns and functional programming -- Process large-scale datasets efficiently using advanced collection operations -- Add robust error handling for production-ready workflows - -Continue exploring the [other side quests](../README.md) to further develop your Nextflow expertise! diff --git a/side-quests/groovy_essentials/collect.nf b/side-quests/groovy_essentials/collect.nf index aaa5573933..dbfd5ac79d 100644 --- a/side-quests/groovy_essentials/collect.nf +++ b/side-quests/groovy_essentials/collect.nf @@ -5,3 +5,14 @@ ch_input = Channel.fromList(sample_ids) ch_input.view { "Individual channel item: ${it}" } ch_collected = ch_input.collect() ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + +// Groovy collect - transforms each element, preserves structure +def formatted_ids = sample_ids.collect { id -> + id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') +} +println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + +// Spread operator - concise property access +def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] +def all_ids = sample_data*.id +println "Spread operator result: ${all_ids}" diff --git a/side-quests/groovy_essentials/data/metadata/analysis_parameters.yaml b/side-quests/groovy_essentials/data/metadata/analysis_parameters.yaml deleted file mode 100644 index 115321672d..0000000000 --- a/side-quests/groovy_essentials/data/metadata/analysis_parameters.yaml +++ /dev/null @@ -1,25 +0,0 @@ -analysis: - quality: - min_score: 30 - trim_adapters: true - remove_duplicates: false - - alignment: - reference: "GRCh38" - aligner: "STAR" - max_mismatches: 2 - - quantification: - method: "featureCounts" - feature_type: "exon" - count_overlaps: false - -resources: - max_cpus: 8 - max_memory: "16GB" - temp_dir: "/tmp" - -output: - publish_mode: "copy" - compress: true - formats: ["bam", "counts", "qc_report"] diff --git a/side-quests/groovy_essentials/main.nf b/side-quests/groovy_essentials/main.nf index 31aa794ede..627c6c8f2c 100644 --- a/side-quests/groovy_essentials/main.nf +++ b/side-quests/groovy_essentials/main.nf @@ -1,5 +1,27 @@ workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) + .map { row -> + // This is all Groovy code now! + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } .view() } diff --git a/side-quests/groovy_essentials/modules/fastp.nf.bak b/side-quests/groovy_essentials/modules/fastp.nf.bak deleted file mode 100644 index 827372b21a..0000000000 --- a/side-quests/groovy_essentials/modules/fastp.nf.bak +++ /dev/null @@ -1,22 +0,0 @@ -process FASTP { - container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' - - input: - tuple val(meta), path(reads) - - output: - tuple val(sample_id), path("*_trimmed*.fastq.gz"), emit: reads - path "*.{json,html}" , emit: reports - - script: - """ - fastp \\ - --in1 ${reads[0]} \\ - --in2 ${reads[1]} \\ - --out1 ${meta.id}_trimmed_R1.fastq.gz \\ - --out2 ${meta.id}_trimmed_R2.fastq.gz \\ - --json ${meta.id}.fastp.json \\ - --html ${meta.id}.fastp.html \\ - --thread $task.cpus - """ -} diff --git a/side-quests/groovy_essentials/templates/analysis_script.sh b/side-quests/groovy_essentials/templates/analysis_script.sh deleted file mode 100644 index 5a9c219b7b..0000000000 --- a/side-quests/groovy_essentials/templates/analysis_script.sh +++ /dev/null @@ -1,27 +0,0 @@ -#!/bin/bash - -# Nextflow template file - accessed via template directive in process -# This template has access to all variables from the process input -# Groovy expressions are evaluated at runtime - -echo "Generating report for sample: ${meta.id}" -echo "Organism: ${meta.organism}" -echo "Quality score: ${meta.quality}" - -# Conditional logic in template -<% if (meta.organism == 'human') { %> -echo "Including human-specific quality metrics" -human_qc_script.py --input ${results} --output ${meta.id}_report.html -<% } else { %> -echo "Using standard quality metrics for ${meta.organism}" -generic_qc_script.py --input ${results} --output ${meta.id}_report.html -<% } %> - -# Groovy variables can be used for calculations -<% -def priority_bonus = meta.priority == 'high' ? 0.1 : 0.0 -def adjusted_score = (meta.quality + priority_bonus).round(2) -%> - -echo "Adjusted quality score: ${adjusted_score}" -echo "Report generation complete" From cda7f76caaec4b131a8d0ead0f13f3de6850f16b Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 14:30:06 +0100 Subject: [PATCH 15/48] Reset collect.nf to starting state --- side-quests/groovy_essentials/collect.nf | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/side-quests/groovy_essentials/collect.nf b/side-quests/groovy_essentials/collect.nf index dbfd5ac79d..aaa5573933 100644 --- a/side-quests/groovy_essentials/collect.nf +++ b/side-quests/groovy_essentials/collect.nf @@ -5,14 +5,3 @@ ch_input = Channel.fromList(sample_ids) ch_input.view { "Individual channel item: ${it}" } ch_collected = ch_input.collect() ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } - -// Groovy collect - transforms each element, preserves structure -def formatted_ids = sample_ids.collect { id -> - id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') -} -println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" - -// Spread operator - concise property access -def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] -def all_ids = sample_data*.id -println "Spread operator result: ${all_ids}" From e37d4efdb66b9adbf33cf8901527c436dd4d6f33 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 14:44:43 +0100 Subject: [PATCH 16/48] Tweaks --- docs/side_quests/groovy_essentials.md | 61 +++++++++++++++++++++------ 1 file changed, 49 insertions(+), 12 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 62150e54c3..1826d0c7f0 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -635,7 +635,7 @@ Make the following change to your existing `main.nf` workflow: === "After" - ```groovy title="main.nf" linenums="4" hl_lines="10-22" + ```groovy title="main.nf" linenums="4" hl_lines="10-21" .map { row -> // This is all Groovy code now! def sample_meta = [ @@ -713,27 +713,64 @@ Make that change like so: === "After" - ```groovy title="main.nf" linenums="1" hl_lines="1-3,7" + ```groovy title="main.nf" linenums="1" hl_lines="1-2 27" def separateMetadata(row) { - // ... all the metadata processing logic ... + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] } workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map(separateMetadata) + .map{ row -> separateMetadata(row) } .view() } ``` === "Before" - ```groovy title="main.nf" linenums="1" + ```groovy title="main.nf" linenums="1" hl_lines="4-26" workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> - // ... all the inline metadata processing logic ... + // This is all Groovy code now! + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] } .view() } @@ -752,18 +789,18 @@ By doing this we've reduced the actual workflow logic down to something really t You can run that to make sure it still works: -```bash title="Test reusable closure" +```bash title="Test reusable function" nextflow run main.nf ``` -```console title="Closure results" +```console title="Function results" N E X T F L O W ~ version 25.04.6 -Launching `main.nf` [tender_archimedes] DSL2 - revision: 8bfb9b2485 +Launching `main.nf` [admiring_panini] DSL2 - revision: 8cc832e32f -[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] -[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] -[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] ``` ### 2.3. Dynamic Script Logic in Processes From 2c5f0b68edb2d1c270db2992630d98fede2b3b09 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 14:45:23 +0100 Subject: [PATCH 17/48] Revert main.nf to starting point --- side-quests/groovy_essentials/main.nf | 22 ---------------------- 1 file changed, 22 deletions(-) diff --git a/side-quests/groovy_essentials/main.nf b/side-quests/groovy_essentials/main.nf index 627c6c8f2c..31aa794ede 100644 --- a/side-quests/groovy_essentials/main.nf +++ b/side-quests/groovy_essentials/main.nf @@ -1,27 +1,5 @@ workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map { row -> - // This is all Groovy code now! - def sample_meta = [ - id: row.sample_id.toLowerCase(), - organism: row.organism, - tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), - depth: row.sequencing_depth.toInteger(), - quality: row.quality_score.toDouble() - ] - def fastq_path = file(row.file_path) - - def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) - def file_meta = m ? [ - sample_num: m[0][2].toInteger(), - lane: m[0][3], - read: m[0][4], - chunk: m[0][5] - ] : [:] - - def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] - } .view() } From 954ce99d3526f1dff9c3900f0416844b9432914e Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 14:58:41 +0100 Subject: [PATCH 18/48] Tweaks --- docs/side_quests/groovy_essentials.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 1826d0c7f0..e9449718c5 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -803,6 +803,8 @@ Launching `main.nf` [admiring_panini] DSL2 - revision: 8cc832e32f [[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] ``` +Hopefully there are no changes to the output, but the workflow is now much cleaner and easier to maintain. + ### 2.3. Dynamic Script Logic in Processes Another place you'll find it very useful to break out your Groovy toolbox is in process script blocks. You can use Groovy logic to make your scripts dynamic and adaptable to different input conditions. From cd45c760c3fe64e8b29992c667df5e02872ca9eb Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 15:05:24 +0100 Subject: [PATCH 19/48] backticks fix? --- docs/side_quests/groovy_essentials.md | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index e9449718c5..344eb0b930 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -400,6 +400,7 @@ nextflow run main.nf ``` You should see output like: + ```console title="Complete workflow output" [[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] [[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] @@ -479,7 +480,7 @@ Now let's see Groovy's `collect` method in action. Modify `collect.nf` to apply id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') } println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" -``` + ``` === "Before" @@ -491,7 +492,7 @@ Now let's see Groovy's `collect` method in action. Modify `collect.nf` to apply ch_input.view { "Individual channel item: ${it}" } ch_collected = ch_input.collect() ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } -``` + ``` In this new snippet we: @@ -845,7 +846,6 @@ include { FASTP } from './modules/fastp.nf' Then modify the `workflow` block to connect the `ch_samples` channel to the `FASTP` process: - === "After" ```groovy title="main.nf" linenums="30" hl_lines="6" @@ -1153,6 +1153,7 @@ echo "Date: \$(date)" >> ${meta.id}_report.txt ``` Now you can see all three types together: + - `${report_type}`, `${meta.id}`, `${meta.quality}`: Groovy variables (no backslash) - `\${USER}`: Shell environment variable (backslash) - `\$(hostname)`, `\$(date)`: Shell command substitution (backslash) @@ -1171,7 +1172,6 @@ In this section, you've learned: These string processing patterns are essential for handling the diverse file formats and naming conventions you'll encounter in real-world bioinformatics workflows. - --- ### 2.5. Dynamic Resource Directives with Closures @@ -1230,6 +1230,7 @@ process FASTP { ``` Now if the process fails due to insufficient memory, Nextflow will retry with more memory: + - First attempt: 512 MB (task.attempt = 1) - Second attempt: 1024 MB (task.attempt = 2) @@ -1257,6 +1258,7 @@ process QUALITY_CONTROL { ``` This demonstrates several advanced patterns: + - Creating intermediate Groovy variables (`base_mem`, `base_cpus`) - Using Groovy math functions (`Math.min`) to set limits - Combining metadata with retry logic @@ -1265,6 +1267,7 @@ This demonstrates several advanced patterns: ### Takeaway Dynamic directives with closures let you: + - Allocate resources based on input characteristics - Implement automatic retry strategies with increasing resources - Combine multiple factors (metadata, attempt number, priorities) @@ -1423,6 +1426,7 @@ Our `separateMetadata` function currently assumes all CSV fields are present and ### 4.1. The Problem: Null Pointer Crashes Add a row with missing data to your `data/samples.csv`: + ```csv SAMPLE_004,,unknown_tissue,20000000,data/sequences/SAMPLE_004_S4_L001_R1_001.fastq, ``` @@ -1902,6 +1906,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ### Key Concepts Reference - **Language Boundaries** + ```groovy title="Nextflow vs Groovy examples" // Nextflow: workflow orchestration Channel.fromPath('*.fastq').splitCsv(header: true) @@ -1911,6 +1916,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ``` - **String Processing** + ```groovy title="String processing examples" // Pattern matching filename =~ ~/^(\w+)_(\w+)_(\d+)\.fastq$/ @@ -1930,6 +1936,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ``` - **Error Handling** + ```groovy title="Error handling patterns" try { def errors = validateSample(sample) @@ -1940,6 +1947,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ``` - **Essential Groovy Operators** + ```groovy title="Essential operators examples" // Safe navigation and Elvis operators def id = data?.sample?.id ?: 'unknown' @@ -1954,6 +1962,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ``` - **Advanced Closures** + ```groovy title="Advanced closure patterns" // Named closures and composition def enrichData = normalizeId >> addQualityCategory >> addFlags From c0ec86a14a8f939df445b614fd1c5c9be1109733 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 15:18:08 +0100 Subject: [PATCH 20/48] Reset fastp module to starting point --- .../groovy_essentials/modules/fastp.nf | 38 ++++++------------- 1 file changed, 11 insertions(+), 27 deletions(-) diff --git a/side-quests/groovy_essentials/modules/fastp.nf b/side-quests/groovy_essentials/modules/fastp.nf index f74e5e772a..d478d61ae0 100644 --- a/side-quests/groovy_essentials/modules/fastp.nf +++ b/side-quests/groovy_essentials/modules/fastp.nf @@ -5,33 +5,17 @@ process FASTP { tuple val(meta), path(reads) output: - tuple val(meta.id), path("*_trimmed*.fastq.gz"), emit: reads - path "*.{json,html}" , emit: reports + tuple val(sample_id), path("*_trimmed*.fastq.gz"), emit: reads script: - // Simple single-end vs paired-end detection - def is_single = reads instanceof List ? reads.size() == 1 : true - - if (is_single) { - def input_file = reads instanceof List ? reads[0] : reads - """ - fastp \\ - --in1 ${input_file} \\ - --out1 ${meta.id}_trimmed.fastq.gz \\ - --json ${meta.id}.fastp.json \\ - --html ${meta.id}.fastp.html \\ - --thread $task.cpus - """ - } else { - """ - fastp \\ - --in1 ${reads[0]} \\ - --in2 ${reads[1]} \\ - --out1 ${meta.id}_trimmed_R1.fastq.gz \\ - --out2 ${meta.id}_trimmed_R2.fastq.gz \\ - --json ${meta.id}.fastp.json \\ - --html ${meta.id}.fastp.html \\ - --thread $task.cpus - """ - } + """ + fastp \\ + --in1 ${reads[0]} \\ + --in2 ${reads[1]} \\ + --out1 ${meta.id}_trimmed_R1.fastq.gz \\ + --out2 ${meta.id}_trimmed_R2.fastq.gz \\ + --json ${meta.id}.fastp.json \\ + --html ${meta.id}.fastp.html \\ + --thread $task.cpus + """ } From 51e3e1da9bce44c8398780629867a351ea0a2516 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 15:27:22 +0100 Subject: [PATCH 21/48] Fix fastp --- side-quests/groovy_essentials/modules/fastp.nf | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/side-quests/groovy_essentials/modules/fastp.nf b/side-quests/groovy_essentials/modules/fastp.nf index d478d61ae0..29e9ef8ca3 100644 --- a/side-quests/groovy_essentials/modules/fastp.nf +++ b/side-quests/groovy_essentials/modules/fastp.nf @@ -5,7 +5,7 @@ process FASTP { tuple val(meta), path(reads) output: - tuple val(sample_id), path("*_trimmed*.fastq.gz"), emit: reads + tuple val(meta), path("*_trimmed*.fastq.gz"), emit: reads script: """ From e2f3681e57ecd08596992f27bac3bd5b8682853c Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Thu, 9 Oct 2025 15:32:28 +0100 Subject: [PATCH 22/48] tweaks --- docs/side_quests/groovy_essentials.md | 41 ++++++++++++++------------- 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 344eb0b930..ffaa5ebdc5 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -714,7 +714,7 @@ Make that change like so: === "After" - ```groovy title="main.nf" linenums="1" hl_lines="1-2 27" + ```groovy title="main.nf" linenums="1" hl_lines="1-22 26" def separateMetadata(row) { def sample_meta = [ id: row.sample_id.toLowerCase(), @@ -782,8 +782,8 @@ By doing this we've reduced the actual workflow logic down to something really t ```groovy title="minimal workflow" ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map{row -> separateMetadata(row)} - .view() + .map{row -> separateMetadata(row)} + .view() ``` ... which makes the logic much easier to read and understand at a glance. The function `separateMetadata` encapsulates all the complex logic for parsing and enriching metadata, making it reusable and testable. @@ -812,7 +812,7 @@ Another place you'll find it very useful to break out your Groovy toolbox is in To illustrate what we mean, let's add some processes to our existing `main.nf` workflow that demonstrate common patterns for dynamic script generation. Open `modules/fastp.nf` and take a look: -```groovtitle="modules/fastp.nf" linenums="1" +```groovy title="modules/fastp.nf" linenums="1" process FASTP { container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' @@ -848,26 +848,27 @@ Then modify the `workflow` block to connect the `ch_samples` channel to the `FAS === "After" - ```groovy title="main.nf" linenums="30" hl_lines="6" - workflow { + ```groovy title="main.nf" linenums="25" hl_lines="7" + workflow { - ch_samples = Channel.fromPath("./data/samples.csv") - .splitCsv(header: true) - .map(separateMetadata) + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } - ch_fastp = FASTP(ch_samples) - } + ch_fastp = FASTP(ch_samples) + } ``` === "Before" - ```groovy title="main.nf" linenums="30" hl_lines="6" - workflow { + ```groovy title="main.nf" linenums="25" hl_lines="6" + workflow { - ch_samples = Channel.fromPath("./data/samples.csv") - .splitCsv(header: true) - .map(separateMetadata) - } + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + .view() + } ``` Run this modified workflow: @@ -909,7 +910,7 @@ Let's fix this by adding some Groovy logic to the `script:` block of the `FASTP` === "After" - ```groovy title="main.nf" linenums="11" hl_lines="3,5,15" + ```groovy title="main.nf" linenums="10" hl_lines="3 5 15" script: // Simple single-end vs paired-end detection def is_single = reads instanceof List ? reads.size() == 1 : true @@ -940,7 +941,7 @@ Let's fix this by adding some Groovy logic to the `script:` block of the `FASTP` === "Before" - ```groovy title="main.nf" linenums="11" + ```groovy title="main.nf" linenums="10" script: """ fastp \\ @@ -988,7 +989,7 @@ fastp \ --thread 2 ``` -Another common one can be seen in [the Nextflow for Science Genomics module](../nf4science/genomics/02_joint_calling.md). In that module, the GATK process being called can take multiple input files, but each must be prefixed with `-V` to form a correct command line. The process uses Groovy logic to transform a collection of input files (`all_gvcfs`) into the correct command arguments: +Another common usage of dynamic script logic can be seen in [the Nextflow for Science Genomics module](../../nf4science/genomics/02_joint_calling.md). In that module, the GATK process being called can take multiple input files, but each must be prefixed with `-V` to form a correct command line. The process uses Groovy logic to transform a collection of input files (`all_gvcfs`) into the correct command arguments: ```groovy title="command line manipulation for GATK" linenums="1" script: From 5913c17ba7f48109ee85603464a5fb66339afa2c Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Fri, 10 Oct 2025 09:23:35 +0100 Subject: [PATCH 23/48] latest tweaks, simplify the variable interpretation part --- docs/side_quests/groovy_essentials.md | 73 +++++++------------ .../modules/generate_report.nf | 7 +- 2 files changed, 26 insertions(+), 54 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index ffaa5ebdc5..f9104b308b 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -989,9 +989,9 @@ fastp \ --thread 2 ``` -Another common usage of dynamic script logic can be seen in [the Nextflow for Science Genomics module](../../nf4science/genomics/02_joint_calling.md). In that module, the GATK process being called can take multiple input files, but each must be prefixed with `-V` to form a correct command line. The process uses Groovy logic to transform a collection of input files (`all_gvcfs`) into the correct command arguments: +Another common usage of dynamic script logic can be seen in [the Nextflow for Science Genomics module](../../nf4science/genomics/02_joint_calling). In that module, the GATK process being called can take multiple input files, but each must be prefixed with `-V` to form a correct command line. The process uses Groovy logic to transform a collection of input files (`all_gvcfs`) into the correct command arguments: -```groovy title="command line manipulation for GATK" linenums="1" +```groovy title="command line manipulation for GATK" linenums="1" hl_lines="2 5" script: def gvcfs_line = all_gvcfs.collect { gvcf -> "-V ${gvcf}" }.join(' ') """ @@ -1008,7 +1008,7 @@ These patterns of using Groovy logic in process script blocks are extremely powe When writing process scripts, you're actually working with three different types of variables, and using the wrong syntax is a common source of errors. Let's add a process that creates a processing report to demonstrate the differences. -Create a new process file `modules/generate_report.nf`: +Take a look a the module file `modules/generate_report.nf`: ```groovy title="modules/generate_report.nf" linenums="1" process GENERATE_REPORT { @@ -1035,9 +1035,8 @@ Include the process in your `main.nf` and add it to the workflow: === "After" - ```groovy title="main.nf" linenums="1" hl_lines="3 11" + ```groovy title="main.nf" linenums="1" hl_lines="2 12" include { FASTP } from './modules/fastp.nf' - include { TRIMGALORE } from './modules/trimgalore.nf' include { GENERATE_REPORT } from './modules/generate_report.nf' // ... separateMetadata function ... @@ -1045,34 +1044,32 @@ Include the process in your `main.nf` and add it to the workflow: workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map(separateMetadata) + .map{ row -> separateMetadata(row) } + ch_fastp = FASTP(ch_samples) GENERATE_REPORT(ch_samples) - - // ... rest of workflow ... } ``` === "Before" - ```groovy title="main.nf" linenums="1" + ```groovy title="main.nf" linenums="1" hl_lines="1 10" include { FASTP } from './modules/fastp.nf' - include { TRIMGALORE } from './modules/trimgalore.nf' // ... separateMetadata function ... workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map(separateMetadata) + .map{ row -> separateMetadata(row) } - // ... rest of workflow ... + ch_fastp = FASTP(ch_samples) } ``` Now run the workflow and check the generated reports in `results/reports/`. They should contain basic information about each sample. -But what if we want to add information about when and where the processing occurred? Let's modify the process to include shell environment variables: +But what if we want to add information about when and where the processing occurred? Let's modify the process to use **shell** variables and a bit of command substitution to include the current user, hostname, and date in the report: === "After" @@ -1097,9 +1094,22 @@ But what if we want to add information about when and where the processing occur """ ``` -If you run this, you'll notice an error or unexpected behavior - Nextflow tries to interpret `${USER}` as a Groovy variable that doesn't exist! We need to escape it so Bash can handle it instead. +If you run this, you'll notice an error or unexpected behavior - Nextflow tries to interpret `$(hostname)` as a Groovy variable that doesn't exist: -Fix this by escaping the shell variables: +```console title="Error with shell variables" +unknown recognition error type: groovyjarjarantlr4.v4.runtime.LexerNoViableAltException +ERROR ~ Module compilation error +- file : /workspaces/training/side-quests/groovy_essentials/modules/generate_report.nf +- cause: token recognition error at: '(' @ line 16, column 22. + echo "Hostname: $(hostname)" >> ${meta.id}_report.txt + ^ + +1 error +``` + + We need to escape it so Bash can handle it instead. + +Fix this by escaping the shell variables and command substitutions with a backslash (`\`): === "After - Fixed" @@ -1129,38 +1139,6 @@ Fix this by escaping the shell variables: Now it works! The backslash (`\`) tells Nextflow "don't interpret this, pass it through to Bash." -!!! note "Three Types of Variables in Process Scripts" - - - **Nextflow/Groovy variables**: Use `${variable}` - evaluated before the script runs - - **Shell environment variables**: Use `\${variable}` - passed through to Bash - - **Shell command substitution**: Use `\$(command)` - executed by Bash - -Let's add one more feature - a Groovy variable for the report type: - -```groovy title="modules/generate_report.nf - Complete" linenums="10" -script: -def report_type = meta.priority == 'high' ? 'PRIORITY' : 'STANDARD' -""" -echo "=== ${report_type} SAMPLE REPORT ===" > ${meta.id}_report.txt -echo "Processing ${reads}" >> ${meta.id}_report.txt -echo "Sample: ${meta.id}" >> ${meta.id}_report.txt -echo "Quality: ${meta.quality}" >> ${meta.id}_report.txt -echo "Priority: ${meta.priority}" >> ${meta.id}_report.txt -echo "---" >> ${meta.id}_report.txt -echo "Processed by: \${USER}" >> ${meta.id}_report.txt -echo "Hostname: \$(hostname)" >> ${meta.id}_report.txt -echo "Date: \$(date)" >> ${meta.id}_report.txt -""" -``` - -Now you can see all three types together: - -- `${report_type}`, `${meta.id}`, `${meta.quality}`: Groovy variables (no backslash) -- `\${USER}`: Shell environment variable (backslash) -- `\$(hostname)`, `\$(date)`: Shell command substitution (backslash) - -Run the workflow again and check the reports - high-priority samples will have "PRIORITY" in their header! - ### Takeaway In this section, you've learned: @@ -1168,7 +1146,6 @@ In this section, you've learned: - **Regular expressions for file parsing**: Using Groovy's `=~` operator and regex patterns to extract metadata from complex bioinformatics file naming conventions - **Reusable functions**: Extracting complex logic into named functions that can be called from channel operators, making workflows more readable and maintainable - **Dynamic script generation**: Using Groovy conditional logic within process script blocks to adapt commands based on input characteristics (like single-end vs paired-end reads) -- **Command-line argument construction**: Transforming file collections into properly formatted command arguments using `collect()` and `join()` methods - **Variable interpolation**: Understanding the difference between Nextflow/Groovy variables (`${var}`), shell environment variables (`\${var}`), and shell command substitution (`\$(cmd)`) These string processing patterns are essential for handling the diverse file formats and naming conventions you'll encounter in real-world bioinformatics workflows. diff --git a/side-quests/solutions/groovy_essentials/modules/generate_report.nf b/side-quests/solutions/groovy_essentials/modules/generate_report.nf index 91e7af87f6..fca06dad2a 100644 --- a/side-quests/solutions/groovy_essentials/modules/generate_report.nf +++ b/side-quests/solutions/groovy_essentials/modules/generate_report.nf @@ -9,14 +9,9 @@ process GENERATE_REPORT { path "${meta.id}_report.txt" script: - def report_type = meta.priority == 'high' ? 'PRIORITY' : 'STANDARD' """ - echo "=== ${report_type} SAMPLE REPORT ===" > ${meta.id}_report.txt - echo "Processing ${reads}" >> ${meta.id}_report.txt + echo "Processing ${reads}" > ${meta.id}_report.txt echo "Sample: ${meta.id}" >> ${meta.id}_report.txt - echo "Quality: ${meta.quality}" >> ${meta.id}_report.txt - echo "Priority: ${meta.priority}" >> ${meta.id}_report.txt - echo "---" >> ${meta.id}_report.txt echo "Processed by: \${USER}" >> ${meta.id}_report.txt echo "Hostname: \$(hostname)" >> ${meta.id}_report.txt echo "Date: \$(date)" >> ${meta.id}_report.txt From 4bf4d050694b898d4be6943c330e3ab244bcef30 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Fri, 10 Oct 2025 09:44:17 +0100 Subject: [PATCH 24/48] Reorg subsections a bit --- docs/side_quests/groovy_essentials.md | 304 ++++++++++++++------------ 1 file changed, 169 insertions(+), 135 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index f9104b308b..ac1f8087b1 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -9,10 +9,11 @@ This side quest takes you on a hands-on journey from basic concepts to productio - **Understanding boundaries:** Distinguish between Nextflow operators and Groovy methods, and master when to use each - **Data manipulation:** Extract, transform, and subset maps and collections using Groovy's powerful operators - **String processing:** Parse complex file naming schemes with regex patterns and master variable interpolation +- **Reusable functions:** Extract complex logic into named functions for cleaner, more maintainable workflows - **Dynamic logic:** Build processes that adapt to different input types and use closures for dynamic resource allocation - **Conditional routing:** Intelligently route samples through different processes based on their metadata characteristics - **Safe operations:** Handle missing data gracefully with null-safe operators and validate inputs with clear error messages -- **Reusable code:** Create maintainable workflows with functions and configuration-based event handlers +- **Configuration-based handlers:** Use workflow event handlers for logging, notifications, and lifecycle management --- @@ -622,9 +623,9 @@ Next we'll dive deeper into Groovy's powerful string processing capabilities, wh --- -## 2. Advanced String Processing for Bioinformatics +## 2. String Processing and Dynamic Script Generation -The difference between a brittle workflow that breaks on unexpected input and a robust pipeline that adapts gracefully often comes down to mastering Groovy's string processing capabilities. Let's transform our pipeline to handle the messy realities of real-world bioinformatics data. +The difference between a brittle workflow that breaks on unexpected input and a robust pipeline that adapts gracefully often comes down to mastering Groovy's string processing capabilities. In this section, we'll explore how to parse complex file names, generate process scripts dynamically based on input characteristics, and properly interpolate variables in different contexts. ### 2.1. Pattern Matching and Regular Expressions @@ -704,111 +705,9 @@ Launching `main.nf` [clever_pauling] DSL2 - revision: 605d2058b4 [[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] ``` -### 2.2. Creating Reusable functions +### 2.2. Dynamic Script Generation in Processes -You may have noticed that the content of our map operation is getting quite long and complex. To keep our workflow maintainable, it's a good idea to break out complex logic into reusable functions. - -To illustrate what that looks like with our existing workflow, make the modification below, using `def` to define a reusable function called `separateMetadata`. - -Make that change like so: - -=== "After" - - ```groovy title="main.nf" linenums="1" hl_lines="1-22 26" - def separateMetadata(row) { - def sample_meta = [ - id: row.sample_id.toLowerCase(), - organism: row.organism, - tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), - depth: row.sequencing_depth.toInteger(), - quality: row.quality_score.toDouble() - ] - def fastq_path = file(row.file_path) - - def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) - def file_meta = m ? [ - sample_num: m[0][2].toInteger(), - lane: m[0][3], - read: m[0][4], - chunk: m[0][5] - ] : [:] - - def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] - } - - workflow { - ch_samples = Channel.fromPath("./data/samples.csv") - .splitCsv(header: true) - .map{ row -> separateMetadata(row) } - .view() - } - ``` - -=== "Before" - - ```groovy title="main.nf" linenums="1" hl_lines="4-26" - workflow { - ch_samples = Channel.fromPath("./data/samples.csv") - .splitCsv(header: true) - .map { row -> - // This is all Groovy code now! - def sample_meta = [ - id: row.sample_id.toLowerCase(), - organism: row.organism, - tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), - depth: row.sequencing_depth.toInteger(), - quality: row.quality_score.toDouble() - ] - def fastq_path = file(row.file_path) - - def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) - def file_meta = m ? [ - sample_num: m[0][2].toInteger(), - lane: m[0][3], - read: m[0][4], - chunk: m[0][5] - ] : [:] - - def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] - } - .view() - } - ``` - -By doing this we've reduced the actual workflow logic down to something really trivial: - -```groovy title="minimal workflow" - ch_samples = Channel.fromPath("./data/samples.csv") - .splitCsv(header: true) - .map{row -> separateMetadata(row)} - .view() -``` - -... which makes the logic much easier to read and understand at a glance. The function `separateMetadata` encapsulates all the complex logic for parsing and enriching metadata, making it reusable and testable. - -You can run that to make sure it still works: - -```bash title="Test reusable function" -nextflow run main.nf -``` - -```console title="Function results" - N E X T F L O W ~ version 25.04.6 - -Launching `main.nf` [admiring_panini] DSL2 - revision: 8cc832e32f - -[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] -[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] -[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] -``` - -Hopefully there are no changes to the output, but the workflow is now much cleaner and easier to maintain. - -### 2.3. Dynamic Script Logic in Processes - -Another place you'll find it very useful to break out your Groovy toolbox is in process script blocks. You can use Groovy logic to make your scripts dynamic and adaptable to different input conditions. +Process script blocks are essentially multi-line strings that get passed to the shell. You can use Groovy logic to dynamically generate different script strings based on input characteristics, making your processes adaptable to different input conditions. To illustrate what we mean, let's add some processes to our existing `main.nf` workflow that demonstrate common patterns for dynamic script generation. Open `modules/fastp.nf` and take a look: @@ -1004,7 +903,7 @@ Another common usage of dynamic script logic can be seen in [the Nextflow for Sc These patterns of using Groovy logic in process script blocks are extremely powerful and can be applied in many scenarios - from handling variable input types to building complex command-line arguments from file collections, making your processes truly adaptable to the diverse requirements of real-world data. -### 2.4. Variable Interpolation: Groovy, Bash, and Shell Variables +### 2.3. Variable Interpolation: Groovy, Bash, and Shell Variables When writing process scripts, you're actually working with three different types of variables, and using the wrong syntax is a common source of errors. Let's add a process that creates a processing report to demonstrate the differences. @@ -1144,15 +1043,139 @@ Now it works! The backslash (`\`) tells Nextflow "don't interpret this, pass it In this section, you've learned: - **Regular expressions for file parsing**: Using Groovy's `=~` operator and regex patterns to extract metadata from complex bioinformatics file naming conventions -- **Reusable functions**: Extracting complex logic into named functions that can be called from channel operators, making workflows more readable and maintainable -- **Dynamic script generation**: Using Groovy conditional logic within process script blocks to adapt commands based on input characteristics (like single-end vs paired-end reads) +- **Dynamic script generation**: Using Groovy conditional logic to generate different script strings based on input characteristics (like single-end vs paired-end reads) - **Variable interpolation**: Understanding the difference between Nextflow/Groovy variables (`${var}`), shell environment variables (`\${var}`), and shell command substitution (`\$(cmd)`) -These string processing patterns are essential for handling the diverse file formats and naming conventions you'll encounter in real-world bioinformatics workflows. +These string processing and generation patterns are essential for handling the diverse file formats and naming conventions you'll encounter in real-world bioinformatics workflows. --- -### 2.5. Dynamic Resource Directives with Closures +## 3. Creating Reusable Functions + +As your workflow logic becomes more complex, keeping everything inline in channel operators or process definitions can make your code hard to read and maintain. Groovy functions let you extract complex logic into named, reusable components that can be called from anywhere in your workflow. + +You may have noticed that the content of our map operation is getting quite long and complex. To keep our workflow maintainable, it's a good idea to break out complex logic into reusable functions. + +To illustrate what that looks like with our existing workflow, make the modification below, using `def` to define a reusable function called `separateMetadata`: + +=== "After" + + ```groovy title="main.nf" linenums="1" hl_lines="3-25 28-32" + include { FASTP } from './modules/fastp.nf' + include { GENERATE_REPORT } from './modules/generate_report.nf' + + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } + + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + + ch_fastp = FASTP(ch_samples) + GENERATE_REPORT(ch_samples) + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="1" hl_lines="5-28" + include { FASTP } from './modules/fastp.nf' + include { GENERATE_REPORT } from './modules/generate_report.nf' + + workflow { + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } + + ch_fastp = FASTP(ch_samples) + GENERATE_REPORT(ch_samples) + } + ``` + +By extracting this logic into a function, we've reduced the actual workflow logic down to something much cleaner: + +```groovy title="minimal workflow" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map{ row -> separateMetadata(row) } + + ch_fastp = FASTP(ch_samples) + GENERATE_REPORT(ch_samples) +``` + +This makes the workflow logic much easier to read and understand at a glance. The function `separateMetadata` encapsulates all the complex logic for parsing and enriching metadata, making it reusable and testable. + +Run the workflow to make sure it still works: + +```bash title="Test reusable function" +nextflow run main.nf +``` + +```console title="Function results" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [admiring_panini] DSL2 - revision: 8cc832e32f + +executor > local (6) +[8c/2e3f91] process > FASTP (3) [100%] 3 of 3 ✔ +[7a/1b4c92] process > GENERATE_REPORT (3) [100%] 3 of 3 ✔ +``` + +The output should show both processes completing successfully. The workflow is now much cleaner and easier to maintain, with all the complex metadata processing logic encapsulated in the `separateMetadata` function. + +### Takeaway + +In this section, you've learned: + +- **Extracting functions**: Moving complex logic from inline closures into named functions +- **Function scope**: Functions defined at the script level can be called from anywhere in your workflow +- **Cleaner workflows**: Using functions makes your workflow blocks more concise and readable + +Next, we'll explore how to use Groovy closures in process directives for dynamic resource allocation. + +--- + +## 4. Dynamic Resource Directives with Closures So far we've used Groovy in the `script` block of processes. But Groovy closures are also incredibly useful in process directives, especially for dynamic resource allocation. Let's add resource directives to our FASTP process that adapt based on the sample characteristics. @@ -1255,12 +1278,14 @@ This makes your workflows both more efficient (not over-allocating) and more rob --- -## 3. Conditional Logic and Process Control +## 5. Conditional Logic and Process Control Earlier on, we discussed how to use the `.map()` operator to use snippets of Groovy code to transform data flowing through channels. The counterpart to that is using Groovy to not just transform data, but to control which processes get executed based on the data itself. This is essential for building flexible workflows that can adapt to different sample types and analysis requirements. Nextflow has several [operators](https://www.nextflow.io/docs/latest/reference/operator.html) that control process flow, including, many of which take closures as arguments, meanint their content is evaluated at run time, allowing us to use Groovy logic to drive workflow decisions based on channel content. +### 5.1. Routing with `.branch()` + For example, let's pretend that our sequencing samples need to be trimmed with FASTP only if they're human samples with a coverage above a certain threshold. Mouse samples or low-coverage samples should be run with Trimgalore instead (this is a contrived example, but it illustrates the point). Add a new process for Trimgalore in `modules/trimgalore.nf`: @@ -1325,7 +1350,7 @@ executor > local (3) Here, we've used small but mighty Groovy expressions inside the `.branch{}` operator to route samples based on their metadata. Human samples with high coverage go through `FASTP`, while all other samples go through `TRIMGALORE`. -### 3.1. Using `.filter()` with Groovy Truth +### 5.2. Using `.filter()` with Groovy Truth Another powerful pattern for controlling workflow execution is the `.filter()` operator, which uses a closure to determine which items should continue down the pipeline. Let's add a validation step to filter out samples that don't meet our quality requirements. @@ -1397,11 +1422,11 @@ Our pipeline now intelligently routes samples through appropriate processes, but --- -## 4. Safe Navigation and Elvis Operators +## 6. Safe Navigation and Elvis Operators Our `separateMetadata` function currently assumes all CSV fields are present and valid. But what happens with incomplete data? Let's find out. -### 4.1. The Problem: Null Pointer Crashes +### 6.1. The Problem: Null Pointer Crashes Add a row with missing data to your `data/samples.csv`: @@ -1417,7 +1442,7 @@ nextflow run main.nf It crashes with a NullPointerException! This is where Groovy's safe operators save the day. -### 4.2. Safe Navigation Operator (`?.`) +### 6.2. Safe Navigation Operator (`?.`) The safe navigation operator (`?.`) returns null instead of throwing an exception. Update your `separateMetadata` function: @@ -1457,7 +1482,7 @@ nextflow run main.nf No crash! But SAMPLE_004 now has `null` values which could cause problems downstream. -### 4.3. Elvis Operator (`?:`) for Defaults +### 6.3. Elvis Operator (`?:`) for Defaults The Elvis operator (`?:`) provides default values. Update again: @@ -1497,7 +1522,7 @@ nextflow run main.nf Perfect! SAMPLE_004 now has safe defaults: 'unknown' for organism/tissue, 0.0 for quality. -### 4.4. Filtering with Safe Operators +### 6.4. Filtering with Safe Operators Now let's filter out samples with missing data. Update your workflow: @@ -1543,7 +1568,9 @@ SAMPLE_004 is now filtered out! Only valid samples proceed. These operators make workflows resilient to incomplete data - essential for real-world bioinformatics. -### 4.5. Validation with `error()` and `log.warn` +--- + +## 7. Validation with `error()` and `log.warn` Sometimes you need to stop the workflow immediately if input parameters are invalid. Nextflow provides `error()` for this. Let's add validation to our workflow. @@ -1635,25 +1662,24 @@ def separateMetadata(row) { } ``` -### Takeaway (Updated) +### Takeaway -- **Safe navigation (`?.`)**: Prevents crashes on null values - returns null instead of throwing exception -- **Elvis operator (`?:`)**: Provides defaults - `value ?: 'default'` - **`error()`**: Stops workflow immediately with clear message - **`log.warn`**: Issues warnings without stopping workflow - **Early validation**: Check inputs before processing to fail fast with helpful errors +- **Validation functions**: Create reusable validation logic that can be called at workflow start -These operators make workflows resilient to incomplete data - essential for real-world bioinformatics. +Proper validation makes workflows more robust and user-friendly by catching problems early with clear error messages. --- -## 5. Groovy in Configuration: Workflow Event Handlers +## 8. Groovy in Configuration: Workflow Event Handlers Up until now, we've been writing Groovy code in our workflow scripts and process definitions. But there's one more important place where Groovy is essential: workflow event handlers in your `nextflow.config` file. Event handlers are Groovy closures that run at specific points in your workflow's lifecycle. They're perfect for adding logging, notifications, or cleanup operations without cluttering your main workflow code. -### 5.1. The `onComplete` Handler +### 8.1. The `onComplete` Handler The most commonly used event handler is `onComplete`, which runs when your workflow finishes (whether it succeeded or failed). Let's add one to summarize our pipeline results. @@ -1754,7 +1780,7 @@ workflow.onComplete = { } ``` -### 5.2. Other Useful Event Handlers +### 8.2. Other Useful Event Handlers Besides `onComplete`, there are other event handlers you can use: @@ -1838,13 +1864,19 @@ Here's how we progressively enhanced our pipeline: 1. **Nextflow vs Groovy Boundaries**: You learned to distinguish between workflow orchestration (Nextflow) and programming logic (Groovy), including the crucial differences between constructs like `collect`. -2. **Advanced String Processing**: You mastered regular expressions, parsing functions, reusable functions, variable interpolation (Groovy vs Bash vs Shell), dynamic script generation in processes, and dynamic resource directives with closures. +2. **Advanced String Processing**: You mastered regular expressions for parsing file names, dynamic script generation in processes, and variable interpolation (Groovy vs Bash vs Shell). + +3. **Creating Reusable Functions**: You learned to extract complex logic into named functions that can be called from channel operators, making workflows more readable and maintainable. + +4. **Dynamic Resource Directives with Closures**: You explored using Groovy closures in process directives for adaptive resource allocation based on input characteristics. + +5. **Conditional Logic and Process Control**: You added intelligent routing using `.branch()` and `.filter()` operators, leveraging Groovy Truth for concise conditional expressions. -3. **Conditional Logic and Process Control**: You added intelligent routing using `.branch()` and `.filter()` operators, leveraging Groovy Truth for concise conditional expressions. +6. **Safe Navigation and Elvis Operators**: You made the pipeline robust against missing data using `?.` for null-safe property access and `?:` for providing default values. -4. **Safe Navigation and Elvis Operators**: You made the pipeline robust against missing data using `?.` for null-safe property access, `?:` for providing default values, and `error()` for input validation. +7. **Validation with error() and log.warn**: You learned to validate inputs early and fail fast with clear error messages. -5. **Groovy in Configuration**: You learned to use workflow event handlers (`onComplete`, `onStart`, `onError`) for logging, notifications, and lifecycle management. +8. **Groovy in Configuration**: You learned to use workflow event handlers (`onComplete`, `onStart`, `onError`) for logging, notifications, and lifecycle management. ### Key Benefits @@ -1858,10 +1890,12 @@ Here's how we progressively enhanced our pipeline: The pipeline journey you completed demonstrates the evolution from basic data processing to production-ready bioinformatics workflows: 1. **Started simple**: Basic CSV processing and metadata extraction with clear Nextflow vs Groovy boundaries -2. **Added intelligence**: Dynamic file name parsing with regex patterns, variable interpolation mastery, and conditional routing based on sample characteristics -3. **Made it efficient**: Dynamic resource allocation with closures in directives and retry strategies -4. **Made it robust**: Safe navigation and Elvis operators for handling missing data gracefully -5. **Added observability**: Workflow event handlers for logging, notifications, and lifecycle management +2. **Added intelligence**: Dynamic file name parsing with regex patterns, variable interpolation mastery, and dynamic script generation based on input types +3. **Made it maintainable**: Extracted complex logic into reusable functions for cleaner, more testable code +4. **Made it efficient**: Dynamic resource allocation with closures in directives and retry strategies +5. **Added routing**: Conditional logic to route samples through appropriate processes based on their characteristics +6. **Made it robust**: Safe navigation and Elvis operators for handling missing data gracefully, plus validation for early error detection +7. **Added observability**: Workflow event handlers for logging, notifications, and lifecycle management This progression mirrors the real-world evolution of bioinformatics pipelines - from research prototypes handling a few samples to production systems processing thousands of samples across laboratories and institutions. Every challenge you solved and pattern you learned reflects actual problems developers face when scaling Nextflow workflows. From 4232b7c9e712df634e31484d9985089bbf9f841f Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Fri, 10 Oct 2025 10:27:08 +0100 Subject: [PATCH 25/48] Fix highlights --- docs/side_quests/groovy_essentials.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index ac1f8087b1..3ddc53895f 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -1060,7 +1060,7 @@ To illustrate what that looks like with our existing workflow, make the modifica === "After" - ```groovy title="main.nf" linenums="1" hl_lines="3-25 28-32" + ```groovy title="main.nf" linenums="1" hl_lines="4-24 29" include { FASTP } from './modules/fastp.nf' include { GENERATE_REPORT } from './modules/generate_report.nf' @@ -1098,7 +1098,7 @@ To illustrate what that looks like with our existing workflow, make the modifica === "Before" - ```groovy title="main.nf" linenums="1" hl_lines="5-28" + ```groovy title="main.nf" linenums="1" hl_lines="7-27" include { FASTP } from './modules/fastp.nf' include { GENERATE_REPORT } from './modules/generate_report.nf' From 93198e504942dd3c25602b324538213f89b8396c Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Fri, 10 Oct 2025 11:00:08 +0100 Subject: [PATCH 26/48] Fix up dynamic resources --- docs/side_quests/groovy_essentials.md | 147 +++++++++++++++++++------- 1 file changed, 106 insertions(+), 41 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 3ddc53895f..38ba3879d9 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -1179,7 +1179,7 @@ Next, we'll explore how to use Groovy closures in process directives for dynamic So far we've used Groovy in the `script` block of processes. But Groovy closures are also incredibly useful in process directives, especially for dynamic resource allocation. Let's add resource directives to our FASTP process that adapt based on the sample characteristics. -Currently, our FASTP process uses default resources. Let's make it smarter by allocating more CPUs for high-depth samples: +Currently, our FASTP process uses default resources. Let's make it smarter by allocating more CPUs for high-depth samples. Edit `modules/fastp.nf` to include a dynamic `cpus` directive and a static `memory` directive: === "After" @@ -1188,12 +1188,10 @@ Currently, our FASTP process uses default resources. Let's make it smarter by al container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' cpus { meta.depth > 40000000 ? 4 : 2 } - memory '1 GB' + memory '2 GB' input: tuple val(meta), path(reads) - - // ... rest of process ... ``` === "Before" @@ -1204,8 +1202,6 @@ Currently, our FASTP process uses default resources. Let's make it smarter by al input: tuple val(meta), path(reads) - - // ... rest of process ... ``` The closure `{ meta.depth > 40000000 ? 4 : 2 }` is evaluated for each task, allowing per-sample resource allocation. High-depth samples (>40M reads) get 4 CPUs, while others get 2 CPUs. @@ -1214,60 +1210,129 @@ The closure `{ meta.depth > 40000000 ? 4 : 2 }` is evaluated for each task, allo The closure can access any input variables (like `meta` here) because Nextflow evaluates these closures in the context of each task execution. -Another powerful pattern is using `task.attempt` for retry strategies. Let's add error retry with increasing resources: +Run the workflow again: -```groovy title="modules/fastp.nf - With retry" linenums="1" hl_lines="4-6" -process FASTP { - container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' +```bash title="Test resource allocation" +nextflow run main.nf -no-ansi-log +``` - errorStrategy 'retry' - maxRetries 2 - memory { 512.MB * task.attempt } +We're using the `-no-ansi-log` option to make it easier to see the task hashes. + +```console title="Resource allocation output" +N E X T F L O W ~ version 25.04.6 +Launching `main.nf` [fervent_albattani] DSL2 - revision: fa8f249759 +[bd/ff3d41] Submitted process > FASTP (2) +[a4/a3aab2] Submitted process > FASTP (1) +[48/6db0c9] Submitted process > FASTP (3) +[ec/83439d] Submitted process > GENERATE_REPORT (3) +[bd/15d7cc] Submitted process > GENERATE_REPORT (2) +[42/699357] Submitted process > GENERATE_REPORT (1) +``` - input: - tuple val(meta), path(reads) +You can check the exact `docker` command that was run to see the CPU allocation for any given task: - // ... rest of process ... +```console title="Check docker command" +cat work/48/6db0c9e9d8aa65e4bb4936cd3bd59e/.command.run | grep "docker run" ``` -Now if the process fails due to insufficient memory, Nextflow will retry with more memory: +You should see something like: -- First attempt: 512 MB (task.attempt = 1) -- Second attempt: 1024 MB (task.attempt = 2) +```bash title="docker command" + docker run -i --cpu-shares 4096 --memory 2048m -e "NXF_TASK_WORKDIR" -v /workspaces/training/side-quests/groovy_essentials:/workspaces/training/side-quests/groovy_essentials -w "$NXF_TASK_WORKDIR" --name $NXF_BOXID community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690 /bin/bash -ue /workspaces/training/side-quests/groovy_essentials/work/48/6db0c9e9d8aa65e4bb4936cd3bd59e/.command.sh +``` -You can combine multiple factors: +In this example we've chosen an example that requested 4 CPUs (`--cpu-shares 4096`), because it was a high-depth sample, but you should see different CPU allocations depending on the sample depth. Try this for the other tasks as well. -```groovy title="Complex resource allocation" -process QUALITY_CONTROL { +Another powerful pattern is using `task.attempt` for retry strategies. To show why this is useful, we're going to start by reducing the memory allocation to FASTP to less than it needs. Change the `memory` directive in `modules/fastp.nf` to `1.GB`: - memory { - def base_mem = meta.depth > 30000000 ? 1.GB : 512.MB - base_mem * task.attempt - } +=== "After" - cpus { - def base_cpus = meta.organism == 'human' ? 4 : 2 - Math.min(base_cpus, 8) // Cap at 8 CPUs for Codespaces - } + ```groovy title="modules/fastp.nf" linenums="1" hl_lines="4-5" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' - time { - meta.priority == 'high' ? '30.m' : '1.h' - } + cpus { meta.depth > 40000000 ? 4 : 2 } + memory '1 GB' - // ... rest of process ... -} + input: + tuple val(meta), path(reads) + ``` + +=== "Before" + + ```groovy title="modules/fastp.nf" linenums="1" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + cpus { meta.depth > 40000000 ? 4 : 2 } + memory '2 GB' + + input: + tuple val(meta), path(reads) + ``` + +... and run the workflow again: + +```bash title="Test insufficient memory" +nextflow run main.nf +``` + +You'll see an error indicating that the process was killed for exceeding memory limits: + +```console title="Memory error output" hl_lines="2 11" +Command exit status: + 137 + +Command output: + (empty) + +Command error: + Detecting adapter sequence for read1... + No adapter detected for read1 + + .command.sh: line 7: 101 Killed fastp --in1 SAMPLE_002_S2_L001_R1_001.fastq --out1 sample_002_trimmed.fastq.gz --json sample_002.fastp.json --html sample_002.fastp.html --thread 2 ``` -This demonstrates several advanced patterns: +This is a very common scenario in real-world workflows - sometimes you just don't know how much memory a task will need until you run it. To make our workflow more robust, we can implement a retry strategy that increases memory allocation on each attempt, once again using a Groovy closure. Modify the `memory` directive to multiply the base memory by `task.attempt`, and add `errorStrategy 'retry'` and `maxRetries 2` directives: + +=== "After" + + ```groovy title="modules/fastp.nf" linenums="1" hl_lines="5-7" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + cpus { meta.depth > 40000000 ? 4 : 2 } + memory { '1 GB' * task.attempt } + errorStrategy 'retry' + maxRetries 2 + + input: + tuple val(meta), path(reads) + ``` + +=== "Before" + + ```groovy title="modules/fastp.nf" linenums="1" hl_lines="5" + process FASTP { + container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' + + cpus { meta.depth > 40000000 ? 4 : 2 } + memory '2 GB' + + input: + tuple val(meta), path(reads) + ``` + +Now if the process fails due to insufficient memory, Nextflow will retry with more memory: + +- First attempt: 1 GB (task.attempt = 1) +- Second attempt: 2 GB (task.attempt = 2) -- Creating intermediate Groovy variables (`base_mem`, `base_cpus`) -- Using Groovy math functions (`Math.min`) to set limits -- Combining metadata with retry logic -- Using Nextflow's duration syntax (`30.m`, `1.h`) +... and so on, up to the `maxRetries` limit. ### Takeaway -Dynamic directives with closures let you: +Dynamic directives with Groovy closures let you: - Allocate resources based on input characteristics - Implement automatic retry strategies with increasing resources From 7e708a4e3a883d8b1d6b66d5dab831e9b4a7c04d Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Fri, 10 Oct 2025 11:18:01 +0100 Subject: [PATCH 27/48] Fix up dynamic routing --- docs/side_quests/groovy_essentials.md | 114 +++++++++++++++----------- 1 file changed, 65 insertions(+), 49 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 38ba3879d9..3e0a12941b 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -518,7 +518,7 @@ Individual channel item: sample_003 Nextflow collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) ``` -This time, have NOT changed the structure of the data, we still have 3 items in the list, but we HAVE transformed each item using Groovy's `collect` method to produce a new list with modified values. This is sort of like using the `map` operator in Nextflow, but it's pure Groovy code operating on a standard Groovy list. +This time, we have NOT changed the structure of the data, we still have 3 items in the list, but we HAVE transformed each item using Groovy's `collect` method to produce a new list with modified values. This is sort of like using the `map` operator in Nextflow, but it's pure Groovy code operating on a standard Groovy list. `collect` is an extreme case we're using here to make a point. The key lesson is that when you're writing workflows always distinguish between **Groovy constructs** (data structures) and **Nextflow constructs** (channels/workflows). Operations can share names but behave completely differently. @@ -805,7 +805,7 @@ Command output: You can see that the process is trying to run `fastp` with a `null` value for the second input file, which is causing it to fail. This is because our dataset contains single-end reads, but the process is hardcoded to expect paired-end reads (two input files at a time). -Let's fix this by adding some Groovy logic to the `script:` block of the `FASTP` process to handle both single-end and paired-end reads dynamically. We'll use an if/else statement to check how many read files are are present and adjust the command accordingly. +Let's fix this by adding some Groovy logic to the `script:` block of the `FASTP` process to handle both single-end and paired-end reads dynamically. We'll use an if/else statement to check how many read files are present and adjust the command accordingly. === "After" @@ -1247,7 +1247,7 @@ Another powerful pattern is using `task.attempt` for retry strategies. To show w === "After" - ```groovy title="modules/fastp.nf" linenums="1" hl_lines="4-5" + ```groovy title="modules/fastp.nf" linenums="1" hl_lines="5" process FASTP { container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' @@ -1260,7 +1260,7 @@ Another powerful pattern is using `task.attempt` for retry strategies. To show w === "Before" - ```groovy title="modules/fastp.nf" linenums="1" + ```groovy title="modules/fastp.nf" linenums="1" hl_lines="5" process FASTP { container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' @@ -1347,13 +1347,15 @@ This makes your workflows both more efficient (not over-allocating) and more rob Earlier on, we discussed how to use the `.map()` operator to use snippets of Groovy code to transform data flowing through channels. The counterpart to that is using Groovy to not just transform data, but to control which processes get executed based on the data itself. This is essential for building flexible workflows that can adapt to different sample types and analysis requirements. -Nextflow has several [operators](https://www.nextflow.io/docs/latest/reference/operator.html) that control process flow, including, many of which take closures as arguments, meanint their content is evaluated at run time, allowing us to use Groovy logic to drive workflow decisions based on channel content. +Nextflow has several [operators](https://www.nextflow.io/docs/latest/reference/operator.html) that control process flow, many of which take closures as arguments, meaning their content is evaluated at run time, allowing us to use Groovy logic to drive workflow decisions based on channel content. ### 5.1. Routing with `.branch()` For example, let's pretend that our sequencing samples need to be trimmed with FASTP only if they're human samples with a coverage above a certain threshold. Mouse samples or low-coverage samples should be run with Trimgalore instead (this is a contrived example, but it illustrates the point). -Add a new process for Trimgalore in `modules/trimgalore.nf`: +We've provided a simple Trimgalore process in `modules/trimgalore.nf`, take a look if you like, but the details aren't important for this exercise. The key point is that we want to route samples based on their metadata. + +Include the new from in `modules/trimgalore.nf`: === "After" @@ -1385,16 +1387,18 @@ Add a new process for Trimgalore in `modules/trimgalore.nf`: ch_fastp = FASTP(trim_branches.fastp) ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) + GENERATE_REPORT(ch_samples) ``` === "Before" - ```groovy title="main.nf" linenums="28" + ```groovy title="main.nf" linenums="28" hl_lines="5" ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map(separateMetadata) ch_fastp = FASTP(ch_samples) + GENERATE_REPORT(ch_samples) ``` Run this modified workflow: @@ -1406,11 +1410,12 @@ nextflow run main.nf ```console title="Conditional trimming results" N E X T F L O W ~ version 25.04.6 -Launching `main.nf` [boring_koch] DSL2 - revision: 68a6bc7bd8 +Launching `main.nf` [adoring_galileo] DSL2 - revision: c9e83aaef1 -executor > local (3) -[3d/bb1e90] process > FASTP (2) [100%] 2 of 2 ✔ -[4c/455334] process > TRIMGALORE (1) [100%] 1 of 1 ✔ +executor > local (6) +[1d/0747ac] process > FASTP (2) [100%] 2 of 2 ✔ +[cc/c44caf] process > TRIMGALORE (1) [100%] 1 of 1 ✔ +[34/bd5a9f] process > GENERATE_REPORT (1) [100%] 3 of 3 ✔ ``` Here, we've used small but mighty Groovy expressions inside the `.branch{}` operator to route samples based on their metadata. Human samples with high coverage go through `FASTP`, while all other samples go through `TRIMGALORE`. @@ -1421,22 +1426,57 @@ Another powerful pattern for controlling workflow execution is the `.filter()` o Add the following before the branch operation: -```groovy title="main.nf - Adding filter" hl_lines="5-9" - ch_samples = Channel.fromPath("./data/samples.csv") - .splitCsv(header: true) +=== "After" + + ```groovy title="main.nf" linenums="28" hl_lines="11" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) .map(separateMetadata) - // Filter out invalid or low-quality samples - ch_valid_samples = ch_samples - .filter { meta, reads -> - meta.id && meta.organism && meta.depth >= 1000000 - } + // Filter out invalid or low-quality samples + ch_valid_samples = ch_samples + .filter { meta, reads -> + meta.id && meta.organism && meta.depth >= 25000000 + } - trim_branches = ch_valid_samples - .branch { meta, reads -> - fastp: meta.organism == 'human' && meta.depth >= 30000000 - trimgalore: true - } + trim_branches = ch_valid_samples + .branch { meta, reads -> + fastp: meta.organism == 'human' && meta.depth >= 30000000 + trimgalore: true + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="28" hl_lines="5" + ch_samples = Channel.fromPath("./data/samples.csv") + .splitCsv(header: true) + .map(separateMetadata) + + trim_branches = ch_samples + .branch { meta, reads -> + fastp: meta.organism == 'human' && meta.depth >= 30000000 + trimgalore: true + } + ``` + +Run the workflow again: + +```bash title="Test filtering samples" +nextflow run main.nf +``` + +Because we've chosen a filter that excludes some samples, you should see fewer tasks executed: + +```console title="Filtered samples results" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [deadly_woese] DSL2 - revision: 9a6044a969 + +executor > local (5) +[01/7b1483] process > FASTP (2) [100%] 2 of 2 ✔ +[- ] process > TRIMGALORE - +[07/ef53af] process > GENERATE_REPORT (3) [100%] 3 of 3 ✔ ``` This filter uses **Groovy Truth** - Groovy's way of evaluating expressions in boolean contexts: @@ -1444,7 +1484,7 @@ This filter uses **Groovy Truth** - Groovy's way of evaluating expressions in bo - `null`, empty strings, empty collections, and zero are all "false" - Non-null values, non-empty strings, and non-zero numbers are "true" -So `meta.id && meta.organism` checks that both fields exist and are non-empty, while `meta.depth >= 1000000` ensures we have sufficient sequencing depth. +So `meta.id && meta.organism` checks that both fields exist and are non-empty, while `meta.depth >= 25000000` ensures we have sufficient sequencing depth. !!! note "Groovy Truth in Practice" @@ -1455,30 +1495,6 @@ So `meta.id && meta.organism` checks that both fields exist and are non-empty, w This makes filtering logic much cleaner and easier to read. -You could also combine `.filter()` with more complex Groovy logic: - -```groovy title="Complex filtering examples" -// Filter using safe navigation and Elvis operators -ch_samples - .filter { meta, reads -> - (meta.quality ?: 0) > 30 && meta.organism?.toLowerCase() in ['human', 'mouse'] - } - -// Filter using regular expressions -ch_samples - .filter { meta, reads -> - meta.id =~ /^SAMPLE_\d+$/ && reads.exists() - } - -// Filter using multiple conditions with Groovy Truth -ch_samples - .filter { meta, reads -> - meta.files // Non-empty file list - && meta.paired // Boolean flag is true - && !meta.failed // Negative check - } -``` - ### Takeaway In this section, you've learned to use Groovy logic to control workflow execution using the closure interfaces of Nextflow operators like `.branch{}` and `.filter{}`, leveraging Groovy Truth to write concise conditional expressions. From a9e3c285209a7d40e4682d86bcdc824cdf5b8f7d Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Fri, 10 Oct 2025 12:18:54 +0100 Subject: [PATCH 28/48] Final tweaks --- docs/side_quests/groovy_essentials.md | 307 ++++++++++++------ .../solutions/groovy_essentials/main.nf | 45 +-- .../groovy_essentials/nextflow.config | 47 +-- 3 files changed, 247 insertions(+), 152 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 3e0a12941b..b972f12a54 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -1507,43 +1507,82 @@ Our pipeline now intelligently routes samples through appropriate processes, but Our `separateMetadata` function currently assumes all CSV fields are present and valid. But what happens with incomplete data? Let's find out. -### 6.1. The Problem: Null Pointer Crashes +### 6.1. The Problem: Accessing Properties That Don't Exist -Add a row with missing data to your `data/samples.csv`: +Let's say we want to add support for optional sequencing run information. In some labs, samples might have an additional field for the sequencing run ID or batch number, but our current CSV doesn't have this column. Let's try to access it anyway. -```csv -SAMPLE_004,,unknown_tissue,20000000,data/sequences/SAMPLE_004_S4_L001_R1_001.fastq, -``` +Modify the `separateMetadata` function to include a run_id field: + +=== "After" + + ```groovy title="main.nf" linenums="5" hl_lines="9" + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def run_id = row.run_id.toUpperCase() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="5" + def separateMetadata(row) { + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + ``` -Notice the empty organism field and missing quality_score. Now try running the workflow: +Now run the workflow: ```bash nextflow run main.nf ``` -It crashes with a NullPointerException! This is where Groovy's safe operators save the day. +It crashes with a NullPointerException: + +```console title="Null pointer error" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [trusting_torvalds] DSL2 - revision: b56fbfbce2 + +ERROR ~ Cannot invoke method toUpperCase() on null object + + -- Check script 'main.nf' at line: 13 or see '.nextflow.log' file for more details +``` + +The problem is that `row.run_id` returns `null` because the `run_id` column doesn't exist in our CSV. When we try to call `.toUpperCase()` on `null`, it crashes. This is where Groovy's safe navigation operator saves the day. ### 6.2. Safe Navigation Operator (`?.`) -The safe navigation operator (`?.`) returns null instead of throwing an exception. Update your `separateMetadata` function: +The safe navigation operator (`?.`) returns `null` instead of throwing an exception when called on a `null` value. If the object before `?.` is `null`, the entire expression evaluates to `null` without executing the method. + +Update the function to use safe navigation: === "After" - ```groovy title="main.nf" linenums="4" hl_lines="6-8" + ```groovy title="main.nf" linenums="4" hl_lines="9" def separateMetadata(row) { def sample_meta = [ id: row.sample_id.toLowerCase(), - organism: row.organism?.toLowerCase(), - tissue: row.tissue_type?.replaceAll('_', ' ')?.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), depth: row.sequencing_depth.toInteger(), - quality: row.quality_score?.toDouble() + quality: row.quality_score.toDouble() ] - // ... rest unchanged + def run_id = row.run_id?.toUpperCase() ``` === "Before" - ```groovy title="main.nf" linenums="4" + ```groovy title="main.nf" linenums="4" hl_lines="9" def separateMetadata(row) { def sample_meta = [ id: row.sample_id.toLowerCase(), @@ -1552,7 +1591,7 @@ The safe navigation operator (`?.`) returns null instead of throwing an exceptio depth: row.sequencing_depth.toInteger(), quality: row.quality_score.toDouble() ] - // ... rest unchanged + def run_id = row.run_id.toUpperCase() ``` Run again: @@ -1561,91 +1600,97 @@ Run again: nextflow run main.nf ``` -No crash! But SAMPLE_004 now has `null` values which could cause problems downstream. +No crash! The workflow now handles the missing field gracefully. When `row.run_id` is `null`, the `?.` operator prevents the `.toUpperCase()` call, and `run_id` becomes `null` instead of causing an exception. ### 6.3. Elvis Operator (`?:`) for Defaults -The Elvis operator (`?:`) provides default values. Update again: +The Elvis operator (`?:`) provides default values when the left side is `null` (or empty, in Groovy's "truth" evaluation). It's named after Elvis Presley because `?:` looks like his famous hair and eyes when viewed sideways! + +Now that we're using safe navigation, `run_id` will be `null` for samples without that field. Let's use the Elvis operator to provide a default value and add it to our `sample_meta` map: === "After" - ```groovy title="main.nf" linenums="4" hl_lines="6-8" + ```groovy title="main.nf" linenums="5" hl_lines="9-10" def separateMetadata(row) { def sample_meta = [ id: row.sample_id.toLowerCase(), - organism: row.organism?.toLowerCase() ?: 'unknown', - tissue: row.tissue_type?.replaceAll('_', ' ')?.toLowerCase() ?: 'unknown', + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), depth: row.sequencing_depth.toInteger(), - quality: row.quality_score?.toDouble() ?: 0.0 + quality: row.quality_score.toDouble() ] - // ... rest unchanged + def run_id = row.run_id?.toUpperCase() ?: 'UNSPECIFIED' + sample_meta.run = run_id ``` === "Before" - ```groovy title="main.nf" linenums="4" + ```groovy title="main.nf" linenums="5" hl_lines="9" def separateMetadata(row) { def sample_meta = [ id: row.sample_id.toLowerCase(), - organism: row.organism?.toLowerCase(), - tissue: row.tissue_type?.replaceAll('_', ' ')?.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), depth: row.sequencing_depth.toInteger(), - quality: row.quality_score?.toDouble() + quality: row.quality_score.toDouble() ] - // ... rest unchanged + def run_id = row.run_id?.toUpperCase() ``` -Run once more: - -```bash -nextflow run main.nf -``` - -Perfect! SAMPLE_004 now has safe defaults: 'unknown' for organism/tissue, 0.0 for quality. - -### 6.4. Filtering with Safe Operators - -Now let's filter out samples with missing data. Update your workflow: +Also add a `view()` operator in the workflow to see the results: === "After" - ```groovy title="main.nf" linenums="28" hl_lines="4-7" - workflow { + ```groovy title="main.nf" linenums="30" hl_lines="4" ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map(separateMetadata) - .filter { meta, reads -> - meta.organism != 'unknown' && (meta.quality ?: 0) > 0 - } - - // ... rest of workflow + .map{ row -> separateMetadata(row) } + .view() ``` === "Before" - ```groovy title="main.nf" linenums="28" - workflow { + ```groovy title="main.nf" linenums="30" ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map(separateMetadata) - - // ... rest of workflow + .map{ row -> separateMetadata(row) } ``` -Run the workflow: + + and run the workflow: ```bash nextflow run main.nf ``` -SAMPLE_004 is now filtered out! Only valid samples proceed. +You'll see output like this: + +```console title="View output with run field" +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, run:UNSPECIFIED, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, run:UNSPECIFIED, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, run:UNSPECIFIED, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] +``` + +Perfect! Now all samples have a `run` field with either their actual run ID (in uppercase) or the default value 'UNSPECIFIED'. The combination of `?.` and `?:` provides both safety (no crashes) and sensible defaults. + +Take out the `.view()` operator now that we've confirmed it works. + +!!! tip "Combining Safe Navigation and Elvis" + + The pattern `value?.method() ?: 'default'` is extremely common in production Nextflow: + + - `value?.method()` - Safely call method, returns `null` if `value` is `null` + - `?: 'default'` - Provide fallback if the result is `null` + + This makes your code resilient to missing or incomplete data. + +Get in the habit of using these operators everywhere you write Groovy code - in functions, operator closures (`.map{}`, `.filter{}`), process scripts, and config files. They're lightweight and will save you from countless crashes when handling real-world data. ### Takeaway - **Safe navigation (`?.`)**: Prevents crashes on null values - returns null instead of throwing exception - **Elvis operator (`?:`)**: Provides defaults - `value ?: 'default'` - **Combining**: `value?.method() ?: 'default'` is the common pattern -- **In filters**: Use to handle missing data: `(meta.quality ?: 0) > threshold` These operators make workflows resilient to incomplete data - essential for real-world bioinformatics. @@ -1653,44 +1698,32 @@ These operators make workflows resilient to incomplete data - essential for real ## 7. Validation with `error()` and `log.warn` -Sometimes you need to stop the workflow immediately if input parameters are invalid. Nextflow provides `error()` for this. Let's add validation to our workflow. +Sometimes you need to stop the workflow immediately if input parameters are invalid. Nextflow provides `error()` and `log.warn` for this, but the validation logic itself is pure Groovy - using conditionals, string operations, and file checking methods. Let's add validation to our workflow. -Create a validation function before your workflow block: +Create a validation function before your workflow block, call it from the workflow, and change the channel creation to use a parameter for the CSV file path. If the parameter is missing or the file doesn't exist, call `error()` to stop execution with a clear message. === "After" - ```groovy title="main.nf" linenums="1" hl_lines="3-18" + ```groovy title="main.nf" linenums="1" hl_lines="5-20 23-24" include { FASTP } from './modules/fastp.nf' include { TRIMGALORE } from './modules/trimgalore.nf' include { GENERATE_REPORT } from './modules/generate_report.nf' def validateInputs() { - // Check CSV file exists - if (!file(params.input ?: './data/samples.csv').exists()) { - error("Input CSV file not found: ${params.input ?: './data/samples.csv'}") + // Check input parameter is provided + if (!params.input) { + error("Input CSV file path not provided. Please specify --input ") } - // Warn if output directory already exists - if (file(params.outdir ?: 'results').exists()) { - log.warn "Output directory already exists: ${params.outdir ?: 'results'}" - } - - // Check for required genome parameter - if (params.run_gatk && !params.genome) { - error("Genome reference required when running GATK. Please provide --genome") + // Check CSV file exists + if (!file(params.input).exists()) { + error("Input CSV file not found: ${params.input}") } } - - // ... separateMetadata function ... - + ... workflow { - validateInputs() // Call validation first - - ch_samples = Channel.fromPath("./data/samples.csv") - .splitCsv(header: true) - .map(separateMetadata) - // ... rest of workflow - } + validateInputs() + ch_samples = Channel.fromPath(params.input) ``` === "Before" @@ -1700,47 +1733,71 @@ Create a validation function before your workflow block: include { TRIMGALORE } from './modules/trimgalore.nf' include { GENERATE_REPORT } from './modules/generate_report.nf' - // ... separateMetadata function ... - + ... workflow { ch_samples = Channel.fromPath("./data/samples.csv") - .splitCsv(header: true) - .map(separateMetadata) - // ... rest of workflow - } ``` Now try running without the CSV file: ```bash -mv data/samples.csv data/samples.csv.bak nextflow run main.nf ``` The workflow stops immediately with a clear error message instead of failing mysteriously later! -You can also add validation within the `separateMetadata` function: +```console title="Validation error output" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [confident_coulomb] DSL2 - revision: 07059399ed + +WARN: Access to undefined parameter `input` -- Initialise it to a default value eg. `params.input = some_value` +Input CSV file path not provided. Please specify --input +``` -```groovy title="main.nf - Validation in function" -def separateMetadata(row) { - // Validate required fields - if (!row.sample_id) { - error("Missing sample_id in CSV row: ${row}") +You can also add validation within the `separateMetadata` function. Let's use the non-fatal `log.warn` to issue warnings for samples with low sequencing depth, but still allow the workflow to continue: + +=== "After" + + ```groovy title="main.nf" linenums="1" hl_lines="3-6" + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + + // Validate data makes sense + if (sample_meta.depth < 30000000) { + log.warn "Low sequencing depth for ${sample_meta.id}: ${sample_meta.depth}" + } + + return [sample_meta + file_meta + [priority: priority], fastq_path] } + ``` + +=== "Before" - def sample_meta = [ - id: row.sample_id.toLowerCase(), - organism: row.organism?.toLowerCase() ?: 'unknown', - // ... rest of fields - ] + ```groovy title="main.nf" linenums="1" + def priority = sample_meta.quality > 40 ? 'high' : 'normal' - // Validate data makes sense - if (sample_meta.depth < 1000000) { - log.warn "Low sequencing depth for ${sample_meta.id}: ${sample_meta.depth}" + return [sample_meta + file_meta + [priority: priority], fastq_path] } + ``` - return [sample_meta, file(row.file_path)] -} +Run the workflow again with the original CSV: + +```bash +nextflow run main.nf --input ./data/samples.csv +``` + +... and you'll see a warning about low sequencing depth for one of the samples: + +```console title="Warning output" + N E X T F L O W ~ version 25.04.6 + +Launching `main.nf` [awesome_goldwasser] DSL2 - revision: a31662a7c1 + +executor > local (5) +[ce/df5eeb] process > FASTP (2) [100%] 2 of 2 ✔ +[- ] process > TRIMGALORE - +[d1/7d2b4b] process > GENERATE_REPORT (3) [100%] 3 of 3 ✔ +WARN: Low sequencing depth for sample_002: 25000000 ``` ### Takeaway @@ -1756,7 +1813,7 @@ Proper validation makes workflows more robust and user-friendly by catching prob ## 8. Groovy in Configuration: Workflow Event Handlers -Up until now, we've been writing Groovy code in our workflow scripts and process definitions. But there's one more important place where Groovy is essential: workflow event handlers in your `nextflow.config` file. +Up until now, we've been writing Groovy code in our workflow scripts and process definitions. But there's one more important place where Groovy is essential: workflow event handlers in your `nextflow.config` file (or other places you write configuration). Event handlers are Groovy closures that run at specific points in your workflow's lifecycle. They're perfect for adding logging, notifications, or cleanup operations without cluttering your main workflow code. @@ -1798,6 +1855,29 @@ This is a Groovy closure being assigned to `workflow.onComplete`. Inside, you ha Run your workflow and you'll see this summary appear at the end! +```bash title="Run with onComplete handler" +nextflow run main.nf --input ./data/samples.csv -no-ansi-log +``` + +```console title="onComplete output" +N E X T F L O W ~ version 25.04.6 +Launching `main.nf` [marvelous_boltzmann] DSL2 - revision: a31662a7c1 +WARN: Low sequencing depth for sample_002: 25000000 +[9b/d48e40] Submitted process > FASTP (2) +[6a/73867a] Submitted process > GENERATE_REPORT (2) +[79/ad0ac5] Submitted process > GENERATE_REPORT (1) +[f3/bda6cb] Submitted process > FASTP (1) +[34/d5b52f] Submitted process > GENERATE_REPORT (3) + +Pipeline execution summary: +========================== +Completed at: 2025-10-10T12:14:24.885384+01:00 +Duration : 2.9s +Success : true +workDir : /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/work +exit status : 0 +``` + Let's make it more useful by adding conditional logic: === "After" @@ -1840,6 +1920,29 @@ Let's make it more useful by adding conditional logic: } ``` +Now we get an even more informative summary, including a success/failure message and the output directory if specified: + +```console title="Enhanced onComplete output" +N E X T F L O W ~ version 25.04.6 +Launching `main.nf` [boring_linnaeus] DSL2 - revision: a31662a7c1 +WARN: Low sequencing depth for sample_002: 25000000 +[e5/242efc] Submitted process > FASTP (2) +[3b/74047c] Submitted process > GENERATE_REPORT (3) +[8a/7a57e6] Submitted process > GENERATE_REPORT (1) +[a8/b1a31f] Submitted process > GENERATE_REPORT (2) +[40/648429] Submitted process > FASTP (1) + +Pipeline execution summary: +========================== +Completed at: 2025-10-10T12:16:00.522569+01:00 +Duration : 3.6s +Success : true +workDir : /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/work +exit status : 0 + +✅ Pipeline completed successfully! +``` + You can also write the summary to a file using Groovy file operations: ```groovy title="nextflow.config - Writing summary to file" diff --git a/side-quests/solutions/groovy_essentials/main.nf b/side-quests/solutions/groovy_essentials/main.nf index 2ee2a2ae03..ab83315d71 100644 --- a/side-quests/solutions/groovy_essentials/main.nf +++ b/side-quests/solutions/groovy_essentials/main.nf @@ -3,30 +3,27 @@ include { TRIMGALORE } from './modules/trimgalore.nf' include { GENERATE_REPORT } from './modules/generate_report.nf' def validateInputs() { - // Check CSV file exists - if (!file(params.input ?: './data/samples.csv').exists()) { - error("Input CSV file not found: ${params.input ?: './data/samples.csv'}") - } - - // Warn if output directory already exists - if (file(params.outdir ?: 'results').exists()) { - log.warn "Output directory already exists: ${params.outdir ?: 'results'}" + // Check input parameter is provided + if (!params.input) { + error("Input CSV file path not provided. Please specify --input ") } - // Check for required genome parameter - if (params.run_gatk && !params.genome) { - error("Genome reference required when running GATK. Please provide --genome") + // Check CSV file exists + if (!file(params.input).exists()) { + error("Input CSV file not found: ${params.input}") } } def separateMetadata(row) { def sample_meta = [ id: row.sample_id.toLowerCase(), - organism: row.organism?.toLowerCase() ?: 'unknown', - tissue: row.tissue_type?.replaceAll('_', ' ')?.toLowerCase() ?: 'unknown', + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), depth: row.sequencing_depth.toInteger(), - quality: row.quality_score?.toDouble() ?: 0.0 + quality: row.quality_score?.toDouble() ] + def run_id = row.run_id?.toUpperCase() ?: 'UNSPECIFIED' + sample_meta.run = run_id def fastq_path = file(row.file_path) def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) @@ -38,22 +35,29 @@ def separateMetadata(row) { ] : [:] def priority = sample_meta.quality > 40 ? 'high' : 'normal' + + // Validate data makes sense + if (sample_meta.depth < 30000000) { + log.warn "Low sequencing depth for ${sample_meta.id}: ${sample_meta.depth}" + } + return [sample_meta + file_meta + [priority: priority], fastq_path] } workflow { validateInputs() - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = Channel.fromPath(params.input) .splitCsv(header: true) - .map(separateMetadata) + .map{ row -> separateMetadata(row) } + + // Filter out invalid or low-quality samples + ch_valid_samples = ch_samples .filter { meta, reads -> - meta.organism != 'unknown' && (meta.quality ?: 0) > 0 + meta.id && meta.organism && meta.depth > 25000000 } - GENERATE_REPORT(ch_samples) - - trim_branches = ch_samples + trim_branches = ch_valid_samples .branch { meta, reads -> fastp: meta.organism == 'human' && meta.depth >= 30000000 trimgalore: true @@ -61,4 +65,5 @@ workflow { ch_fastp = FASTP(trim_branches.fastp) ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) + GENERATE_REPORT(ch_samples) } diff --git a/side-quests/solutions/groovy_essentials/nextflow.config b/side-quests/solutions/groovy_essentials/nextflow.config index 7d2d0ccef0..80c74b85f4 100644 --- a/side-quests/solutions/groovy_essentials/nextflow.config +++ b/side-quests/solutions/groovy_essentials/nextflow.config @@ -2,35 +2,22 @@ docker.enabled = true -workflow.onStart = { - println "Starting ${workflow.runName} at ${workflow.start}" -} - -workflow.onError = { - println "Workflow failed: ${workflow.errorMessage}" -} - workflow.onComplete = { - def duration_mins = workflow.duration.toMinutes().round(2) - def status = workflow.success ? "SUCCESS ✅" : "FAILED ❌" - - def summary = """ - Pipeline Execution Summary - =========================== - Completed: ${workflow.complete} - Duration : ${workflow.duration} - Success : ${workflow.success} - Command : ${workflow.commandLine} - """ - - println summary - - // Write to a log file - def log_file = file("${workflow.launchDir}/pipeline_summary.txt") - log_file.text = summary - - println """ - Pipeline finished: ${status} - Duration: ${duration_mins} minutes - """ + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" + + if (workflow.success) { + println "✅ Pipeline completed successfully!" + println "Results are in: ${params.outdir ?: 'results'}" + } else { + println "❌ Pipeline failed!" + println "Error: ${workflow.errorMessage}" + } } From ef9124c5ec7f6c873bf6cf280151e68c592d3ce2 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Fri, 10 Oct 2025 12:50:21 +0100 Subject: [PATCH 29/48] Language/ clarity improvements --- docs/side_quests/groovy_essentials.md | 130 +++++++++++++++----------- 1 file changed, 76 insertions(+), 54 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index b972f12a54..2087f4323d 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -31,13 +31,13 @@ This tutorial will explain Groovy concepts as we encounter them, so you don't ne ### 0.2. Starting Point -Let's move into the project directory and explore our working materials. +Navigate to the project directory: ```bash title="Navigate to project directory" cd side-quests/groovy_essentials ``` -You'll find a `data` directory with sample files and a main workflow file that we'll evolve throughout this tutorial. +The `data` directory contains sample files and a main workflow file we'll evolve throughout. ```console title="Directory contents" > tree @@ -76,7 +76,7 @@ We'll use this realistic dataset to explore practical Groovy techniques that you ### 1.1. Identifying What's What -One of the most common sources of confusion for Nextflow developers is understanding when they're working with Nextflow constructs versus Groovy language features. Let's build a workflow step by step to see a common example of how they work together. +Nextflow developers often confuse Nextflow constructs with Groovy language features. Let's build a workflow demonstrating how they work together. #### Step 1: Basic Nextflow Workflow @@ -137,9 +137,11 @@ Here's what that map operation looks like: .view() ``` -The closure we've added is `{ row -> return row }`. We've named the parameter `row`, but it could be called anything, so you could write `.map { item -> ... }` or `.map { sample -> ... }` and it would work exactly the same way. It's also possible not to name the parameter at all and just use the implicit variable `it`, like `.map { return it }`, but naming it makes the code clearer. +This is our first **Groovy closure**—an anonymous function you can pass as an argument. Closures are a core Groovy concept (similar to lambdas in Python or arrow functions in JavaScript) and are essential for working with Nextflow operators. -When Nextflow processes each item in the channel, it passes that item to your closure as the parameter you named. So if your channel contains CSV rows, `row` will hold one complete row at a time. +The closure `{ row -> return row }` takes a parameter `row` (could be any name: `item`, `sample`, etc.). You can also use the implicit variable `it` instead: `.map { return it }`, though naming parameters improves clarity. + +When Nextflow processes each channel item, it passes that item to your closure. Here, `row` holds one CSV row at a time. Apply this change and run the workflow: @@ -183,7 +185,7 @@ Now we're going to write **pure Groovy code** inside our closure. Everything fro .view() ``` -Notice how we've left Nextflow syntax behind and are now writing pure Groovy code. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. It lets us store related pieces of information together. In this map, we're storing the sample ID, organism, tissue type, sequencing depth, and quality score. +This is pure Groovy code. The `sample_meta` map is a key-value data structure (like dictionaries in Python, objects in JavaScript, or hashes in Ruby) storing related information: sample ID, organism, tissue type, sequencing depth, and quality score. We use Groovy's string manipulation methods like `.toLowerCase()` and `.replaceAll()` to clean up our data, and type conversion methods like `.toInteger()` and `.toDouble()` to convert string data from the CSV into the appropriate numeric types. @@ -434,13 +436,13 @@ ch_collected = ch_input.collect() ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } ``` -We are: +Steps: -- Defining a (Groovy) list -- Using the `fromList()` channel factory to create a channel that emits each sample ID as a separate item -- Using `view()` to print each item as it flows through the channel -- Applying Nextflow's `collect()` operator to gather all items into a single list -- Using a second `view()` to print the collected result which appears as a single item containing a list of all sample IDs +- Define a Groovy list +- Create a channel with `fromList()` that emits each sample ID separately +- Print each item with `view()` as it flows through +- Gather all items into a single list with Nextflow's `collect()` operator +- Print the collected result (single item containing all sample IDs) with a second `view()` We've changed the structure of the channel, but we haven't changed the data itself. @@ -625,11 +627,11 @@ Next we'll dive deeper into Groovy's powerful string processing capabilities, wh ## 2. String Processing and Dynamic Script Generation -The difference between a brittle workflow that breaks on unexpected input and a robust pipeline that adapts gracefully often comes down to mastering Groovy's string processing capabilities. In this section, we'll explore how to parse complex file names, generate process scripts dynamically based on input characteristics, and properly interpolate variables in different contexts. +Mastering Groovy's string processing separates brittle workflows from robust pipelines. This section covers parsing complex file names, dynamic script generation, and variable interpolation. ### 2.1. Pattern Matching and Regular Expressions -Many bioinformatics workflows encounter files with complex naming conventions that encode important metadata. Let's see how Groovy's pattern matching can extract this information automatically. +Bioinformatics files often have complex naming conventions encoding metadata. Let's extract this automatically with Groovy's pattern matching. We're going to return to our `main.nf` workflow and add some pattern matching logic to extract additional sample information from file names. The FASTQ files in our dataset follow Illumina-style naming conventions with names like `SAMPLE_001_S1_L001_R1_001.fastq.gz`. These might look cryptic, but they actually encode useful metadata like sample ID, lane number, and read direction. We're going to use Groovy's regex capabilities to parse these names. @@ -679,13 +681,24 @@ Make the following change to your existing `main.nf` workflow: } ``` -This demonstrates key Groovy string processing concepts: +This demonstrates key **Groovy string processing concepts**: 1. **Regular expression literals** using `~/pattern/` syntax - this creates a regex pattern without needing to escape backslashes 2. **Pattern matching** with the `=~` operator - this attempts to match a string against a regex pattern 3. **Matcher objects** that capture groups with `[0][1]`, `[0][2]`, etc. - `[0]` refers to the entire match, `[1]`, `[2]`, etc. refer to captured groups in parentheses -The regex pattern `^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$` is designed to match the Illumina-style naming convention and capture key components, namely the sample number, lane number, read direction, and chunk number. +Let's break down the regex pattern `^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$`: + +| Pattern | Matches | Captures | +|---------|---------|----------| +| `^(.+)` | Sample name from start | Group 1: sample name | +| `_S(\d+)` | Sample number `_S1`, `_S2`, etc. | Group 2: sample number | +| `_L(\d{3})` | Lane number `_L001` | Group 3: lane (3 digits) | +| `_(R[12])` | Read direction `_R1` or `_R2` | Group 4: read direction | +| `_(\d{3})` | Chunk number `_001` | Group 5: chunk (3 digits) | +| `\.fastq(?:\.gz)?$` | File extension `.fastq` or `.fastq.gz` | Not captured (?: is non-capturing) | + +This parses Illumina-style naming conventions to extract metadata automatically. Run the modified workflow: @@ -707,9 +720,9 @@ Launching `main.nf` [clever_pauling] DSL2 - revision: 605d2058b4 ### 2.2. Dynamic Script Generation in Processes -Process script blocks are essentially multi-line strings that get passed to the shell. You can use Groovy logic to dynamically generate different script strings based on input characteristics, making your processes adaptable to different input conditions. +Process script blocks are essentially multi-line strings that get passed to the shell. You can use **Groovy conditional logic** (if/else, ternary operators) to dynamically generate different script strings based on input characteristics. This is essential for handling diverse input types—like single-end vs paired-end sequencing reads—without duplicating process definitions. -To illustrate what we mean, let's add some processes to our existing `main.nf` workflow that demonstrate common patterns for dynamic script generation. Open `modules/fastp.nf` and take a look: +Let's add a process to our workflow that demonstrates this pattern. Open `modules/fastp.nf` and take a look: ```groovy title="modules/fastp.nf" linenums="1" process FASTP { @@ -805,7 +818,7 @@ Command output: You can see that the process is trying to run `fastp` with a `null` value for the second input file, which is causing it to fail. This is because our dataset contains single-end reads, but the process is hardcoded to expect paired-end reads (two input files at a time). -Let's fix this by adding some Groovy logic to the `script:` block of the `FASTP` process to handle both single-end and paired-end reads dynamically. We'll use an if/else statement to check how many read files are present and adjust the command accordingly. +Fix this by adding Groovy logic to the `FASTP` process `script:` block. An if/else statement checks read file count and adjusts the command accordingly. === "After" @@ -905,7 +918,7 @@ These patterns of using Groovy logic in process script blocks are extremely powe ### 2.3. Variable Interpolation: Groovy, Bash, and Shell Variables -When writing process scripts, you're actually working with three different types of variables, and using the wrong syntax is a common source of errors. Let's add a process that creates a processing report to demonstrate the differences. +Process scripts mix Nextflow variables, shell variables, and command substitutions, each with different interpolation syntax. Using the wrong syntax causes errors. Let's explore these with a process that creates a processing report. Take a look a the module file `modules/generate_report.nf`: @@ -1040,11 +1053,14 @@ Now it works! The backslash (`\`) tells Nextflow "don't interpret this, pass it ### Takeaway -In this section, you've learned: +In this section, you've learned **Groovy string processing** techniques: -- **Regular expressions for file parsing**: Using Groovy's `=~` operator and regex patterns to extract metadata from complex bioinformatics file naming conventions -- **Dynamic script generation**: Using Groovy conditional logic to generate different script strings based on input characteristics (like single-end vs paired-end reads) -- **Variable interpolation**: Understanding the difference between Nextflow/Groovy variables (`${var}`), shell environment variables (`\${var}`), and shell command substitution (`\$(cmd)`) +- **Regular expressions for file parsing**: Using Groovy's `=~` operator and regex patterns (`~/pattern/`) to extract metadata from complex file naming conventions +- **Dynamic script generation**: Using Groovy conditional logic (if/else, ternary operators) to generate different script strings based on input characteristics +- **Variable interpolation**: Understanding when Nextflow interprets strings vs when the shell does + - `${var}` - Groovy/Nextflow variables (interpolated by Nextflow at workflow compile time) + - `\${var}` - Shell environment variables (escaped, passed to bash at runtime) + - `\$(cmd)` - Shell command substitution (escaped, executed by bash at runtime) These string processing and generation patterns are essential for handling the diverse file formats and naming conventions you'll encounter in real-world bioinformatics workflows. @@ -1052,9 +1068,9 @@ These string processing and generation patterns are essential for handling the d ## 3. Creating Reusable Functions -As your workflow logic becomes more complex, keeping everything inline in channel operators or process definitions can make your code hard to read and maintain. Groovy functions let you extract complex logic into named, reusable components that can be called from anywhere in your workflow. +Complex workflow logic inline in channel operators or process definitions reduces readability and maintainability. **Groovy functions** let you extract this logic into named, reusable components—this is core Groovy programming, not Nextflow-specific syntax. -You may have noticed that the content of our map operation is getting quite long and complex. To keep our workflow maintainable, it's a good idea to break out complex logic into reusable functions. +Our map operation has grown long and complex. Let's extract it into a reusable Groovy function using the `def` keyword. To illustrate what that looks like with our existing workflow, make the modification below, using `def` to define a reusable function called `separateMetadata`: @@ -1165,11 +1181,12 @@ The output should show both processes completing successfully. The workflow is n ### Takeaway -In this section, you've learned: +In this section, you've learned core **Groovy programming concepts**: -- **Extracting functions**: Moving complex logic from inline closures into named functions -- **Function scope**: Functions defined at the script level can be called from anywhere in your workflow -- **Cleaner workflows**: Using functions makes your workflow blocks more concise and readable +- **Defining functions with `def`**: Groovy's keyword for creating named functions (like `def` in Python or `function` in JavaScript) +- **Function scope**: Functions defined at the script level are accessible throughout your Nextflow workflow +- **Return values**: Functions automatically return the last expression, or use explicit `return` +- **Cleaner code**: Extracting complex logic into functions is a fundamental software engineering practice in any language, including Groovy Next, we'll explore how to use Groovy closures in process directives for dynamic resource allocation. @@ -1177,7 +1194,7 @@ Next, we'll explore how to use Groovy closures in process directives for dynamic ## 4. Dynamic Resource Directives with Closures -So far we've used Groovy in the `script` block of processes. But Groovy closures are also incredibly useful in process directives, especially for dynamic resource allocation. Let's add resource directives to our FASTP process that adapt based on the sample characteristics. +So far we've used Groovy in the `script` block of processes. But **Groovy closures** (introduced in Section 1.1) are also incredibly useful in process directives, especially for dynamic resource allocation. Let's add resource directives to our FASTP process that adapt based on the sample characteristics. Currently, our FASTP process uses default resources. Let's make it smarter by allocating more CPUs for high-depth samples. Edit `modules/fastp.nf` to include a dynamic `cpus` directive and a static `memory` directive: @@ -1204,7 +1221,7 @@ Currently, our FASTP process uses default resources. Let's make it smarter by al tuple val(meta), path(reads) ``` -The closure `{ meta.depth > 40000000 ? 4 : 2 }` is evaluated for each task, allowing per-sample resource allocation. High-depth samples (>40M reads) get 4 CPUs, while others get 2 CPUs. +The closure `{ meta.depth > 40000000 ? 4 : 2 }` uses the **Groovy ternary operator** (covered in Section 1.1) and is evaluated for each task, allowing per-sample resource allocation. High-depth samples (>40M reads) get 4 CPUs, while others get 2 CPUs. !!! note "Accessing Input Variables in Directives" @@ -1345,9 +1362,9 @@ This makes your workflows both more efficient (not over-allocating) and more rob ## 5. Conditional Logic and Process Control -Earlier on, we discussed how to use the `.map()` operator to use snippets of Groovy code to transform data flowing through channels. The counterpart to that is using Groovy to not just transform data, but to control which processes get executed based on the data itself. This is essential for building flexible workflows that can adapt to different sample types and analysis requirements. +Previously, we used `.map()` with Groovy to transform channel data. Now we'll use Groovy to control which processes execute based on data—essential for flexible workflows adapting to different sample types. -Nextflow has several [operators](https://www.nextflow.io/docs/latest/reference/operator.html) that control process flow, many of which take closures as arguments, meaning their content is evaluated at run time, allowing us to use Groovy logic to drive workflow decisions based on channel content. +Nextflow's [flow control operators](https://www.nextflow.io/docs/latest/reference/operator.html) take closures evaluated at runtime, enabling Groovy logic to drive workflow decisions based on channel content. ### 5.1. Routing with `.branch()` @@ -1422,7 +1439,14 @@ Here, we've used small but mighty Groovy expressions inside the `.branch{}` oper ### 5.2. Using `.filter()` with Groovy Truth -Another powerful pattern for controlling workflow execution is the `.filter()` operator, which uses a closure to determine which items should continue down the pipeline. Let's add a validation step to filter out samples that don't meet our quality requirements. +Another powerful pattern for controlling workflow execution is the `.filter()` operator, which uses a closure to determine which items should continue down the pipeline. Inside the filter closure, you'll write **Groovy boolean expressions** that decide which items pass through. + +Groovy has a concept called **"Groovy Truth"** that determines what values evaluate to `true` or `false` in boolean contexts: + +- **Truthy**: Non-null values, non-empty strings, non-zero numbers, non-empty collections +- **Falsy**: `null`, empty strings `""`, zero `0`, empty collections `[]` or `[:]`, `false` + +This means `meta.id` alone (without explicit `!= null`) checks if the ID exists and isn't empty. Let's use this to filter out samples that don't meet our quality requirements. Add the following before the branch operation: @@ -1479,12 +1503,10 @@ executor > local (5) [07/ef53af] process > GENERATE_REPORT (3) [100%] 3 of 3 ✔ ``` -This filter uses **Groovy Truth** - Groovy's way of evaluating expressions in boolean contexts: - -- `null`, empty strings, empty collections, and zero are all "false" -- Non-null values, non-empty strings, and non-zero numbers are "true" +The filter expression `meta.id && meta.organism && meta.depth >= 25000000` combines Groovy Truth with explicit comparisons: -So `meta.id && meta.organism` checks that both fields exist and are non-empty, while `meta.depth >= 25000000` ensures we have sufficient sequencing depth. +- `meta.id && meta.organism` checks that both fields exist and are non-empty (using Groovy Truth) +- `meta.depth >= 25000000` ensures sufficient sequencing depth with an explicit comparison !!! note "Groovy Truth in Practice" @@ -1677,14 +1699,14 @@ Take out the `.view()` operator now that we've confirmed it works. !!! tip "Combining Safe Navigation and Elvis" - The pattern `value?.method() ?: 'default'` is extremely common in production Nextflow: + The pattern `value?.method() ?: 'default'` is common in production Nextflow: - - `value?.method()` - Safely call method, returns `null` if `value` is `null` - - `?: 'default'` - Provide fallback if the result is `null` + - `value?.method()` - Safely calls method, returns `null` if `value` is `null` + - `?: 'default'` - Provides fallback if result is `null` - This makes your code resilient to missing or incomplete data. + This pattern handles missing/incomplete data gracefully. -Get in the habit of using these operators everywhere you write Groovy code - in functions, operator closures (`.map{}`, `.filter{}`), process scripts, and config files. They're lightweight and will save you from countless crashes when handling real-world data. +Use these operators consistently in functions, operator closures (`.map{}`, `.filter{}`), process scripts, and config files. They prevent crashes when handling real-world data. ### Takeaway @@ -1698,7 +1720,7 @@ These operators make workflows resilient to incomplete data - essential for real ## 7. Validation with `error()` and `log.warn` -Sometimes you need to stop the workflow immediately if input parameters are invalid. Nextflow provides `error()` and `log.warn` for this, but the validation logic itself is pure Groovy - using conditionals, string operations, and file checking methods. Let's add validation to our workflow. +Sometimes you need to stop the workflow immediately if input parameters are invalid. While `error()` and `log.warn` are Nextflow-provided functions, the **validation logic itself is pure Groovy**—using conditionals (`if`, `!`), boolean logic, and methods like `.exists()`. Let's add validation to our workflow. Create a validation function before your workflow block, call it from the workflow, and change the channel creation to use a parameter for the CSV file path. If the parameter is missing or the file doesn't exist, call `error()` to stop execution with a clear message. @@ -2071,15 +2093,15 @@ Here's how we progressively enhanced our pipeline: ### From Simple to Sophisticated -The pipeline journey you completed demonstrates the evolution from basic data processing to production-ready bioinformatics workflows: +This pipeline evolved from basic data processing to production-ready workflows: -1. **Started simple**: Basic CSV processing and metadata extraction with clear Nextflow vs Groovy boundaries -2. **Added intelligence**: Dynamic file name parsing with regex patterns, variable interpolation mastery, and dynamic script generation based on input types -3. **Made it maintainable**: Extracted complex logic into reusable functions for cleaner, more testable code -4. **Made it efficient**: Dynamic resource allocation with closures in directives and retry strategies -5. **Added routing**: Conditional logic to route samples through appropriate processes based on their characteristics -6. **Made it robust**: Safe navigation and Elvis operators for handling missing data gracefully, plus validation for early error detection -7. **Added observability**: Workflow event handlers for logging, notifications, and lifecycle management +1. **Simple**: CSV processing and metadata extraction (Nextflow vs Groovy boundaries) +2. **Intelligent**: Regex parsing, variable interpolation, dynamic script generation +3. **Maintainable**: Reusable functions for cleaner, testable code +4. **Efficient**: Dynamic resource allocation and retry strategies +5. **Adaptive**: Conditional routing based on sample characteristics +6. **Robust**: Safe navigation, Elvis operators, early validation +7. **Observable**: Event handlers for logging and lifecycle management This progression mirrors the real-world evolution of bioinformatics pipelines - from research prototypes handling a few samples to production systems processing thousands of samples across laboratories and institutions. Every challenge you solved and pattern you learned reflects actual problems developers face when scaling Nextflow workflows. From 9f1fa177c544687f4bf4739db04495bc7edfb830 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Fri, 10 Oct 2025 12:57:31 +0100 Subject: [PATCH 30/48] Prettier --- docs/side_quests/groovy_essentials.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 2087f4323d..8b12b38c29 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -689,13 +689,13 @@ This demonstrates key **Groovy string processing concepts**: Let's break down the regex pattern `^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$`: -| Pattern | Matches | Captures | -|---------|---------|----------| -| `^(.+)` | Sample name from start | Group 1: sample name | -| `_S(\d+)` | Sample number `_S1`, `_S2`, etc. | Group 2: sample number | -| `_L(\d{3})` | Lane number `_L001` | Group 3: lane (3 digits) | -| `_(R[12])` | Read direction `_R1` or `_R2` | Group 4: read direction | -| `_(\d{3})` | Chunk number `_001` | Group 5: chunk (3 digits) | +| Pattern | Matches | Captures | +| ------------------- | -------------------------------------- | ---------------------------------- | +| `^(.+)` | Sample name from start | Group 1: sample name | +| `_S(\d+)` | Sample number `_S1`, `_S2`, etc. | Group 2: sample number | +| `_L(\d{3})` | Lane number `_L001` | Group 3: lane (3 digits) | +| `_(R[12])` | Read direction `_R1` or `_R2` | Group 4: read direction | +| `_(\d{3})` | Chunk number `_001` | Group 5: chunk (3 digits) | | `\.fastq(?:\.gz)?$` | File extension `.fastq` or `.fastq.gz` | Not captured (?: is non-capturing) | This parses Illumina-style naming conventions to extract metadata automatically. @@ -1019,7 +1019,7 @@ ERROR ~ Module compilation error 1 error ``` - We need to escape it so Bash can handle it instead. +We need to escape it so Bash can handle it instead. Fix this by escaping the shell variables and command substitutions with a backslash (`\`): @@ -1678,8 +1678,7 @@ Also add a `view()` operator in the workflow to see the results: .map{ row -> separateMetadata(row) } ``` - - and run the workflow: +and run the workflow: ```bash nextflow run main.nf From c4a6eaf8265ad86cc61413482065d221a4b6377d Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Fri, 10 Oct 2025 14:32:44 +0100 Subject: [PATCH 31/48] Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- docs/side_quests/groovy_essentials.md | 2 +- side-quests/solutions/groovy_essentials/modules/fastp.nf | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 8b12b38c29..7a0c483394 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -666,7 +666,7 @@ Make the following change to your existing `main.nf` workflow: === "Before" - ```groovy title="main.nf" linenums="4" "hl_lines="11" + ```groovy title="main.nf" linenums="4" hl_lines="11" .map { row -> // This is all Groovy code now! def sample_meta = [ diff --git a/side-quests/solutions/groovy_essentials/modules/fastp.nf b/side-quests/solutions/groovy_essentials/modules/fastp.nf index f74e5e772a..2aa70e6c81 100644 --- a/side-quests/solutions/groovy_essentials/modules/fastp.nf +++ b/side-quests/solutions/groovy_essentials/modules/fastp.nf @@ -5,7 +5,7 @@ process FASTP { tuple val(meta), path(reads) output: - tuple val(meta.id), path("*_trimmed*.fastq.gz"), emit: reads + tuple val(meta), path("*_trimmed*.fastq.gz"), emit: reads path "*.{json,html}" , emit: reports script: From 4d42b2ad0c5049c66ed1663e0ea50f5ec4d2e2c6 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Sat, 11 Oct 2025 16:37:58 +0100 Subject: [PATCH 32/48] Add teach time estimate for groovy --- docs/side_quests/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/side_quests/index.md b/docs/side_quests/index.md index 128b3ae0e4..5a988a6b71 100644 --- a/docs/side_quests/index.md +++ b/docs/side_quests/index.md @@ -31,7 +31,7 @@ Otherwise, select a side quest from the table below. | Side Quest | Time Estimate for Teaching | | ----------------------------------------------------------------- | -------------------------- | | [Nextflow development environment walkthrough](./ide_features.md) | 45 mins | -| [Groovy Essentials for Nextflow](./groovy_essentials.md) | | +| [Groovy Essentials for Nextflow](./groovy_essentials.md) | 2 hours | | [Introduction to nf-core](./nf-core.md) | - | | [Metadata in workflows](./metadata.md) | 45 mins | | [Splitting and Grouping](./splitting_and_grouping.md) | 45 mins | From 89141504b82c81c1118ba049c4bada271f178aa4 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Sat, 11 Oct 2025 16:38:39 +0100 Subject: [PATCH 33/48] Fix formatting --- docs/side_quests/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/side_quests/index.md b/docs/side_quests/index.md index 5a988a6b71..2e4537e1b4 100644 --- a/docs/side_quests/index.md +++ b/docs/side_quests/index.md @@ -31,7 +31,7 @@ Otherwise, select a side quest from the table below. | Side Quest | Time Estimate for Teaching | | ----------------------------------------------------------------- | -------------------------- | | [Nextflow development environment walkthrough](./ide_features.md) | 45 mins | -| [Groovy Essentials for Nextflow](./groovy_essentials.md) | 2 hours | +| [Groovy Essentials for Nextflow](./groovy_essentials.md) | 2 hours | | [Introduction to nf-core](./nf-core.md) | - | | [Metadata in workflows](./metadata.md) | 45 mins | | [Splitting and Grouping](./splitting_and_grouping.md) | 45 mins | From 97bb70729b0a2ae93a6d11a92d2d8a76c733d698 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Mon, 13 Oct 2025 18:06:04 +0100 Subject: [PATCH 34/48] Some section 2 fixes --- docs/side_quests/groovy_essentials.md | 168 +++++++++++++++++++------- 1 file changed, 122 insertions(+), 46 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 7a0c483394..6ac3689722 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -760,12 +760,32 @@ Then modify the `workflow` block to connect the `ch_samples` channel to the `FAS === "After" - ```groovy title="main.nf" linenums="25" hl_lines="7" + ```groovy title="main.nf" linenums="25" hl_lines="7-26" workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map{ row -> separateMetadata(row) } + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } ch_fastp = FASTP(ch_samples) } @@ -773,12 +793,32 @@ Then modify the `workflow` block to connect the `ch_samples` channel to the `FAS === "Before" - ```groovy title="main.nf" linenums="25" hl_lines="6" + ```groovy title="main.nf" linenums="25" hl_lines="6-25" workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map{ row -> separateMetadata(row) } + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], file(row.file_path)] + } .view() } ``` @@ -822,7 +862,7 @@ Fix this by adding Groovy logic to the `FASTP` process `script:` block. An if/el === "After" - ```groovy title="main.nf" linenums="10" hl_lines="3 5 15" + ```groovy title="main.nf" linenums="10" hl_lines="3-26" script: // Simple single-end vs paired-end detection def is_single = reads instanceof List ? reads.size() == 1 : true @@ -853,7 +893,7 @@ Fix this by adding Groovy logic to the `FASTP` process `script:` block. An if/el === "Before" - ```groovy title="main.nf" linenums="10" + ```groovy title="main.nf" linenums="10" hl_lines="2-11" script: """ fastp \\ @@ -916,7 +956,7 @@ Another common usage of dynamic script logic can be seen in [the Nextflow for Sc These patterns of using Groovy logic in process script blocks are extremely powerful and can be applied in many scenarios - from handling variable input types to building complex command-line arguments from file collections, making your processes truly adaptable to the diverse requirements of real-world data. -### 2.3. Variable Interpolation: Groovy, Bash, and Shell Variables +### 2.3. Variable Interpolation: Groovy and Shell Variables Process scripts mix Nextflow variables, shell variables, and command substitutions, each with different interpolation syntax. Using the wrong syntax causes errors. Let's explore these with a process that creates a processing report. @@ -947,16 +987,34 @@ Include the process in your `main.nf` and add it to the workflow: === "After" - ```groovy title="main.nf" linenums="1" hl_lines="2 12" + ```groovy title="main.nf" linenums="1" hl_lines="2 12-31" include { FASTP } from './modules/fastp.nf' include { GENERATE_REPORT } from './modules/generate_report.nf' - // ... separateMetadata function ... - workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map{ row -> separateMetadata(row) } + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } ch_fastp = FASTP(ch_samples) GENERATE_REPORT(ch_samples) @@ -965,15 +1023,33 @@ Include the process in your `main.nf` and add it to the workflow: === "Before" - ```groovy title="main.nf" linenums="1" hl_lines="1 10" + ```groovy title="main.nf" linenums="1" hl_lines="1 10-29" include { FASTP } from './modules/fastp.nf' - // ... separateMetadata function ... - workflow { ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) - .map{ row -> separateMetadata(row) } + .map { row -> + def sample_meta = [ + id: row.sample_id.toLowerCase(), + organism: row.organism, + tissue: row.tissue_type.replaceAll('_', ' ').toLowerCase(), + depth: row.sequencing_depth.toInteger(), + quality: row.quality_score.toDouble() + ] + def fastq_path = file(row.file_path) + + def m = (fastq_path.name =~ /^(.+)_S(\d+)_L(\d{3})_(R[12])_(\d{3})\.fastq(?:\.gz)?$/) + def file_meta = m ? [ + sample_num: m[0][2].toInteger(), + lane: m[0][3], + read: m[0][4], + chunk: m[0][5] + ] : [:] + + def priority = sample_meta.quality > 40 ? 'high' : 'normal' + return [sample_meta + file_meta + [priority: priority], fastq_path] + } ch_fastp = FASTP(ch_samples) } @@ -986,24 +1062,24 @@ But what if we want to add information about when and where the processing occur === "After" ```groovy title="modules/generate_report.nf" linenums="10" hl_lines="5-7" - script: - """ - echo "Processing ${reads}" > ${meta.id}_report.txt - echo "Sample: ${meta.id}" >> ${meta.id}_report.txt - echo "Processed by: ${USER}" >> ${meta.id}_report.txt - echo "Hostname: $(hostname)" >> ${meta.id}_report.txt - echo "Date: $(date)" >> ${meta.id}_report.txt - """ + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + echo "Processed by: ${USER}" >> ${meta.id}_report.txt + echo "Hostname: $(hostname)" >> ${meta.id}_report.txt + echo "Date: $(date)" >> ${meta.id}_report.txt + """ ``` === "Before" ```groovy title="modules/generate_report.nf" linenums="10" - script: - """ - echo "Processing ${reads}" > ${meta.id}_report.txt - echo "Sample: ${meta.id}" >> ${meta.id}_report.txt - """ + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + """ ``` If you run this, you'll notice an error or unexpected behavior - Nextflow tries to interpret `$(hostname)` as a Groovy variable that doesn't exist: @@ -1023,30 +1099,30 @@ We need to escape it so Bash can handle it instead. Fix this by escaping the shell variables and command substitutions with a backslash (`\`): -=== "After - Fixed" +=== "After" ```groovy title="modules/generate_report.nf" linenums="10" hl_lines="5-7" - script: - """ - echo "Processing ${reads}" > ${meta.id}_report.txt - echo "Sample: ${meta.id}" >> ${meta.id}_report.txt - echo "Processed by: \${USER}" >> ${meta.id}_report.txt - echo "Hostname: \$(hostname)" >> ${meta.id}_report.txt - echo "Date: \$(date)" >> ${meta.id}_report.txt - """ + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + echo "Processed by: \${USER}" >> ${meta.id}_report.txt + echo "Hostname: \$(hostname)" >> ${meta.id}_report.txt + echo "Date: \$(date)" >> ${meta.id}_report.txt + """ ``` -=== "Before - Broken" +=== "Before" ```groovy title="modules/generate_report.nf" linenums="10" - script: - """ - echo "Processing ${reads}" > ${meta.id}_report.txt - echo "Sample: ${meta.id}" >> ${meta.id}_report.txt - echo "Processed by: ${USER}" >> ${meta.id}_report.txt - echo "Hostname: $(hostname)" >> ${meta.id}_report.txt - echo "Date: $(date)" >> ${meta.id}_report.txt - """ + script: + """ + echo "Processing ${reads}" > ${meta.id}_report.txt + echo "Sample: ${meta.id}" >> ${meta.id}_report.txt + echo "Processed by: ${USER}" >> ${meta.id}_report.txt + echo "Hostname: $(hostname)" >> ${meta.id}_report.txt + echo "Date: $(date)" >> ${meta.id}_report.txt + """ ``` Now it works! The backslash (`\`) tells Nextflow "don't interpret this, pass it through to Bash." From 0ef30c636fb7d18aef182791f9035e5a90bee1e4 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 10:45:21 +0100 Subject: [PATCH 35/48] Fix highlights --- docs/side_quests/groovy_essentials.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 6ac3689722..3985556d53 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -760,7 +760,7 @@ Then modify the `workflow` block to connect the `ch_samples` channel to the `FAS === "After" - ```groovy title="main.nf" linenums="25" hl_lines="7-26" + ```groovy title="main.nf" linenums="25" hl_lines="27" workflow { ch_samples = Channel.fromPath("./data/samples.csv") @@ -793,7 +793,7 @@ Then modify the `workflow` block to connect the `ch_samples` channel to the `FAS === "Before" - ```groovy title="main.nf" linenums="25" hl_lines="6-25" + ```groovy title="main.nf" linenums="25" hl_lines="26" workflow { ch_samples = Channel.fromPath("./data/samples.csv") @@ -987,7 +987,7 @@ Include the process in your `main.nf` and add it to the workflow: === "After" - ```groovy title="main.nf" linenums="1" hl_lines="2 12-31" + ```groovy title="main.nf" linenums="1" hl_lines="2 30" include { FASTP } from './modules/fastp.nf' include { GENERATE_REPORT } from './modules/generate_report.nf' From fa8c04db9fe4cc668eec17b082fdc2152a641d1e Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 11:01:41 +0100 Subject: [PATCH 36/48] Fix tiny issues --- docs/side_quests/groovy_essentials.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index 3985556d53..e2f0038440 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -1280,8 +1280,8 @@ Currently, our FASTP process uses default resources. Let's make it smarter by al process FASTP { container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' - cpus { meta.depth > 40000000 ? 4 : 2 } - memory '2 GB' + cpus { meta.depth > 40000000 ? 2 : 1 } + memory 2.GB input: tuple val(meta), path(reads) @@ -1297,7 +1297,7 @@ Currently, our FASTP process uses default resources. Let's make it smarter by al tuple val(meta), path(reads) ``` -The closure `{ meta.depth > 40000000 ? 4 : 2 }` uses the **Groovy ternary operator** (covered in Section 1.1) and is evaluated for each task, allowing per-sample resource allocation. High-depth samples (>40M reads) get 4 CPUs, while others get 2 CPUs. +The closure `{ meta.depth > 40000000 ? 2 : 1 }` uses the **Groovy ternary operator** (covered in Section 1.1) and is evaluated for each task, allowing per-sample resource allocation. High-depth samples (>40M reads) get 2 CPUs, while others get 1 CPU. !!! note "Accessing Input Variables in Directives" @@ -1306,10 +1306,10 @@ The closure `{ meta.depth > 40000000 ? 4 : 2 }` uses the **Groovy ternary operat Run the workflow again: ```bash title="Test resource allocation" -nextflow run main.nf -no-ansi-log +nextflow run main.nf -ansi-log false ``` -We're using the `-no-ansi-log` option to make it easier to see the task hashes. +We're using the `-ansi-log false` option to make it easier to see the task hashes. ```console title="Resource allocation output" N E X T F L O W ~ version 25.04.6 @@ -1345,7 +1345,7 @@ Another powerful pattern is using `task.attempt` for retry strategies. To show w container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' cpus { meta.depth > 40000000 ? 4 : 2 } - memory '1 GB' + memory 1.GB input: tuple val(meta), path(reads) @@ -1358,7 +1358,7 @@ Another powerful pattern is using `task.attempt` for retry strategies. To show w container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' cpus { meta.depth > 40000000 ? 4 : 2 } - memory '2 GB' + memory 2.GB input: tuple val(meta), path(reads) @@ -1395,7 +1395,7 @@ This is a very common scenario in real-world workflows - sometimes you just don' container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' cpus { meta.depth > 40000000 ? 4 : 2 } - memory { '1 GB' * task.attempt } + memory { 1.GB * task.attempt } errorStrategy 'retry' maxRetries 2 @@ -1410,7 +1410,7 @@ This is a very common scenario in real-world workflows - sometimes you just don' container 'community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690' cpus { meta.depth > 40000000 ? 4 : 2 } - memory '2 GB' + memory 2.GB input: tuple val(meta), path(reads) @@ -1419,7 +1419,7 @@ This is a very common scenario in real-world workflows - sometimes you just don' Now if the process fails due to insufficient memory, Nextflow will retry with more memory: - First attempt: 1 GB (task.attempt = 1) -- Second attempt: 2 GB (task.attempt = 2) +- Second attempt: 2.GB (task.attempt = 2) ... and so on, up to the `maxRetries` limit. @@ -1953,7 +1953,7 @@ This is a Groovy closure being assigned to `workflow.onComplete`. Inside, you ha Run your workflow and you'll see this summary appear at the end! ```bash title="Run with onComplete handler" -nextflow run main.nf --input ./data/samples.csv -no-ansi-log +nextflow run main.nf --input ./data/samples.csv -ansi-log false ``` ```console title="onComplete output" From b9cd15a78a069fc5cf095935216ff12d2c78bafb Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 11:39:53 +0100 Subject: [PATCH 37/48] More Groovy fixes --- docs/side_quests/groovy_essentials.md | 80 ++++++++++++++------------- 1 file changed, 43 insertions(+), 37 deletions(-) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/groovy_essentials.md index e2f0038440..6b1b9e26cc 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/groovy_essentials.md @@ -328,7 +328,7 @@ nextflow run main.nf You should see output showing both the full metadata displayed by the `view()` operation and the extracted subset we printed with `println`: ```console title="SubMap results" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `main.nf` [peaceful_cori] DSL2 - revision: 4cc4a8340f @@ -453,7 +453,7 @@ nextflow run collect.nf ``` ```console title="Different collect behaviors" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `collect.nf` [loving_mendel] DSL2 - revision: e8d054a46e @@ -509,7 +509,7 @@ nextflow run collect.nf ``` ```console title="Groovy collect results" hl_lines="5" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `collect.nf` [cheeky_stonebraker] DSL2 - revision: 2d5039fb47 @@ -580,7 +580,7 @@ nextflow run collect.nf You should see output like: ```console title="Spread operator output" hl_lines="6" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `collect.nf` [cranky_galileo] DSL2 - revision: 5f3c8b2a91 @@ -709,7 +709,7 @@ nextflow run main.nf You should see output with metadata enriched from the file names, like ```console title="Metadata with file parsing" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `main.nf` [clever_pauling] DSL2 - revision: 605d2058b4 @@ -915,7 +915,7 @@ nextflow run main.nf ``` ```console title="Successful run" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `main.nf` [adoring_rosalind] DSL2 - revision: 04b1cd93e9 @@ -1244,7 +1244,7 @@ nextflow run main.nf ``` ```console title="Function results" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `main.nf` [admiring_panini] DSL2 - revision: 8cc832e32f @@ -1312,7 +1312,7 @@ nextflow run main.nf -ansi-log false We're using the `-ansi-log false` option to make it easier to see the task hashes. ```console title="Resource allocation output" -N E X T F L O W ~ version 25.04.6 +N E X T F L O W ~ version 25.04.3 Launching `main.nf` [fervent_albattani] DSL2 - revision: fa8f249759 [bd/ff3d41] Submitted process > FASTP (2) [a4/a3aab2] Submitted process > FASTP (1) @@ -1501,7 +1501,7 @@ nextflow run main.nf ``` ```console title="Conditional trimming results" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `main.nf` [adoring_galileo] DSL2 - revision: c9e83aaef1 @@ -1569,7 +1569,7 @@ nextflow run main.nf Because we've chosen a filter that excludes some samples, you should see fewer tasks executed: ```console title="Filtered samples results" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `main.nf` [deadly_woese] DSL2 - revision: 9a6044a969 @@ -1647,7 +1647,7 @@ nextflow run main.nf It crashes with a NullPointerException: ```console title="Null pointer error" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `main.nf` [trusting_torvalds] DSL2 - revision: b56fbfbce2 @@ -1844,7 +1844,7 @@ nextflow run main.nf The workflow stops immediately with a clear error message instead of failing mysteriously later! ```console title="Validation error output" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `main.nf` [confident_coulomb] DSL2 - revision: 07059399ed @@ -1852,6 +1852,30 @@ WARN: Access to undefined parameter `input` -- Initialise it to a default value Input CSV file path not provided. Please specify --input ``` +Now run it with a non-existent file: + +```bash +nextflow run main.nf --input ./data/nonexistent.csv +``` + +Observe the error: + +```console title="File not found error output" + N E X T F L O W ~ version 25.04.3 + +Launching `main.nf` [cranky_gates] DSL2 - revision: 26839ae3eb + +Input CSV file not found: ./data/nonexistent.csv +``` + +Finally, run it with the correct file: + +```bash +nextflow run main.nf --input ./data/samples.csv +``` + +This time it runs successfully. + You can also add validation within the `separateMetadata` function. Let's use the non-fatal `log.warn` to issue warnings for samples with low sequencing depth, but still allow the workflow to continue: === "After" @@ -1886,7 +1910,7 @@ nextflow run main.nf --input ./data/samples.csv ... and you'll see a warning about low sequencing depth for one of the samples: ```console title="Warning output" - N E X T F L O W ~ version 25.04.6 + N E X T F L O W ~ version 25.04.3 Launching `main.nf` [awesome_goldwasser] DSL2 - revision: a31662a7c1 @@ -1957,7 +1981,7 @@ nextflow run main.nf --input ./data/samples.csv -ansi-log false ``` ```console title="onComplete output" -N E X T F L O W ~ version 25.04.6 +N E X T F L O W ~ version 25.04.3 Launching `main.nf` [marvelous_boltzmann] DSL2 - revision: a31662a7c1 WARN: Low sequencing depth for sample_002: 25000000 [9b/d48e40] Submitted process > FASTP (2) @@ -1979,7 +2003,7 @@ Let's make it more useful by adding conditional logic: === "After" - ```groovy title="nextflow.config" linenums="5" hl_lines="11-18" + ```groovy title="nextflow.config" linenums="5" hl_lines="12-18" workflow.onComplete = { println "" println "Pipeline execution summary:" @@ -2020,7 +2044,7 @@ Let's make it more useful by adding conditional logic: Now we get an even more informative summary, including a success/failure message and the output directory if specified: ```console title="Enhanced onComplete output" -N E X T F L O W ~ version 25.04.6 +N E X T F L O W ~ version 25.04.3 Launching `main.nf` [boring_linnaeus] DSL2 - revision: a31662a7c1 WARN: Low sequencing depth for sample_002: 25000000 [e5/242efc] Submitted process > FASTP (2) @@ -2056,7 +2080,7 @@ workflow.onComplete = { println summary // Write to a log file - def log_file = file("${workflow.launchDir}/pipeline_summary.txt") + def log_file = new File("${workflow.launchDir}/pipeline_summary.txt") log_file.text = summary } ``` @@ -2065,19 +2089,6 @@ workflow.onComplete = { Besides `onComplete`, there are other event handlers you can use: -**`onStart`** - Runs when the workflow begins: - -```groovy title="nextflow.config - onStart handler" -workflow.onStart = { - println "="* 50 - println "Starting pipeline: ${workflow.runName}" - println "Project directory: ${workflow.projectDir}" - println "Launch directory: ${workflow.launchDir}" - println "Work directory: ${workflow.workDir}" - println "="* 50 -} -``` - **`onError`** - Runs only if the workflow fails: ```groovy title="nextflow.config - onError handler" @@ -2088,7 +2099,7 @@ workflow.onError = { println "="* 50 // Write detailed error log - def error_file = file("${workflow.launchDir}/error.log") + def error_file = new File("${workflow.launchDir}/error.log") error_file.text = """ Workflow Error Report ===================== @@ -2104,10 +2115,6 @@ workflow.onError = { You can use multiple handlers together: ```groovy title="nextflow.config - Combined handlers" -workflow.onStart = { - println "Starting ${workflow.runName} at ${workflow.start}" -} - workflow.onError = { println "Workflow failed: ${workflow.errorMessage}" } @@ -2129,7 +2136,6 @@ In this section, you've learned: - **Event handler closures**: Groovy closures in `nextflow.config` that run at different lifecycle points - **`onComplete` handler**: For execution summaries and result reporting -- **`onStart` handler**: For logging pipeline initialization - **`onError` handler**: For error handling and logging failures - **Workflow object properties**: Accessing `workflow.success`, `workflow.duration`, `workflow.errorMessage`, etc. @@ -2157,7 +2163,7 @@ Here's how we progressively enhanced our pipeline: 7. **Validation with error() and log.warn**: You learned to validate inputs early and fail fast with clear error messages. -8. **Groovy in Configuration**: You learned to use workflow event handlers (`onComplete`, `onStart`, `onError`) for logging, notifications, and lifecycle management. +8. **Groovy in Configuration**: You learned to use workflow event handlers (`onComplete` and `onError`) for logging, notifications, and lifecycle management. ### Key Benefits From b8e0c80938046542fe07234680ddebea91241f85 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 11:41:21 +0100 Subject: [PATCH 38/48] update time estimate --- docs/side_quests/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/side_quests/index.md b/docs/side_quests/index.md index 2e4537e1b4..ad6048571d 100644 --- a/docs/side_quests/index.md +++ b/docs/side_quests/index.md @@ -31,7 +31,7 @@ Otherwise, select a side quest from the table below. | Side Quest | Time Estimate for Teaching | | ----------------------------------------------------------------- | -------------------------- | | [Nextflow development environment walkthrough](./ide_features.md) | 45 mins | -| [Groovy Essentials for Nextflow](./groovy_essentials.md) | 2 hours | +| [Groovy Essentials for Nextflow](./groovy_essentials.md) | 90 mins | | [Introduction to nf-core](./nf-core.md) | - | | [Metadata in workflows](./metadata.md) | 45 mins | | [Splitting and Grouping](./splitting_and_grouping.md) | 45 mins | From 3b5f11a18cc4e7bd49b67f3992c2f961474b7241 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 12:38:58 +0100 Subject: [PATCH 39/48] Refactor Groovy Essentials to Essential Scripting Patterns MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Address reviewer feedback to reframe content away from "Nextflow vs Groovy" and toward presenting Nextflow as a standalone language. Major changes: - Renamed from groovy_essentials to essential_scripting_patterns - Updated terminology: "dataflow vs scripting" instead of "Nextflow vs Groovy" - Reframed collect example as Channel vs Iterable types in Nextflow standard library - Added historical context about Groovy while emphasizing Nextflow language specification - Updated all code comments and examples to reference Nextflow standard library - Prioritized Nextflow documentation in resources section 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- ...als.md => essential_scripting_patterns.md} | 279 +++++++++--------- docs/side_quests/index.md | 2 +- mkdocs.yml | 2 +- .../collect.nf | 4 +- .../data/samples.csv | 0 .../sequences/SAMPLE_001_S1_L001_R1_001.fastq | 0 .../sequences/SAMPLE_002_S2_L001_R1_001.fastq | 0 .../sequences/SAMPLE_003_S3_L001_R1_001.fastq | 0 .../main.nf | 0 .../modules/fastp.nf | 0 .../modules/generate_report.nf | 0 .../modules/trimgalore.nf | 0 .../nextflow.config | 0 .../collect.nf | 8 +- .../main.nf | 0 .../modules/fastp.nf | 0 .../modules/generate_report.nf | 0 .../modules/trimgalore.nf | 0 .../nextflow.config | 0 19 files changed, 148 insertions(+), 147 deletions(-) rename docs/side_quests/{groovy_essentials.md => essential_scripting_patterns.md} (81%) rename side-quests/{groovy_essentials => essential_scripting_patterns}/collect.nf (53%) rename side-quests/{groovy_essentials => essential_scripting_patterns}/data/samples.csv (100%) rename side-quests/{groovy_essentials => essential_scripting_patterns}/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq (100%) rename side-quests/{groovy_essentials => essential_scripting_patterns}/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq (100%) rename side-quests/{groovy_essentials => essential_scripting_patterns}/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq (100%) rename side-quests/{groovy_essentials => essential_scripting_patterns}/main.nf (100%) rename side-quests/{groovy_essentials => essential_scripting_patterns}/modules/fastp.nf (100%) rename side-quests/{groovy_essentials => essential_scripting_patterns}/modules/generate_report.nf (100%) rename side-quests/{groovy_essentials => essential_scripting_patterns}/modules/trimgalore.nf (100%) rename side-quests/{groovy_essentials => essential_scripting_patterns}/nextflow.config (100%) rename side-quests/solutions/{groovy_essentials => essential_scripting_patterns}/collect.nf (59%) rename side-quests/solutions/{groovy_essentials => essential_scripting_patterns}/main.nf (100%) rename side-quests/solutions/{groovy_essentials => essential_scripting_patterns}/modules/fastp.nf (100%) rename side-quests/solutions/{groovy_essentials => essential_scripting_patterns}/modules/generate_report.nf (100%) rename side-quests/solutions/{groovy_essentials => essential_scripting_patterns}/modules/trimgalore.nf (100%) rename side-quests/solutions/{groovy_essentials => essential_scripting_patterns}/nextflow.config (100%) diff --git a/docs/side_quests/groovy_essentials.md b/docs/side_quests/essential_scripting_patterns.md similarity index 81% rename from docs/side_quests/groovy_essentials.md rename to docs/side_quests/essential_scripting_patterns.md index 6b1b9e26cc..70f898743e 100644 --- a/docs/side_quests/groovy_essentials.md +++ b/docs/side_quests/essential_scripting_patterns.md @@ -1,13 +1,15 @@ -# Groovy Essentials for Nextflow Developers +# Essential Programming Patterns for Nextflow Developers -Nextflow is built on Groovy, a powerful dynamic language that runs on the Java Virtual Machine. You can write a lot of Nextflow without ever feeling like you've learned Groovy - many workflows use only basic syntax for variables, maps, and lists. Most Nextflow tutorials focus on workflow orchestration (channels, processes, and data flow), and you can go surprisingly far with just that. +Nextflow is a programming language that runs on the Java Virtual Machine. While Nextflow was originally built on Groovy and shares much of its syntax, the [Nextflow language specification](https://nextflow.io/docs/latest/reference/index.html) defines Nextflow as a standalone language—making it easier to understand Nextflow on its own terms rather than as "Groovy with extensions." -However, when you need to manipulate data, parse complex filenames, implement conditional logic, or build robust production workflows, you're writing Groovy code - and knowing a few key Groovy concepts can dramatically improve your ability to solve real-world problems efficiently. Understanding where Nextflow ends and Groovy begins helps you write clearer, more maintainable workflows. +You can write a lot of Nextflow without venturing beyond basic syntax for variables, maps, and lists. Most Nextflow tutorials focus on workflow orchestration (channels, processes, and data flow), and you can go surprisingly far with just that. + +However, when you need to manipulate data, parse complex filenames, implement conditional logic, or build robust production workflows, it helps to think about two distinct aspects of your code: **dataflow** (channels, operators, processes, and workflows—the constructs that control how data moves through your pipeline) and **scripting** (the code you write inside closures, functions, and process scripts to manipulate data, generate commands, allocate resources, and more). While this distinction is somewhat arbitrary—it's all Nextflow code—it provides a useful mental model for understanding when you're orchestrating your pipeline versus when you're using the language's programming features. Mastering both aspects dramatically improves your ability to solve real-world problems efficiently and write clearer, more maintainable workflows. This side quest takes you on a hands-on journey from basic concepts to production-ready patterns. We'll transform a simple CSV-reading workflow into a sophisticated bioinformatics pipeline, evolving it step-by-step through realistic challenges: -- **Understanding boundaries:** Distinguish between Nextflow operators and Groovy methods, and master when to use each -- **Data manipulation:** Extract, transform, and subset maps and collections using Groovy's powerful operators +- **Understanding boundaries:** Distinguish between dataflow operations and scripting, and understand how they work together +- **Data manipulation:** Extract, transform, and subset maps and collections using powerful operators - **String processing:** Parse complex file naming schemes with regex patterns and master variable interpolation - **Reusable functions:** Extract complex logic into named functions for cleaner, more maintainable workflows - **Dynamic logic:** Build processes that adapt to different input types and use closures for dynamic resource allocation @@ -34,7 +36,7 @@ This tutorial will explain Groovy concepts as we encounter them, so you don't ne Navigate to the project directory: ```bash title="Navigate to project directory" -cd side-quests/groovy_essentials +cd side-quests/essential_scripting_patterns ``` The `data` directory contains sample files and a main workflow file we'll evolve throughout. @@ -68,15 +70,15 @@ SAMPLE_002,mouse,brain,25000000,data/sequences/SAMPLE_002_S2_L001_R1_001.fastq,3 SAMPLE_003,human,kidney,45000000,data/sequences/SAMPLE_003_S3_L001_R1_001.fastq,42.1 ``` -We'll use this realistic dataset to explore practical Groovy techniques that you'll encounter in real bioinformatics workflows. +We'll use this realistic dataset to explore practical programming techniques that you'll encounter in real bioinformatics workflows. --- -## 1. Nextflow vs Groovy: Understanding the Boundaries +## 1. Dataflow vs Scripting: Understanding the Boundaries ### 1.1. Identifying What's What -Nextflow developers often confuse Nextflow constructs with Groovy language features. Let's build a workflow demonstrating how they work together. +When writing Nextflow workflows, it's important to distinguish between **dataflow** (how data moves through channels and processes) and **scripting** (the code that manipulates data and makes decisions). Let's build a workflow demonstrating how they work together. #### Step 1: Basic Nextflow Workflow @@ -110,11 +112,11 @@ Launching `main.nf` [marvelous_tuckerman] DSL2 - revision: 6113e05c17 #### Step 2: Adding the Map Operator -Now we're going to use some Groovy code to transform the data, using the `.map()` operator you will probably already be familiar with. This operator takes a 'closure' where we can write Groovy code to transform each item. +Now we're going to add scripting to transform the data, using the `.map()` operator you will probably already be familiar with. This operator takes a 'closure' where we can write code to transform each item. !!! note - A **closure** is a block of code that can be passed around and executed later. Think of it as a function that you define inline. In Groovy, closures are written with curly braces `{ }` and can take parameters. They're fundamental to how Nextflow operators work and if you've been writing Nextflow for a while, you may already have been using them without realizing it! + A **closure** is a block of code that can be passed around and executed later. Think of it as a function that you define inline. Closures are written with curly braces `{ }` and can take parameters. They're fundamental to how Nextflow operators work and if you've been writing Nextflow for a while, you may already have been using them without realizing it! Here's what that map operation looks like: @@ -137,11 +139,11 @@ Here's what that map operation looks like: .view() ``` -This is our first **Groovy closure**—an anonymous function you can pass as an argument. Closures are a core Groovy concept (similar to lambdas in Python or arrow functions in JavaScript) and are essential for working with Nextflow operators. +This is our first **closure**—an anonymous function you can pass as an argument (similar to lambdas in Python or arrow functions in JavaScript). Closures are essential for working with Nextflow operators. The closure `{ row -> return row }` takes a parameter `row` (could be any name: `item`, `sample`, etc.). You can also use the implicit variable `it` instead: `.map { return it }`, though naming parameters improves clarity. -When Nextflow processes each channel item, it passes that item to your closure. Here, `row` holds one CSV row at a time. +When the `.map()` operator processes each channel item, it passes that item to your closure. Here, `row` holds one CSV row at a time. Apply this change and run the workflow: @@ -153,7 +155,7 @@ You'll see the same output as before, because we're simply returning the input u #### Step 3: Creating a Map Data Structure -Now we're going to write **pure Groovy code** inside our closure. Everything from this point forward in this section is Groovy syntax and methods, not Nextflow operators. +Now we're going to write **scripting** inside our closure to transform each row of data. This is where we process individual data items rather than orchestrating data flow. === "After" @@ -161,7 +163,7 @@ Now we're going to write **pure Groovy code** inside our closure. Everything fro ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> - // This is all Groovy code now! + // Scripting for data transformation def sample_meta = [ id: row.sample_id.toLowerCase(), organism: row.organism, @@ -185,9 +187,9 @@ Now we're going to write **pure Groovy code** inside our closure. Everything fro .view() ``` -This is pure Groovy code. The `sample_meta` map is a key-value data structure (like dictionaries in Python, objects in JavaScript, or hashes in Ruby) storing related information: sample ID, organism, tissue type, sequencing depth, and quality score. +The `sample_meta` map is a key-value data structure (like dictionaries in Python, objects in JavaScript, or hashes in Ruby) storing related information: sample ID, organism, tissue type, sequencing depth, and quality score. -We use Groovy's string manipulation methods like `.toLowerCase()` and `.replaceAll()` to clean up our data, and type conversion methods like `.toInteger()` and `.toDouble()` to convert string data from the CSV into the appropriate numeric types. +We use string manipulation methods like `.toLowerCase()` and `.replaceAll()` to clean up our data, and type conversion methods like `.toInteger()` and `.toDouble()` to convert string data from the CSV into the appropriate numeric types. Apply this change and run the workflow: @@ -205,7 +207,7 @@ You should see the refined map output like: #### Step 4: Adding Conditional Logic -Now let's add more Groovy logic - this time using a ternary operator to make decisions based on data values. +Now let's add more scripting - this time using a ternary operator to make decisions based on data values. Make the following change: @@ -272,7 +274,7 @@ We've successfully added conditional logic to enrich our metadata with a priorit #### Step 4.5: Subsetting Maps with `.subMap()` -While the `+` operator adds keys to a map, sometimes you need to do the opposite - extract only specific keys. Groovy's `.subMap()` method is perfect for this. +While the `+` operator adds keys to a map, sometimes you need to do the opposite - extract only specific keys. The `.subMap()` method is perfect for this. Let's add a line to create a simplified version of our metadata that only contains identification fields: @@ -282,7 +284,7 @@ Let's add a line to create a simplified version of our metadata that only contai ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> - // This is all Groovy code now! + // Scripting for data transformation def sample_meta = [ id: row.sample_id.toLowerCase(), organism: row.organism, @@ -305,7 +307,7 @@ Let's add a line to create a simplified version of our metadata that only contai ch_samples = Channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> - // This is all Groovy code now! + // Scripting for data transformation def sample_meta = [ id: row.sample_id.toLowerCase(), organism: row.organism, @@ -405,9 +407,9 @@ nextflow run main.nf You should see output like: ```console title="Complete workflow output" -[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] -[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] -[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, priority:normal], /workspaces/training/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, priority:normal], /workspaces/training/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, priority:high], /workspaces/training/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] ``` This `[meta, file]` tuple structure is a common pattern in Nextflow for passing both metadata and associated files to processes. @@ -416,32 +418,32 @@ This `[meta, file]` tuple structure is a common pattern in Nextflow for passing **Maps and Metadata**: Maps are fundamental to working with metadata in Nextflow. For a more detailed explanation of working with metadata maps, see the [Working with metadata](./metadata.md) side quest. -Our workflow demonstrates the core pattern: **Nextflow constructs** (`workflow`, `Channel.fromPath()`, `.splitCsv()`, `.map()`, `.view()`) orchestrate data flow, while **basic Groovy constructs** (maps `[key: value]`, string methods, type conversions, ternary operators) handle the data processing logic inside the `.map()` closure. +Our workflow demonstrates the core pattern: **dataflow operations** (`workflow`, `Channel.fromPath()`, `.splitCsv()`, `.map()`, `.view()`) orchestrate how data moves through the pipeline, while **scripting** (maps `[key: value]`, string methods, type conversions, ternary operators) inside the `.map()` closure handles the transformation of individual data items. -### 1.2. Distinguishing Nextflow operators from Groovy functions +### 1.2. Understanding Different Types: Channel vs Iterable -So far, so good, we can distinguish between Nextflow constructs and basic Groovy constructs. But what about when the syntax overlaps? +So far, so good, we can distinguish between dataflow operations and scripting. But what about when the same method name exists in both contexts? -A perfect example of this confusion is the `collect` operation, which exists in both contexts but does completely different things. Groovy's `collect` transforms each element, while Nextflow's `collect` gathers all channel elements into a single-item channel. +A perfect example is the `collect` method, which exists for both Channel types and Iterable types (like Lists) in the Nextflow standard library. The `collect()` method on an Iterable transforms each element, while the `collect()` operator on a Channel gathers all channel emissions into a single-item channel. -Let's demonstrate this with some sample data, starting by refreshing ourselves on what the Nextflow `collect()` operator does. Check out `collect.nf`: +Let's demonstrate this with some sample data, starting by refreshing ourselves on what the Channel `collect()` operator does. Check out `collect.nf`: ```groovy title="collect.nf" linenums="1" def sample_ids = ['sample_001', 'sample_002', 'sample_003'] -// Nextflow collect() - groups multiple channel emissions into one +// Channel.collect() - groups multiple channel emissions into one ch_input = Channel.fromList(sample_ids) ch_input.view { "Individual channel item: ${it}" } ch_collected = ch_input.collect() -ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } +ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } ``` Steps: -- Define a Groovy list +- Define a List of sample IDs - Create a channel with `fromList()` that emits each sample ID separately - Print each item with `view()` as it flows through -- Gather all items into a single list with Nextflow's `collect()` operator +- Gather all items into a single list with the Channel's `collect()` operator - Print the collected result (single item containing all sample IDs) with a second `view()` We've changed the structure of the channel, but we haven't changed the data itself. @@ -460,29 +462,29 @@ Launching `collect.nf` [loving_mendel] DSL2 - revision: e8d054a46e Individual channel item: sample_001 Individual channel item: sample_002 Individual channel item: sample_003 -Nextflow collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) +Channel.collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) ``` `view()` returns an output for every channel emission, so we know that this single output contains all 3 original items grouped into one list. -Now let's see Groovy's `collect` method in action. Modify `collect.nf` to apply Groovy's `collect` method to the original list of sample IDs: +Now let's see the `collect` method on an Iterable type in action. Modify `collect.nf` to apply the Iterable's `collect` method to the original list of sample IDs: === "After" ```groovy title="main.nf" linenums="1" hl_lines="9-13" def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - // Nextflow collect() - groups multiple channel emissions into one + // Channel.collect() - groups multiple channel emissions into one ch_input = Channel.fromList(sample_ids) ch_input.view { "Individual channel item: ${it}" } ch_collected = ch_input.collect() - ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } - // Groovy collect - transforms each element, preserves structure + // Iterable.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') } - println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + println "Iterable.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" ``` === "Before" @@ -490,43 +492,43 @@ Now let's see Groovy's `collect` method in action. Modify `collect.nf` to apply ```groovy title="main.nf" linenums="1" def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - // Nextflow collect() - groups multiple channel emissions into one + // Channel.collect() - groups multiple channel emissions into one ch_input = Channel.fromList(sample_ids) ch_input.view { "Individual channel item: ${it}" } ch_collected = ch_input.collect() - ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } ``` In this new snippet we: -- Define a new variable `formatted_ids` that uses Groovy's `collect` method to transform each sample ID in the original list +- Define a new variable `formatted_ids` that uses the List's `collect` method to transform each sample ID in the original list - Print the result using `println` Run the modified workflow: -```bash title="Test Groovy collect" +```bash title="Test Iterable collect" nextflow run collect.nf ``` -```console title="Groovy collect results" hl_lines="5" +```console title="Iterable collect results" hl_lines="5" N E X T F L O W ~ version 25.04.3 Launching `collect.nf` [cheeky_stonebraker] DSL2 - revision: 2d5039fb47 -Groovy collect result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) +Iterable.collect() result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) Individual channel item: sample_001 Individual channel item: sample_002 Individual channel item: sample_003 -Nextflow collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) +Channel.collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) ``` -This time, we have NOT changed the structure of the data, we still have 3 items in the list, but we HAVE transformed each item using Groovy's `collect` method to produce a new list with modified values. This is sort of like using the `map` operator in Nextflow, but it's pure Groovy code operating on a standard Groovy list. +This time, we have NOT changed the structure of the data, we still have 3 items in the list, but we HAVE transformed each item using the Iterable's `collect` method to produce a new list with modified values. This is similar to using the `map` operator on a Channel, but it's operating on a List data structure rather than a channel. -`collect` is an extreme case we're using here to make a point. The key lesson is that when you're writing workflows always distinguish between **Groovy constructs** (data structures) and **Nextflow constructs** (channels/workflows). Operations can share names but behave completely differently. +`collect` is an extreme case we're using here to make a point. The key lesson is that when you're writing workflows, always distinguish between **data structures** (Lists, Maps, etc.) and **channels** (dataflow constructs). Operations can share names but behave completely differently depending on the type they're called on. ### 1.3. The Spread Operator (`*.`) - Shorthand for Property Extraction -Related to Groovy's `collect` is the spread operator (`*.`), which provides a concise way to extract properties from collections. It's essentially syntactic sugar for a common `collect` pattern. +Related to the Iterable's `collect` method is the spread operator (`*.`), which provides a concise way to extract properties from collections. It's essentially syntactic sugar for a common `collect` pattern. Let's add a demonstration to our `collect.nf` file: @@ -535,17 +537,17 @@ Let's add a demonstration to our `collect.nf` file: ```groovy title="collect.nf" linenums="1" hl_lines="15-18" def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - // Nextflow collect() - groups multiple channel emissions into one + // Channel.collect() - groups multiple channel emissions into one ch_input = Channel.fromList(sample_ids) ch_input.view { "Individual channel item: ${it}" } ch_collected = ch_input.collect() - ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } - // Groovy collect - transforms each element, preserves structure + // Iterable.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') } - println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + println "Iterable.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" // Spread operator - concise property access def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] @@ -558,17 +560,17 @@ Let's add a demonstration to our `collect.nf` file: ```groovy title="collect.nf" linenums="1" def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - // Nextflow collect() - groups multiple channel emissions into one + // Channel.collect() - groups multiple channel emissions into one ch_input = Channel.fromList(sample_ids) ch_input.view { "Individual channel item: ${it}" } ch_collected = ch_input.collect() - ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } + ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } - // Groovy collect - transforms each element, preserves structure + // Iterable.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') } - println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + println "Iterable.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" ``` Run the updated workflow: @@ -584,12 +586,12 @@ You should see output like: Launching `collect.nf` [cranky_galileo] DSL2 - revision: 5f3c8b2a91 -Groovy collect result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) +Iterable.collect() result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) Spread operator result: [s1, s2, s3] Individual channel item: sample_001 Individual channel item: sample_002 Individual channel item: sample_003 -Nextflow collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) +Channel.collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) ``` The spread operator `*.` is a shorthand for a common collect pattern: @@ -606,7 +608,7 @@ def names = files.collect { it.getName() } The spread operator is particularly useful when you need to extract a single property from a list of objects - it's more readable than writing out the full `collect` closure. -!!! tip "When to Use Groovy's Spread vs Collect" +!!! tip "When to Use Spread vs Collect" - **Use spread (`*.`)** for simple property access: `samples*.id`, `files*.name` - **Use collect** for transformations or complex logic: `samples.collect { it.id.toUpperCase() }`, `samples.collect { [it.id, it.quality > 40] }` @@ -615,25 +617,25 @@ The spread operator is particularly useful when you need to extract a single pro In this section, you've learned: -- **It takes both Nextflow and Groovy**: Nextflow provides the workflow structure and data flow, while Groovy provides the data manipulation and logic -- **Distinguishing Nextflow from Groovy**: How to identify which language construct you're using given the context -- **Context matters**: The same operation name can have completely different behaviors +- **Dataflow vs scripting**: Channel operators orchestrate how data flows through your pipeline, while scripting transforms individual data items +- **Understanding types**: The same method name (like `collect`) can behave differently depending on the type it's called on (Channel vs Iterable) +- **Context matters**: Always be aware of whether you're working with channels (dataflow) or data structures (scripting) Understanding these boundaries is essential for debugging, documentation, and writing maintainable workflows. -Next we'll dive deeper into Groovy's powerful string processing capabilities, which are essential for handling real-world data. +Next we'll dive deeper into string processing capabilities, which are essential for handling real-world data. --- ## 2. String Processing and Dynamic Script Generation -Mastering Groovy's string processing separates brittle workflows from robust pipelines. This section covers parsing complex file names, dynamic script generation, and variable interpolation. +Mastering string processing separates brittle workflows from robust pipelines. This section covers parsing complex file names, dynamic script generation, and variable interpolation. ### 2.1. Pattern Matching and Regular Expressions -Bioinformatics files often have complex naming conventions encoding metadata. Let's extract this automatically with Groovy's pattern matching. +Bioinformatics files often have complex naming conventions encoding metadata. Let's extract this automatically using pattern matching with regular expressions. -We're going to return to our `main.nf` workflow and add some pattern matching logic to extract additional sample information from file names. The FASTQ files in our dataset follow Illumina-style naming conventions with names like `SAMPLE_001_S1_L001_R1_001.fastq.gz`. These might look cryptic, but they actually encode useful metadata like sample ID, lane number, and read direction. We're going to use Groovy's regex capabilities to parse these names. +We're going to return to our `main.nf` workflow and add some pattern matching logic to extract additional sample information from file names. The FASTQ files in our dataset follow Illumina-style naming conventions with names like `SAMPLE_001_S1_L001_R1_001.fastq.gz`. These might look cryptic, but they actually encode useful metadata like sample ID, lane number, and read direction. We're going to use regex capabilities to parse these names. Make the following change to your existing `main.nf` workflow: @@ -641,7 +643,7 @@ Make the following change to your existing `main.nf` workflow: ```groovy title="main.nf" linenums="4" hl_lines="10-21" .map { row -> - // This is all Groovy code now! + // Scripting for data transformation def sample_meta = [ id: row.sample_id.toLowerCase(), organism: row.organism, @@ -668,7 +670,7 @@ Make the following change to your existing `main.nf` workflow: ```groovy title="main.nf" linenums="4" hl_lines="11" .map { row -> - // This is all Groovy code now! + // Scripting for data transformation def sample_meta = [ id: row.sample_id.toLowerCase(), organism: row.organism, @@ -681,7 +683,7 @@ Make the following change to your existing `main.nf` workflow: } ``` -This demonstrates key **Groovy string processing concepts**: +This demonstrates key **string processing concepts**: 1. **Regular expression literals** using `~/pattern/` syntax - this creates a regex pattern without needing to escape backslashes 2. **Pattern matching** with the `=~` operator - this attempts to match a string against a regex pattern @@ -713,14 +715,14 @@ You should see output with metadata enriched from the file names, like Launching `main.nf` [clever_pauling] DSL2 - revision: 605d2058b4 -[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] -[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] -[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /workspaces/training/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] ``` ### 2.2. Dynamic Script Generation in Processes -Process script blocks are essentially multi-line strings that get passed to the shell. You can use **Groovy conditional logic** (if/else, ternary operators) to dynamically generate different script strings based on input characteristics. This is essential for handling diverse input types—like single-end vs paired-end sequencing reads—without duplicating process definitions. +Process script blocks are essentially multi-line strings that get passed to the shell. You can use **conditional logic** (if/else, ternary operators) to dynamically generate different script strings based on input characteristics. This is essential for handling diverse input types—like single-end vs paired-end sequencing reads—without duplicating process definitions. Let's add a process to our workflow that demonstrates this pattern. Open `modules/fastp.nf` and take a look: @@ -858,7 +860,7 @@ Command output: You can see that the process is trying to run `fastp` with a `null` value for the second input file, which is causing it to fail. This is because our dataset contains single-end reads, but the process is hardcoded to expect paired-end reads (two input files at a time). -Fix this by adding Groovy logic to the `FASTP` process `script:` block. An if/else statement checks read file count and adjusts the command accordingly. +Fix this by adding conditional logic to the `FASTP` process `script:` block. An if/else statement checks read file count and adjusts the command accordingly. === "After" @@ -908,7 +910,7 @@ Fix this by adding Groovy logic to the `FASTP` process `script:` block. An if/el } ``` -Now the workflow can handle both single-end and paired-end reads gracefully. The Groovy logic checks the number of input files and constructs the appropriate command for `fastp`. Let's see if it works: +Now the workflow can handle both single-end and paired-end reads gracefully. The conditional logic checks the number of input files and constructs the appropriate command for `fastp`. Let's see if it works: ```bash title="Test dynamic fastp" nextflow run main.nf @@ -941,7 +943,7 @@ fastp \ --thread 2 ``` -Another common usage of dynamic script logic can be seen in [the Nextflow for Science Genomics module](../../nf4science/genomics/02_joint_calling). In that module, the GATK process being called can take multiple input files, but each must be prefixed with `-V` to form a correct command line. The process uses Groovy logic to transform a collection of input files (`all_gvcfs`) into the correct command arguments: +Another common usage of dynamic script logic can be seen in [the Nextflow for Science Genomics module](../../nf4science/genomics/02_joint_calling). In that module, the GATK process being called can take multiple input files, but each must be prefixed with `-V` to form a correct command line. The process uses scripting to transform a collection of input files (`all_gvcfs`) into the correct command arguments: ```groovy title="command line manipulation for GATK" linenums="1" hl_lines="2 5" script: @@ -954,9 +956,9 @@ Another common usage of dynamic script logic can be seen in [the Nextflow for Sc """ ``` -These patterns of using Groovy logic in process script blocks are extremely powerful and can be applied in many scenarios - from handling variable input types to building complex command-line arguments from file collections, making your processes truly adaptable to the diverse requirements of real-world data. +These patterns of using scripting in process script blocks are extremely powerful and can be applied in many scenarios - from handling variable input types to building complex command-line arguments from file collections, making your processes truly adaptable to the diverse requirements of real-world data. -### 2.3. Variable Interpolation: Groovy and Shell Variables +### 2.3. Variable Interpolation: Nextflow and Shell Variables Process scripts mix Nextflow variables, shell variables, and command substitutions, each with different interpolation syntax. Using the wrong syntax causes errors. Let's explore these with a process that creates a processing report. @@ -1087,7 +1089,7 @@ If you run this, you'll notice an error or unexpected behavior - Nextflow tries ```console title="Error with shell variables" unknown recognition error type: groovyjarjarantlr4.v4.runtime.LexerNoViableAltException ERROR ~ Module compilation error -- file : /workspaces/training/side-quests/groovy_essentials/modules/generate_report.nf +- file : /workspaces/training/side-quests/essential_scripting_patterns/modules/generate_report.nf - cause: token recognition error at: '(' @ line 16, column 22. echo "Hostname: $(hostname)" >> ${meta.id}_report.txt ^ @@ -1129,12 +1131,12 @@ Now it works! The backslash (`\`) tells Nextflow "don't interpret this, pass it ### Takeaway -In this section, you've learned **Groovy string processing** techniques: +In this section, you've learned **string processing** techniques: -- **Regular expressions for file parsing**: Using Groovy's `=~` operator and regex patterns (`~/pattern/`) to extract metadata from complex file naming conventions -- **Dynamic script generation**: Using Groovy conditional logic (if/else, ternary operators) to generate different script strings based on input characteristics +- **Regular expressions for file parsing**: Using the `=~` operator and regex patterns (`~/pattern/`) to extract metadata from complex file naming conventions +- **Dynamic script generation**: Using conditional logic (if/else, ternary operators) to generate different script strings based on input characteristics - **Variable interpolation**: Understanding when Nextflow interprets strings vs when the shell does - - `${var}` - Groovy/Nextflow variables (interpolated by Nextflow at workflow compile time) + - `${var}` - Nextflow variables (interpolated by Nextflow at workflow compile time) - `\${var}` - Shell environment variables (escaped, passed to bash at runtime) - `\$(cmd)` - Shell command substitution (escaped, executed by bash at runtime) @@ -1144,9 +1146,9 @@ These string processing and generation patterns are essential for handling the d ## 3. Creating Reusable Functions -Complex workflow logic inline in channel operators or process definitions reduces readability and maintainability. **Groovy functions** let you extract this logic into named, reusable components—this is core Groovy programming, not Nextflow-specific syntax. +Complex workflow logic inline in channel operators or process definitions reduces readability and maintainability. **Functions** let you extract this logic into named, reusable components. -Our map operation has grown long and complex. Let's extract it into a reusable Groovy function using the `def` keyword. +Our map operation has grown long and complex. Let's extract it into a reusable function using the `def` keyword. To illustrate what that looks like with our existing workflow, make the modification below, using `def` to define a reusable function called `separateMetadata`: @@ -1257,20 +1259,20 @@ The output should show both processes completing successfully. The workflow is n ### Takeaway -In this section, you've learned core **Groovy programming concepts**: +In this section, you've learned **function creation**: -- **Defining functions with `def`**: Groovy's keyword for creating named functions (like `def` in Python or `function` in JavaScript) +- **Defining functions with `def`**: The keyword for creating named functions (like `def` in Python or `function` in JavaScript) - **Function scope**: Functions defined at the script level are accessible throughout your Nextflow workflow - **Return values**: Functions automatically return the last expression, or use explicit `return` -- **Cleaner code**: Extracting complex logic into functions is a fundamental software engineering practice in any language, including Groovy +- **Cleaner code**: Extracting complex logic into functions is a fundamental software engineering practice in any language -Next, we'll explore how to use Groovy closures in process directives for dynamic resource allocation. +Next, we'll explore how to use closures in process directives for dynamic resource allocation. --- ## 4. Dynamic Resource Directives with Closures -So far we've used Groovy in the `script` block of processes. But **Groovy closures** (introduced in Section 1.1) are also incredibly useful in process directives, especially for dynamic resource allocation. Let's add resource directives to our FASTP process that adapt based on the sample characteristics. +So far we've used scripting in the `script` block of processes. But **closures** (introduced in Section 1.1) are also incredibly useful in process directives, especially for dynamic resource allocation. Let's add resource directives to our FASTP process that adapt based on the sample characteristics. Currently, our FASTP process uses default resources. Let's make it smarter by allocating more CPUs for high-depth samples. Edit `modules/fastp.nf` to include a dynamic `cpus` directive and a static `memory` directive: @@ -1297,7 +1299,7 @@ Currently, our FASTP process uses default resources. Let's make it smarter by al tuple val(meta), path(reads) ``` -The closure `{ meta.depth > 40000000 ? 2 : 1 }` uses the **Groovy ternary operator** (covered in Section 1.1) and is evaluated for each task, allowing per-sample resource allocation. High-depth samples (>40M reads) get 2 CPUs, while others get 1 CPU. +The closure `{ meta.depth > 40000000 ? 2 : 1 }` uses the **ternary operator** (covered in Section 1.1) and is evaluated for each task, allowing per-sample resource allocation. High-depth samples (>40M reads) get 2 CPUs, while others get 1 CPU. !!! note "Accessing Input Variables in Directives" @@ -1331,7 +1333,7 @@ cat work/48/6db0c9e9d8aa65e4bb4936cd3bd59e/.command.run | grep "docker run" You should see something like: ```bash title="docker command" - docker run -i --cpu-shares 4096 --memory 2048m -e "NXF_TASK_WORKDIR" -v /workspaces/training/side-quests/groovy_essentials:/workspaces/training/side-quests/groovy_essentials -w "$NXF_TASK_WORKDIR" --name $NXF_BOXID community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690 /bin/bash -ue /workspaces/training/side-quests/groovy_essentials/work/48/6db0c9e9d8aa65e4bb4936cd3bd59e/.command.sh + docker run -i --cpu-shares 4096 --memory 2048m -e "NXF_TASK_WORKDIR" -v /workspaces/training/side-quests/essential_scripting_patterns:/workspaces/training/side-quests/essential_scripting_patterns -w "$NXF_TASK_WORKDIR" --name $NXF_BOXID community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690 /bin/bash -ue /workspaces/training/side-quests/essential_scripting_patterns/work/48/6db0c9e9d8aa65e4bb4936cd3bd59e/.command.sh ``` In this example we've chosen an example that requested 4 CPUs (`--cpu-shares 4096`), because it was a high-depth sample, but you should see different CPU allocations depending on the sample depth. Try this for the other tasks as well. @@ -1425,12 +1427,12 @@ Now if the process fails due to insufficient memory, Nextflow will retry with mo ### Takeaway -Dynamic directives with Groovy closures let you: +Dynamic directives with closures let you: - Allocate resources based on input characteristics - Implement automatic retry strategies with increasing resources - Combine multiple factors (metadata, attempt number, priorities) -- Use Groovy logic for complex resource calculations +- Use conditional logic for complex resource calculations This makes your workflows both more efficient (not over-allocating) and more robust (automatic retry with more resources). @@ -1438,9 +1440,9 @@ This makes your workflows both more efficient (not over-allocating) and more rob ## 5. Conditional Logic and Process Control -Previously, we used `.map()` with Groovy to transform channel data. Now we'll use Groovy to control which processes execute based on data—essential for flexible workflows adapting to different sample types. +Previously, we used `.map()` with scripting to transform channel data. Now we'll use conditional logic to control which processes execute based on data—essential for flexible workflows adapting to different sample types. -Nextflow's [flow control operators](https://www.nextflow.io/docs/latest/reference/operator.html) take closures evaluated at runtime, enabling Groovy logic to drive workflow decisions based on channel content. +Nextflow's [flow control operators](https://www.nextflow.io/docs/latest/reference/operator.html) take closures evaluated at runtime, enabling conditional logic to drive workflow decisions based on channel content. ### 5.1. Routing with `.branch()` @@ -1511,13 +1513,13 @@ executor > local (6) [34/bd5a9f] process > GENERATE_REPORT (1) [100%] 3 of 3 ✔ ``` -Here, we've used small but mighty Groovy expressions inside the `.branch{}` operator to route samples based on their metadata. Human samples with high coverage go through `FASTP`, while all other samples go through `TRIMGALORE`. +Here, we've used small but mighty conditional expressions inside the `.branch{}` operator to route samples based on their metadata. Human samples with high coverage go through `FASTP`, while all other samples go through `TRIMGALORE`. -### 5.2. Using `.filter()` with Groovy Truth +### 5.2. Using `.filter()` with Truthiness -Another powerful pattern for controlling workflow execution is the `.filter()` operator, which uses a closure to determine which items should continue down the pipeline. Inside the filter closure, you'll write **Groovy boolean expressions** that decide which items pass through. +Another powerful pattern for controlling workflow execution is the `.filter()` operator, which uses a closure to determine which items should continue down the pipeline. Inside the filter closure, you'll write **boolean expressions** that decide which items pass through. -Groovy has a concept called **"Groovy Truth"** that determines what values evaluate to `true` or `false` in boolean contexts: +Nextflow (like many dynamic languages) has a concept of **"truthiness"** that determines what values evaluate to `true` or `false` in boolean contexts: - **Truthy**: Non-null values, non-empty strings, non-zero numbers, non-empty collections - **Falsy**: `null`, empty strings `""`, zero `0`, empty collections `[]` or `[:]`, `false` @@ -1579,12 +1581,12 @@ executor > local (5) [07/ef53af] process > GENERATE_REPORT (3) [100%] 3 of 3 ✔ ``` -The filter expression `meta.id && meta.organism && meta.depth >= 25000000` combines Groovy Truth with explicit comparisons: +The filter expression `meta.id && meta.organism && meta.depth >= 25000000` combines truthiness with explicit comparisons: -- `meta.id && meta.organism` checks that both fields exist and are non-empty (using Groovy Truth) +- `meta.id && meta.organism` checks that both fields exist and are non-empty (using truthiness) - `meta.depth >= 25000000` ensures sufficient sequencing depth with an explicit comparison -!!! note "Groovy Truth in Practice" +!!! note "Truthiness in Practice" The expression `meta.id && meta.organism` is more concise than writing: ```groovy @@ -1595,7 +1597,7 @@ The filter expression `meta.id && meta.organism && meta.depth >= 25000000` combi ### Takeaway -In this section, you've learned to use Groovy logic to control workflow execution using the closure interfaces of Nextflow operators like `.branch{}` and `.filter{}`, leveraging Groovy Truth to write concise conditional expressions. +In this section, you've learned to use conditional logic to control workflow execution using the closure interfaces of Nextflow operators like `.branch{}` and `.filter{}`, leveraging truthiness to write concise conditional expressions. Our pipeline now intelligently routes samples through appropriate processes, but production workflows need to handle invalid data gracefully. Let's make our workflow robust against missing or null values. @@ -1763,9 +1765,9 @@ nextflow run main.nf You'll see output like this: ```console title="View output with run field" -[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, run:UNSPECIFIED, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] -[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, run:UNSPECIFIED, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] -[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, run:UNSPECIFIED, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /workspaces/training/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] +[[id:sample_001, organism:human, tissue:liver, depth:30000000, quality:38.5, run:UNSPECIFIED, sample_num:1, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq] +[[id:sample_002, organism:mouse, tissue:brain, depth:25000000, quality:35.2, run:UNSPECIFIED, sample_num:2, lane:001, read:R1, chunk:001, priority:normal], /workspaces/training/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq] +[[id:sample_003, organism:human, tissue:kidney, depth:45000000, quality:42.1, run:UNSPECIFIED, sample_num:3, lane:001, read:R1, chunk:001, priority:high], /workspaces/training/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq] ``` Perfect! Now all samples have a `run` field with either their actual run ID (in uppercase) or the default value 'UNSPECIFIED'. The combination of `?.` and `?:` provides both safety (no crashes) and sensible defaults. @@ -1774,7 +1776,7 @@ Take out the `.view()` operator now that we've confirmed it works. !!! tip "Combining Safe Navigation and Elvis" - The pattern `value?.method() ?: 'default'` is common in production Nextflow: + The pattern `value?.method() ?: 'default'` is common in production workflows: - `value?.method()` - Safely calls method, returns `null` if `value` is `null` - `?: 'default'` - Provides fallback if result is `null` @@ -1789,13 +1791,13 @@ Use these operators consistently in functions, operator closures (`.map{}`, `.fi - **Elvis operator (`?:`)**: Provides defaults - `value ?: 'default'` - **Combining**: `value?.method() ?: 'default'` is the common pattern -These operators make workflows resilient to incomplete data - essential for real-world bioinformatics. +These operators make workflows resilient to incomplete data - essential for real-world work. --- ## 7. Validation with `error()` and `log.warn` -Sometimes you need to stop the workflow immediately if input parameters are invalid. While `error()` and `log.warn` are Nextflow-provided functions, the **validation logic itself is pure Groovy**—using conditionals (`if`, `!`), boolean logic, and methods like `.exists()`. Let's add validation to our workflow. +Sometimes you need to stop the workflow immediately if input parameters are invalid. While `error()` and `log.warn` are Nextflow-provided functions, the **validation logic itself uses standard programming constructs**—conditionals (`if`, `!`), boolean logic, and methods like `.exists()`. Let's add validation to our workflow. Create a validation function before your workflow block, call it from the workflow, and change the channel creation to use a parameter for the CSV file path. If the parameter is missing or the file doesn't exist, call `error()` to stop execution with a clear message. @@ -1932,11 +1934,11 @@ Proper validation makes workflows more robust and user-friendly by catching prob --- -## 8. Groovy in Configuration: Workflow Event Handlers +## 8. Configuration and Workflow Event Handlers -Up until now, we've been writing Groovy code in our workflow scripts and process definitions. But there's one more important place where Groovy is essential: workflow event handlers in your `nextflow.config` file (or other places you write configuration). +Up until now, we've been writing code in our workflow scripts and process definitions. But there's one more important place where you can add logic: workflow event handlers in your `nextflow.config` file (or other places you write configuration). -Event handlers are Groovy closures that run at specific points in your workflow's lifecycle. They're perfect for adding logging, notifications, or cleanup operations without cluttering your main workflow code. +Event handlers are closures that run at specific points in your workflow's lifecycle. They're perfect for adding logging, notifications, or cleanup operations without cluttering your main workflow code. ### 8.1. The `onComplete` Handler @@ -1972,7 +1974,7 @@ Your `nextflow.config` file already has Docker enabled. Add an event handler aft docker.enabled = true ``` -This is a Groovy closure being assigned to `workflow.onComplete`. Inside, you have access to the `workflow` object which provides useful properties about the execution. +This is a closure being assigned to `workflow.onComplete`. Inside, you have access to the `workflow` object which provides useful properties about the execution. Run your workflow and you'll see this summary appear at the end! @@ -1995,7 +1997,7 @@ Pipeline execution summary: Completed at: 2025-10-10T12:14:24.885384+01:00 Duration : 2.9s Success : true -workDir : /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/work +workDir : /Users/jonathan.manning/projects/training/side-quests/essential_scripting_patterns/work exit status : 0 ``` @@ -2058,13 +2060,13 @@ Pipeline execution summary: Completed at: 2025-10-10T12:16:00.522569+01:00 Duration : 3.6s Success : true -workDir : /Users/jonathan.manning/projects/training/side-quests/groovy_essentials/work +workDir : /Users/jonathan.manning/projects/training/side-quests/essential_scripting_patterns/work exit status : 0 ✅ Pipeline completed successfully! ``` -You can also write the summary to a file using Groovy file operations: +You can also write the summary to a file using file operations: ```groovy title="nextflow.config - Writing summary to file" workflow.onComplete = { @@ -2134,40 +2136,40 @@ workflow.onComplete = { In this section, you've learned: -- **Event handler closures**: Groovy closures in `nextflow.config` that run at different lifecycle points +- **Event handler closures**: Closures in `nextflow.config` that run at different lifecycle points - **`onComplete` handler**: For execution summaries and result reporting - **`onError` handler**: For error handling and logging failures - **Workflow object properties**: Accessing `workflow.success`, `workflow.duration`, `workflow.errorMessage`, etc. -Event handlers are pure Groovy code running in your config file, demonstrating that Nextflow configuration is actually a Groovy script with access to the full language. +Event handlers show how you can use the full power of the Nextflow language within your config files to add sophisticated logging and notification capabilities. --- ## Summary -Throughout this side quest, you've built a comprehensive sample processing pipeline that evolved from basic metadata handling to a sophisticated, production-ready workflow. Each section built upon the previous, demonstrating how Groovy transforms simple Nextflow workflows into powerful data processing systems. +Throughout this side quest, you've built a comprehensive sample processing pipeline that evolved from basic metadata handling to a sophisticated, production-ready workflow. Each section built upon the previous, demonstrating how programming constructs transform simple workflows into powerful data processing systems. Here's how we progressively enhanced our pipeline: -1. **Nextflow vs Groovy Boundaries**: You learned to distinguish between workflow orchestration (Nextflow) and programming logic (Groovy), including the crucial differences between constructs like `collect`. +1. **Dataflow vs Scripting**: You learned to distinguish between dataflow operations (channel orchestration) and scripting (code that manipulates data), including the crucial differences between operations on different types like `collect` on Channel vs Iterable. -2. **Advanced String Processing**: You mastered regular expressions for parsing file names, dynamic script generation in processes, and variable interpolation (Groovy vs Bash vs Shell). +2. **Advanced String Processing**: You mastered regular expressions for parsing file names, dynamic script generation in processes, and variable interpolation (Nextflow vs Bash vs Shell). 3. **Creating Reusable Functions**: You learned to extract complex logic into named functions that can be called from channel operators, making workflows more readable and maintainable. -4. **Dynamic Resource Directives with Closures**: You explored using Groovy closures in process directives for adaptive resource allocation based on input characteristics. +4. **Dynamic Resource Directives with Closures**: You explored using closures in process directives for adaptive resource allocation based on input characteristics. -5. **Conditional Logic and Process Control**: You added intelligent routing using `.branch()` and `.filter()` operators, leveraging Groovy Truth for concise conditional expressions. +5. **Conditional Logic and Process Control**: You added intelligent routing using `.branch()` and `.filter()` operators, leveraging truthiness for concise conditional expressions. 6. **Safe Navigation and Elvis Operators**: You made the pipeline robust against missing data using `?.` for null-safe property access and `?:` for providing default values. 7. **Validation with error() and log.warn**: You learned to validate inputs early and fail fast with clear error messages. -8. **Groovy in Configuration**: You learned to use workflow event handlers (`onComplete` and `onError`) for logging, notifications, and lifecycle management. +8. **Configuration Event Handlers**: You learned to use workflow event handlers (`onComplete` and `onError`) for logging, notifications, and lifecycle management. ### Key Benefits -- **Clearer code**: Understanding when to use Nextflow and Groovy helps you write more organized workflows +- **Clearer code**: Understanding dataflow vs scripting helps you write more organized workflows - **Robust handling**: Safe navigation and Elvis operators make workflows resilient to missing data - **Flexible processing**: Conditional logic lets your workflows process different sample types appropriately - **Adaptive resources**: Dynamic directives optimize resource usage based on input characteristics @@ -2176,7 +2178,7 @@ Here's how we progressively enhanced our pipeline: This pipeline evolved from basic data processing to production-ready workflows: -1. **Simple**: CSV processing and metadata extraction (Nextflow vs Groovy boundaries) +1. **Simple**: CSV processing and metadata extraction (dataflow vs scripting boundaries) 2. **Intelligent**: Regex parsing, variable interpolation, dynamic script generation 3. **Maintainable**: Reusable functions for cleaner, testable code 4. **Efficient**: Dynamic resource allocation and retry strategies @@ -2188,10 +2190,10 @@ This progression mirrors the real-world evolution of bioinformatics pipelines - ### Next Steps -With these Groovy fundamentals mastered, you're ready to: +With these fundamentals mastered, you're ready to: -- Write cleaner workflows with proper separation between Nextflow and Groovy logic -- Master variable interpolation to avoid common pitfalls with Groovy, Bash, and shell variables +- Write cleaner workflows with proper separation between dataflow and scripting +- Master variable interpolation to avoid common pitfalls with Nextflow, Bash, and shell variables - Use dynamic resource directives for efficient, adaptive workflows - Transform file collections into properly formatted command-line arguments - Handle different file naming conventions and input formats gracefully using regex and string processing @@ -2200,17 +2202,17 @@ With these Groovy fundamentals mastered, you're ready to: - Add validation, error handling, and logging to make your workflows production-ready - Implement workflow lifecycle management with event handlers -Continue practicing these patterns in your own workflows, and refer to the [Groovy documentation](http://groovy-lang.org/documentation.html) when you need to explore more advanced features. +Continue practicing these patterns in your own workflows, and refer to the [Nextflow language reference](https://nextflow.io/docs/latest/reference/index.html) when you need to explore more advanced features. ### Key Concepts Reference - **Language Boundaries** - ```groovy title="Nextflow vs Groovy examples" - // Nextflow: workflow orchestration + ```groovy title="Dataflow vs Scripting examples" + // Dataflow: channel orchestration Channel.fromPath('*.fastq').splitCsv(header: true) - // Groovy: data processing + // Scripting: data processing on collections sample_data.collect { it.toUpperCase() } ``` @@ -2282,8 +2284,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Groo ## Resources -- [Groovy Documentation](http://groovy-lang.org/documentation.html) +- [Nextflow Language Reference](https://nextflow.io/docs/latest/reference/index.html) - [Nextflow Operators](https://www.nextflow.io/docs/latest/operator.html) -- [Regular Expressions in Groovy](https://groovy-lang.org/syntax.html#_regular_expression_operators) -- [JSON Processing](https://groovy-lang.org/json.html) -- [XML Processing](https://groovy-lang.org/processing-xml.html) +- [Nextflow Script Syntax](https://www.nextflow.io/docs/latest/script.html) +- [Groovy Documentation](http://groovy-lang.org/documentation.html) (for deeper understanding of underlying language features) diff --git a/docs/side_quests/index.md b/docs/side_quests/index.md index ad6048571d..898f047b2c 100644 --- a/docs/side_quests/index.md +++ b/docs/side_quests/index.md @@ -31,7 +31,7 @@ Otherwise, select a side quest from the table below. | Side Quest | Time Estimate for Teaching | | ----------------------------------------------------------------- | -------------------------- | | [Nextflow development environment walkthrough](./ide_features.md) | 45 mins | -| [Groovy Essentials for Nextflow](./groovy_essentials.md) | 90 mins | +| [Essential Scripting Patterns for Nextflow](./essential_scripting_patterns.md) | 90 mins | | [Introduction to nf-core](./nf-core.md) | - | | [Metadata in workflows](./metadata.md) | 45 mins | | [Splitting and Grouping](./splitting_and_grouping.md) | 45 mins | diff --git a/mkdocs.yml b/mkdocs.yml index 6252ec1b7d..306ed4f8d7 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -65,7 +65,7 @@ nav: - side_quests/splitting_and_grouping.md - side_quests/workflows_of_workflows.md - side_quests/debugging.md - - side_quests/groovy_essentials.md + - side_quests/essential_scripting_patterns.md - side_quests/nf-test.md - side_quests/nf-core.md - Archive: diff --git a/side-quests/groovy_essentials/collect.nf b/side-quests/essential_scripting_patterns/collect.nf similarity index 53% rename from side-quests/groovy_essentials/collect.nf rename to side-quests/essential_scripting_patterns/collect.nf index aaa5573933..0c8598de40 100644 --- a/side-quests/groovy_essentials/collect.nf +++ b/side-quests/essential_scripting_patterns/collect.nf @@ -1,7 +1,7 @@ def sample_ids = ['sample_001', 'sample_002', 'sample_003'] -// Nextflow collect() - groups multiple channel emissions into one +// Channel.collect() - groups multiple channel emissions into one ch_input = Channel.fromList(sample_ids) ch_input.view { "Individual channel item: ${it}" } ch_collected = ch_input.collect() -ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } +ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } diff --git a/side-quests/groovy_essentials/data/samples.csv b/side-quests/essential_scripting_patterns/data/samples.csv similarity index 100% rename from side-quests/groovy_essentials/data/samples.csv rename to side-quests/essential_scripting_patterns/data/samples.csv diff --git a/side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq b/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq similarity index 100% rename from side-quests/groovy_essentials/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq rename to side-quests/essential_scripting_patterns/data/sequences/SAMPLE_001_S1_L001_R1_001.fastq diff --git a/side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq b/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq similarity index 100% rename from side-quests/groovy_essentials/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq rename to side-quests/essential_scripting_patterns/data/sequences/SAMPLE_002_S2_L001_R1_001.fastq diff --git a/side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq b/side-quests/essential_scripting_patterns/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq similarity index 100% rename from side-quests/groovy_essentials/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq rename to side-quests/essential_scripting_patterns/data/sequences/SAMPLE_003_S3_L001_R1_001.fastq diff --git a/side-quests/groovy_essentials/main.nf b/side-quests/essential_scripting_patterns/main.nf similarity index 100% rename from side-quests/groovy_essentials/main.nf rename to side-quests/essential_scripting_patterns/main.nf diff --git a/side-quests/groovy_essentials/modules/fastp.nf b/side-quests/essential_scripting_patterns/modules/fastp.nf similarity index 100% rename from side-quests/groovy_essentials/modules/fastp.nf rename to side-quests/essential_scripting_patterns/modules/fastp.nf diff --git a/side-quests/groovy_essentials/modules/generate_report.nf b/side-quests/essential_scripting_patterns/modules/generate_report.nf similarity index 100% rename from side-quests/groovy_essentials/modules/generate_report.nf rename to side-quests/essential_scripting_patterns/modules/generate_report.nf diff --git a/side-quests/groovy_essentials/modules/trimgalore.nf b/side-quests/essential_scripting_patterns/modules/trimgalore.nf similarity index 100% rename from side-quests/groovy_essentials/modules/trimgalore.nf rename to side-quests/essential_scripting_patterns/modules/trimgalore.nf diff --git a/side-quests/groovy_essentials/nextflow.config b/side-quests/essential_scripting_patterns/nextflow.config similarity index 100% rename from side-quests/groovy_essentials/nextflow.config rename to side-quests/essential_scripting_patterns/nextflow.config diff --git a/side-quests/solutions/groovy_essentials/collect.nf b/side-quests/solutions/essential_scripting_patterns/collect.nf similarity index 59% rename from side-quests/solutions/groovy_essentials/collect.nf rename to side-quests/solutions/essential_scripting_patterns/collect.nf index dbfd5ac79d..2fe2685119 100644 --- a/side-quests/solutions/groovy_essentials/collect.nf +++ b/side-quests/solutions/essential_scripting_patterns/collect.nf @@ -1,16 +1,16 @@ def sample_ids = ['sample_001', 'sample_002', 'sample_003'] -// Nextflow collect() - groups multiple channel emissions into one +// Channel.collect() - groups multiple channel emissions into one ch_input = Channel.fromList(sample_ids) ch_input.view { "Individual channel item: ${it}" } ch_collected = ch_input.collect() -ch_collected.view { "Nextflow collect() result: ${it} (${it.size()} items grouped into 1)" } +ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } -// Groovy collect - transforms each element, preserves structure +// Iterable.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') } -println "Groovy collect result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" +println "Iterable.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" // Spread operator - concise property access def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] diff --git a/side-quests/solutions/groovy_essentials/main.nf b/side-quests/solutions/essential_scripting_patterns/main.nf similarity index 100% rename from side-quests/solutions/groovy_essentials/main.nf rename to side-quests/solutions/essential_scripting_patterns/main.nf diff --git a/side-quests/solutions/groovy_essentials/modules/fastp.nf b/side-quests/solutions/essential_scripting_patterns/modules/fastp.nf similarity index 100% rename from side-quests/solutions/groovy_essentials/modules/fastp.nf rename to side-quests/solutions/essential_scripting_patterns/modules/fastp.nf diff --git a/side-quests/solutions/groovy_essentials/modules/generate_report.nf b/side-quests/solutions/essential_scripting_patterns/modules/generate_report.nf similarity index 100% rename from side-quests/solutions/groovy_essentials/modules/generate_report.nf rename to side-quests/solutions/essential_scripting_patterns/modules/generate_report.nf diff --git a/side-quests/solutions/groovy_essentials/modules/trimgalore.nf b/side-quests/solutions/essential_scripting_patterns/modules/trimgalore.nf similarity index 100% rename from side-quests/solutions/groovy_essentials/modules/trimgalore.nf rename to side-quests/solutions/essential_scripting_patterns/modules/trimgalore.nf diff --git a/side-quests/solutions/groovy_essentials/nextflow.config b/side-quests/solutions/essential_scripting_patterns/nextflow.config similarity index 100% rename from side-quests/solutions/groovy_essentials/nextflow.config rename to side-quests/solutions/essential_scripting_patterns/nextflow.config From 9c466b1096e6b9ef79e214974ede09da53bfdc22 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 12:42:08 +0100 Subject: [PATCH 40/48] Prettier --- docs/side_quests/index.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/side_quests/index.md b/docs/side_quests/index.md index 898f047b2c..d374f486d7 100644 --- a/docs/side_quests/index.md +++ b/docs/side_quests/index.md @@ -28,16 +28,16 @@ Otherwise, select a side quest from the table below. ## Side Quests -| Side Quest | Time Estimate for Teaching | -| ----------------------------------------------------------------- | -------------------------- | -| [Nextflow development environment walkthrough](./ide_features.md) | 45 mins | +| Side Quest | Time Estimate for Teaching | +| ------------------------------------------------------------------------------ | -------------------------- | +| [Nextflow development environment walkthrough](./ide_features.md) | 45 mins | | [Essential Scripting Patterns for Nextflow](./essential_scripting_patterns.md) | 90 mins | -| [Introduction to nf-core](./nf-core.md) | - | -| [Metadata in workflows](./metadata.md) | 45 mins | -| [Splitting and Grouping](./splitting_and_grouping.md) | 45 mins | -| [Testing with nf-test](./nf-test.md) | 1 hour | -| [Workflows of workflows](./workflows_of_workflows.md) | 30 mins | -| [Working with files](./working_with_files.md) | 45 mins | -| [Debugging workflows](./debugging.md) | - | +| [Introduction to nf-core](./nf-core.md) | - | +| [Metadata in workflows](./metadata.md) | 45 mins | +| [Splitting and Grouping](./splitting_and_grouping.md) | 45 mins | +| [Testing with nf-test](./nf-test.md) | 1 hour | +| [Workflows of workflows](./workflows_of_workflows.md) | 30 mins | +| [Working with files](./working_with_files.md) | 45 mins | +| [Debugging workflows](./debugging.md) | - | Let us know what other domains and use cases you'd like to see covered here by posting in the [Training section](https://community.seqera.io/c/training/) of the community forum. From 57b250505e0b41409b1e0cec7f4f2159ed943221 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 12:48:10 +0100 Subject: [PATCH 41/48] Update title to Essential Nextflow Scripting Patterns MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- .../essential_scripting_patterns.md | 2 +- docs/side_quests/index.md | 22 +++++++++---------- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/side_quests/essential_scripting_patterns.md b/docs/side_quests/essential_scripting_patterns.md index 70f898743e..9516f689d4 100644 --- a/docs/side_quests/essential_scripting_patterns.md +++ b/docs/side_quests/essential_scripting_patterns.md @@ -1,4 +1,4 @@ -# Essential Programming Patterns for Nextflow Developers +# Essential Nextflow Scripting Patterns Nextflow is a programming language that runs on the Java Virtual Machine. While Nextflow was originally built on Groovy and shares much of its syntax, the [Nextflow language specification](https://nextflow.io/docs/latest/reference/index.html) defines Nextflow as a standalone language—making it easier to understand Nextflow on its own terms rather than as "Groovy with extensions." diff --git a/docs/side_quests/index.md b/docs/side_quests/index.md index d374f486d7..62f6b80dc7 100644 --- a/docs/side_quests/index.md +++ b/docs/side_quests/index.md @@ -28,16 +28,16 @@ Otherwise, select a side quest from the table below. ## Side Quests -| Side Quest | Time Estimate for Teaching | -| ------------------------------------------------------------------------------ | -------------------------- | -| [Nextflow development environment walkthrough](./ide_features.md) | 45 mins | -| [Essential Scripting Patterns for Nextflow](./essential_scripting_patterns.md) | 90 mins | -| [Introduction to nf-core](./nf-core.md) | - | -| [Metadata in workflows](./metadata.md) | 45 mins | -| [Splitting and Grouping](./splitting_and_grouping.md) | 45 mins | -| [Testing with nf-test](./nf-test.md) | 1 hour | -| [Workflows of workflows](./workflows_of_workflows.md) | 30 mins | -| [Working with files](./working_with_files.md) | 45 mins | -| [Debugging workflows](./debugging.md) | - | +| Side Quest | Time Estimate for Teaching | +| -------------------------------------------------------------------------- | -------------------------- | +| [Nextflow development environment walkthrough](./ide_features.md) | 45 mins | +| [Essential Nextflow Scripting Patterns](./essential_scripting_patterns.md) | 90 mins | +| [Introduction to nf-core](./nf-core.md) | - | +| [Metadata in workflows](./metadata.md) | 45 mins | +| [Splitting and Grouping](./splitting_and_grouping.md) | 45 mins | +| [Testing with nf-test](./nf-test.md) | 1 hour | +| [Workflows of workflows](./workflows_of_workflows.md) | 30 mins | +| [Working with files](./working_with_files.md) | 45 mins | +| [Debugging workflows](./debugging.md) | - | Let us know what other domains and use cases you'd like to see covered here by posting in the [Training section](https://community.seqera.io/c/training/) of the community forum. From 9f332c5f78ae50f7e62c86495afc28f429ca804e Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 18:51:40 +0100 Subject: [PATCH 42/48] Apply suggestions from code review Co-authored-by: Ben Sherman --- .../essential_scripting_patterns.md | 32 +++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/side_quests/essential_scripting_patterns.md b/docs/side_quests/essential_scripting_patterns.md index 9516f689d4..c794a97952 100644 --- a/docs/side_quests/essential_scripting_patterns.md +++ b/docs/side_quests/essential_scripting_patterns.md @@ -1,6 +1,6 @@ # Essential Nextflow Scripting Patterns -Nextflow is a programming language that runs on the Java Virtual Machine. While Nextflow was originally built on Groovy and shares much of its syntax, the [Nextflow language specification](https://nextflow.io/docs/latest/reference/index.html) defines Nextflow as a standalone language—making it easier to understand Nextflow on its own terms rather than as "Groovy with extensions." +Nextflow is a programming language that runs on the Java Virtual Machine. While Nextflow is built on [Groovy](http://groovy-lang.org/) and shares much of its syntax, Nextflow is more than just "Groovy with extensions" -- it is a standalone language with a fully-specified [syntax](https://nextflow.io/docs/latest/reference/syntax.html) and [standard library](https://nextflow.io/docs/latest/reference/stdlib.html). You can write a lot of Nextflow without venturing beyond basic syntax for variables, maps, and lists. Most Nextflow tutorials focus on workflow orchestration (channels, processes, and data flow), and you can go surprisingly far with just that. @@ -27,9 +27,9 @@ Before taking on this side quest you should: - Complete the [Hello Nextflow](../hello_nextflow/README.md) tutorial or have equivalent experience - Understand basic Nextflow concepts (processes, channels, workflows) -- Have basic familiarity with common programming constructs used in Groovy syntax (variables, maps, lists) +- Have basic familiarity with common programming constructs (variables, maps, lists) -This tutorial will explain Groovy concepts as we encounter them, so you don't need extensive prior Groovy knowledge. We'll start with fundamental concepts and build up to advanced patterns. +This tutorial will explain programming concepts as we encounter them, so you don't need extensive programming experience. We'll start with fundamental concepts and build up to advanced patterns. ### 0.2. Starting Point @@ -86,7 +86,7 @@ Start with a simple workflow that just reads the CSV file (we've already done th ```groovy title="main.nf" linenums="1" workflow { - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .view() } @@ -141,7 +141,7 @@ Here's what that map operation looks like: This is our first **closure**—an anonymous function you can pass as an argument (similar to lambdas in Python or arrow functions in JavaScript). Closures are essential for working with Nextflow operators. -The closure `{ row -> return row }` takes a parameter `row` (could be any name: `item`, `sample`, etc.). You can also use the implicit variable `it` instead: `.map { return it }`, though naming parameters improves clarity. +The closure `{ row -> return row }` takes a parameter `row` (could be any name: `item`, `sample`, etc.). When the `.map()` operator processes each channel item, it passes that item to your closure. Here, `row` holds one CSV row at a time. @@ -155,7 +155,7 @@ You'll see the same output as before, because we're simply returning the input u #### Step 3: Creating a Map Data Structure -Now we're going to write **scripting** inside our closure to transform each row of data. This is where we process individual data items rather than orchestrating data flow. +Now we're going to write **scripting** logic inside our closure to transform each row of data. This is where we process individual data items rather than orchestrating data flow. === "After" @@ -356,7 +356,7 @@ Now remove those println statements to restore your workflow to its previous sta #### Step 5: Combining Maps and Returning Results -So far, we've only been returning what Nextflow community calls the 'meta map', and we've been ignoring the files those metadata relate to. But if you're writing Nextflow workflows, you probably want to do something with those files. +So far, we've only been returning what the Nextflow community calls the 'meta map', and we've been ignoring the files those metadata relate to. But if you're writing Nextflow workflows, you probably want to do something with those files. Let's output a channel structure comprising a tuple of 2 elements: the enriched metadata map and the corresponding file path. This is a common pattern in Nextflow for passing data to processes. @@ -374,7 +374,7 @@ Let's output a channel structure comprising a tuple of 2 elements: the enriched quality: row.quality_score.toDouble() ] def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + [priority: priority], file(row.file_path) ] + return tuple( sample_meta + [priority: priority], file(row.file_path) ) } .view() ``` @@ -1084,7 +1084,7 @@ But what if we want to add information about when and where the processing occur """ ``` -If you run this, you'll notice an error or unexpected behavior - Nextflow tries to interpret `$(hostname)` as a Groovy variable that doesn't exist: +If you run this, you'll notice an error or unexpected behavior - Nextflow tries to interpret `$(hostname)` as a Nextflow variable that doesn't exist: ```console title="Error with shell variables" unknown recognition error type: groovyjarjarantlr4.v4.runtime.LexerNoViableAltException @@ -1442,7 +1442,7 @@ This makes your workflows both more efficient (not over-allocating) and more rob Previously, we used `.map()` with scripting to transform channel data. Now we'll use conditional logic to control which processes execute based on data—essential for flexible workflows adapting to different sample types. -Nextflow's [flow control operators](https://www.nextflow.io/docs/latest/reference/operator.html) take closures evaluated at runtime, enabling conditional logic to drive workflow decisions based on channel content. +Nextflow's [dataflow operators](https://www.nextflow.io/docs/latest/reference/operator.html) take closures evaluated at runtime, enabling conditional logic to drive workflow decisions based on channel content. ### 5.1. Routing with `.branch()` @@ -1658,7 +1658,7 @@ ERROR ~ Cannot invoke method toUpperCase() on null object -- Check script 'main.nf' at line: 13 or see '.nextflow.log' file for more details ``` -The problem is that `row.run_id` returns `null` because the `run_id` column doesn't exist in our CSV. When we try to call `.toUpperCase()` on `null`, it crashes. This is where Groovy's safe navigation operator saves the day. +The problem is that `row.run_id` returns `null` because the `run_id` column doesn't exist in our CSV. When we try to call `.toUpperCase()` on `null`, it crashes. This is where the safe navigation operator saves the day. ### 6.2. Safe Navigation Operator (`?.`) @@ -1704,7 +1704,7 @@ No crash! The workflow now handles the missing field gracefully. When `row.run_i ### 6.3. Elvis Operator (`?:`) for Defaults -The Elvis operator (`?:`) provides default values when the left side is `null` (or empty, in Groovy's "truth" evaluation). It's named after Elvis Presley because `?:` looks like his famous hair and eyes when viewed sideways! +The Elvis operator (`?:`) provides default values when the left side is "falsy" (as explained previously). It's named after Elvis Presley because `?:` looks like his famous hair and eyes when viewed sideways! Now that we're using safe navigation, `run_id` will be `null` for samples without that field. Let's use the Elvis operator to provide a default value and add it to our `sample_meta` map: @@ -1797,7 +1797,7 @@ These operators make workflows resilient to incomplete data - essential for real ## 7. Validation with `error()` and `log.warn` -Sometimes you need to stop the workflow immediately if input parameters are invalid. While `error()` and `log.warn` are Nextflow-provided functions, the **validation logic itself uses standard programming constructs**—conditionals (`if`, `!`), boolean logic, and methods like `.exists()`. Let's add validation to our workflow. +Sometimes you need to stop the workflow immediately if input parameters are invalid. In Nextflow, you can use built-in functions like `error()` and `log.warn`, as well as standard programming constructs like `if` statements and boolean logic, to implement validation logic. Let's add validation to our workflow. Create a validation function before your workflow block, call it from the workflow, and change the channel creation to use a parameter for the CSV file path. If the parameter is missing or the file doesn't exist, call `error()` to stop execution with a clear message. @@ -2247,7 +2247,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Next } ``` -- **Essential Groovy Operators** +- **Essential Operators** ```groovy title="Essential operators examples" // Safe navigation and Elvis operators @@ -2284,7 +2284,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Next ## Resources -- [Nextflow Language Reference](https://nextflow.io/docs/latest/reference/index.html) +- [Nextflow Language Reference](https://nextflow.io/docs/latest/reference/syntax.html) - [Nextflow Operators](https://www.nextflow.io/docs/latest/operator.html) - [Nextflow Script Syntax](https://www.nextflow.io/docs/latest/script.html) -- [Groovy Documentation](http://groovy-lang.org/documentation.html) (for deeper understanding of underlying language features) +- [Nextflow Standard Library](https://nextflow.io/docs/latest/reference/stdlib.html) From 2462885f070bd3314c0e98391eb1df3c2ab29b12 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 19:19:12 +0100 Subject: [PATCH 43/48] Easy fixes for Ben --- .../essential_scripting_patterns.md | 120 +++++++++--------- .../essential_scripting_patterns/collect.nf | 6 +- .../essential_scripting_patterns/main.nf | 2 +- .../essential_scripting_patterns/collect.nf | 8 +- .../essential_scripting_patterns/main.nf | 4 +- 5 files changed, 70 insertions(+), 70 deletions(-) diff --git a/docs/side_quests/essential_scripting_patterns.md b/docs/side_quests/essential_scripting_patterns.md index c794a97952..d6fd8e61fc 100644 --- a/docs/side_quests/essential_scripting_patterns.md +++ b/docs/side_quests/essential_scripting_patterns.md @@ -92,7 +92,7 @@ workflow { } ``` -The `workflow` block defines our pipeline structure, while `Channel.fromPath()` creates a channel from a file path. The `.splitCsv()` operator processes the CSV file and converts each row into a map data structure. +The `workflow` block defines our pipeline structure, while `channel.fromPath()` creates a channel from a file path. The `.splitCsv()` operator processes the CSV file and converts each row into a map data structure. Run this workflow to see the raw CSV data: @@ -123,7 +123,7 @@ Here's what that map operation looks like: === "After" ```groovy title="main.nf" linenums="2" hl_lines="3-6" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> return row @@ -134,7 +134,7 @@ Here's what that map operation looks like: === "Before" ```groovy title="main.nf" linenums="2" hl_lines="3" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .view() ``` @@ -160,7 +160,7 @@ Now we're going to write **scripting** logic inside our closure to transform eac === "After" ```groovy title="main.nf" linenums="2" hl_lines="4-12" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> // Scripting for data transformation @@ -179,7 +179,7 @@ Now we're going to write **scripting** logic inside our closure to transform eac === "Before" ```groovy title="main.nf" linenums="2" hl_lines="4" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> return row @@ -214,7 +214,7 @@ Make the following change: === "After" ```groovy title="main.nf" linenums="2" hl_lines="11-12" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -233,7 +233,7 @@ Make the following change: === "Before" ```groovy title="main.nf" linenums="2" hl_lines="11" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -281,7 +281,7 @@ Let's add a line to create a simplified version of our metadata that only contai === "After" ```groovy title="main.nf" linenums="2" hl_lines="12-15" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> // Scripting for data transformation @@ -304,7 +304,7 @@ Let's add a line to create a simplified version of our metadata that only contai === "Before" ```groovy title="main.nf" linenums="2" hl_lines="12" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> // Scripting for data transformation @@ -363,7 +363,7 @@ Let's output a channel structure comprising a tuple of 2 elements: the enriched === "After" ```groovy title="main.nf" linenums="2" hl_lines="12" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -382,7 +382,7 @@ Let's output a channel structure comprising a tuple of 2 elements: the enriched === "Before" ```groovy title="main.nf" linenums="2" hl_lines="12" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -418,7 +418,7 @@ This `[meta, file]` tuple structure is a common pattern in Nextflow for passing **Maps and Metadata**: Maps are fundamental to working with metadata in Nextflow. For a more detailed explanation of working with metadata maps, see the [Working with metadata](./metadata.md) side quest. -Our workflow demonstrates the core pattern: **dataflow operations** (`workflow`, `Channel.fromPath()`, `.splitCsv()`, `.map()`, `.view()`) orchestrate how data moves through the pipeline, while **scripting** (maps `[key: value]`, string methods, type conversions, ternary operators) inside the `.map()` closure handles the transformation of individual data items. +Our workflow demonstrates the core pattern: **dataflow operations** (`workflow`, `channel.fromPath()`, `.splitCsv()`, `.map()`, `.view()`) orchestrate how data moves through the pipeline, while **scripting** (maps `[key: value]`, string methods, type conversions, ternary operators) inside the `.map()` closure handles the transformation of individual data items. ### 1.2. Understanding Different Types: Channel vs Iterable @@ -431,11 +431,11 @@ Let's demonstrate this with some sample data, starting by refreshing ourselves o ```groovy title="collect.nf" linenums="1" def sample_ids = ['sample_001', 'sample_002', 'sample_003'] -// Channel.collect() - groups multiple channel emissions into one -ch_input = Channel.fromList(sample_ids) -ch_input.view { "Individual channel item: ${it}" } +// channel.collect() - groups multiple channel emissions into one +ch_input = channel.fromList(sample_ids) +ch_input.view { sample -> "Individual channel item: ${sample}" } ch_collected = ch_input.collect() -ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } +ch_collected.view { list -> "channel.collect() result: ${list} (${list.size()} items grouped into 1)" } ``` Steps: @@ -462,7 +462,7 @@ Launching `collect.nf` [loving_mendel] DSL2 - revision: e8d054a46e Individual channel item: sample_001 Individual channel item: sample_002 Individual channel item: sample_003 -Channel.collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) +channel.collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) ``` `view()` returns an output for every channel emission, so we know that this single output contains all 3 original items grouped into one list. @@ -474,11 +474,11 @@ Now let's see the `collect` method on an Iterable type in action. Modify `collec ```groovy title="main.nf" linenums="1" hl_lines="9-13" def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - // Channel.collect() - groups multiple channel emissions into one - ch_input = Channel.fromList(sample_ids) - ch_input.view { "Individual channel item: ${it}" } + // channel.collect() - groups multiple channel emissions into one + ch_input = channel.fromList(sample_ids) + ch_input.view { sample -> "Individual channel item: ${sample}" } ch_collected = ch_input.collect() - ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } + ch_collected.view { list -> "channel.collect() result: ${list} (${list.size()} items grouped into 1)" } // Iterable.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> @@ -492,11 +492,11 @@ Now let's see the `collect` method on an Iterable type in action. Modify `collec ```groovy title="main.nf" linenums="1" def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - // Channel.collect() - groups multiple channel emissions into one - ch_input = Channel.fromList(sample_ids) - ch_input.view { "Individual channel item: ${it}" } + // channel.collect() - groups multiple channel emissions into one + ch_input = channel.fromList(sample_ids) + ch_input.view { sample -> "Individual channel item: ${sample}" } ch_collected = ch_input.collect() - ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } + ch_collected.view { list -> "channel.collect() result: ${list} (${list.size()} items grouped into 1)" } ``` In this new snippet we: @@ -519,7 +519,7 @@ Iterable.collect() result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items t Individual channel item: sample_001 Individual channel item: sample_002 Individual channel item: sample_003 -Channel.collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) +channel.collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) ``` This time, we have NOT changed the structure of the data, we still have 3 items in the list, but we HAVE transformed each item using the Iterable's `collect` method to produce a new list with modified values. This is similar to using the `map` operator on a Channel, but it's operating on a List data structure rather than a channel. @@ -537,11 +537,11 @@ Let's add a demonstration to our `collect.nf` file: ```groovy title="collect.nf" linenums="1" hl_lines="15-18" def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - // Channel.collect() - groups multiple channel emissions into one - ch_input = Channel.fromList(sample_ids) - ch_input.view { "Individual channel item: ${it}" } + // channel.collect() - groups multiple channel emissions into one + ch_input = channel.fromList(sample_ids) + ch_input.view { sample -> "Individual channel item: ${sample}" } ch_collected = ch_input.collect() - ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } + ch_collected.view { list -> "channel.collect() result: ${list} (${list.size()} items grouped into 1)" } // Iterable.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> @@ -560,11 +560,11 @@ Let's add a demonstration to our `collect.nf` file: ```groovy title="collect.nf" linenums="1" def sample_ids = ['sample_001', 'sample_002', 'sample_003'] - // Channel.collect() - groups multiple channel emissions into one - ch_input = Channel.fromList(sample_ids) - ch_input.view { "Individual channel item: ${it}" } + // channel.collect() - groups multiple channel emissions into one + ch_input = channel.fromList(sample_ids) + ch_input.view { sample -> "Individual channel item: ${sample}" } ch_collected = ch_input.collect() - ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } + ch_collected.view { list -> "channel.collect() result: ${list} (${list.size()} items grouped into 1)" } // Iterable.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> @@ -591,7 +591,7 @@ Spread operator result: [s1, s2, s3] Individual channel item: sample_001 Individual channel item: sample_002 Individual channel item: sample_003 -Channel.collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) +channel.collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) ``` The spread operator `*.` is a shorthand for a common collect pattern: @@ -662,7 +662,7 @@ Make the following change to your existing `main.nf` workflow: ] : [:] def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] + return tuple(sample_meta + file_meta + [priority: priority], fastq_path) } ``` @@ -679,7 +679,7 @@ Make the following change to your existing `main.nf` workflow: quality: row.quality_score.toDouble() ] def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + [priority: priority], file(row.file_path)] + return tuple(sample_meta + [priority: priority], file(row.file_path)) } ``` @@ -765,7 +765,7 @@ Then modify the `workflow` block to connect the `ch_samples` channel to the `FAS ```groovy title="main.nf" linenums="25" hl_lines="27" workflow { - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -786,7 +786,7 @@ Then modify the `workflow` block to connect the `ch_samples` channel to the `FAS ] : [:] def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] + return tuple(sample_meta + file_meta + [priority: priority], fastq_path) } ch_fastp = FASTP(ch_samples) @@ -798,7 +798,7 @@ Then modify the `workflow` block to connect the `ch_samples` channel to the `FAS ```groovy title="main.nf" linenums="25" hl_lines="26" workflow { - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -994,7 +994,7 @@ Include the process in your `main.nf` and add it to the workflow: include { GENERATE_REPORT } from './modules/generate_report.nf' workflow { - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -1015,7 +1015,7 @@ Include the process in your `main.nf` and add it to the workflow: ] : [:] def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] + return tuple(sample_meta + file_meta + [priority: priority], fastq_path) } ch_fastp = FASTP(ch_samples) @@ -1029,7 +1029,7 @@ Include the process in your `main.nf` and add it to the workflow: include { FASTP } from './modules/fastp.nf' workflow { - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -1050,7 +1050,7 @@ Include the process in your `main.nf` and add it to the workflow: ] : [:] def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] + return tuple(sample_meta + file_meta + [priority: priority], fastq_path) } ch_fastp = FASTP(ch_samples) @@ -1177,11 +1177,11 @@ To illustrate what that looks like with our existing workflow, make the modifica ] : [:] def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] + return tuple(sample_meta + file_meta + [priority: priority], fastq_path) } workflow { - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map{ row -> separateMetadata(row) } @@ -1197,7 +1197,7 @@ To illustrate what that looks like with our existing workflow, make the modifica include { GENERATE_REPORT } from './modules/generate_report.nf' workflow { - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map { row -> def sample_meta = [ @@ -1218,7 +1218,7 @@ To illustrate what that looks like with our existing workflow, make the modifica ] : [:] def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] + return tuple(sample_meta + file_meta + [priority: priority], fastq_path) } ch_fastp = FASTP(ch_samples) @@ -1229,7 +1229,7 @@ To illustrate what that looks like with our existing workflow, make the modifica By extracting this logic into a function, we've reduced the actual workflow logic down to something much cleaner: ```groovy title="minimal workflow" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map{ row -> separateMetadata(row) } @@ -1470,7 +1470,7 @@ Include the new from in `modules/trimgalore.nf`: === "After" ```groovy title="main.nf" linenums="28" hl_lines="5-12" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map(separateMetadata) @@ -1488,7 +1488,7 @@ Include the new from in `modules/trimgalore.nf`: === "Before" ```groovy title="main.nf" linenums="28" hl_lines="5" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map(separateMetadata) @@ -1531,7 +1531,7 @@ Add the following before the branch operation: === "After" ```groovy title="main.nf" linenums="28" hl_lines="11" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map(separateMetadata) @@ -1551,7 +1551,7 @@ Add the following before the branch operation: === "Before" ```groovy title="main.nf" linenums="28" hl_lines="5" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map(separateMetadata) @@ -1742,7 +1742,7 @@ Also add a `view()` operator in the workflow to see the results: === "After" ```groovy title="main.nf" linenums="30" hl_lines="4" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map{ row -> separateMetadata(row) } .view() @@ -1751,7 +1751,7 @@ Also add a `view()` operator in the workflow to see the results: === "Before" ```groovy title="main.nf" linenums="30" - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map{ row -> separateMetadata(row) } ``` @@ -1822,7 +1822,7 @@ Create a validation function before your workflow block, call it from the workfl ... workflow { validateInputs() - ch_samples = Channel.fromPath(params.input) + ch_samples = channel.fromPath(params.input) ``` === "Before" @@ -1834,7 +1834,7 @@ Create a validation function before your workflow block, call it from the workfl ... workflow { - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") ``` Now try running without the CSV file: @@ -1890,7 +1890,7 @@ You can also add validation within the `separateMetadata` function. Let's use th log.warn "Low sequencing depth for ${sample_meta.id}: ${sample_meta.depth}" } - return [sample_meta + file_meta + [priority: priority], fastq_path] + return tuple(sample_meta + file_meta + [priority: priority], fastq_path) } ``` @@ -1899,7 +1899,7 @@ You can also add validation within the `separateMetadata` function. Let's use th ```groovy title="main.nf" linenums="1" def priority = sample_meta.quality > 40 ? 'high' : 'normal' - return [sample_meta + file_meta + [priority: priority], fastq_path] + return tuple(sample_meta + file_meta + [priority: priority], fastq_path) } ``` @@ -2210,7 +2210,7 @@ Continue practicing these patterns in your own workflows, and refer to the [Next ```groovy title="Dataflow vs Scripting examples" // Dataflow: channel orchestration - Channel.fromPath('*.fastq').splitCsv(header: true) + channel.fromPath('*.fastq').splitCsv(header: true) // Scripting: data processing on collections sample_data.collect { it.toUpperCase() } diff --git a/side-quests/essential_scripting_patterns/collect.nf b/side-quests/essential_scripting_patterns/collect.nf index 0c8598de40..2202dbbb12 100644 --- a/side-quests/essential_scripting_patterns/collect.nf +++ b/side-quests/essential_scripting_patterns/collect.nf @@ -1,7 +1,7 @@ def sample_ids = ['sample_001', 'sample_002', 'sample_003'] // Channel.collect() - groups multiple channel emissions into one -ch_input = Channel.fromList(sample_ids) -ch_input.view { "Individual channel item: ${it}" } +ch_input = channel.fromList(sample_ids) +ch_input.view { sample -> "Individual channel item: ${sample}" } ch_collected = ch_input.collect() -ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } +ch_collected.view { list -> "Channel.collect() result: ${list} (${list.size()} items grouped into 1)" } diff --git a/side-quests/essential_scripting_patterns/main.nf b/side-quests/essential_scripting_patterns/main.nf index 31aa794ede..a228d2dcbf 100644 --- a/side-quests/essential_scripting_patterns/main.nf +++ b/side-quests/essential_scripting_patterns/main.nf @@ -1,5 +1,5 @@ workflow { - ch_samples = Channel.fromPath("./data/samples.csv") + ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .view() } diff --git a/side-quests/solutions/essential_scripting_patterns/collect.nf b/side-quests/solutions/essential_scripting_patterns/collect.nf index 2fe2685119..0ffcfa87a2 100644 --- a/side-quests/solutions/essential_scripting_patterns/collect.nf +++ b/side-quests/solutions/essential_scripting_patterns/collect.nf @@ -1,10 +1,10 @@ def sample_ids = ['sample_001', 'sample_002', 'sample_003'] -// Channel.collect() - groups multiple channel emissions into one -ch_input = Channel.fromList(sample_ids) -ch_input.view { "Individual channel item: ${it}" } +// channel.collect() - groups multiple channel emissions into one +ch_input = channel.fromList(sample_ids) +ch_input.view { sample -> "Individual channel item: ${sample}" } ch_collected = ch_input.collect() -ch_collected.view { "Channel.collect() result: ${it} (${it.size()} items grouped into 1)" } +ch_collected.view { list -> "channel.collect() result: ${list} (${list.size()} items grouped into 1)" } // Iterable.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> diff --git a/side-quests/solutions/essential_scripting_patterns/main.nf b/side-quests/solutions/essential_scripting_patterns/main.nf index ab83315d71..136764dc31 100644 --- a/side-quests/solutions/essential_scripting_patterns/main.nf +++ b/side-quests/solutions/essential_scripting_patterns/main.nf @@ -41,13 +41,13 @@ def separateMetadata(row) { log.warn "Low sequencing depth for ${sample_meta.id}: ${sample_meta.depth}" } - return [sample_meta + file_meta + [priority: priority], fastq_path] + return tuple(sample_meta + file_meta + [priority: priority], fastq_path) } workflow { validateInputs() - ch_samples = Channel.fromPath(params.input) + ch_samples = channel.fromPath(params.input) .splitCsv(header: true) .map{ row -> separateMetadata(row) } From 4793e8dbbdcb9df07af8dc2277359ec526b169f5 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 19:25:58 +0100 Subject: [PATCH 44/48] Iterable -> List --- .../essential_scripting_patterns.md | 34 +++++++++---------- .../essential_scripting_patterns/collect.nf | 4 +-- 2 files changed, 19 insertions(+), 19 deletions(-) diff --git a/docs/side_quests/essential_scripting_patterns.md b/docs/side_quests/essential_scripting_patterns.md index d6fd8e61fc..32f5e76161 100644 --- a/docs/side_quests/essential_scripting_patterns.md +++ b/docs/side_quests/essential_scripting_patterns.md @@ -420,11 +420,11 @@ This `[meta, file]` tuple structure is a common pattern in Nextflow for passing Our workflow demonstrates the core pattern: **dataflow operations** (`workflow`, `channel.fromPath()`, `.splitCsv()`, `.map()`, `.view()`) orchestrate how data moves through the pipeline, while **scripting** (maps `[key: value]`, string methods, type conversions, ternary operators) inside the `.map()` closure handles the transformation of individual data items. -### 1.2. Understanding Different Types: Channel vs Iterable +### 1.2. Understanding Different Types: Channel vs List So far, so good, we can distinguish between dataflow operations and scripting. But what about when the same method name exists in both contexts? -A perfect example is the `collect` method, which exists for both Channel types and Iterable types (like Lists) in the Nextflow standard library. The `collect()` method on an Iterable transforms each element, while the `collect()` operator on a Channel gathers all channel emissions into a single-item channel. +A perfect example is the `collect` method, which exists for both Channel types and List types in the Nextflow standard library. The `collect()` method on a List transforms each element, while the `collect()` operator on a Channel gathers all channel emissions into a single-item channel. Let's demonstrate this with some sample data, starting by refreshing ourselves on what the Channel `collect()` operator does. Check out `collect.nf`: @@ -467,7 +467,7 @@ channel.collect() result: [sample_001, sample_002, sample_003] (3 items grouped `view()` returns an output for every channel emission, so we know that this single output contains all 3 original items grouped into one list. -Now let's see the `collect` method on an Iterable type in action. Modify `collect.nf` to apply the Iterable's `collect` method to the original list of sample IDs: +Now let's see the `collect` method on a List in action. Modify `collect.nf` to apply the List's `collect` method to the original list of sample IDs: === "After" @@ -480,11 +480,11 @@ Now let's see the `collect` method on an Iterable type in action. Modify `collec ch_collected = ch_input.collect() ch_collected.view { list -> "channel.collect() result: ${list} (${list.size()} items grouped into 1)" } - // Iterable.collect() - transforms each element, preserves structure + // List.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') } - println "Iterable.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + println "List.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" ``` === "Before" @@ -506,29 +506,29 @@ In this new snippet we: Run the modified workflow: -```bash title="Test Iterable collect" +```bash title="Test List collect" nextflow run collect.nf ``` -```console title="Iterable collect results" hl_lines="5" +```console title="List collect results" hl_lines="5" N E X T F L O W ~ version 25.04.3 Launching `collect.nf` [cheeky_stonebraker] DSL2 - revision: 2d5039fb47 -Iterable.collect() result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) +List.collect() result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) Individual channel item: sample_001 Individual channel item: sample_002 Individual channel item: sample_003 channel.collect() result: [sample_001, sample_002, sample_003] (3 items grouped into 1) ``` -This time, we have NOT changed the structure of the data, we still have 3 items in the list, but we HAVE transformed each item using the Iterable's `collect` method to produce a new list with modified values. This is similar to using the `map` operator on a Channel, but it's operating on a List data structure rather than a channel. +This time, we have NOT changed the structure of the data, we still have 3 items in the list, but we HAVE transformed each item using the List's `collect` method to produce a new list with modified values. This is similar to using the `map` operator on a Channel, but it's operating on a List data structure rather than a channel. `collect` is an extreme case we're using here to make a point. The key lesson is that when you're writing workflows, always distinguish between **data structures** (Lists, Maps, etc.) and **channels** (dataflow constructs). Operations can share names but behave completely differently depending on the type they're called on. ### 1.3. The Spread Operator (`*.`) - Shorthand for Property Extraction -Related to the Iterable's `collect` method is the spread operator (`*.`), which provides a concise way to extract properties from collections. It's essentially syntactic sugar for a common `collect` pattern. +Related to the List's `collect` method is the spread operator (`*.`), which provides a concise way to extract properties from collections. It's essentially syntactic sugar for a common `collect` pattern. Let's add a demonstration to our `collect.nf` file: @@ -543,11 +543,11 @@ Let's add a demonstration to our `collect.nf` file: ch_collected = ch_input.collect() ch_collected.view { list -> "channel.collect() result: ${list} (${list.size()} items grouped into 1)" } - // Iterable.collect() - transforms each element, preserves structure + // List.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') } - println "Iterable.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + println "List.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" // Spread operator - concise property access def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] @@ -566,11 +566,11 @@ Let's add a demonstration to our `collect.nf` file: ch_collected = ch_input.collect() ch_collected.view { list -> "channel.collect() result: ${list} (${list.size()} items grouped into 1)" } - // Iterable.collect() - transforms each element, preserves structure + // List.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') } - println "Iterable.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" + println "List.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" ``` Run the updated workflow: @@ -586,7 +586,7 @@ You should see output like: Launching `collect.nf` [cranky_galileo] DSL2 - revision: 5f3c8b2a91 -Iterable.collect() result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) +List.collect() result: [SPECIMEN_001, SPECIMEN_002, SPECIMEN_003] (3 items transformed into 3) Spread operator result: [s1, s2, s3] Individual channel item: sample_001 Individual channel item: sample_002 @@ -618,7 +618,7 @@ The spread operator is particularly useful when you need to extract a single pro In this section, you've learned: - **Dataflow vs scripting**: Channel operators orchestrate how data flows through your pipeline, while scripting transforms individual data items -- **Understanding types**: The same method name (like `collect`) can behave differently depending on the type it's called on (Channel vs Iterable) +- **Understanding types**: The same method name (like `collect`) can behave differently depending on the type it's called on (Channel vs List) - **Context matters**: Always be aware of whether you're working with channels (dataflow) or data structures (scripting) Understanding these boundaries is essential for debugging, documentation, and writing maintainable workflows. @@ -2151,7 +2151,7 @@ Throughout this side quest, you've built a comprehensive sample processing pipel Here's how we progressively enhanced our pipeline: -1. **Dataflow vs Scripting**: You learned to distinguish between dataflow operations (channel orchestration) and scripting (code that manipulates data), including the crucial differences between operations on different types like `collect` on Channel vs Iterable. +1. **Dataflow vs Scripting**: You learned to distinguish between dataflow operations (channel orchestration) and scripting (code that manipulates data), including the crucial differences between operations on different types like `collect` on Channel vs List. 2. **Advanced String Processing**: You mastered regular expressions for parsing file names, dynamic script generation in processes, and variable interpolation (Nextflow vs Bash vs Shell). diff --git a/side-quests/solutions/essential_scripting_patterns/collect.nf b/side-quests/solutions/essential_scripting_patterns/collect.nf index 0ffcfa87a2..9242babfbf 100644 --- a/side-quests/solutions/essential_scripting_patterns/collect.nf +++ b/side-quests/solutions/essential_scripting_patterns/collect.nf @@ -6,11 +6,11 @@ ch_input.view { sample -> "Individual channel item: ${sample}" } ch_collected = ch_input.collect() ch_collected.view { list -> "channel.collect() result: ${list} (${list.size()} items grouped into 1)" } -// Iterable.collect() - transforms each element, preserves structure +// List.collect() - transforms each element, preserves structure def formatted_ids = sample_ids.collect { id -> id.toUpperCase().replace('SAMPLE_', 'SPECIMEN_') } -println "Iterable.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" +println "List.collect() result: ${formatted_ids} (${sample_ids.size()} items transformed into ${formatted_ids.size()})" // Spread operator - concise property access def sample_data = [[id: 's1', quality: 38.5], [id: 's2', quality: 42.1], [id: 's3', quality: 35.2]] From c6bc005f6853db721341e0fadabe4b130f320115 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 20:09:10 +0100 Subject: [PATCH 45/48] Fixes up to handlers --- .../essential_scripting_patterns.md | 227 ++++++++++-------- .../essential_scripting_patterns/main.nf | 19 ++ .../nextflow.config | 20 -- 3 files changed, 143 insertions(+), 123 deletions(-) diff --git a/docs/side_quests/essential_scripting_patterns.md b/docs/side_quests/essential_scripting_patterns.md index 32f5e76161..aa7b7beb6e 100644 --- a/docs/side_quests/essential_scripting_patterns.md +++ b/docs/side_quests/essential_scripting_patterns.md @@ -1934,47 +1934,49 @@ Proper validation makes workflows more robust and user-friendly by catching prob --- -## 8. Configuration and Workflow Event Handlers +## 8. Workflow Event Handlers -Up until now, we've been writing code in our workflow scripts and process definitions. But there's one more important place where you can add logic: workflow event handlers in your `nextflow.config` file (or other places you write configuration). +Up until now, we've been writing code in our workflow scripts and process definitions. But there's one more important feature you should know about: workflow event handlers. -Event handlers are closures that run at specific points in your workflow's lifecycle. They're perfect for adding logging, notifications, or cleanup operations without cluttering your main workflow code. +Event handlers are closures that run at specific points in your workflow's lifecycle. They're perfect for adding logging, notifications, or cleanup operations. These handlers should be defined in your workflow script alongside your workflow definition. ### 8.1. The `onComplete` Handler The most commonly used event handler is `onComplete`, which runs when your workflow finishes (whether it succeeded or failed). Let's add one to summarize our pipeline results. -Your `nextflow.config` file already has Docker enabled. Add an event handler after the existing configuration: +Add the event handler to your `main.nf` file, inside your workflow definition: === "After" - ```groovy title="nextflow.config" linenums="1" hl_lines="5-15" - // Nextflow configuration for Groovy Essentials side quest - - docker.enabled = true + ```groovy title="main.nf" linenums="66" hl_lines="5-16" + ch_fastp = FASTP(trim_branches.fastp) + ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) + GENERATE_REPORT(ch_samples) - workflow.onComplete = { - println "" - println "Pipeline execution summary:" - println "==========================" - println "Completed at: ${workflow.complete}" - println "Duration : ${workflow.duration}" - println "Success : ${workflow.success}" - println "workDir : ${workflow.workDir}" - println "exit status : ${workflow.exitStatus}" - println "" + workflow.onComplete = { + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" + } } ``` === "Before" - ```groovy title="nextflow.config" linenums="1" - // Nextflow configuration for Groovy Essentials side quest - - docker.enabled = true + ```groovy title="main.nf" linenums="66" hl_lines="4" + ch_fastp = FASTP(trim_branches.fastp) + ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) + GENERATE_REPORT(ch_samples) + } ``` -This is a closure being assigned to `workflow.onComplete`. Inside, you have access to the `workflow` object which provides useful properties about the execution. +This closure runs when the workflow completes. Inside, you have access to the `workflow` object which provides useful properties about the execution. Run your workflow and you'll see this summary appear at the end! @@ -1997,7 +1999,7 @@ Pipeline execution summary: Completed at: 2025-10-10T12:14:24.885384+01:00 Duration : 2.9s Success : true -workDir : /Users/jonathan.manning/projects/training/side-quests/essential_scripting_patterns/work +workDir : /workspaces/training/side-quests/essential_scripting_patterns/work exit status : 0 ``` @@ -2005,41 +2007,50 @@ Let's make it more useful by adding conditional logic: === "After" - ```groovy title="nextflow.config" linenums="5" hl_lines="12-18" - workflow.onComplete = { - println "" - println "Pipeline execution summary:" - println "==========================" - println "Completed at: ${workflow.complete}" - println "Duration : ${workflow.duration}" - println "Success : ${workflow.success}" - println "workDir : ${workflow.workDir}" - println "exit status : ${workflow.exitStatus}" - println "" - - if (workflow.success) { - println "✅ Pipeline completed successfully!" - println "Results are in: ${params.outdir ?: 'results'}" - } else { - println "❌ Pipeline failed!" - println "Error: ${workflow.errorMessage}" + ```groovy title="main.nf" linenums="66" hl_lines="5-22" + ch_fastp = FASTP(trim_branches.fastp) + ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) + GENERATE_REPORT(ch_samples) + + workflow.onComplete = { + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" + + if (workflow.success) { + println "✅ Pipeline completed successfully!" + } else { + println "❌ Pipeline failed!" + println "Error: ${workflow.errorMessage}" + } } } ``` === "Before" - ```groovy title="nextflow.config" linenums="5" - workflow.onComplete = { - println "" - println "Pipeline execution summary:" - println "==========================" - println "Completed at: ${workflow.complete}" - println "Duration : ${workflow.duration}" - println "Success : ${workflow.success}" - println "workDir : ${workflow.workDir}" - println "exit status : ${workflow.exitStatus}" - println "" + ```groovy title="main.nf" linenums="66" hl_lines="5-16" + ch_fastp = FASTP(trim_branches.fastp) + ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) + GENERATE_REPORT(ch_samples) + + workflow.onComplete = { + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" + } } ``` @@ -2060,7 +2071,7 @@ Pipeline execution summary: Completed at: 2025-10-10T12:16:00.522569+01:00 Duration : 3.6s Success : true -workDir : /Users/jonathan.manning/projects/training/side-quests/essential_scripting_patterns/work +workDir : /workspaces/training/side-quests/essential_scripting_patterns/work exit status : 0 ✅ Pipeline completed successfully! @@ -2068,67 +2079,77 @@ exit status : 0 You can also write the summary to a file using file operations: -```groovy title="nextflow.config - Writing summary to file" -workflow.onComplete = { - def summary = """ - Pipeline Execution Summary - =========================== - Completed: ${workflow.complete} - Duration : ${workflow.duration} - Success : ${workflow.success} - Command : ${workflow.commandLine} - """ +```groovy title="main.nf - Writing summary to file" +workflow { + // ... your workflow code ... - println summary + workflow.onComplete = { + def summary = """ + Pipeline Execution Summary + =========================== + Completed: ${workflow.complete} + Duration : ${workflow.duration} + Success : ${workflow.success} + Command : ${workflow.commandLine} + """ - // Write to a log file - def log_file = new File("${workflow.launchDir}/pipeline_summary.txt") - log_file.text = summary + println summary + + // Write to a log file + def log_file = file("${workflow.launchDir}/pipeline_summary.txt") + log_file.text = summary + } } ``` -### 8.2. Other Useful Event Handlers - -Besides `onComplete`, there are other event handlers you can use: +### 8.2. The `onError` Handler -**`onError`** - Runs only if the workflow fails: +Besides `onComplete`, there is one other event handler you can use: `onError`, which runs only if the workflow fails: -```groovy title="nextflow.config - onError handler" -workflow.onError = { - println "="* 50 - println "Pipeline execution failed!" - println "Error message: ${workflow.errorMessage}" - println "="* 50 - - // Write detailed error log - def error_file = new File("${workflow.launchDir}/error.log") - error_file.text = """ - Workflow Error Report - ===================== - Time: ${new Date()} - Error: ${workflow.errorMessage} - Error report: ${workflow.errorReport ?: 'No detailed report available'} - """ +```groovy title="main.nf - onError handler" +workflow { + // ... your workflow code ... + + workflow.onError = { + println "="* 50 + println "Pipeline execution failed!" + println "Error message: ${workflow.errorMessage}" + println "="* 50 + + // Write detailed error log + def error_file = file("${workflow.launchDir}/error.log") + error_file.text = """ + Workflow Error Report + ===================== + Time: ${new Date()} + Error: ${workflow.errorMessage} + Error report: ${workflow.errorReport ?: 'No detailed report available'} + """ - println "Error details written to: ${error_file}" + println "Error details written to: ${error_file}" + } } ``` -You can use multiple handlers together: +You can use multiple handlers together in your workflow script: -```groovy title="nextflow.config - Combined handlers" -workflow.onError = { - println "Workflow failed: ${workflow.errorMessage}" -} +```groovy title="main.nf - Combined handlers" +workflow { + // ... your workflow code ... -workflow.onComplete = { - def duration_mins = workflow.duration.toMinutes().round(2) - def status = workflow.success ? "SUCCESS ✅" : "FAILED ❌" + workflow.onError = { + println "Workflow failed: ${workflow.errorMessage}" + } - println """ - Pipeline finished: ${status} - Duration: ${duration_mins} minutes - """ + workflow.onComplete = { + def duration_mins = workflow.duration.toMinutes().round(2) + def status = workflow.success ? "SUCCESS ✅" : "FAILED ❌" + + println """ + Pipeline finished: ${status} + Duration: ${duration_mins} minutes + """ + } } ``` @@ -2136,12 +2157,12 @@ workflow.onComplete = { In this section, you've learned: -- **Event handler closures**: Closures in `nextflow.config` that run at different lifecycle points +- **Event handler closures**: Closures in your workflow script that run at different lifecycle points - **`onComplete` handler**: For execution summaries and result reporting - **`onError` handler**: For error handling and logging failures - **Workflow object properties**: Accessing `workflow.success`, `workflow.duration`, `workflow.errorMessage`, etc. -Event handlers show how you can use the full power of the Nextflow language within your config files to add sophisticated logging and notification capabilities. +Event handlers show how you can use the full power of the Nextflow language within your workflow scripts to add sophisticated logging and notification capabilities. --- diff --git a/side-quests/solutions/essential_scripting_patterns/main.nf b/side-quests/solutions/essential_scripting_patterns/main.nf index 136764dc31..0cfab9b9fe 100644 --- a/side-quests/solutions/essential_scripting_patterns/main.nf +++ b/side-quests/solutions/essential_scripting_patterns/main.nf @@ -66,4 +66,23 @@ workflow { ch_fastp = FASTP(trim_branches.fastp) ch_trimgalore = TRIMGALORE(trim_branches.trimgalore) GENERATE_REPORT(ch_samples) + + workflow.onComplete = { + println "" + println "Pipeline execution summary:" + println "==========================" + println "Completed at: ${workflow.complete}" + println "Duration : ${workflow.duration}" + println "Success : ${workflow.success}" + println "workDir : ${workflow.workDir}" + println "exit status : ${workflow.exitStatus}" + println "" + + if (workflow.success) { + println "✅ Pipeline completed successfully!" + } else { + println "❌ Pipeline failed!" + println "Error: ${workflow.errorMessage}" + } + } } diff --git a/side-quests/solutions/essential_scripting_patterns/nextflow.config b/side-quests/solutions/essential_scripting_patterns/nextflow.config index 80c74b85f4..b57f70c8c2 100644 --- a/side-quests/solutions/essential_scripting_patterns/nextflow.config +++ b/side-quests/solutions/essential_scripting_patterns/nextflow.config @@ -1,23 +1,3 @@ // Nextflow configuration for Groovy Essentials side quest docker.enabled = true - -workflow.onComplete = { - println "" - println "Pipeline execution summary:" - println "==========================" - println "Completed at: ${workflow.complete}" - println "Duration : ${workflow.duration}" - println "Success : ${workflow.success}" - println "workDir : ${workflow.workDir}" - println "exit status : ${workflow.exitStatus}" - println "" - - if (workflow.success) { - println "✅ Pipeline completed successfully!" - println "Results are in: ${params.outdir ?: 'results'}" - } else { - println "❌ Pipeline failed!" - println "Error: ${workflow.errorMessage}" - } -} From 5d530c99777c24a200b22bb5619e45726f873ce9 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 20:13:20 +0100 Subject: [PATCH 46/48] Fix highlight --- docs/side_quests/essential_scripting_patterns.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/side_quests/essential_scripting_patterns.md b/docs/side_quests/essential_scripting_patterns.md index aa7b7beb6e..4d320f8012 100644 --- a/docs/side_quests/essential_scripting_patterns.md +++ b/docs/side_quests/essential_scripting_patterns.md @@ -864,7 +864,7 @@ Fix this by adding conditional logic to the `FASTP` process `script:` block. An === "After" - ```groovy title="main.nf" linenums="10" hl_lines="3-26" + ```groovy title="main.nf" linenums="10" hl_lines="3-27" script: // Simple single-end vs paired-end detection def is_single = reads instanceof List ? reads.size() == 1 : true From abc25eb61e0d3d8973fef003e06fd6a498546605 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 20:15:55 +0100 Subject: [PATCH 47/48] Tighten intro --- docs/side_quests/essential_scripting_patterns.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/side_quests/essential_scripting_patterns.md b/docs/side_quests/essential_scripting_patterns.md index 4d320f8012..26e33b9beb 100644 --- a/docs/side_quests/essential_scripting_patterns.md +++ b/docs/side_quests/essential_scripting_patterns.md @@ -4,7 +4,7 @@ Nextflow is a programming language that runs on the Java Virtual Machine. While You can write a lot of Nextflow without venturing beyond basic syntax for variables, maps, and lists. Most Nextflow tutorials focus on workflow orchestration (channels, processes, and data flow), and you can go surprisingly far with just that. -However, when you need to manipulate data, parse complex filenames, implement conditional logic, or build robust production workflows, it helps to think about two distinct aspects of your code: **dataflow** (channels, operators, processes, and workflows—the constructs that control how data moves through your pipeline) and **scripting** (the code you write inside closures, functions, and process scripts to manipulate data, generate commands, allocate resources, and more). While this distinction is somewhat arbitrary—it's all Nextflow code—it provides a useful mental model for understanding when you're orchestrating your pipeline versus when you're using the language's programming features. Mastering both aspects dramatically improves your ability to solve real-world problems efficiently and write clearer, more maintainable workflows. +However, when you need to manipulate data, parse complex filenames, implement conditional logic, or build robust production workflows, it helps to think about two distinct aspects of your code: **dataflow** (channels, operators, processes, and workflows) and **scripting** (the code inside closures, functions, and process scripts). While this distinction is somewhat arbitrary—it's all Nextflow code—it provides a useful mental model for understanding when you're orchestrating your pipeline versus when you're manipulating data. Mastering both dramatically improves your ability to write clear, maintainable workflows. This side quest takes you on a hands-on journey from basic concepts to production-ready patterns. We'll transform a simple CSV-reading workflow into a sophisticated bioinformatics pipeline, evolving it step-by-step through realistic challenges: From 0d6022aeb35216f0621fdda84cf4b2be72050b76 Mon Sep 17 00:00:00 2001 From: Jonathan Manning Date: Tue, 14 Oct 2025 20:58:58 +0100 Subject: [PATCH 48/48] Tiny fixes --- .../essential_scripting_patterns.md | 30 +++++++++++-------- .../essential_scripting_patterns/collect.nf | 4 +-- 2 files changed, 19 insertions(+), 15 deletions(-) diff --git a/docs/side_quests/essential_scripting_patterns.md b/docs/side_quests/essential_scripting_patterns.md index 26e33b9beb..6c5c46ce0a 100644 --- a/docs/side_quests/essential_scripting_patterns.md +++ b/docs/side_quests/essential_scripting_patterns.md @@ -58,7 +58,7 @@ The `data` directory contains sample files and a main workflow file we'll evolve │ └── trimgalore.nf └── nextflow.config -4 directories, 10 files +3 directories, 10 files ``` Our sample CSV contains information about biological samples that need different processing based on their characteristics: @@ -139,7 +139,7 @@ Here's what that map operation looks like: .view() ``` -This is our first **closure**—an anonymous function you can pass as an argument (similar to lambdas in Python or arrow functions in JavaScript). Closures are essential for working with Nextflow operators. +This is our first **closure** - an anonymous function you can pass as an argument (similar to lambdas in Python or arrow functions in JavaScript). Closures are essential for working with Nextflow operators. The closure `{ row -> return row }` takes a parameter `row` (could be any name: `item`, `sample`, etc.). @@ -668,7 +668,7 @@ Make the following change to your existing `main.nf` workflow: === "Before" - ```groovy title="main.nf" linenums="4" hl_lines="11" + ```groovy title="main.nf" linenums="4" hl_lines="10-11" .map { row -> // Scripting for data transformation def sample_meta = [ @@ -1336,7 +1336,7 @@ You should see something like: docker run -i --cpu-shares 4096 --memory 2048m -e "NXF_TASK_WORKDIR" -v /workspaces/training/side-quests/essential_scripting_patterns:/workspaces/training/side-quests/essential_scripting_patterns -w "$NXF_TASK_WORKDIR" --name $NXF_BOXID community.wave.seqera.io/library/fastp:0.24.0--62c97b06e8447690 /bin/bash -ue /workspaces/training/side-quests/essential_scripting_patterns/work/48/6db0c9e9d8aa65e4bb4936cd3bd59e/.command.sh ``` -In this example we've chosen an example that requested 4 CPUs (`--cpu-shares 4096`), because it was a high-depth sample, but you should see different CPU allocations depending on the sample depth. Try this for the other tasks as well. +In this example we've chosen an example that requested 2 CPUs (`--cpu-shares 2048`), because it was a high-depth sample, but you should see different CPU allocations depending on the sample depth. Try this for the other tasks as well. Another powerful pattern is using `task.attempt` for retry strategies. To show why this is useful, we're going to start by reducing the memory allocation to FASTP to less than it needs. Change the `memory` directive in `modules/fastp.nf` to `1.GB`: @@ -1530,7 +1530,7 @@ Add the following before the branch operation: === "After" - ```groovy title="main.nf" linenums="28" hl_lines="11" + ```groovy title="main.nf" linenums="28" hl_lines="5-11" ch_samples = channel.fromPath("./data/samples.csv") .splitCsv(header: true) .map(separateMetadata) @@ -1571,14 +1571,18 @@ nextflow run main.nf Because we've chosen a filter that excludes some samples, you should see fewer tasks executed: ```console title="Filtered samples results" - N E X T F L O W ~ version 25.04.3 - -Launching `main.nf` [deadly_woese] DSL2 - revision: 9a6044a969 - -executor > local (5) -[01/7b1483] process > FASTP (2) [100%] 2 of 2 ✔ -[- ] process > TRIMGALORE - -[07/ef53af] process > GENERATE_REPORT (3) [100%] 3 of 3 ✔ +N E X T F L O W ~ version 25.04.3 +Launching `main.nf` [lonely_williams] DSL2 - revision: d0b3f121ec +[94/b48eac] Submitted process > FASTP (2) +[2c/d2b28f] Submitted process > GENERATE_REPORT (2) +[65/2e3be4] Submitted process > GENERATE_REPORT (1) +[94/b48eac] NOTE: Process `FASTP (2)` terminated with an error exit status (137) -- Execution is retried (1) +[3e/0d8664] Submitted process > TRIMGALORE (1) +[6a/9137b0] Submitted process > FASTP (1) +[6a/9137b0] NOTE: Process `FASTP (1)` terminated with an error exit status (137) -- Execution is retried (1) +[83/577ac0] Submitted process > GENERATE_REPORT (3) +[a2/5117de] Re-submitted process > FASTP (1) +[1f/a1a4ca] Re-submitted process > FASTP (2) ``` The filter expression `meta.id && meta.organism && meta.depth >= 25000000` combines truthiness with explicit comparisons: diff --git a/side-quests/essential_scripting_patterns/collect.nf b/side-quests/essential_scripting_patterns/collect.nf index 2202dbbb12..1c31e8b7d1 100644 --- a/side-quests/essential_scripting_patterns/collect.nf +++ b/side-quests/essential_scripting_patterns/collect.nf @@ -1,7 +1,7 @@ def sample_ids = ['sample_001', 'sample_002', 'sample_003'] -// Channel.collect() - groups multiple channel emissions into one +// channel.collect() - groups multiple channel emissions into one ch_input = channel.fromList(sample_ids) ch_input.view { sample -> "Individual channel item: ${sample}" } ch_collected = ch_input.collect() -ch_collected.view { list -> "Channel.collect() result: ${list} (${list.size()} items grouped into 1)" } +ch_collected.view { list -> "channel.collect() result: ${list} (${list.size()} items grouped into 1)" }