Name: KHUMBO PUTE Student ID: MB300-0004/2024 Course: PUB 3127 - Computing for Biologists Date Started: 11/20/2025 Date Completed: 11/22/2025
This project analyzes genomic scaffold data and protein database records using bash command-line tools. The analysis includes:
- Counting and filtering DNA sequences from FASTA files
- Identifying high-quality scaffolds based on length and coverage criteria
- Extracting and analyzing protein information from database files
- Automating bioinformatics workflows with bash scripts
└── bash_miniproject
├── ASSIGNMENT.md
├── Data
│ ├── IP-004_S38_L001_scaffolds.fasta
│ └── humchrx.txt
├── README.md
├── results
│ ├── analysis_summary.txt
│ ├── filtered_sequences.txt
│ ├── gene_names_sorted.txt
│ ├── high_quality_scaffolds.txt
│ ├── longest_sequence.txt
│ ├── protein_count.txt
│ ├── protein_search_results.txt
│ ├── sequence_count.txt
│ └── sequence_ids.txt
└── scripts
├── extract_genes.sh
├── extract_headers.sh
├── filter_by_length.sh
├── high_quality_scaffolds.sh
├── longest_sequence.sh
├── protein_entry_count.sh
├── run_analysis.sh
├── search_proteins.sh
└── sequence_count.sh
- Bash shell (Linux, macOS, or WSL on Windows)
- Git and GitHub account
- Basic Unix tools:
grep,cut,sort,uniq,wc,head,tail - Text editor (nano, vim, VS Code, etc.)
The Data/ directory contains:
- IP-004_S38_L001_scaffolds.fasta (~10 MB) - Genomic scaffolds from sequencing assembly
- humchrx.txt (~152 KB) - UniProt protein entries for human chromosome X
-
Clone this repository:
git clone https://github.com/KhumboPute/bash_miniproject.git cd bash_miniproject -
Verify data files are present:
ls -lh Data/
./scripts/extract_headers.shPurpose: Extracts all sequence headers from the FASTA file
Output: results/sequence_ids.txt - List of NODE identifiers
./scripts/longest_sequence.shPurpose: Identifies the scaffold with the longest sequence
Output: results/longest_sequence.txt - Details of the longest scaffold
./scripts/filter_by_length.sh 5000Purpose: Filters scaffolds with length >= specified minimum
Output: results/filtered_sequences.txt - Filtered scaffold headers
./scripts/high_quality_scaffolds.shPurpose: Identifies scaffolds meeting both length and coverage criteria
Output: results/high_quality_scaffolds.txt - High-quality scaffold list
./scripts/extract_genes.shPurpose: Extracts unique gene names from protein database
Output: results/gene_names_sorted.txt - Sorted unique gene names
./scripts/search_proteins.sh "kinase"Purpose: Searches for proteins matching a keyword
Output: results/protein_search_results.txt - Matching protein entries
./scripts/protein_entry_count.shPurpose:Count the number of proteins from the protein database
Output:results/protein_count.txt- Number of proteins in the database
./scripts/sequence_count.shPurpose:Count the number of sequences in the FASTA file
Output:results/sequence_count.txt- Number of DNA sequences in the FASTA file
./scripts/run_analysis.sh ./Data/Purpose: Runs all analyses in sequence and generates a comprehensive summary Output:
- All result files from individual scripts
results/analysis_summary.txt- Summary of all analyses with counts and timestamp
- Total number of sequences: 35079
- Longest sequence: NODE_1
- Number of sequences with length >= 5000: 283
- Number of high-quality scaffolds (length >= 10000, coverage >= 5.0): 33
- Total protein entries: 890
- Number of unique genes: 888
- Example protein search result (e.g., for "kinase"): 38
- How fast the code completed my intended task, the code took too long to write but it executed in seconds
- How to read through complicated information to extract what you want
- Breaking down huge problems into smaller ones to accomplish the task
| Script Name | Purpose | Key Commands Used |
|---|---|---|
extract_headers.sh |
Extract NODE identifiers from FASTA file | grep, cut |
longest_sequence.sh |
Find the scaffold with the longest sequence | grep, sort, head |
filter_by_length.sh |
Filter scaffolds by minimum length | grep, cut, for,if |
high_quality_scaffolds.sh |
Identify high-quality scaffolds | grep, cut, for,if |
extract_genes.sh |
Extract unique gene names from protein file | cut, sort, uniq |
search_proteins.sh |
Search for proteins by keyword | grep |
run_analysis.sh |
Master script that runs all analyses | [calls all other scripts] |
Challenge 1: I did'nt know the type of backup file for the editor I used (Nano) Solution: I went online and found out that if you enable the back up option when opening nano using a flag -f, it creates a backup file starting with a .filename~, so I added it in the gitignore file.
Challenge 2: After pushing my repository, git online wasn't showing the folders for results and scripts Solution: I went on google, apparently its a problem if the directories are empty, so the solution suggested was to create a .gitkeep file in each of the folders and I also used the flag -f to override the gitignore for the results directory.
Challenge 3: Making comparisons using floats which is different from integers Solution: I googled the syntax for working with float when dealing with coverage
Challenge: Handling errors using if statements Solution: I googled the structure then replaced the variables with what I wanted
- New bash commands or concepts you learned: Using command line arguments in scripts
- How command-line tools can be useful for bioinformatics : Command line can help in dealing with large files
- Any insights about version control with Git : I can make changes in my files while keeping the originals
- How this project relates to your research interests: Probably I will deal with files from databases
- Introduction to Linux Lectures
- Course materials: PUB 3127 - Computing for Biologists
- Bash manual:
man bash - FASTA format: https://en.wikipedia.org/wiki/FASTA_format
This project is for educational purposes as part of PUB 3127 coursework.
- Instructor: Dr. Kibet
- Institution: Pan African University of Science, Technology and Innovation
- Data sources: [Sequencing data and UniProt database]