The code contained in this repositority enable the reproduction of the results of:
Castro-Rivadeneyra et. al 2022. Found in translation: Microproteins are a new class of host cell impurity in mAb drug products
The publication is freely availiable here: xxxxxxx
Abstract:
Mass spectrometry (MS) has emerged as a powerful approach for the detection of Chinese hamster ovary (CHO) cell protein impurities in antibody drug products. The incomplete annotation of the Chinese hamster genome, however, limits the coverage of MS-based host cell protein (HCP) analysis.
In this study, we performed ribosome footprint profiling (Ribo-seq) of translation initiation and elongation to refine the Chinese hamster genome annotation. Analysis of these data resulted in the identification of thousands of previously uncharacterised non-canonical proteoforms in CHO cells, such as N-terminally extended proteins and short open reading frames (sORFs) predicted to encode for microproteins. MS-based HCP analysis of adalimumab and trastuzumab with the extended protein sequence database, resulted in the detection of CHO cell microprotein impurities in mAb drug product for the first time. Further analysis revealed that the CHO cell microprotein population is altered over the course of cell culture and, in response to a change in cell culture temperature. The annotation of non-canonical Chinese hamster proteoforms permits a more comprehensive characterisation of HCPs in antibody drug products using MS.
To be completed when data is uploaded to SRA and ENA
# change to working directory
cd ribosome_footprint_profiling
# download data
./scripts/get_raw_data.sh
Download the PICR-H reference genome from NCBI and create a STAR index for mapping
# create a directory
mkdir -p reference_genome
# NCBI url
url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/668/045/
# get the sequence, feature table and GTF,GFF annotation files
wget "$url"GCF_003668045.3_CriGri-PICRH-1.0/GCF_003668045.3_CriGri-PICRH-1.0_genomic.fna.gz \
-P reference_genome
wget "$url"/GCF_003668045.3_CriGri-PICRH-1.0/GCF_003668045.3_CriGri-PICRH-1.0_genomic.gtf.gz \
-P reference_genome
wget "$url"/GCF_003668045.3_CriGri-PICRH-1.0/GCF_003668045.3_CriGri-PICRH-1.0_feature_table.txt.gz \
-P reference_genome
wget "$url"/GCF_003668045.3_CriGri-PICRH-1.0/GCF_003668045.3_CriGri-PICRH-1.0_genomic.gff.gz \
-P reference_genome
# unzip
gunzip reference_genome/*.gz
A STAR index is created to map the Ribo-seq and RNA-seq data
#create a directory
mkdir reference_genome/star_index_ncbi
# set the path to STAR
star_path=../bin/STAR-2.7.8a/bin/Linux_x86_64
# build the index
$star_path/STAR --runThreadN 16 \
--runMode genomeGenerate \
--sjdbOverhang 74 \
--genomeChrBinNbits 16 \
--genomeDir reference_genome/star_index_ncbi \
--genomeFastaFiles reference_genome/GCF_003668045.3_CriGri-PICRH-1.0_genomic.fna \
--sjdbGTFfile reference_genome/GCF_003668045.3_CriGri-PICRH-1.0_genomic.gtf
This script preprocesses the raw sequencing data. For all data types the adapters are removed as well as low quality bases. For Ribo-seq data contaminating RNA species (rRNA, tRNA and snoRNA) are removed following mapping to individual indexes, remaining reads are filtered based on length with only those within the expected RPF range (28-31nt) retained. Finally the reads from all replicate.
# preprocess
./scripts/preprocess_reads.sh
# count the reads removed by filtering as well as the final RPFs
./scripts/fastq_read_count.sh
Calculation of the P-site offset and analysis of triplet periodicty for RPFs for the merged and individual samples.
./scripts/identify_RPF_psite.sh
We have built a docker image with ORF-RATER and required packages to ensure future compatability. The merged BAM files for Harringtone, cycloheximide and no-drug Ribo-seq as well as the Chinese hamster rare used with ORF-RATER to identify ORFs
# get docker image
docker pull clarkelab/orfrater:final
# run ORF-RATER
./scripts/identify_ORFs.sh
Filter the ORF-RATER output to remove:
-
ORFs < 5aa & ORFRATER score < 0.05
-
Truncations and Interal ORFs
-
When other ORFs overlap and have the same stop codon retain the longest
A list of ORFs in non-coding RNAs is created for downstream differential expression analysis
# create a directory to store ids for amino acid analysis and plastid quantitation
mkdir orf_lists
# mkdir to store results
mkdir results/section_2.2
# filter the ORF-RATER output
Rscript ./scripts/filter_ORFs.R
./scripts/get_ORF_amino_acid_sequences.sh
Here we extract the amino acid sequences for short ORFs and combine with the Uniprot Chinese hamster proteins. A database can be created for Mass spec based HCP analysis.
mkdir proteomics_db
# create the protein sequence database
./scripts/create_ms_fasta.sh
Here a plastid reference is created to enable the determination of the RPKM of transcripts and gene-level counting. A mask is created to elminate the first 5 and last 15 codons of ORFs >100aa, for ORFs < 100aa the first and last codons are exlcuded from the counting process.
mkdir plastid_reference
# make the plastid reference
./scripts/make_plastid_reference.sh
The RPKM is calculated for each annotated transcript
mkdir -p quantitation/transcript_cds_rpkm
# quantitate
./scripts/calculate_rpkm.sh
To enable the identification of differential translation between the NTS and TS conditions the mapped Ribo-seq and RNA-seq CDS counts are determined. For the Riboeq
mkdir quantitation/gene_cds_counts
# count
./scripts/calculate_gene_cds_counts.sh
First we count the RPFs and RNA-seq reads mapping to CDS regions using Plastid
# count
./scripts/calculate_gene_cds_count.sh
Then DESeq2 is using to calculate differential expression from the Ribo-seq and RNA-seq reads separately before differential translation is carried out.
# the mouse annotation is used to replace missing gene CGR gene symbols
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_feature_table.txt.gz \
-P reference_genome
gunzip reference_genome/GCF_000001635.27_GRCm39_feature_table.txt.gz
# determine differential expression and translation
Rscript ./scripts/run_deseq2.R
Here we make the required alignment tracks for figures from the genome and transcriptome BAMs using individual replicates and the merged data
./scripts/make_coverage_tracks.sh
The following R notebooks allow the reproduction of figures and tables in the manuscript
results/r_scripts/section_2_1.Rmd
Outputs of the ORF-RATER algorithm for the Chinese hamster genome
results/r_scripts/section_2_2.Rmd
Analysis of the global effect of uORFs at the transcript level
results/r_scripts/section_2_3.Rmd
results/r_scripts/section_2_4.Rmd
2.5 The translation efficiency of sORFs found in non-coding RNA genes is altered in response to mild hypothermia in CHO cells
DESeq2 analysis of ORFs identified in the Chinese hamster ncRNA
results/r_scripts/section_2_5.Rmd
2.6 Microproteins are differentially expressed between the exponential and stationary phases of CHO cell culture
results/r_scripts/section_2_6.Rmd
results/r_scripts/Supplementary_Results.Rmd