PDB Processor with PySpark & Docker

This program is designed to efficiently process large-scale PDB datasets on a local cluster using Apache PySpark. Given that the Protein Data Bank (PDB) dataset spans approximately 4–5 TB, traditional sequential processing methods are computationally expensive and inefficient. By leveraging PySpark’s distributed computing capabilities, this solution enables fast and scalable data processing across multiple cores or nodes.

The PDB dataset is systematically partitioned into two distinct data categories:

Atomic Coordinate Data (CSV format):

Contains detailed 3D atomic coordinates of biomolecular structures. Primarily used for structural analysis, computational modeling, and machine learning applications in structural bioinformatics. Experimental Metadata (JSON format):

Includes key information such as resolution, R-free, R-factor, deposition date, and experimental method. Designed for efficient querying, facilitating large-scale statistical analysis, metadata filtering, and integration with external biological databases.

By preprocessing and structuring PDB data in this format, the system enhances data accessibility, supports downstream ML model training, and enables rapid querying of experimental conditions. This architecture is particularly suited for high-throughput structural bioinformatics pipelines and large-scale machine learning applications in computational biology and cheminformatics.

Installation

Follow these steps to set up and run the project.

1. Clone the Repository

git clone git@github.com:SanketSatishShelke/PDB_Processor_PySpark.git
cd PDB_Processor_PySpark

2. Install Dependencies

pip install -r requirements.txt

Usage

1. Run without Docker

python pdb_coordinate_data_extraction.py
python pdb_metadata_extraction.py

2. Run With Docker

docker build -t pdb_extract
docker run -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output pdb_extract

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
output		output
.DS_Store		.DS_Store
Dockerfile		Dockerfile
README.md		README.md
pdb_coordinate_data_extraction.py		pdb_coordinate_data_extraction.py
pdb_metadata_extraction.py		pdb_metadata_extraction.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDB Processor with PySpark & Docker

Installation

1. Clone the Repository

2. Install Dependencies

Usage

1. Run without Docker

2. Run With Docker

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDB Processor with PySpark & Docker

Installation

1. Clone the Repository

2. Install Dependencies

Usage

1. Run without Docker

2. Run With Docker

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages