This script efficiently processes a large number of images from public URLs listed in a CSV file. It downloads each image and uploads it directly to a specified AWS S3 bucket.
The script is robust for high-volume tasks, featuring concurrent processing, automatic resumption after failure, and detailed error logging.
- Reliable Download & Upload: Downloads images from public URLs and streams them directly to an S3 bucket without saving them to a local disk.
- Concurrent Processing: Uses a
ThreadPoolExecutorto process multiple images simultaneously, dramatically reducing the time required for large datasets. - Automatic Resumability: If the script is stopped for any reason, it can be restarted and will automatically skip any images that were already successfully processed.
- Robust Error Logging: Any URLs that fail to process are logged to a separate
failed_urls.logfile with a corresponding error message for easy review and debugging. - Secure Credential Management: Uses a
.envfile to manage AWS credentials, keeping them separate from the source code and out of version control. - Live Progress Bar: Provides a real-time progress bar using
tqdmso you can monitor the status and estimate the time to completion.
The script maintains state and logs errors using two key files:
processed_indices.log: Every time an image is successfully downloaded and uploaded, the script writes the corresponding row index from the CSV file into this log. On startup, the script reads this file to know which images to skip.failed_urls.log: If an error occurs (e.g., the URL is broken, a network error occurs, or an upload fails), the script writes the problematic URL and the error details to this file. This creates a clean list of all images that require manual investigation.
/s3-image-importer/
β
βββ .env # Your secret AWS credentials (not committed to Git)
βββ .gitignore # Ensures .env and venv/ are not committed to Git
βββ your_image_urls.csv # Your input data file with image URLs
βββ s3_image_importer.py # The main Python script
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ venv/ # The isolated Python virtual environment (not committed)
β
# Files generated after running the script:
β
βββ processed_indices.log # Tracks successful copies
βββ failed_urls.log # Logs any errors
Follow these steps to create a clean, isolated environment for the project.
- Python 3.8 or newer installed
- Git installed
Clone this repository to your local machine and navigate into the project directory:
git clone https://github.com/JamesonCodes/s3-image-importer.git
cd s3-image-importerThis step creates an isolated environment for this project's dependencies.
# Create the virtual environment (this creates the venv/ folder)
python3 -m venv venv
# Activate the environment (you must do this in every new terminal session)
source venv/bin/activateYour terminal prompt should now be prefixed with (venv).
Place your CSV data file (e.g., your_image_urls.csv) in the project folder. This file should contain the image URLs you want to process.
With the virtual environment active, install all required Python libraries using the requirements.txt file:
pip install -r requirements.txtCreate a file named .env in the project folder and add your AWS credentials:
AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY_ID_HERE
AWS_SECRET_ACCESS_KEY=YOUR_SECRET_AWS_ACCESS_KEY_HERE
AWS_DEFAULT_REGION=us-east-1The .gitignore file is already configured to ignore .env and venv/, keeping your secrets and environment files out of version control.
Before running the script, you must configure the constants at the top of the s3_image_importer.py file to match your needs:
# --- CONFIGURATION ---
CSV_FILE_PATH = 'your_image_urls.csv' # Your CSV filename
URL_COLUMN_NAME = 'URL' # Column with image URLs (update if your CSV uses a different column name)
DEST_S3_BUCKET = 'your-s3-bucket-name' # Your destination S3 bucket
DEST_S3_FOLDER = 'your/s3/folder' # The folder (prefix) within the bucket (e.g., 'Batch1' or 'images/')
MAX_WORKERS = 30 # Number of concurrent downloadsYou must activate the virtual environment every time you open a new terminal to work on this project.
source venv/bin/activateRun the script:
python s3_image_importer.pyThe script will start, display how many images it's resuming, and show a progress bar for the remaining images.
When you are finished, you can leave the virtual environment by typing:
deactivate- error: externally-managed-environment: You forgot to activate the virtual environment. Stop the command and run
source venv/bin/activatefirst. - ModuleNotFoundError: You either forgot to activate the virtual environment or you haven't installed the dependencies yet. Run
source venv/bin/activateand thenpip install -r requirements.txt. - FileNotFoundError: The script cannot find your CSV file. Double-check that
CSV_FILE_PATHin the script configuration exactly matches your filename. - NoCredentialsError: Boto3 cannot find your AWS credentials. Ensure your
.envfile is present, correctly named, and filled out. - requests.exceptions.RequestException: A network error occurred while downloading an image (e.g., timeout, DNS issue). The specific URL will be logged in
failed_urls.log. - HTTPError: 404 Not Found: A URL in your CSV points to a location where there is no image.
- HTTPError: 403 Forbidden: You do not have permission to access the image at a specific URL. The server hosting the image may be blocking automated requests.
- ClientError: AccessDenied: This error comes from AWS S3. It means your credentials do not grant
s3:PutObjectpermission on theDEST_S3_BUCKET.