Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 126 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -399,7 +399,105 @@ The server is currently running on an EC2 instance through Docker. The ports act

**Note:** While we provide a Dockerfile, local deployment of the site using the docker will only properly work on AWS. Local deployment should be done with a local install using the steps above.

## 0. Setting up a new server on AWS EC2

This section covers provisioning a fresh EC2 instance. The app container is built on the host and is not published to any registry — it must be built locally. Source code is mounted into the running container at runtime, so routine code updates do not require a container rebuild; only changes to `requirements.txt` or the `Dockerfile` do.

### i. Provision the EC2 instance

- **AMI**: Ubuntu Server 22.04 LTS
- **Instance type**: choose based on expected load (t3.medium or larger recommended)
- **Security group**: open inbound ports for SSH (22), HTTP (80), HTTPS (443), and the app port defined by `AMPLICON_ENV_PORT` in `config.sh`; Neo4j ports 7474 and 7687 should be restricted to internal access only
- After the instance is running, register it with the appropriate AWS Application Load Balancer target group:
- Production: `ampliconrepo-https`
- Dev: `dev-ampliconrepository-org`

### ii. Install Docker

SSH into the new instance and install Docker Engine:

```bash
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" \
| sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo usermod -aG docker ubuntu
# Log out and back in for the group membership to take effect
```

### iii. Create the two working checkouts

Two separate git checkouts are maintained to allow container rebuilds without interrupting a running server:

- **Running server**: `/home/ubuntu/AmpliconRepository-<ENV>` — the live checkout whose source is mounted directly into the running container (`dev` or `prod`)
- **Build directory**: `/home/ubuntu/ampRepo_for_docker_build/AmpliconRepository` — used only for `docker build`; changes here do not affect the running server

```bash
# Running server checkout (use AmpliconRepository-dev for the dev environment)
git clone https://github.com/AmpliconSuite/AmpliconRepository.git /home/ubuntu/AmpliconRepository-prod

# Build-only checkout
mkdir -p /home/ubuntu/ampRepo_for_docker_build
git clone https://github.com/AmpliconSuite/AmpliconRepository.git /home/ubuntu/ampRepo_for_docker_build/AmpliconRepository
```

### iv. Create required directories and place config.sh

```bash
cd /home/ubuntu/AmpliconRepository-prod # or AmpliconRepository-dev
mkdir -p logs tmp neo4j/conf neo4j/data
```

Place `config.sh` at `caper/config.sh` inside the running-server checkout. The checkout root is mounted as `/srv/` inside the container, so the file will be read from `/srv/caper/config.sh` at runtime. `config.sh` contains all environment-specific secrets and settings — database connection strings, OAuth keys, S3 bucket names, and so on. Obtain it from another team member. **Do not commit it to git.**

MongoDB connectivity uses the existing DocumentDB cluster; the connection string is provided via the `DB_URI_SECRET` variable in `config.sh`.

### v. Configure AWS credentials

The container mounts `/home/ubuntu/.aws` so that the application can reach S3. Use one of:

- **IAM instance role** (preferred for EC2): attach a role with the required S3 permissions to the instance — no credential file is needed
- **Access keys**: place credentials in `/home/ubuntu/.aws/credentials` in the standard AWS shared credentials format

### vi. Build the Docker image

Build from the build-only checkout so the running server is not affected. The image packages the Python virtual environment and system dependencies; source code is supplied at runtime via volume mount and does not need to be re-baked on every code change.

```bash
cd /home/ubuntu/ampRepo_for_docker_build/AmpliconRepository/caper
source caper/config.sh # sets AMPLICON_ENV and other variables
docker build -t genepattern/amplicon-repo:${AMPLICON_ENV} .
```

### vii. Start Neo4j

Neo4j runs as a Docker container. Run the start script from the running-server checkout root (where `caper/config.sh` is visible at the relative path the script expects):

```bash
cd /home/ubuntu/AmpliconRepository-prod # or AmpliconRepository-dev
./start-neo4j-container.sh
```

Neo4j data and configuration are persisted in `neo4j/data/` and `neo4j/conf/` under the checkout root. To stop Neo4j:

```bash
./stop-neo4j-container.sh
```

### viii. Start the application server

```bash
cd /home/ubuntu/AmpliconRepository-prod # or AmpliconRepository-dev
./start-server.sh
```

This launches a gunicorn container named `amplicon-${AMPLICON_ENV}` with the source checkout mounted at `/srv/`. Confirm it is running with `docker ps`. Logs are written to the `logs/` directory in the checkout root.

## 1. How to start the server
- SSH into the EC2 instance (called `ampliconrepo-ubuntu-20.04`)
Expand Down Expand Up @@ -440,9 +538,33 @@ The server is currently running on an EC2 instance through Docker. The ports act
> `git fetch`
> `git pull`
> `git checkout tags/<release tag in github>`
- if any static (css, js, images) have changed, run sync_static_to_s3.sh to copy the files into the S3 bucket
- Restart the server
> `./stop-server.sh`
> `./start-server.sh`
- Restart the server - use the same script as the nightly cron to stop and restart the docker container. During server startup it automatically syncs static files to S3
> `/home/ubuntu/stop-and-start-repo.sh`

## 4. Manual disk space cleanup

The server writes temporary working directories under `./tmp/` (relative to the project root) during project creation and editing. These are cleaned up automatically by the server, but if disk space runs low you can reclaim it manually. **Always check that no project creation or editing is in progress before deleting anything.** First, SSH into the EC2 instance and check for active background tasks:

```bash
cd /home/ubuntu/AmpliconRepository-prod/caper
python manage.py shell -c \
"from caper.background_tasks import get_background_task_status; import json; print(json.dumps(get_background_task_status(), indent=2))"
```

If `active_count` is 0, it is safe to remove all temp directories. To see how much space they are using and then remove them:

```bash
du -sh ./tmp/*/ # check sizes
rm -rf ./tmp/*/ # remove all temp dirs
```

If you want to be cautious and leave any directories that may belong to a recently-started task (for example, if you are unsure whether `active_count` reflects all workers), limit deletion to directories that have not been written to in at least two hours:

```bash
find ./tmp -mindepth 1 -maxdepth 1 -type d -mmin +120 -exec rm -rf {} +
```

Note that restarting the server (`/home/ubuntu/stop-and-start-repo.sh`) also triggers an automatic cleanup of any orphaned temp directories left by previous runs.



11 changes: 11 additions & 0 deletions caper/caper/apps.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,17 @@ def ready(self):
except Exception as e:
logger.warning(f"Could not determine AmpliconSuiteAggregator version: {e}")

# Remove orphaned temp dirs left by any previous run that crashed or
# was killed before it could clean up after itself.
try:
from .background_tasks import cleanup_stale_temp_dirs
# Use a short min-age (5 min) at startup: no tasks are running yet,
# so anything older than a few minutes is safely orphaned.
removed = cleanup_stale_temp_dirs(min_age_seconds=300)
logger.info(f"Startup temp dir cleanup: removed {removed} orphaned dir(s)")
except Exception as e:
logger.warning(f"Startup temp dir cleanup failed: {e}")

# Start the S3 sync in a background thread (only when S3 static files are enabled)
if os.environ.get('S3_STATIC_FILES') == 'TRUE':
sync_thread = threading.Thread(target=self.sync_static_to_s3, daemon=True)
Expand Down
156 changes: 144 additions & 12 deletions caper/caper/background_tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
import uuid
import datetime
import os
import shutil
import time
from concurrent.futures import ThreadPoolExecutor


Expand All @@ -32,6 +34,87 @@ def _get_tasks_collection():
# How long before a 'running' task is considered stale/dead (seconds).
_STALE_THRESHOLD_SECONDS = 20 * 60 # 20 minutes

# Root directory where project temp dirs are created.
_DEFAULT_TMP_ROOT = os.getenv('CAPER_TMP_ROOT', './tmp')

# Periodic cleanup interval: every 6 hours.
_CLEANUP_INTERVAL_SECONDS = 6 * 60 * 60

# Minimum dir age before the cleanup daemon will touch it.
# 3× the stale threshold (60 min) gives active tasks a generous safety buffer.
_CLEANUP_MIN_AGE_SECONDS = _STALE_THRESHOLD_SECONDS * 3


def _remove_temp_dir(path: str) -> None:
"""Remove a temp directory if it exists, logging the outcome."""
if not path:
return
try:
if os.path.exists(path):
shutil.rmtree(path, ignore_errors=True)
logging.info(f"Removed temp directory: {path}")
except Exception:
logging.exception(f"Failed to remove temp directory: {path}")


def cleanup_stale_temp_dirs(
tmp_root: str = _DEFAULT_TMP_ROOT,
min_age_seconds: int = _CLEANUP_MIN_AGE_SECONDS,
) -> int:
"""
Remove directories in tmp_root that have no corresponding running task in
MongoDB and whose mtime is older than min_age_seconds.

The mtime-based age check provides a secondary safety net: a directory
that an active aggregation is still writing to will have a recent mtime
and will not be removed even if its MongoDB record is temporarily invisible.

Returns the number of directories removed.
"""
tmp_root = os.path.abspath(tmp_root)
if not os.path.isdir(tmp_root):
return 0

# Collect the absolute temp_dir paths of all currently-running tasks.
active_temp_dirs: set[str] = set()
try:
col = _get_tasks_collection()
for doc in col.find({'state': 'running', 'temp_dir': {'$exists': True, '$ne': None}}):
td = doc.get('temp_dir')
if td:
active_temp_dirs.add(os.path.abspath(td))
except Exception:
logging.exception("cleanup_stale_temp_dirs: failed to query MongoDB — skipping cleanup")
return 0

cutoff = time.time() - min_age_seconds
removed = 0

try:
with os.scandir(tmp_root) as entries:
for entry in entries:
if not entry.is_dir(follow_symlinks=False):
continue
try:
if entry.stat(follow_symlinks=False).st_mtime > cutoff:
continue # Too recent — still potentially active
abs_path = os.path.abspath(entry.path)
if abs_path in active_temp_dirs:
continue # Claimed by a running task
shutil.rmtree(abs_path, ignore_errors=True)
logging.info(f"cleanup_stale_temp_dirs: removed orphaned temp dir {abs_path}")
removed += 1
except Exception:
logging.exception(f"cleanup_stale_temp_dirs: error processing {entry.path}")
except OSError:
logging.exception(f"cleanup_stale_temp_dirs: cannot scan {tmp_root}")

if removed:
logging.info(f"cleanup_stale_temp_dirs: removed {removed} orphaned dir(s) from {tmp_root}")
else:
logging.debug(f"cleanup_stale_temp_dirs: no orphaned dirs found in {tmp_root}")
return removed


class BackgroundTaskTracker:
"""
Expand All @@ -40,13 +123,22 @@ class BackgroundTaskTracker:
the same status.
"""

def __init__(self, max_workers: int = 4, thread_name_prefix: str = 'caper_worker'):
def __init__(
self,
max_workers: int = 4,
thread_name_prefix: str = 'caper_worker',
cleanup_interval_seconds: int = _CLEANUP_INTERVAL_SECONDS,
tmp_root: str = _DEFAULT_TMP_ROOT,
):
self._executor = ThreadPoolExecutor(
max_workers=max_workers,
thread_name_prefix=thread_name_prefix,
)
self._max_workers = max_workers
self._col = None # lazy
self._cleanup_interval = cleanup_interval_seconds
self._tmp_root = tmp_root
self._start_cleanup_daemon()

def _collection(self):
if self._col is None:
Expand All @@ -62,36 +154,56 @@ def _collection(self):
# Public interface
# ------------------------------------------------------------------

def submit(self, fn, *args, task_label: str = None, **kwargs):
"""Submit *fn* to the thread pool and record it in MongoDB."""
def submit(self, fn, *args, task_label: str = None, temp_dir: str = None, **kwargs):
"""Submit *fn* to the thread pool and record it in MongoDB.

temp_dir: if provided, its absolute path is stored in the task record
so the periodic cleanup daemon can skip directories that belong to
running tasks. The directory is also removed in a finally block after
fn returns, catching the success path which the existing code does not
clean up.
"""
if task_label is None:
task_label = getattr(fn, '__name__', str(fn))

task_id = uuid.uuid4().hex
now = datetime.datetime.utcnow()

doc = {
'_id': task_id,
'label': task_label,
'state': 'running',
'started_at': now.isoformat(timespec='seconds'),
'updated_at': now,
'worker_pid': os.getpid(),
}
if temp_dir is not None:
doc['temp_dir'] = os.path.abspath(temp_dir)

# Write to MongoDB first so it's visible immediately
col = self._collection()
if col is not None:
try:
col.insert_one({
'_id': task_id,
'label': task_label,
'state': 'running',
'started_at': now.isoformat(timespec='seconds'),
'updated_at': now,
'worker_pid': os.getpid(),
})
col.insert_one(doc)
except Exception:
logging.exception("Failed to insert background task record")

abs_temp_dir = os.path.abspath(temp_dir) if temp_dir is not None else None

def _wrapped():
try:
fn(*args, **kwargs)
self._mark_task(task_id, 'completed')
except Exception as exc:
except Exception:
self._mark_task(task_id, 'failed')
raise
finally:
# Belt-and-suspenders: the success path in
# _process_and_aggregate_files does not call rmtree, so this
# finally block is the primary cleanup for the happy path.
# Error paths that already called rmtree are harmless repeats.
if abs_temp_dir is not None:
_remove_temp_dir(abs_temp_dir)

future = self._executor.submit(_wrapped)
return future
Expand Down Expand Up @@ -163,6 +275,26 @@ def _mark_task(self, task_id: str, state: str):
except Exception:
logging.exception(f"Failed to mark background task {task_id} as {state}")

def _start_cleanup_daemon(self):
"""Start a daemon thread that sweeps tmp_root every cleanup_interval seconds."""
interval = self._cleanup_interval
tmp_root = self._tmp_root

def _loop():
while True:
time.sleep(interval)
try:
cleanup_stale_temp_dirs(tmp_root)
except Exception:
logging.exception("Periodic temp dir cleanup failed")

t = threading.Thread(target=_loop, daemon=True, name='caper_tmp_cleanup')
t.start()
logging.info(
f"Started temp dir cleanup daemon "
f"(interval={interval}s, root={os.path.abspath(tmp_root)})"
)


# ---------------------------------------------------------------------------
# Module-level singleton – import this in views.py and views_apis.py
Expand Down