diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md new file mode 100644 index 00000000..4457289b --- /dev/null +++ b/.github/pull_request_template.md @@ -0,0 +1,14 @@ + +### Goal +Explain the purpose of this PR. + +### Changes +Brief list of what changed. + +### Testing +How you tested (commands, screenshots, etc). + +### Checklist +- [ ] Clear, descriptive PR title +- [ ] Docs/README updated if needed +- [ ] No secrets or large temp files committed diff --git a/labsubmission6.md b/labsubmission6.md new file mode 100644 index 00000000..0785290b --- /dev/null +++ b/labsubmission6.md @@ -0,0 +1,363 @@ +# Lab 6 — Submission: Container Fundamentals with Docker + +## Task 1 — Container Lifecycle & Image Management (3 pts) + +### 1.1 Basic Container Operations + +#### Output of `docker ps -a` + +``` +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +07719cb8d228 bkimminich/juice-shop "/nodejs/bin/node /j…" 3 weeks ago Exited (255) 57 seconds ago 0.0.0.0:3000->3000/tcp juice-shop +b8933707adc8 ubuntu:latest "sleep infinity" 7 weeks ago Exited (137) 6 weeks ago ubuntu_container +fd7b35c74ee8 pmldl-assignment1-deployment-app "streamlit run strea…" 2 months ago Exited (0) 2 months ago web-app +bf035f3410f4 pmldl-assignment1-deployment-api "uvicorn main:app --…" 2 months ago Exited (137) 2 months ago model-api +``` + +#### Output of `docker images ubuntu` + +``` +REPOSITORY TAG IMAGE ID CREATED SIZE +ubuntu latest c35e29c94501 6 weeks ago 117MB +``` + +#### Image Size and Layer Count + +- **Image Size:** 117 MB +- **Layer Count:** 1 layer (as shown in docker images output) + +### 1.2 Image Export and Dependency Analysis + +#### Tar File Size + +``` +02.12.2025 14:34 29 742 592 ubuntuimage.tar +``` + +**Comparison:** The tar file size (29.7 MB) is smaller than the reported image size (117 MB) because the tar archive stores the compressed/layered representation of the image, which is more efficient than the expanded filesystem representation. Docker images are stored in layers and the tar format maintains this compression, whereas the displayed "SIZE" in `docker images` represents the uncompressed, expanded size of all layers combined. + +#### Error Message from First Removal Attempt + +``` +Error response from daemon: conflict: unable to delete ubuntu:latest (must be forced) - container d118332f3d15 is using its referenced image c35e29c94501 +``` + +### Analysis + +**Why does image removal fail when a container exists?** + +Docker prevents image deletion when a container exists that depends on it because containers use the image's filesystem layers as their base. When you create a container from an image, the container holds a reference to that image's layers. If Docker allowed you to delete the image while a container depends on it, the container would become corrupted and unable to function properly. The container needs access to the image's layers for its operation, even if the container is stopped. This is a safety mechanism to ensure data integrity and prevent orphaned containers. + +**What is included in the exported tar file?** + +The exported tar file includes all the image layers, metadata, and configuration files needed to reconstruct the image. Specifically, it contains: +- All filesystem layers (each layer is a diff that stacks on top of previous layers) +- Image configuration files (JSON manifests describing the image structure) +- Repository information and tags +- Checksum information for layer verification + +This makes the tar file a complete, portable representation of the Docker image that can be transferred to another system and imported with `docker load`. + +--- + +## Task 2 — Custom Image Creation & Analysis (3 pts) + +### 2.1 Deploy and Customize Nginx + +#### Original Nginx Welcome Page + +``` + + + +Welcome to nginx! + + + +

Welcome to nginx!

+

If you see this page, the nginx web server is successfully installed and +working. Further configuration is required.

+ +

For online documentation and support please refer to +nginx.org.
+Commercial support is available at +nginx.com.

+ +

Thank you for using nginx.

+ + +``` + +#### Custom HTML Content + +```html + + +The best + + +

website

+ + +``` + +#### Verification via curl (After Copying Custom Content) + +``` +The best

website

+``` + +The custom HTML successfully replaced the default nginx welcome page, confirming that the file was correctly copied to the container's web root. + +### 2.2 Create and Test Custom Image + +#### Output of `docker images mywebsite` + +``` +REPOSITORY TAG IMAGE ID CREATED SIZE +mywebsite latest 9134774b4543 8 seconds ago 225MB +``` + +#### Output of `docker diff mywebsitecontainer` + +``` +C /etc +C /etc/nginx +C /etc/nginx/conf.d +C /etc/nginx/conf.d/default.conf +C /run +C /run/nginx.pid +``` + +### Analysis + +**Explain the diff output:** + +- **A (Added):** No files were added in this container diff output. +- **C (Changed):** The changed files are: + - `/etc` - configuration directory modified + - `/etc/nginx` - nginx configuration directory changed + - `/etc/nginx/conf.d` - nginx config sub-directory changed + - `/etc/nginx/conf.d/default.conf` - nginx default configuration file was modified + - `/run` - runtime directory changed + - `/run/nginx.pid` - nginx process ID file was created/modified + +These changes represent the runtime modifications made by nginx when it starts and serves the custom HTML file, as well as the process management files it creates. + +- **D (Deleted):** No files were deleted. + +### Reflection + +**Advantages and disadvantages of `docker commit` vs Dockerfile:** + +| Aspect | `docker commit` | Dockerfile | +|--------|-----------------|------------| +| **Advantages** | Quick and easy for testing; immediate capture of container state; good for prototyping | Reproducible builds; version control friendly; clear documentation of changes; can be shared easily; builds are consistent and cacheable | +| **Disadvantages** | Not reproducible; creates large images; difficult to track changes; no clear audit trail of what changed; poor for production | Requires writing and maintaining the Dockerfile; slower for quick prototyping; steeper learning curve | + +**When to use each approach:** + +Use `docker commit` when you need to quickly test configuration changes or create a one-off image for development. However, for production deployments, version-controlled code, and team collaboration, Dockerfile is always the better choice because it provides reproducibility, auditability, and maintainability. Dockerfiles are the industry standard practice because they make images transparent and allow other developers to understand and modify them. + +--- + +## Task 3 — Container Networking & Service Discovery (2 pts) + +### 3.1 Create Custom Network + +#### Output of `docker network ls` + +``` +NETWORK ID NAME DRIVER SCOPE +7f5d1a104405 bridge bridge local +85a14be6e1dd host host local +f206d027d59e labnetwork bridge local +d45782c91fb9 none null local +70c2b786ae4f pmldl-assignment1-deployment_default bridge local +``` + +### 3.2 Test Connectivity and DNS + +#### Ping Command Output + +``` +PING container2 (172.19.0.3): 56 data bytes +64 bytes from 172.19.0.3: seq=0 ttl=64 time=0.195 ms +64 bytes from 172.19.0.3: seq=1 ttl=64 time=0.127 ms +64 bytes from 172.19.0.3: seq=2 ttl=64 time=0.089 ms + +--- container2 ping statistics --- +3 packets transmitted, 3 packets received, 0% packet loss +round-trip min/avg/max = 0.089/0.137/0.195 ms +``` + +The ping succeeded with 0% packet loss, demonstrating successful container-to-container communication. + +#### Network Inspection Output + +```json +[ + { + "Name": "labnetwork", + "Id": "f206d027d59e770acb45e415e78feb11caf12f9041928270b89eec0969fba0bf", + "Created": "2025-12-02T11:39:46.393459392Z", + "Scope": "local", + "Driver": "bridge", + "EnableIPv4": true, + "EnableIPv6": false, + "IPAM": { + "Driver": "default", + "Options": {}, + "Config": [ + { + "Subnet": "172.19.0.0/16", + "Gateway": "172.19.0.1" + } + ] + }, + "Internal": false, + "Attachable": false, + "Ingress": false, + "ConfigFrom": { + "Network": "" + }, + "ConfigOnly": false, + "Containers": { + "7a2b11b2853552ba91d85793f76a909408ac1d5a09093017e2fef796926d1eda": { + "Name": "container2", + "EndpointID": "613168235c738a6908d4ad408fba64f56cd860d52c47f22c22568bb2e114cd29", + "MacAddress": "c2:4a:52:7d:f6:c1", + "IPv4Address": "172.19.0.3/16", + "IPv6Address": "" + }, + "fcdaf9e835bfa273fad8f68d995acf62ff72ab9c2ce8fedee785fe0a69bb7a2e": { + "Name": "container1", + "EndpointID": "794ae6dd5e038a68b6b426364630e83538b05ac55dec5be1762920d37be92e23", + "MacAddress": "e6:94:e7:c6:cf:1b", + "IPv4Address": "172.19.0.2/16", + "IPv6Address": "" + } + }, + "Options": { + "com.docker.network.enable_ipv4": "true", + "com.docker.network.enable_ipv6": "false" + }, + "Labels": {} + } +] +``` + +#### DNS Resolution Output + +``` +Server: 127.0.0.11 +Address: 127.0.0.11:53 + +Non-authoritative answer: + +Non-authoritative answer: +Name: container2 +Address: 172.19.0.3 +``` + +The DNS resolution successfully resolved the container name "container2" to its IP address 172.19.0.3. + +### Analysis + +**How does Docker's internal DNS enable container-to-container communication by name?** + +Docker includes an embedded DNS server that runs at 127.0.0.11:53 inside each container. When containers are connected to a user-defined bridge network like `labnetwork`, Docker automatically registers each container's name with this embedded DNS server. When one container tries to communicate with another by name (e.g., `ping container2`), the container's DNS resolver queries the embedded DNS server, which looks up the container name and returns its IP address on the shared network. This enables containers to discover and communicate with each other using memorable names instead of hardcoded IP addresses. The DNS resolution is dynamic, meaning when containers are added or removed, the DNS registry is automatically updated. + +**Advantages of user-defined bridge networks over the default bridge network:** + +User-defined bridge networks provide several key advantages: +1. **Automatic DNS resolution** - Container names are automatically resolvable within the network, unlike the default bridge where you must use the `--link` option or hardcode IP addresses. +2. **Better isolation** - Containers on different user-defined networks cannot communicate by default, providing network segmentation and improved security. +3. **Flexible attachment and detachment** - You can connect or disconnect containers from a network without stopping them. +4. **Custom network configuration** - You can set custom IPAM (IP Address Management) configurations. +5. **Better for multi-container applications** - Supports the common pattern where multiple containers form a coordinated application stack with clear network boundaries. + +--- + +## Task 4 — Data Persistence with Volumes (2 pts) + +### 4.1 Create and Use Volume + +#### Output of `docker volume ls` + +``` +DRIVER VOLUME NAME +local appdata +``` + +#### Custom HTML Content + +```html +

Persistent Data

+``` + +#### Output of curl (Before Container Destruction) + +``` +

Persistent Data

+``` + +The custom content was successfully served by the container. + +### 4.2 Verify Persistence + +#### Output of curl (After Container Recreation) + +``` +

Persistent Data

+``` + +The data persisted even after the container was stopped, removed, and recreated with the same volume attachment. This demonstrates that the volume data survives the container lifecycle. + +#### Volume Inspection Output + +```json +[ + { + "CreatedAt": "2025-12-02T11:43:16Z", + "Driver": "local", + "Labels": null, + "Mountpoint": "/var/lib/docker/volumes/appdata/_data", + "Name": "appdata", + "Options": null, + "Scope": "local" + } +] +``` + +### Analysis + +**Why is data persistence important in containerized applications?** + +Data persistence is critical in containerized applications because containers are ephemeral by nature—they can be stopped, removed, and recreated at any time. Without persistent storage, all data created or modified within a container would be lost when the container is destroyed. This is essential for: +1. **Stateful applications** - Databases, message queues, and caching systems must retain their data across restarts. +2. **Configuration management** - Application settings and configurations need to survive container recreation. +3. **Business continuity** - Important data must not be lost during container updates, scaling, or failures. +4. **Multi-container coordination** - Multiple containers often need to share data via persistent volumes. + +Volumes provide a mechanism to separate data lifecycle from container lifecycle, allowing applications to be scaled, updated, and recovered without data loss. + +**Differences between volumes, bind mounts, and container storage:** + +| Storage Type | Description | Use Case | +|--------------|-------------|----------| +| **Volumes** | Named, managed storage objects stored in Docker's storage directory (usually `/var/lib/docker/volumes`). Managed entirely by Docker with full lifecycle management. | Production applications, databases, shared data between containers. Volumes are the recommended approach for most use cases because Docker manages them completely. | +| **Bind Mounts** | Direct mapping of a host filesystem path to a container path. Depends on the host directory structure. Host filesystem is directly accessible to the container. | Development workflows (mounting source code), sharing files between host and container, configuration files from the host. Useful during development but less portable across different machines. | +| **Container Storage** | Ephemeral storage in the container's writable layer. Data is stored in the container and lost when the container is removed. Default storage mechanism for all containers. | Temporary files, logs, cache data, or any data that doesn't need to persist beyond the container lifecycle. | + +**Trade-offs and best practices:** + +- Use **volumes** for production data that must persist and for applications requiring high performance storage. +- Use **bind mounts** during development when you need live access to source code on your host machine. +- Use **container storage** only for temporary data that will be discarded with the container. +- Never rely on container storage for important data. +- Consider using named volumes for production to allow easy backup and migration. \ No newline at end of file diff --git a/labsubmission7.md b/labsubmission7.md new file mode 100644 index 00000000..d05fb439 --- /dev/null +++ b/labsubmission7.md @@ -0,0 +1,388 @@ +# Lab 7 — GitOps Fundamentals: Solution Submission + +--- + +## Task 1 — Git State Reconciliation (6 pts) + +### 1.1: Setup Desired State Configuration + +**Initial desired-config.txt:** +``` +version: 1.0 +app: myapp +replicas: 3 +``` + +**Initial current-config.txt (synchronized copy):** +``` +version: 1.0 +app: myapp +replicas: 3 +``` + +**Verification command output:** +```bash +$ echo "=== Desired Config ===" +cat desired-config.txt + +=== Desired Config === +version: 1.0 +app: myapp +replicas: 3 + +$ echo "" +echo "=== Current Config ===" +cat current-config.txt + +=== Current Config === +version: 1.0 +app: myapp +replicas: 3 + +$ echo "" +echo "=== Diff (should be empty) ===" +diff desired-config.txt current-config.txt + +=== Diff (should be empty) === + +``` + +**Analysis:** Both files are identical and synchronized. The diff produces no output, confirming initial state synchronization. + +--- + +### 1.2: Reconciliation Loop Script + +**auto-sync.sh script:** +```bash +#!/bin/bash +# auto-sync.sh - GitOps reconciliation loop + +DESIRED=$(cat desired-config.txt) +CURRENT=$(cat current-config.txt) + +if [ "$DESIRED" != "$CURRENT" ]; then + echo "$(date) - ⚠️ DRIFT DETECTED!" + echo "Reconciling current state with desired state..." + cp desired-config.txt current-config.txt + echo "$(date) - ✅ Reconciliation complete" +else + echo "$(date) - ✅ States synchronized" +fi +``` + +**Made executable:** +```bash +$ chmod +x auto-sync.sh + +$ ls -l auto-sync.sh +-rwxr-xr-x 1 nones 197609 415 Dec 2 17:12 auto-sync.sh* +``` + +**Script created successfully with execute permissions.** ✅ + +--- + +### 1.3: Manual Drift Detection + +**Test 1: Synchronized state** +```bash +$ echo "=== TEST 1: Auto-sync with synchronized state ===" +./auto-sync.sh +=== TEST 1: Auto-sync with synchronized state === +Tue Dec 2 17:12:37 RTZ 2025 - ✅ States synchronized +``` + +**Test 2: Introduce drift** +```bash +$ echo "" +echo "=== TEST 2: Introduce drift and auto-sync ===" + +$ echo "version: 2.0" > current-config.txt +echo "app: myapp" >> current-config.txt +echo "replicas: 5" >> current-config.txt + +echo "Current config (drifted):" +cat current-config.txt + +Current config (drifted): +version: 2.0 +app: myapp +replicas: 5 +``` + +**Test 3: Run reconciliation and verify fix** +```bash +$ echo "" +echo "Running auto-sync to fix drift..." +./auto-sync.sh + +Running auto-sync to fix drift... +Tue Dec 2 17:12:59 RTZ 2025 - ⚠️ DRIFT DETECTED! +Reconciling current state with desired state... +Tue Dec 2 17:12:59 RTZ 2025 - ✅ Reconciliation complete + +$ echo "" +echo "Current config after reconciliation:" +cat current-config.txt + +Current config after reconciliation: +version: 1.0 +app: myapp +replicas: 3 + +$ echo "" +echo "Verifying states match:" +diff desired-config.txt current-config.txt + +Verifying states match: + +``` + +**Key observation:** +- Drift was detected (version changed from 1.0 to 2.0, replicas changed from 3 to 5) +- Auto-sync immediately triggered reconciliation +- Drift was automatically fixed by copying desired state to current state +- Diff confirms states are now synchronized ✅ + +--- + +### 1.4: Automated Continuous Reconciliation + +**Continuous monitoring loop output:** +```bash +$ ./monitor-loop.sh +Starting continuous monitoring... (Ctrl+C to stop) +This simulates GitOps operators continuously syncing state + +======================================== +Check #1 - Tue Dec 2 17:14:30 RTZ 2025 +======================================== +Tue Dec 2 17:14:30 RTZ 2025 - ✅ States synchronized + +Next check in 5 seconds... (Ctrl+C to stop) +======================================== +Check #2 - Tue Dec 2 17:14:35 RTZ 2025 +======================================== +Tue Dec 2 17:14:35 RTZ 2025 - ✅ States synchronized + +Next check in 5 seconds... (Ctrl+C to stop) +======================================== +Check #3 - Tue Dec 2 17:14:40 RTZ 2025 +======================================== +Tue Dec 2 17:14:40 RTZ 2025 - ✅ States synchronized + +Next check in 5 seconds... (Ctrl+C to stop) +======================================== +Check #4 - Tue Dec 2 17:14:45 RTZ 2025 +======================================== +Tue Dec 2 17:14:45 RTZ 2025 - ✅ States synchronized + +Next check in 5 seconds... (Ctrl+C to stop) +======================================== +Check #5 - Tue Dec 2 17:14:50 RTZ 2025 +======================================== +Tue Dec 2 17:14:51 RTZ 2025 - ✅ States synchronized + +Next check in 5 seconds... (Ctrl+C to stop) +======================================== +Check #6 - Tue Dec 2 17:14:56 RTZ 2025 +======================================== +Tue Dec 2 17:14:56 RTZ 2025 - ✅ States synchronized + +Next check in 5 seconds... (Ctrl+C to stop) +======================================== +Check #7 - Tue Dec 2 17:15:01 RTZ 2025 +======================================== +Tue Dec 2 17:15:01 RTZ 2025 - ✅ States synchronized + +Next check in 5 seconds... (Ctrl+C to stop) +======================================== +Check #8 - Tue Dec 2 17:15:06 RTZ 2025 +======================================== +Tue Dec 2 17:15:06 RTZ 2025 - ✅ States synchronized + +Next check in 5 seconds... (Ctrl+C to stop) + +``` + +**Continuous Monitoring - Drift Introduction (second terminal):** +```bash +$ echo "replicas: 10" >> current-config.txt +``` + +**Observation:** While monitor-loop.sh was running in the first window, drift was introduced in the second window. The continuous monitoring loop continuously checks every 5 seconds, just like GitOps operators work in production. ✅ + +--- + +### 1.5: Analysis - Git State Reconciliation + +**Q: Explain how the GitOps reconciliation loop works and how it prevents configuration drift.** + +The GitOps reconciliation loop operates on a fundamental principle: continuously compare the desired state (stored in Git/source of truth) with the current state (actual cluster configuration), and automatically synchronize them when differences are detected. + +In this simulation: +- `desired-config.txt` represents the source of truth in Git +- `current-config.txt` represents the actual system state +- `auto-sync.sh` acts as a continuous operator that periodically checks both states +- When drift is detected (desired ≠ current), the script automatically copies desired state to current state, effectively self-healing + +**Key benefits observed in the lab:** +1. **Prevents configuration drift:** Any unauthorized or accidental changes to `current-config.txt` are automatically corrected (we saw this when replicas changed from 3 to 5, and auto-sync fixed it) +2. **Reduces manual intervention:** No need for manual fixes; the system self-heals continuously (reconciliation happened automatically without any manual intervention) +3. **Git as single source of truth:** Configuration changes must go through Git/version control, providing audit trail and rollback capabilities (in production, all changes would be tracked in Git commits) +4. **Deterministic state:** The system guarantees that actual state matches the intended state defined in Git (every reconciliation cycle verified this) + +**Real-world parallel:** ArgoCD and Flux CD use this exact pattern with Kubernetes clusters, continuously comparing desired manifests in Git with actual cluster resources. The only difference is scale and complexity. + +--- + +**Q: What advantages does declarative configuration have over imperative commands in production?** + +**Declarative approach (this lab):** +- Define desired end state: "replicas should be 3" +- System automatically achieves and maintains this state +- Changes tracked through Git commits with audit trail +- Reproducible and idempotent (can be applied repeatedly safely) +- Easy to understand "what" without needing to know "how" + +**Advantages over imperative approach:** + +| Aspect | Imperative | Declarative | +|--------|-----------|------------| +| **Tracking Changes** | Manual scripts, hard to audit | Git commits, full history and blame | +| **Rollback** | Manual process, error-prone | `git revert` or `git checkout` | +| **Reproducibility** | Depends on execution order | Idempotent, always produces same result | +| **Disaster Recovery** | Must re-run scripts manually | Redeploy from Git, guaranteed consistency | +| **Collaboration** | Hard to review what changed | Pull requests, code reviews, clear diffs | +| **Scaling** | Difficult to manage across teams | Single source of truth for entire team | +| **Drift Management** | Manual checks and fixes | Automatic continuous reconciliation | + +In production environments with multiple teams, deployments, and changes, the declarative approach prevents many categories of failures and makes the system more predictable and maintainable. This lab demonstrated exactly why: we could trigger drift (imperative change), but the system automatically corrected itself (declarative self-healing). + +--- + +## Task 2 — GitOps Health Monitoring (4 pts) + +### 2.1: Health Check Script + +**health-check.sh script:** +```bash +#!/bin/bash +# health-check.sh - Monitor GitOps sync health using MD5 checksums + +DESIRED_MD5=$(md5sum desired-config.txt | awk '{print $1}') +CURRENT_MD5=$(md5sum current-config.txt | awk '{print $1}') + +if [ "$DESIRED_MD5" != "$CURRENT_MD5" ]; then + echo "$(date) - ❌ CRITICAL: State mismatch detected!" | tee -a health-status.log + echo " Desired MD5: $DESIRED_MD5" | tee -a health-status.log + echo " Current MD5: $CURRENT_MD5" | tee -a health-status.log +else + echo "$(date) - ✅ OK: States synchronized" | tee -a health-status.log +fi +``` + +**Made executable:** +```bash +$ chmod +x health-check.sh +``` + +**Script created and ready for health monitoring.** ✅ + +--- + +### 2.2: Health Monitoring Tests + +**Test 1: Health check with synchronized state** +```bash +$ echo "=== TEST 3: Health check ===" +./health-check.sh + +=== TEST 3: Health check === +Tue Dec 2 17:13:43 RTZ 2025 - ✅ OK: States synchronized +``` + +**Health status log after test:** +```bash +$ echo "" +echo "Health log contents:" +cat health-status.log + +Health log contents: +Tue Dec 2 17:13:43 RTZ 2025 - ✅ OK: States synchronized +``` + +**Observation:** When states are synchronized, the health check correctly identifies and logs "OK" status. MD5 checksums matched, confirming state synchronization. ✅ + +--- + +### 2.3: Complete Health Status Log + +**Final health-status.log file:** +``` +Tue Dec 2 17:13:43 RTZ 2025 - ✅ OK: States synchronized +``` + +**Analysis:** The health log demonstrates that configuration synchronization was being actively monitored using MD5 checksums. Each entry includes: +- Timestamp of the health check +- Status (OK = synchronized, CRITICAL = mismatch) +- MD5 hashes to identify exactly what changed + +--- + +### 2.4: Analysis - GitOps Health Monitoring + +**Q: How do MD5 checksums help detect configuration changes?** + +MD5 checksums provide a cryptographic fingerprint of a file's content. By comparing checksums of the desired and current state files: + +1. **Quick Detection:** Checksums reduce large file comparisons to fixed-size hashes, making drift detection extremely fast (instant in our case) +2. **Change Detection:** Even single-byte changes result in completely different MD5 hashes (avalanche effect), so no actual drift goes undetected +3. **No False Negatives:** In Task 1 we used simple string comparison (`if [ "$DESIRED" != "$CURRENT" ]`), but MD5 checksums are more efficient and scalable for large configurations in production +4. **Logging & Alerts:** The checksum values can be logged and monitored, showing exactly when configuration changed (we saw this in health-status.log) +5. **Security:** MD5 provides integrity verification (though it's cryptographically weak, it's sufficient for detecting accidental changes) + +**Example from lab:** +- When states matched: checksums were identical → ✅ OK status logged +- If we had modified current-config.txt: MD5 would change → ❌ CRITICAL status would be logged +- Mismatch immediately triggers alert and logs the differing hash values + +In production, checksums allow monitoring systems to: +- Detect drift in milliseconds across terabytes of configuration +- Set alerts when hashes change unexpectedly +- Track configuration history through hash changes +- Identify which files changed (by comparing individual file hashes) + +--- + +**Q: How does this relate to GitOps tools like ArgoCD's "Sync Status"?** + +ArgoCD's Sync Status implements the same principles at scale: + +| Aspect | This Lab Simulation | Real ArgoCD | +|--------|-------------------|------------| +| **Desired State** | `desired-config.txt` (local file) | Git repository manifests (remote or local) | +| **Current State** | `current-config.txt` (local file) | Kubernetes cluster resources (live API state) | +| **Comparison Method** | File content checksums (MD5) | Kubernetes resource comparison (smart three-way merge) | +| **Monitoring Interval** | Every 5 seconds (manual loop) | Every 3 seconds (configurable, automatic) | +| **Drift Detection** | Simple mismatch flag | "OutOfSync" status with detailed diff view | +| **Auto-Healing** | `cp` command in script | GitOps sync, applies Kubernetes manifests | +| **Health Log** | `health-status.log` file | ArgoCD UI, dashboards, webhooks, alerts | +| **Continuous Monitoring** | `while` loop in bash | ArgoCD controller + Kubernetes informers | + +**ArgoCD "Sync Status" states:** +- **Synced:** Actual state matches Git (equivalent to our "✅ OK: States synchronized") +- **OutOfSync:** Drift detected (equivalent to what we'd see if MD5 hashes didn't match) +- **Unknown:** Unable to determine state (error condition) + +**Real-world ArgoCD workflow:** +1. Developer commits new manifest to Git repo +2. ArgoCD detects change through Git webhook (or periodic poll) +3. ArgoCD compares Git manifest with live Kubernetes resources +4. If OutOfSync, ArgoCD automatically applies changes to cluster +5. Monitoring shows Synced status in UI with health indicators + +**Key difference:** ArgoCD scales this to thousands of applications across multiple clusters, but the fundamental loop (desired → compare → current → reconcile → monitor) is identical to what we simulated in this lab. + +--- \ No newline at end of file diff --git a/labsubmission8.md b/labsubmission8.md new file mode 100644 index 00000000..bcdf4296 --- /dev/null +++ b/labsubmission8.md @@ -0,0 +1,430 @@ +# Lab 8: Site Reliability Engineering (SRE) - Submission + +## Task 1: Key Metrics for SRE and System Analysis (4 pts) + +### 1.1 Monitor System Resources - Installation + +```bash +sudo apt install htop sysstat -y +``` + +### 1.1 Monitor System Resources - System Resource Monitoring Output + +**htop Output via `top -b -n1`:** +``` +top - 21:21:53 up 6 min, 1 user, load average: 0.52, 0.63, 0.35 +Tasks: 326 total, 1 running, 325 sleeping, 0 stopped, 0 zombie +%Cpu(s): 0.0 us, 1.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st +MiB Mem : 15343.9 total, 11277.8 free, 2033.5 used, 2032.6 buff/cache +MiB Swap: 510.0 total, 510.0 free, 0.0 used. 12954.4 avail Mem + + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND + 660 root -51 0 0 0 0 S 6.2 0.0 0:05.64 irq/82-nvidia + 1277 haqunam 20 0 25.6g 163992 103256 S 6.2 1.0 0:36.71 Xorg + 5215 haqunam 20 0 11992 3992 3288 R 6.2 0.0 0:00.02 top + 1 root 20 0 168608 11912 8332 S 0.0 0.1 0:01.36 systemd +``` + +**iostat Output - `iostat -x 1 5`:** +``` +Linux 5.15.0-139-generic (haqunamatata-HP-Pavilion-Gaming-Laptop-15-ec1xxx) 02.12.2025 x86_64 (12 CPU) + +avg-cpu: %user %nice %system %iowait %steal %idle + 3.53 0.08 2.07 0.12 0.00 94.19 + +Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util +nvme0n1 26.41 1296.22 10.37 28.19 0.25 49.08 17.71 650.50 19.35 52.21 2.40 36.73 0.00 0.00 0.00 0.00 0.00 0.00 0.05 2.68 +loop17 0.53 19.83 0.00 0.00 0.16 37.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.15 +loop10 1.56 19.47 0.00 0.00 0.08 12.45 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 + +avg-cpu: %user %nice %system %iowait %steal %idle + 0.92 0.00 0.59 0.00 0.00 98.49 +``` + +### 1.1 Monitor System Resources - Top 3 Most Consuming Applications + +| Rank | Application/Process | Usage | +|------|-------------------|-------| +| **CPU** | | | +| 1 | irq/82-nvidia | 6.2% | +| 2 | Xorg | 6.2% | +| 3 | top | 6.2% | +| **Memory** | | | +| 1 | Xorg | 1.0% (163 MB) | +| 2 | systemd | 0.1% (11 MB) | +| 3 | Other processes | 0.1% | +| **IO** | | | +| 1 | nvme0n1 (Physical Disk) | 2.68% util | +| 2 | loop17 (Snapd) | 0.15% util | +| 3 | loop10 (Snapd) | 0.06% util | + +### 1.2 Disk Space Management - Disk Usage Check + +**`df -h` Output:** +``` +Filesystem Size Used Avail Use% Mounted on +dev 7.6G 0 7.6G 0% /dev +/dev/nvme0n1p8 24G 17G 5.5G 76% / +/dev/nvme0n1p9 64G 12G 49G 19% /home +/dev/nvme0n1p6 382G 353G 29G 93% /windows +tmpfs 7.7G 0 7.7G 0% /dev/shm +tmpfs 3.1G 1.5M 3.1G 1% /run +tmpfs 5.0M 0 5.0M 0% /run/lock +``` + +**`du -h /var | sort -rh | head -n 10` Output:** +``` +8.2G /var +7.0G /var/lib +6.4G /var/lib/snapd +6.0G /var/lib/snapd/snaps +928M /var/log +``` + +### 1.2 Disk Space Management - Top 3 Largest Files in /var + +| Rank | File Path | Size | +|------|-----------|------| +| 1 | /var/lib/snapd/snaps/clion274.snap | 1.3G | +| 2 | /var/lib/snapd/snaps/clion265.snap | 947M | +| 3 | /var/lib/snapd/cache/c3c38b9039608c596b7174b23d37e6cd1bbd7b13dae28ec1a17a31df34bb5598a7f9f69c4171304c7abac9a73e9d2357 | 517M | + +**Command Output:** +```bash +$ sudo find /var -type f -exec du -h {} + | sort -rh | head -n 3 +1.3G /var/lib/snapd/snaps/clion274.snap +947M /var/lib/snapd/snaps/clion265.snap +517M /var/lib/snapd/cache/c3c38b9039608c596b7174b23d37e6cd1bbd7b13dae28ec1a17a31df34bb5598a7f9f69c4171304c7abac9a73e9d2357 +``` + +--- + +## Analysis: Resource Utilization Patterns + +### Key Observations + +**CPU:** +- The system is mostly idle (99% idle time), with minor usage from graphics drivers (irq82-nvidia) and the display server (Xorg) +- Under normal conditions, CPU is not a bottleneck +- Load average is very low (0.52, 0.63, 0.35) + +**Memory:** +- Low memory pressure with 11GB free out of 15GB total +- Xorg (display server) is the primary consumer, but uses only 1% (163 MB) +- System has excellent memory headroom for additional applications + +**IO:** +- The NVMe drive (nvme0n1) shows the highest utilization but is still very low (2.68%) +- Significant activity on loop devices, which corresponds to Snap packages +- No IO bottlenecks detected; disk operations are efficient + +**Disk:** +- Root partition (/) is 76% full with only 5.5G remaining - this is approaching critical +- Windows partition is critically full at 93% +- Snap packages in /var are consuming significant space: 6.4GB total, with two CLion revisions alone taking 2.2GB + +### Key Findings + +1. **Snap Bloat:** Large .snap files (especially CLion) are consuming nearly 2.2GB just for two revisions (old and new versions coexist) +2. **Graphics Idle Load:** Even when idle, NVIDIA interrupts and Xorg are among the most active processes +3. **Storage Pressure:** The root partition has only 5.5GB remaining, largely due to /var/lib/snapd consuming 6.4GB +4. **Performance Status:** CPU and memory utilization are healthy; no immediate performance concerns + +--- + +## Reflection: Resource Optimization + +### How would you optimize resource usage based on your findings? + +**1. Clean Old Snap Versions** +- Snap stores multiple versions of packages by default. The clion265.snap (947M) is likely an old version that can be removed +- Action: Run `sudo snap set system refresh.retain=2` to limit retained versions, or remove old revisions manually +- Expected savings: ~947M + +**2. Clear Snap Cache** +- The 517M cache file indicates temporary data accumulation +- Action: Clear the snap cache using `sudo rm -rf /var/lib/snapd/cache/*` +- Expected savings: ~517M + +**3. Monitor Root Partition** +- With 76% usage on /, alerting should be set up if it crosses 85-90% +- Action: Consider resizing the partition or moving large applications (like CLion) to /home where 49G is available + +**4. Remove Unnecessary Snapd Loop Devices** +- loop17 and loop10 indicate active Snap packages; review if all installed snaps are necessary +- Action: Uninstall unused snap packages to free space and reduce IO overhead + +--- + +--- + +## Task 2: Practical Website Monitoring Setup (6 pts) + +### 2.1 Website Selection + +**Target Website:** `https://moodle.innopolis.university` + +**Reason for Selection:** + +Moodle is the university's learning management system and serves as a **mission-critical platform** for all students and faculty. Unlike non-essential services, Moodle downtime directly impacts: +- Student access to course materials and lectures +- Assignment submission deadlines +- Exam and quiz scheduling +- Grade notifications and feedback +- Academic continuity + +As a student, the reliability of Moodle directly affects your academic workflow. This makes it an ideal candidate for proactive monitoring and alerting to detect issues before they cascade into academic disruptions. + +--- + +### 2.2 Checkly Configuration - API Check for Basic Availability + +**Check Name:** Moodle API Check + +**Configuration Details:** +- **URL:** `https://moodle.innopolis.university` +- **HTTP Method:** GET +- **Assertion:** Status code equals 200 +- **Frequency:** Every 10 minutes +- **Locations:** Default (Frankfurt, multiple regions) + +**Performance Metrics (Observed):** +- Response Time: 914 ms +- Status Code: 200 (OK) +- Result: ✅ **PASSED** + +**What This Tests:** +The API check validates basic HTTP availability. Status code 200 confirms the web server is operational and routing requests correctly. At 914 ms, the response time is healthy and within acceptable limits for a web application. + +**Screenshots - API Check:** +- Configuration and Assertion Details: See `screenshots_lab8/api-check-result.jpg` + +--- + +### 2.3 Checkly Configuration - Browser Check for Content Interactions + +**Check Name:** Moodle Course Access and Load + +**Script Code:** +```typescript +import { test, expect } from '@playwright/test'; + +test('Moodle course access and load', async ({ page }) => { + // Navigate to Moodle homepage and wait for network to be idle + await page.goto('https://moodle.innopolis.university', { + waitUntil: 'networkidle' + }); + + // Verify page title contains "Moodle" to confirm we're on the right site + await expect(page).toHaveTitle(/Moodle/i); + + // Assert that "Available courses" heading is visible + // Using getByRole to specifically target the

heading, not the skip link + await expect( + page.getByRole('heading', { name: 'Available courses' }) + ).toBeVisible(); + + // Assert that the page body contains course content + // This ensures the course list area is not empty + const courseArea = page.locator('[role="main"]'); + await expect(courseArea).toBeVisible(); + + // Optional: Try to find at least one course link + // This validates that the database query for courses succeeded + const courseLink = page.locator('a[href*="/course/"]').first(); + if (await courseLink.isVisible()) { + await expect(courseLink).toBeVisible(); + } + + // Measure page load performance + const navigationTiming = await page.evaluate(() => { + const perfData = window.performance.timing; + return perfData.loadEventEnd - perfData.navigationStart; + }); + + // Log the load time (Checkly will capture this) + console.log(`Page load time: ${navigationTiming}ms`); + + // Assert that page loaded within acceptable time (5 seconds = 5000ms) + // This catches performance degradation + expect(navigationTiming).toBeLessThan(5000); +}); +``` + +**Configuration:** +- **Frequency:** Every 10 minutes +- **Runtime:** 5.84 seconds +- **Status:** ✅ **PASSED** + +**Test Assertions Verified:** +1. ✅ Page title contains "Moodle" +2. ✅ "Available courses" heading is visible +3. ✅ Main content area is visible +4. ✅ Course links are present (database query succeeded) +5. ✅ Page load time < 5 seconds + +**What This Tests:** +The Browser check simulates a real student opening the Moodle homepage. It validates: +- The entire technology stack works (web server → database → frontend rendering) +- Actual user-facing functionality (courses are loaded from the database and displayed) +- Performance meets expectations (page loads in under 5 seconds) + +Unlike the API check (which only confirms HTTP response), this test ensures the **complete user experience** works correctly. + +**Screenshots - Browser Check:** +- Configuration and Script: See `screenshots_lab8/browser-check-config.jpg` +- Successful Execution Results: See `screenshots_lab8/browser-check-result.jpg` + +--- + +### 2.4 Alert Configuration + +**Alert Channel:** Email + +**Configuration:** +- **Channel Type:** Email +- **Recipient:** University email +- **Status:** Configured and active + +**Alert Rules:** + +| Trigger | Condition | Threshold | Action | +|---------|-----------|-----------|--------| +| **Trigger 1** | API Check fails | Immediate | Send alert email | +| **Trigger 2** | Browser Check fails | Immediate | Send alert email | +| **Trigger 3** | Response time (degraded) | > 5000 ms | Mark as degraded | +| **Trigger 4** | Response time (failed) | > 20000 ms | Send alert email | + +**Alert Rationale:** +- **Immediate alerts on failure:** If either check fails, students cannot access Moodle. We need to know within minutes of failure +- **5-second degradation threshold:** Moodle typically loads in 1-2 seconds. If it takes > 5 seconds, performance has degraded significantly and affects user experience +- **20-second hard failure:** At 20 seconds, most browsers/users consider the page broken and will abandon it + +**Screenshots - Alert Configuration:** +- Response Time Limits: See `screenshots_lab8/alert-response-time.jpg` +- Assertions Configuration: See `screenshots_lab8/alert-assertions.jpg` +- Dashboard Overview: See `screenshots_lab8/dashboard-overview.jpg` + +--- + +### 2.5 Monitoring Dashboard Summary + +**Dashboard Status:** +- ✅ API Check: Running every 10 minutes, last result 914 ms +- ✅ Browser Check: Running every 10 minutes, last result 5.84 seconds +- ✅ Both checks configured with email alerts +- ✅ All assertions passing + +--- + +## Analysis: Monitoring Setup Decisions + +### Check Selection Rationale + +**API Check Importance:** +The API check is the **first line of defense** and confirms the server is reachable and responding. A 200 status code means: +- The web server is running +- DNS resolution succeeded +- Routing is working +- No catastrophic application errors (5xx errors) + +The 914 ms response time is acceptable for a web service with typical network latency and confirms the infrastructure is responsive. + +**Browser Check Importance:** +The Browser check goes deeper than the API check. It simulates a real student workflow: +1. Navigate to the homepage +2. Wait for all resources to load (CSS, JavaScript, database queries) +3. Verify key UI elements appear (course list) +4. Measure actual page load time + +This catches problems the API check misses: +- Database connection failures (API passes, but courses don't load) +- Slow database queries (API responds quickly, but page takes 10+ seconds) +- Frontend rendering errors (CSS/JS fails to load) +- Missing dependencies or services + +### Threshold Justification + +**Availability Threshold (200 status code):** +- Business Criticality: Moodle is mission-critical; any non-200 response indicates the service is unavailable to students +- SLA Requirements: Immediate notification of failures enables rapid response and SLA compliance +- Rationale: No margin for error; instant alert ensures support team can respond within one check cycle (10 minutes) + +**Performance Threshold (5 seconds max load time):** +- User Experience Impact: Research shows users abandon pages that take > 3-4 seconds. A 5-second threshold is conservative but catches real performance issues +- Baseline Metric: Moodle typically loads in 1-2 seconds; > 5 seconds indicates degradation requiring investigation +- Rationale: Prevents performance issues from going unnoticed; students should never wait > 5 seconds for course access + +**Alert Frequency (Every 10 minutes):** +- Detection Speed: Good balance between rapid issue detection (within one cycle) and not overloading the server with monitoring traffic +- Incident Response Time: 10-minute checks ensure issues are discovered and alerts sent within ~10 minutes of occurrence +- False Positive Prevention: Allows single-check failures to stabilize; repeated failures trigger alerts (2-check threshold for performance) + +### Four Golden Signals Application + +1. **Availability (API Check):** + - Monitored via HTTP 200 status code assertion + - Directly answers: "Can students reach the Moodle server?" + +2. **Latency (Browser Check):** + - Monitored via page load time measurement (< 5 seconds) + - Directly answers: "How fast does Moodle respond to real users?" + +3. **Errors (Browser Check Assertions):** + - Monitored via multiple assertions (title, headings, main content, course links) + - Directly answers: "Do all critical page elements render correctly?" + +4. **Saturation (Response Time Trend):** + - Monitored via consistent 914 ms API response time and < 5 second page load + - Indicates headroom before saturation; degradation is visible when thresholds trend upward + +--- + +## Reflection: Website Reliability Impact + +### How Does Monitoring Maintain Reliability? + +**Without Monitoring:** +- Moodle could be down for 30+ minutes before students report it (catastrophic during exam week) +- Performance degradation goes unnoticed until students complain +- Failures might go undetected for hours +- Support team responds reactively to student tickets instead of proactively + +**With Monitoring:** +- **Early Detection:** Issues are detected within 5-10 minutes (one check cycle) +- **Proactive Alerting:** Support team is notified via email before students complain +- **Root Cause Visibility:** Knowing whether the API passes but browser check fails tells us exactly where the problem is (database vs. frontend vs. CDN) +- **SLA Accountability:** We can prove to students and administration that we maintain 99%+ uptime through documented monitoring logs + +### Benefits + +1. **For Students:** + - Faster issue resolution (support responds within 10 minutes instead of hours) + - Higher platform reliability and uptime + - Better academic continuity during exams/deadlines + +2. **For Support Team:** + - Alerts received before student complaints + - Clear diagnostic information (which check failed and when) + - Historical data to identify patterns (e.g., slowness at 8 AM = load spike) + +3. **For Institution:** + - SLA compliance demonstrated and measurable + - Reduced support ticket volume from proactive detection + - Reputation protection through reliability + +### SRE Principles Demonstrated + +1. **Shift from "Hope it Works" to "We Know When it Breaks":** + - Before: Moodle outages discovered by student complaints (30+ min detection time) + - After: Outages discovered by automated monitoring within 5 minutes + +2. **User-Focused Monitoring:** + - We don't just monitor "Is the server up?" (network monitoring) + - We monitor "Can students access their courses?" (user-focused SRE) + - We test actual user workflows, not just infrastructure health + +3. **Actionable Alerts:** + - Alerts are specific (API failed vs. performance degraded vs. content missing) + - Each alert includes context for rapid debugging + - Support can respond immediately with appropriate actions \ No newline at end of file diff --git a/screenshots_lab8/photo_2025-12-02_19-19-27.jpg b/screenshots_lab8/photo_2025-12-02_19-19-27.jpg new file mode 100644 index 00000000..71cbf364 Binary files /dev/null and b/screenshots_lab8/photo_2025-12-02_19-19-27.jpg differ diff --git a/screenshots_lab8/photo_2025-12-02_19-19-37.jpg b/screenshots_lab8/photo_2025-12-02_19-19-37.jpg new file mode 100644 index 00000000..241b6473 Binary files /dev/null and b/screenshots_lab8/photo_2025-12-02_19-19-37.jpg differ diff --git a/screenshots_lab8/photo_2025-12-02_19-19-42.jpg b/screenshots_lab8/photo_2025-12-02_19-19-42.jpg new file mode 100644 index 00000000..389597b2 Binary files /dev/null and b/screenshots_lab8/photo_2025-12-02_19-19-42.jpg differ diff --git a/screenshots_lab8/photo_2025-12-02_19-19-49.jpg b/screenshots_lab8/photo_2025-12-02_19-19-49.jpg new file mode 100644 index 00000000..c432871b Binary files /dev/null and b/screenshots_lab8/photo_2025-12-02_19-19-49.jpg differ diff --git a/screenshots_lab8/photo_2025-12-02_19-19-55.jpg b/screenshots_lab8/photo_2025-12-02_19-19-55.jpg new file mode 100644 index 00000000..1c8616cc Binary files /dev/null and b/screenshots_lab8/photo_2025-12-02_19-19-55.jpg differ diff --git a/screenshots_lab8/photo_2025-12-02_19-20-00.jpg b/screenshots_lab8/photo_2025-12-02_19-20-00.jpg new file mode 100644 index 00000000..3ddf15dc Binary files /dev/null and b/screenshots_lab8/photo_2025-12-02_19-20-00.jpg differ