Releases: finos/htc-grid
v0.4.3
What's changed
This release fixes all the relevant security issues in the current code base, as detected by cfn_lint, trivy, checkov and ScoutSuite.
Terraform State:
Encrypt and secure init_grid state and Lambda buckets.
Limit the scope of KMS Key policy for State Buckets.
Remove AccessControls and use BucketPolicy to keep the bucket private.
Configure all Makefiles to use encrypted S3 Buckets for TF State, non-root Dockerfiles, fix HTCGRID_ECR_REPO, name CloudFormation stack outputs, and support updating existing init_grid stack.
Improve init_grid Makefile to handle initial and deletion cases better.
Add support for cleaning up S3 object versions and standardize bucket variable naming.
HTC Grid Containers:
Configure all Dockerfiles to run non-root containers and fix builds.
Configure all HTC K8S resources to run with runAsNonRoot, default seccompProfile, and disabled allowPrivilegeEscalation.
Rename components, add readOnlyFileSystem and seccomp profile to HTC Agent, fix and cleanup code.
Remove file system write dependencies for the agent.
Harden K8S manifests and enforce further chekov rules.
Configure Grafana Ingress to drop invalid HTTP Header fields.
HTC Grid Control Plane:
Configure CMK KMS Key encryption for VPC Flow Logs, ECR Repositories, SQS, DynamoDB, S3, EKS Cluster, EKS MNG EBS Volumes, and all CloudWatch Logs.
Add encrypted CloudWatch Logging for API Gateway.
Create S3 via TF Module, add encryption support for S3 Data Plane in the agent, fix AWS partition, and DNS Suffix usage.
Simplify code and move all lambdas and auth to the control_plane.
Configure and consolidate least-privilege permissions on KMS, Lambda, and Agent IAM policies.
Add KMS Decrypt and GenerateDataKey permissions to Lambda and Agent permissions.
Move installation of jq onto lambda images and fix the bootstrap script.
Convert EC Redis to a single replica cluster mode and add encryption.
Add AUTH for ElastiCache Redis Cluster.
Enable XRay tracing for Lambda functions and adjust Redis config.
Add an explicit ASG Service Linked Role declaration to enable KMS support for ASG EBS Volumes.
Handle cases where AWSServiceRoleForAutoScaling already exists.
Add S3 and SQS Resource Policies to enforce HTTPS and create separate CMK KMS Keys for DLQs per each SQS Queue.
Configure the DLQs to be used with the respective SQS Queues and fix naming/references.
Add security group and ACL controls where possible.
Configure securityContext for OpenAPI.
General:
Add GitHub workflows for cfn_lint, trivy, and checkov.
Standardize, fix, and simplify tests.
Standardize the naming of TF resources.
Fix docs and random_password to align with pipelines.
Add auto deploy & destroy stages for images.
Change all Copyright notices to reflect the current year (2024).
Cloud9:
Fix Cloud9 deployment script to target correct instances.
Fix Cloud9 bootstrap race condition and adjust to WS.
Force a reinstall at bootstrap time to fix virtualenv issues.
Add support for specifying a Git repo/branch for HTCGridSource.
Remove Admin role from KMS Admins as it doesn't exist in WS.
Full Changelog: v0.4.2...v0.4.3
v0.4.2
- Remove
CDK
as IaC for deploying HTC Grid - Remove any hardcoded dependency to
urllib3
- Migrate lambda function runtime from python 3.7 to python 3.11
v0.4.1
- Move the deployment of the Helm charts outside of the
EKS Blueprints Addons
module to native TF Resource(s) to better handle the resource dependencies to those addons and simplify code. - Switch Grafana ingress to use the new
ingressClassName
spec format instead of the deprecatedkubernetes.io/ingress.class
annotation. - Switch to using the
kubernetes_annotations
TF Resource to manage the Cognito annotations for Grafana Ingress. - Adjust workshop notes on creation of Cognito user for the user-pool with sign-up disabled.
- Add ability to always use the
latest
released tag in the Cloud9 instance deployment. - Fix the Private API Gateway and Resource Policy race-condition/dependency.
- Fix
image_repository
destroy issues since adding explicit region flags to ECR commands. - Fix missing comma in
state_table_dynamodb.py
. - Add explicit region flag when listing ECR repos in the workshop.
- Clean up and adjust workshop notes, code, comments and other docs (ie the FSI Whitepaper link).
v0.4.0
EKS Cluster & Nodes:
- Change to using terraform-aws-modules/eks for managing and deploying the EKS Cluster as well as related resources, such as: Node IAM Roles & Policies, Node Defaults incl. instance types, Security Groups and the AWS Auth ConfigMap.
- Change to using EKS Managed Node Groups for all of the Core and Worker Node Groups.
- Configure Cluster Autoscaler to manage the scaling and lifecycle of the EKS Managed Node Groups.
- Disable AWS Node Termination Handler, as it shouldn't be used in conjunction with EKS Managed Node Groups.
- Simplify and standardise VPC Endpoint creation. Add EKS Private VPC Endpoint to allow internal communications from the private subnet with the EKS Control Plane.
- Change node taints from
grid/type: Operator
tohtc/node-type: core
andhtc/node-type: worker
. Add those as labels and tags as well, to simplify operations and cluster visibility via kubectl and other monitoring solutions. - Adjust default instance types for the Core and Worker Node Groups to allow for better diversification and deplopyment, both for OnDemand and Spot workloads.
- Change to using
cluster_name
instead ofeks_cluster_id
everywhere, in line with the new module changes. - Add ability to specify EBS Volume type and size for the EKS Nodes.
EKS AddOns:
- Change to eks-blueprints-addons for managing and deploying all of the EKS Blueprint AddOns and OSS Helm Releases, such as: CoreDNS, Kube-Proxy, VPC CNI, FluentBit, Cluster Autoscaler, AWS LoadBalancer Controller, CloudWatch Metrics, KEDA, InfluxDB, Prometheus & Grafana, as well as all the relevant configuration.
- Add implicit and explicit dependencies to fix the race conditions where the
AWS Loadbalancer Controller
may get deleted before being able to cleanup the AWS resources that it manages. The new dependency order guarantees a proper clean up of those resources before theAWS LoadBalancer Controller
is destroyed during unprovisioning. - Fix the explicit and implicit dependencies between the Kubernetes data sources and the underlying resources created by the
EKS Blueprints Addons
module. - Move ingress and dashboard creation for Grafana to be handled via the Helm chart and clean up the un-needed additional Terraform resources. Add the Grafana Ingress URL as a Terraform output for the module.
- Adjust image and repo configuration to pull the correct version for
Cluster Autoscaler
and other components. - Adjust the node selectors for FluentBit and CloudWatch agent DaemonSets to deploy to all nodes.
- Switch to using the new Go based high-performance FluentBit logger for CloudWatch.
- Disable Grafana Live Server (as it requires WebSockets).
- Add cookie based session stickiness to the Grafana ingress to allow the ALB Controller and the Grafana HA deployment to handle auth properly.
- Fix FluentBit based Container Insights Logs.
- Extend the CoreDNS creation timeout to 25Mins to allow for the control plane to self-heal in case of issues.
HTC-Grid:
- Change to using eks-blueprints-addon for deploying the HTC-Grid Helm Chart as well as create the respective IRSA Role.
- Adjust IAM Policies & Permissions (ensuring CloudWatch Log Group lifecycle handling is done via Terraform), as well as formatting and naming to ensure concsistency for all the Lambdas.
- Split the Control Plane lambda defintions into their individual TF files, simplifying configuration and visibility and grouping for the resources created.
Terraform & Helm:
- Adjust all of the Terraform Registry modules to use
~>
version pinning, allowing any new non-major versions to be used (any minor and patch updates are allowed), simplifying dependency version updates and ensuring consistency. - Upgrade all of the Terraform modules from the Terraform Registry to use the current latest versions.
- Upgrade all of the Terraform providers to use the latest available versions and major version pinning using thre
~>
operator. - Upgrade all of the Helm charts and container images to the current latest version for all of the components.
- Remove image level pinning of Helm AddOn components and pinned only using the Helm release versions.
- Remove un-needed explicit
depends_on
statemenets which cause slowness and cyclic dependencies or failures on plan (by not allowing data sources to be computed before an apply). - Fix cyclic dependency and remove the need for running targeted applies for the IAM Policies for the EKS Pull Through Cache and Agent permissions in the
apply
/auto-apply
stages. - Move to using
aws_api_gateway_rest_api_policy
instead of a direct policy attachment of a generic policy forOpenAPI Private
, which showed changes on everyterraform apply
, due to the wildcard allow policy. - Configure the AWS CloudWatch Metrics and AWS for FluentBit deployments to run on the
Core
nodes. - Configure Grafana to start two replicas and spread them across different nodes for high availability.
- Clean up the Helm chart
values.yaml
files, removing any unneeded and nrequired config, simplifying the deployments. Consolidating Helm chart versions into a single variable for ease of change and visibility. - Remove un-needed data sources and use module outputs as required to also enforce consistent implicit dependencies in Terraform.
- Simplify and consolidate the variable definitions, usage and functions across all of the resources and modules.
- Adjust output and variable descriptions, types and values to reflect the required information and ensure consistency.
- Adjust provider configurations to ensure correct credential retrieval and handling.
- Use
aws_htc_ecr
consistently across all of the Helm charts as the ECR source repository for pulling internal and pull-through images.
New Features:
- Upgrade
ElastiCache
to version 7 and started using the AWS Graviton3 basedcache.r7g.large
instance(s) for the Redis cluster. - Add ability to do in-place upgrades of the
ElastiCache
clusters by versioning theParameter Groups
created/used. - Add
watch_htc.sh
script, which can be used to monitor the status of a Kubernetes job running tasks on HTC-Grid, as well as the status of the overall compute plane, including the HPA, Deployment, Nodes and Job Completion statuses as well as durations. The scripts takes two arguments, namely the namespace to be watched as well as the name of the Kubernetes job. - Add support for correct handling of the
AWS Partition
as well asAWS Partition DNS Suffix
. - Add ability to automatically manage the lifecycle of the self-signed ALB Certificates via the deployment process (any certs about to expire will get automatically updated and rolled out without any downtime).
- Migrate to using
AWS Certificate Manager
instead of theIAM Server Certificates
for the ALB Certs. - Increase the self-signed ALB Cert validity to 1 year, with auto-renew if run within 6 months of expiration time
- Add ability to automatically create, update and destroy an
admin
Cognito user via the deployment, to be used for the Grafana authentication, reducing the need for manual steps during the setup as well as the workshop. - Add user cleanup on
destroy
for theadmin
Cognito user (created for use with Grafana) as well as the relevant Cognito config with the Grafana Ingress. - Switch to creating the Cognito User for Grafana using TF native resources.
- Switch the
grafana_admin_password
variable to be sensitive everywhere. - Add template file and generation for submitting a batch of multi-session tasks instead of copying/replacing at runtime of the workshop. Adjust docs/workshop accordingly.
Lambda Runtimes:
- Unify all of the
lambda_runtimes
into a single Dockerfile, driving behavior via build time arguments. - Add package updates at build time (incl. cache clearing post updates), to ensure latest versions of updates are always included in the runtime images.
- Migrate all build runtimes to use the ECR Pull Through Cache for the build images.
- Simplify and consolidated the lambda runtime build and push Terraform resources into a single map of resources.
- Fix Lambda Runtimes Dockerfile to handle different entrypoint source script for the provided runtime.
ECR & Image Builds:
- Change all container images to use the ECR pull through-cache where possible.
- Add a new pull-through-cache config for
registry.k8s.io
, to allow for pulling any cluster components automatically, i.e. thecluster-autoscaler
. - Add flag (
REBUILD_RUNTIMES
) which allows re-creating the local images for all the runtimes (without using the cache) and pushing them to ECR. - Clean up
image_repository
keeping the minimum number of required external dependencies (that were not availble via an ECR Pull Through Cache), to be manually copied over to the local ECR repositories. - Add the ability to cleanup the ECR Pull Through Cache repositories upon running
destroy-images
. - Add image scanning on push/upload for all of the ECR Repositories.
- Move to using
for_each
instead ofcount
for ECR Repositories ensuring they don't get destroyed from a simple order change in the JSON Config.
Cloud9:
- Fix all of the Cloud9 bootstrap errors, handling of different packages, correct installation and upgrade of all the components and improved the bootstrap logging to increase visibilty on the success or issues of the Cloud9 deployment.
- Update default versions for all pre-requisites for the Clo...
v0.3.6
- Adding support for Java based Lambda Workers
#64
- Adding automated Bandit security checks for pull requests
#55
- DynamoDB degrading state refactoring
#52
- Fixing instance profile association in the context of Config rule
#51
- Fix: automatically added timestamp upon task completion into DDB
#43
- Fixing Cloud9 deployment outside of EventEngine
#46
- Adding CDK has a deployment tool for the HTC Grid
#39
- demo update
2215871
- feat: migration tentative to EKS blueprint
d65abca
- Adding Java runtime for Worker Lambdas + QuantLib example
9444a17
v0.3.5
Merge pull request #38 from ruecarlo/main fixed issue in cloud9 environment