Skip to content

Latest commit

 

History

History
207 lines (164 loc) · 9.86 KB

File metadata and controls

207 lines (164 loc) · 9.86 KB

Description

This module performs the following tasks:

  • create an instance template from which execute points will be created
  • create a managed instance group (MIG) for execute points
  • create a Toolkit runner to configure the autoscaler to scale the MIG

It is expected to be used with the htcondor-install and htcondor-configure modules.

Known limitations

This module may be used exactly 1 or 2 times in a blueprint to create sets of execute points in an HTCondor pool. If using 1 set, it may use either Spot or On-demand pricing. If using 2 sets, one must use Spot and the other must use On-demand pricing. If you do not follow this constraint, you will likely receive an error while running terraform apply similar to that shown below. Future development is planned to support more than 2 sets of VM configurations, including all pricing options.

│     │ var.runners is list of map of string with 7 elements
│
│ All startup-script runners must have a unique destination.
│
│ This was checked by the validation rule at modules/startup-script/variables.tf:72,3-13.

How to run HTCondor jobs on Spot VMs

HTCondor access points provisioned by the Toolkit are specially configured to add an attribute named RequireSpot to each Job ClassAd. When this value is true, a job's requirements are automatically updated to require that it run on a Spot VM. When this value is false, the requirements are similarly updated to run only on On-Demand VMs. The default value of this attribute is false. A job submit file may override this value as shown below.

universe       = vanilla
executable     = /bin/echo
arguments      = "Hello, World!"
output         = out.\$(ClusterId).\$(ProcId)
error          = err.\$(ClusterId).\$(ProcId)
log            = log.\$(ClusterId).\$(ProcId)
request_cpus   = 1
request_memory = 100MB
+RequireSpot   = true
queue

Example

A full example can be found in the examples README.

The following code snippet creates a pool with 2 sets of HTCondor execute points, one using On-demand pricing and the other using Spot pricing. They use a startup script and network created in previous steps.

- id: htcondor_execute_point
  source: community/modules/compute/htcondor-execute-point
  use:
  - network1
  - htcondor_configure_execute_point
  settings:
    service_account:
      email: $(htcondor_configure.execute_point_service_account)
      scopes:
      - cloud-platform

- id: htcondor_execute_point_spot
  source: community/modules/compute/htcondor-execute-point
  use:
  - network1
  - htcondor_configure_execute_point
  settings:
    service_account:
      email: $(htcondor_configure.execute_point_service_account)
      scopes:
      - cloud-platform

  - id: htcondor_startup_access_point
    source: modules/scripts/startup-script
    settings:
      runners:
      - $(htcondor_install.install_htcondor_runner)
      - $(htcondor_install.install_autoscaler_deps_runner)
      - $(htcondor_install.install_autoscaler_runner)
      - $(htcondor_configure.access_point_runner)
      - $(htcondor_execute_point.configure_autoscaler_runner)
      - $(htcondor_execute_point_spot.configure_autoscaler_runner)

  - id: htcondor_access
    source: modules/compute/vm-instance
    use:
    - network1
    - htcondor_startup_access_point
    settings:
      name_prefix: access-point
      machine_type: c2-standard-4
      service_account:
        email: $(htcondor_configure.access_point_service_account)
        scopes:
        - cloud-platform

Support

HTCondor is maintained by the Center for High Throughput Computing at the University of Wisconsin-Madison. Support for HTCondor is available via:

Known Issues

When using OS Login with "external users" (outside of the Google Cloud organization), then Docker universe jobs will fail and cause the Docker daemon to crash. This stems from the use of POSIX user ids (uid) outside the range supported by Docker. Please consider disabling OS Login if this atypical situation applies.

vars:
  # add setting below to existing deployment variables
  enable_oslogin: DISABLE

License

Copyright 2022 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name Version
terraform >= 0.13.0

Providers

No providers.

Modules

Name Source Version
execute_point_instance_template terraform-google-modules/vm/google//modules/instance_template ~> 8.0
mig terraform-google-modules/vm/google//modules/mig ~> 8.0

Resources

No resources.

Inputs

Name Description Type Default Required
deployment_name HPC Toolkit deployment name. HTCondor cloud resource names will include this value. string n/a yes
disk_size_gb Boot disk size in GB number 100 no
enable_oslogin Enable or Disable OS Login with "ENABLE" or "DISABLE". Set to "INHERIT" to inherit project OS Login setting. string "ENABLE" no
image HTCondor execute point VM image
object({
family = string,
project = string
})
{
"family": "hpc-centos-7",
"project": "cloud-hpc-image-public"
}
no
labels Labels to add to HTConodr execute points map(string) n/a yes
machine_type Machine type to use for HTCondor execute points string "n2-standard-4" no
max_size Maximum size of the HTCondor execute point pool. number 100 no
metadata Metadata to add to HTCondor execute points map(string) {} no
min_idle Minimum number of idle VMs in the HTCondor pool (if pool reaches var.max_size, this minimum is not guaranteed); set to ensure jobs beginning run more quickly. number 0 no
network_self_link The self link of the network HTCondor execute points will join string "default" no
network_storage An array of network attached storage mounts to be configured
list(object({
server_ip = string,
remote_mount = string,
local_mount = string,
fs_type = string,
mount_options = string,
client_install_runner = map(string)
mount_runner = map(string)
}))
[] no
project_id Project in which the HTCondor execute points will be created string n/a yes
region The region in which HTCondor execute points will be created string n/a yes
service_account Service account to attach to HTCondor execute points
object({
email = string,
scopes = set(string)
})
{
"email": null,
"scopes": [
"https://www.googleapis.com/auth/cloud-platform"
]
}
no
spot Provision VMs using discounted Spot pricing, allowing for preemption bool false no
startup_script Startup script to run at boot-time for HTCondor execute points string null no
subnetwork_self_link The self link of the subnetwork HTCondor execute points will join string null no
target_size Initial size of the HTCondor execute point pool; set to null (default) to avoid Terraform management of size. number null no
zone The default zone in which resources will be created string n/a yes

Outputs

Name Description
configure_autoscaler_runner Toolkit runner to configure the HTCondor autoscaler