diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 00000000000..a5649d90d0d --- /dev/null +++ b/docs/README.md @@ -0,0 +1,34 @@ +# Docs + +This folder contains the documentation for the Armada project. The documentation is written in markdown and is rendered as webpages on [armadaproject.io](https://armadaproject.io). + +It's accessible from the IDE, GitHub, and the website. + +## For Developers + +See [website.md](./developer/website.md) + +## Overview + +Docs added to this the `docs/` folder are automatically copied into [armadaproject.io](https://armadaproject.io). + +For example, if you wanted to document bananas, and you added `bananas.md`, +once committed to master that would be published at +`https://armadaproject.io/bananas/`. + +> [!NOTE] +> All files in `docs/` folder are rendered as webpage except this `README.md` file. + +## Pages with assets + +If you'd like to add a more complex page, such as one with images or other +linked assets, you have to be careful to ensure links will work both +for people viewing in GitHub and for those viewing via [armadaproject.io](https://armadaproject.io). + +The easiest way to accomplish this is by using page bundles. Assets should be located inside the `docs/` folder and +used in the markdown file with relative paths. + +## Removing pages + +Any page that is removed from the `docs/` folder will be removed from the website automatically. The `docs/` folder is +the source of truth for the website's content. diff --git a/docs/consistency.md b/docs/consistency.md index 7b1644e7568..5d39cb7157d 100644 --- a/docs/consistency.md +++ b/docs/consistency.md @@ -7,7 +7,7 @@ Armada stores its state across several databases. Whenever Armada receives an AP There are three commonly used approaches to address this issue: * Store all state in a single database with support for transactions. Changes are submitted atomically and are rolled back in case of failure; there are no partial failures. -* Distributed transaction frameworks (e.g., X/Open XA), which extend the notation of transactions to operations involving several databases. +* Distributed transaction frameworks (e.g., X/Open XA), which extend the notation of transactions to operations involving several databases. * Ordered idempotent updates. The first approach results in tight coupling between components and would limit us to a single database technology. Adding a new component (e.g., a new dashboard) could break existing component since all operations part of the transaction are rolled back if one fails. The second approach allows us to use multiple databases (as long as they support the distributed transaction framework), but components are still tightly coupled since they have to be part of the same transaction. Further, there are performance concerns associated with these options, since transactions may not be easily scalable. Hence, we use the third approach, which we explain next. diff --git a/docs/demo.md b/docs/demo.md new file mode 100644 index 00000000000..424dc80bab0 --- /dev/null +++ b/docs/demo.md @@ -0,0 +1,144 @@ +# Armada Demo + +
+ +
+ +> This video demonstrates the use of Armadactl, Armada Lookout UI, and Apache Airflow. + +This guide will show you how to take a quick test drive of an Armada +instance already deployed to AWS EKS. + +## EKS + +The Armada UI (lookout) can be found at this URL: + +- [https://ui.demo.armadaproject.io](https://ui.demo.armadaproject.io) + +## Local prerequisites + +- Git +- Go 1.20 + +## Obtain the armada source +Clone [this](https://github.com/armadaproject/armada) repository: + +```bash +git clone https://github.com/armadaproject/armada.git +cd armada +``` + +All commands are intended to be run from the root of the repository. + +## Setup an easy-to-use alias +If you are on a Windows System, use a linux-supported terminal to run this command, for example [Git Bash](https://git-scm.com/downloads) or [Hyper](https://hyper.is/) +```bash +alias armadactl='go run cmd/armadactl/main.go --armadaUrl armada.demo.armadaproject.io:443' +``` + +## Create queues and jobs +Create queues, submit some jobs, and monitor progress: + +### Queue Creation +Use a unique name for the queue. Make sure you remember it for the next steps. +```bash +armadactl create queue $QUEUE_NAME --priorityFactor 1 +armadactl create queue $QUEUE_NAME --priorityFactor 2 +``` + +For queues created in this way, user and group owners of the queue have permissions to: +- submit jobs +- cancel jobs +- reprioritize jobs +- watch queue + +For more control, queues can be created via `armadactl create`, which allows for setting specific permission; see the following example. + +```bash +armadactl create -f ./docs/quickstart/queue-a.yaml +armadactl create -f ./docs/quickstart/queue-b.yaml +``` + +Make sure to manually edit both of these `yaml` files using a code or text editor before running the commands above. + +``` +name: $QUEUE_NAME +``` + +### Job Submission +``` +armadactl submit ./docs/quickstart/job-queue-a.yaml +armadactl submit ./docs/quickstart/job-queue-b.yaml +``` + +Make sure to manually edit both of these `yaml` files using a code or text editor before running the commands above. +``` +queue: $QUEUE_NAME +``` + +### Monitor Job Progress + +```bash +armadactl watch $QUEUE_NAME job-set-1 +``` +```bash +armadactl watch $QUEUE_NAME job-set-1 +``` + +Try submitting lots of jobs and see queues get built and processed: + +#### Windows (using Git Bash): + +Use a text editor of your choice. +Copy and paste the following lines into the text editor: +``` +#!/bin/bash + +for i in {1..50} +do + armadactl submit ./docs/quickstart/job-queue-a.yaml + armadactl submit ./docs/quickstart/job-queue-b.yaml +done +``` +Save the file with a ".sh" extension (e.g., myscript.sh) in the root directory of the project. +Open Git Bash, navigate to the project's directory using the 'cd' command, and then run the script by typing ./myscript.sh and pressing Enter. + +#### Linux: + +Open a text editor (e.g., Nano or Vim) in the terminal and create a new file by running: nano myscript.sh (replace "nano" with your preferred text editor if needed). +Copy and paste the script content from above into the text editor. +Save the file and exit the text editor. +Make the script file executable by running: chmod +x myscript.sh. +Run the script by typing ./myscript.sh in the terminal and pressing Enter. + +#### macOS: + +Follow the same steps as for Linux, as macOS uses the Bash shell by default. +With this approach, you create a shell script file that contains your multi-line script, and you can run it as a whole by executing the script file in the terminal. + +## Observing job progress + +CLI: + +```bash +$ armadactl watch queue-a job-set-1 +Watching job set job-set-1 +Nov 4 11:43:36 | Queued: 0, Leased: 0, Pending: 0, Running: 0, Succeeded: 0, Failed: 0, Cancelled: 0 | event: *api.JobSubmittedEvent, job id: 01drv3mey2mzmayf50631tzp9m +Nov 4 11:43:36 | Queued: 1, Leased: 0, Pending: 0, Running: 0, Succeeded: 0, Failed: 0, Cancelled: 0 | event: *api.JobQueuedEvent, job id: 01drv3mey2mzmayf50631tzp9m +Nov 4 11:43:36 | Queued: 1, Leased: 0, Pending: 0, Running: 0, Succeeded: 0, Failed: 0, Cancelled: 0 | event: *api.JobSubmittedEvent, job id: 01drv3mf7b6fd1rraeq1f554fn +Nov 4 11:43:36 | Queued: 2, Leased: 0, Pending: 0, Running: 0, Succeeded: 0, Failed: 0, Cancelled: 0 | event: *api.JobQueuedEvent, job id: 01drv3mf7b6fd1rraeq1f554fn +Nov 4 11:43:38 | Queued: 1, Leased: 1, Pending: 0, Running: 0, Succeeded: 0, Failed: 0, Cancelled: 0 | event: *api.JobLeasedEvent, job id: 01drv3mey2mzmayf50631tzp9m +Nov 4 11:43:38 | Queued: 0, Leased: 2, Pending: 0, Running: 0, Succeeded: 0, Failed: 0, Cancelled: 0 | event: *api.JobLeasedEvent, job id: 01drv3mf7b6fd1rraeq1f554fn +Nov 4 11:43:38 | Queued: 0, Leased: 1, Pending: 1, Running: 0, Succeeded: 0, Failed: 0, Cancelled: 0 | event: *api.JobPendingEvent, job id: 01drv3mey2mzmayf50631tzp9m +Nov 4 11:43:38 | Queued: 0, Leased: 0, Pending: 2, Running: 0, Succeeded: 0, Failed: 0, Cancelled: 0 | event: *api.JobPendingEvent, job id: 01drv3mf7b6fd1rraeq1f554fn +Nov 4 11:43:41 | Queued: 0, Leased: 0, Pending: 1, Running: 1, Succeeded: 0, Failed: 0, Cancelled: 0 | event: *api.JobRunningEvent, job id: 01drv3mf7b6fd1rraeq1f554fn +Nov 4 11:43:41 | Queued: 0, Leased: 0, Pending: 0, Running: 2, Succeeded: 0, Failed: 0, Cancelled: 0 | event: *api.JobRunningEvent, job id: 01drv3mey2mzmayf50631tzp9m +Nov 4 11:44:17 | Queued: 0, Leased: 0, Pending: 0, Running: 1, Succeeded: 1, Failed: 0, Cancelled: 0 | event: *api.JobSucceededEvent, job id: 01drv3mf7b6fd1rraeq1f554fn +Nov 4 11:44:26 | Queued: 0, Leased: 0, Pending: 0, Running: 0, Succeeded: 2, Failed: 0, Cancelled: 0 | event: *api.JobSucceededEvent, job id: 01drv3mey2mzmayf50631tzp9m +``` + +Web UI: + +Open [https://ui.demo.armadaproject.io](https://ui.demo.armadaproject.io) in your browser. + +![Lookout UI](./quickstart/img/lookout.png "Lookout UI") diff --git a/docs/design/README.md b/docs/design/README.md new file mode 100644 index 00000000000..590bfb4c8e7 --- /dev/null +++ b/docs/design/README.md @@ -0,0 +1,76 @@ +# System overview + +This document is meant to be an overview of Armada for new users. We cover the architecture of Armada, show how jobs are represented, and explain how jobs are queued and scheduled. + +If you just want to learn how to submit jobs to Armada, see: + +- [User guide](../user.md) + +If you want to see a quick overview of Armadas components, see: + +- [Relationships diagram](./relationships_diagram.md) + +## Architecture + +Armada consists of two main components: +- The Armada server, which is responsible for accepting jobs from users and deciding in what order, and on which Kubernetes cluster, jobs should run. Users submit jobs to the Armada server through the `armadactl` command-line utility or via a gRPC or REST API. +- The Armada executor, of which there is one instance running in each Kubernetes cluster that Armada is connected to. Each Armada executor instance regularly notifies the server of how much spare capacity it has available and requests jobs to run. Users of Armada never interact with the executor directly. + +All state relating to the Armada server is stored in [Redis](https://redis.io/), which may use replication combined with failover for redundancy. Hence, the Armada server is itself stateless and is easily replicated by running multiple independent instances. Both the server and the executors are intended to be run in Kubernetes pods. We show a diagram of the architecture below. + +![How Armada works](./batch-api.svg) + +### Job leasing + +To avoid jobs being lost if a cluster or its executor becomes unavailable, each job assigned to an executor has an associated timeout. Armada executors are required to check in with the server regularly and if an executor responsible for running a particular job fails to check in within that timeout, the server will re-schedule the job on another cluster. + +## Jobs and job sets + +A job is the most basic unit of work in Armada, and is represented by a Kubernetes pod specification (podspec) with additional metadata specific to Armada. Armada handles creating, running, and removing containers as necessary for each job. Hence, Armada is essentially a system for managing the life cycle of a set of containerised applications representing a batch job. + +The Armada workflow is: + +1. Create a job specification, which is a Kubernetes podspec with a few additional metadata fields. +2. Submit the job specification to one of Armada's job queues using the `armadactl` CLI utility or through the Armada gRPC or REST API. + +For example, a job that sleeps for 60 seconds could be represented by the following yaml file. + +```yaml +queue: test +jobSetId: set1 +jobs: + - priority: 0 + podSpecs: + - terminationGracePeriodSeconds: 0 + restartPolicy: Never + containers: + - name: sleep + imagePullPolicy: IfNotPresent + image: busybox:latest + args: + - sleep + - 60s + resources: + limits: + memory: 64Mi + cpu: 150m + requests: + memory: 64Mi + cpu: 150m +``` + +In the above yaml snippet, `podSpec` is a Kubernetes podspec, which consists of one or more containers that contain the user code to be run. In addition, the job specification (jobspec) contains metadata fields specific to Armada: + +- `queue`: which of the available job queues the job should be submitted to. +- `priority`: the job priority (lower values indicate higher priority). +- `jobSetId`: jobs with the same `jobSetId` can be followed and cancelled in a single operation. The `jobSetId` has no impact on scheduling. + +Queues and scheduling is explained in more detail below. + +For more examples, see the [user guide](../user.md). + +### Job events + +A job event is generated whenever the state of a job changes (e.g., when changing from submitted to running or from running to completed) and is a timestamped message containing event-specific information (e.g., an exit code for a completed job). All events generated by jobs part of the same job set are grouped together and published via a [Redis stream](https://redis.io/topics/streams-intro). There are unique streams for each job set to facilitate subscribing only to events generated by jobs in a particular set, which can be done via the Armada API. + +Armada records all events necessary to reconstruct the state of each job and, after a job has been completed, the only information retained about the job is the events generated by it. diff --git a/docs/design/architecture.md b/docs/design/architecture.md new file mode 100644 index 00000000000..ecc0f02af22 --- /dev/null +++ b/docs/design/architecture.md @@ -0,0 +1,80 @@ +# Architecture + +Armada is designed to manage millions of batch jobs across compute clusters made up of potentially hundreds of thousands of nodes, while providing near-constant uptime. Hence, the Architecture of Armada must be highly resilient and scalable. The current architecture was chosen in early 2022 to achieve these goals while also ensuring new features, e.g., advanced scheduling techniques, can be delivered. + +At a high level, Armada is a so-called data stream system (sometimes referred to as an event sourcing system), for which there are two components responsible for tracking the state of the system: + +* A log-based message broker that stores state transitions durably in order, referred to throughout this document simply as "the log". +* A set of databases, each deriving its state from the log (but are otherwise mutually independent) and storing a different so-called materialised view of the state of the system. + +The log is a publish-subscribe system consisting of multiple topics to which messages can be published. Those messages are eventually delivered to all subscribers of the topic. Important properties of the log are: + +* Durability: Messages published to the log are stored in a durable manner, i.e., they are not lost in case of up to x node failures, where x is a tuneable parameter. +* Ordering: All subscribers of a topic see messages in the same order as they were published in, and replaying messages on a topic always results in the same message order. Further, all messages on the same topic are annotated with a message id that is monotonically increasing within each topic. + +In Armada, the log is implemented using Apache Pulsar. + +In a data stream system, the log is the source of truth and the databases an optimisation to simplify querying – since the databases can be re-constructed by replaying messages from the log, if the log was replayed for each query, although highly unpractical, the databases could be omitted. For example, in Armada there are separate PostgreSQL databases for storing jobs to be scheduled and the jobs to be shown in the UI, Lookout. Both of these derive their state from the log but are otherwise independent. + +To change the state of the system, a message (e.g., corresponding to a job being submitted) is published to the log. Later, that message is picked up by a log processor, which updates some database accordingly (in the case of a job being submitted, by storing the new job in the database). Hence, the log serialises state transitions and the database is a materialised view of part of the state of the system, as derived from the state transitions submitted to the log. In effect, a data stream system is a bespoke distributed database with the log acting as the transaction log. + +This approach has several benefits: + +* Resiliency towards bursts of high load: Because the log buffers state transitions, the components reading from the log and acting on those transitions are not directly affected by incoming requests. +* Simplicity and extensibility: Adding new materialised views (e.g., for a new dashboard) can be accomplished by adding a new subscriber to the log. This new subscriber has the same source of truth as all others (i.e., the log) but is loosely coupled to those components; adding or removing views does not affect other components of the system. +* Consistency: When storing state across several independent databases, those databases are guaranteed to eventually be consistent; there is no failure scenario where the different databases become permanently inconsistent, thus requiring a human to manually reconcile them (assuming acting on state transitions is idempotent). + +However, the approach also has some drawbacks: + +* Eventual consistency: Because each database is updated from the log independently, they do not necessarily represent the state of the system at the same point of time. For example, a job may be written to the scheduler database (thus making it eligible for scheduling) before it shows up in the UI. +* Timeliness: Because databases are updated from the log asynchronously, there may be a lag between a message being published and the system being updated to reflect the change (e.g., a submitted job may not show up in the UI immediately). + +## System overview + +Besides the log, Armada consists of the following components: + +* Submit API: Clients (i.e., users) connect to this API to request state transitions (e.g., submitting jobs or updating job priorities) and each such state transition is communicated to the rest of the system by writing to the log (more detail on this below). +* Streams API: Clients connect to this API to subscribe to log messages for a particular set of jobs. Armada components can receive messages either via this API or directly from the log, but users have to go via the streams API to isolate them from internal messages. +* Scheduler: A log processor responsible for maintaining a global view of the system and preempting and scheduling jobs. Preemption and scheduling decisions are communicated to he rest of the system by writing to the log. +* Executors: Each executor is responsible for one Kubernetes worker cluster and is the component that communicates between the Armada scheduler and the Kubernetes API of the cluster it is responsible for. +* Lookout: The web UI showing the current state of the system. Lookout maintains its views by reading log messages to populate its database. + +### Job submission logic + +Here, we outline the sequence of actions resulting from submitting a job. + +1. A client submits a job to the submit-query API, which is composed of a Kubernetes podspec and some Armada-specific metadata (e.g., the priority of the job). +2. The submit API authenticates and authorizes the user, validates the submitted job, and, if valid, submits the job spec. to the log. The submit API annotates each job with a randomly generated UUID that uniquely identifies the job. This UUID is returned to the user. +3. The scheduler receives the job spec. and stores it in-memory (discarding any data it doesn't need, such as the pod spec.). The scheduler runs periodically, at which point it schedules queued jobs. At the start of each scheduling run, the scheduler queries each executor for its available resources. The scheduler uses this information in making scheduling decisions. When the scheduler assigns a job to an executor, it submits a message to the log indicating this state transition. It also updates its in-memory storage immediately to reflect the change (to avoid scheduling the same job twice). +4. A log processor receives the message indicating the job was scheduled, and writes this decision to a database acting as the interface between the scheduler and the executor. +5. Periodically, each executor queries the database for the list of jobs it should be running. It compares that list with the list of jobs it is actually running and makes changes necessary to reconcile any differences. +6. When a job has finished, the executor responsible for running the job informs the scheduler, which on its behalf submits a "job finished" message to the log. The same log processor as in step 4. updates its database to reflect that the job has finished. + +### Streams API + +Armada does not maintain a user-queryable database of the current state of the system. This is by design to avoid overloading the system with connections. For example, say there is one million active jobs in the system and that there are clients who want to track the state of all of those jobs. With a current-state-of-the-world database, those client would need to resort to polling that database to catch any updates, thus opening a total of one million connections to the database, which, while not impossible to manage, would pose significant challenges. + +Instead, users are expected to be notified of updates to their jobs via an event stream (i.e., the streams API), where a client opens a single connection for all jobs in a so-called job set over which all state transitions are streamed as they happen. This approach is highly scalable since data is only sent when something happens and since a single connection that contain updates for thousands of jobs. Users who want to maintain a view of their jobs are thus responsible for maintaining that view themselves by subscribing to events. + +## Notes on consistency + +The data stream approach taken by Armada is not the only way to maintain consistency across views. Here, we compare this approach with the two other possible solutions. + +Armada stores its state across several databases. Whenever Armada receives an API call to update its state, all those databases need to be updated. However, if each database were to be updated independently it is possible for some of those updates to succeed while others fail, leading to an inconsistent application state. It would require complex logic to detect and correct for such partial failures. However, even with such logic we could not guarantee that the application state is consistent; if Armada crashes before it has had time to correct for the partial failure the application may remain in an inconsistent state. + +There are three commonly used approaches to address this issue: + +* Store all state in a single database with support for transactions. Changes are submitted atomically and are rolled back in case of failure; there are no partial failures. +* Distributed transaction frameworks (e.g., X/Open XA), which extend the notation of transactions to operations involving several databases. +* Ordered idempotent updates. + +The first approach results in tight coupling between components and would limit us to a single database technology. Adding a new component (e.g., a new dashboard) could break existing component since all operations part of the transaction are rolled back if one fails. The second approach allows us to use multiple databases (as long as they support the distributed transaction framework), but components are still tightly coupled since they have to be part of the same transaction. Further, there are performance concerns associated with these options, since transactions may not be easily scalable. Hence, we use the third approach, which we explain next. + +First, note that if we can replay the sequence of state transitions that led to the current state, in case of a crash we can recover the correct state by truncating the database and replaying all transitions from the beginning of time. Because operations are ordered, this always results in the same end state. If we also, for each database, store the id of the most recent transition successfully applied to that database, we only need to replay transitions more recent than that. This saves us from having to start over from a clean database; because we know where we left off we can keep going from there. For this to work, we need transactions but not distributed transactions. Essentially, applying a transition already written to the database results in a no-op, i.e., the updates are idempotent (meaning that applying the same update twice has the same effect as applying it once). + +The two principal drawbacks of this approach are: + +* Eventual consistency: Whereas the first two approaches result in a system that is always consistent, with the third approach, because databases are updated independently, there will be some replication lag during which some part of the state may be inconsistent. +* Timeliness: There is some delay between submitting a change and that change being reflected in the application state. + +Working around eventual consistency requires some care, but is not impossible. For example, it is fine for the UI to show the a job as "running" for a few seconds after the job has finished before showing "completed". Regarding timeliness, it is not a problem if there is a few seconds delay between a job being submitted and the job being considered for queueing. However, poor timeliness may lead to clients (i.e., the entity submitting jobs to the system) not being able to read their own writes for some time, which may lead to confusion (i.e., there may be some delay between a client submitting a job a that job showing as "pending"). This issue can be worked around by storing the set of submitted jobs in-memory either at the client or at the API endpoint. diff --git a/docs/batch-api.svg b/docs/design/batch-api.svg similarity index 100% rename from docs/batch-api.svg rename to docs/design/batch-api.svg diff --git a/docs/design/database_interfaces.md b/docs/design/database_interfaces.md new file mode 100644 index 00000000000..0003e7b00e7 --- /dev/null +++ b/docs/design/database_interfaces.md @@ -0,0 +1,187 @@ +# Armada Database Interfaces + +## Problem Description + +Open source projects should not be hard coded to a particular Database. Armada currently only allows users to use Postgres. This project is to build interfaces around our connections to Postgres so we can allow other databases. + +## Solution + +1. Introduce base common database interfaces that can be shared reused by all components (Lookout, Scheduler, Scheduler Ingester). +2. Add interfaces that abstracts the hardcoded Postgres configuration. +3. Add interfaces around `pgx` structs. + +### Functional Specification (API Description) + +#### Database Connection + +Most of the components (Lookout, Scheduler, Scheduler Ingester) rely on [PostgresConfig](https://github.com/armadaproject/armada/blob/master/internal/armada/configuration/types.go#L294) to connect to external databases, we can avoid hardcoding the configuration of those components to use `PostgresConfig` but defining a generic `DatabaseConfig` interface that's when implemented will provide those components with the necessary details to connect to databases. + + /** + Components configuration (e.g. LookoutConfiguration) can now make use of this interface instead of hardcoding PostgresConfig. + */ + type DatabaseConfig interface { + GetMaxOpenConns() int + GetMaxIdleConns() int + GetConnMaxLifetime() time.Duration + GetConnectionString() string + } + + type DatabaseConnection interface { + GetConnection() (*sql.DB, error) + GetConfig() DatabaseConfig + } + +The existing configurations can then be tweaked to use the new generic `DatabaseConfig` interface instead of hardcoding `PostgresConfig` + + type LookoutConfiguration struct { + Postgres PostgresConfig // this can be replaced with the new Database property + Database DatabaseConfig // new property + } + +#### Database Communication + +Currently, most of the Armada components make use of the `github.com/jackc/pgx` Postgres client which provides APIs to interact exclusively with Postgres databases, this makes Armada tightly coupled with Postgres and makes it impossible to use other SQL dialects (e.g. MySQL). + +A way to fix this would be to design database-agnostic interfaces that can abstract away the existing Postgres core implementation (pgx), and then implement adapters around `pgx` that implement those interfaces. This will allow for having a high level abstraction API for interacting with databases while maintaining the existing Postgres core implementation. +To accomplish this, we will need to define interfaces for the following features: + +1. Connection Handler + + // DatabaseConn represents a connection handler interface that provides methods for managing the open connection, executing queries, and starting transactions. + type DatabaseConn interface { + // Close closes the database connection. It returns any error encountered during the closing operation. + Close(context.Context) error + + // Ping pings the database to check the connection. It returns any error encountered during the ping operation. + Ping(context.Context) error + + // Exec executes a query that doesn't return rows. It returns any error encountered. + Exec(context.Context, string, ...any) (any, error) + + // Query executes a query that returns multiple rows. It returns a DatabaseRows interface that allows you to iterate over the result set, and any error encountered. + Query(context.Context, string, ...any) (DatabaseRows, error) + + // QueryRow executes a query that returns one row. It returns a DatabaseRow interface representing the result row, and any error encountered. + QueryRow(context.Context, string, ...any) DatabaseRow + + // BeginTx starts a transcation with the given DatabaseTxOptions, or returns an error if any occured. + BeginTx(context.Context, DatabaseTxOptions) (DatabaseTx, error) + + // BeginTxFunc starts a transaction and executes the given function within the transaction. It the function runs successfuly, BeginTxFunc commits the transaction, otherwise it rolls back and return an errorr. + BeginTxFunc(context.Context, DatabaseTxOptions, func(DatabaseTx) error) error + } + +2. Connection Pool + + // DatabasePool represents a database connection pool interface that provides methods for acquiring and managing database connections. + type DatabasePool interface { + // Acquire acquires a database connection from the pool. It takes a context and returns a DatabaseConn representing the acquired connection and any encountered error. + Acquire(context.Context) (DatabaseConn, error) + + // Ping pings the database to check the connection. It returns any error encountered during the ping operation. + Ping(context.Context) error + + // Close closes the database connection. It returns any error encountered during the closing operation. + Close() + + // Exec executes a query that doesn't return rows. It returns any error encountered. + Exec(context.Context, string, ...any) (any, error) + + // Query executes a query that returns multiple rows. It returns a DatabaseRows interface that allows you to iterate over the result set, and any error encountered. + Query(context.Context, string, ...any) (DatabaseRows, error) + + // BeginTx starts a transcation with the given DatabaseTxOptions, or returns an error if any occured. + BeginTx(context.Context, DatabaseTxOptions) (DatabaseTx, error) + + // BeginTxFunc starts a transaction and executes the given function within the transaction. It the function runs successfuly, BeginTxFunc commits the transaction, otherwise it rolls back and return an errorr. + BeginTxFunc(context.Context, DatabaseTxOptions, func(DatabaseTx) error) error + } + +3. Transaction + + // DatabaseTx represents a database transaction interface that provides methods for executing queries, managing transactions, and performing bulk insertions. + type DatabaseTx interface { + // Exec executes a query that doesn't return rows. It returns any error encountered. + Exec(context.Context, string, ...any) (any, error) + + // Query executes a query that returns multiple rows. It returns a DatabaseRows interface that allows you to iterate over the result set, and any error encountered. + Query(context.Context, string, ...any) (DatabaseRows, error) + + // QueryRow executes a query that returns one row. It returns a DatabaseRow interface representing the result row, and any error encountered. + QueryRow(context.Context, string, ...any) DatabaseRow + + // CopyFrom performs a bulk insertion of data into a specified table. It accepts the table name, column names, and a slice of rows representing the data to be inserted. It returns the number of rows inserted and any error encountered. + CopyFrom(ctx context.Context, tableName string, columnNames []string, rows [][]any) (int64, error) + + // Commit commits the transaction. It returns any error encountered during the commit operation. + Commit(context.Context) error + + // Rollback rolls back the transaction. It returns any error encountered during the rollback operation. + Rollback(context.Context) error + } + +4. Result Row + + // DatabaseRow represents a single row in a result set. + type DatabaseRow interface { + // Scan reads the values from the current row into dest values positionally. It returns an error if any occured during the read operation. + Scan(dest ...any) error + } + +5. Resultset + + // DatabaseRows represents an interator over a result set. + type DatabaseRows interface { + // Close closes the result set. + Close() error + + // Next moves the iterator to the next row in the result set, it returns false if the result set is exhausted, otherwise true. + Next() bool + + // Err returns the error, if any, encountered during iteration over the result set. + Err() error + + // Scan reads the values from the current row into dest values positionally. It returns an error if any occured during the read operation. + Scan(dest ...any) error + } + +### Implementation Plan + +Designing interfaces that can remove the coupling between Armada and Postgres while maintaining the existing core Postgres implementation is a requirement. + +To fullfill this requirement, we can implement adapters around the `pgx` client so that it also implements the interfaces defined above. + +For example, an adapter can be implemented for `pgxpool.Pool` so that it can be used with `DatabasePool`: + + type PostgresPoolAdapter struct { + *pgxpool.Pool + } + + func (p PostgresPoolAdapter) Exec(ctx context.Context, sql string, args ...any) (any, error) { + return p.Pool.Exec(ctx, sql, args) + } + + func (p PostgresPoolAdapter) BeginTxFunc(ctx context.Context, opts dbtypes.DatabaseTxOptions, action func(dbtypes.DatabaseTx) error) error { + tx, err := p.Pool.BeginTx(ctx, pgx.TxOptions{ + IsoLevel: pgx.TxIsoLevel(opts.Isolation), + DeferrableMode: opts.DeferrableMode, + AccessMode: pgx.TxAccessMode(opts.AccessMode), + }) + + if err != nil { + return err + } + + // PostgresTrxAdapter is the Postgres adapter for DatabaseTx interface + if err := action(PostgresTrxAdapter{Tx: tx}); err != nil { + return tx.Rollback(ctx) + } + + return tx.Commit(ctx) + } + +The example above showcases the implementation of a Postgres connection pool adapter, this example implements the `DatabasePool` interface (the rest of the methods can be implemented similarly to `Exec` and `BeginTxFunc`). + +This allows the components that make use `pgxpool.Pool` (e.g. Lookout) to switch to using `DatabasePool` which underneath can make use of `pgxpool.Pool` (or any other `DatabasePool` implementation) without making any changes to the core Postgres implementation. + +To support new SQL dialects, we can simply introduce adapters that implement the interfaces, as well as introduce some level of flexibility into the configuration of components to allow choosing which dialect we want to use. diff --git a/docs/design/diagrams/relationships/README.md b/docs/design/diagrams/relationships/README.md new file mode 100644 index 00000000000..1360c3f904a --- /dev/null +++ b/docs/design/diagrams/relationships/README.md @@ -0,0 +1,25 @@ +# Diagrams of Armada Architecture + +## Generating the Diagram + +To generate this diagram, you can use the following command: + +```bash +# install graphviz +sudo apt-get install graphviz + +# then install diagrams from pip +pip install diagrams + +# then run the following command to generate the diagram +python3 generate.py +``` + +To find out more about the diagrams library, see https://diagrams.mingrammer.com/ +To find out more about the graphviz library, see https://graphviz.org/ + + + + + + diff --git a/docs/design/diagrams/relationships/armada_system.png b/docs/design/diagrams/relationships/armada_system.png new file mode 100644 index 00000000000..f6c95340361 Binary files /dev/null and b/docs/design/diagrams/relationships/armada_system.png differ diff --git a/docs/design/diagrams/relationships/generate.py b/docs/design/diagrams/relationships/generate.py new file mode 100644 index 00000000000..af339454b92 --- /dev/null +++ b/docs/design/diagrams/relationships/generate.py @@ -0,0 +1,147 @@ +from diagrams import Cluster, Diagram, Edge +from diagrams.onprem.database import PostgreSQL +from diagrams.onprem.inmemory import Redis +from diagrams.k8s.controlplane import API +from diagrams.custom import Custom + +graph_attr = { + "concentrate": "false", + "splines": "ortho", + "pad": "2", + "nodesep": "0.30", + "ranksep": "1.5", + "fontsize": "20", +} + +node_attr = { + # decrease image size + "fixedsize": "true", + "width": "1", + "height": "1", + "fontsize": "15", +} + +edge_attr = { + "minlen": "1", +} + +cluster_attr_common = { + "margin": "20", + "fontsize": "15", +} + +cluster_attr_server = { + "labelloc": "b", + "bgcolor": "#c7ffd5", +} +cluster_attr_server = {**cluster_attr_common, **cluster_attr_server} + +cluster_attr_exec = { + "labelloc": "t", + "bgcolor": "#c7ffd5", +} + +cluster_attr_exec = {**cluster_attr_common, **cluster_attr_exec} + +armada_logo = "../files/armada.png" +pulsar_logo = "../files/pulsar.png" +browser_logo = "../files/browser.png" + +with Diagram( + name="Armada Systems Diagram", + show=False, + direction="LR", + graph_attr=graph_attr, + edge_attr=edge_attr, + node_attr=node_attr, + filename="out/armada_systems_diagram", +): + pulsar = Custom("Pulsar", pulsar_logo) + + # Databases + postgres_lookout = PostgreSQL("Postgres (Lookout)") + postgres_scheduler = PostgreSQL("Postgres (Scheduler)") + redis_events = Redis("Redis (Events)") + + # Components + server = Custom("Server", armada_logo) + client = Custom("Client", armada_logo) + scheduler = Custom("Scheduler", armada_logo) + + # Lookout Parts + lookout_api = Custom("Lookout API", armada_logo) + lookoutUI = Custom("Lookout UI", armada_logo) + + # Ingesters + lookout_ingester = Custom("Lookout Ingester", armada_logo) + scheduler_ingester = Custom("Scheduler Ingester", armada_logo) + event_ingerster = Custom("Event Ingester", armada_logo) + + with Cluster("Executor Cluster", graph_attr=cluster_attr_server): + executor = Custom("Executor", armada_logo) + k8s_api = API("K8s API") + binoculars = Custom("Binoculars", armada_logo) + + with Cluster("Executor Cluster 2", graph_attr=cluster_attr_server): + executor2 = Custom("Executor 2", armada_logo) + k8s_api2 = API("K8s API 2") + binoculars2 = Custom("Binoculars", armada_logo) + + # Relationships + + # client sends requests to the server + client >> Edge(color="black") >> server + + # submit api talks to Pulsar + server >> Edge(color="red") >> pulsar + + # Pulsar talks to each of the ingesters + pulsar >> Edge(color="red") >> lookout_ingester + pulsar >> Edge(color="red") >> scheduler_ingester + pulsar >> Edge(color="red") >> event_ingerster + + # make Postgres blue, redis orange + # lookout and scheduler ingesters talk to postgres + # the other ingesters talk to redis + lookout_ingester >> Edge(color="blue") >> postgres_lookout + scheduler_ingester >> Edge(color="blue") >> postgres_scheduler + + event_ingerster >> Edge(color="orange") >> redis_events + + # the Postgres scheduler talks to the scheduler and executor api + postgres_scheduler >> Edge(color="blue") >> scheduler + + # the scheduler talks to Pulsar + scheduler >> Edge(color="red") >> pulsar + + executor >> Edge(color="blue") >> k8s_api + k8s_api >> Edge(color="blue") >> executor + + executor2 >> Edge(color="blue") >> k8s_api2 + k8s_api2 >> Edge(color="blue") >> executor2 + + # The binoculars in every cluster talks to k8s, and + # then talks directly to the lookout UI + k8s_api >> Edge(color="blue") >> binoculars + binoculars >> Edge(color="black") >> lookoutUI + + k8s_api2 >> Edge(color="blue") >> binoculars2 + binoculars2 >> Edge(color="black") >> lookoutUI + + # Lookout API gets its data from postgres + # and passes it to the lookout UI + postgres_lookout >> Edge(color="blue") >> lookout_api + lookout_api >> Edge(color="black") >> lookoutUI + + # The scheduler talks to the executor api + scheduler >> Edge(color="blue") >> executor + scheduler >> Edge(color="blue") >> executor2 + + # Pulsar talks to the server + pulsar >> Edge(color="red") >> server + + # redis events are given back to the server + redis_events >> Edge(color="orange") >> server + + # and passed to the client + server >> Edge(color="black") >> client diff --git a/docs/design/diagrams/relationships/images/armada.png b/docs/design/diagrams/relationships/images/armada.png new file mode 100644 index 00000000000..f4c86ed5cbe Binary files /dev/null and b/docs/design/diagrams/relationships/images/armada.png differ diff --git a/docs/design/diagrams/relationships/images/browser.png b/docs/design/diagrams/relationships/images/browser.png new file mode 100644 index 00000000000..a02b3a98eed Binary files /dev/null and b/docs/design/diagrams/relationships/images/browser.png differ diff --git a/docs/design/diagrams/relationships/images/pulsar.png b/docs/design/diagrams/relationships/images/pulsar.png new file mode 100644 index 00000000000..36de6bb18ff Binary files /dev/null and b/docs/design/diagrams/relationships/images/pulsar.png differ diff --git a/docs/design/jobservice/airflow-sequence.pml b/docs/design/jobservice/airflow-sequence.pml new file mode 100644 index 00000000000..cf2c6e44051 --- /dev/null +++ b/docs/design/jobservice/airflow-sequence.pml @@ -0,0 +1,8 @@ +@startuml +User -> Airflow : Creates a dag +Airflow -> AirflowOperator : Specify ArmadaPythonClient and JobServiceClient and pod definitions +AirflowOperator -> ArmadaPythonClient : Submits pod spec to Armada +AirflowOperator -> JobServiceClient : Polls GetJobStatus rpc call for given job id +AirflowOperator <- JobServiceClient : Wait for finished event and returns state, message +Airflow <- AirflowOperator : Airflow moves on to new task in schedule +@enduml \ No newline at end of file diff --git a/docs/design/jobservice/airflow-sequence.svg b/docs/design/jobservice/airflow-sequence.svg new file mode 100644 index 00000000000..894e02b708a --- /dev/null +++ b/docs/design/jobservice/airflow-sequence.svg @@ -0,0 +1,18 @@ +UserUserAirflowAirflowAirflowOperatorAirflowOperatorArmadaPythonClientArmadaPythonClientJobServiceClientJobServiceClientCreates a dagSpecify ArmadaPythonClient and JobServiceClient and pod definitionsSubmits pod spec to ArmadaPolls GetJobStatus rpc call for given job idWait for finished event and returns state, messageAirflow moves on to new task in schedule \ No newline at end of file diff --git a/docs/design/jobservice/job-service.md b/docs/design/jobservice/job-service.md new file mode 100644 index 00000000000..737d6670427 --- /dev/null +++ b/docs/design/jobservice/job-service.md @@ -0,0 +1,114 @@ +# Armada Job Service + +## Deprecation Warning + +The Job Service is being deprecated in favor of the new Query API as of May 2024. Users are encouraged to use the new API, and this component will be removed from future versions of Armada. + +## Problem Description +Armada’s API is event driven, preventing it from integrating with tools, such as Apache Airflow, written with the expectation that it can easily fetch status of a running job. It is not scalable to have Airflow subscribe to the event stream to observe status, so we must implement a caching layer which will expose a friendlier API for individual job querying. + +## Proposed Change +### Notes +- Add an optional caching API and service to Armada +- Caches job_id:(job_status, message) relationship for subscribed (queue,job_set) tuples +- Service is written in Go for performance and to reuse code from armadactl + +### Proposed Airflow Operator flow +1. Create the job_set +2. [do the work to schedule the job] +3. Status polling loop that talks to job service + +## Alternative Options + +### Change Armada API +Armada could expose a direct endpoint allowing access to status of a running job. +A previous iteration of Armada did provide an endpoint to get status of a running job. This was found to be a bottleneck for scaling to large number of jobs and/or users. The switch to an event API was used to alleviate this performance issue. + +### Change Airflow DAG API +Airflow could be modified to allow alternate forms of integration which work better with event-based systems. +This is impractical because we do not have Airflow contributors on staff, and the timeline required to get such a change proposed, approved, and merged upstream is much too long and includes lots of risk. + +## Data Access +- Service will need to insert job-id and state for a given queue and job-set +- Service will need to delete all jobs for a given queue and job-set. +- Service will access by job-id for polling loop. +- We will use in an in memory cache while doing subscription and then write to a persistent DB periodically. +- We will delete data after a configuration amount of time without an update. +## Data Model + - The Job Service will contain a table of queue, job-set, job-id, state and timestamp. + - What database should store this? + - We will use SQLLite. + - A in memory database will be used to get job-sets and then we will write in batches to our database for persistence. +### SQLLite +Pros + - Lightweight + - In memory db + - Part of service + - Persists database to a file + - SQL operations for inserting and deleting are simple. + +Cons + - Writing is sequential and blocks. + - Meant for small amount of concurrent users. + - Difficult to scale with Kubernetes. + - Scaling is only possible by increasing the number of job services + - Logic for deleting is more complicated. + - Writing to virtualized file volume will be slow in Kubernetes. + +## API (impact/changes?) +- What should be the API between Armada cache <-> Airflow? + - The proto file above will generate a python client where they can call get_job_status with job_id, job_set_id and queue specified. All of these are known by the Airflow Operator. + - [API Definition](https://github.com/armadaproject/armada/blob/master/pkg/api/jobservice/jobservice.proto) +- JobSet subscription will happen automatically for all tasks in a dag. + +## Security Impact + +The cache should use the same security as our armadactl. Airflow does not currently support multitenancy. + +## Documentation Impact +- Update dev and quickstart guides +- Update production deployment guides + +## Use Cases + +### Airflow Operator +1) User creates a dag and assigns a job-set. +2) Dag setup includes ArmadaPythonClient and JobServiceClient +3) Airflow operator takes both ArmadaPythonClient and JobServiceClient +4) Airflow operator submits job via ArmadaPythonClient +5) Airflow operator polls JobServiceClient via GetJobStatus +6) Once Armada has a terminal event, the airflow task is complete. + +### Implementation Plan + +I have a PR that implements this [plan](https://github.com/armadaproject/armada/pull/1122). +- Created a jobservice proto definition +- Generated GRPC service for the correspond proto definition +- Created a jobservice cmd that takes an configuration object +- JobService starts a GRPC server +- Added ApiConnection and GRPC configuration parameters + + +### Subscription + +The logic for this service is as follows: + +- When a request comes in, check if we already have a subscription to that jobset. +If we don't have a subscription, create one using the armada go client (grpc api). +- Have a function connected to this subscription that updates the jobId key in the local cache with the new state for all jobs in the jobset (even those nobody has asked for yet). +- The local redis should just store jobId -> state mappings. Any messages you get that don't correspond to states we care about (ingresses, unableToSchedule) just ignore. +- Return the latest state. If we just subscribed then it's probably "not found" +The armada operator just polls for the job state. The first poll for a given jobset will cause a subscription to be made for that jobset. + +### Airflow Sequence Diagram + +![AirflowSequence](./airflow-sequence.svg) + + +### JobService Server Diagram + +![JobService](./job-service.svg) + +- EventClient is the GRPC public GRPC client for watching events + +- The JobService deployment consists of a GRPC Go Server and a database. diff --git a/docs/design/jobservice/job-service.svg b/docs/design/jobservice/job-service.svg new file mode 100644 index 00000000000..795a12f733e --- /dev/null +++ b/docs/design/jobservice/job-service.svg @@ -0,0 +1,15 @@ +JobServiceJobServiceEventClientEventClientDatabaseDatabaseJobServiceClientJobServiceClientCalls Event client with job-set and queueStores all job status and message for a given job-set in the databaseCalls Database via rpc call to retrieve status and message for given id \ No newline at end of file diff --git a/docs/design/jobservice/jobservice.pml b/docs/design/jobservice/jobservice.pml new file mode 100644 index 00000000000..4d45e346bec --- /dev/null +++ b/docs/design/jobservice/jobservice.pml @@ -0,0 +1,5 @@ +@startuml +JobService -> EventClient : Calls Event client with job-set and queue +EventClient -> JobService : Stores all job status and message for a given job-set in the database +Database -> JobServiceClient : Calls Database via rpc call to retrieve status and message for given id +@enduml \ No newline at end of file diff --git a/docs/design/priority.md b/docs/design/priority.md new file mode 100644 index 00000000000..f7a8bfab7e9 --- /dev/null +++ b/docs/design/priority.md @@ -0,0 +1,52 @@ +# Armada priority + +This document describes priority calculation algorithm in detail. + +## How is priority calculated + +### Resource usage +Armada schedules jobs which can use multiple types of resources (CPU, memory, GPU, ...). +To get one number which represents the share of a resource by a particular queue, Armada firstly calculates how much of particular +resource is available for one CPU `resource factor`. +Then queue usage can be calculated as `usage = # of CPU + # GPU / GPU factor + # memory / memory factor + ...` + +In example: +If our cluster has 10 CPUs, 20Gb of memory and 5 GPUs.
+GPU factor will be `0.5` and memory factor `2`.
+Queue using 5 CPUs, 2 Gb memory and 1 GPU will have usage `5 + 2 / 2 + 1 / 0.5 = 8` . + +### Queue priority +Queue priority is calculated based on current resource usage; if a particular queue usage is constant, the queue priority will approach this number and eventually stabilize on this value. +Armada allows configuration of `priorityHalftime` which influences how quickly queue priority approaches resource usage. + +The formula for priority update is as follows (inspired by Condor priority calculation): + +`priority = priority (1 - beta) + resourceUsage * beta` + +`beta = 0.5 ^ (timeChange / priorityHalftime)` + +### Priority factor +Each queue has a priority factor, this is a multiplicative constant which is applied to the priority. The lower this number is the more resources a queue will be allocated in scheduling. + +`effectivePriority = priority * priorityFactor` + +## Scheduling resources +Available resources are divided between non empty queues based on queue priority. The share allocated to the queue is proportional to inverse of its priority. + +For example if queue `A` has priority `1` and queue `B` priority `2`, `A` will get `2/3` and `B` `1/3` of the resources. + +There are 2 approaches Armada uses to schedule jobs: + +### Slices of resources +When the Executor requests new jobs with information about available resources, resources are divided into slices according to the inverse priority. + +Armada iterates through queues and allocates jobs up to the slice size for each queue. + +Whatever resources remain after this round are scheduled using probabilistic slicing. + +This round is skipped if Armada Server is configured with the option `scheduling.useProbabilisticSchedulingForAllResources = true`. + +### Probabilistic scheduling +To schedule any remaining resources Armada randomly selects a non-empty queue with probability distribution corresponding to the remainders of queue slices. One job from this queue is scheduled, and the queue slice is reduced. This continues until there is no resource available, queues are empty or the scheduling time is up. + +This way there is a chance than one queue will get allocated more than it is entitled to in the scheduling round. However as we are concerned with fair share over the time, rather than in a moment, this does not matter much. Queue priority will compensate for this in the future. diff --git a/docs/design/relationships_diagram.md b/docs/design/relationships_diagram.md new file mode 100644 index 00000000000..b58cfe82cf8 --- /dev/null +++ b/docs/design/relationships_diagram.md @@ -0,0 +1,43 @@ +## Relationships Diagram + +![Systems Diagram](./diagrams/relationships/armada_system.png) + +This diagram shows the high-level relationships between components of Armada and third-party softwares. + +For a more detailed view of Armada, see the [Scheduler Architecture Doc](./architecture.md). + +### Armada Client + +This is the comonent that is used by users to submit jobs to Armada, using gRPC. Current languages supported are: +- Go +- Python +- C# + +### Ingester Loops + +All data-flows in armada are controlled by Pulsar. This means that all data is first written to Pulsar, and then ingested into the appropriate database. The ingester loops are the components that read data from Pulsar and write it to the appropriate database. + +There are 3 ingester loops: +- **Event Ingester**: This ingests data from Pulsar into Redis. +- **Lookout Ingester**: This ingests data from Pulsar into Postgres. +- **Scheduler Ingester**: This ingests data from Pulsar into Postgres. + +### Scheduler + +The [scheduler](./scheduler.md) is the component that is responsible for scheduling jobs. + +It receives data from the ingester loops, and then uses that data to schedule jobs. Its decisions are then fed back to Pulsar, allowing the process to repeat. + +### Armada Executor Components + +These are the components that run on each k8s cluster that executes jobs. + +It includes: +- **Armada Executor**: The main component of the executor. It is responsible for the execution of jobs on the cluster. +- **Binoculars**: A component that reads logs from the k8s API. + +### Lookout + +Lookout is made of 2 components: +- **Lookout API**: This is the component that acts as a gateway to the lookout database. It is a gRPC API. +- **Lookout UI**: This is the component that is used by users to query the state of jobs. It is a web UI. diff --git a/docs/scheduler.md b/docs/design/scheduler.md similarity index 100% rename from docs/scheduler.md rename to docs/design/scheduler.md diff --git a/docs/developer.md b/docs/developer/README.md similarity index 91% rename from docs/developer.md rename to docs/developer/README.md index 315f7e5bfa7..1573cc18f07 100644 --- a/docs/developer.md +++ b/docs/developer/README.md @@ -32,22 +32,22 @@ Feel free to create a ticket if you encounter any issues, and link them to the r Please see these documents for more information about Armadas Design: -* [Armada Components Diagram](./design/relationships_diagram.md) -* [Armada Architecture](./design/architecture.md) -* [Armada Design](./design/index.md) -* [How Priority Functions](./design/priority.md) -* [Armada Scheduler Design](./design/scheduler.md) +* [Armada Components Diagram](../design/relationships_diagram.md) +* [Armada Architecture](../design/architecture.md) +* [Armada Design](../design/README.md) +* [How Priority Functions](../design/priority.md) +* [Armada Scheduler Design](../design/scheduler.md) ## Other Useful Developer Docs -* [Armada API](./developer/api.md) -* [Running Armada in an EC2 Instance](./developer/aws-ec2.md) -* [Armada UI](./developer/ui.md) -* [Usage Metrics](./developer/usage_metrics.md) -* [Using OIDC with Armada](./developer/oidc.md) -* [Building the Website](./developer/website.md) -* [Using Localdev Manually](./developer/manual-localdev.md) -* [Inspecting and Debugging etcd in Localdev setup](./developer/etc-localdev.md) +* [Armada API](./api.md) +* [Running Armada in an EC2 Instance](./aws-ec2.md) +* [Armada UI](./ui.md) +* [Usage Metrics](./usage_metrics.md) +* [Using OIDC with Armada](./oidc.md) +* [Building the Website](./website.md) +* [Using Localdev Manually](./manual-localdev.md) +* [Inspecting and Debugging etcd in Localdev setup](./etc-localdev.md) ## Pre-requisites @@ -129,7 +129,7 @@ go run cmd/testsuite/main.go test --tests "testsuite/testcases/basic/*" --junit In LocalDev, the UI is built seperately with `mage ui`. To access it, open http://localhost:8089 in your browser. -For more information see the [UI Developer Guide](./developer/ui.md). +For more information see the [UI Developer Guide](./ui.md). ### Choosing components to run @@ -172,7 +172,7 @@ It supports the following commands: ### VSCode Debugging After running `mage debug vscode`, you can attach to the running processes using VSCode. -The launch.json file can be found [Here](../developer/debug/launch.json) +The launch.json file can be found [Here](../../developer/debug/launch.json) For using VSCode debugging, see the [VSCode Debugging Guide](https://code.visualstudio.com/docs/editor/debugging). @@ -256,4 +256,4 @@ For required enviromental variables, please see [The Enviromental Variables Guid ## Finer-Grain Control If you would like to run the individual mage targets yourself, you can do so. -See the [Manually Running LocalDev](./developer/manual-localdev.md) guide for more information. +See the [Manually Running LocalDev](./manual-localdev.md) guide for more information. diff --git a/docs/developer/api.md b/docs/developer/api.md index 94dea8ec631..f35a19a5d53 100644 --- a/docs/developer/api.md +++ b/docs/developer/api.md @@ -1,3 +1,7 @@ +--- +permalink: /api +--- + # Armada API Armada exposes an API via gRPC or REST. diff --git a/docs/developer/aws-ec2.md b/docs/developer/aws-ec2.md new file mode 100644 index 00000000000..bbe4be3ed07 --- /dev/null +++ b/docs/developer/aws-ec2.md @@ -0,0 +1,236 @@ +# EC2 Developer Setup + +## Background + +For development, you might want to set up an Amazon EC2 instance as the resource requirements for Armada are substantial. A typical Armada installation requires a system with at least 16GB of memory to perform well. Running Armada on a laptop made before ~2017 will typically eat battery life and result in a slower UI. + +Note: As of June 2022, not all Armada dependencies reliably build on a Mac M1 using standard package management. So if you have an M1 Mac, working on EC2 or another external server is your best bet. + +## Instructions + +- We suggest a t3.xlarge instance from aws ec2 with AmazonLinux as the OS. 16 GB of memory is suggested. +- During selection of instance, Add a large volume to your ec2 instance. 100 gb of storage is recommended. +- When selecting the instance, you will have the opportunity to choose a security group. You may need to make a new one. Be sure to add a rule allowing inbound communication on port 22 so that you can access your server via SSH. We recommend that you restrict access to the IP address from which you access the Internet, or a small CIDR block containing it. + +If you want to use your browser to access Armada Lookout UI or other web-based interfaces, you will also need to grant access to their respective ports. For added security, consider using an [SSH tunnel](https://www.ssh.com/academy/ssh/tunneling/example) from your local machine to your development server instead of opening those ports. You can add LocalForward to your ssh config: `LocalForward 4000 localhost:3000` + +- ### Install [Docker](https://www.cyberciti.biz/faq/how-to-install-docker-on-amazon-linux-2/) + +The procedure to install Docker on AMI 2 (Amazon Linux 2) running on either EC2 or Lightsail instance is as follows: + +1. Login into remote AWS server using the ssh command: + +``` +ssh ec2-user@ec2-ip-address-dns-name-here +``` + +2. Apply pending updates using the yum command: + +``` +sudo yum update +``` + +3. Search for Docker package: + +``` +sudo yum search docker +``` + +4. Get version information: + +``` +sudo yum info docker +``` +

+ +

+ +5. Install docker, run: + +``` +sudo yum install docker +``` + +6. Add group membership for the default ec2-user so you can run all docker commands without using the sudo command: + +``` +sudo usermod -a -G docker ec2-user +id ec2-user +# Reload a Linux user's group assignments to docker w/o logout +newgrp docker +``` + + +- ### Install [docker-compose](https://www.cyberciti.biz/faq/how-to-install-docker-on-amazon-linux-2/) + +```bash +$ cd $HOME/.docker +$ mkdir cli-plugins +$ cd cli-plugins +$ curl -SL https://github.com/docker/compose/releases/download/v2.17.3/docker-compose-linux-x86_64 -o docker-compose +$ chmod 755 docker-compose +``` + +Then verify it with: + +```bash +docker-compose version +``` + +- ### Getting the [Docker Compose Plugin](https://docs.docker.com/compose/install/linux/#install-the-plugin-manually) + +Armadas setup assumes You have the docker compose plugin installed. If you do not have it installed, you can use the following guide: + +* https://docs.docker.com/compose/install/linux/#install-the-plugin-manually + +Then test it with: + +```bash +docker compose version +``` + + +- ### Install [Go](https://go.dev/doc/install) + +ssh into your EC2 instance, become root and download the go package from [golang.org](https://go.dev/doc/install). + +1. Extract the archive you downloaded into /usr/local, creating a Go tree in /usr/local/go with the following command: + +``` +rm -rf /usr/local/go && tar -C /usr/local -xzf go1.20.1.linux-amd64.tar.gz +``` + +2. Configure .bashrc + +Switch back to ec2-user and add the following line to your ~/.bashrc file + +``` +export PATH=$PATH:/usr/local/go/bin +``` + +3. Configure go Environment + +Add the following lines to your ~/.bashrc file as well, also create a golang folder under /home/ec2-user. + +``` +# Go envs +export GOVERSION=go1.20.1 +export GO_INSTALL_DIR=/usr/local/go +export GOROOT=$GO_INSTALL_DIR +export GOPATH=/home/ec2-user/golang +export PATH=$GOROOT/bin:$GOPATH/bin:$PATH +export GO111MODULE="on" +export GOSUMDB=off +``` + +4. Test go + +Verify that you’ve installed Go by opening a command prompt and typing the following command: + +``` +go version +go version go1.20.1 linux/amd64 +``` + +- ### Install [Kind](https://dev.to/rajitpaul_savesoil/setup-kind-kubernetes-in-docker-on-linux-3kbd) + +1. Install Kind + +``` +go install sigs.k8s.io/kind@v0.11.1 +# You can replace v0.11.1 with the latest stable kind version +``` + +2. Move the KinD Binary to /usr/local/bin + +``` +- You can find the kind binary inside the directory go/bin +- Move it to /usr/local/bin - mv go/bin/kind /usr/local/bin +- Make sure you have a path setup for /usr/local/bin +``` + +- ### Install [kubectl](https://dev.to/rajitpaul_savesoil/setup-kind-kubernetes-in-docker-on-linux-3kbd) + +1. Install Latest Version of Kubectl: + +``` +curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" +chmod +x kubectl +mv kubectl /usr/local/bin +``` + +- ### Install [helm](https://helm.sh/docs/intro/install/) + +1. Install helm: + +``` +curl -sSL https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash +``` + +2. We can verify the version + +``` +helm version --short +``` + +- ### Install [python3 (>= 3.7)](https://www.geeksforgeeks.org/how-to-install-python3-on-aws-ec2/) + +1. Check if Python is already installed or not on our AWS EC2. + +``` +python --version +``` + +

+ +

+ +2. At first update, Ubuntu packages by using the following command. + +``` +sudo apt update +``` + +

+ +

+ +3. If Python3 is not installed on your AWS EC2, then install Python3 using the following command. + +``` +sudo apt-get install python3.7 +``` + +4. We have successfully installed Python3 on AWS EC2, to check if Python3 is successfully installed or not, verify using the following command. + +``` +python3 --version +``` + +- ### Install .NET for [linux](https://docs.microsoft.com/en-us/dotnet/core/install/linux-centos) + +1. Before you install .NET, run the following commands to add the Microsoft package signing key to your list of trusted keys and add the Microsoft package repository. Open a terminal and run the following commands: + +``` +sudo rpm -Uvh https://packages.microsoft.com/config/centos/7/packages-microsoft-prod.rpm +``` + +2. Install the SDK + +``` +sudo yum install dotnet-sdk-7.0 +``` + +3. Install the runtime + +``` +sudo yum install aspnetcore-runtime-7.0 +``` + +- ### We suggest using the [remote code extension](https://code.visualstudio.com/docs/remote/ssh) for VS Code if that is your IDE of choice. + +

+ +

+ +- ### Please see [Our Developer Docs](./README.md) for more information on how to get started with the codebase. diff --git a/docs/developer/manual-localdev.md b/docs/developer/manual-localdev.md new file mode 100644 index 00000000000..236995857c7 --- /dev/null +++ b/docs/developer/manual-localdev.md @@ -0,0 +1,75 @@ +# Manual Local Development + +Here, we give an overview of a development setup for Armada that gives users full control over the Armada components and dependencies. + +Before starting, please ensure you have installed [Go](https://go.dev/doc/install) (version 1.20 or later), gcc (for Windows, see, e.g., [tdm-gcc](https://jmeubank.github.io/tdm-gcc/)), [mage](https://magefile.org/), [docker](https://docs.docker.com/get-docker/), [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl), and, if you need to compile `.proto` files, [protoc](https://github.com/protocolbuffers/protobuf/releases). + +For a full lust of mage commands, run `mage -l`. + +## Setup + +### Note for Arm/M1 Mac Users + +You will need to set the `PULSAR_IMAGE` enviromental variable to an arm64 image. + +We provide an optimised image for this purpose: + +```bash +export PULSAR_IMAGE=richgross/pulsar:2.11.0 +``` + +```bash +# Download Go dependencies. +go mod tidy + +# Install necessary tooling. +mage BootstrapTools + +# Compile .pb.go files from .proto files +# (only necessary after changing a .proto file). +mage proto +make dotnet + +# Build the Docker images containing all Armada components. +# Only the main "bundle" is needed for quickly testing Armada. +mage buildDockers "bundle,lookout-bundle,jobservice" + +# Setup up a kind (i.e., Kubernetes-in-Docker) cluster; see +# https://kind.sigs.k8s.io/ for details. +mage Kind + +# Start necessary dependencies. +# Verify that dependencies started successfully +# (check that Pulsar has fully started as it is quite slow (~ 1min )). +mage StartDependencies && mage checkForPulsarRunning + +# Start the Armada server and executor. +# Alternatively, run the Armada server and executor directly on the host, +# e.g., through your IDE; see below for details. +docker compose up -d server executor + +# Wait for Armada to come online +mage checkForArmadaRunning +``` + +Run the Armada test suite against the local environment to verify that it is working correctly. +```bash +# Create an Armada queue to submit jobs to. +go run cmd/armadactl/main.go create queue e2e-test-queue + +# To allow Ingress tests to pass +export ARMADA_EXECUTOR_INGRESS_URL="http://localhost" +export ARMADA_EXECUTOR_INGRESS_PORT=5001 + +# Run the Armada test suite against the local environment. +go run cmd/testsuite/main.go test --tests "testsuite/testcases/basic/*" --junit junit.xml +``` + +Tear down the local environment using the following: +```bash +# Stop Armada components and dependencies. +docker compose down + +# Tear down the kind cluster. +mage KindTeardown +``` diff --git a/docs/developer/ubuntu-setup.md b/docs/developer/ubuntu-setup.md new file mode 100644 index 00000000000..6aa02e39a7c --- /dev/null +++ b/docs/developer/ubuntu-setup.md @@ -0,0 +1,164 @@ +# Setting up an Ubuntu Linux instance for Armada development + +## Introduction + +This document is a list of the steps, packages, and tweaks that need to be done to get an Ubuntu Linux +instance running, with all the tools needed for Armada development and testing. + +The packages and steps were verified on an AWS EC2 instance (type t3.xlarge, 4 vcpu, 16GB RAM, +150GB EBS disk), but should be essentially the same on any comparable hardware system. + +### Install Ubuntu Linux + +Install Ubuntu Linux 22.04 (later versions may work as well). The default package set should +work. If you are setting up a new AWS EC2 instance, the default Ubuntu 22.04 image works well. + +When installing, ensure that the network configuration allows: +- SSH traffic from your client IP(s) +- HTTP traffic +- HTTPS traffic + +Apply all recent updates: +``` +$ sudo apt update +$ sudo apt upgrade +``` +You will likely need to reboot after applying the updates: +``` +$ sudo shutdown -r now +``` +After logging in, clean up any old, unused packages: +``` +$ sudo apt autoremove +``` + +AWS usually creates new EC2 instances with a very small root partion (8GB), which will quickly +fill up when using containers, or doing any serious development. Creating a new, large EBS volume, and +attaching it to the instance, will give a system usable for container work. + +First, provision an EBS volume in the AWS Console - of at least 150GB, or more - and attach it to +the instance. You will need to create the EBS volume in the same availability zone as the EC2 +instance - you can find the latter's AZ by clicking on the 'Networking' tab in the details page +for the instance, and you should see the Availabilty Zone listed in that panel. Once you've created +the volume, attach it to the instance. + +Then, format a filesystem on the volume and mount it. First, determine what block device the +parition is on, by running the `lsblk` comand. There should be a line where the TYPE is 'disk' +and the size matches the size you specified when creating the volume - e.g. +``` +nvme1n1 259:4 0 150G 0 disk +``` +Create a filesystem on that device by running `mkfs`: +``` +$ sudo mkfs -t ext4 /dev/nvme1n1 +``` +Then set a label on the partition - here, we will give it a label of 'VOL1': +``` +$ sudo e2label /dev/nvme1n1 VOL1 +``` +Create the mount-point directory: +``` +$ sudo mkdir /vol1 +``` +Add the following line to the end of `/etc/fstab`, so it will be mounted upon reboot: +``` +LABEL=VOL1 /vol1 ext4 defaults 0 2 +``` +Then mount it by doing `sudo mount -a`, and confirm the available space by running `df -h` - the `/vol1` +filesystem should be listed. + +### Install Language/Tool Packages + +Install several development packages that aren't installed by default in the base system: +``` +$ sudo apt install gcc make unzip +``` + +### Install Go, Protobuffers, and kubectl tools +Install the Go compiler and associated tools. Currently, the latest version is 1.20.5, but there may +be newer versions: + +``` +$ curl --location -O https://go.dev/dl/go1.20.5.linux-amd64.tar.gz +$ sudo tar -C /usr/local -xzvf go1.20.5.linux-amd64.tar.gl +$ echo 'export PATH=$PATH:/usr/local/go/bin' > go.sh +$ sudo cp go.sh /etc/profile.d/ +``` +Then, log out and back in again, then run `go version` to verify your path is now correct. + +Install protoc: +``` +$ curl -O --location https://github.com/protocolbuffers/protobuf/releases/download/v23.3/protoc-23.3-linux-x86_64.zip +$ cd /usr/local && sudo unzip ~/protoc-23.3-linux-x86_64.zip +$ cd ~ +$ type protoc +``` + +Install kubectl: +``` +$ curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" +$ sudo cp kubectl /usr/local/bin +$ sudo chmod 755 /usr/local/bin/kubectl +$ kubectl version +``` + +### Install Docker + +Warning: do not install Docker as provided by the `docker.io` and other packages in the Ubuntu base +packages repository - the version of Docker they provide is out-of-date. + +Instead, follow the instructions for installing Docker on Ubuntu at https://docs.docker.com/engine/install/ubuntu/ . +Specifically, follow the listed steps for installing using an apt repository, and install the latest Docker version. + +### Relocate Docker storage directory to secondary volume + +Since Docker can use a lot of filesystem space, the directory where it stores container images, logs, +and other datafiles should be relocated to the separate, larger non-root volume on the system, so that +the root filesystem does not fill up. + +Stop the Docker daemon(s) and copy the existing data directory to the new location: +``` +$ sudo systemctl stop docker +$ ps ax | grep -i docker # no Docker processes should be shown + +$ sudo rsync -av /var/lib/docker /vol1/ +$ sudo rm -rf /var/lib/docker +$ sudo ln -s /vol1/docker /var/lib/docker +``` +Then restart Docker and verify that it's working again: +``` +$ sudo systemctl start docker +$ sudo docker ps +$ sudo docker run hello-world +``` + +### Create user accounts, verify docker access + +First, make a home directory parent in the new larger filesystem: +``` +$ sudo mkdir /vol1/home +``` +Then, for each user to be added, run the following steps - we will be using the account named 'testuser' here. +First, create the account and their home directory. +``` +$ sudo adduser --shell /bin/bash --gecos 'Test User' --home /vol1/home/testuser testuser +``` +Set up their $HOME/.ssh directory and add their SSH public-key: +``` +$ sudo mkdir /vol1/home/testuser/.ssh +$ sudo vim /vol1/home/testuser/.ssh/authorized_keys +# In the editor, add the SSH public key string that the user has given you, save the file and exit +$ sudo chmod 600 /vol1/home/testuser/.ssh/authorized_keys +$ sudo chmod 700 /vol1/home/testuser/.ssh +$ sudo chown -R testuser:testuser /vol1/home/testuser/.ssh +``` +Finally, add them to the `docker` group so they can run Docker commands without `sudo` access: +``` +$ sudo gpasswd -a testuser docker +``` +**sudo Access (OPTIONAL)** + +If you want to give the new user `sudo` privileges, run the following command: +``` +$ sudo gpasswd -a testuser sudo +``` diff --git a/docs/development_guide.md b/docs/development_guide.md new file mode 100644 index 00000000000..b76a6194282 --- /dev/null +++ b/docs/development_guide.md @@ -0,0 +1,119 @@ +# Development guide + +Here, we give an overview of a development setup for Armada that is closely aligned with how Armada is built and tested in CI. + +Before starting, please ensure you have installed [Go](https://go.dev/doc/install) (version 1.20 or later), gcc (for Windows, see, e.g., [tdm-gcc](https://jmeubank.github.io/tdm-gcc/)), [mage](https://magefile.org/), [docker](https://docs.docker.com/get-docker/), [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl), and, if you need to compile `.proto` files, [protoc](https://github.com/protocolbuffers/protobuf/releases). + +Then, use the following commands to setup a local Armada system. +```bash +# Download Go dependencies. +go mod tidy + +# Install necessary tooling. +mage BootstrapTools + +# Compile .pb.go files from .proto files +# (only necessary after changing a .proto file). +mage proto +make dotnet + +# Build a Docker image containing all Armada components. +mage buildDockers "bundle" + +# Setup up a kind (i.e., Kubernetes-in-Docker) cluster; see +# https://kind.sigs.k8s.io/ for details. +mage Kind + +# Start necessary dependencies. +# Verify that dependencies started successfully +# (check that Pulsar has fully started as it is quite slow (~ 1min )). +mage StartDependencies && mage checkForPulsarRunning + +# Start the Armada server and executor. +# Alternatively, run the Armada server and executor directly on the host, +# e.g., through your IDE; see below for details. +docker-compose up -d server executor +``` + +**Note: the components take ~15 seconds to start up.** + +Run the Armada test suite against the local environment to verify that it is working correctly. +```bash +# Create an Armada queue to submit jobs to. +go run cmd/armadactl/main.go create queue e2e-test-queue + +# To allow Ingress tests to pass +export ARMADA_EXECUTOR_INGRESS_URL="http://localhost" +export ARMADA_EXECUTOR_INGRESS_PORT=5001 + +# Run the Armada test suite against the local environment. +go run cmd/testsuite/main.go test --tests "testsuite/testcases/basic/*" --junit junit.xml +``` + +Tear down the local environment using the following: +```bash +# Stop Armada components and dependencies. +docker-compose down + +# Tear down the kind cluster. +mage KindTeardown +``` + + +## Running the Armada server and executor in Visual Studio Code + +To run the Armada server and executor from Visual Studio Code for debugging purposes, add, e.g., the following config to `.vscode/launch.json` and start both from the "Run and Debug" menu (see the Visual Studio Code [documentation](https://code.visualstudio.com/docs/editor/debugging) for more information). + +```json +{ + // Use IntelliSense to learn about possible attributes. + // Hover to view descriptions of existing attributes. + // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387 + "version": "0.2.0", + "configurations": [ + { + "name": "server", + "type": "go", + "request": "launch", + "mode": "auto", + "env": { + "CGO_ENABLED": "0", + "ARMADA_REDIS_ADDRS": "localhost:6379", + "ARMADA_EVENTSAPIREDIS_ADDRS": "localhost:6379", + "ARMADA_EVENTAPI_POSTGRES_CONNECTION_HOST": "localhost", + "ARMADA_POSTGRES_CONNECTION_HOST": "localhost", + "ARMADA_PULSAR_URL": "pulsar://localhost:6650" + }, + "cwd": "${workspaceFolder}/", + "program": "${workspaceFolder}/cmd/armada/main.go", + "args": [ + "--config", "${workspaceFolder}/localdev/config/armada/config.yaml" + ] + }, + { + "name": "executor", + "type": "go", + "request": "launch", + "mode": "auto", + "env": { + "CGO_ENABLED": "0", + "ARMADA_HTTPPORT": "8081", + "ARMADA_APICONNECTION_ARMADAURL": "localhost:50051", + "KUBECONFIG": "${workspaceFolder}/.kube/external/config" + }, + "cwd": "${workspaceFolder}/", + "program": "${workspaceFolder}/cmd/executor/main.go", + "args": [ + "--config", "${workspaceFolder}/localdev/config/executor/config.yaml" + ] + } + ], + "compounds": [ + { + "name": "server/executor", + "configurations": ["server", "executor"], + "stopAll": true + } + ] +} +``` \ No newline at end of file diff --git a/docs/docs-readme.md b/docs/docs-readme.md deleted file mode 100644 index 43c5808bfcd..00000000000 --- a/docs/docs-readme.md +++ /dev/null @@ -1,31 +0,0 @@ -# Docs Readme - -## For Developers - -See [website.md](./developer/website.md) - -## Overview -Docs added to this directory are automatically copied into armadaproject.io. - -For example, if you wanted to document bananas, and you added `bananas.md`, -once committed to master that would be published at -`https://armadaproject.io/bananas/`. - -## Complex pages with assets -If you'd like to add a more complex page, such as one with images or other -linked assets, you have to be very careful to ensure links will work both -for people viewing in github and for those viewing via armadaproject.io. - -The easiest way to accomplish this is by using page bundles. See quickstart -as as example: quickstart/index.md is the actual content, with links to -various images using relative pathing; e.g. `./my-image.png`. This is -considered a page bundle by jekyll (github pages) and are rendered as a -single page at `quickstart/`. - -In order to get this page bundle pushed to gh-pages branch, you'll need -to adjust the github workflow in `.github/workflows/pages.yml` to add your -new page bundle as well. - -## Removing pages -If you put a commit here to remove a page, you will need to also commit -to the gh-pages branch to remove that page. diff --git a/docs/production-install.md b/docs/production-install.md new file mode 100644 index 00000000000..2719f46b9f6 --- /dev/null +++ b/docs/production-install.md @@ -0,0 +1,256 @@ +--- +permalink: /production +--- + +# Production Installation + +### Prerequisites + +* At least one running Kubernetes cluster + +### Installing Armada Server + +For production it is assumed that the server component runs inside a Kubernetes cluster. + +The below sections will cover how to install the component into Kubernetes. + +#### Recommended prerequisites + + +* Cert manager installed [https://cert-manager.io/docs/installation/helm/#installing-with-helm](https://cert-manager.io/docs/installation/helm/#installing-with-helm) +* gRPC compatible ingress controller installed for gRPC ingress such as [https://github.com/kubernetes/ingress-nginx](https://github.com/kubernetes/ingress-nginx) +* Redis installed [https://github.com/helm/charts/tree/master/stable/redis-ha](https://github.com/helm/charts/tree/master/stable/redis-ha) +* Optionally install NATS streaming server helm chart:[https://github.com/nats-io/k8s/tree/main/helm/charts/stan](https://github.com/nats-io/k8s/tree/main/helm/charts/stan), additional docs: [https://docs.nats.io/running-a-nats-service/nats-kubernetes](https://docs.nats.io/running-a-nats-service/nats-kubernetes) + + +Set `ARMADA_VERSION` environment variable and clone [this repository](https://github.com/armadaproject/armada.git) with the same version tag as you are installing. For example to install version `v1.2.3`: +```bash +export ARMADA_VERSION=v1.2.3 +git clone https://github.com/armadaproject/armada.git --branch $ARMADA_VERSION +``` + +#### Installing server component + +To install the server component, we will use Helm. + +You'll need to provide custom config via the values file, below is a minimal template that you can fill in: + +```yaml +ingressClass: "nginx" +clusterIssuer: "letsencrypt-prod" +hostnames: + - "server.component.url.com" +replicas: 3 + +applicationConfig: + redis: + masterName: "mymaster" + addrs: + - "redis-ha-announce-0.default.svc.cluster.local:26379" + - "redis-ha-announce-1.default.svc.cluster.local:26379" + - "redis-ha-announce-2.default.svc.cluster.local:26379" + poolSize: 1000 + eventsRedis: + masterName: "mymaster" + addrs: + - "redis-ha-announce-0.default.svc.cluster.local:26379" + - "redis-ha-announce-1.default.svc.cluster.local:26379" + - "redis-ha-announce-2.default.svc.cluster.local:26379" + poolSize: 1000 + +basicAuth: + users: + "user1": "password1" +``` + +For all configuration options you can specify in your values file, see [server Helm docs](https://armadaproject.io/helm#server-helm-chat). + +Fill in the appropriate values in the above template and save it as `server-values.yaml` + +Then run: + +```bash +helm install ./deployment/armada --set image.tag=$ARMADA_VERSION -f ./server-values.yaml +``` + +#### Using NATS Streaming +You can optionally setup Armada to route all job events through persistent NATS Streaming subject before saving them to Redis. This is useful if additional application needs to consume events from Armada as NATS subject contains job events from all job sets. + +Required additional server configuration is: + +```yaml +eventsNats: + servers: + - "armada-nats-0.default.svc.cluster.local:4222" + - "armada-nats-1.default.svc.cluster.local:4222" + - "armada-nats-2.default.svc.cluster.local:4222" + clusterID: "nats-cluster-ID" + subject: "ArmadaEvents" + queueGroup: "ArmadaEventsRedisProcessor" +``` + +### Installing Armada Executor + +For production the executor component should run inside the cluster it is "managing". + +To install the executor into a cluster, we will use Helm. + +You'll need to provide custom config via the values file, below is a minimal template that you can fill in: + +```yaml +applicationConfig: + application: + clusterId : "clustername" + apiConnection: + armadaUrl : "server.component.url.com:443" + basicAuth: + username: "user1" + password: "password1" +``` + +
+ +##### Moving Executor off the control plane + +By default, the executor runs on the control plane. + +When that isn't an option, maybe because you are using a managed kubernetes service where you cannot access the master nodes. + +Add the following to your values file: + ```yaml +nodeSelector: null +tolerations: [] +``` +
+ + +For other node configurations and all other executor options you can specify in your values file, see [executor Helm docs](https://armadaproject.io/helm#Executor-helm-chart). + +Fill in the appropriate values in the above template and save it as `executor-values.yaml`. + +Then run: + +```bash +helm install ./deployment/armada-executor --set image.tag=$ARMADA_VERSION -f ./executor-values.yaml +``` +# Interacting with Armada + +Once you have the Armada components running, you can interact with them via the command-line tool called `armadactl`. + +## Setting up armadactl + +`armadactl` connects to `localhost:50051` by default with no authentication. + +For authentication please create a config file described below. + +#### Config file + +By default config is loaded from `$HOME/.armadactl.yaml`. + +You can also set location of the config file using command line argument: + +```bash +armada command --config=/config/location/config.yaml +``` + +The format of this file is a simple yaml file: + +```yaml +armadaUrl: "server.component.url.com:443" +basicAuth: + username: "user1" + password: "password1" +``` + +For Open Id protected server armadactl will perform PKCE flow opening web browser. +Config file should look like this: +```yaml +armadaUrl: "server.component.url.com:443" +openIdConnect: + providerUrl: "https://myproviderurl.com" + clientId: "***" + localPort: 26354 + useAccessToken: true + scopes: [] +``` + +To Invoke an external program to generate an access token, config file should be as follows: +```yaml +armadaUrl: "server.component.url.com:443" +execAuth: + # Command to run. Needs to be on the path and should write only a token to stdout. Required. + cmd: some-command + # Environment variables to set when executing the command. Optional. + args: + - "arg1" + # Arguments to pass when executing the command. Optional. + env: + - name: "FOO" + value: "bar" + # Whether the command requires user input. Optional + interactive: true +``` + +For Kerberos authentication, config file should contain this: +``` +KerberosAuth: + enabled: true +``` + +#### Environment variables + + --- TBC --- + +## Submitting Test Jobs + +For more information about usage please see the [User Guide](./user.md) + +Specify the jobs to be submitted in a yaml file: +```yaml +queue: test +jobSetId: job-set-1 +jobs: + - priority: 0 + podSpec: + terminationGracePeriodSeconds: 0 + restartPolicy: Never + containers: + ... any Kubernetes pod spec ... + +``` + +Use the `armadactl` command line utility to submit jobs to the Armada server +```bash +# create a queue: +armadactl create queue test --priorityFactor 1 + +# submit jobs in yaml file: +armadactl submit ./example/jobs.yaml + +# watch jobs events: +armadactl watch test job-set-1 + +``` + +**Note: Job resource request and limit should be equal. Armada does not support limit > request currently.** + +## Metrics + +All Armada components provide a `/metrics` endpoint providing relevant metrics to the running of the system. + +We actively support Prometheus through our Helm charts (see below for details), however any metrics solution that can scrape an endpoint will work. + +### Component metrics + +#### Server + +The server component provides metrics on the `:9000/metrics` endpoint. + +You can enable Prometheus components when installing with Helm by setting `prometheus.enabled=true`. + +#### Executor + +The executor component provides metrics on the `:9001/metrics` endpoint. + +You can enable Prometheus components when installing with Helm by setting `prometheus.enabled=true`. + diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 00000000000..3419c9816db --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,7 @@ +# Quickstart + +Easiest way to install **Armada** is by using the **Armada Operator**, a Kubernetes operator that manages the lifecycle of **Armada components**. + +The operator is available in the [Armada Operator repository](https://github.com/armadaproject/armada-operator). + +Follow the [Quickstart](https://github.com/armadaproject/armada-operator?tab=readme-ov-file#quickstart) to create your first Armada cluster. \ No newline at end of file diff --git a/docs/user.md b/docs/user.md index 1c6d6763fb6..6eb505f75f0 100644 --- a/docs/user.md +++ b/docs/user.md @@ -7,7 +7,7 @@ This document is meant to be a guide for new users of how to create and submit J For more information about the design of Armada (e.g., how jobs are prioritised), see: -- [System overview](./design.md) +- [System overview](./design/README.md) The Armada workflow is: