|
| 1 | +# artcodix |
| 2 | + |
| 3 | +## Preface |
| 4 | + |
| 5 | +This document describes a possible environment setup for a pre-production or minimal production setup. |
| 6 | +In general hardware requirements can vary largely from environment to environment and this guide is not |
| 7 | +a hardware sizing guide nor the best placement solution of services for every setup. This guide intends to |
| 8 | +provide a starting point for a hardware based deployment of the SCS-IaaS reference implementation based on OSISM. |
| 9 | + |
| 10 | +## Node type definitions |
| 11 | + |
| 12 | +### Control Node |
| 13 | + |
| 14 | +A control node runs all or most of the openstack services, that are responsible for API-services and the corresponding |
| 15 | +runtimes. These nodes are necessary for any user to interact with the cloud and to keep the cloud in a managed state. |
| 16 | +However these nodes are usualy **not** running user virtual machines. |
| 17 | +Hence it is advisable to have the control nodes replicated. To have a RAFT-quorum three nodes are a good starting point. |
| 18 | + |
| 19 | +### Compute Node (HCI/no HCI) |
| 20 | + |
| 21 | +#### Not Hyperconverged Infrastructure (no HCI) |
| 22 | + |
| 23 | +Non HCI compute nodes are exclusively running user virtual machines. They are running no API-services, no storage daemons |
| 24 | +and no network routers, except for the necessary network infrastructure to connect virtual machines. |
| 25 | + |
| 26 | +#### Hyperconverged Infrastructure (HCI) |
| 27 | + |
| 28 | +HCI nodes generally run at least user virtual machines and storage daemons. It is possible to place networking services |
| 29 | +here as well but that is not considered good practice. |
| 30 | + |
| 31 | +#### No HCI / vs HCI |
| 32 | + |
| 33 | +Whether to use HCI nodes or not is in general not an easy question. For a getting started (pre production/smalles possible production) |
| 34 | +environment however, it is the most cost efficent option. Therefore we will continue with HCI nodes (compute + storage). |
| 35 | + |
| 36 | +### Storage Node |
| 37 | + |
| 38 | +A dedicated storage node runs only storage daemons. This can be necessary in larger deployments to protect the storage daemons from |
| 39 | +ressource starvation through user workloads. |
| 40 | + |
| 41 | +Not used in this setup. |
| 42 | + |
| 43 | +### Network Node |
| 44 | + |
| 45 | +A dedicated network node runs the routing infrastructure for user virtual machines that connects these machines with provider / external |
| 46 | +networks. In larger deployments these can be useful to enhance scaling and improve network performance. |
| 47 | + |
| 48 | +Not used in this setup. |
| 49 | + |
| 50 | +## Nodes in this deployment example |
| 51 | + |
| 52 | +As mentioned before we are running three dedicated control nodes. To be able to fully test an openstack environment it is |
| 53 | +recommended to run three compute nodes (HCI) as well. Technically you can get a setup running with just one compute node. |
| 54 | +See the following chapter (Use cases and validation) for more information. |
| 55 | + |
| 56 | +### Use cases and validation |
| 57 | + |
| 58 | +The setup described allows for the following use cases / test cases: |
| 59 | + |
| 60 | +- Highly available control plane |
| 61 | + - Control plane failure toleration test (Database, RabbitMQ, Ceph Mons, Routers) |
| 62 | +- Highly available user virtual clusters (e.g. Kubernetes clusters) |
| 63 | + - Compute host failure simulation |
| 64 | +- Host aggregates / compute node grouping |
| 65 | +- Host based storage replication (instead of OSD based) |
| 66 | + - Fully replicated storage / storage high availability test |
| 67 | + |
| 68 | +### Control Node |
| 69 | + |
| 70 | +#### General requirements |
| 71 | + |
| 72 | +The control nodes do not run any user workloads. This means they are usually not sized as big as the compute nodes. |
| 73 | +Relevant metrics for control nodes are: |
| 74 | + |
| 75 | +- Fast and big enough discs. At least SATA-SSDs are recommended, NVMe will greatly improve the overall responsiveness. |
| 76 | +- A rather large amount of memory to house all the caches for databases and queues. |
| 77 | +- CPU performance should be average. A good compromise between amount of cores and speed should be used. However this is |
| 78 | + the least important requirement on the list. |
| 79 | + |
| 80 | +#### Hardware recommendation |
| 81 | + |
| 82 | +The following server specs are just a starting point and can greatly vary between environments. |
| 83 | + |
| 84 | +Example: |
| 85 | +3x Dell R630/R640/R650 1HE Server |
| 86 | + |
| 87 | +- Dual 8 Core 3,00 GHz Intel/AMD |
| 88 | +- 128 GB RAM |
| 89 | +- 2x 3,84 TB NVMe in (Software-) RAID 1 |
| 90 | +- 2x 10/25/40 GBit 2 Port SFP+/QSFP Network Cards |
| 91 | + |
| 92 | +### Compute Node (HCI) |
| 93 | + |
| 94 | +The compute nodes in this scenario run all the user virtual workloads **and** the storage infrastructure. To make sure |
| 95 | +we don't starve these nodes, they should be of decent size. |
| 96 | + |
| 97 | +> This setup takes local storage tests into consideration. The SCS-standards require certain flavors with very fast disc speed |
| 98 | +> to house customer kubernetes control planes (etcd). These speeds are usually not achievable with shared storage. If you don't |
| 99 | +> intend to test this scenario, you can skip the NVMe discs. |
| 100 | +
|
| 101 | +#### Hardware recommendation |
| 102 | + |
| 103 | +The following server specs are just a starting point and can greatly vary between environments. The sizing of the nodes needs to fit |
| 104 | +the expected workloads (customer VMs). |
| 105 | + |
| 106 | +Example: |
| 107 | +3x Dell R730(xd)/R740(xd)/R750(xd) |
| 108 | +or |
| 109 | +3x Supermicro |
| 110 | + |
| 111 | +- Dual 16 Core 2,8 GHz Intel/AMD |
| 112 | +- 512 GB RAM |
| 113 | +- 2x 3,84 TB NVMe in (Software-) RAID 1 if you want to have local storage available (optional) |
| 114 | + |
| 115 | +For hyperconverged ceph osds: |
| 116 | + |
| 117 | +- 4x 10 TB HDD -> This leads to ~30 TB of available HDD storage (optional) |
| 118 | +- 4x 7,68 TB SSD -> This leads to ~25 TB of available SSD storage (optional) |
| 119 | +- 2x 10/25/40 GBit 2 Port SFP+/QSFP Network Cards |
| 120 | + |
| 121 | +## Network |
| 122 | + |
| 123 | +The network infrastructure can vary a lot from setup to setup. This guide does not intend to define the best networking solution |
| 124 | +for every cluster but rather give two possible scenarios. |
| 125 | + |
| 126 | +### Scenario A: Not recommended for production |
| 127 | + |
| 128 | +The smallest possible setup is just a single switch connected to all the nodes physically on one interface. The switch has to be |
| 129 | +VLAN enabled. Openstack recommends multiple isolated networks but the following are at least recommended to be split: |
| 130 | + |
| 131 | +- Out of Band network |
| 132 | +- Management networks |
| 133 | +- Storage backend network |
| 134 | +- Public / External network for virutal machines |
| 135 | + If there is only one switch, these networks should all be defined as seperate VLANs. One of the networks can run in untagged default |
| 136 | + VLAN 1. |
| 137 | + |
| 138 | +### Scenario B: Minimum recommended setup for small production environments |
| 139 | + |
| 140 | +The recommended setup uses two stacked switches connected in a LAG and at least three different physical network ports on each node. |
| 141 | + |
| 142 | +- Physical Network 1: VLANs for Public / External network for virutal machines, Management networks |
| 143 | +- Physical Network 2: Storage backend network |
| 144 | +- Physical Network 3: Out of Band network |
| 145 | + |
| 146 | +### Network adapters |
| 147 | + |
| 148 | +The out of band network does usually not need a lot of bandwith. Most modern servers come with 1Gbit/s adapters which are sufficient. |
| 149 | +For small test clusters, it might also be sufficient to use 1Gbit/s networks for the other two physical networks. |
| 150 | +For a minimum production cluster it is recommended to use the following: |
| 151 | + |
| 152 | +- Out of Band Network: 1Gbit/s |
| 153 | +- VLANs for Public / External network for virutal machines, Management networks: 10 / 25 Gbit/s |
| 154 | +- Storage backend network: 10 / 25 / 40 Gbit/s |
| 155 | + |
| 156 | +Whether you need a higher throughput for your storage backend services depends on your expected storage load. The faster the network |
| 157 | +the faster storage data can be replicated between nodes. This usually leads to improved performance and better/faster fault tolerance. |
| 158 | + |
| 159 | +## How to continue |
| 160 | + |
| 161 | +After implementing the recommended deployment example hardware, you can continue with the [deployment guide](https://docs.scs.community/docs/iaas/guides/deploy-guide/). |
0 commit comments