From 949649ba61ee328f08eed821503cde7c707daaeb Mon Sep 17 00:00:00 2001 From: Jonathan Zhang Date: Mon, 8 May 2023 01:34:19 -0700 Subject: [PATCH 1/7] Doc updates --- ocfweb/docs/docs/services/hpc.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/ocfweb/docs/docs/services/hpc.md b/ocfweb/docs/docs/services/hpc.md index 5f8abb7eb..265c78d15 100644 --- a/ocfweb/docs/docs/services/hpc.md +++ b/ocfweb/docs/docs/services/hpc.md @@ -39,16 +39,15 @@ where you can ask questions and talk to us about anything HPC. ## The Cluster -As of Fall 2018, the OCF HPC cluster is composed of one server, with the +As of Spring 2023, the OCF HPC cluster is composed of one server, with the following specifications: -* 2 Intel Xeon [E5-2640v4][corruption-cpu] CPUs (10c/20t @ 2.4GHz) -* 4 NVIDIA 1080Ti GPUs -* 256GB ECC DDR4-2400 RAM +* 2 Intel Xeon Platinum [8352Y][corruption-cpu] CPUs (32c/64t @ 2.2GHz) +* 4 NVIDIA RTX A6000 GPUs +* 256GB ECC DDR4-3200 RAM We have plans to expand the cluster with additional nodes of comparable -specifications as funding becomes available. The current hardware was -generously funded by a series of grants from the [Student Tech Fund][stf]. +specifications as funding becomes available. ## Slurm From 46338dde236e2762b404f4660b570bbb8bf6eacd Mon Sep 17 00:00:00 2001 From: Jonathan Zhang Date: Mon, 8 May 2023 01:37:32 -0700 Subject: [PATCH 2/7] Update php.md --- ocfweb/docs/docs/services/web/php.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ocfweb/docs/docs/services/web/php.md b/ocfweb/docs/docs/services/web/php.md index 064f22760..de3ed98e2 100644 --- a/ocfweb/docs/docs/services/web/php.md +++ b/ocfweb/docs/docs/services/web/php.md @@ -1,6 +1,6 @@ [[!meta title="PHP"]] -`death`, the OCF webserver, currently runs PHP 7.0 with the following +`death`, the OCF webserver, currently runs PHP 7.4 with the following non-standard packages installed: * [APCu](https://www.php.net/manual/en/book.apcu.php) (opcode caching) From 8669e4b584a55504554e92bc4c2ce70e5d86c950 Mon Sep 17 00:00:00 2001 From: Jonathan Zhang Date: Tue, 9 May 2023 08:14:15 -0700 Subject: [PATCH 3/7] documentation: zfs-based backups --- ocfweb/docs/docs/staff/backend/backups.md | 134 +++++++++------------- 1 file changed, 53 insertions(+), 81 deletions(-) diff --git a/ocfweb/docs/docs/staff/backend/backups.md b/ocfweb/docs/docs/staff/backend/backups.md index 6494daef4..328bdcd19 100644 --- a/ocfweb/docs/docs/staff/backend/backups.md +++ b/ocfweb/docs/docs/staff/backend/backups.md @@ -1,89 +1,60 @@ [[!meta title="Backups"]] ## Backup Storage -We currently store our on-site backups across a couple drives on `hal`: +We currently store our on-site backups across a RAID mirror on `hal`: -* `hal:/opt/backups` (6 TiB usable; 2x 6-TiB Seagate drives in RAID 1 in an LVM - volume group) +* `hal:/backup` (16 TB usable; 2x 16 TB WD drives in ZFS mirror) - This volume group provides `/dev/vg-backups/backups-live` which contains - recent daily, weekly, and monthly backups, and - `/dev/vg-backups/backups-scratch`, which is scratch space for holding - compressed and encrypted backups which we then upload to off-site storage. +Backups are stored as ZFS snapshots. ZFS snapshots have the advantage of being +immutable, browsable, and they can be sent to other ZFS pools for off-site +backups. ## Off-Site Backups -Our main off-site backup location is [Box][box]. Students automatically get an -"unlimited" plan, so it provides a nice and free location to store encrypted -backups. We currently have a weekly cronjob that [makes an encrypted -backup][create-encrypted-backup] using GPG keys and then [uploads it to -Box.com][upload-to-box]. This takes about 20 hours combined to make and upload, -and will probably take even longer in the future as backups grow. An email is -sent out once the backup files are uploaded, and the link provided is shared -with only OCF officers to make sure the backups are kept as secure as possible, -since they contain all of the OCF's important data. The backups are already -encrypted, but it doesn't hurt to add a little extra security to that. - -### Retention - -Off-site backups older than six months (180 days) are permanently deleted by a -[daily cronjob][prune-old-backups]. +Todo: new off-site backup documentation. ## Restoring Backups The easiest way to restore from a backup is to look at how it is made and reverse it. If it is a directory specified in rsnapshot, then likely all that needs to be done is to take that directory from the backup and put it onto the -server to restore onto. Some backups, such as mysql, ldap, and kerberos are -more complicated, and need to be restored using `mysqlimport` or `ldapadd` for +server to restore onto. Some backups, such as mysql, ldap, and kerberos are more +complicated, and need to be restored using `mysqlimport` or `ldapadd` for instance. ### Onsite -Onsite backups are pretty simple, all that needs to be done is to go to `hal` -and find the backup to restore from in `/opt/backups/live`. All backups of -recent data are found in either `rsnapshot` (for daily backups) or `misc` (for -any incidents or one-off backups). Within `rsnapshot`, the backups are -organized into directories dependings on how long ago the backup was made. To -see when each backup was created just use `ls -l` to show the last modified -time of each directory. +Compared to the old setup, onsite backups are a little harder to find.They are +located at `/backup/encrypted/rsnapshot` on `hal`. In addition, we have a +dataset for each top-level user directory, such as `/home/a/`, which is stored +as the `backup/encrypted/rsnapshot/.sync/nfs/opt/homes/home/a` dataset. + +The ZFS snapshots are stored in the `.zfs/snapshot` directory of each dataset. +The `.zfs` folder is hidden and will not show up even with the `ls -a` command, +so you will need to manually `cd` into the directory. The snapshots are time- +stamped, so you can find the snapshot you want to restore from by looking at the +date string in the snapshot name. For example, if you wanted to restore the +`public_html` directory of user `foo` with the backup from 2023-05-01, you +should enter the +``` +/backup/encrypted/rsnapshot/.sync/nfs/opt/homes/services/http/users/f/.zfs +``` +directory, and then go inside the `snapshot/` folder. From there, you enter the +`zfs-auto-snap_after-backup-2023-05-01-1133/` directory (note that the time is +UTC), and then you can copy the `foo/` directory to the user's home directory. + +For large directories, please use `/backup/encrypted/scratch` as a temporary +working area for compressing the files and other operations. Please note that +this dataset will not be automatically snapshotted. + +MySQL backups are stored at the `/backup/encrypted/rsnapshot/mysql/` directory, +and the snapshots can be accessed at +`/backup/encrypted/rsnapshot/mysql/.zfs/snapshot/`. Inside a snapshot, the +individual databases are stored as `.sql` files inside the `.sync/` directory. ### Offsite -Offsite backups are more complicated because the backup files first need to be -downloaded, stuck together into a single file, decrypted, extracted, and then -put into LVM to get back the whole backup archive that would normally be found -onsite. This essentially just means that the -[create-encrypted-backup][create-encrypted-backup] script needs to be reversed -to restore once the backup files are downloaded. Here are the general steps to -take to restore from an offsite backup: - -1. Download all the backup pieces from Box.com. This is generally easiest with - a command line tool like `cadaver`, which can just use a `mget *` to download - all the files (albeit sequentially). If more speed is needed, open multiple - `cadaver` connections and download multiple groups of files at once. - -2. Put together all the backup pieces into a single file. This can be done by - running `cat .img.gz.gpg.part* > .img.gz.gpg`. - -3. Decrypt the backup using `gpg`. This requires your key pair to be imported - into `gpg` first using `gpg --import public_key.gpg` and - `gpg --allow-secret-key-import --import private_key.gpg`, then you can - decrypt the backup with - `gpg --output .img.gz --decrypt .img.gz.gpg`. Be careful to - keep your private key secure by setting good permissions on it so that nobody - else can read it, and delete it after the backup is imported. The keys can be - deleted with `gpg --delete-secret-keys ""` and - `gpg --delete-key ""`, where your name is whatever name it shows when - you run `gpg --list-keys`. - -4. Extract the backup with `gunzip .img.gz`. - -5. Put the backup image into a LVM logical volume. First find the size that the - volume should be by running `ls -l .img`, and copy the number of - bytes that outputs. Then create the LV with - `sudo lvcreate -L B -n /dev/` where the volume - group has enough space to store the entire backup (2+ TiB). +Todo: add instructions for restoring offsite backups using zfs send/receive. ## Backup Contents @@ -93,33 +64,34 @@ Backups currently include: * User home and web directories * Cronjobs on supported servers (tsunami, supernova, biohazard, etc.) * MySQL databases (including user databases, stats, RT, print quotas, IRC data) -* Everything on GitHub (probably very unnecessary) +* A few OCF repositories on GitHub (probably very unnecessary) * LDAP and Kerberos data * A [smattering of random files on random servers][backed-up-files] ## Backup Procedures Backups are currently made daily via a cronjob on `hal` which calls `rsnapshot`. -The current settings are to retain 7 daily backups, 4 weekly backups, and 6 -monthly backups, but we might adjust this as it takes more space or we get -larger backup drives. -We use `rsnapshot` to make incremental backups. Typically, each new backup -takes an additional ~3GiB of space (but this will vary based on how many files -actually changed). A full backup is about ~2TiB of space and growing. +We use `rsnapshot` and ZFS snapshots to make incremental backups. Typically, +each new backup takes an additional ~20GiB of space (but this will vary based on +how many files actually changed). A full backup is about ~4TiB of space and +growing. -(The incremental file backups are only about ~300 MiB, but since mysqldump -files can't be incrementally backed up, those take a whole ~2 GiB each time, so -the total backup grows by ~3GiB each time. However, an old backup is discarded -each time too, so it approximately breaks even.) +(The incremental file backups are only about ~1-5 GiB, but since MySQL and +Postgres files can't be incrementally backed up, those take a whole ~ 15 GiB +each time, so the total backup grows by ~20GiB each time.) ## Ideas for backup improvements 1. Automate backup testing, so have some system for periodically checking that backups can be restored from, whether they are offsite or onsite. -[box]: https://www.box.com -[create-encrypted-backup]: https://github.com/ocf/puppet/blob/master/modules/ocf_backups/files/create-encrypted-backup -[upload-to-box]: https://github.com/ocf/puppet/blob/master/modules/ocf_backups/files/upload-to-box -[backed-up-files]: https://github.com/ocf/puppet/blob/17bc94b395e254529d97c84fb044f76931439fd7/modules/ocf_backups/files/rsnapshot.conf#L53 -[prune-old-backups]: https://github.com/ocf/puppet/blob/master/modules/ocf_backups/files/prune-old-backups +[rsyncnet]: https://www.rsync.net +[create-encrypted-backup]: + https://github.com/ocf/puppet/blob/master/modules/ocf_backups/files/create-encrypted-backup +[upload-to-box]: + https://github.com/ocf/puppet/blob/master/modules/ocf_backups/files/upload-to-box +[backed-up-files]: + https://github.com/ocf/puppet/blob/17bc94b395e254529d97c84fb044f76931439fd7/modules/ocf_backups/files/rsnapshot.conf#L53 +[prune-old-backups]: + https://github.com/ocf/puppet/blob/master/modules/ocf_backups/files/prune-old-backups From 14055e37c8fa7ce976ac9a3fd43742650f00706d Mon Sep 17 00:00:00 2001 From: Jonathan Zhang Date: Tue, 9 May 2023 08:19:03 -0700 Subject: [PATCH 4/7] styling: line wrapped --- ocfweb/docs/docs/staff/backend/firewall.md | 61 +++++----- ocfweb/docs/docs/staff/backend/git.md | 38 +++--- .../docs/staff/backend/internal-firewalls.md | 11 +- ocfweb/docs/docs/staff/backend/jenkins.md | 29 +++-- ocfweb/docs/docs/staff/backend/kerberos.md | 104 ++++++++-------- ocfweb/docs/docs/staff/backend/kubernetes.md | 68 ++++++----- ocfweb/docs/docs/staff/backend/ldap.md | 14 +-- ocfweb/docs/docs/staff/backend/libvirt.md | 16 +-- ocfweb/docs/docs/staff/backend/mail.md | 12 +- ocfweb/docs/docs/staff/backend/mail/vhost.md | 14 +-- ocfweb/docs/docs/staff/backend/munin.md | 30 +++-- ocfweb/docs/docs/staff/backend/printhost.md | 76 ++++++------ ocfweb/docs/docs/staff/backend/prometheus.md | 115 +++++++++++++----- ocfweb/docs/docs/staff/backend/rt.md | 66 +++++----- ocfweb/docs/docs/staff/backend/switch.md | 61 ++++++---- 15 files changed, 394 insertions(+), 321 deletions(-) diff --git a/ocfweb/docs/docs/staff/backend/firewall.md b/ocfweb/docs/docs/staff/backend/firewall.md index 4d32a0ad8..2e315e7d2 100644 --- a/ocfweb/docs/docs/staff/backend/firewall.md +++ b/ocfweb/docs/docs/staff/backend/firewall.md @@ -3,50 +3,49 @@ We use a Palo Alto Networks (PAN) firewall provided by IST. We have one network port in the server room which is activated and behind the firewall; we have another network port activated in the lab behind the television which is also -behind the firewall. All the ports the desktops use are also behind the -firewall since they are routed through the switch in the server room. +behind the firewall. All the ports the desktops use are also behind the firewall +since they are routed through the switch in the server room. ## Administering the firewall ### Accessing the interface Administration of the firewall is done through the [web interface][panorama], -and must be done from an on-campus IP address (for instance through the -[library VPN][library-vpn] or SOCKS proxying through an OCF host). **Remember -to specify https when loading the firewall admin page**, as it does not have a -redirect from http to https. If you are having connection issues with the -firewall admin page loading indefinitely, it is likely because you are trying -to use http or trying to access it from an off-campus IP. To quickly set up a -SOCKS proxy, run `ssh -D 8000 -N supernova` from any off-campus host and then -set up the SOCKS proxy (through your OS or through your browser's settings) to -use the proxy on `localhost` and port `8000`. +and must be done from an on-campus IP address (for instance through the [library +VPN][library-vpn] or SOCKS proxying through an OCF host). **Remember to specify +https when loading the firewall admin page**, as it does not have a redirect +from http to https. If you are having connection issues with the firewall admin +page loading indefinitely, it is likely because you are trying to use http or +trying to access it from an off-campus IP. To quickly set up a SOCKS proxy, run +`ssh -D 8000 -N supernova` from any off-campus host and then set up the SOCKS +proxy (through your OS or through your browser's settings) to use the proxy on +`localhost` and port `8000`. [panorama]: https://panorama.net.berkeley.edu [library-vpn]: https://www.lib.berkeley.edu/using-the-libraries/vpn -To sign in to administer the firewall, make sure to use the single sign-on -(SSO) option, and it will ask for CalNet authentication. +To sign in to administer the firewall, make sure to use the single sign-on (SSO) +option, and it will ask for CalNet authentication. ### Policies -All our current policies are located in the "Pre Rules" section under -"Security" in the policies tab. This option should be right at the top in the -box on the left side of the page. It contains all our rules since we are only -blocking traffic (either outgoing or incoming) before it goes through the -firewall, so all we need are pre rules. - -In general the interface is pretty self-explanatory. Each rule has a custom -name and a description that describes what kind of traffic it should be -blocking or letting through, as well as the source and destination addresses -(or groups of addresses), application (identified by the firewall), service -(port), and whether it is allowed or blocked. Each rule has a dropdown next to -the rule name if you hover over it that leads to the log viewer, where you can -see what kind of traffic matched each rule and when the traffic was -allowed/blocked. - -Any changes made to the firewall policies need to be committed and pushed to -the firewall using the commit button and then the push button (or the commit -and push button to do it in one step) located in the top right. +All our current policies are located in the "Pre Rules" section under "Security" +in the policies tab. This option should be right at the top in the box on the +left side of the page. It contains all our rules since we are only blocking +traffic (either outgoing or incoming) before it goes through the firewall, so +all we need are pre rules. + +In general the interface is pretty self-explanatory. Each rule has a custom name +and a description that describes what kind of traffic it should be blocking or +letting through, as well as the source and destination addresses (or groups of +addresses), application (identified by the firewall), service (port), and +whether it is allowed or blocked. Each rule has a dropdown next to the rule name +if you hover over it that leads to the log viewer, where you can see what kind +of traffic matched each rule and when the traffic was allowed/blocked. + +Any changes made to the firewall policies need to be committed and pushed to the +firewall using the commit button and then the push button (or the commit and +push button to do it in one step) located in the top right. ### Syslog diff --git a/ocfweb/docs/docs/staff/backend/git.md b/ocfweb/docs/docs/staff/backend/git.md index 9ed9deef2..3574b61e9 100644 --- a/ocfweb/docs/docs/staff/backend/git.md +++ b/ocfweb/docs/docs/staff/backend/git.md @@ -8,29 +8,28 @@ distributed). ## Workflow Although Git is a great tool for large-scale distributed development, for us a -Subversion-like workflow with a "central repository" (where you clone/fetch -from and push to) and linear history makes more sense. The instructions below -assume that development is happening in a single branch. +Subversion-like workflow with a "central repository" (where you clone/fetch from +and push to) and linear history makes more sense. The instructions below assume +that development is happening in a single branch. **Only commit your own, original work**. You may commit another staff member's work if you have permission and change the author appropriately (e.g., `--author="Guest User "`). When committing, `git config user.name` should be your name and `git config user.email` should be your OCF -email address -- this should be taken care of by [[LDAP|doc -staff/backend/ldap]] and `/etc/mailname` on OCF machines. +email address -- this should be taken care of by [[LDAP|doc staff/backend/ldap]] +and `/etc/mailname` on OCF machines. ### To "update" -Get the latest commits from the central repository and update your working -tree. +Get the latest commits from the central repository and update your working tree. git pull --rebase -This will `git fetch` (update your local copy of the remote repository) and -`git rebase` (rewrite current branch in terms of tracked branch). The rebase -prevents unnecessary merge commits by moving your local commits on top of the -latest remote commit (`FETCH_HEAD`). This is a good thing if you have any local -commits which have not yet been pushed to the remote repository. +This will `git fetch` (update your local copy of the remote repository) and `git +rebase` (rewrite current branch in terms of tracked branch). The rebase prevents +unnecessary merge commits by moving your local commits on top of the latest +remote commit (`FETCH_HEAD`). This is a good thing if you have any local commits +which have not yet been pushed to the remote repository. If you have "dirty" uncommitted changes, you'll need to commit them or stash them before rebasing (`git stash`). @@ -104,8 +103,8 @@ Advanced: * line of changes in a repository, default branch is `master` * fast-forward * advance branch forward in a linear sequence - * this is usually what we want: the new commit builds directly on the - previous commit + * this is usually what we want: the new commit builds directly on the previous + commit * hooks * optional scripts that can be executed during git operations * for example, validate syntax before accepting a commit or deploy code to a @@ -114,8 +113,8 @@ Advanced: * files that are ready to be stored in your next commit * references (aka refs) * SHA-1 hashes that identify commits - * `HEAD` points to the latest commit ref in the current branch (`HEAD^` to - the one before it) + * `HEAD` points to the latest commit ref in the current branch (`HEAD^` to the + one before it) * remote * upstream repository that you can `git fetch` from or `git push` to, default is `origin` @@ -123,11 +122,12 @@ Advanced: `origin/master`) * working tree (aka workspace or working directory) * directory that checked out files reside - * this includes the current branch and any "dirty" uncommitted changes - (staged or not) + * this includes the current branch and any "dirty" uncommitted changes (staged + or not) ## Recommended reading * [A Visual Git Reference](https://marklodato.github.io/visual-git-guide/) * [Git Immersion](http://www.gitimmersion.com/) -* [The Case for Git Rebase](http://darwinweb.net/articles/the-case-for-git-rebase) +* [The Case for Git + Rebase](http://darwinweb.net/articles/the-case-for-git-rebase) diff --git a/ocfweb/docs/docs/staff/backend/internal-firewalls.md b/ocfweb/docs/docs/staff/backend/internal-firewalls.md index 0d720d852..919324ab4 100644 --- a/ocfweb/docs/docs/staff/backend/internal-firewalls.md +++ b/ocfweb/docs/docs/staff/backend/internal-firewalls.md @@ -74,7 +74,8 @@ Firewall rules are added by using `firewall_multi` and since such resources wouldn't be subject to the [ordering constraints generally placed on firewall resources][ordering]. -[ordering]: https://github.com/ocf/puppet/blob/f3fdd5912a5dc5eafd9995412a9c5e85874dee31/manifests/site.pp#L50-L58 +[ordering]: + https://github.com/ocf/puppet/blob/f3fdd5912a5dc5eafd9995412a9c5e85874dee31/manifests/site.pp#L50-L58 [puppetlabs-firewall]: https://forge.puppet.com/puppetlabs/firewall @@ -113,10 +114,10 @@ For IPv6 firewall rules, you need to use the `ip6tables` command instead. The invocation is the same as for `iptables`. Iptables rules are not automatically persisted across reboots. In order for your -changes to iptables to be preserved across reboots, you need to additionally -run `service netfilter-persistent save`. This is done automatically after -every Puppet run which results in iptables rules being modified, but if you -manually fiddle with iptables you may need to run it yourself. +changes to iptables to be preserved across reboots, you need to additionally run +`service netfilter-persistent save`. This is done automatically after every +Puppet run which results in iptables rules being modified, but if you manually +fiddle with iptables you may need to run it yourself. ### Disabling firewalls through hiera diff --git a/ocfweb/docs/docs/staff/backend/jenkins.md b/ocfweb/docs/docs/staff/backend/jenkins.md index b69483895..8c38e65c9 100644 --- a/ocfweb/docs/docs/staff/backend/jenkins.md +++ b/ocfweb/docs/docs/staff/backend/jenkins.md @@ -1,16 +1,16 @@ [[!meta title="Jenkins"]] [Jenkins](https://jenkins.ocf.berkeley.edu/) is the tool we use for continuous -integration and continuous delivery (TM) at OCF. All that means is that when -you push code, +integration and continuous delivery (TM) at OCF. All that means is that when you +push code, * Jenkins will test that code, * Jenkins will build that code (if applicable), * and then Jenkins will deploy that code. -Ideally all projects at OCF will go through this pipeline of being tested -before deployed, though currently some don't (or some only use some portion, -such as deploying without any tests). +Ideally all projects at OCF will go through this pipeline of being tested before +deployed, though currently some don't (or some only use some portion, such as +deploying without any tests). ## Making changes to Jenkins @@ -34,13 +34,13 @@ There are three users configured on the Jenkins server (`reaper`): Jenkins master but not for performing any work. * `jenkins-slave`, a user we create. It is used for running build jobs with - potentially untrusted code. **However,** it's not secure enough to run - totally untrusted code, since all jobs run under this user. + potentially untrusted code. **However,** it's not secure enough to run totally + untrusted code, since all jobs run under this user. * `jenkins-deploy`, a user we create. It is used for running build jobs tagged - `deploy`, whose only purpose is intended to be *deploying* code which has - been built or tested in a previous step. The user has a Kerberos keytab for - the `ocfdeploy` user and our PyPI key in its home directory. Jobs such as + `deploy`, whose only purpose is intended to be *deploying* code which has been + built or tested in a previous step. The user has a Kerberos keytab for the + `ocfdeploy` user and our PyPI key in its home directory. Jobs such as `upload-deb` or `puppet-trigger` fall under this user. Within Jenkins, we configure two "slaves" which are really on the same server, @@ -56,9 +56,8 @@ have to worry that anybody who can get some code built can become ocfdeploy, which is a privileged user account) and protects Jenkins somewhat against bad jobs that might e.g. delete files or crash processes. -Of course, in many cases once code builds successfully, we ship it off -somewhere where it gets effectively run as root anyway. But this feels a little -safer. +Of course, in many cases once code builds successfully, we ship it off somewhere +where it gets effectively run as root anyway. But this feels a little safer. ## Jenkins for GitHub projects @@ -115,8 +114,8 @@ testing pull requests. For example, `puppet-test-pr`. there are any). Add "ocf" as the only line to the "List of organizations" box. -7. On GitHub, under "Settings" and "Webhooks & services", add a new webhook - with payload URL `https://jenkins.ocf.berkeley.edu/ghprbhook/`, content type +7. On GitHub, under "Settings" and "Webhooks & services", add a new webhook with + payload URL `https://jenkins.ocf.berkeley.edu/ghprbhook/`, content type `application/json`, and the secret (it's in `supernova:/opt/passwords`). Choose to trigger only on certain events: diff --git a/ocfweb/docs/docs/staff/backend/kerberos.md b/ocfweb/docs/docs/staff/backend/kerberos.md index d6df568f2..f186d3c60 100644 --- a/ocfweb/docs/docs/staff/backend/kerberos.md +++ b/ocfweb/docs/docs/staff/backend/kerberos.md @@ -23,38 +23,33 @@ servers with a key that only that machine can read. ### Usability advantages -Kerberos makes passwordless login easy, since after the first password is -input, a ticket can be used for future logins instead of having to type the -same password again and go through the whole authentication process a second -time. Keep in mind that all of the authentication will have to be done every 10 -hours, as tickets do expire, but passwords have to be typed far less with -Kerberos in place. Tickets are invalidated on logout, so that makes sure that -someone can't steal a ticket and use it after you have left, as a little added -security. +Kerberos makes passwordless login easy, since after the first password is input, +a ticket can be used for future logins instead of having to type the same +password again and go through the whole authentication process a second time. +Keep in mind that all of the authentication will have to be done every 10 hours, +as tickets do expire, but passwords have to be typed far less with Kerberos in +place. Tickets are invalidated on logout, so that makes sure that someone can't +steal a ticket and use it after you have left, as a little added security. ## Versions There are two major free versions of Kerberos: MIT and Heimdal Kerberos. At the -OCF, we use Heimdal Kerberos; if you look up documentation, it might instead -be for the MIT version, so be careful to make sure the commands work. Kerberos -also has 2 main versions that are still used: version 4 and version 5. Version -5 fixes a lot of the security and design flaws of version 4, so we use version -5 of the protocol. +OCF, we use Heimdal Kerberos; if you look up documentation, it might instead be +for the MIT version, so be careful to make sure the commands work. ## Terminology Unfortunately, Kerberos is a complicated protocol that involves a lot of -technical jargon. Here's a bunch of different terms that you might run into -when reading about or working on Kerberos and an attempt to explain what they -mean: +technical jargon. Here's a bunch of different terms that you might run into when +reading about or working on Kerberos and an attempt to explain what they mean: - **KDC** (**K**ey **D**istribution **C**enter): The central server that issues tickets for Kerberos communication and stores all users' keys. If the KDC is compromised, you are going to have a very bad time and [will not go to space - today][xkcd-space]. Our current KDC is firestorm, but that could change in - the future, as servers are moved around or rebuilt. + today][xkcd-space]. Our current KDC is firestorm, but that could change in the + future, as servers are moved around or rebuilt. - **Realm**: A kerberos domain, usually identified with the domain name in all caps (e.g. `OCF.BERKELEY.EDU`). Two hosts are in the same realm if they share @@ -70,8 +65,8 @@ mean: `@OCF.BERKELEY.EDU` since it is the realm the OCF uses. - **User**: `[user]` or `[user]/[instance]` e.g. `jvperrin` or - `mattmcal/root`. Used for user logins or for user privileges such as - editing LDAP or running commands with `sudo`. + `mattmcal/root`. Used for user logins or for user privileges such as editing + LDAP or running commands with `sudo`. - **Host**: `host/[hostname]` e.g. `host/supernova.ocf.berkeley.edu`. Used by Kerberos to allow clients to verify they are communicating with the correct @@ -84,14 +79,14 @@ mean: particular host, such as `http`, which (for instance) enables logins to RT, or `smtp`, which allows email authentication. -- **Ticket**: Tickets are issued by the TGS (see below) to clients. Tickets - have an expiration time, which is set to the default of 10 hours after being +- **Ticket**: Tickets are issued by the TGS (see below) to clients. Tickets have + an expiration time, which is set to the default of 10 hours after being issued. -- **Keytab**: A keytab is essentially the equivalent of a password, but one - that can be used easily by a script. If someone has read access to a keytab, - they can retrieve all the keys in it, so be very careful what permissions are - set on keytabs. +- **Keytab**: A keytab is essentially the equivalent of a password, but one that + can be used easily by a script. If someone has read access to a keytab, they + can retrieve all the keys in it, so be very careful what permissions are set + on keytabs. - **TGT** (**T**icket **G**ranting **T**icket): A special ticket that is used for communication between the client machine and the KDC. @@ -100,13 +95,14 @@ mean: the job of the TGS is to grant tickets (see above) for different network services. -- **GSS-API**: The API used by different applications to be able to - authenticate with Kerberos. +- **GSS-API**: The API used by different applications to be able to authenticate + with Kerberos. - **SASL**: An authentication layer that many different applications can use. [xkcd-space]: https://xkcd.com/1133/ -[kdc-location]: https://github.com/ocf/puppet/blob/17bc94b395e254529d97c84fb044f76931439fd7/modules/ocf/files/auth/krb5.conf#L27 +[kdc-location]: + https://github.com/ocf/puppet/blob/17bc94b395e254529d97c84fb044f76931439fd7/modules/ocf/files/auth/krb5.conf#L27 ## Commands @@ -126,8 +122,8 @@ All conveniently prefixed with the letter `k`. - `kadmin`: Administration utility for Kerberos to make changes to the Kerberos database, either locally (with `-l`), or remotely by connecting to the KDC. - Can retrieve information about principals, modify principal attributes, - change principal passwords, show privileges allowed, etc. + Can retrieve information about principals, modify principal attributes, change + principal passwords, show privileges allowed, etc. - `kdestroy`: Remove a principal or ticket file. This is essentially the opposite of `kinit`, so it invalidates tickets you have, logging you out from @@ -157,23 +153,24 @@ will have to enter it every time they want to edit LDAP or run commands with when running `sudo` commands and for changing user passwords, whereas the `[user]/admin` principal is used mainly for modifying LDAP. -Next, to give the principal actual privileges, add the principals and -privileges assigned to the [kadmind.acl file][2] used by Puppet. Notice that -the `all` privilege does not actually give *all* privileges, since the -`get-keys` privilege is separate. The `get-keys` privilege is used to fetch -principals' keys, which is equivalent to knowing the password hash in other -authentication systems, so it is not a privilege to be handed out lightly. +Next, to give the principal actual privileges, add the principals and privileges +assigned to the [kadmind.acl file][2] used by Puppet. Notice that the `all` +privilege does not actually give *all* privileges, since the `get-keys` +privilege is separate. The `get-keys` privilege is used to fetch principals' +keys, which is equivalent to knowing the password hash in other authentication +systems, so it is not a privilege to be handed out lightly. -[2]: https://github.com/ocf/puppet/blob/master/modules/ocf_kerberos/files/kadmind.acl +[2]: + https://github.com/ocf/puppet/blob/master/modules/ocf_kerberos/files/kadmind.acl ## How does it actually work? Kerberos is pretty complicated, so explaining exactly how it works gets messy -very quickly, but here are the main steps that are taken by Kerberos when a -user logs in to their machine. A great guide on these steps is [Lynn Root's -_Explain it like I'm 5: Kerberos_][eli5], and explains it better and in more -depth than the rather cursory overview found here: +very quickly, but here are the main steps that are taken by Kerberos when a user +logs in to their machine. A great guide on these steps is [Lynn Root's _Explain +it like I'm 5: Kerberos_][eli5], and explains it better and in more depth than +the rather cursory overview found here: 1. The user enters their username. Their login is sent to the KDC to receieve a ticket. @@ -184,9 +181,9 @@ depth than the rather cursory overview found here: KDC). 3. The client gets the encrypted TGT and decrypts it with the user's entered - password. Note the user's password was never directly sent across the - network at any stage in the process. Then the TGT is stored in the cache on - the client machine until it expires, when it is requested again if needed. + password. Note the user's password was never directly sent across the network + at any stage in the process. Then the TGT is stored in the cache on the + client machine until it expires, when it is requested again if needed. 4. The user can then use this TGT to make requests for service tickets from the KDC. @@ -198,11 +195,11 @@ otherwise the key in the TGT could just be cracked offline by an attacker using a dictionary attack. This preauthentication typically takes the form of something like the current time encrypted with the user's key. If an attacker intercepts this communication, they do not have the exact timestamp or the -user's key to attempt to decrypt it. We require pre-authentication at the OCF -by specifying `require-preauth = true` in [/var/lib/heimdal-kdc/kdc.conf][kdc]. +user's key to attempt to decrypt it. We require pre-authentication at the OCF by +specifying `require-preauth = true` in [/var/lib/heimdal-kdc/kdc.conf][kdc]. -Then, if the user wants to communicate with other services or hosts, like SSH -or a HTTP Kerberos login, then they make more requests to the KDC: +Then, if the user wants to communicate with other services or hosts, like SSH or +a HTTP Kerberos login, then they make more requests to the KDC: 1. The client will request a service or host principal from the TGS (Ticket Granting Service) using the TGT received before. The TGS in our case is the @@ -211,9 +208,10 @@ or a HTTP Kerberos login, then they make more requests to the KDC: contacting a service and authenticating until the service ticket expires. 2. The client can then use this service ticket to send with requests to - Kerberos-enabled services, like SSH, as user authentication. The service - will verify the ticket with the KDC when used, to make sure it is valid for - the user issuing the request. + Kerberos-enabled services, like SSH, as user authentication. The service will + verify the ticket with the KDC when used, to make sure it is valid for the + user issuing the request. [eli5]: https://www.roguelynn.com/words/explain-like-im-5-kerberos/ -[kdc]: https://github.com/ocf/puppet/blob/17bc94b395e254529d97c84fb044f76931439fd7/modules/ocf_kerberos/files/kdc.conf#L13 +[kdc]: + https://github.com/ocf/puppet/blob/17bc94b395e254529d97c84fb044f76931439fd7/modules/ocf_kerberos/files/kdc.conf#L13 diff --git a/ocfweb/docs/docs/staff/backend/kubernetes.md b/ocfweb/docs/docs/staff/backend/kubernetes.md index b85496f7e..591a43a2e 100644 --- a/ocfweb/docs/docs/staff/backend/kubernetes.md +++ b/ocfweb/docs/docs/staff/backend/kubernetes.md @@ -1,13 +1,14 @@ [[!meta title="Kubernetes"]] +**NOTE:** This is the documentation for the oldk8s cluster at the OCF. + At the OCF we have fully migrated all services from Mesos/Marathon to [Kubernetes][kubernetes]. In this document we will explain the design of our Kubernetes cluster while also touching briefly on relevant core concepts. This -page is _not_ a `HOWTO` for deploying services or troubleshooting a bad -cluster. Rather, it is meant to explain architectural considerations such that -current work can be built upon. Although, reading this document will help you -both deploy services in the OCF Kubernetes cluster and debug issues when they -arise. +page is _not_ a `HOWTO` for deploying services or troubleshooting a bad cluster. +Rather, it is meant to explain architectural considerations such that current +work can be built upon. Although, reading this document will help you both +deploy services in the OCF Kubernetes cluster and debug issues when they arise. ## Kubernetes @@ -30,8 +31,8 @@ cluster has three masters. ### Masters Kubernetes masters share state via [etcd][etcd-io], a distributed key-value -store (KVS) implementing the [Raft][raft] protocol. The three main goals of -Raft are: +store (KVS) implementing the [Raft][raft] protocol. The three main goals of Raft +are: 1. Leader elections in case of failure. 2. Log replication across all masters. @@ -56,11 +57,11 @@ read more [here][failure-tolerance]. Workers are the brawn in the Kubernetes cluster. While master nodes are constantly sharing data, managing the control plane (routing inside the -Kubernetes cluster), and scheduling services, workers primarily run -[pods][pod]. `kubelet` is the service that executes pods as dictated by the -control plane, performs health checks, and recovers from pod failures should -they occur. Workers also run an instance of `kube-proxy`, which forwards -control plane traffic to the correct `kubelet`. +Kubernetes cluster), and scheduling services, workers primarily run [pods][pod]. +`kubelet` is the service that executes pods as dictated by the control plane, +performs health checks, and recovers from pod failures should they occur. +Workers also run an instance of `kube-proxy`, which forwards control plane +traffic to the correct `kubelet`. ### Pods @@ -71,10 +72,9 @@ composed of several containers—the web container, static container, and worker container. In Kubernetes, together these three containers form one pod, and it is pods that can be scaled up or down. A failure in any of these containers indicates a failure in the entire pod. An astute reader might wonder: _if pods -can be broken down into containers, how can pods possibly be the smallest -unit?_ Do note that if one wished to deploy a singleton container, it would -still need to be wrapped in the pod abstraction for Kubernetes to do anything -with it. +can be broken down into containers, how can pods possibly be the smallest unit?_ +Do note that if one wished to deploy a singleton container, it would still need +to be wrapped in the pod abstraction for Kubernetes to do anything with it. While pods are essential for understanding Kubernetes, when writing services we don't actually deal in pods but one further abstraction, @@ -108,19 +108,19 @@ re-run of [kubetool][puppetlabs-kubetool] to generate new `etcd server` and ### OCF Kubernetes Configuration -Currently, the OCF has three Kubernetes masters: (1) `deadlock`, (2) `coup`, -and (3) `autocrat`. A Container Networking Interface (`cni`) is the last piece +Currently, the OCF has three Kubernetes masters: (1) `deadlock`, (2) `coup`, and +(3) `autocrat`. A Container Networking Interface (`cni`) is the last piece required for a working cluster. The `cni`'s purpose is to faciltate intra-pod communication. `puppetlabs-kubernetes` supports two choices: `weave` and -`flannel`. Both solutions work out-the-box, and we've had success with -`flannel` thus far so we've stuck with it. +`flannel`. Both solutions work out-the-box, and we've had success with `flannel` +thus far so we've stuck with it. ## Getting traffic into the cluster One of the challenges with running Kubernetes on bare-metal is getting traffic into the cluster. Kubernetes is commonly deployed on `AWS`, `GCP`, or `Azure`, -so Kubernetes has native support for ingress on these providers. Since we are -on bare-metal, we designed our own scheme for ingressing traffic. +so Kubernetes has native support for ingress on these providers. Since we are on +bare-metal, we designed our own scheme for ingressing traffic. The figure below demonstrates a request made for `templates.ocf.berkeley.edu`. For the purpose of simplicity, we assume `deadlock` is the current `keepalived` @@ -158,16 +158,16 @@ Nginx][ingress-nginx]. Right now ingress is running as a [NodePort][nodeport] service on all workers (Note: we can easily change this to be a subset of workers if our cluster scales such that this is no longer feasible). The ingress worker will inspect the `Host` header and forward the request on to the -appropriate pod where the request is finally processed. _Do note that the -target pod is not necessarily on the same worker that routed the traffic_. +appropriate pod where the request is finally processed. _Do note that the target +pod is not necessarily on the same worker that routed the traffic_. ### Why didn't we use MetalLB? `MetalLB` was created so a bare-metal Kubernetes cluster could use `Type: LoadBalancer` in Service definitions. The problem is, in `L2` mode, it takes a -pool of IPs and puts your service on a random IP in that pool. How one makes -DNS work in this configuration is completely unspecified. We would need to +pool of IPs and puts your service on a random IP in that pool. How one makes DNS +work in this configuration is completely unspecified. We would need to dynamically update our DNS, which sounds like a myriad of outages waiting to happen. `L3` mode would require the OCF dedicating a router to Kubernetes. @@ -179,8 +179,8 @@ balancer and traffic coming into that port is routed accordingly. First, in Kubernetes we would emulate this behavior using `NodePort` services, and all Kubernetes documentation discourages this. Second, it's ugly. Every time we add a new service we need to modify the load balancer configuration in Puppet. With -our Kubernetes configuration we can add unlimited HTTP services without -touching Puppet. +our Kubernetes configuration we can add unlimited HTTP services without touching +Puppet. But wait! The Kubernetes documentation says not to use `NodePort` services in production, and you just said that above too! True, but we only run _one_ @@ -193,14 +193,18 @@ in production][soundcloud-nodeport]. [cncf]: https://cncf.io [etcd-io]: https://github.com/etcd-io/etcd [raft]: https://raft.github.io/raft.pdf -[failure-tolerance]: https://coreos.com/etcd/docs/latest/faq.html#what-is-failure-tolerance +[failure-tolerance]: + https://coreos.com/etcd/docs/latest/faq.html#what-is-failure-tolerance [pod]: https://kubernetes.io/docs/concepts/workloads/pods/pod/ [ocfweb]: https://github.com/ocf/ocfweb/tree/master/services -[deployment]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ +[deployment]: + https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ [kubernetes-module]: https://github.com/puppetlabs/puppetlabs-kubernetes [kubernetes-pki]: https://kubernetes.io/docs/setup/certificates [puppetlabs-kubetool]: https://github.com/puppetlabs/puppetlabs-kubernetes#Setup [nginx]: https://nginx.org/ [ingress-nginx]: https://github.com/kubernetes/ingress-nginx -[nodeport]: https://kubernetes.io/docs/concepts/services-networking/service/#nodeport -[soundcloud-nodeport]: https://developers.soundcloud.com/blog/how-soundcloud-uses-haproxy-with-kubernetes-for-user-facing-traffic +[nodeport]: + https://kubernetes.io/docs/concepts/services-networking/service/#nodeport +[soundcloud-nodeport]: + https://developers.soundcloud.com/blog/how-soundcloud-uses-haproxy-with-kubernetes-for-user-facing-traffic diff --git a/ocfweb/docs/docs/staff/backend/ldap.md b/ocfweb/docs/docs/staff/backend/ldap.md index 05f3936d4..ec3d5eff2 100644 --- a/ocfweb/docs/docs/staff/backend/ldap.md +++ b/ocfweb/docs/docs/staff/backend/ldap.md @@ -54,9 +54,9 @@ Attributes that define a POSIX group: ### `ldapsearch` For most staff, their primary interface to LDAP will be `ldapsearch`. -`ldapsearch` is a powerful program that allows queries of the LDAP database. -For most usage, you want to type in `-x`, which skips authentication. After -that you provide a search filter (in this case UID). +`ldapsearch` is a powerful program that allows queries of the LDAP database. For +most usage, you want to type in `-x`, which skips authentication. After that you +provide a search filter (in this case UID). Searching for an account: @@ -107,8 +107,8 @@ Kerberos and then run `ldapvi` all in one step. loginShell: /bin/bash calnetUid: 872544 -Now if you make changes to some attributes (say, change the shell to `tcsh`) -and try to save the temporary file which has been opened in a text editor: +Now if you make changes to some attributes (say, change the shell to `tcsh`) and +try to save the temporary file which has been opened in a text editor: 1 entry read add: 0, rename: 0, modify: 1, delete: 0 @@ -121,8 +121,8 @@ You can enter `v` to view the LDIF change record (or `?` for help). replace: loginShell loginShell: /bin/tcsh -You can enter `y` to apply changes, `q` to save the LDIF change record as a -file in your current directory, or `Q` to discard. +You can enter `y` to apply changes, `q` to save the LDIF change record as a file +in your current directory, or `Q` to discard. ### `ldapadd` diff --git a/ocfweb/docs/docs/staff/backend/libvirt.md b/ocfweb/docs/docs/staff/backend/libvirt.md index 61fa0aad3..93809d0f5 100644 --- a/ocfweb/docs/docs/staff/backend/libvirt.md +++ b/ocfweb/docs/docs/staff/backend/libvirt.md @@ -39,11 +39,11 @@ On the hypervisor, run `sudo virsh start `. ### How do I turn off a VM? -You can SSH into the VM and run the `shutdown` command, or you can run -`sudo virsh shutdown ` on the hypervisor which hosts it. +You can SSH into the VM and run the `shutdown` command, or you can run `sudo +virsh shutdown ` on the hypervisor which hosts it. -If it's a public-facing VM (e.g. tsunami), remember to give a positive amount -of time to the shutdown command, so users have adequate warning. +If it's a public-facing VM (e.g. tsunami), remember to give a positive amount of +time to the shutdown command, so users have adequate warning. ### How do I make a VM automatically turn on when the hypervisor boots? @@ -91,8 +91,8 @@ On the hypervisor: or delete it. You may want to also dump the contents of the disk to a file, compressing it, and placing that file in `/opt/backups/live/misc/servers` on the server which contains backups (which is `hal` at the time of this - writing). You may also want to save the VM's XML definition by running - `sudo virsh dumpxml [vm-name] > [vm-name].xml` and placing it in the same + writing). You may also want to save the VM's XML definition by running `sudo + virsh dumpxml [vm-name] > [vm-name].xml` and placing it in the same aforementioned directory. ### How do I move a VM from one host to another? @@ -123,8 +123,8 @@ won't work if you're trying to diagnose boot problems. On the hypervisor, run `sudo virsh edit ` to edit the VM's XML definition. -To query and modify virtual hardware state for your vm, use the following commands, -for RAM: +To query and modify virtual hardware state for your vm, use the following +commands, for RAM: virsh dommemstat [[--config] [--live] | [--current]] virsh setmaxmem [[--config] [--live] | [--current]] diff --git a/ocfweb/docs/docs/staff/backend/mail.md b/ocfweb/docs/docs/staff/backend/mail.md index 52d49d728..300913ec2 100644 --- a/ocfweb/docs/docs/staff/backend/mail.md +++ b/ocfweb/docs/docs/staff/backend/mail.md @@ -5,14 +5,14 @@ mail, excluding staff use of Google Apps. Mail sent by users, websites, virtual hosts, or basically anything else goes through here. Received mail to @ocf.berkeley.edu is forwarded to the address in the `mail` -attribute of the LDAP account entry (or the [aliases table](https://github.com/ocf/puppet/blob/master/modules/ocf_mail/files/site_ocf/aliases)) +attribute of the LDAP account entry (or the [aliases +table](https://github.com/ocf/puppet/blob/master/modules/ocf_mail/files/site_ocf/aliases)) or rejected; nothing is stored. -Received virtual host mail is forwarded to the address stored in a MySQL -table. Outgoing virtual host mail is also via anthrax, which uses SMTP -authentication (passwords checked against `crypt(3)`'d passwords in a MySQL -table). [[There's a whole page with more details about vhost mail.|doc -staff/backend/mail/vhost]] +Received virtual host mail is forwarded to the address stored in a MySQL table. +Outgoing virtual host mail is also via anthrax, which uses SMTP authentication +(passwords checked against `crypt(3)`'d passwords in a MySQL table). [[There's a +whole page with more details about vhost mail.|doc staff/backend/mail/vhost]] Mail originating anywhere inside the OCF relays through anthrax. diff --git a/ocfweb/docs/docs/staff/backend/mail/vhost.md b/ocfweb/docs/docs/staff/backend/mail/vhost.md index 097e4333a..445b081ef 100644 --- a/ocfweb/docs/docs/staff/backend/mail/vhost.md +++ b/ocfweb/docs/docs/staff/backend/mail/vhost.md @@ -1,7 +1,7 @@ [[!meta title="Virtual hosted mail"]] -**Note: This page is designed for OCF staffers and is a technical description -of the service. For information or help using it, see [[our page about it|doc +**Note: This page is designed for OCF staffers and is a technical description of +the service. For information or help using it, see [[our page about it|doc services/vhost/mail]].** Virtual hosting mail allows groups to receive mail at `@group.b.e` addresses, @@ -24,13 +24,13 @@ and send from those same addresses. It complements our web hosting nicely. ## Technical implementation There is a database on our MySQL host for storing email vhost information. It -has one table, `addresses`, with columns for the incoming address, password, -and forwarding addresses (among others). +has one table, `addresses`, with columns for the incoming address, password, and +forwarding addresses (among others). It has one view, `domains`, which is generated from the `addresses` table. This -is only used to make the queries Postfix makes simpler. In particular, you -never need to update MySQL to add forwarding to a domain; it's entirely based -on `~staff/vhost/vhost-mail.conf`. +is only used to make the queries Postfix makes simpler. In particular, you never +need to update MySQL to add forwarding to a domain; it's entirely based on +`~staff/vhost/vhost-mail.conf`. ocflib has simple functions for interacting with this database (see `pydoc3 ocflib.vhost.mail`). diff --git a/ocfweb/docs/docs/staff/backend/munin.md b/ocfweb/docs/docs/staff/backend/munin.md index 907361b76..808cf4530 100644 --- a/ocfweb/docs/docs/staff/backend/munin.md +++ b/ocfweb/docs/docs/staff/backend/munin.md @@ -1,6 +1,8 @@ [[!meta title="Munin"]] -**NOTE:** We are currently in the process of migrating many of our monitoring services to Prometheus. For more information, visit the documentation page for Prometheus [[here|doc staff/backend/prometheus]]. +**NOTE:** Munin has been deprecated at the OCF in favor of Prometheus. For more +information, visit the documentation page for Prometheus [[here|doc +staff/backend/prometheus]]. We use [Munin](https://munin.ocf.berkeley.edu) to provide real-time monitoring of our hardware. The master is [[dementors|doc staff/backend/servers]] which @@ -13,10 +15,10 @@ Additionally, we don't receive email alerts for staff VMs. ## Automated alerts -Munin sends mail to root whenever certain stats run out of bounds for a -machine, e.g. if disk usage goes above 92%. Some plugins have configurable -warning and critical levels for each field, which are usually set in the node -config like so: +Munin sends mail to root whenever certain stats run out of bounds for a machine, +e.g. if disk usage goes above 92%. Some plugins have configurable warning and +critical levels for each field, which are usually set in the node config like +so: ``` [pluginname] @@ -30,11 +32,10 @@ underscores, the display name for a variable's warning levels takes the form `fieldname.warning` or `fieldname.critical`. When `munin-limits` finds a variable in warning or critical range, it pipes the -alert text to [another script][mail-munin-alert] which filters out -uninteresting or noisy messages and emails the rest to root. Munin itself isn't -very flexible about disabling alerts from plugins, so, if there is a noisy -variable you want to ignore alerts for, you can add it to the list of -`IGNORED_WARNINGS`. +alert text to [another script][mail-munin-alert] which filters out uninteresting +or noisy messages and emails the rest to root. Munin itself isn't very flexible +about disabling alerts from plugins, so, if there is a noisy variable you want +to ignore alerts for, you can add it to the list of `IGNORED_WARNINGS`. ## Custom plugins @@ -62,6 +63,9 @@ field1.warning min:max ... ``` -[gen-munin-nodes]: https://github.com/ocf/puppet/blob/master/modules/ocf_munin/files/gen-munin-nodes -[mail-munin-alert]: https://github.com/ocf/puppet/blob/master/modules/ocf_munin/templates/mail-munin-alert.erb -[ocf_munin_plugin]: https://github.com/ocf/puppet/blob/master/modules/ocf/manifests/munin/plugin.pp +[gen-munin-nodes]: + https://github.com/ocf/puppet/blob/master/modules/ocf_munin/files/gen-munin-nodes +[mail-munin-alert]: + https://github.com/ocf/puppet/blob/master/modules/ocf_munin/templates/mail-munin-alert.erb +[ocf_munin_plugin]: + https://github.com/ocf/puppet/blob/master/modules/ocf/manifests/munin/plugin.pp diff --git a/ocfweb/docs/docs/staff/backend/printhost.md b/ocfweb/docs/docs/staff/backend/printhost.md index 07b2a8640..a2b93413f 100644 --- a/ocfweb/docs/docs/staff/backend/printhost.md +++ b/ocfweb/docs/docs/staff/backend/printhost.md @@ -5,13 +5,12 @@ The OCF's print server is based around two components: [CUPS][cups], the standard UNIX print server, and a custom print accounting system contained in the ocflib API. CUPS is responsible for receiving print jobs over the network, -converting documents to a printer-friendly format, and delivering processed -jobs to one of the available printers. The OCF's print accounting system, -nicknamed enforcer after one of the scripts, plugs into CUPS as a hook that -looks at jobs before and after going to the printer. It records jobs in a -database that keeps track of how many pages each user has printed, rejecting -jobs that go over quota. The high level flow of data through the print system -looks like this: +converting documents to a printer-friendly format, and delivering processed jobs +to one of the available printers. The OCF's print accounting system, nicknamed +enforcer after one of the scripts, plugs into CUPS as a hook that looks at jobs +before and after going to the printer. It records jobs in a database that keeps +track of how many pages each user has printed, rejecting jobs that go over +quota. The high level flow of data through the print system looks like this: ``` [Application] @@ -64,8 +63,8 @@ becomes available to print it. The document is converted into a more printer-friendly format before it actually reaches the printer. Once it's ready to print, it is sent to the printer via some backend such as IPP. -Finally, the printer accepts a PostScript document as raw data and prints it -out (some also support raster formats). This part of the process is largely +Finally, the printer accepts a PostScript document as raw data and prints it out +(some also support raster formats). This part of the process is largely controlled by the printer's onboard configuration, which can be modified by visiting the printer's IP over the web (e.g. `https://papercut/`). In the OCF's case, security is provided by an access control list (ACL) which accepts print @@ -82,12 +81,12 @@ special call format, plus CUPS-specific environment variables, and converts files from one format to another while adding special formatting options like duplex mode. -CUPS uses not just one, but potentially several filters to get the document -into its final format. For example, a PDF file might go through `pdftops` to -convert it to PostScript, then `pstops` to insert print job options such as -duplexing, then, finally, a device-specific filter such as `hpcups`. Each -filter is associated with an internal "cost", and CUPS picks the path with the -least total cost to print the document. +CUPS uses not just one, but potentially several filters to get the document into +its final format. For example, a PDF file might go through `pdftops` to convert +it to PostScript, then `pstops` to insert print job options such as duplexing, +then, finally, a device-specific filter such as `hpcups`. Each filter is +associated with an internal "cost", and CUPS picks the path with the least total +cost to print the document. At the OCF, print jobs are all processed by a single filter, [ocfps][ocfps], which converts raw PDFs to rasterized, printable PostScript. It calls on a @@ -96,21 +95,22 @@ the result and the rest of the arguments to standard CUPS filters. So far, this has given us the fewest headaches in terms of malformatted output and printer errors. -[ocfps]: https://github.com/ocf/puppet/blob/master/modules/ocf_printhost/files/ocfps +[ocfps]: + https://github.com/ocf/puppet/blob/master/modules/ocf_printhost/files/ocfps ### Drivers -In order to know what job options are available for a particular printer and -how to convert documents to a printable format, CUPS requires large config -files called PostScript Printer Drivers (PPDs). The OCF uses a modified HP PPD -for the [M806][m806]. There are two versions of it: one which only allows -double-sided printing and one which only allows single-sided. This is how we -implement the "double" and "single" classes. The PPDs tell CUPS to use `ocfps` -to convert documents to PostScript, plus they turn on economode so we can -afford the toner. +In order to know what job options are available for a particular printer and how +to convert documents to a printable format, CUPS requires large config files +called PostScript Printer Drivers (PPDs). The OCF uses a modified HP PPD for the +[M806][m806]. There are two versions of it: one which only allows double-sided +printing and one which only allows single-sided. This is how we implement the +"double" and "single" classes. The PPDs tell CUPS to use `ocfps` to convert +documents to PostScript, plus they turn on economode so we can afford the toner. -[m806]: https://github.com/ocf/puppet/blob/master/modules/ocf_printhost/templates/cups/ppd/m806.ppd.epp +[m806]: + https://github.com/ocf/puppet/blob/master/modules/ocf_printhost/templates/cups/ppd/m806.ppd.epp ## Print accounting @@ -131,7 +131,8 @@ that cancels the job. Otherwise, it logs successful print jobs in the database and emails users in the case a job fails. [Tea4CUPS]: https://wiki.debian.org/Tea4CUPS -[enforcer]: https://github.com/ocf/puppet/blob/master/modules/ocf_printhost/files/enforcer +[enforcer]: + https://github.com/ocf/puppet/blob/master/modules/ocf_printhost/files/enforcer [ocflib.printing]: https://github.com/ocf/ocflib/tree/master/ocflib/printing @@ -139,20 +140,20 @@ and emails users in the case a job fails. After printing a document from a desktop, lab visitors are notified when pages are subtracted from their quota by a little popup notification. This is done by -a short daemon script, [notify script][notify], which starts upon login and -runs the [[paper command|doc staff/scripts/paper]] every minute to see if the -quota has changed. +a short daemon script, [notify script][notify], which starts upon login and runs +the [[paper command|doc staff/scripts/paper]] every minute to see if the quota +has changed. In the future, it would be nice to have a more robust notification system where enforcer pushes notifications to desktops while a job is printing. This would -allow for richer notifications to be displayed; namely, alerts to show when -a job has started or finished printing, whether the job printed successfully, -and whether it went over quota. Current thinking is that this could be -implemented by broadcasting notifications to the whole network, or just the -desktops, and modifying the notify script to listen for messages about the -current user. +allow for richer notifications to be displayed; namely, alerts to show when a +job has started or finished printing, whether the job printed successfully, and +whether it went over quota. Current thinking is that this could be implemented +by broadcasting notifications to the whole network, or just the desktops, and +modifying the notify script to listen for messages about the current user. -[notify]: https://github.com/ocf/puppet/blob/master/modules/ocf_desktop/files/xsession/notify +[notify]: + https://github.com/ocf/puppet/blob/master/modules/ocf_desktop/files/xsession/notify ## See also @@ -164,4 +165,5 @@ current user. CUPS info as well) [ocf_printhost]: https://github.com/ocf/puppet/tree/master/modules/ocf_printhost -[cups-samba]: https://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/CUPS-printing.html +[cups-samba]: + https://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/CUPS-printing.html diff --git a/ocfweb/docs/docs/staff/backend/prometheus.md b/ocfweb/docs/docs/staff/backend/prometheus.md index 55f488bba..d49b978d6 100644 --- a/ocfweb/docs/docs/staff/backend/prometheus.md +++ b/ocfweb/docs/docs/staff/backend/prometheus.md @@ -1,73 +1,124 @@ [[!meta title="Prometheus"]] -We use Prometheus to provide real-time monitoring of our [[hardware|doc staff/backend]]. The master is [[dementors|doc staff/backend/servers]] which +We use Prometheus to provide real-time monitoring of our [[hardware|doc +staff/backend]]. The master is [[dementors|doc staff/backend/servers]] which uses the Node Exporter to collect data from other servers. We monitor servers, desktops, and staff VMs, but not the hozer boxes. -Additionally, we don't receive email alerts for staff VMs. Monitoring for the networking switch, blackhole, is currently under development. +Additionally, we don't receive email alerts for staff VMs. Monitoring for the +networking switch, blackhole, is currently under development. ## Alerts -Alerts can be viewed at [prometheus.ocf.berkeley.edu/alerts](https://prometheus.ocf.berkeley.edu/alerts). They are configured at [this folder][prometheus-puppet] in the Puppet configs. +Alerts can be viewed at +[prometheus.ocf.berkeley.edu/alerts](https://prometheus.ocf.berkeley.edu/alerts). +They are configured at [this folder][prometheus-puppet] in the Puppet configs. -Alerts can additionally be configured using the [alert manager](prometheus.ocf.berkeley.edu/alertmanager). Alertmanager handles notifications for alerts via communication through email and Slack. Alerts can be inhibited or silenced. Alertmanager documentation can be found [here](https://prometheus.io/docs/alerting/alertmanager/). +Alerts can additionally be configured using the [alert +manager](prometheus.ocf.berkeley.edu/alertmanager). Alertmanager handles +notifications for alerts via communication through email and Slack. Alerts can +be inhibited or silenced. Alertmanager documentation can be found +[here](https://prometheus.io/docs/alerting/alertmanager/). Alerts are currently under development and may not be fully comprehensive. ## Metrics -Prometheus uses [metrics](https://prometheus.io/docs/concepts/metric_types/) to collect and visualize different types of data. +Prometheus uses [metrics](https://prometheus.io/docs/concepts/metric_types/) to +collect and visualize different types of data. -The main way Prometheus collects metrics in the OCF is [Node Exporter](https://github.com/prometheus/node_exporter). Another important exporter we use is the [SNMP Exporter](https://github.com/prometheus/snmp_exporter) which monitors information from printers, and possibly in the future, network switches. +The main way Prometheus collects metrics in the OCF is [Node +Exporter](https://github.com/prometheus/node_exporter). Another important +exporter we use is the [SNMP +Exporter](https://github.com/prometheus/snmp_exporter) which monitors +information from printers, and possibly in the future, network switches. -A full list of exporters is available in the [Prometheus documentation](https://prometheus.io/docs/instrumenting/exporters/). In order to take advantage of these exporters, we define them in the [Puppet config for the Prometheus server][puppet-config]. +A full list of exporters is available in the [Prometheus +documentation](https://prometheus.io/docs/instrumenting/exporters/). In order to +take advantage of these exporters, we define them in the [Puppet config for the +Prometheus server][puppet-config]. ### Custom Metrics There are three main ways to generate custom metrics: -1. If metrics can be generated from a VM, run a script on a cronjob that writes to `/srv/prometheus`. These automatically get bundled into Node Exporter. We do this for CUPS monitoring - [here is an example of this in practice](https://github.com/ocf/puppet/blob/master/modules/ocf_printhost/manifests/monitor.pp). -2. Run a metrics server over HTTP and add them manually to the Puppet config. This is the most ideal method of using a prewritten exporter, like the Apache or Postfix exporters, both of which we use. An example of this is in the [Prometheus server config][puppet-config]. -3. Run your exporter in Kubernetes if it doesn't matter which host it runs on. This is how we run the SNMP exporter. Again, this is done in the [Prometheus server config][puppet-config]. +1. If metrics can be generated from a VM, run a script on a cronjob that writes + to `/srv/prometheus`. These automatically get bundled into Node Exporter. We + do this for CUPS monitoring - [here is an example of this in + practice](https://github.com/ocf/puppet/blob/master/modules/ocf_printhost/manifests/monitor.pp). +2. Run a metrics server over HTTP and add them manually to the Puppet config. + This is the most ideal method of using a prewritten exporter, like the Apache + or Postfix exporters, both of which we use. An example of this is in the + [Prometheus server config][puppet-config]. +3. Run your exporter in Kubernetes if it doesn't matter which host it runs on. + This is how we run the SNMP exporter. Again, this is done in the [Prometheus + server config][puppet-config]. ## Custom Queries -Prometheus supports querying a wide variety of metrics. (For a full list, go to [Prometheus](https://prometheus.ocf.berkeley.edu) and use the "insert metric at cursor" dropdown.) A basic query comes in the form: +Prometheus supports querying a wide variety of metrics. (For a full list, go to +[Prometheus](https://prometheus.ocf.berkeley.edu) and use the "insert metric at +cursor" dropdown.) A basic query comes in the form: ``` metric{label="value", label2="value2", ...} ``` Some labels used frequently are: - - **instance:** The name of the device that the data was collected from. Some examples are `papercut`, `avalanche`, or `supernova`. - - **host_type:** The type of device that is being queried. Valid types are `desktop`, `server`, and `staffvm`. - - **job:** The name of the job/exporter that collected the data. Some examples are `node`, `printer`, and `slurm`. + - **instance:** The name of the device that the data was collected from. Some + examples are `papercut`, `avalanche`, or `supernova`. + - **host_type:** The type of device that is being queried. Valid types are + `desktop`, `server`, and `staffvm`. + - **job:** The name of the job/exporter that collected the data. Some examples + are `node`, `printer`, and `slurm`. -For example, if you would like to view the total RAM installed on each of the [[servers|doc staff/backend/servers]] you can query `node_memory_Active_bytes{host_type="server"}`. +For example, if you would like to view the total RAM installed on each of the +[[servers|doc staff/backend/servers]] you can query +`node_memory_Active_bytes{host_type="server"}`. To view the per-second rate of a metric, use ``` rate(metric{label="value",...}) ``` -For example, the data sent in bytes/second over the past 5 minutes by `fallingrocks` can be retrieved using `rate(node_network_transmit_bytes_total{instance="fallingrocks"}`. +For example, the data sent in bytes/second over the past 5 minutes by +`fallingrocks` can be retrieved using +`rate(node_network_transmit_bytes_total{instance="fallingrocks"}`. -For more info about querying, see the [official documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/). +For more info about querying, see the [official +documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/). -Queries are best used in conjunction with Grafana, as to produce more readable results and save them for future reference. The next section will give more details on how to do this. +Queries are best used in conjunction with Grafana, as to produce more readable +results and save them for future reference. The next section will give more +details on how to do this. ## Grafana -The frontend for Prometheus is [Grafana][grafana], which displays statistics collected by Prometheus in a user-friendly manner. Some of the more useful dashboards available are: - - **[Servers](https://ocf.io/serverstats):** Displays usage information for the physical servers and hypervisors (fallingrocks, riptide, etc). - - **[Desktops](https://ocf.io/desktopstats):** Displays usage information for lab computers (cyclone, acid, etc). - - **[Printers](https://ocf.io/printerstats):** Displays printer usage and resource information. - - **[Mirrors](https://ocf.io/mirrorstats):** Displays information about mirror staleness. - - **[HPC](hhttps://ocf.io/hpcstats):** Displays usage information for the [[HPC cluster|doc services/hpc]]. - -There are more dashboards available, which can be accessed by clicking the dropdown arrow on the top left of the Grafana page. - -Configuring Grafana dashboards does not require editing Puppet configs. Simply go to [Grafana][grafana], login using your OCF account, and click the plus icon on the left toolbar to begin visually creating a custom dashboard. Grafana uses [Prometheus queries](https://prometheus.io/docs/prometheus/latest/querying/basics/) to fetch data to be displayed. - - -[prometheus-puppet]: https://github.com/ocf/puppet/tree/master/modules/ocf_prometheus/files/rules.d +The frontend for Prometheus is [Grafana][grafana], which displays statistics +collected by Prometheus in a user-friendly manner. Some of the more useful +dashboards available are: + - **[Servers](https://ocf.io/serverstats):** Displays usage information for the + physical servers and hypervisors (fallingrocks, riptide, etc). + - **[Desktops](https://ocf.io/desktopstats):** Displays usage information for + lab computers (cyclone, acid, etc). + - **[Printers](https://ocf.io/printerstats):** Displays printer usage and + resource information. + - **[Mirrors](https://ocf.io/mirrorstats):** Displays information about mirror + staleness. + - **[HPC](hhttps://ocf.io/hpcstats):** Displays usage information for the [[HPC + cluster|doc services/hpc]]. + +There are more dashboards available, which can be accessed by clicking the +dropdown arrow on the top left of the Grafana page. + +Configuring Grafana dashboards does not require editing Puppet configs. Simply +go to [Grafana][grafana], login using your OCF account, and click the plus icon +on the left toolbar to begin visually creating a custom dashboard. Grafana uses +[Prometheus +queries](https://prometheus.io/docs/prometheus/latest/querying/basics/) to fetch +data to be displayed. + + +[prometheus-puppet]: + https://github.com/ocf/puppet/tree/master/modules/ocf_prometheus/files/rules.d [grafana]: https://grafana.ocf.berkeley.edu -[puppet-config]: https://github.com/ocf/puppet/blob/master/modules/ocf_prometheus/manifests/server.pp +[puppet-config]: + https://github.com/ocf/puppet/blob/master/modules/ocf_prometheus/manifests/server.pp diff --git a/ocfweb/docs/docs/staff/backend/rt.md b/ocfweb/docs/docs/staff/backend/rt.md index 2535b506c..713a70e41 100644 --- a/ocfweb/docs/docs/staff/backend/rt.md +++ b/ocfweb/docs/docs/staff/backend/rt.md @@ -1,14 +1,14 @@ [[!meta title="Request Tracker"]] -[**Request Tracker**](https://rt.ocf.berkeley.edu/) is the ticketing system -used by the OCF. It is the main way of keeping track of OCF-related activity. -Some tickets are automatically created when emails are received at the queue's -name (e.g. help@, devnull@, etc.). Staff can also create tickets by logging in +[**Request Tracker**](https://rt.ocf.berkeley.edu/) is the ticketing system used +by the OCF. It is the main way of keeping track of OCF-related activity. Some +tickets are automatically created when emails are received at the queue's name +(e.g. help@, devnull@, etc.). Staff can also create tickets by logging in directly to the web UI. ## Queues -Tickets are assigned to queues, or organized boards. Manually-created -tickets are found under: +Tickets are assigned to queues, or organized boards. Manually-created tickets +are found under: - *bod* for meeting topics - *bureaucracy* for officer-related issues - *operations* for Operations Strategist work (opstaff) @@ -18,30 +18,36 @@ tickets are found under: ## Tickets ### Comment vs Reply -Much like the issues between Reply and Reply-All, the difference between Comment and Reply has led -to some mishaps. In the RT interface, *Reply* directly communicates with the poster, so look for the -last communication with the ticket opener. *Comment* doesn't directly communicate and is generally for -internal discussion. This can also be done through email, as RT defaults the reply-to field with the -queue mailing lists. Be careful here though: to comment through email, send the email to {queue}-comment -(i.e. help@ vs help-comment@). Also make sure that your reply does not include any of the comments, as in -make sure the trimmed comment is all the information you want released. +Much like the issues between Reply and Reply-All, the difference between Comment +and Reply has led to some mishaps. In the RT interface, *Reply* directly +communicates with the poster, so look for the last communication with the ticket +opener. *Comment* doesn't directly communicate and is generally for internal +discussion. This can also be done through email, as RT defaults the reply-to +field with the queue mailing lists. Be careful here though: to comment through +email, send the email to {queue}-comment (i.e. help@ vs help-comment@). Also +make sure that your reply does not include any of the comments, as in make sure +the trimmed comment is all the information you want released. ### Creation -They can be manually created through the *New Ticket in* button on the top right of the page. If doing -so to communicate to people outside of the OCF, add their email to the requestors field of the ticket and -leave the body blank. Afterwards, reply to the ticket to actually communicate with the person as the ticket -creation doesn't send emails to the requestor but does to staff. -Staff mailing lists are attached to the queue, so they usually don't have to be CC'd (i.e. *help* to help@). -You can set people to be owner, allowing people to keep track of assignments better. +They can be manually created through the *New Ticket in* button on the top right +of the page. If doing so to communicate to people outside of the OCF, add their +email to the requestors field of the ticket and leave the body blank. +Afterwards, reply to the ticket to actually communicate with the person as the +ticket creation doesn't send emails to the requestor but does to staff. Staff +mailing lists are attached to the queue, so they usually don't have to be CC'd +(i.e. *help* to help@). You can set people to be owner, allowing people to keep +track of assignments better. ### Modification -With any created ticket, it can be modified further. For queues like *bod*, some tickets should be discussed -more urgently than others. In the individual ticket page, one can change a ticket's priority value ([-10, 100] -recommended) by clicking on *The Basics*. Ownership is modified through *Reminders* and mailing list settings -can be modified through *People*. +With any created ticket, it can be modified further. For queues like *bod*, some +tickets should be discussed more urgently than others. In the individual ticket +page, one can change a ticket's priority value ([-10, 100] recommended) by +clicking on *The Basics*. Ownership is modified through *Reminders* and mailing +list settings can be modified through *People*. -Tickets may reference each other or there may be redundant tickets. If so, ticket relationships and merging -can be done under *Links* for the ticket you want to keep/set relations for. +Tickets may reference each other or there may be redundant tickets. If so, +ticket relationships and merging can be done under *Links* for the ticket you +want to keep/set relations for. #### Statuses - *new*: New tickets without staff responses @@ -52,7 +58,9 @@ can be done under *Links* for the ticket you want to keep/set relations for. - *deleted*: Use sparingly, and generally used on obvious spam. ### Searching -By default, the queues only show *open* or *new* tickets. To see other tickets, either search the ticket number in the -top right or use *New Search* to do more advanced searching. If using the latter, don't forget to press either *Add these -terms and Search* or *Update formate and Search*. Search arguments can also be saved for later use, as seen with the -ocfstaff saved searches (bother a staff member to see these searches). +By default, the queues only show *open* or *new* tickets. To see other tickets, +either search the ticket number in the top right or use *New Search* to do more +advanced searching. If using the latter, don't forget to press either *Add these +terms and Search* or *Update formate and Search*. Search arguments can also be +saved for later use, as seen with the ocfstaff saved searches (bother a staff +member to see these searches). diff --git a/ocfweb/docs/docs/staff/backend/switch.md b/ocfweb/docs/docs/staff/backend/switch.md index f36440195..0357207f1 100644 --- a/ocfweb/docs/docs/staff/backend/switch.md +++ b/ocfweb/docs/docs/staff/backend/switch.md @@ -2,25 +2,27 @@ We use an [Arista 7050SX-64][primary-switch] 10GbE switch as a primary switch for our servers, and two [Arista 7048T-A][secondary-switch] 1GbE switches for -desktops and management respectively. These devices were donated to us by -Arista Networks in Fall 2018. Each device in the lab connects first to the back -of a patch panel, and then from a port on the patch panel to a port on one of -the 7048T-As via Cat6. The servers connect directly to the 7050SX via SFP+ DACs, -as does our uplink to IST, through a Cat6 to 1GbE SFP+ converter. The 7048T's +desktops and management respectively. These devices were donated to us by Arista +Networks in Fall 2018. Each device in the lab connects first to the back of a +patch panel, and then from a port on the patch panel to a port on one of the +7048T-As via Cat6. The servers connect directly to the 7050SX via SFP+ DACs, as +does our uplink to IST, through a Cat6 to 1GbE SFP+ converter. The 7048T's connect to the 7050SX through their SFP+ uplink ports. -We do not currently use many of the managed features of the switches, mostly using -them to provide layer 2 connectivity. Our previous switch, a Cisco Catalyst 2960S, -was used for some time to drop [Spanning-tree protocol BPDUs][stp] and [IPv6 Router Advertisements][ipv6-ra] -on all ports, as they caused network configuration problems on our end (creating loops -with IST, or hosts autoconfiguring themselves via SLAAC). - -The one advanced feature that we do use on our primary switch is LACP. All of our -hypervisors use Solarflare SFN8522-R2 dual-port 10GbE SFP+ NICs. Both ports are plugged -into the switch, with each hypervisor occupying a vertical pair of switch ports. Each -vertical pair is configured into a channel-group and port-channel, numbered according -to the index of the pair, e.g. ports Ethernet 31 and Ethernet 32 are aggregated into -port-channel 16. The hypervisors are then configured to bond the two interfaces in LACP mode. +We do not currently use many of the managed features of the switches, mostly +using them to provide layer 2 connectivity. Our previous switch, a Cisco +Catalyst 2960S, was used for some time to drop [Spanning-tree protocol +BPDUs][stp] and [IPv6 Router Advertisements][ipv6-ra] on all ports, as they +caused network configuration problems on our end (creating loops with IST, or +hosts autoconfiguring themselves via SLAAC). + +The one advanced feature that we do use on our primary switch is LACP. All of +our hypervisors use Solarflare SFN8522-R2 dual-port 10GbE SFP+ NICs. Both ports +are plugged into the switch, with each hypervisor occupying a vertical pair of +switch ports. Each vertical pair is configured into a channel-group and +port-channel, numbered according to the index of the pair, e.g. ports Ethernet +31 and Ethernet 32 are aggregated into port-channel 16. The hypervisors are then +configured to bond the two interfaces in LACP mode. In the future, we'd like to make use of some of the more advanced features available on our switches, such as Port Security, to do things like preventing @@ -38,11 +40,11 @@ Password: blackhole.ocf.berkeley.edu> ``` -The switches can also be administered directly by connecting to their console port -with a USB serial console cable. +The switches can also be administered directly by connecting to their console +port with a USB serial console cable. -After logging in, one can enter an advanced configuration mode by typing "`enable`", -and then, before configuring specific interfaces, type "`config`". +After logging in, one can enter an advanced configuration mode by typing +"`enable`", and then, before configuring specific interfaces, type "`config`". ``` blackhole.ocf.berkeley.edu> enable @@ -53,8 +55,9 @@ blackhole.ocf.berkeley.edu(config-if-Et31-32)# ### Configuring LACP -After identifying which interfaces need to be aggregated into an LACP group on the -switch and calculating the group index, enter config mode and do the following: +After identifying which interfaces need to be aggregated into an LACP group on +the switch and calculating the group index, enter config mode and do the +following: ``` blackhole.ocf.berkeley.edu(config)# interface Ethernet 31-32 @@ -81,12 +84,16 @@ Port Channel Port-Channel7: ``` -More details can be found on the EOS guide online, in the [Port Channel section][lacp-guide]. +More details can be found on the EOS guide online, in the [Port Channel +section][lacp-guide]. -LACP also needs to be configured on the [[host side | doc staff/procedures/setting-up-lacp]]. +LACP also needs to be configured on the [[host side | doc +staff/procedures/setting-up-lacp]]. -[primary-switch]: https://www.arista.com/assets/data/pdf/Datasheets/7050SX-128_64_Datasheet.pdf -[secondary-switch]: https://www.arista.com/assets/data/pdf/Datasheets/7048T-A_DataSheet.pdf +[primary-switch]: + https://www.arista.com/assets/data/pdf/Datasheets/7050SX-128_64_Datasheet.pdf +[secondary-switch]: + https://www.arista.com/assets/data/pdf/Datasheets/7048T-A_DataSheet.pdf [stp]: https://en.wikipedia.org/wiki/Bridge_Protocol_Data_Unit [ipv6-ra]: https://en.wikipedia.org/wiki/Neighbor_Discovery_Protocol [bsecure]: https://bsecure.berkeley.edu From e099cb43bb1bc5aff54869e3b13f37b28142d252 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Tue, 9 May 2023 15:20:24 +0000 Subject: [PATCH 5/7] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- ocfweb/docs/docs/staff/backend/backups.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ocfweb/docs/docs/staff/backend/backups.md b/ocfweb/docs/docs/staff/backend/backups.md index 328bdcd19..e7d803895 100644 --- a/ocfweb/docs/docs/staff/backend/backups.md +++ b/ocfweb/docs/docs/staff/backend/backups.md @@ -35,7 +35,7 @@ so you will need to manually `cd` into the directory. The snapshots are time- stamped, so you can find the snapshot you want to restore from by looking at the date string in the snapshot name. For example, if you wanted to restore the `public_html` directory of user `foo` with the backup from 2023-05-01, you -should enter the +should enter the ``` /backup/encrypted/rsnapshot/.sync/nfs/opt/homes/services/http/users/f/.zfs ``` From aacc093370a58e8e6318e61521e90025150d352b Mon Sep 17 00:00:00 2001 From: Jonathan Zhang Date: Tue, 9 May 2023 08:21:22 -0700 Subject: [PATCH 6/7] styling: line wrapped --- .../docs/staff/procedures/editing-docs.md | 18 ++++---- .../staff/procedures/granting-privileges.md | 43 ++++++++++--------- ocfweb/docs/docs/staff/procedures/hpc.md | 9 ++-- ocfweb/docs/docs/staff/procedures/new-host.md | 42 +++++++++--------- 4 files changed, 58 insertions(+), 54 deletions(-) diff --git a/ocfweb/docs/docs/staff/procedures/editing-docs.md b/ocfweb/docs/docs/staff/procedures/editing-docs.md index ed0b850a7..a12840d92 100644 --- a/ocfweb/docs/docs/staff/procedures/editing-docs.md +++ b/ocfweb/docs/docs/staff/procedures/editing-docs.md @@ -7,11 +7,11 @@ users and documentation for fellow staff. ## Overview Docs is currently a part of the OCF's main website, known as [ocfweb][ocfweb]. -Markdown syntax is parsed by [Mistune][mistune] with syntax highlighting done -by [Pygments][pygments]. +Markdown syntax is parsed by [Mistune][mistune] with syntax highlighting done by +[Pygments][pygments]. -We use a wiki-like syntax for making links within documentation and the -website, e.g. from [[Virtual Hosting|doc services/vhost#h4_hosting-badge]]: +We use a wiki-like syntax for making links within documentation and the website, +e.g. from [[Virtual Hosting|doc services/vhost#h4_hosting-badge]]: All virtual hosts on the OCF must include an [[OCF banner|doc services/vhost/badges]] on the front page that links to the [[OCF home page|home]]. @@ -29,17 +29,17 @@ GitHub. The editing process is like our other Git workflows: 4. Make a pull request. Once you make a pull request, it will automatically be tested by -[Jenkins][jenkins], the build server. Jenkins will also deploy your changes -once they have been merged. +[Jenkins][jenkins], the build server. Jenkins will also deploy your changes once +they have been merged. For simple changes, you can just click "Edit this Page" in the sidebar. This will open the file in GitHub, and walk you through the steps for either commiting on master or making a pull request. For more complicated ones, the repository's readme file has instructions for -testing and building the website so you can preview your edits before making -the commit. Also see [[our page on Git|doc staff/backend/git]] for further info -on working with OCF repos. +testing and building the website so you can preview your edits before making the +commit. Also see [[our page on Git|doc staff/backend/git]] for further info on +working with OCF repos. [markdown]: https://daringfireball.net/projects/markdown/syntax diff --git a/ocfweb/docs/docs/staff/procedures/granting-privileges.md b/ocfweb/docs/docs/staff/procedures/granting-privileges.md index 1b9c9b6b9..44bc3eeba 100644 --- a/ocfweb/docs/docs/staff/procedures/granting-privileges.md +++ b/ocfweb/docs/docs/staff/procedures/granting-privileges.md @@ -17,14 +17,14 @@ Then add or remove the appropriate `memberUid` attribute. ### `ocfroot` -Before giving anyone root privileges, make sure to obtain authorization from -the SM. +Before giving anyone root privileges, make sure to obtain authorization from the +SM. -Adding or removing people from `ocfroot` is similar to modifying -`ocfstaff`. However, if you are adding someone to root staff, in addition to -modifying LDAP, you will also have to create their `/root` and `/admin` -principals (if those don't already exist). For example, to create the -`/admin` principal, you would do: +Adding or removing people from `ocfroot` is similar to modifying `ocfstaff`. +However, if you are adding someone to root staff, in addition to modifying LDAP, +you will also have to create their `/root` and `/admin` principals (if those +don't already exist). For example, to create the `/admin` principal, you would +do: ``` $ kadmin @@ -42,14 +42,15 @@ Verify password - otherstaffer/admin@OCF.BERKELEY.EDU's Password: At the very first prompt, you are prompted for your password. It's safe to accept the defaults for the next few prompts. The last two prompts should be -filled in by the new root staffer; it will become the password for their -`/root` or `/admin` principal. +filled in by the new root staffer; it will become the password for their `/root` +or `/admin` principal. After you've created these principals, you'll need to grant them powers in the -[Kerberos ACL file in Puppet](https://github.com/ocf/puppet/blob/master/modules/ocf_kerberos/files/kadmind.acl). +[Kerberos ACL file in +Puppet](https://github.com/ocf/puppet/blob/master/modules/ocf_kerberos/files/kadmind.acl). -Also add the new root staffer to the Admin team in our GitHub org and grant -them RT admin privileges. +Also add the new root staffer to the Admin team in our GitHub org and grant them +RT admin privileges. ## Granting IRC chanop status @@ -59,12 +60,12 @@ TODO ## Granting firewall access -In order to gain access to the firewall, it is necessary to email someone -from the ASUC Student Union to ask them to fill out the Telecom Shopping -Cart on your behalf. Send them an email with the CalNet IDs of the people -you want to add to the firewall, and have an existing firewall administrator -authorize the request. As of Fall 2017, the -[Facilities Coordinator](https://studentunion.berkeley.edu/our-team/) has -worked to get new people added to the firewall, although it is likely that -this process will change in Spring/Fall 2018 when the firewall is changed as -part of the [bSecure](https://bsecure.berkeley.edu) project. +In order to gain access to the firewall, it is necessary to email someone from +the ASUC Student Union to ask them to fill out the Telecom Shopping Cart on your +behalf. Send them an email with the CalNet IDs of the people you want to add to +the firewall, and have an existing firewall administrator authorize the request. +As of Fall 2017, the [Facilities +Coordinator](https://studentunion.berkeley.edu/our-team/) has worked to get new +people added to the firewall, although it is likely that this process will +change in Spring/Fall 2018 when the firewall is changed as part of the +[bSecure](https://bsecure.berkeley.edu) project. diff --git a/ocfweb/docs/docs/staff/procedures/hpc.md b/ocfweb/docs/docs/staff/procedures/hpc.md index c8e1788eb..5f8bf5427 100644 --- a/ocfweb/docs/docs/staff/procedures/hpc.md +++ b/ocfweb/docs/docs/staff/procedures/hpc.md @@ -2,9 +2,9 @@ Access to the OCF's HPC [[cluster | doc services/hpc]] is controlled by means of an LDAP group named `ocfhpc`. If a user requests access to the cluster and -meets the basic access criteria, namely that they have specified what they -want to use the cluster for, simply run the following commands to add the user -to the LDAP group: +meets the basic access criteria, namely that they have specified what they want +to use the cluster for, simply run the following commands to add the user to the +LDAP group: abizer@supernova $ kinit abizer/admin ldapvi cn=ocfhpc ... @@ -13,6 +13,7 @@ to the LDAP group: Add another line to the list in the form of `memberUid: ` -Save and quit from your `$EDITOR`, and then reply to the request email with [this][hpc] template. +Save and quit from your `$EDITOR`, and then reply to the request email with +[this][hpc] template. [hpc]: https://templates.ocf.berkeley.edu/#hpc-new-user diff --git a/ocfweb/docs/docs/staff/procedures/new-host.md b/ocfweb/docs/docs/staff/procedures/new-host.md index a748340a8..9a61d0146 100644 --- a/ocfweb/docs/docs/staff/procedures/new-host.md +++ b/ocfweb/docs/docs/staff/procedures/new-host.md @@ -19,7 +19,8 @@ although this may not always be up to date. Hostnames must be based on (un)natural disasters; check out `~staff/server_name_ideas` if you're having trouble thinking of one. -[github-ip-list]: https://github.com/ocf/dns/blob/master/etc/zones/db.226.229.169.in-addr.arpa +[github-ip-list]: + https://github.com/ocf/dns/blob/master/etc/zones/db.226.229.169.in-addr.arpa [ips-sheet]: https://ocf.io/s/ips @@ -41,8 +42,8 @@ for the address `169.229.226.42`. If setting up a desktop, add a final argument ### Step 1.2. Add the DNS record -Clone the [DNS repo][github-dns] from GitHub, run `make`, and push a commit -with the new records. +Clone the [DNS repo][github-dns] from GitHub, run `make`, and push a commit with +the new records. [github-dns]: https://github.com/ocf/dns @@ -74,8 +75,8 @@ We have a handy script, `makevm`, that: * Waits for the Debian installer to finish * SSHs to the new server and sets its IP -To use it, log on to the target physical server (`riptide`, `hal`, `pandemic`, or `jaws`), -and run `makevm --help`. A typical invocation looks something like: +To use it, log on to the target physical server (`riptide`, `hal`, `pandemic`, +or `jaws`), and run `makevm --help`. A typical invocation looks something like: makevm -m 4096 -c 2 -s 15 arsenic 169.229.226.47 @@ -83,8 +84,8 @@ and run `makevm --help`. A typical invocation looks something like: ### Physical hosts All you need to do to run the Debian installer is PXE boot. On desktops, you -sometimes need to enable this in the BIOS before you can select it from the -boot menu. +sometimes need to enable this in the BIOS before you can select it from the boot +menu. Be warned that the default action (automated install) happens after 5 seconds. So don't PXE-boot your laptop and walk away! @@ -114,14 +115,14 @@ should run: 1. Edit `/etc/hostname` so it has the desired hostname instead of dhcp-_whatever_. 2. Run `hostname -F /etc/hostname`. - 3. Find out what the ethernet interface's name and current IP address is - by running `ip addr`. The ethernet interface should be named something - like `eno1` or `enp4s2`. (In the following instructions, substitute - `eno1` with the correct name.) - 4. Remove the incorrect IP addresses with `ip addr del $WRONG_ADDRESS - dev eno1`. - 5. Add the correct IP addresses with `ip addr add $CORRECT_ADDRESS - dev eno1`. Make sure that $CORRECT_ADDRESS includes the netmask. + 3. Find out what the ethernet interface's name and current IP address is by + running `ip addr`. The ethernet interface should be named something like + `eno1` or `enp4s2`. (In the following instructions, substitute `eno1` with + the correct name.) + 4. Remove the incorrect IP addresses with `ip addr del $WRONG_ADDRESS dev + eno1`. + 5. Add the correct IP addresses with `ip addr add $CORRECT_ADDRESS dev eno1`. + Make sure that $CORRECT_ADDRESS includes the netmask. 3. `puppet agent --test` @@ -129,14 +130,15 @@ should run: ## Step 4. Sign the Puppet cert and run Puppet On the puppetmaster, `sudo puppetserver ca list` to see pending requests. When -you see yours, use `sudo puppetserver ca sign --certname hostname.ocf.berkeley.edu`. +you see yours, use `sudo puppetserver ca sign --certname +hostname.ocf.berkeley.edu`. -Log back into the host and run `puppet agent --test` to start the Puppet -run. You may need to repeat this once or twice until the run converges. +Log back into the host and run `puppet agent --test` to start the Puppet run. +You may need to repeat this once or twice until the run converges. ### Step 4.1. Upgrade packages The first Puppet run and various other things may be broken if one or more -packages are out of date, e.g. Puppet. Remedy this with an `apt update && -apt upgrade`. +packages are out of date, e.g. Puppet. Remedy this with an `apt update && apt +upgrade`. From c1b0683b0857d4e8eb8feaccce324330b2fd8569 Mon Sep 17 00:00:00 2001 From: Jonathan Zhang Date: Tue, 9 May 2023 08:28:26 -0700 Subject: [PATCH 7/7] zfs --- ocfweb/docs/docs/staff/backend/backups.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ocfweb/docs/docs/staff/backend/backups.md b/ocfweb/docs/docs/staff/backend/backups.md index e7d803895..8f991c8da 100644 --- a/ocfweb/docs/docs/staff/backend/backups.md +++ b/ocfweb/docs/docs/staff/backend/backups.md @@ -1,7 +1,7 @@ [[!meta title="Backups"]] ## Backup Storage -We currently store our on-site backups across a RAID mirror on `hal`: +We currently store our on-site backups across a ZFS RAID1 mirror on `hal`: * `hal:/backup` (16 TB usable; 2x 16 TB WD drives in ZFS mirror)