Skip to content

Upgrade Effort March 6th to March 9th 2016 Hal Cluster

tatarsky edited this page Mar 10, 2016 · 8 revisions

Here are the highlights of the effort to update the cluster. They will go here as I have time. I kept notes. This is a work in progress.

Move the Torque Moab Scheduler to Hal-Sched1

Progress: Completed. User Facing Changes: Believed Minor

The Torque and Moab server was moved from the smaller Dell R320 system known as mskcc-fe1 to a larger R620 system named hal-sched1 better suited for the role. Configuration was NOT changed. Versions remain the same. The main user facing change would be that the job "full name" now contains the hostname hal-sched1.local at the end of it instead of mskcc-fe1.local.

Update all servers and nodes to CBIO synchronized repo CentOS 6.7

Progress: Completed. User Facing Changes: Believed Minor

The environment had been running pretty much the same distro version (CentOS 6.4) as when SDSC originally deployed it almost three years ago. A round of bringing the systems up to the last of the CentOS 6.X line (6.7) of release was performed for all existing Dell and Exxact nodes. These updates are now synchronized to the CBIO group repo system to provide the same versions of systems going forward.

User facing changes are believed to be minor. Almost all software in this environment comes from the source build modules. While I have not tested every single module there were not compelling version changes that I saw in the updated system RPM versions. Please report any such problems and we'll take care of them.

Update GPFS version to last of the 3.5 release

Progress: Completed User Facing Changes: None

A minor patch release for GPFS itself was deployed to all servers and clients.

Update per Dell recommendation a set of Toshiba firmware on drives in arrays

User Facing Changes: None

Dell has recommended that a firmware release that came on many of our replaced drives from Toshiba would be best moved to a higher version. The details of the reason involve a very unlikely situation during a sudden power outage. Not all drives had the older version but due to the disabling of full replication array updates require the downing of the filesystem. Drives containing the older firmware were upgraded.

Update all Dell server firmware levels to current per Dell support requirements

User Facing Changes: None

Systems running for > 590 days tend to get behind on Dell firmware revisions. During the downtime critical or important updates of firmware levels took place on all Dell equipment. This is a time consuming and slow process. And for safety reasons I only performed one server at a time in case I needed to deal with a failure.

Redo entire network physically and add dual links all servers

Progress: Completed User Facing Changes: None directly but allows expansion to occur beyond limitations of older switches and across racks

This is a very large item which I will more fully detail shortly. Basically the entire pre-existing physical network of the Hal cluster was replaced for growth and attachment of the several new nodes with a core switch.

Addition of several new nodes and two new head nodes

Progress: Still being worked on User Facing Changes: Will be some additional queues and nodes shortly

Work on two groups purchased nodes continues. The process is being entirely automated via a new system for long term growth and maintenance.

Completely remove ROCKS from environment

Progress: Completed (I think) User Facing Changes: None

The management environment known as ROCKS originally deployed by SDSC has been removed and replaced with puppet. Some remnants involving network services were still in partial use.

New deployment system for new nodes with CBIO management

Resolve some node hardware issues

Progress: One matter open User Facing Changes: None

A few older nodes continue to have periodic problems. One was repaired. One requires a replacement motherboard. The vendor has been notified.

A few of the newer nodes had some memory seating problems and bad DIMMS. Those were handled by vendor technician.

Internal Matters Not Documented Here

Progress: Completed User Facing Changes: None

There are a few items involving network and configuration that I will not be elaborating on in the public Git. But basically be assured I and others spent time on those matters to improve things in terms of the ongoing direction of this cluster. If you wish some of the details I can arrange a private summary.