-
Notifications
You must be signed in to change notification settings - Fork 2
Upgrade Effort March 6th to March 9th 2016 Hal Cluster
Here are the highlights of the effort to update the cluster. They will go here as I have time. I kept notes. This is a work in progress.
Progress: Completed. User Facing Changes: Believed Minor
The Torque and Moab server was moved from the smaller Dell R320 system known as mskcc-fe1
to a larger R620 system named hal-sched1
better suited for the role. Configuration was NOT changed. Versions remain the same. The main user facing change would be that the job "full name" now contains the hostname hal-sched1.local
at the end of it instead of mskcc-fe1.local
.
Progress: Completed. User Facing Changes: Believed Minor
The environment had been running pretty much the same distro version (CentOS 6.4) as when SDSC originally deployed it almost three years ago. A round of bringing the systems up to the last of the CentOS 6.X line (6.7) of release was performed for all existing Dell and Exxact nodes. These updates are now synchronized to the CBIO group repo system to provide the same versions of systems going forward.
User facing changes are believed to be minor. Almost all software in this environment comes from the source build modules. While I have not tested every single module there were not compelling version changes that I saw in the updated system RPM versions. Please report any such problems and we'll take care of them.
Progress: Completed User Facing Changes: None
A minor patch release for GPFS itself was deployed to all servers and clients.
User Facing Changes: None
Dell has recommended that a firmware release that came on many of our replaced drives from Toshiba would be best moved to a higher version. The details of the reason involve a very unlikely situation during a sudden power outage. Not all drives had the older version but due to the disabling of full replication array updates require the downing of the filesystem. Drives containing the older firmware were upgraded.
User Facing Changes: None
Systems running for > 590 days tend to get behind on Dell firmware revisions. During the downtime critical or important updates of firmware levels took place on all Dell equipment. This is a time consuming and slow process. And for safety reasons I only performed one server at a time in case I needed to deal with a failure.
Progress: Completed User Facing Changes: None directly but allows expansion to occur beyond limitations of older switches and across racks
This is a very large item which I will more fully detail shortly. Basically the entire pre-existing physical network of the Hal cluster was replaced for growth and attachment of the several new nodes with a core switch.
Progress: Still being worked on User Facing Changes: Will be some additional queues and nodes shortly
Work on two groups purchased nodes continues. The process is being entirely automated via a new system for long term growth and maintenance.
Progress: Completed (I think) User Facing Changes: None
The management environment known as ROCKS originally deployed by SDSC has been removed and replaced with puppet. Some remnants involving network services were still in partial use.
Progress: One matter open User Facing Changes: None
A few older nodes continue to have periodic problems. One was repaired. One requires a replacement motherboard. The vendor has been notified.
A few of the newer nodes had some memory seating problems and bad DIMMS. Those were handled by vendor technician.
Progress: Completed User Facing Changes: None
There are a few items involving network and configuration that I will not be elaborating on in the public Git. But basically be assured I and others spent time on those matters to improve things in terms of the ongoing direction of this cluster. If you wish some of the details I can arrange a private summary.