From 94ec82be5c6a5ac8458859f3d48d0c4c0dd7a802 Mon Sep 17 00:00:00 2001 From: Ernst Bablick Date: Sun, 29 Dec 2024 20:33:46 +0100 Subject: [PATCH] EH: CS-903: auto (un)installation not covered by the installation guide --- .../manual/installation-guide/02_download.md | 58 ++++ ...{02_installation.md => 03_installation.md} | 313 +++++++++++++----- .../04_backup_and_restore.md | 35 ++ .../manual/installation-guide/05_upgrade.md | 184 ++++++++++ .../installation-guide/06_troubleshooting.md | 74 +++++ .../dist/util/install_modules/inst_execd.sh | 5 +- 6 files changed, 588 insertions(+), 81 deletions(-) create mode 100644 doc/markdown/manual/installation-guide/02_download.md rename doc/markdown/manual/installation-guide/{02_installation.md => 03_installation.md} (68%) create mode 100644 doc/markdown/manual/installation-guide/04_backup_and_restore.md create mode 100644 doc/markdown/manual/installation-guide/05_upgrade.md create mode 100644 doc/markdown/manual/installation-guide/06_troubleshooting.md diff --git a/doc/markdown/manual/installation-guide/02_download.md b/doc/markdown/manual/installation-guide/02_download.md new file mode 100644 index 000000000..6bf080004 --- /dev/null +++ b/doc/markdown/manual/installation-guide/02_download.md @@ -0,0 +1,58 @@ +# Download Product Packages + +For clusters intended for production environments, it is highly recommended to use pre-built packages by xxQS_COMPANY_NAMExx. xxQS_COMPANY_NAMExx ensures that all source code components used to build the packages are compatible with each other. The packages are built and carefully tested. + +xxQS_COMPANY_NAMExx offers patch releases for pre-built packages, along with support services to ensure that productive clusters receive the latest fixes and security enhancements. Professional engineers are available to provide assistance in case of any questions. + +Additionally, the packages from xxQS_COMPANY_NAMExx contain product enhancements that would not be available in packages that you built yourself. + +To receive a quote, please contact us at [xxQS_COMPANY_MAILxx](mailto:xxQS_COMPANY_MAILxx) or fill and send following [Questionnaire](https://www.hpc-gridware.com/quote/). + +The core xxQS_NAMExx code is available on GitHub. You can clone the required repositories and build the core product yourself, or use the nightly build. Please note that we do not provide support for these packages. It is not recommended to use the nightly build for production systems as it contains untested code that is still in development. + +The download of the pre-built packages is available at [xxQS_COMPANY_NAMExx Downloads](https://www.hpc-gridware.com/download-main). + +For a product installation you need a set of *tar.gz* files. Required are: + +* the common package containing architecture independent files (the file names *gcs-``-common.\** e.g. *gcs-9.0.0-common.tar.gz*) + +* one architecture specific package for each supported compute platform (files with the names *gcs-``-bin-``-``.\** e.g. *gcs-9.0.0-bin-lx-amd64.tar.gz*) + +* the gcs-``-md5sum.txt file + +Additionally, you will also find product documentation, release notes and other packages for product extensions on the download page. + +Once you have downloaded all packages, you can test and install them at the designated installation location. Please note in the instructions below the placeholder `` refers to the absolute path of the installation directory, while `` refers to the directory containing the downloaded files. + +1. Copy the packages from your download location into the installation directory + + ``` + % cp /gcs-* + ``` + +2. Check if the downloaded files where downloaded correctly by calculating the MD5 checksum. + + ``` + % cd + % md5 gcs-* + ... + % cat gcs-9.0.0-md5sum.txt + ... + ``` + + Compare the output of the md5 command with that of the cat command. If one or more checksums are not correct then re-download the faulty files and repeat the previous steps, otherwise continue. + +3. Unpack the packages as root and set the SGE_ROOT variable manually and execute the script *util/setfileperm.sh* to verify and adapt ownership and file permissions of the unpacked files. + + ``` + % su + # cd + # tar xfz gcs-*.tar.gz + # SGE_ROOT= + # util/setfileperm.sh $SGE_ROOT + ``` + +4. If your `` is located on a shared filesystem available on all hosts in the cluster then you can start the installation process. + +[//]: # (Eeach file has to end with two emty lines) + diff --git a/doc/markdown/manual/installation-guide/02_installation.md b/doc/markdown/manual/installation-guide/03_installation.md similarity index 68% rename from doc/markdown/manual/installation-guide/02_installation.md rename to doc/markdown/manual/installation-guide/03_installation.md index 36a551ab0..01f2068c6 100644 --- a/doc/markdown/manual/installation-guide/02_installation.md +++ b/doc/markdown/manual/installation-guide/03_installation.md @@ -3,79 +3,9 @@ Once you have gathered the necessary information as outlined in previous chapters, you may proceed with the installation process for xxQS_NAMExx. -## Download Product Packages - -For clusters intended for production environments, it is highly recommended to use pre-built packages by xxQS_COMPANY_NAMExx. -xxQS_COMPANY_NAMExx ensures that all source code components used to build the packages are compatible with each other. -The packages are built and carefully tested. - -xxQS_COMPANY_NAMExx offers patch releases for pre-built packages, along with support services to ensure that productive -clusters receive the latest fixes and security enhancements. Professional engineers are available to provide -assistance in case of any questions. - -Additionally, the packages from xxQS_COMPANY_NAMExx contain product enhancements that would not be available in packages -that you built yourself. - -To receive a quote, please contact us at [xxQS_COMPANY_MAILxx](mailto:xxQS_COMPANY_MAILxx) or fill and send following -[Questionnaire](https://www.hpc-gridware.com/quote/). - -The core xxQS_NAMExx code is available on GitHub. You can clone the required repositories and build the core product -yourself, or use the nightly build. Please note that we do not provide support for these packages. It is not -recommended to use the nightly build for production systems as it contains untested code that is still in development. - -The download of the pre-built packages is available at [xxQS_COMPANY_NAMExx Downloads](https://www.hpc-gridware.com/download-main). - -For a product installation you need a set of *tar.gz* files. Required are: - -* the common package containing architecture independent files (the file names *gcs-``-common.\** e.g. *gcs-9.0.0-common.tar.gz*) - -* one architecture specific package for each supported compute platform (files with the names *gcs-``-bin-``-``.\** e.g. *gcs-9.0.0-bin-lx-amd64.tar.gz*) - -* the gcs-``-md5sum.txt file - -Additionally, you will also find product documentation, release notes and other packages for product extensions on the -download page. - -Once you have downloaded all packages, you can test and install them at the designated installation location. -Please note in the instructions below the placeholder `` refers to the absolute path of the -installation directory, while `` refers to the directory containing the downloaded files. - -1. Copy the packages from your download location into the installation directory - - ``` - % cp /gcs-* - ``` - -2. Check if the downloaded files where downloaded correctly by calculating the MD5 checksum. - - ``` - % cd - % md5 gcs-* - ... - % cat gcs-9.0.0-md5sum.txt - ... - ``` - - Compare the output of the md5 command with that of the cat command. If one or more checksums are not correct then - re-download the faulty files and repeat the previous steps, otherwise continue. - -3. Unpack the packages as root and set the SGE_ROOT variable manually and execute the script *util/setfileperm.sh* to verify and adapt ownership and file permissions of the unpacked files. - - ``` - % su - # cd - # tar xfz gcs-*.tar.gz - # SGE_ROOT= - # util/setfileperm.sh $SGE_ROOT - ``` - -4. If your `` is located on a shared filesystem available on all hosts in the cluster then you can start the installation process. - ## Manual Installation -This section covers the manual installation process on the command line on Linux hosts. Note the prerequisites are -required as outlined in previous chapters. If the hostname setup, usernames and service configuration are correct -for all hosts that you intend to include in you cluster, then you can continue with the installation the master service. +This section covers the manual installation process on the command line. Note the prerequisites are required as outlined in previous chapters. If the hostname setup, usernames and service configuration are correct for all hosts that you intend to include in you cluster, then you can continue with the installation the master service. ### Installation of the Master Service @@ -613,19 +543,248 @@ Here are the steps required to complete the installation. You are reaching the end of the manual installation. +### Installation of the Execution Service + +During the execution host installation procedure following steps are processed: + +* It is tested that the master service is running and that the execution host is able to communicate with the master service. + +* An appropriate directory hierarchy is created as required by the `sge_execd` service. + +* The `sge_execd` service is started and basic tests of its functionality are executed. + +* The host is added to a default queue (optional) + +Here are the steps required to complete the installation. + +1. Log in as user root on an execution host. + +2. Source the settings file that was created during the master service installation or set the SGE_ROOT environment variable manually. This Installation Guide assumes that the installation directory is available on all hosts in the same location. + + ``` + # . //common/settings.sh + # cd $SGE_ROOT + ``` + +4. Verify, that the execution host has been declared as administrative host. Do this by executing the following `qconf` command on the master machine. The hostlist should contain the hostname of the new execution host. If it does not exit, then add the hostname to the list of administrative hosts by executing `qconf -ah ` on the master machine. + + ``` + # qconf -sh + ... + ``` +5. Start the installation process by executing the `install_execd` script and read and follow the given instructions. + + ``` + # ./install_execd + Welcome to the Cluster Scheduler execution host installation + ------------------------------------------------------------ + + If you haven't installed the Cluster Scheduler qmaster host yet, you must execute + this step (with >install_qmaster<) prior the execution host installation. + + For a successful installation you need a running Cluster Scheduler qmaster. It is + also necessary that this host is an administrative host. + + You can verify your current list of administrative hosts with + the command: + + # qconf -sh + + You can add an administrative host with the command: + + # qconf -ah + + The execution host installation will take approximately 5 minutes. + + Hit to continue >> + ``` + +6. Confirm the installation directory. The suggested default is the directory you set in the master service installation. + + ``` + Checking $SGE_ROOT directory + ---------------------------- + + The Cluster Scheduler root directory is: + + $SGE_ROOT = + + If this directory is not correct (e.g. it may contain an automounter + prefix) enter the correct path to this directory or hit + to use default [] >> + ``` + +7. Confirm the cell directory. The suggested default is the directory you set in the master service installation. You can enter a different cell name if you intend to start the execution service in a different cell. + + ``` + Cluster Scheduler cells + ----------------------- + + Please enter cell name which you used for the qmaster + installation or press to use [default] >> + ``` + +8. Confirm the detected execution daemon TCP/IP port number. + + ``` + Cluster Scheduler TCP/IP communication service + ---------------------------------------------- + + The port for sge_execd is set as service. + + sge_execd service set to port 6445 + + Hit to continue >> + ``` + +9. The installer does verify the local hostname resolution and if the current host is an administrative host. + + ``` + Checking hostname resolving + --------------------------- + + This hostname is known at qmaster as an administrative host. + + Hit to continue >> + +10. Specify the spooling directory for execution hosts + + ``` + Execd spool directory configuration + ----------------------------------- + + You defined a global spool directory when you installed the master host. + You can use that directory for spooling jobs from this execution host + or you can define a different spool directory for this execution host. + + ATTENTION: For most operating systems, the spool directory does not have to + be located on a local disk. The spool directory can be located on a + network-accessible drive. However, using a local spool directory provides + better performance. + + The spool directory is currently set to: + <</default/spool/>> + + Do you want to configure a different spool directory + for this host (y/n) [n] >> + ``` + +11. The installer will create a local configuration for the execution host. + + ``` + Creating local configuration + ---------------------------- + @ added "" to configuration list + Local configuration for host >< created. + + Hit to continue >> + ``` + +12. Now specify if you want to start the execution service automatically. + + ``` + execd startup script + -------------------- + + We can install the startup script that will + start execd at machine boot (y/n) [y] >> + ``` + +13. The execution service is started. + + ``` + Cluster Scheduler execution daemon startup + ------------------------------------------ + + Starting execution daemon. Please wait ... + starting sge_execd + + Hit to continue >> + ``` + +14. Specify a queue for the new host. + + ``` + Adding a queue for this host + ---------------------------- + + We can now add a queue instance for this host: + + - it is added to the >allhosts< host group + - the queue provides 32 slot(s) for jobs in all queues + referencing the >allhosts< host group + + You do not need to add this host now, but before running jobs on this host + it must be added to at least one queue. + + Do you want to add a default queue instance for this host (y/n) [y] >> + ``` + ## Automatic Installation -## Backup and Restore +The automatic installation process is based on the manual installation process where the installer gets a configuration file with predefined answers to those questions that would normally be asked during an interactive installation. For an automatic installation the configuration file has to be prepared, and it has to be passed to the installation script as argument with the `-auto` option. + +The auto installation is also able to install services on remote hosts if either passwordless `ssh` or `rsh` access is configured for the root user on the master machine. + +1. Login as root on the system where you intend to install a service. + +2. Make of copy of a configuration template file and prepare it with the answers to the questions that are usually asked during the manual installation process. If the root user has no write permissions in $SGE_ROOT then choose a different path but make sure that you preserve the file for the uninstallation process. + + ``` + $ cp $SGE_ROOT/util/install_modules/inst_template.conf $SGE_ROOT/my_template.conf + $ vi $SGE_ROOT/my_template.conf + ... + ``` + +3. On the master machine start the master installation + + ``` + cd $SGE_ROOT + ./inst_sge -m -auto $SGE_ROOT/my_template.conf + ``` + +4. If you have a list of hosts specified as EXEC_HOST_LIST parameter in the configuration file AND when you have passwordless `ssh` or `rsh` access to those hosts then you can install the execution service on those hosts remotely from the master machine. + + ``` + cd $SGE_ROOT + ./inst_sge -x -auto $SGE_ROOT/my_template.conf + ``` + + If you have no passwordless `ssh` or `rsh` access to those hosts then you have to log in to each host and start the installation process manually for each host individually. + +5. On shadow hosts install the shadow service -## Upgrading Open Cluster Scheduler + ``` + cd $SGE_ROOT + ./inst_sge -sm -auto $SGE_ROOT/my_template.conf + ``` -### Patch installation +## Uninstallation -### Side by Side Upgrade +The uninstallation of the xxQS_NAMExx software can be done manually or automatically using the configuration template created during the auto installation. If you uninstall an execution host then make sure that there are no running jobs on that host. If you uninstall manually then make sure that all execution hosts are uninstalled first before you uninstall the master host or other services. -## Testing the Installation/Upgrade +1. Login as root on the system where you installed a service. -## Troubleshooting +2. Automatic uninstall the execution service on execution hosts. + + ``` + cd $SGE_ROOT + ./inst_sge -ux -auto $SGE_ROOT/my_template.conf + ``` + +3. Manual uninstallation of the execution component. + + ``` + cd $SGE_ROOT + ./inst_sge -ux + ``` + +4. Qmaster, shadow master and other services can be uninstalled the same way. To uninstall the qmaster service use the `-um` switch, for the shadow master service use the `-usm` switch. For the automatic uninstallation use the `-auto` switch with the configuration template. + + ``` + cd $SGE_ROOT + ./inst_sge ... + ``` [//]: # (Eeach file has to end with two emty lines) diff --git a/doc/markdown/manual/installation-guide/04_backup_and_restore.md b/doc/markdown/manual/installation-guide/04_backup_and_restore.md new file mode 100644 index 000000000..8655d9ec1 --- /dev/null +++ b/doc/markdown/manual/installation-guide/04_backup_and_restore.md @@ -0,0 +1,35 @@ +# Backup and Restore + +The backup and restore of the xxQS_NAMExx software can be done manually. Unattended or periodic backups can be scheduled using the cron daemon. + +1. Login as root on the master host. + +2. Make a copy of the backup configuration template and adjust it to your needs. + + ``` + $ cp $SGE_ROOT/util/install_modules/backup_template.conf $SGE_ROOT/my_backup.conf + $ vi $SGE_ROOT/my_backup.conf + ... + ``` + +3. Start the backup process. + + ``` + cd $SGE_ROOT + ./inst_sge -bup -auto $SGE_ROOT/my_backup.conf + ``` + +To restore a backup you have to follow these steps: + +1. Login as root on the master host. + +2. Trigger the restore process. + + ``` + ./inst_sge -rst + ``` + + You will be asked several questions during the restore process (e.g. location of the SGE_ROOT, name of the default cell directory, location of the backup files, etc.). + +[//]: # (Eeach file has to end with two emty lines) + diff --git a/doc/markdown/manual/installation-guide/05_upgrade.md b/doc/markdown/manual/installation-guide/05_upgrade.md new file mode 100644 index 000000000..98fc735ca --- /dev/null +++ b/doc/markdown/manual/installation-guide/05_upgrade.md @@ -0,0 +1,184 @@ +# Upgrade + +The version string of the xxQS_NAMExx software has three different parts: the major version, the minor version, and the patch level. (e.g. v8.1.17 where 8 is the major version, 1 is the minor version, and 17 is the patch level). + +The major and/or minor version of the Software will be upgraded if there are incompatible changes to the Software that require an upgrade procedure to be performed that makes necessary changes to configuration files or the database schema before the new version can be used. Alternatively, the software can be reinstalled. + +If only the patch level is increased then this means that there are usually no incompatible changes. + +There are exceptions to those rules. Always check the release notes of the new version for details and back up your existing installation before starting the upgrade process. + +Please note that you cannot upgrade from Sun Grid Engine, Oracle Grid Engine, Some Grid Engine, Univa Grid Engine or Altair Grid Engine to xxQS_NAMExx but the installation steps of these products are almost identical to the installation steps of xxQS_NAMExx and a re-installation of the software is strongly recommended even if a side-by-side upgrade is at least partially possible. + +If you have questions then please contact our support team directly or send a mail to [support@hpc-gridware.com](mailto:support@hpc-gridware.com). + +## Patch installation + +Patch installation is normally done by downloading the packages from the xxQS_COMPANY_NAMExx download page and unpacking them into the installation directory during a cluster downtime, but it is also possible to install the patches with minimal downtime by following the steps below. + +1. Backup your cluster. + +2. Download the patch packages from xxQS_COMPANY_NAMExx and read the release notes. If there are no specific steps mentioned in the release notes, you can keep all jobs and services running, but be aware that no new jobs can be submitted and no services can be restarted until the next three steps have been completed. + +3. Move but do *NOT* remove binaries and libraries. It is important that the files remain on the same filesystem. + + ``` + $ cd $SGE_ROOT + $ mv bin bin.old + $ mv lib lib.old + $ mv utilbin utilbin.old + ``` + + This ensures that running processed can still find the binaries and libraries if they need them. + +4. Unpack the new packages in the $SGE_ROOT directory. + + ``` + $ tar xfz gcs-*.tar.gz + ``` + +5. Trigger a restart of the services like `sge_qmaster`, `sge_execd`, and `sge_shadowd`. + + Now the new binaries and libraries are used by the services and new jobs can be submitted again. Old jobs still running are not affected by the patch installation. + +6. Ensure to create a new backup of the cluster. + +7. After the last old job has finished you can remove the old binaries and libraries. + + ``` + $ rm -rf bin.old lib.old utilbin.old + ``` + + +## Side-by-Side Upgrade + +The least disruptive way to install a new minor or major version of xxQS_NAMExx is to install the new version side-by-side with the old version using the configuration information of the old cluster to set up the new cluster. This way you can test the new version without affecting the old version. + +* Old jobs can finish in the old cluster and also new jobs can be submitted to the old cluster during the upgrade process. +* You can test the new installation without affecting the old cluster. +* You can switch back to the old version at any time in case of problems. + +Please note that the side-by-side upgrade does not move or clone your jobs or advance reservations from the old cluster to the new cluster. + +The upgrade is done with following steps: + +1. Backup your cluster. + +2. Download the new version of the software from the xxQS_COMPANY_NAMExx download page and read the release notes. If there are no additional or different steps mentioned in the release notes then continue with these instructions. + +3. The following list of settings will conflict with your old installation. You will need to decide on new values before starting the upgrade process: + + - Installation location ($SGE_ROOT) + - Cell name ($SGE_CELL) + - Cluster name ($SGE_CLUSTER_NAME) + - Port numbers for the master and execution services (\$SGE_QMASTER_PORT, \$SGE_EXECD_PORT) + - Spool directories for the master and execution services + - Group id range for used for job tracking + +4. Login to your master machine as root and save all configuration files and objects of the old cluster by executing the following commands: + + ``` + $ cd $SGE_ROOT + $ ./util/upgrade_modules/save_sge_config.sh + ``` + + The specified directory will contain a snapshot of your cluster configuration. Changes made to the old cluster after this point in time will not be part of the new cluster setup. + +5. Unpack the new version of the software into the new $SGE_ROOT directory. + +6. Start the upgrade process by running the following command in the new cluster: + + ``` + $ cd $SGE_ROOT + $ ./inst_sge -upd + ``` + + The upgrade procedure will ask you several questions about the new configuration settings defined in step 3. It will also ask you for the location of the old configuration files and objects created in step 4. + + At the end of this step your new `sge_qmaster` process will be active. + +7. Update your execution environments. On each execution node you will need to source the new settings file and then trigger the initialization of the executions daemon spooling area and the startup scripts. + + ``` + $ . $SGE_ROOT/$SGE_CELL/common/settings.sh + $ $SGE_ROOT/inst_sge -upd-execd + $ $SGE_ROOT/inst_sge -upd-rc + ``` + + You can now start the execution daemon on the execution hosts now, but please be aware that the resources of the machines may become oversubscribed if you also immediately allow new jobs to be submitted to the new cluster. + +8. Check your new cluster. + + * Submit some test jobs and check that they are running as expected. + * Make sure you do not have user generated scripts (JSV, Prolog, Epilog, PE-start/stop, starter/suspend/resume-method, ...) in the old $SGE_ROOT directory that are still used by the new cluster. Move them to a new location outside the old and new $SGE_ROOT directories and reconfigure your cluster to use the new location. + +9. If you are satisfied with the new cluster then you can switch over using the new cluster. + +10. If the last old job has finished then you can shut down the old daemons and remove the old $SGE_ROOT directory. + +## In-Place Upgrade + +The in-place upgrade allows you to upgrade the software without changing the installation location or other key configuration parameters but the downside of this upgrade method is that you have to: + +* Empty the cluster by removing all jobs and disabling submission of new jobs +* Stop all services during the upgrade +* Replace all binaries and libraries with the new version + +Compared to the side-by-side upgrade the in-place upgrade is more disruptive due to the need to remove all jobs and due to the unavailability of the cluster during the upgrade process. + +Here are the steps required to complete the in-place upgrade: + +1. Back up your cluster. + +2. Download the new version of the software from the xxQS_COMPANY_NAMExx download page and read the release notes. If there are no additional or different steps mentioned in the release notes then proceed with these instructions. + +3. Disable the cluster (e.g. by configuring a server JSV that rejects all jobs). + +4. Wait for all jobs to complete or delete all jobs. + +5. Save the configuration of the old cluster. + + ``` + $ cd $SGE_ROOT + $ ./util/upgrade_modules/save_sge_config.sh + ``` + +6. Shutdown execution, shadow and master services on all machines. + + ``` + $ qconf -ke all + $ $SGE_ROOT/$SGE_CELL/common/sgemaster -shadowd stop + $ qconf -km + ``` + +7. Delete the old subdirectories in \$SGE_ROOT except for your \$SGE_CELL directory. + + Make sure there are no user generated scripts (JSV, Prolog, Epilog, PE-start/stop, starter/suspend/resume-method, ...) in the $SGE_ROOT directory that are still needed. + +8. Extract the new software packages in $SGE_ROOT. + +9. Start the upgrade process. + + ``` + $ cd $SGE_ROOT + $ ./inst_sge -upd + ``` + + The upgrade process will ask you several questions about configuration settings. + + At the end of this step your new `sge_qmaster` process is active. + +10. Upgrade your execution environments. On each execution node you need to source the new settings file and then trigger the initialization of the executions daemons spooling area and the startup scripts. + + ``` + $ . $SGE_ROOT/$SGE_CELL/common/settings.sh + $ $SGE_ROOT/inst_sge -upd-execd + $ $SGE_ROOT/inst_sge -upd-rc + ``` + + You can now start the execution daemon on the execution hosts. + +11. Continue with post installation steps. + +[//]: # (Eeach file has to end with two emty lines) + diff --git a/doc/markdown/manual/installation-guide/06_troubleshooting.md b/doc/markdown/manual/installation-guide/06_troubleshooting.md new file mode 100644 index 000000000..fb1642eae --- /dev/null +++ b/doc/markdown/manual/installation-guide/06_troubleshooting.md @@ -0,0 +1,74 @@ +# Troubleshooting + +## Auto-installation fails because \$SGE_ROOT/\$SGE_CELL already exists + +It is an intended behaviour that the automatic installation fails if the \$SGE_ROOT/\$SGE_CELL directory already exists. This is to prevent accidental overwriting of an existing installation. If you wish to overwrite an existing installation then you must manually remove the \$SGE_ROOT/\$SGE_CELL directory before you start the automatic installation. + +## Execution services are not running after automatic installation + +Make sure that the passwordless ssh/rsh access to that remote host is configured correctly. The installer will try to start the execution service but on error it will continue with the installation of others. + +To solve this you can either fix the ssh/rsh problem and reinstall or start the execution service manually. + +## Communication issues due to incorrect setup of hostnames + +All xxQS_NAMExx services must be able to resolve the hostnames of all other machines part in the same cluster otherwise communication between the services or between clients and services will fail. + +If you have a hostname resolution service (such as DNS, NIS, NIS+, LDAP, ...) in your network then make sure that all hostnames are correctly registered there and that all hosts are using it. If you do not have such a service then you have to make sure that all hostnames are correctly registered in the `/etc/hosts` file on all machines. + +Also make sure that hostnames are not 'mapped' to loopback addresses (such as 127.0.0.1 or ::1 if IPv6 is enabled). Some systems have such default mappings in the `/etc/hosts` file which leads to communication issues even if you have a central hostname resolution service. + +To find and fix problems with hostname resolution you can use two utilities that are part of the xxQS_NAMExx software: `gethostname` and `gethostbyname`. Both utilities are located in the `$SGE_ROOT/utilbin/$ARCH` directory and, when started using the `-all` option they will display the primary hostname, alias names and IP addresses of the local machine or the machine with the specified hostname. They will also display the primary hostname as seen by xxQS_NAMExx components. + +Here are examples for a correct setup: + +``` +$ $SGE_ROOT/utilbin/$ARCH/gethostname -all +Hostname: master_host.hpc-gridware.com +SGE name: master_host.hpc-gridware.com +Aliases: master_host +Host Address(es): 10.1.1.1 +``` + +and + +``` +$ $SGE_ROOT/utilbin/$ARCH/gethostbyname -all master_host +Hostname: master_host.hpc-gridware.com +SGE name: master_host.hpc-gridware.com +Aliases: master_host +Host Address(es): 10.1.1.1 +``` + +If you are experiencing communication problems then check the output of those commands on the master machine and the machine having the communication problem. Here are some of the more common problems: + +* The hostname is not correctly assigned to the IP address +* The hostname is registered with a loopback address +* Primary hostname and aliases do not have the same sequence on all hosts +* Hostnames are assigned to different IP addresses on different hosts +* A host has multiple NIC's with different IP addresses + +Most problems can be solved by correctly registering the hostnames in a directory service or `/etc/hosts` file. For setups where hosts have multiple NICs with different IP addresses, you need to make sure that xxQS_NAMExx knows about all IP addresses. This is done by defining a `host_aliases` file (see next section and sge_host_aliases(5)). + +## IP Multipathing, Load Balancing or Bonding + +If you are using IP multipathing, network load balancing, or certain bonding configurations on your master and/or execution nodes, you must ensure that the master and execution services are aware of the primary and additional IP addresses, otherwise communication between the services, or between clients and services, may fail completely or sporadically, depending on the network load. + +Suppose you have a master host named *master_host* with the main network interface `eth0` and two additional network interfaces `eth1` and `eth2`, each with an assigned IP address. Upon installation, the master service will recognise the master host's main IP address and use it for communication. If you configure the underlying host system to use the additional NICs to achieve load balancing, then you must make xxQS_NAMExx aware of the additional IP addresses, otherwise communication from the unknown interfaces (`eth1` and `eth2`) will fail. + +To solve this you need to define a `host_aliases` file (see sge_host_aliases(5)) to tell xxQS_NAMExx that the interfaces `eth0`, `eth1` and `eth2` are all part of the same host. You do this by specifying the assigned IP addresses in the `/etc/hosts` file or by defining the corresponding host names in the DNS/NIS or other directory services. + +``` +10.1.1.1 master_host +10.1.1.2 master_host_eth1 +10.1.1.3 master_host_eth2 +``` + +As second step you have to define the `host_aliases` file to tell xxQS_NAMExx that the IP addresses are all part of the same host where the first mentioned name is the main name of the host. + +``` +master_host master_host_eth1 master_host_eth2 +``` + +[//]: # (Eeach file has to end with two emty lines) + diff --git a/source/dist/util/install_modules/inst_execd.sh b/source/dist/util/install_modules/inst_execd.sh index 6a6322bc7..f02e4b33c 100644 --- a/source/dist/util/install_modules/inst_execd.sh +++ b/source/dist/util/install_modules/inst_execd.sh @@ -534,10 +534,7 @@ GetLocalExecdSpoolDir() "\n\nATTENTION: For most operating systems, the spool directory does not have to" \ "\nbe located on a local disk. The spool directory can be located on a " \ "\nnetwork-accessible drive. However, using a local spool directory provides " \ - "\nbetter performance.\n\nFOR WINDOWS USERS: On Windows systems, the spool directory " \ - "MUST be located\non a local disk. If you install an execution daemon on a " \ - "Windows system\nwithout a local spool directory, the execution host is unusable." \ - "\n\nThe spool directory is currently set to:\n<<$GLOBAL_EXECD_SPOOL>>\n" + "\nbetter performance.\n\nThe spool directory is currently set to:\n<<$GLOBAL_EXECD_SPOOL>>\n" $INFOTEXT -n -auto $AUTO -ask "y" "n" -def "n" "Do you want to configure a different spool directory\n for this host (y/n) [n] >> " ret=$?