Releases: cerebis/shap
Zenodo release
Simple High-throughput Annotation Pipeline (SHAP)
Installation Notes
Pre-requisites
Building
- Java SDK 1.6 (developed with 1.6.0_17)
- Git (developed with 1.7.3)
- Maven build manager (developed with 2.0.4)
Runtime
- Java SDK 1.6 (tested with 1.6.0_17)
- PostgreSQL (tested with 8.3.6) (server build only)
- Tomcat (tested with 6.0.10) (server build only)
Discussion
The SHAP codebase is comprised of a web application for accessing analysis results and a server-side command-line based analysis system. The web application can be deployed to a Servlet container such as Tomcat or, if the embedded build is chosen, an embedded instance of Jetty will be used. The analysis system is invoked, as you would expect, from the command-line.
The system is built using the Maven build manager. Maven resolves dependencies using remote repositories and eliminates the need to bundle supporting libraries as part of the SHAP project. Therefore if building from source, you will need to have a working installation of Maven on your system. The first time the system is built and depending on your local repository, Maven may need to fetch many dependent libraries.
Once completed, you will find the built system in the form of a WAR file in the "target" folder. This WAR file contains both the web application and the server-side system. Currently, the two modes of operation have not been made into separate codebases.
The WAR file can be extracted to your filesystem and treated as the executable installation of the server-side analysis system. In the case of an embedded build, you will be able to start the web application in userspace. For the server build, you will be required to deploy the WAR file to an application server of your choice.
Only the web application makes use of user accounts. The server-side analysis is accessible to whoever has permission to run the elements of the system. Attention should be paid to who has access or data loss could occur.
Differences between Unix platforms
Our development and production environments run on CentOS 5. CentOS is a redistribution of Redhat Enterprise Linux (RHEL). You may come across conflicting environmental details that we have overlooked. The common issues are package management differences, application paths and some types of system commands.
Switching users
In Redhat based distributions, switching users is accomplished by the command "su". The command "sudo" is not enabled by default. To invoke commands as another user, switch to that user with "su" and then continue your work, remembering to logout when finished.
su postgres
...
logout
In Debian based distributions, sudo is typically enabled by default and commands can be put inline
sudo postgres [your-command-here]
In Mac OSX "sudo" is also the choice.
Application paths
PostgreSQL base location
Redhat
/var/lib/pgsql
Debian
/etc/postgresql/{version}/
Source Distribution
Building from source
-
Obtain the source tree from SourceForge
git clone git://git.code.sf.net/p/shap/git shap
-
Go into the SHAP folder
cd shap
-
We need to add a few missing dependencies to your local .m2 maven repository. The following command will install BioJava v3, Apache CLI v2 and the Sun DRMAA libraries.
bin/prep_repo.sh
-
Now start the maven build of SHAP.
By default, the server version will be built. This has the prerequisites of PostgreSQL 8.x and Tomcat 6.x be installed and configured. An embedded version of SHAP is also available by specifying the "embedded" profile. This version will be built using embedded Derby for relational storage and Jetty as a servlet container. The embedded version should require very little configuration by the end user.
As four additional remote repositories are required to satisfy all the dependencies, the fallback process within Maven delays things considerably. This can take more than 10 minutes if you do not have any of the dependent libraries.
Though the embedded system is extremely easy to deploy, it does not support concurrent access to the database. Users should be cautious when attempting to analyze samples in a highly concurrent manner.
To build the server system
mvn install
or
mvn -Pserver install
To build the embedded system
mvn -Pembedded install
-
Once completed, there will be a directory "target/" which contains the built WAR file.
shap_{profile}-{version}.war
This can be deployed to an application server such as Tomcat for web access to analysis results or used immediately if you selected the embedded version.
-
The server-side analysis tools are contained in the WAR file. Extract this file within the SHAP folder as follows:
unzip -q target/shap_{profile}-{version}.war -d war
You should now have a folder "war/" containing the contents of the WAR file. All the helper scripts in "bin/" have been written assuming this path.
Binary Distribution
-
Untar the binary distribution
tar xzf shap-{version}.tar.gz
-
Go into the SHAP folder
cd shap
Follow the steps in the source distribution, starting from step 5.
SHAP-basic Distribution (shap-basic)
A fully self-contained distribution of SHAP is available. This uses the embedded system and includes a set of external analyzers: blastall, hmmpfam, metagene and aragorn. The system comes with scripts to initialize the system for these analyzers and includes some simple reference databases to make it possible to test an included example data set. Users will find it easy to add analyzers for larger reference databases such as Refseq, NR, PFAM etc.
Virtual Appliance (Amazon EC2 AMI)
To aid potential users in trailing the application, we have made a virtual appliance available, in the form of an Amazon Machine Image (AMI). This publicly available AMI can be found in the AWS EC2 Community section by the name shap-{version}, where {version} is replaced by the release version number.
Currently, we only provide an AMI for the latest release, version 1.1.0. Therefore you should filter the long list of community AMIs with "shap-1.1.0".
- Log into or create an account on the Amazon Web Service.
- Once in the system, select the "EC2" tab and click "Launch Instance"
- Select the "Community AMIs" tab and filter the list for "shap-1.1.0"
- Click "Select" for the appropriate AMIl=,
- Configure your instance, as per the Amazon documentation. You can use a micro instance to be eligible for the free tier.
- Once the instance is created and running, you can log in. The ec2-user has a pre-configured installation of SHAP available in its home directory. The web interface should already be up and running with example data.
Configuring Server System
Database Setup
SHAP uses a relational database to store its analysis results. Development has been using PostgreSQL, other SQL compliant databases with concurrent transactional support are a possibility. Inevitable minute details with respect to storage types and SQL implementation means, that for now, this is not offered out of the box and will not be discussed here.
Note The system account used to invoke these commands will require superuser authority in PostgreSQL. This may be most easily accomplished using the "postgres" user (distribution dependent, see above).
-
Create a database user with full privileges to the SHAP database.
createuser -SDRPE shapuser
The predefined password for this user is simply "shap01". If a stronger password is used here remember to update the shap.properties file. You will need to remember this password for step 3. To improve data security PostgreSQL client authentication (pg_hba.conf) can be used. In addition, the web application for SHAP needs only write access to the Users and UserRoles tables, therefore a second PostgreSQL user with read-only (select) access to all other tables could be employed. The analysis pipeline however will still require full access to all tables, so a single restricted user is not an option.
-
Create a database for SHAP.
createdb -O shapuser shap
-
In the extracted SHAP folder from the earlier source or binary installation section, carry over any changes you made to the user in the shap.properties file (war/WEB-INF/classes/shap.properties).
The line should read:
database.username=shapuser
database.password={chosen password}
For application servers, if you changed the password, user or database name you will need to update this file post deployment.
- Make sure the PostgreSQL server has been configured to listen for TCP connections. SHAP connects to the database server by TCP, whether it is hosted on the same system or not. Without this feature being enabled, any attempt by SHAP to connect to the database will fail.
For Redhat based distributions
/var/lib/pgsql/data/postgresql.conf
On Debian based distributions, this becomes
/etc/postgresql/{version}/main/postgresql.conf
Uncomment the "listen_address" line. If the web application, annotation pipeline all reside on the same physical server, you need only listen to the localhost IP address.
listen_address = 'localhost'
- PostgreSQL provides fine-grained control of client authentication. The SHAP user needs permission to authenticate by password which is commonly not part of PostgreSQLs default configuration. Client authentication is defined in the filet:
On Redhat based distributions
/var/lib/pgsql/data/pg_hba.conf
On Debian based distributions, this becomes
/etc/postgresql/{version}/main/pg_hba.conf
The order of rules is im...
Complete application release.
Simple High-throughput Annotation Pipeline (SHAP)
Installation Notes
Pre-requisites
Building
- Java SDK 1.6 (developed with 1.6.0_17)
- Git (developed with 1.7.3)
- Maven build manager (developed with 2.0.4)
Runtime
- Java SDK 1.6 (tested with 1.6.0_17)
- PostgreSQL (tested with 8.3.6)
- Tomcat (tested with 6.0.10)
Discussion
The SHAP codebase is comprised of a web application for accessing analysis results and a server-side command-line based analysis system. The web application is deployed to a Servlet container such as Tomcat. The analysis system is invoked, as you would expect, from the command-line.
The system is built using the Maven build manager. Maven resolves dependencies using remote repositories and eliminates the need to bundle supporting libraries as part of the SHAP project. Therefore if building from source, you will need to have a working installation of Maven on your system. The first time the system is built and depending on your local repository, Maven may need to fetch many dependent libraries.
Once completed, you will find the deployable WAR file in the "target" folder. This WAR file contains both the web application and the server-side system. Currently, the two modes of operation have not been made into separate codebases.
The WAR file can be extracted to your filesystem and treated as the executable installation of the server-side analysis system.
Only the web application makes use of user accounts. The server-side analysis is accessible to whoever has permission to run the elements of the system. Attention should be paid to who has access or data loss could occur.
Differences between Unix platforms
Our development and production environments run on CentOS 5. CentOS is a redistribution of Redhat Enterprise Linux (RHEL). You may come across conflicting environmental details that we have overlooked. The common issues are package management differences, application paths and some types of system commands.
Switching users
In Redhat based distributions, switching users is accomplished by the command "su". The command "sudo" is not enabled by default. To invoke commands as another user, switch to that user with "su" and then continue your work, remembering to logout when finished.
su postgres
...
logout
In Debian based distributions, sudo is typically enabled by default and commands can be put inline
sudo postgres [your-command-here]
In Mac OSX "sudo" is also the choice.
Application paths
PostgreSQL base location
Redhat
/var/lib/pgsql
Debian
/etc/postgresql/{version}/
Source distribution
Building from source
- Obtain the source tree from SourceForge
git clone git://git.code.sf.net/p/shap/git shap
- Go into the SHAP folder
cd shap
- We need to add a few missing dependencies to your local .m2 maven repository. The following command will install BioJava v3, Apache CLI v2 and the Sun DRMAA libraries.
bin/prep_repo.sh
- Now launch the maven build of SHAP. As four additional remote repositories are required to satisfy all the dependencies, the fallback process within Maven delays things considerably. This can take more than 10 minutes if you do not have any of the dependent libraries.
mvn install
- Once completed, there will be a directory "target/" which contains the built WAR file.
shap-{version}.war
This can be deployed to an application server such as Tomcat for web access to analysis results.
- The server-side analysis tools are contained in the WAR file. Extract this file within the SHAP folder as follows:
unzip -q target/shap-{version}.war -d war
You should now have a folder "war/" containing the contents of the WAR file. All the helper scripts in "bin/" have been written assuming this path.
Binary Distribution
- Untar the binary distribution
tar xzf shap-{version}.tar.gz
- Go into the SHAP folder
cd shap
Follow the steps in the source distribution, starting from step 5.
Virtual Appliance (Amazon EC2 AMI)
To aid potential users in trailing the application, we have made a virtual appliance available, in the form of an Amazon Machine Image (AMI). This publicly available AMI can be found in the AWS EC2 Community section by the name shap-{version}, where {version} is replaced by the release version number.
Currently, we only provide an AMI for the latest release, version 1.1.0. Therefore you should filter the long list of community AMIs with "shap-1.1.0".
- Log into or create an account on the Amazon Web Service.
- Once in the system, select the "EC2" tab and click "Launch Instance"
- Select the "Community AMIs" tab and filter the list for "shap-1.1.0"
- Click "Select" for the appropriate AMIl=,
- Configure your instance, as per the Amazon documentation. You can use a micro instance to be eligible for the free tier.
- Once the instance is created and running, you can log in. The ec2-user has a pre-configured installation of SHAP available in its home directory. The web interface should already be up and running with example data.
Database Setup
SHAP uses a relational database to store its analysis results. Development has been using PostgreSQL, other SQL compliant databases with concurrent transactional support are a possibility. Inevitable minute details with respect to storage types and SQL implementation means, that for now, this is not offered out of the box and will not be discussed here.
Note The system account used to invoke these commands will require superuser authority in PostgreSQL. This may be most easily accomplished using the "postgres" user (distribution dependent, see above).
- Create a database user with full privileges to the SHAP database.
createuser -SDRPE shapuser
The predefined password for this user is simply "shap01". If a stronger password is used here remember to update the shap.properties file. You will need to remember this password for step 3. To improve data security PostgreSQL client authentication (pg_hba.conf) can be used. In addition, the web application for SHAP needs only write access to the Users and UserRoles tables, therefore a second PostgreSQL user with read-only (select) access to all other tables could be employed. The analysis pipeline however will still require full access to all tables, so a single restricted user is not an option.
- Create a database for SHAP.
createdb -O shapuser shap
- In the extracted SHAP folder from the earlier source or binary installation section, carry over any changes you made to the user in the shap.properties file (war/WEB-INF/classes/shap.properties).
The line should read:
database.username=shapuser
database.password={chosen password}
For application servers, if you changed the password, user or database name you will need to update this file post deployment.
- Make sure the PostgreSQL server has been configured to listen for TCP connections. SHAP connects to the database server by TCP, whether it is hosted on the same system or not. Without this feature being enabled, any attempt by SHAP to connect to the database will fail.
For Redhat based distributions
/var/lib/pgsql/data/postgresql.conf
On Debian based distributions, this becomes
/etc/postgresql/{version}/main/postgresql.conf
Uncomment the "listen_address" line. If the web application, annotation pipeline all reside on the same physical server, you need only listen to the localhost IP address.
listen_address = 'localhost'
- PostgreSQL provides fine-grained control of client authentication. The SHAP user needs permission to authenticate by password which is commonly not part of PostgreSQLs default configuration. Client authentication is defined in the filet:
On Redhat based distributions
/var/lib/pgsql/data/pg_hba.conf
On Debian based distributions, this becomes
/etc/postgresql/{version}/main/pg_hba.conf
The order of rules is important. The more explicit the rule, the earlier it should come. It is recommended to place the SHAP rules before the default rules. Add the following lines to permit file and TCP/IP socket connections to the SHAP DB with password authentication from the localhost.
local shap shapuser md5
host shap shapuser 127.0.0.1/32 md5
- For the changes to take effect, PostgreSQL will need to be restarted.
On systems with sysvconfig, with root authority invoke the following command.
service postgresql restart
- On first invocation, SHAP will automatically create its table structure.
Setup Analyzers
A tool has been written to help configure an initial set of analyzers. Since these definitions are highly system dependent, it is expected that users of SHAP will want to make modifications before running the tool. Analyzer names must be unique, with multiple invocations you may run into naming conflicts. To aid in experimenting with an initial setup, you have to option to purge the previous database. WARNING This will purge all data from the database, not just analyser configurations -- use it wisely.
An example configuration file can be found at:
helpers/analyzer-config.xml
This file follows the Spring bean definition schema. A few detectors and annotators have been defined. All defined analyzers mentioned in the "configuration" bean will be created.
Once you are ready, run the tool
bin/configSetup.sh <analyzer XML file>
A more user-friendly approach to analyzer definition is planned. This is an obvious need now that SHAP has been released to the public.
Note The scratch path must be read/write accessible to all machines which will participate in ana...
Self-contained demo release
SHAP Basic-System Quick Start
The SHAP basic-system is completely self-contained and intended to get SHAP up and running quickly and easily. Our intention is that end-users can assess the utility of the software without a minimum of trouble. To make that possible, the database provider and web server have been embedded within the application itself. End users should be able to run SHAP with little to no configuration changes.
As an alternative, SHAP can also be tested using our configured Amazon Machine Image on EC2. Please consult our Sourceforge project for further details.
In providing this basic system, we have made choices to keep the configuration simple. In the long run not all of these choices may be suitable for your day-to-day system. The installation includes the external tools: blastall, hmmpfam, aragorn and metagene, sufficient to get a sense of using SHAP for your work.
It is important to note that the version of hmmpfam supplied has been specially modified to write XML output. It has therefore been renamed hmmpfam_xml to avoid the possibility of clashing with a preexisting vanilla installation of HMMER.
Blastall, aragorn and metagene are standard vanilla binaries and could be replaced with those you might already have installed on your system.
You could also extend the supplied analyzer configuration to make use of larger reference databases (Refseq, NR, TIGRFAM, etc) than we have included for testing purposes.
Though we have configured the system for single-threaded analysis, the embedded system still supports the concurrent model and therefore it is possible to make use of as multiple CPU cores. However, running multiple simultaneous at the command line should be avoided as it may lead to database access violations.
Please refer to the full documentation on the SourceForge project page if you wish to make a complete installation and tailor it to your system.
Step 1: Shell environment setup
The basic installation of SHAP uses some environmental variables to setup paths to the included executables and reference databases.
To make this easy and quick, end-users should only need to run the Bash script shap-settings.sh to configure the system.
In a Bash shell execute,
. shap-settings.sh
This can also be placed in your bashrc or bash_profile.
NOTE
You must set the SHAP_HOME variable within this script before it can be successfully executed.
Set the SHAP_HOME environment variable
Assuming you extracted the shap-basic tarball to your home directory. Then SHAP_HOME would be set to the following.
export SHAP_HOME=${HOME}/shap-basic
Step 2: Initialization and analysis
Initialize a basic analyzer setup
Run configSetup.sh, answering "yes" to both questions.
configSetup.sh basic-config.xml
Populate the database with an example dataset
./populate.sh
This script creates a project and sample and a single-genome sized metagenomic example dataset.
Analyze the example dataset
Run feature detection on the example dataset as follows
jobControl --submit plans/gsb-detect.xml
Once this has completed, you should now run the annotation. This will take longer as the basic system has not been configured to make use of concurrency. It is possible to enable concurrency in shap.properties if you wish.
jobControl --submit plans/gsb-annotate.xml
When complete, you may view the results either with the web application described below or your favourite SQL database client. The URI is "jdbc:derby:shapdb".
Step 3: Using the Web Application
Indexing
The web application makes use of Apache Lucene for search capability. Lucene in turn makes use of an index, which is stored on the filesystem. For shap-basic, the index is stored within the application directory. Make sure this path is writable by whoever
wishes to create the index.
At present, the index must be manually updated when new data has been added analyzed. This mass indexing is performed using the index.sh command. Answer yes to reindexing.
index.sh
Starting the local web application
The embedded web application can be started as follows,
web-server.sh
Once start-up is complete, users can access the analysis results with their prefered web-browser.
By default, the URL is,
http://localhost:8090/shap
Only a superuser 'admin' exists to begin with, password 'shap01'. Web application only provides read access to data, but it is good practice to use non-privileged users for regular access.
Done!
Please refer to the complete documentation and the project website for more information.
Matthew DeMaere
[email protected]