Skip to content

overton-group/NetNC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

##
## README for Network Neighbourhood Clustering (NetNC) software
##

## Licensed under the GNU General Public License (GPL) version 3
## you should have recieved a copy of the GNU General Public License
## with the NetNC software in the file LICENSE.txt
## if not, see <http://www.gnu.org/licenses/>.

##
## QUICK START GUIDE
##

1) Ensure the dependencies are installed:
a. Perl - Math::Pari
b. R
c. Python - networkx 1.8 and numpy

2) Edit the paths at the top of NetNC_v2pt2.pl (the relevant lines have comments # EDIT THIS ...)

3) Run the NetNC_v2pt2.pl script with relevant options, the two main modes are given below.
a. 'FTI' analysis mode:
NetNC_v2pt2.pl -n MyNetwork.txt -i NodeList.txt -o /path/to/my/outdir/fileprefix -F

The final output (an edge list) will be in the file: /path/to/my/outdir/fileprefix.FDRthresholded_mincutDensity0pt306_noDoublets.txt

b. 'FBT' analysis mode:
NetNC_v2pt2.pl -n MyNetwork.txt -i NodeList.txt -o /path/to/my/outdir/fileprefix -M

The final output as a node list will be located at: /path/to/my/outdir/fileprefix_NFCS-mixturemodel.coherentNodes
Additionally, a network from the coherent nodes in the above file and where edges have bonferroni 
corrected -log(p)<=0.05 is output to: /path/to/my/outdir/fileprefix_NFCS-mixturemodel.coherentNet.txt

Please see further details below.

##
## INTRODUCTION
##

In this README, the usage and testing sections provide examples of 
some ways to run NetNC and its components. The testing section also 
includes description of output files. 
The final two sections summarise contributors and how to cite NetNC.

The NetNC distribution includes three main components:
1) Network Neighbourhood Clustering - NetNC_v2pt2.pl

This program runs Network Neighbourhood Clustering and optionally
calls the components below according to the command-line options
supplied. Requires the perl module Math::Pari

After installation (see below), run 'perl NetNC_v2pt2.pl -h' for 
further details.

Files in the modules/ directory provide functions for NetNC_v2pt2.pl

2) Iterative minimum cut - itercut.py (see directory mincut/ )
The script mincut/itercut.py iteratively cuts edges using weighted min-cut max-flow 
algorithm with a graph density stopping criterion.

Further details are given in mincut/MinCut_README.txt

3) Node centric analysis with Gaussian mixture modelling
The script FCS.pl generates a Node Functional Coherence Scores (NFCS)
which are the sum of node edge weight(s) normalised by degree.

The script mixtureModelling/NetNCmixmodel.R runs Gaussian mixture modelling on
NFCS values, then processes the results to infer functionally coherent nodes
and estimate the proportion of noise; mixtureModelling/netNCmix.R
provides functions for the NetNCmixmodel.R script. A little more detail is given
in the usage section, below.



##
## INSTALLATION
##

1. Network Neighbourhood clustering - NetNC_v2pt2.pl

a) The arbitary precision calculations require the perl module Math::Pari.
Install this from cpan (www.cpan.org). For example from the command line:
cpan
install Math::Pari

b) In the NetNC_v2pt2.pl edit the 'use lib' statements to include:
i) the path to Math::Pari (typically this will be where cpan installs
perl modules) and ii) the full path to the NetNC modules/ (this is the 'modules' 
directory created when you unzipped the NetNC distribution).

Also edit the $path_to_R and $NetNC_home variables to specify 
the path to your R installation and the NetNC home directory (this is the location of 
NetNC_v2pt2.pl and FCS.pl scripts)

The lines to edit in NetNC_v2pt2.pl are indicated with comments near to the start of the 
script. NetNC has been tested with perl v5.16, v5.18, v5.20 and v5.26

2. Iterative Minimum cut - itercut.py

Install the dependencies: networkx 1.8 and numpy 

For example: easy_install networkx==1.8
	     easy_install numpy

(itercut.py is verified to work with networkx 1.8. Earlier versions are 
not compatible)

More information is given in mincut/MinCut_README.txt

3. The Gaussian mixture modelling requires R to be installed and has been 
tested with R versions 3.0.2, 3.2.2, 3.3.0, 3.3.1, 3.6.3 on Linux, but is expected 
to work on Mac OSX, Windows and with other recent versions of R.



##
## USAGE
##

1. NetNC

NetNC can be run with many possible combinations of options, here are some
illustrative examples:
-- Functional Target Identification mode (calls itercut.py):
NetNC_v2pt2.pl -n MyNetwork.txt -i NodeList.txt -o /path/to/my/outdir/fileprefix -F

-- Pathway Identification mode (calls itercut.py) and Node-Centric Analysis
(including Gaussian Mixture Modelling) also specifying a background list for 
resampling (e.g. detected genes from a microarray experiment):
NetNC_v2pt2.pl -n MyNetwork.txt -i NodeList.txt -o /path/to/my/outdir/fileprefix -E -M -l /path/to/backgroundGenelist.txt

-- Node centric analysis (including Gaussian Mixture Modelling) and using background 
negative log p-values generated from a previous run of NetNC:
NetNC_v2pt2.pl -n MyNetwork.txt -i NodeList.txt -o /path/to/my/outdir/fileprefix -M -p /path/to/my/outdir/fileprefix.BG.nlPonly.txt -z [resample_number]

Where -z [resample_number] indicates the number of resamples used in generating 
the list of -log p-values provided using the -p option (with suffix '.BG.nlPonly.txt')

A help message is produced by running:
NetNC_v2pt2.pl -h


2. Iterative minimum cut

The iterative minimum cut is (optionally) called by NetNC_v2pt2.pl
If you wish to run the iterative minimum cut as a stand-alone program, the basic
usage is:
python minCut/itercut.py -i graphfile.txt -o cutgraph.txt -t 0.3

Note that -t gives the density stopping criterion, a value other than 0.3 may
be used. The itercut.py help message is produced by running:
python minCut/itercut.py 


3. Node-centric analysis with Gaussian mixture modelling

Node-centric analysis is intended to be (optionally) called by NetNC_v2pt2.pl
This subsection describes components of the analysis that may be called separately. 

The script FCS.pl takes as input the genelist and the file written by NetNC with 
the suffix ".FG.id1_id2_nlp.txt". For example:
FCS.pl input_genelist NetNC_results.FG.id1_id2_nlp.txt FCS_outputFile.txt

FCS.pl output has the format: Node edgePvalueSum degree degree-normalised_edgePvalueSum

Gaussian mixture modelling can be run with the NetNCmixmodel.R script as follows:

R --slave --no-save --args FCS_output.txt output_file NetNC_home_directory < NetNC_home_directory/mixtureModelling/NetNCmixmodel.R

The 'NetNC_home_directory' is the directory where NetNC_v2pt2.pl and FCS.pl are located. 

NetNCmixmodel.R runs mixture modelling on the NFCS scores written by FCS.pl
and seeks to identify a NFCS threshold to define 'noise' nodes. If a unimodal 
model is returned, a tiny Gaussian noise component is added to NFCS=0 values 
and mixture modelling is rerun (the .log file details which analyses were run). 
The output file with suffix '.coherentNodes' includes nodes scoring above the 
NFCS threshold value determined by this procedure (please the manuscript methods
section for further details). Of course functions in netNCmix.R may also be 
called directly.

4. A note on networks for use with NetNC
NetNC requires a network as input, provided using the -n option. This network defines the 
context for the analysis, providing the structure that enables NetNC to discover relationships
within the input node list. Network nodes may be genes and the input list should be a subset 
of the nodes in the network. NetNC might be usefully applied to any network and input list 
across various complexity science application domains (biology, economics, social sciences, 
telecommuincations, geography etc.). An example network for use with NetNC is DroFN, which is
available from: https://www.ebi.ac.uk/biostudies/files/S-BSST460/DroFN.zip . DroFN is described
in the citation at the end of this README. A further example network is described in Overton et 
al. BMC Systems Biology 5, Article number: 68 (2011) and is available from the link below: 
https://static-content.springer.com/esm/art%3A10.1186%2F1752-0509-5-68/MediaObjects/12918_2010_685_MOESM3_ESM.ZIP


##
## TESTING THE INSTALLATION AND EXAMPLE OUTPUT
##

The directory 'test/' includes data to enable you to test that NetNC
is working correctly. 

1. The tests may be run as follows (from the netNC/ directory):

a) Network Neighbourhood Clustering (NNC)

NetNC_v2pt2.pl -i test/test_genelist.txt -n test/network/test_net.txt -o test/output/NNConly/NNCz10 -z 10

This usually runs in less than one minute.

Please note: the above command runs 10 resamples (-z 10); however, at 
least 100 resamples are recommended for general use (more samples enable
greater precision for estimation of pFDR).

b) Node-centric Analysis mode:

NetNC_v2pt2.pl -n test/network/test_net.txt -i test/test_genelist.txt -o test/output/NodeCentric/NodeCent_z10 -M -p test/output/NNConly/NNCz10.BG.nlPonly.txt -z 10

Typically takes less than three minutes to complete.

This command is also an example of using the background p-value distribution generated
in test a) above, with the -p option; however it is not neccessary to do so for 
Node-centric analysis. If using the -p option it is also important to specify the 
number of resamples taken (-z  option) so that pFDR estimation is correct; if -z is not 
specified, the default number of resamples (100) is assumed and a warning message is 
issued.

c) Functional Target Identification mode:
NetNC_v2pt2.pl -n test/network/test_net.txt -i test/test_genelist.txt -o test/output/FTI/FTIz10 -z 10 -F

Runtime is around 15 minutes. This mode calculates NNC and runs the iterative minimum
cut, applying pFDR and network density thresholds optimised for the Functional Target 
Identification (FTI) task.

d) Pathway Identification mode and node-centric analysis:
NetNC_v2pt2.pl -n test/network/test_net.txt -i test/test_genelist.txt -o test/output/PID/PID_NodeCent_z10 -z10 -E -M -l test/test_background_genelist.txt

Usually takes about 15 minutes to compute. This mode calculates NNC and runs the iterative
minimum cut, applying pFDR and network density thresholds optimised for the Pathway 
Identification (PID) task. This command also includes an example of using the -l option
to specify a background node list for the resampling procedure (e.g. detected genes from a 
microarray experiment). 
Please note that to ensure correct estimation of pFDR the node list given to NetNC with 
the -i option should be a subset of the background list provided in the -l option; 
At present, no warning is given by NetNC if the -i (input) node list is not a subset of 
the -l (background) node list.

e) Iterative minimum cut in standalone mode:

python mincut/itercut.py -i test/output/NNConly/NNCz10.FDRthresholded_pairs.txt -o test/output/mincutOnly/NNCz10_FDR0pt1_mincutThresh0pt1.txt -t 0.1

Runtime is usually 10 minutes. This command depends on test a) above having
successfully completed.

2. Example output for the tests a) through e), above, are given in subdirectories
of test/exampleOutput. Please note that there might be differences between the
example output and the output of the tests below - due to variation introduced by
the resampling procedure. Subsections below summarise files provided in the example
output directories:

a) test/exampleOutput/NNConly

NNCz10.FG.id1_id2_nlp.txt - The Network Neighbourhood Clustering p-values for the
			    subnetwork induced by test_genelist.txt in test_net.txt
			    Format: nodeID1 nodeID2 -log(p-value)

NNCz10_10_resamples.txt  -  Resampled nodes taken for empirical estimation of pFDR.
			    Each resample begins with the following header:
			    '### [resample number]'
			    Where [resample number] is an integer.
			    The -z 10 option will produce 10 headers (numbered from
			    0 to 9). The list of resampled nodes is given
			    below each header.
			     
NNCz10.BG.nlPonly.txt   -   List of -log p-values for the networks induced by
			    all resampling from test_net.txt, which provides
			    the empirical null distribution for pFDR estimation.
			    (No headers are given in this file).

NNCz10.minSignif_nlP.log -  Reports the NNC -log(p-value) estimated to match
			    the pFDR threshold. Also records the number of resamples 
			    taken for pFDR estimation.

NNCz10.FDRthresholded_pairs.txt - Network of edges that meet the pFDR threshold (default 
				  0.1). Format: nodeID1 nodeID2 -log(p-value)

NNCz10.FDR.txt		 -  The mapping between NNC score and pFDR.
			    Format: NNC_score pFDR
		

b) test/exampleOutput/NodeCentric

NodeCent_z10.FG.id1_id2_nlp.txt - The Network Neighbourhood Clustering p-values for the
			    	  subnetwork induced by test_genelist.txt in test_net.txt
			    	  Format: nodeID1 nodeID2 -log(p-value)

NodeCent_z10.minSignif_nlP.log - Reports the NNC -log(p-value) estimated to 
				 match the pFDR threshold. Also records the number of 
			    	 resamples taken for pFDR estimation.

NodeCent_z10.FDRthresholded_pairs.txt - Network of edges that meet the pFDR threshold 
					(default 0.1). 
					Format: nodeID1 nodeID2 -log(p-value)

NodeCent_z10.FDR.txt     	-  The mapping between NNC score and pFDR.
			    	   Format: NNC_score pFDR
				   
NodeCent_z10.FCS.txt		- Node Functional Coherence Score (NFCS) values
				  calculated by FCS.pl.
				  Format: Node edgePvalueSum degree NFCS
				  (NFCS is the sum of -log(p-values) normalised by degree)
				  
NodeCent_z10_NFCS-mixturemodel.txt - Describes the Gaussian mixture model fitted to the 
				     NFCS distribution using Expectation-Maximisation
				     and model selection with Bayesian Information 
				     Criterion regularisation.
				    
NodeCent_z10_NFCS-mixturemodel.log - Indicates the NFCS threshold determined from analysis
				     of the Gaussian mixture modelling, and summarises the
				     steps taken in determining the threshold value.
				     
NodeCent_z10_NFCS-mixturemodel.coherentNodes - List of nodes that pass the NFCS threshold
					       and so classed as functionally coherent.
					       
NodeCent_z10_NFCS-mixturemodel_coherentNet.txt - Network where nodes are coherent in 
						the mixture modelling analysis and edges 
						have bonferroni-corrected -log(p)<=0.05					       
					       

c) test/exampleOutput/FTI/
					       
FTIz10.FG.id1_id2_nlp.txt - The Network Neighbourhood Clustering p-values for the
			    subnetwork induced by test_genelist.txt in test_net.txt
			    Format: nodeID1 nodeID2 -log(p-value)

FTIz10_10_resamples.txt  -  Resampled nodes taken for estimation of pFDR.
			    Each resample begins with the header:
			    '### [resample number]'
			    Where [resample number] is an integer.
			    The -z 10 option will produce 10 headers (numbered from
			    0 to 9). The list of resampled nodes is given
			    below each header.
			     
FTIz10.BG.nlPonly.txt   -   List of -log p-values for the networks induced by
			    all resampling from test_net.txt, which provides
			    the null distribution for empirical estimation of pFDR. 
			    There are no headers in this file.

FTIz10.minSignif_nlP.log -  Reports the NNC -log(p-value) estimated to match
			    the pFDR threshold. Also records the number of resamples 
			    taken for pFDR estimation.

FTIz10.FDRthresholded_pairs.txt - Network of edges that meet the pFDR threshold 
				  (default 0.1). Format: nodeID1 nodeID2 -log(p-value)

FTIz10.FDR.txt		 -  The mapping between NNC score and pFDR.
			    Format: NNC_score pFDR
			    
FTIz10.FDRthresholded_mincutDensity0pt306.txt - Edges passing the pFDR and minimum cut
						density thresholds for Functional Target 
						Identification.	
						Format: nodeID1 nodeID2 -log(p-value)		    

FTIz10.FDRthresholded_mincutDensity0pt306_noDoublets.txt - Final NetNC output corresponding
							   to the data used in benchmarking.
							   Edges pass the pFDR and minimum
							   cut density thresholds for FTI
							   and any two-node components are
							   deleted.

d) test/exampleOutput/PID/

PID_NodeCent_z10.FG.id1_id2_nlp.txt - Network Neighbourhood Clustering p-values for the
			   	      subnetwork induced by test_genelist.txt in 
				      test_net.txt.
			    	      Format: nodeID1 nodeID2 -log(p-value)
				  
PID_NodeCent_z10_10_resamples.txt  -  Resampled nodes taken for estimation of pFDR.
			    	      Each resample begins with the header:
			    	      '### [resample number]'
			    	      Where [resample number] is an integer.
			    	      The -z 10 option will produce 10 headers (numbered 
				      from 0 to 9). The list of resampled nodes is given
	  			      below each header.
			     
PID_NodeCent_z10.BG.nlPonly.txt   -   List of -log p-values for the networks induced by
			    	      all resampling from test_net.txt, which provides
			    	      the null distribution for empirical estimation of
				      pFDR. There are no headers in this file.
			    
PID_NodeCent_z10.minSignif_nlP.log - Reports the -log(p-value) estimated to match the 
				     NNC pFDR threshold. Also records the number 
				     of resamples taken for pFDR estimation.

PID_NodeCent_z10.FDRthresholded_pairs.txt - Network of edges that meet the pFDR threshold 
					    (default 0.1). 
					    Format: nodeID1 nodeID2 -log(p-value)

PID_NodeCent_z10.FDR.txt     	-  The mapping between NNC score and pFDR.
			    	   Format: NNC_score pFDR
				   
PID_NodeCent_z10.FCS.txt	- Node Functional Coherence Score (NFCS) values
				  calculated by FCS.pl.
				  Format: Node edgePvalueSum degree NFCS
				  (NFCS is the sum of -log(p-values) normalised by degree)
				  
PID_NodeCent_z10_NFCS-mixturemodel.txt - Describes the Gaussian mixture model fitted to the 
				         NFCS distribution using Expectation-Maximisation
				         and model selection with Bayesian Information 
				         Criterion regularisation.
				    
PID_NodeCent_z10_NFCS-mixturemodel.log - Indicates the NFCS threshold determined from 
					 analysis of the Gaussian mixture modelling, and 
					 summarises the steps taken in determining the 
					 threshold value.
				     
PID_NodeCent_z10_NFCS-mixturemodel.coherentNodes - List of nodes that pass the 
						   NFCS threshold and so classed as 
						   functionally coherent.
						   
PID_NodeCent_z10_NFCS-mixturemodel_coherentNet.txt - Network where nodes are coherent 
						     in the mixture modelling analysis 
						     and edges have bonferroni-corrected 
						     -log(p)<=0.05					       
							   
					       
PID_NodeCent_z10.FDRthresholded_mincutDensity0pt5044.txt - Edges passing the pFDR and 
							   minimum cut density thresholds 
							   for Pathway Identification.	
							   File format: 
							   nodeID1 nodeID2 -log(p-value)		    

PID_NodeCent_z10.FDRthresholded_mincutDensity0pt5044_noDoublets.txt - Edges passing the pFDR and
	                                                              minimum cut density thresholds
            	                                                      for Pathway Identification;
								      two node components are delete.
                                                          	      File format: 
	                                                              nodeID1 nodeID2 -log(p-value)

e) test/exampleOutput/mincutOnly/ 					       		

NNCz10_FDR0pt1_mincutThresh0pt1.txt - The network of edges passing the minimum cut density
				      stopping criterion (0.1 in this example). These edges
				      also pass the NNC pFDR threshold of 0.1 which was
				      applied by NetNC prior to running the minimum cut 
				      algorithm in stand-alone mode.
				      Format: nodeID1 nodeID2 -log(p-value)
				      (-log(p-value) previously calculated by NetNC).


##
## Contributors and contact
##

The NetNC software distribution was developed by Ian Overton, Jeremy Owen (iterative 
minimum cut) and Alex Lubbock (Gaussian Mixture Modelling).

NetNC is maintained by Ian Overton, who can be reached at: first_name_initial* dot overton at qub dot ac dot uk
* substitute with i  
Also see: go.qub.ac.uk/IanOverton

##
## Citing NetNC
##

If you use any of the code in this NetNC software distribution please cite:
Overton IM, Sims A, Owen JA, Heale B, Ford M, Lubbock ALR, Pairo-Castineira E, Essafi E (2020).'Functional 
Transcription Factor Target Networks Illuminate Control of Epithelial Remodelling'. Cancers 12, 2823
DOI: https://doi.org/10.3390/cancers12102823

About

Network Neighbourhood Clustering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published