Skip to content

Commit 621b574

Browse files
author
Rajiv Narayan
committed
Minor tutorial updates and re-org
1 parent 4245bb9 commit 621b574

File tree

6 files changed

+49
-47
lines changed

6 files changed

+49
-47
lines changed

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ setup
1818
Contents
1919
--------
2020
* [Working with CMap data formats](docs/Formats.md)
21-
* [Working with annotated matrices in GCTX and GCT formats](docs/gctx_tutorial.html)
2221
* [L1000 data-processing pipeline](docs/DataPipeline.md)
2322

2423
Software Requirements

docs/Formats.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,9 @@ Handling common data formats used in CMapM
1010

1111
Datasets
1212
--
13-
The majority of data generated via the Connectivity Map
14-
tab-delimited text GCT files with the file extension .gct and binary equivalent files called GCTX with file extension .gctx
13+
The majority of data generated via the Connectivity Map are supplied as tab-delimited text GCT files with the file extension .gct and binary equivalent files called GCTX with file extension .gctx
14+
15+
* [Working with annotated matrices in GCTX and GCT formats](gctx_tutorial.html)
1516

1617
Lists
1718
-----

docs/gctx_tutorial.html

Lines changed: 33 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
<!--
77
This HTML was auto-generated from MATLAB code.
88
To make changes, update the MATLAB code and republish this document.
9-
--><title>Working with annotated matrices using the GCT and GCTX data formats in MATLAB</title><meta name="generator" content="MATLAB 8.4"><link rel="schema.DC" href="http://purl.org/dc/elements/1.1/"><meta name="DC.date" content="2017-11-22"><meta name="DC.source" content="gctx_tutorial.m"><style type="text/css">
9+
--><title>Working with annotated matrices using the GCT and GCTX data formats in MATLAB</title><meta name="generator" content="MATLAB 8.4"><link rel="schema.DC" href="http://purl.org/dc/elements/1.1/"><meta name="DC.date" content="2017-11-27"><meta name="DC.source" content="gctx_tutorial.m"><style type="text/css">
1010
html,body,div,span,applet,object,iframe,h1,h2,h3,h4,h5,h6,p,blockquote,pre,a,abbr,acronym,address,big,cite,code,del,dfn,em,font,img,ins,kbd,q,s,samp,small,strike,strong,sub,sup,tt,var,b,u,i,center,dl,dt,dd,ol,ul,li,fieldset,form,label,legend,table,caption,tbody,tfoot,thead,tr,th,td{margin:0;padding:0;border:0;outline:0;font-size:100%;vertical-align:baseline;background:transparent}body{line-height:1}ol,ul{list-style:none}blockquote,q{quotes:none}blockquote:before,blockquote:after,q:before,q:after{content:'';content:none}:focus{outine:0}ins{text-decoration:none}del{text-decoration:line-through}table{border-collapse:collapse;border-spacing:0}
1111

1212
html { min-height:100%; margin-bottom:1px; }
@@ -66,7 +66,7 @@
6666

6767

6868

69-
</style></head><body><div class="content"><h1>Working with annotated matrices using the GCT and GCTX data formats in MATLAB</h1><!--introduction--><p>Script used to generate this tutorial: <a href="gctx_tutorial.m">gctx_tutorial.m</a></p><!--/introduction--><h2>Contents</h2><div><ul><li><a href="#1">Reading a GCT or GCTX file</a></li><li><a href="#2">GCT data representation</a></li><li><a href="#3">Layout of the GCT structure</a></li><li><a href="#4">For large files, it can be useful to read just the metadata</a></li><li><a href="#5">Extracting a subset of data from a GCTX file</a></li><li><a href="#6">Working with metadata</a></li><li><a href="#7">List all available row metadata fields</a></li><li><a href="#8">Read all row metadata into a structure</a></li><li><a href="#9">Annotate a dataset from a structure</a></li><li><a href="#10">Read contents of metadata field</a></li><li><a href="#11">Add metadata fields from cell arrays</a></li><li><a href="#12">Remove metadata fields</a></li><li><a href="#13">Merging GCT/x files</a></li><li><a href="#14">Slicing GCT/x files</a></li><li><a href="#15">Transpose a GCT/x</a></li><li><a href="#16">Writing GCT/x files</a></li><li><a href="#17">Compute correlations</a></li><li><a href="#18">Clean-up</a></li></ul></div><h2>Reading a GCT or GCTX file<a name="1"></a></h2><p>GCT and GCTx files can be read in the same way. We'll use the same two files throughout this tutorial.</p><pre class="codeinput">gct_file_location = fullfile(cmapmpath, <span class="string">'resources'</span>, <span class="string">'example.gct'</span>);
69+
</style></head><body><div class="content"><h1>Working with annotated matrices using the GCT and GCTX data formats in MATLAB</h1><!--introduction--><p>Script used to generate this tutorial: <a href="gctx_tutorial.m">gctx_tutorial.m</a></p><!--/introduction--><h2>Contents</h2><div><ul><li><a href="#1">Reading a GCT or GCTX file</a></li><li><a href="#2">GCT data representation</a></li><li><a href="#3">Layout of the GCT structure</a></li><li><a href="#4">For large files, it can be useful to read just the metadata</a></li><li><a href="#5">Extracting a subset of data from a GCTX file</a></li><li><a href="#6">Working with metadata</a></li><li><a href="#7">List all available row metadata fields</a></li><li><a href="#8">Read all row metadata into a structure</a></li><li><a href="#9">Annotate a dataset from a structure</a></li><li><a href="#10">Read contents of a metadata field</a></li><li><a href="#11">Add metadata fields from cell arrays</a></li><li><a href="#12">Remove metadata fields</a></li><li><a href="#13">Merging GCT/x files</a></li><li><a href="#14">Slicing GCT/x files</a></li><li><a href="#15">Transpose a GCT/x</a></li><li><a href="#16">Writing GCT/x files</a></li><li><a href="#17">Compute correlations</a></li><li><a href="#18">Clean-up</a></li></ul></div><h2>Reading a GCT or GCTX file<a name="1"></a></h2><p>GCT and GCTx files can be read in the same way. We'll use the same two files throughout this tutorial.</p><pre class="codeinput">gct_file_location = fullfile(cmapmpath, <span class="string">'resources'</span>, <span class="string">'example.gct'</span>);
7070
gctx_file_location = fullfile(cmapmpath, <span class="string">'resources'</span>, <span class="string">'example.gctx'</span>);
7171
ds1 = cmapm.Pipeline.parse_gctx(gct_file_location);
7272
ds2 = cmapm.Pipeline.parse_gctx(gctx_file_location);
@@ -76,7 +76,7 @@
7676
Done.
7777

7878
Reading /Users/narayan/workspace/cmapM/resources/example.gctx [978x1476]
79-
Done [0.77 s].
79+
Done [0.78 s].
8080
</pre><h2>GCT data representation<a name="2"></a></h2><p>GCT and GCTx files are both represented in memory as structures.</p><pre class="codeinput">disp(class(ds1));
8181
disp(class(ds2));
8282
</pre><pre class="codeoutput">struct
@@ -99,7 +99,7 @@
9999
</pre><h2>For large files, it can be useful to read just the metadata<a name="4"></a></h2><pre class="codeinput">ds_with_only_meta = cmapm.Pipeline.parse_gctx(gctx_file_location, <span class="string">'annot_only'</span>, true);
100100
disp(ds_with_only_meta);
101101
<span class="comment">% Note that the mat field is empty, but the metadata is the same as above</span>
102-
</pre><pre class="codeoutput">Reading /Users/narayan/workspace/cmapM/resources/example.gctx Done [0.76 s].
102+
</pre><pre class="codeoutput">Reading /Users/narayan/workspace/cmapM/resources/example.gctx Done [0.73 s].
103103
mat: []
104104
rid: {978x1 cell}
105105
rhd: {11x1 cell}
@@ -121,7 +121,7 @@
121121
ds_subset = cmapm.Pipeline.parse_gctx(gctx_file_location, <span class="string">'rid'</span>, my_rids, <span class="string">'cid'</span>, my_cids);
122122
</pre><pre class="codeoutput">Reading /Users/narayan/workspace/cmapM/resources/example.gctx [3x1]
123123
Performing 3 hyperslab selections
124-
Done [0.77 s].
124+
Done [0.76 s].
125125
</pre><h2>Working with metadata<a name="6"></a></h2><p>We provide several convenience functions to operate on the metadata in a dataset.</p><p>Note that while you can modify the attributes of a dataset object directly, it is not recommended since it could affect the integrity of the data structure.</p><h2>List all available row metadata fields<a name="7"></a></h2><pre class="codeinput">row_fields = ds_subset.rhd;
126126
col_fields = ds_subset.chd;
127127

@@ -195,7 +195,7 @@
195195
ds_subset = cmapm.Pipeline.ds_set_annotations(ds_subset, new_meta, <span class="string">'dim'</span>, <span class="string">'row'</span>);
196196
<span class="comment">% verify if the new fields have been added</span>
197197
assert(all(ismember({<span class="string">'new_field1'</span>, <span class="string">'new_field2'</span>}, ds_subset.rhd)));
198-
</pre><h2>Read contents of metadata field<a name="10"></a></h2><pre class="codeinput">gene_symbol = cmapm.Pipeline.ds_get_meta(ds_subset, <span class="string">'row'</span>, <span class="string">'pr_gene_symbol'</span>);
198+
</pre><h2>Read contents of a metadata field<a name="10"></a></h2><pre class="codeinput">gene_symbol = cmapm.Pipeline.ds_get_meta(ds_subset, <span class="string">'row'</span>, <span class="string">'pr_gene_symbol'</span>);
199199
disp(gene_symbol);
200200
</pre><pre class="codeoutput"> 'VDAC1'
201201
'SORBS3'
@@ -218,29 +218,18 @@
218218
beadset_ids = cmapm.Pipeline.ds_get_meta(ds, <span class="string">'row'</span>, <span class="string">'pr_bset_id'</span>);
219219
dp52_bool_array = strcmp(<span class="string">'dp52'</span>, beadset_ids);
220220
dp52_rids = ds.rid(dp52_bool_array);
221-
length(dp52_rids)
222221

223222
<span class="comment">% Get cids corresponding to DMSO samples.</span>
224223
pert_inames = cmapm.Pipeline.ds_get_meta(ds, <span class="string">'column'</span>, <span class="string">'pert_iname'</span>);
225224
dmso_bool_array = strcmp(<span class="string">'DMSO'</span>, pert_inames);
226225
dmso_cids = ds.cid(dmso_bool_array);
227-
length(dmso_cids)
228226

229-
<span class="comment">% Confirm that the size of sliced is correct: 489 probes x 100 samples.</span>
227+
<span class="comment">% Confirm that the dimensions of sliced is correct: 489 probes x 100 samples.</span>
230228
sliced = cmapm.Pipeline.ds_slice(ds, <span class="string">'rid'</span>, dp52_rids, <span class="string">'cid'</span>, dmso_cids);
229+
assert(isequal(size(sliced.mat), [length(dp52_rids), length(dmso_cids)]), <span class="string">'Dimension mismatch'</span>);
231230
disp(size(sliced.mat));
232231
</pre><pre class="codeoutput">Reading /Users/narayan/workspace/cmapM/resources/example.gctx [978x1476]
233-
Done [0.76 s].
234-
235-
ans =
236-
237-
489
238-
239-
240-
ans =
241-
242-
100
243-
232+
Done [0.75 s].
244233
489 100
245234

246235
</pre><h2>Transpose a GCT/x<a name="15"></a></h2><pre class="codeinput">transposed = cmapm.Pipeline.ds_transpose(ds);
@@ -250,7 +239,7 @@
250239
out_gctx = cmapm.Pipeline.mkgctx(<span class="string">'example_out.gctx'</span>, ds);
251240

252241
<span class="comment">% Note that the same dataset object can be written out as either a GCT or GCTx.</span>
253-
<span class="comment">% Note alsa that for convenience the dimensions of the matrix is automatically appended to</span>
242+
<span class="comment">% Note also that for convenience the dimensions of the matrix is automatically appended to</span>
254243
<span class="comment">% the filename, and the columns go first.</span>
255244
</pre><pre class="codeoutput">Saving file to example_out_n1476x978.gct
256245
Dimensions of matrix: [978x1476]
@@ -262,16 +251,17 @@
262251
done [0.26s].
263252
</pre><h2>Compute correlations<a name="17"></a></h2><p>Compute pairwise spearman correlations between columns of dataset</p><pre class="codeinput">cc = cmapm.Pipeline.ds_corr(ds);
264253

265-
<span class="comment">% cc is itself a GCT structure</span>
254+
<span class="comment">% cc is a square and symmetric GCT structure</span>
255+
assert(isequal(size(cc.mat), [size(ds.mat, 2), size(ds.mat, 2)]), <span class="string">'CC is not square'</span>);
256+
assert(isequal(cc.mat, cc.mat'), <span class="string">'CC is not symmetric'</span>);
257+
266258
<span class="comment">% Examine its contents</span>
267-
disp(cc.mat(1:5, 1:5));
268-
</pre><pre class="codeoutput"> 1.0000 0.9042 0.8794 0.8476 0.8184
269-
0.9042 1.0000 0.9022 0.8620 0.8363
270-
0.8794 0.9022 1.0000 0.8636 0.8455
271-
0.8476 0.8620 0.8636 1.0000 0.8187
272-
0.8184 0.8363 0.8455 0.8187 1.0000
273-
274-
</pre><h2>Clean-up<a name="18"></a></h2><pre class="codeinput">delete(out_gct)
259+
imagesc(cc.mat(1:20, 1:20));
260+
colorbar
261+
caxis([0.5, 1]);
262+
axis <span class="string">square</span>
263+
title(<span class="string">'Pairwise Spearman Correlation'</span>);
264+
</pre><img vspace="5" hspace="5" src="gctx_tutorial_01.png" alt=""> <h2>Clean-up<a name="18"></a></h2><pre class="codeinput">delete(out_gct)
275265
delete(out_gctx)
276266
</pre><p class="footer"><br><a href="http://www.mathworks.com/products/matlab/">Published with MATLAB&reg; R2014b</a><br></p></div><!--
277267
##### SOURCE BEGIN #####
@@ -347,7 +337,7 @@
347337
% verify if the new fields have been added
348338
assert(all(ismember({'new_field1', 'new_field2'}, ds_subset.rhd)));
349339
350-
%% Read contents of metadata field
340+
%% Read contents of a metadata field
351341
gene_symbol = cmapm.Pipeline.ds_get_meta(ds_subset, 'row', 'pr_gene_symbol');
352342
disp(gene_symbol);
353343
%% Add metadata fields from cell arrays
@@ -376,16 +366,15 @@
376366
beadset_ids = cmapm.Pipeline.ds_get_meta(ds, 'row', 'pr_bset_id');
377367
dp52_bool_array = strcmp('dp52', beadset_ids);
378368
dp52_rids = ds.rid(dp52_bool_array);
379-
length(dp52_rids)
380369
381370
% Get cids corresponding to DMSO samples.
382371
pert_inames = cmapm.Pipeline.ds_get_meta(ds, 'column', 'pert_iname');
383372
dmso_bool_array = strcmp('DMSO', pert_inames);
384373
dmso_cids = ds.cid(dmso_bool_array);
385-
length(dmso_cids)
386374
387-
% Confirm that the size of sliced is correct: 489 probes x 100 samples.
375+
% Confirm that the dimensions of sliced is correct: 489 probes x 100 samples.
388376
sliced = cmapm.Pipeline.ds_slice(ds, 'rid', dp52_rids, 'cid', dmso_cids);
377+
assert(isequal(size(sliced.mat), [length(dp52_rids), length(dmso_cids)]), 'Dimension mismatch');
389378
disp(size(sliced.mat));
390379
391380
%% Transpose a GCT/x
@@ -397,16 +386,23 @@
397386
out_gctx = cmapm.Pipeline.mkgctx('example_out.gctx', ds);
398387
399388
% Note that the same dataset object can be written out as either a GCT or GCTx.
400-
% Note alsa that for convenience the dimensions of the matrix is automatically appended to
389+
% Note also that for convenience the dimensions of the matrix is automatically appended to
401390
% the filename, and the columns go first.
402391
403392
%% Compute correlations
404393
% Compute pairwise spearman correlations between columns of dataset
405394
cc = cmapm.Pipeline.ds_corr(ds);
406395
407-
% cc is itself a GCT structure
396+
% cc is a square and symmetric GCT structure
397+
assert(isequal(size(cc.mat), [size(ds.mat, 2), size(ds.mat, 2)]), 'CC is not square');
398+
assert(isequal(cc.mat, cc.mat'), 'CC is not symmetric');
399+
408400
% Examine its contents
409-
disp(cc.mat(1:5, 1:5));
401+
imagesc(cc.mat(1:20, 1:20));
402+
colorbar
403+
caxis([0.5, 1]);
404+
axis square
405+
title('Pairwise Spearman Correlation');
410406
411407
%% Clean-up
412408
delete(out_gct)

docs/gctx_tutorial.m

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@
7070
% verify if the new fields have been added
7171
assert(all(ismember({'new_field1', 'new_field2'}, ds_subset.rhd)));
7272

73-
%% Read contents of metadata field
73+
%% Read contents of a metadata field
7474
gene_symbol = cmapm.Pipeline.ds_get_meta(ds_subset, 'row', 'pr_gene_symbol');
7575
disp(gene_symbol);
7676
%% Add metadata fields from cell arrays
@@ -99,16 +99,15 @@
9999
beadset_ids = cmapm.Pipeline.ds_get_meta(ds, 'row', 'pr_bset_id');
100100
dp52_bool_array = strcmp('dp52', beadset_ids);
101101
dp52_rids = ds.rid(dp52_bool_array);
102-
length(dp52_rids)
103102

104103
% Get cids corresponding to DMSO samples.
105104
pert_inames = cmapm.Pipeline.ds_get_meta(ds, 'column', 'pert_iname');
106105
dmso_bool_array = strcmp('DMSO', pert_inames);
107106
dmso_cids = ds.cid(dmso_bool_array);
108-
length(dmso_cids)
109107

110-
% Confirm that the size of sliced is correct: 489 probes x 100 samples.
108+
% Confirm that the dimensions of sliced is correct: 489 probes x 100 samples.
111109
sliced = cmapm.Pipeline.ds_slice(ds, 'rid', dp52_rids, 'cid', dmso_cids);
110+
assert(isequal(size(sliced.mat), [length(dp52_rids), length(dmso_cids)]), 'Dimension mismatch');
112111
disp(size(sliced.mat));
113112

114113
%% Transpose a GCT/x
@@ -120,16 +119,23 @@
120119
out_gctx = cmapm.Pipeline.mkgctx('example_out.gctx', ds);
121120

122121
% Note that the same dataset object can be written out as either a GCT or GCTx.
123-
% Note alsa that for convenience the dimensions of the matrix is automatically appended to
122+
% Note also that for convenience the dimensions of the matrix is automatically appended to
124123
% the filename, and the columns go first.
125124

126125
%% Compute correlations
127126
% Compute pairwise spearman correlations between columns of dataset
128127
cc = cmapm.Pipeline.ds_corr(ds);
129128

130-
% cc is itself a GCT structure
129+
% cc is a square and symmetric GCT structure
130+
assert(isequal(size(cc.mat), [size(ds.mat, 2), size(ds.mat, 2)]), 'CC is not square');
131+
assert(isequal(cc.mat, cc.mat'), 'CC is not symmetric');
132+
131133
% Examine its contents
132-
disp(cc.mat(1:5, 1:5));
134+
imagesc(cc.mat(1:20, 1:20));
135+
colorbar
136+
caxis([0.5, 1]);
137+
axis square
138+
title('Pairwise Spearman Correlation');
133139

134140
%% Clean-up
135141
delete(out_gct)

docs/gctx_tutorial.png

5.89 KB
Loading

docs/gctx_tutorial_01.png

22.1 KB
Loading

0 commit comments

Comments
 (0)