forked from csarven/publishing-statistical-linked-data
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
510 lines (369 loc) · 44.7 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:rsa="http://www.w3.org/ns/auth/rsa#"
xmlns:cert="http://www.w3.org/ns/auth/cert#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:v="http://www.w3.org/2006/vcard/ns#"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:dbr="http://dbpedia.org/resource/"
xmlns:dbp="http://dbpedia.org/property/"
xmlns:sioc="http://rdfs.org/sioc/ns#"
xmlns:wgs="http://www.w3.org/2003/01/geo/wgs84_pos#"
xmlns:cal="http://www.w3.org/2002/12/cal/ical#"
xmlns:org="http://www.w3.org/ns/org#"
xmlns:biblio="http://purl.org/net/biblio#"
xmlns:book="http://purl.org/NET/book/vocab#"
xmlns:ov="http://open.vocab.org/terms/"
xmlns:this="http://csarven.ca/publishing-statistical-linked-data"
xml:lang="en">
<head>
<meta charset="UTF-8"/>
<title>Publishing statistical Linked Data</title>
<meta name="description" content=""/>
<link rel="stylesheet" type="text/css" media="all" href="theme/base/css/display.css"/>
</head>
<body about="[this:]" typeof="foaf:Document sioc:Post biblio:Paper" class="hfeed journal">
<div id="wrap">
<div class="hentry">
<h1 property="dcterms:title" class="entry-title">Publishing statistical Linked Data</h1>
<p class="entry-subtitle">Case studies: World Bank, Eurostat, Irish Census</p>
<div id="authors">
<dl>
<dt>Authors</dt>
<dd class="entry-author">
<p><a rel="dcterms:creator" href="http://csarven.ca/#i" class="author_name">Sarven Capadisli</a></p>
<p><a about="http://csarven.ca/#i" rel="org:memberOf" href="http://deri.ie/" class="author_org">Digital Enterprise Research Institute</a>, <a class="author_org" href="http://nuigalway.ie/">National University of Ireland, Galway</a></p>
<p><a about="http://csarven.ca/#i" rel="foaf:mbox" href="mailto:info@csarven.ca" class="author_email">[email protected]</a></p>
</dd>
</dl>
</div>
<div id="abstract" class="entry-summary">
<h2>Abstract</h2>
<p property="dcterms:abstract" datatype="">TODO</p>
</div>
<div property="dcterms:description" id="content" class="entry-content">
<div id="document-identifier">
<h2>Document Identifier</h2>
<p><code>http://csarven.ca/publishing-statistical-linked-data</code></p>
</div>
<div id="categories-and-subject-descriptors">
<h2>Categories and Subject Descriptors</h2>
<ul about="[this:]">
<li><a rel="dcterms:subject" href="http://www.acm.org/about/class/ccs98-html#H.4">H.4</a> [<strong>Information Systems Applications</strong>]: Linked Data</li>
<li><a rel="dcterms:subject" href="http://www.acm.org/about/class/ccs98-html#D.2">D.2</a> [<strong>Software Engineering</strong>]: Semantic Web</li>
</ul>
</div>
<div id="keywords">
<h2>Keywords</h2>
<ul about="[this:]" rel="dcterms:subject">
<li><a href="http://dbpedia.org/resource/Data_modeling">Data modeling</a></li>
<li><a href="http://dbpedia.org/resource/Knowledge_management">Knowledge management</a></li>
<li><a href="http://dbpedia.org/resource/Linked_Data">Linked Data</a></li>
<li><a href="http://dbpedia.org/resource/World_Bank">World Bank</a></li>
<li><a href="http://dbpedia.org/resource/Eurostat">Eurostat</a></li>
<li><a href="http://dbpedia.org/resource/Census">Census</a></li>
</ul>
</div>
<h2 id="introduction">Introduction</h2>
<p>TODO</p>
<h2 id="linked-statistics">Linked statistics</h2>
<p>TODO</p>
<h2 id="original-data">Original data</h2>
<p>TODO: Blurb about the datasets we are about to discuss, .. in what shape they are and how are they published...</p>
<h3 id="original-data_world-bank">World Bank</h3>
<p>The World Bank Group provides access to a comprehensive set of data about development in countries around the globe. The publicly available statistical data is compiled from officially-recognized international sources; to name a few, the data consists of development indicators, financial statements, climate change, projects and operations.</p>
<h3 id="original-data_eurostat">Eurostat</h3>
<p>TODO</p>
<h3 id="original-data_irish-census">Irish Census</h3>
<p>TODO</p>
<h2 id="case-studies">Case Studies</h2>
<p>TODO</p>
<h3 id="data-sources">Data sources</h3>
<h4 id="data-source_world-bank">World Bank</h4>
<p>The World Bank Group provides a free and open access to several datasets in their <a href="http://data.worldbank.org/data-catalog">data catalog</a>. In our use case, we'll take a look at the datasets which compromise the bulk of this available data: <a href="http://data.worldbank.org/data-catalog/world-development-indicators">World Development Indicators</a> (<dfn><abbr title="World Development Indicators">WDI</abbr></dfn>), <a href="https://finances.worldbank.org/">World Bank Finances</a> (<dfn><abbr title="World Bank Finances">WBF</abbr></dfn>), <a href="http://data.worldbank.org/data-catalog/projects-portfolio">World Bank Projects and Operations</a> (<dfn><abbr title="World Bank Projects and Operations">WBPO</abbr></dfn>), and <a href="http://data.worldbank.org/developers/climate-data-api">World Bank Climate Change</a> (<dfn><abbr title="World Bank Climate Change">WBCC</abbr></dfn>). These datasets are available in XML and JSON formats via The World Bank API. Here is a brief summary of the datasets:</p>
<ul>
<li><em>World Development Indicators</em> are available as a single data cube where it contains indicators, countries, and time as dimension values, and a measured value.</li>
<li><em>World Bank Finances</em> comes in several sub-datasets with different structures i.e., the observations in the datasets contain different dimensions, measures, as well as attributes.</li>
<li><em>World Bank Projects and Operations</em> are offered as information on project documents as opposed to statistical observations.</li>
<li><em>World Bank Climate Change</em> contains sub-datasets for different historical and future observations. They primarily include data on; reference area, time periods, statistical types (averages and anomalies), measured variables and derived statistics, global circulation models.</li>
</ul>
<h4 id="data-source_eurostat">Eurostat</h4>
<p>TODO</p>
<h4 id="data-source_irish-census">Irish Census</h4>
<p>TODO</p>
<h3 id="data-retrieval">Data retrieval</h3>
<h4 id="data-retrieval_world-bank">World Bank</h4>
<p>The World Bank datasets were collected by making requests to The World Bank API endpoints using the XML output format. The reason for choosing the XML format preference over JSON was to easily transform the data to an RDF/XML serialization with XSLT. Table [<a href="#data-retrieval_world-bank_table">1</a>] provides retrieval information about the datasets.</p>
<table id="data-retrieval_world-bank_table">
<caption><strong>Table 1.</strong> World Bank data retrieval</caption>
<thead><tr><th>Datasets</th><th>URLs</th><th>Format</th><th>Number of requests</th><th>Data size</th></tr></thead>
<tfoot><tr><td colspan="5">The metadata for these datasets are retrieved from <code>https://finances.worldbank.org/api/views.xml</code>, <code>http://api.worldbank.org/{lang}/{sources, topics, regions, incomeLevels, lendingTypes, countries, or indicators}?format=xml</code>, and compiled manually for some information e.g., <code>http://www.currency-iso.org/dl_iso_table_a1.xml</code>.</td></tr>
<tbody>
<tr>
<th><a href="http://data.worldbank.org/developers/climate-data-api">World Bank Climate Change</a></th>
<td><code>http://climatedataapi.worldbank.org/climateweb/rest/v1/{country or basin}/{type}/{var}/{start}/{end}/{ISO3 or basinID}.xml</code> (basic requests)</code></td>
<td>XML</td>
<td>140666</td>
<td>983M</td>
</tr>
<tr>
<th><a href="https://finances.worldbank.org/">World Bank Finances</a></th>
<td><code>https://finances.worldbank.org/api/views/{id}/rows.xml</code></td>
<td>XML</td>
<td>32</td>
<td>365M</td>
</tr>
<tr>
<th><a href="http://www.worldbank.org/projects">World Bank Projects and Operations</a></th>
<td><code>http://search.worldbank.org/api/projects/all.xml</code></td>
<td>XML</td>
<td>1</td>
<td>171M</td>
</tr>
<tr>
<th><a href="http://data.worldbank.org/developers">World Development Indicators</a></th>
<td><code>http://api.worldbank.org/{lang}/countries/all/indicators/{id}?format=xml</code></td>
<td>XML</td>
<td>7090</td>
<td>18G</td>
</tr>
</tfoot>
</tbody>
</table>
<p>The data retrieval process from the World Bank API endpoints is held at irregular periods i.e., several times a month. The retrieval act is partly based on new dataset announcements in World Bank mailing lists. Although the data retrieval and transformation phases are conducted by independent scripts, the commitment to retrieve and store the data is based the readiness of the meaningful and usefulness of the data transformation to an RDF serialization. Hence, Java and Bash scripts are executed manually to retrieve in order to:</p>
<ul>
<li>closely monitor abnormalities in the responses and;</li>
<li>account for necessary changes in the data transformations</li>
</ul>
<h4 id="data-retrieval_eurostat">Eurostat</h4>
<p>TODO.</p>
<h4 id="data-retrieval_irish-census">Irish Census</h4>
<p>TODO. Table [<a href="#data-retrieval_irish-census_table">3</a>] provides retrieval information about the datasets 2006 Irish Census datasets.</p>
<table id="data-retrieval_irish-census_table">
<caption><strong>Table 3.</strong> Irish Census (2006) data retrieval</caption>
<thead><tr><th>Datasets</th><th>URLs</th><th>Format</th><th>Number of requests</th><th>Data size</th></tr></thead>
<tfoot><tr><td colspan="5">These datasets are retrieved manually using the interactive application by clicking on the access URLs.</td></tr></tfoot>
<tbody>
<tr>
<th><a href="http://census.cso.ie/census/ReportFolders/ReportFolders.aspx">Census Interactive Tables for 2006</a></th>
<td><code>http://census.cso.ie/census/ReportFolders/ReportFolders.aspx</code></td>
<td>CSV</td>
<td>14</td>
<td>8M</td>
</tr>
</tbody>
</table>
<h3 id="data-review">Dataset review and decisions</h3>
<p>In this section, we cover some of the observed <em>abnormalities</em> in the original datasets, and the decisions which were made in order to later achieve reasonable RDF serializations. The information in this section is meant illustrate the some of the recurring challenges and is not an exhaustive list.</p>
<h4 id="data-review_world-bank">World Bank</h4>
<p>In order to arrive at a proper and useful Linked Data representation, some of the following problems were solved either with a script or manually updated, and others were brought up to the World Bank team's attention for investigation.</p>
<p id="data-review_world-bank_missing-units"><em>Missing units</em>: The statistics of the World Development Indicators consists of various indicators in different measurement units. At the time of this writing, these measurements in the source data are only provided as part of the string of the indicator name, as opposed to an explicit XML node [].</p>
<p id="data-review_world-bank_missing-values"><em>Missing values</em>: Some of the observations in the World Development Indicators dataset do not have measured data. The nodes for the values were given in the API response, however they contained no numerical values []. Hence, in order to keep the RDF-ized version of the dataset lean, these observations were excluded in the data transformation phase.</p>
<p id="data-review_world-bank_aggregated-data"><em>Aggregated data</em>: While the World Bank API provides endpoints which compiled aggregate data for the <em>WDI</em>, some of these calls were left out in the retrieval process.</p>
<p id="data-review_world-bank_most-recent-values"><em>Most recent values</em>: Most recent values were incorrectly introduced to non-date API calls [] in <em>WDI</em>. These observation nodes were excluded in the transformation phase since the data already contained observations with corresponding reference periods.</p>
<p id="data-review_world-bank_naming-patterns"><em>Naming patterns</em>: Different naming patterns were identified across World Bank datasets. Some of these are as follows:</p>
<p id="data-review_world-bank_region-names"><em>Region names</em>: Region names as used in <em>WDI</em> and <em>WBF</em> datasets differed [] in a way that although they essentially conveyed the same meaning, labels did not match exactly. In order to have URIs for the region labels in the <em>WBF</em> observations, and to simplify the linking process, unique region names from the <em>WBF</em> observations was added to region resources in <em>WDI</em>. During the XSLT process, the alternative labels were matched with the labels in the observations themselves to arrive at their canonical representations.</p>
<p id="data-review_world-bank_credit-loan"><em>Credit and Loan names</em>: Based on a private discussion with the WFI team, it was determined that the vocabulary terms <em>Credit Status</em> and <em>Loan Status</em>, as well as <em>Credit Number</em> and <em>Loan Number</em> was used interchangeably. Thus, the canonical representation for the Linked Data URI pattern was to use one: <em>Loan Status</em> and <em>Loan Number</em>.</p>
<p id="data-review_world-bank_missing-countries"><em>Missing countries</em>: Some country codes were identified in the <em>WDI</em> observations that were not defined in the <em>WDI</em> country code list. These were later added to the original data [] [].</p>
<h3 id="data-modeling">Data modeling</h3>
<p>TODO</p>
<h4 id="vocabularies">Vocabularies</h4>
<p>TODO</p>
<h5 id="vocabularies_world-bank">World Bank</h5>
<p>Besides RDF, RDFS, XSD, OWL, the most common vocabularies in these datasets are: <a href="http://www.w3.org/TR/vocab-data-cube/">RDF Data Cube</a> to describe multi-dimensional statistical data, <a href="http://purl.org/linked-data/sdmx">SDMX</a> for the statistical information model, British reference periods (<a href="http://reference.data.gov.uk/doc/year">Year</a>, <a href="http://reference.data.gov.uk/id/gregorian-interval/">Gregorian Interval</a>), <a href="http://www.w3.org/2004/02/skos/core">SKOS</a> to describe the concepts in the observations, and <a href="http://purl.org/dc/terms/">DC Terms</a> for general purpose meta-data relations. Where appropriate, properties and classifications were created to represent World Bank Linked Data. The <em><a href="#uri-patterns">URI patterns</a></em> section gives a further break down of this.</p>
<p>Properties that happen to be semantically the same, yet syntactically different in source data were collapsed into a single namespace in order to have a canonical name across the datasets. Consequently this keeps the vocabulary slim, and potentially easier for reuse.</p>
<p>In the case of country codes, <a href="http://www.iso.org/iso/country_codes/background_on_iso_3166/iso_3166-2.htm">ISO 3166-2</a> is used as the primary representation for countries. For example, the URI <code><a href="http://worldbank.270a.info/classification/country/CA">http://worldbank.270a.info/classification/country/CA</a></code> identifies the country Canada in the datasets. It contains a <code>skos:exactMatch</code> relation to <code><a href="http://worldbank.270a.info/classification/country/CAN">http://worldbank.270a.info/classification/country/CAN</a></code> and vice-versa.</p>
<h4 id="uri-patterns">URI patterns</h4>
<p>TODO</p>
<h5 id="uri-patterns_world-bank">World Bank</h5>
<p>Given the statistical nature of the majority of the World Bank data, URI spaces are created for concepts, properties, observations, and datasets. New URIs for classifications and properties were created because majority of the properties and concepts did not already exist in the wild, and in cases where they did, they did not fully correspond with the World Bank's. For instance, the country codes in the World Bank data are not only composed of concepts of countries, but also other geopolitical areas and income levels. While these concepts are distinguishable from one another given additional indicators, they still belonged to the same code list. As a widely recommended practice, these World Bank resources were interlinked and mapped to external resources.</p>
<p>Names in the URIs are kept as close to the names as they occur in the original data as possible in order to minimize misinterpretations and to stay semantically close to the source. Terms are lower-cased, and delimited with <code>-</code> for consistent readability across all of the datasets.</p>
<p>Observation URIs in WDI and WBCC datasets follow the same pattern. That is, the dimension values are used as the terms in the URI space and are delimited with a slash. Due to the wide range of observation types in WBF datasets, its URI pattern is simpler in comparison to WDI and WBCC, such that, each observation is instead given a unique numerical row identifier as found in the original data source.</p>
<p><em>Slash URIs</em> are used throughout the schema and data resources. The reason for this is to keep the URI patterns consistent and to make sure that all important resources when dereferenced returned information that's particular to the resource. Since the content size of the responses for statistical data may be heavy, the <em>slash URIs</em> approach appeared to be preferable to <em>hash URIs</em>, as the latter would not allow distinct requests in majority of the deployments on the Web. This is of course independent to accessing these resources via SPARQL endpoints.</p>
<p>The general URI space is as follows:</p>
<p id="uri-patterns_world-bank_classification"><em>Classifications</em> are composed of code lists for various concepts that are used in the World Bank datasets. The concepts are compiled by using the accompanied meta-data from the World Bank, and are type of a <code>skos:Concept</code>. Each code list is of type <code>skos:CodeList</code> and has a URI pattern of <code>http://worldbank.270a.info/classification/{id}</code>, where id is one of; <a href="http://worldbank.270a.info/classification/country"></a>country, <a href="http://worldbank.270a.info/classification/income-level">income-level</a>, <a href="http://worldbank.270a.info/classification/indicator">indicator</a>, <a href="http://worldbank.270a.info/classification/lending-type">lending-type</a>, <a href="http://worldbank.270a.info/classification/region">region</a>, <a href="http://worldbank.270a.info/classification/source">source</a>, <a href="http://worldbank.270a.info/classification/topic">topic</a>, <a href="http://worldbank.270a.info/classification/project">project</a>, <a href="http://worldbank.270a.info/classification/currency">currency</a>, <a href="http://worldbank.270a.info/classification/loan-type">loan-type</a>, <a href="http://worldbank.270a.info/classification/loan-status">loan-status</a>, <a href="http://worldbank.270a.info/classification/variable">variable</a>, <a href="http://worldbank.270a.info/classification/global-circulation-model">global-circulation-model</a>, <a href="http://worldbank.270a.info/classification/scenario">scenario</a>, <a href="http://worldbank.270a.info/classification/basin">basin</a>. Each concept is defined under the code list namespace hierarchy e.g., <code>http://worldbank.270a.info/classification/country/CA</code> is the concept for country <em>Canada</em>.</p>
<p id="uri-patterns_world-bank_properties"><em>Properties</em> have the URI pattern <code>http://worldbank.270a.info/property/{id}</code>.</p>
<p id="uri-patterns_world-bank_datasets"><em>Data Cube datasets</em> use the URI patterns: <code>http://worldbank.270a.info/dataset/{id}</code>, where id is one of; <a href="http://worldbank.270a.info/dataset/world-development-indicators">world-development-indicators</a>, <a href="http://worldbank.270a.info/dataset/world-bank-finances">world-bank-finances</a>, <a href="http://worldbank.270a.info/dataset/world-bank-climates">world-bank-climates</a>.</p>
<p id="uri-patterns_world-bank_named-graphs"><em>Named graphs in RDF store</em> are placed in <code>http://worldbank.270a.info/graph/{id}</code>, where id is one of; <code>meta</code>, world-development-indicators</code>, <code>world-bank-finances</code>, <code>world-bank-climates</code>, <code>world-bank-projects-and-operations</code>.</dd>
<p id="uri-patterns_world-bank_world-development-indicators"><em>World Development Indicators</em> observations are within <code>http://worldbank.270a.info/dataset/world-development-indicators/{id}/{country}/{year}</code>, where id is one of; <a href="http://worldbank.270a.info/classification/indicator">indicator code</a>, country in one of <a href="http://worldbank.270a.info/classification/country">country code</a>, and year in YYYY.</p>
<p id="uri-patterns_world-bank_world-bank-finances"><em>World Bank Finances</em> observations are within <code>http://worldbank.270a.info/dataset/world-bank-finances/{id}/{rowid}</code>, where id is one of; <a href="http://worldbank.270a.info/dataset/world-bank-finances">financial dataset code</a>, rowid as a positive integer.</p>
<p id="uri-patterns_world-bank_world-bank-climates"><em>World Bank Climate Change</em> observations are within <code>http://worldbank.270a.info/dataset/world-bank-climates/{id}/{various patterns separated by slash}</code>, where id is one of; <a href="http://worldbank.270a.info/dataset/world-bank-climates">climate change dataset code</a>.</p>
<h4 id="blank-nodes">Blank nodes</h4>
<p>TODO</p>
<h5 id="blank-nodes_world-bank">World Bank</h5>
<p>By in large, the datasets do not contain blank-nodes (bnodes), with the exception of unavoidable ones in the Projects and Operations code list. Given the (beta) state of the <em>WBPO</em> API, the decision at this time was not create arbitrary URIs, since the maintenance of creating resolvable URIs have a cost.</p>
<h5 id="blank-nodes_eurostat">Eurostat</h5>
<p>TODO</p>
<h5 id="blank-nodes_irish-census">Irish Census</h5>
<p>TODO</p>
<h4 id="data-modeling_normalization">Normalization</h4>
<p>Data was only altered by removing white-space at the start and end of text content. Some of the dates in the data were transformed into equivalent representations in IS0 8601 format.</p>
<h4 id="data-interlinking">Data interlinking</h4>
<!--
What can you link on (codelists probably.. but not observations.. geography is the obvious connection points)
Reused some.. and mapped to existing stuff
Highlight some of the interesting questions that came up along the way
-->
<p>The dataset is interlinked (~380 links) with <a href="http://dbpedia.org/">DBpedia</a> for countries and currencies, and (~216 links) with <a href="http://eurostat.linked-statistics.org/">Eurostat</a> for countries using <a href="http://aksw.org/Projects/limes">LInk discovery framework for MEtric Spaces</a> (LIMES). With respect to some of the concepts for code lists, they were manually matched with corresponding <code>skos:exactMatch</code> or <code>skos:closeMatch</code> links to DBpedia.</p>
<p>Additional interlinking was done by adding links to resources with corresponding <code>foaf:homepage</code>s on the World Bank site, as well as links to referenced documents.</p>
<p>Further interlinking was done by adding links to resources with corresponding <code>foaf:homepage</code>s on the World Bank site.</p>
<h5 id="data-enrichment">Data enrichment</h5>
<p>A code list for currencies was created based on <a href="http://www.currency-iso.org/dl_iso_table_a1.xml">currency and funds code list</a> to represent the SDMX attributes for the amount measurements in the World Bank Finances datasets. They were also linked to each country which officially uses that currency.</p>
<p>Given that some of the codes in the World Bank country code list are not considered to be countries e.g., <code>1W</code> representing <em>World</em>, only the resources that represent a real country have an added <code>rdf:type</code> instance of <code>dbo:Country</code>.</p>
<h4 id="data-provenance">Data provenance</h4>
<p>TODO</p>
<h5 id="data-provenance_world-bank">World Bank</h5>
<p>As part of data enrichment, triples pertaining provenance was added in order to partially provide meta-data for resources like code lists and datasets. They particularly address the following information:</p>
<ul>
<li><em>Defining source</em> using <code>rdfs:isDefinedBy</code></li>
<li><em>License</em> using <code>dcterms:license</code></li>
<li><em>Source location</em> using <code>dcterms:source</code></li>
<li><em>Related resource</em> using <code>dcterms:hasPart</code> and <code>isPartOf</code></li>
<li><em>Creator of the data</em> using <code>dcterms:creator</code></li>
<li><em>Publisher of the data</em> using <code>dcterms:publisher</code></li>
<li><em>Creation date</em> using <code>dcterms:created</code></li>
<li><em>Issued date</em> using <code>dcterms:issued</code></li>
<li><em>Modified date</em> using <code>dcterms:modified</code></li>
</ul>
<h4 id="data-structure-definition">Data structure definitions</h4>
<p>TODO</p>
<h5 id="data-structure-definition_world-bank">World Bank</h5>
<p>TODO</p>
<h5 id="data-structure-definition_eurostat">Eurostat</h5>
<p>TODO</p>
<h5 id="data-structure-definition_irish-census">Irish Census</h5>
<p>TODO</p>
<h3 id="data-conversion">Data conversion</h3>
<p>TODO</p>
<h4 id="data-conversion_world-bank">World Bank</h4>
<p>XSLT 2.0 transformations are applied on the source XML files to arrive at the target RDF/XML serialization. Saxon's command-line XSLT and XQuery Processor tool was used for the transformations, and employed as part of Bash scripts to iterate through all the files in the datasets. The conversion step from the command-line was preferred over Java's SAXTransformerFactory as it was significantly faster in preliminary tests.</p>
<p>In order to import this data into the RDF store rather efficiently, <a href="http://librdf.org/raptor/rapper.html">rapper</a> RDF parser utility program was used to first re-serialize each RDF/XML file as N-Triples and appended to a single file at run-time before importing.</p>
<h4 id="data-conversion_eurostat">Eurostat</h4>
<p>TODO</p>
<h4 id="data-conversion_irish-census">Irish Census</h4>
<p>TODO</p>
<h3 id="linked-datasets">Linked datasets</h3>
<h4 id="linked-datasets_world-bank">World Bank</h4>
<p>There is a <a href="http://worldbank.270a.info/.well-known/void">VoID</a> file which contains metadata for the datasets. The information included, but not limited to is: locations to RDF datadumps, named graphs that are used in the SPARQL endpoint, vocabularies used, dataset size. Statistics for the VoID file is generated using <a href="http://aksw.org/projects/LODStats">LODStats</a>. The <a href="http://worldbank.270a.info/data/">data dumps</a> are available either as individual RDF/XML files or in compressed gzip format.</p>
<table id="linked-dataset_world-bank_table">
<caption>World Bank Linked Data</caption>
<thead><tr><th></th><th>Format</th><th>Number of triples</th><th>Size</th></tr></thead>
<tfoot><tr><td colspan="4">The size of the dataset is in rounded number of triples. See VoID file for exact numbers.</td></tr></tfoot>
<tbody>
<tr><th>World Bank Climate Change</th><td>RDF/XML</td><td>78 million</td><td>10G</td></tr>
<tr><th>World Bank Finances</th><td>RDF/XML</td><td>7 million</td><td>827M</td></tr>
<tr><th>World Bank Projects and Operations</th><td>RDF/XML</td><td>1 million</td><td>93M</td></tr>
<tr><th>World Development Indicators</th><td>RDF/XML</td><td>79 million</td><td>8.4G</td></tr>
</tbody>
</table>
<h4 id="linked-datasets_eurostat">Eurostat</h4>
<p>TODO</p>
<h4 id="linked-datasets_irish-census">Irish Census</h4>
<p>TODO</p>
<table id="linked-dataset_irish-census_table">
<caption>Irish Census Linked Data</caption>
<thead><tr><th></th><th>Format</th><th>Number of triples</th><th>Size</th></tr></thead>
<tfoot><tr><td colspan="4">The size of the dataset is in rounded number of triples. See VoID file for exact numbers.</td></tr></tfoot>
<tbody>
<tr><th>2006 Irish Census</th><td>Turtle</td><td>12 million</td><td>776M</td></tr>
</tbody>
</table>
<h3 id="data-license">Data license</h3>
<h4 id="data-license_world-bank">World Bank</h4>
<p>In addition to adhering to <a href="http://go.worldbank.org/OJC02YMLA0">World Bank's terms of use</a>, the RDF data is licensed under <a href="http://creativecommons.org/publicdomain/zero/1.0/">CC0 1.0 Universal (CC0 1.0) Public Domain Dedication</a>.</p>
<h4 id="data-license_eurostat">Eurostat</h4>
<p>TODO</p>
<h4 id="data-license_irish-census">Irish Census</h4>
<p>TODO</p>
<h3 id="publication">Publication</h3>
<p>TODO</p>
<h4 id="publication_world-bank">World Bank</h4>
<p>The HTML pages are generated and published by the <a href="https://github.com/csarven/linked-data-pages">Linked Data Pages</a> framework, where <a href="http://code.google.com/p/moriarty/">Moriarty</a>, <a href="http://code.google.com/p/paget/">Paget</a>, and <a href="https://github.com/semsol/arc2">ARC2</a> does the heavy lifting for it. Linked Data Pages is used to invoke unique SPARQL queries based on the requested URI. The results are outputted in corresponding HTML templates. Links to alternate RDF formats as well as JSON are handled by content-negotiation. Given the nature of the invoked SPARQL query, alternate formats may contain additional triples like labels for the vocabulary terms that’s not in the RDF dumps. This minor difference is mentioned for the users on the site.</p>
<h4 id="publication_eurostat">Eurostat</h4>
<p>TODO</p>
<h4 id="publication_irish-census">Irish Census</h4>
<p>The publication of approach of the 2006 Irish Census Linked Data is same as World Bank's using Linked Data Pages. For some resources e.g., Cities, <a href="https://google-developers.appspot.com/chart/">Google Charts Tools</a> is used to create various visualizations in place of the statistical tabular data.</p>
<h3 id="sparql-endpoint">SPARQL Endpoint</h3>
<p>TODO</p>
<h4 id="sparql-endpoint_world-bank">World Bank</h4>
<p>Apache Jena’s <a href="http://incubator.apache.org/jena/documentation/tdb/">TDB</a> storage system and <a href="http://incubator.apache.org/jena/documentation/serving_data/index.html">Fuseki</a> is used to run the SPARQL server. A public <a href="http://worldbank.270a.info/sparql">SPARQL endpoint</a> is available which accepts SPARQL 1.1 queries. The endpoint allows access to the full schema and datasets, and uses named graphs.</p>
<h4 id="sparql-endpoint_eurostat">Eurostat</h4>
<p>Same setup as World Bank with its own <a href="http://eurostat.linked-statistics.org/sparql">SPARQL endpoint</a>. This endpoint includes only the schema and excludes the data (~533GB RDF/XML) due to the limitation of the available resources and performance reasons.</p>
<h4 id="sparql-endpoint_irish-census">Irish Census</h4>
<p>Same setup as World Bank with its own <a href="http://data-gov.ie/sparql">SPARQL endpoint</a>. The endpoint allows access to the full schema and datasets, and uses named graphs.</p>
<h3 id="data-dumps">Data dumps</h3>
<p>TODO</p>
<h4 id="data-dumps_world-bank">World Bank</h4>
<p>The RDF <a href="http://worldbank.270a.info/data/">data dumps</a> can be retrieved in several ways and are available in <em>Gzip</em>s of the schema and datasets along with individuals RDF/XML files. For automatic processing with tools like <a href="http://github.com/csarven/graphpusher">GraphPusher</a>, the location of the data dumps can be discovered via <a href="http://worldbank.270a.info/.well-known/void">World Bank's VoID</a>.</p>
<h4 id="data-dumps_eurostat">Eurostat</h4>
<p>TODO</p>
<h4 id="data-dumps_irish-census">Irish Census</h4>
<p>TODO</p>
<h3 id="dataset-advertisement">Advertising the dataset</h3>
<p>TODO</p>
<h4 id="dataset-advertisement_world-bank">World Bank</h4>
<p>The dataset is registered in the Data Hub with ID: <a href="http://thedatahub.org/dataset/world-bank-linked-data">world-bank-linked-data</a>. The dataset <a href="http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php?package=world-bank-linked-data">has level 3</a> in the CKAN Validator. It is a candidate for the lodcloud group.</p>
<h4 id="dataset-advertisement_eurostat">Eurostat</h4>
<p>The dataset is registered in the Data Hub with ID: <a href="http://thedatahub.org/dataset/eurostat-linked-data">eurostat-linked-data</a>. The dataset <a href="http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php?package=eurostat-linked-data">has level 1</a> in the CKAN Validator. It is a candidate for the lodcloud group. And is also available in the LATC project's <a href="http://latc-project.eu/datasets/">EU data cloud</a>.</p>
<h4 id="dataset-advertisement_irish-census">Irish Census</h4>
<p>The dataset is registered in the Data Hub with ID: <a href="http://thedatahub.org/dataset/data-gov-ie">data-gov-ie</a>. The dataset <a href="http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php?package=data-gov-ie">has level 4</a> in the CKAN Validator. It is in the lodcloud group.</p>
<h3 id="code">About the code</h3>
<h4 id="code_world-bank">World Bank</h4>
<p>The code which retrieves the World Bank data, transforms it to RDF serializations, and imports to TDB Triple Store can be found at <a href="http://github.com/csarven/worldbank-linkeddata">GitHub</a>. It is using the <a href="http://www.apache.org/licenses/LICENSE-2.0.html">Apache License 2.0</a>.</p>
<h4 id="code_eurostat">Eurostat</h4>
<p>The code which transforms the source data to RDF serializations can be found at <a href="https://github.com/LATC/EU-data-cloud">GitHub</a></p>
<h4 id="code_irish-census">Irish Census</h4>
<p>The code which transforms the source data to RDF serializations can be found at <a href="http://github.com/data-gov-ie/cso2rdf">GitHub</a>.</p>
<!--
XXX: If it makes sense, we might *briefly* talk about applications or "interesting" queries e.g., https://docs.google.com/spreadsheet/ccc?key=0AnqqglUUJZt8dF9kdC1OeS10aTg2bWhoME10VmZlZXc
<h2 id="application">Application</h2>
<p>The application for the the WBLD is viewed in the form of chart visualizations. A custom API is built to pull the necessary data out of the application. The parameters for the API are:
<ul>
<li><code>indicator</code>, which accepts a single indicator code (<code>skos:notation</code> of the indicator URI)</li>
<li><code>country</code>, which accepts multiple country codes (<code>skos:notation</code> of the country URI)</li>
<li><code>year</code>, which accepts a year in YYYY format</li>
</ul>
<p>The <code>indicator</code> parameter is a required as one of the dimensions in the observation needs to be known. The other required dimension is either <code>country</code> or <code>year</code>.</p>
<p>Two API calls are made due to modular design approach; the first call is made to get the metadata about the indicator, whereas the second call is made to collect either all of the observations for the countries with that indicator, or all of the observations for a given reference period with that indicator. The response data from the API is requested in JSON format in order to pass it on to JavaScript library which handles the visualizations.</p>
<h3 id="chart-visualization">Chart visualizations</h3>
<p>The <a href="http://worldbank.270a.info/view">Tools</a> section on the site uses <a href="https://google-developers.appspot.com/chart/">Google Charts Tools</a> to create the visualizations.</p>
<h4 id="visualization_world-development-indicators">Visualizing World Development Indicators</h4>
<p>Depending on the user selections, and the corresponding API call, two possible charts are generated:</p>
<h5 id="visualization_world-development-indicators_motion-chart">Motion chart</h5>
<p>It consists of three different views; a bubble chart, bar chart, and line chart. This chart is intended for observation values in countries, over a time period for an indicator. Unique colours are assigned to each country to easily visually separate them from one another. The reference period runs on the x-axis, whereas the measured values run on the y-axis.</p>
<h5 id="visualization_world-development-indicators_geo-chart">Geo chart</h5>
<p>It consists of a world map view where countries are separated by their official borders. This chart is used to view observation values for a time period for all the countries in the world. The legend consists in the form of a colour spectrum from lowest to highest measured values. The corresponding colours are assigned to each country on the map.</p>
-->
<h2 id="related-work">Related work</h2>
<p>TODO</p>
<h2 id="conclusions">Conclusions</h2>
<p>TODO</p>
<div id="acknowledgements">
<h2>Acknowledgements</h2>
<p>TODO</p>
</div>
<div id="references">
<h2>References</h2>
<p>TODO</p>
<ol about="[this:]">
<li id="r_1">Franklin, M., Halevy, A., Maier, D.: <em>From databases to dataspaces: a new abstraction for information management</em>, SIGMOD Record 34(4), 27–33 (2005), <a rel="dcterms:references" href="http://dl.acm.org/citation.cfm?id=1107502">http://dl.acm.org/citation.cfm?id=1107502</a></li>
<!--
<li><a href="https://groups.google.com/forum/?fromgroups#!topic/world-bank-api/sFLaE9iumms">https://groups.google.com/forum/?fromgroups#!topic/world-bank-api/sFLaE9iumms</a></li>
<li><a href="https://groups.google.com/forum/?fromgroups#!topic/world-bank-api/DL4PlDLcLU0">https://groups.google.com/forum/?fromgroups#!topic/world-bank-api/DL4PlDLcLU0</a></li>
<li><a href="https://groups.google.com/forum/?fromgroups#!msg/world-bank-api/SDcl8KkhTXs/VsDdRUezGqEJ">https://groups.google.com/forum/?fromgroups#!msg/world-bank-api/SDcl8KkhTXs/VsDdRUezGqEJ</a></li>
<li><a href="https://groups.google.com/forum/?fromgroups#!topic/world-bank-api/JpAhRiOusNk">https://groups.google.com/forum/?fromgroups#!topic/world-bank-api/JpAhRiOusNk</a></li>
<li><a href="https://groups.google.com/forum/?fromgroups#!msg/world-bank-api/ZXAahsuREgM/vUWVhxxtLSQJ">https://groups.google.com/forum/?fromgroups#!msg/world-bank-api/ZXAahsuREgM/vUWVhxxtLSQJ</a></li>
<li><a href="https://groups.google.com/forum/?fromgroups#!msg/world-bank-api/ZXAahsuREgM/VhMMkUIxXsMJ">https://groups.google.com/forum/?fromgroups#!msg/world-bank-api/ZXAahsuREgM/VhMMkUIxXsMJ</a></li>
-->
</ol>
</div>
</div>
</div>
</body>
</html>