Skip to content

Commit

Permalink
Minor edits before posting.
Browse files Browse the repository at this point in the history
  • Loading branch information
fperez committed Apr 20, 2013
1 parent 8963d87 commit c2eb19c
Showing 1 changed file with 13 additions and 16 deletions.
29 changes: 13 additions & 16 deletions 130418-Data-driven journalism.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@
"source": [
"The computing community has for decades known about the [\"literate programming\"](http://en.wikipedia.org/wiki/Literate_programming) paradigm introduced by [Don Knuth](http://www-cs-faculty.stanford.edu/~uno/lp.html) in the 70's and fully formalized in his famous 1992 book. Briefly, Knuth's approach proposes writing computer programs in a format that mixes the code and a textual narrative together, and from this format generating separate files that will contain either an actual code that can be compiled/executed by the computer, or a narrative document that *explains* the program and is meant for human consumption. The idea is that by allowing the authors to maintain a close connection between code and narrative, a number of benefits will ensue (clearer code, less programming errors, more meaningful descriptions than mere comments embedded in the code, etc).\n",
"\n",
"I don't take any issue with this approach per se, but I don't personally use it because it's not very suited to the kinds of workflows that I need in practice, that require the frequent execution of small fragments of code, in an iterative cycle where code is run to obtain partial results that inform the next bit of code that will be run. Such is the nature of interactive exploratory computing, which is the bread and butter of many practicing scientists. This is the kind of workflow that led me to the creating IPython over a decade ago, and it continues to inform basically every decision we make in the project today.\n",
"I don't take any issue with this approach per se, but I don't personally use it because it's not very well suited to the kinds of workflows that I need in practice. These require the frequent execution of small fragments of code, in an iterative cycle where code is run to obtain partial results that inform the next bit of code to be written. Such is the nature of interactive exploratory computing, which is the bread and butter of many practicing scientists. This is the kind of workflow that led me to creating IPython over a decade ago, and it continues to inform basically every decision we make in the project today.\n",
"\n",
"As [Hamming](http://en.wikipedia.org/wiki/Richard_Hamming) famously said in 1962, *\"The purpose of computing is insight, not numbers.\"*. IPython tries to help precisely in this kind of usage pattern of the computer, in contexts where there is no clear notion in advance of what needs to be done, so the user is the one driving the computation. However, IPython also tries to provide a way to *capture* this process, and this is where we join back with the discussion on literate programming (LP): while LP focuses on providing a narrative description of the structure of an algorithm, our working paradigm is one where the *act of computing* occupies the center stage. \n",
"As [Hamming](http://en.wikipedia.org/wiki/Richard_Hamming) famously said in 1962, *\"The purpose of computing is insight, not numbers.\"*. IPython tries to help precisely in this kind of usage pattern of the computer, in contexts where there is no clear notion in advance of what needs to be done, so the user is the one driving the computation. However, IPython also tries to provide a way to *capture* this process, and this is where we join back with the discussion above: while LP focuses on providing a narrative description of the structure of an algorithm, our working paradigm is one where the *act of computing* occupies the center stage. \n",
" \n",
"From this perspective, we therefore refer to the worfklow exposed by these kinds of computational notebooks (not just IPython, but also Sage, Mathematica and others), as \"literate computing\": it is the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components. For the goals of communicating results in scientific computing and data analysis, I think this model is a better fit than the literate programming one, which is rather aimed at developing software in tight concert with its design and explanatory documentation. I should note that we have some ideas on how to make IPython stronger as a tool for \"traditional\" literate programming, but it's a bit early for us to focus on that, as we first want to solidify the computational workflows possible with IPython.\n",
" \n",
Expand All @@ -61,11 +61,11 @@
"source": [
"In the last few years, extraordinarily contentious debates have raged in the circles of political power and fiscal decision making around the world, regarding the relation between government debt and economic growth. One of the center pieces of this debate was a paper form Harvard economists C. Reinhart and K. Rogoff, later turned into a [best-selling book](http://www.reinhartandrogoff.com), that argued that beyond 90% debt ratios, economic growth would plummet precipitously.\n",
"\n",
"This argument was used (amongst others) by many politicians to justify some of the extreme austerity policies that have been foisted upon many countries in the last few years, and it remains the subject of much debate. On April 15, a team of researchers from U. Massachusetts [published a re-analysis](http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04deaa6388b1/publication/566) of the original data where they showed how Rienhart and Rogoff had made both fairly obvious coding errors in their orignal Excel spreadsheets as well as some statistically questionable manipulations of the data. Herndon, Ash and Pollin (the U. Mass authors) finally replicated the Reinhart and Rogoff analysis and published all their scripts in R so that others could inspect their calculations.\n",
"This argument was used (amongst others) by politicians to justify some of the extreme austerity policies that have been foisted upon many countries in the last few years. On April 15, a team of researchers from U. Massachusetts [published a re-analysis](http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04deaa6388b1/publication/566) of the original data where they showed how Rienhart and Rogoff had made both fairly obvious coding errors in their orignal Excel spreadsheets as well as some statistically questionable manipulations of the data. Herndon, Ash and Pollin (the U. Mass authors) published all their scripts in R so that others could inspect their calculations.\n",
"\n",
"Two posts from [the Economist](http://www.economist.com/news/finance-and-economics/21576362-seminal-analysis-relationship-between-debt-and-growth-comes-under) and [the Roosevelt Institute](http://www.nextnewdeal.net/rortybomb/researchers-finally-replicated-reinhart-rogoff-and-there-are-serious-problems) nicely summarize the story with a more informed policy and economics discussion than I can make. And James Kwak has a [series of posts](http://baselinescenario.com/2013/04/19/fatal-sensitivity) that dive into [technical detail](http://baselinescenario.com/2013/04/18/are-reinhart-and-rogoff-right-anyway/) and question the horrible choice of [using Excel](http://baselinescenario.com/2013/04/18/more-bad-excel/), a tool that should for all intents and purposes be banned from serious research as it entangles code and data in ways that more or less guarantee serious errors in anything but trivial scenarios. Victoria Stodden just wrote [an excellent new post](http://blog.stodden.net/2013/04/19/what-the-reinhart-rogoff-debacle-really-shows-verifying-empirical-results-needs-to-be-routine/) on the broader issues of reproducibility; here I want to take a narrow view of these same questions focusing strictly on the tools.\n",
"Two posts from [the Economist](http://www.economist.com/news/finance-and-economics/21576362-seminal-analysis-relationship-between-debt-and-growth-comes-under) and [the Roosevelt Institute](http://www.nextnewdeal.net/rortybomb/researchers-finally-replicated-reinhart-rogoff-and-there-are-serious-problems) nicely summarize the story with a more informed policy and economics discussion than I can make. James Kwak has a [series of posts](http://baselinescenario.com/2013/04/19/fatal-sensitivity) that dive into [technical detail](http://baselinescenario.com/2013/04/18/are-reinhart-and-rogoff-right-anyway/) and question the horrible choice of [using Excel](http://baselinescenario.com/2013/04/18/more-bad-excel/), a tool that should for all intents and purposes be banned from serious research as it entangles code and data in ways that more or less guarantee serious errors in anything but trivial scenarios. Victoria Stodden just wrote [an excellent new post](http://blog.stodden.net/2013/04/19/what-the-reinhart-rogoff-debacle-really-shows-verifying-empirical-results-needs-to-be-routine/) with specific guidance on practices for better reproducibility; here I want to take a narrow view of these same questions focusing strictly on the tools.\n",
"\n",
"As reported in Mike Konczal's piece at the Roosevelt Institute, Herndon et al. had to reach out to Reinhart and Rogoff for the original code, which hadn't been made available before (apparently causing much frustration in economics circles). It's absolutely unacceptable that major policy decisions that impact millions worldwide had until now hinged effectively on the unverified word of two scientists: no matter how competent or honorable, we know everybody makes mistakes, and in this case there were both egregious errors and debatable assumptions. As Konczal says, \"all I can hope is that future historians note that one of the core empirical points providing the intellectual foundation for the global move to austerity in the early 2010s was based on someone accidentally not updating a row formula in Excel.\" To that I would add the obvious: this should *never* have happened in the first place, as we should have been able to inspect that code and data from the start.\n",
"As reported in Mike Konczal's piece at the Roosevelt Institute, Herndon et al. had to reach out to Reinhart and Rogoff for the original code, which hadn't been made available before (apparently causing much frustration in economics circles). It's absolutely unacceptable that major policy decisions that impact millions worldwide had until now hinged effectively on the unverified word of two scientists: no matter how competent or honorable they may be, we know everybody makes mistakes, and in this case there were both egregious errors and debatable assumptions. As Konczal says, \"all I can hope is that future historians note that one of the core empirical points providing the intellectual foundation for the global move to austerity in the early 2010s was based on someone accidentally not updating a row formula in Excel.\" To that I would add the obvious: this should *never* have happened in the first place, as we should have been able to inspect that code and data from the start.\n",
"\n",
"Now, moving over to IPython, something interesting happened: when I saw the report about the Herndon et al. paper and realized they had published their R scripts for all to see, I posted this request on Twitter:"
]
Expand All @@ -82,7 +82,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It seemed to me that the obvious thing to do would be to create a document that explained together the analysis and a bit of narrative using IPython, hopefully more easily used as a starting point for further discussion. What I didn't really expect is that it would take less than *three hours* for [Vincent Arel-Bundock](http://www-personal.umich.edu/~varel), a PhD Student in Political Science at U. Michigan, to come through with a solution!"
"It seemed to me that the obvious thing to do would be to create a document that explained together the analysis and a bit of narrative using IPython, hopefully more easily used as a starting point for further discussion. What I didn't really expect is that it would take less than *three hours* for [Vincent Arel-Bundock](http://www-personal.umich.edu/~varel), a PhD Student in Political Science at U. Michigan, to come through with a solution:"
]
},
{
Expand Down Expand Up @@ -112,9 +112,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"So now we have a [full IPython notebook](http://nbviewer.ipython.org/urls/raw.github.com/vincentarelbundock/Reinhart-Rogoff/master/reinhart-rogoff.ipynb) that anyone can contribute to. This repository can enable an informed debate about the statistical methodologies used for the analysis, and now anyone who simply installs the SciPy stack can not only run the code as-is, but explore new directions and contribute to the debate in a properly informed way.\n",
"\n",
"I can only hope that with our work, we'll help those interested in this kind of quantitatively-grounded discussion work and collaborate more fluidly."
"So now we have a [full IPython notebook](http://nbviewer.ipython.org/urls/raw.github.com/vincentarelbundock/Reinhart-Rogoff/master/reinhart-rogoff.ipynb), kept in a proper github repository. This repository can enable an informed debate about the statistical methodologies used for the analysis, and now anyone who simply installs the SciPy stack can not only run the code as-is, but explore new directions and contribute to the debate in a properly informed way."
]
},
{
Expand Down Expand Up @@ -144,7 +142,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The post links to a gorgeous, animated infographic that summarizes the results that [NASA's Kepler spacecraft](http://kepler.nasa.gov) has obtained so far, and which accompanies a full [article](http://www.nytimes.com/2013/04/19/science/space/2-new-planets-are-most-earth-like-yet-scientists-say.html?pagewanted=all) at the NYT on Kepler's most recent results: a pair of planets that seem to have just the right features to possibly support life, a mere 1200 light-years hop from us.\n",
"The post links to a gorgeous, [animated infographic](http://www.nytimes.com/interactive/science/space/keplers-tally-of-planets.html?smid=tw-share) that summarizes the results that [NASA's Kepler spacecraft](http://kepler.nasa.gov) has obtained so far, and which accompanies a full [article](http://www.nytimes.com/2013/04/19/science/space/2-new-planets-are-most-earth-like-yet-scientists-say.html?pagewanted=all) at the NYT on Kepler's most recent results: a pair of planets that seem to have just the right features to possibly support life, a quick 1200 light-years hop from us.\n",
"\n",
"Jonathan indicated that he converted his notebook to a Python script later on for version control and automation, though I explained to him that he could have continued using the notebook, since the `--script` flag would give him a `.py` file if needed, and it's also possible to execute a notebook just like a script, with a bit of additional support code:\n",
"\n",
Expand Down Expand Up @@ -173,18 +171,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Our job with IPython is to think deeply about questions regarding the intersection of computing, data and science, but it's clear to me at this point that we can contribute in contexts beyond pure scientific research. I hope we'll be able to provide folks in contexts that have a direct intersection with the public, such as journalists, with tools that help a more informed and productive debate.\n",
"Our job with IPython is to think deeply about questions regarding the intersection of computing, data and science, but it's clear to me at this point that we can contribute in contexts beyond pure scientific research. I hope we'll be able to provide folks who have a direct intersection with the public, such as journalists, with tools that help a more informed and productive debate.\n",
"\n",
"Coincidentally, UC Berkeley will be hosting on May 4 a [symposium on data and journalism](http://multimedia.journalism.berkeley.edu/blog/2013/mar/20/our-first-ever-data-journalism-symposium), and in recent days I've had very productive interactions with folks in this space on campus. [Cathryn Carson](http://history.berkeley.edu/people/cathryn-carson) currently directs the newly formed [D-Lab](http://dlab.berkeley.edu), whose focus is precisely the use of quantitative and datamethods in the social sciences, and her team has recently been teaching workshops on using Python and R for social scientists. And just last week I lectured in [Raymond Yee's](http://www.ischool.berkeley.edu/people/faculty/raymondyee) course (from the School of Information) where they are using the notebook extensively, following Wes McKinney's excellent [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) as the class textbook. Given all this, I'm fairly optimistic about the future of a productive dialog and collaborations on campus, given that we have a lot of the IPython team working full-time here."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"cell_type": "markdown",
"metadata": {},
"outputs": []
"source": [
"**Note:** as usual, this post is available as an [IPython notebook](http://nbviewer.ipython.org/urls/raw.github.com/fperez/blog/master/130418-Data-driven%20journalism.ipynb) in my [blog repo](https://github.com/fperez/blog)."
]
}
],
"metadata": {}
Expand Down

0 comments on commit c2eb19c

Please sign in to comment.