diff --git a/old/090907-cse09-siam-news.rst b/old/090907-cse09-siam-news.rst new file mode 100644 index 0000000..0465799 --- /dev/null +++ b/old/090907-cse09-siam-news.rst @@ -0,0 +1,44 @@ +============================= + CSE'09 article in SIAM News +============================= + + +This isn't fresh-off-the-oven news, but it's still on the main SIAM News page, +so I'll mention it as I am trying to do a better job of keeping up with this +blog... + +As I mentioned on a `previous post`_, our three_ part_ minisymposium_ on Python +at the SIAM `CSE09 conference`_ was fairly well received. But even better, +SIAM asked us to write up an article for publication in the `SIAM News +bulletin`_. I was really pleased with this, as SIAM News reaches a very broad +international audience and is actually read by people. I often find very +interesting material in it, as the publication hits a very good balance of +quality and not being overly specialized (as we all drown in work, we tend to +focus only on very narrow publication lists for everyday reading, +unfortunately). + +Randy, Hans-Petter and I drafted up an article, and we received great feedback +from all of the presenters at the minisymposium, including figure contributions +by `John Hunter`_ and `Hank Childs`_ (who was at LLNL at the time, and I'm +delighted to see is just up the hill from me at LBL). + +Our article is now available in HTML `online at the SIAM site`_, and I also +have a `PDF version`_ that is more suitable for printing in case you are +interested (note that my version is missing the very last corrections from the +SIAM editor, but the differences should be minor). + +.. Links + +.. _previous post: http://fdoperez.blogspot.com/2009/03/python-at-siam-cse09-meeting.html +.. _three: http://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=8044 +.. _part: http://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=8045 +.. _minisymposium: http://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=8046 +.. _CSE09 conference: http://www.siam.org/meetings/cse09/ + +.. _SIAM News bulletin: http://www.siam.org/news/news.php +.. _John Hunter: http://matplotlib.sf.net +.. _Hank Childs: http://www.lbl.gov/cs/CSnews/CSnews073109c.html + +.. _online at the SIAM site: http://www.siam.org/news/news.php?id=1595 +.. _PDF version: https://cirl.berkeley.edu/fperez/papers/siam_news_py4science.pdf + diff --git a/old/091105-guido-visit.rst b/old/091105-guido-visit.rst new file mode 100644 index 0000000..e1bf232 --- /dev/null +++ b/old/091105-guido-visit.rst @@ -0,0 +1,141 @@ +============================================== + Guido van Rossum at UC Berkeley's Py4Science +============================================== + +**Update:** Quick links + + * Video_ of the entire session. The talks cover the first 50 minutes, and + Guido's part starts at the 54 minute mark. + * A `page with all the slides`_ on my site. + * Guido also blogged_ his impressions. + * Blog posts by `Jarrod Millman`_ and `Matthew Brett`_. + +.. _page with all the slides: http://fperez.org/py4science/2009_guido_ucb +.. _blogged: http://neopythonic.blogspot.com/2009/11/python-in-scientific-world.html +.. _Jarrod Millman: http://jarrodmillman.blogspot.com/2009/11/visit-from-guido-van-rossum.html +.. _Matthew Brett: http://nipyworld.blogspot.com/2009/11/guido-van-rossum-talks-about-python-3.html + +On November 4 2009, we had a special session of our informal Py4Science_ +seminar where Guido van Rossum visited for an open discussion regarding the +uses of the Python language in scientific research. Guido had expressed his +interest in discussing the work that various scientists do with Python but +mentioned that instead of a formal talk, he would prefer a format that allowed +for more interaction with the audience. We agreed that a good plan would be +for us to present a rapid-fire sequence of very short talks highlighting +multiple projects so that he could get a good "high altitude" view of the +scientific python landscape, leaving then ample time for discussions with the +audience. + +.. _Py4Science: https://cirl.berkeley.edu/view/Py4Science + +Guido has already posted `his impressions`_ of the visit, and so have my +colleagues `Jarrod Millman`_ and `Matthew Brett`_, so I'll try to provide a +complementary view here without too much repetition. + +.. _his impressions: http://neopythonic.blogspot.com/2009/11/python-in-scientific-world.html +.. _Jarrod Millman: http://jarrodmillman.blogspot.com/2009/11/visit-from-guido-van-rossum.html +.. _Matthew Brett: http://nipyworld.blogspot.com/2009/11/guido-van-rossum-talks-about-python-3.html + +We gathered for lunch first with a small group and had a very interesting +discussion on various topics; we had a chance to talk in some detail about the +transition to Python 3 for Numpy, something a number of people have started to +think about seriously. Numpy is pretty much a 'root dependency' for so many +scientific projects, that until it makes the jump it will be very difficult for +anyone else in science to seriously consider Python 3. Understandably, Guido +would like to see some movement from our community in this direction, and he +offered useful guidance. In particular, he said that in the core Python-dev +team there might be enough interest that if we ask for help there, we might +find developers willing to pitch in and provide some assistance. He also +expressed some disappointment that `PEP 3118`_, which was accepted with our +interests in mind, still hadn't been fully implemented. Limited manpower is +the simple reason for this situation, but fortunately Jarrod mentioned that +there's a good plan to address this in the near future. + +I had a chance to quiz Guido about something I've wondered for a while: Python +has unusually good number types in its core (arbitrary length integers, +extended precision decimals and complex numbers), but the integers divide +either into the integers (the truncating behavior of Python 2.x) or into the +floats (in 3.x). While the 3.x division is an improvement, I would have really +liked to see Python go to native rationals; for one thing, this would make the +Sage 'language' (by which I mean the extensions Sage makes to pure Python) +behave like Python in algorithms involving integers, eliminating a recurring +source of confusion between the two. I also happen to think it would be a +'better' behavior, though there are valid reasons also for someone to expect a +more 'calculator-like' answer to divisions like 1/2, who might be annoyed if +they get 1/2 back instead of 0.5. While obviously such changes will not be on +the table for a long while (and I should say here that I am very happy with the +planned moratorium to core language changes, as I hope that will allow a more +focused effort on the needs of the standard library), it was interesting to +hear Guido's approach to this question as one that could be handled via +overloadable literals rather than a change of integer semantics. I'd never +thought of that, but it's an intriguing idea... Something to think about for +when we start looking at Python 4000 :) + +.. _PEP 3118: http://www.python.org/dev/peps/pep-3118 + +We then headed over to the official presentation, where we managed to cram 14 +talks in 50 minutes and then had a full hour of open conversation with Guido, +where the audience asked him questions on a number of topics. You can see a +complete video_ of the entire session: after 50 minutes of talks we have a +transition, and Guido's section starta at the 54 minute mark. On my website I +have posted_ a page with all the slides for these mini-talks. + +I presented an overview_ introduction and material on behalf of 4 others who +were not present locally, but coincidentally, `William Stein`_ of Sage_ fame +was on campus to give a talk in the same building almost at the same time, and +he could present the Sage slides directly. `Ondrej Certik`_ from SymPy_ was +able to make the trip from Reno, completing our out-of-town speakers. The +other 7 presentations were from a number of local speakers (from various +departments at UC Berkeley and Lawrence Berkeley National Laboratory, just up +the hill from us). + +.. _overview: http://fperez.org/py4science/2009_guido_ucb/fperez_py4science_overview.pdf +.. _video: http://www.archive.org/details/ucb_py4science_2009_11_04_Guido_van_Rossum +.. _posted: http://fperez.org/py4science/2009_guido_ucb/index.html + +.. _William Stein: http://wstein.org +.. _sage: http://sagemath.org +.. _Ondrej Certik: http://ondrejcertik.blogspot.com/ +.. _sympy: http://code.google.com/p/sympy + +I have received very good feedback from several people, and I am really +thankful to all the speakers for being so attentive to the time constraints, +which let us pack a lot of material while leaving ample time for the discussion +with Guido. My intention with this was to really provide Guido with a broad +understanding of how significant Python's penetration has been in scientific +computing, where many different projects from disciplines ranging from computer +science to astronomy are relying heavily on his creation. I wanted both to +thank him for creating and shepherding such a high-quality language for us +scientists, and to establish a good line of communication with him (and +indirectly the core python development group) so that he can understand better +what are some of the use patterns, concerns and questions we may have regarding +the language. + +I have the impression that in this we were successful, especially as we had +time after the open presentations for a more detailed discussion of how we use +and develop our tools. Most of us in scientific computing end up spending an +enormous amount of time with open interpreter sessions, typically IPython_ ones +(I started the project in the first place because I wanted a *very good* +interactive environment, beyond Python's default one), and in this work mode +the key source of understanding for code are good docstrings. This is an area +where I've always been unhappy about the standard library, whose docstrings are +typically not very good (and often they are non-existent). We showed Guido the +fabulous Numpy/Scipy `docstring editor`_ by Pauli Virtanen and Emmanuelle +Gouillart, as well as the fact that Numpy has an actual `docstring standard`_ +that is easy to read yet fairly complete. I hope that this may lead in the +future to an increase in the quality of the Python docstrings, and perhaps even +to the adoption of a more detailed docstring standard as part of `PEP 8`_, +which I think would be very beneficial to the community at large. + +In the end, putting all this together took me a lot more time than I'd +originally planned (I think I've had this same problem before...), but I am +very pleased with the results. Python has become a central tool for the work +many of us do, and I am really happy to establish a good dialogue with Guido +(and hopefully other core developers), which I'm sure will have benefits in +both directions. + +.. _IPython: http://ipython.scipy.org +.. _docstring editor: http://docs.scipy.org/numpy/Front%20Page/ +.. _docstring standard: http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines#docstring-standard +.. _pep 8: http://www.python.org/dev/peps/pep-0008 + diff --git a/old/091215-claremont-py4science.rst b/old/091215-claremont-py4science.rst new file mode 100644 index 0000000..80c6f56 --- /dev/null +++ b/old/091215-claremont-py4science.rst @@ -0,0 +1,71 @@ +===================== + Py4science teaching +===================== + +It's pretty clear to me that Python is rapidly growing in acceptance as a +computational platform at universities everywhere. I recently heard from `Josh +Bloom`_ at UC Berkeley astronomy that his proposal for a short 'boot camp' +course at the beginning of the Fall semester was approved. This is excellent +news, last year I taught `something similar`_ for neuroscience students and +postdocs, and I'm glad to see the campus adopting python further as a key +component of the computational training the science students receive. + +.. _Josh Bloom: http://astro.berkeley.edu/~jbloom/ +.. _something similar: http://fperez.org/py4science/workshop_berkeley_2008.html + + + +John Hunter and I just completed a few days ago teaching another such workshop +at the Claremont Colleges, supported by an NSF grant that `John Milton`_, +(J. Hunter's PhD advisor) has for exposing undergraduates to a number of +research-related experiences. This grant supports summer research internships +where two undergrads visit together a research lab elsewhere to work +independently on a project, which has already resulted in some really neat +projects, as well as our teaching of scientific python tools (we were also +there in 2008 and hopefully will continue next year). + +.. _John Milton: http://faculty.jsd.claremont.edu/jmilton/ + +I have to say that I really enjoy teaching this type of workshop, especially +now that the core tools are fairly mature, installation isn't the problem it +used to be, and we have a chance of presenting interesting examples to the +students from the very beginning. John Hunter is a phenomenal lecturer, with a +real knack for illustrating interesting concepts on the fly, in a very natural +manner that is honestly similar to how exploratory work is *actually* done in +real research, where you run a bit of code, plot some results, print a few +things, code some more, etc. In the picture above, he was working through some +of the `matplotlib image tutorial +`_, which gave us +an opportunity to find and fix a small bug in how the luminance histogram was +being computed (the current form is correct). Every one of these students +probably has a digital camera these days (if nothing else, in a cell phone), so +an example like this is a great way to connect something they are used to with +mathematical and programming concepts. + + +This one is from a great illustration of random numbers and simple statistics. +John built up interactively a simulation of random walks, working up from a +single (one-dimensional) walker to a group of them, comparing the simulation +results with the analytical predictions for various quantities (like mean and +variance), and also explaining to the students how these squiggly lines could +be considered a model of price fluctuations over time. In this manner, he +connected the somewhat abstract statistical concepts to something the students +could relate to, namely the risk of an investment making a profit or a loss +after a certain period. + +We also talked about FFTs, dynamical systems, error analysis, data formats, and +a few other things. It was very encouraging to have most of the students +return on the second day, considering how this was a completely optional +activity for them, covering an entire weekend (with morning lectures both days) +right before they had to start diving into their finals. But they were a smart +and enthusiastic bunch, and I hope that the workshop gave them some useful +starting points they can then develop on their own as they get involved in +research projects. + +These are just two examples of how we are now seeing Python's acceptance in +university computing growing. We have a lot of work still ahead of us, but +it's really encouraging to see how far we've come from the days when building +scipy was a black art, IPython was little more than a different prompt for the +python shell and matplotlib could only do basic line plots on top of libgd. We +now have tools that provide a complete computational environment for teaching +and research, and things are only getting better... diff --git a/old/091215-vision-hpc.rst b/old/091215-vision-hpc.rst new file mode 100644 index 0000000..1dfe2d6 --- /dev/null +++ b/old/091215-vision-hpc.rst @@ -0,0 +1,36 @@ + + +http://www.abhishek-tiwari.com/2009/08/visual-high-performance-computing-in.html + +From jose, re the python bof at SC 2009 + +The BoF was very well attended -- standing room only. There were probably about +100 people in the room. Spotz from Sandia Labs was there and it was a very +interesting presentation regarding NVIDIA "copperhead" python-to-GPU +programming. It's still very early in development, but looks very promising. + +No word on a corresponding workshop, though. The SC09 committee surveys the +attendance of each of these BoF's and uses that as input for subsequent related +events -- tutorials, workshops, etc. There is a lot of interest and some of the +topics you brought up with your meeting with Guido surfaced -- especially GIL. + +Anyway, given the strength of the attendance, hopefully this will result in +subsequent Python for HPC material at the next conference, which is in New +Orleans. + +Just parenthetically, I also attended the high-level languages series, which +included Matlab. Outside of the ability to distribute an array on multiple +nodes, I really didn't see anything that was light years ahead of what you all +have already implemented in IPython. In fact, during the question session, some +professor brought up the fact that it was too expensive for his institution to +continue to teach parallel programming using Matlab. Obviously, they weren't +too happy to discuss that publicly. + + +### +http://www.paratools.com/ptoolsrte.php + +The Centos build (which includes associated binaries) has IPython in it. + +Hopefully, ptoolsrte will become part of the standard software installation for HPCMP and will thereby have IPython in it by default. That would be huge (~6000 DoD users ) at four major computational centers around the country would have this. + diff --git a/old/100703-scipy2010.rst b/old/100703-scipy2010.rst new file mode 100644 index 0000000..18c151a --- /dev/null +++ b/old/100703-scipy2010.rst @@ -0,0 +1,175 @@ +====================== + SciPy 2010 in Austin +====================== + +A lot of scientific Python events back-to-back: I've just wrapped up +co-teaching a workshop on Python for computational chemistry `in Barcelona`_, +our `Open Computational Research`_ Python mini-event in the context of `Sage +Days 22`_ at Berkeley's MSRI_, and now the 2010 edition of the `SciPy +conference`_, which this year for the first time took place in Austin, TX. A +few notes about the conference while they are fresh in my mind (I'll try to +catch up on the other topics later). + +Since its inception in 2002, the conference had been held at Caltech (a great +location for all those occasions), but this year Enthought took a more direct +logistical role and hosted it at the UT Austin AT&T conference center. They +did a really great job: the venue was comfortable [#]_, the wifi excellent, +there was space for discussions and the setup for the sprints was very good. +Throughout the days the catering was perfect: instead of bringing coffee/snacks +only during narrow time windows, they kept them available all day, so you +didn't need to worry about missing the refreshments break. + +Python is now a grown-up scientific language +============================================ + +It was an interesting SciPy, with big changes in what's happening in the +scientific Python world becoming readily apparent. Already last year I had the +feeling that we were finally solidifying enough the core tools layer for a wave +of new projects and ideas to start taking shape, but now that process is in +full swing. There were great talks both on computational 'infrastructure' +projects like Theano_ or Stefan's lightning one on the `image processing +scikit`_, and also great applied work where Python is 'just the perfect tool +for the job', like Chris Colbert's `robotic vision`_ presentation (sadly marred +by problems with the video system). Since I worry not only about IPython_ but +also about the greater tool ecosystem (numpy_, scipy_, matplotlib_, mayavi_, +sympy_, etc) , I find it extremely useful to see what kinds of things people +are doing with these tools. I get a better understanding of our current +capabilities and more importantly, of our limitations. + +A few randomly-remembered talks in no particular order: + +- Theano_: we may get really high-level ways to express our ideas and still be + able to have efficient low-level execution on multicore/gpu systems. This is + not quite a full production system, but it does some very elegant things with + code to be able to emit as efficient as possible of a low-level code. + +- The visualizations in Kristopher Overholt's `Numerical pyromaniacs`_ talk. + He showed demos of matplotlib/PIL plots overlaid on live video, to show + comparisons between the real videos of fire experiments and the model + predictions on flame height or other parameters. It's both appealing and + very informative. + +- Brian's ØMQ_ talk, though I admit I'm heavily biased here since we're now + using ØMQ for so much of the new IPython work and a lot of our enthusiasm + comes from our initial prototyping work together. Even with this disclosure, + it was a very good presentation. + +- The statistically oriented presentations, Pandas_ by Wes McKinney and + statsmodels_ by Skipper Seabold, were excellent. I'm thrilled to see the + growth of the statistical functionality in Python. While R is an + extraordinary collection of Statistical functionality, as a general + programming language it has a number of limitations. I think that Python has + all the right bases to be a great tool for statistical computing, and these + are projects making strides in the right direction. + +.. _pandas: http://code.google.com/p/pandas/ + +.. _statsmodels: http://conference.scipy.org/scipy2010/slides/skipper_seabold_statsmodels.pdf + +.. _robotic vision: http://conference.scipy.org/scipy2010/slides/sccolbert_robot_vision.pdf + +.. _theano: http://conference.scipy.org/scipy2010/slides/james_bergstra_theano.pdf +.. _numerical pyromaniacs: http://conference.scipy.org/scipy2010/slides/kris_overholt_pyromaniacs.pdf + +.. _ØMQ: http://conference.scipy.org/scipy2010/slides/brian_granger_pyzmq.pdf + +There was a marked increase of interest from industry this year: Microsoft, +Dell, AQR capital, D.E. Shaw and others offerexd support and had a presence at +the meeting. I was particuarly pleased to see fellow Bay Area developers from +a financial firm participating in the sprints. It's great to see industry +involvement not just in terms of financial contributions but also of direct +hands-on participation. + +Numpy on .Net +============= + +A big announcement_ was made by Travis Oliphant in his keynote about a +partnership between Enthought and Microsoft to refactor Numpy so that its core +has less/no CPython dependencies and then wrap it for access by IronPython, +Microsoft's .NET implementation of Python, with a subsequent port of Scipy as +well. I've never been a fan of Microsoft Windows as a working environment, but +I'm actually pretty excited about this possibility. Windows is very widely +used, and I've heard very good things about the technical foundations of the +.NET Common Language Runtime system as well as the availability of many good +libraries. The prospect of being able to run all of our C- and Fortran-based +numerical tools on top of the .NET machinery opens up lots of interesting +opportunities for Python to be useful in environments that have already +invested heavily in the .NET system. + +.. _announcement: http://www.flickr.com/photos/pivanov/4752318542 + +I hope they will succeed, because it's a technically challenging (though very +interesting and fun) project. Travis mentioned a fairly tight timeline of a +few months for the port, so we should know soon enough what the outcome is. + +Here is the official `press release`_ from Enthought. + +.. _press release: http://www.enthought.com/media/SciPyNumPyDotNet.pdf + +The tutorials +============= + +This year I wasn't involved with the tutorial organization, but Brian Granger +took over and from the feedback I received, he did a terrific job (I +unfortunately missed both tutorial days). Many people specifically mentioned +to me the quality of his own tutorial session, the HPC presentation. I read +over the materials, and it's a very compact and efficient overview of +performance analysis and parallelization tools and strategies available to +Python developers. + +.. _[#]: except for a sub-par video system on the main auditorium which was + finicky and had horrific contrast, so speakers with slightly dark slides + ended up with big black boxes on the screen. + +.. _in Barcelona: http://www.xrqtc.cat/index.php/en/home-escuela-4-2010 +.. _Open Computational Research: XXX +.. _Sage Days 22: XXX +.. _MSRI: http://msri.org +.. _SciPy conference: http://conference.scipy.org +.. _Theano: XXX +.. _image processing scikit: http://scikits.appspot.com/ XXX +.. _IPython: http://ipython.scipy.org +.. _numpy: http://numpy.scipy.org +.. _scipy: http://www.scipy.org +.. _matplotlib: http://matplotlib.sf.net +.. _mayavi: http:// XXX +.. _sympy: http://sympy.org + + +Arrays with labeled axes +======================== + +We had a great Birds of a Feather (BoF) session discussing the various +approaches to arrays with labeled axes, arbitrary tick labels along the axes +(so you can index with something other than integers in the ``0..n`` range), +and similar ideas. At SciPy'09 I started playing with this idea and since, a +group of us had made some progress on a prototype code, especially thanks to +Mike Trumpis' recent work. But in the meantime a multitude of related projects +have been developed, indicating that there's a genuine need for something like +this. Our approach has been to keep a very simple and generic code that +directly subclasses numpy's ndarray, in the hope that it can go directly into +numpy itself. More sophisticated systems that provide even richer semantics, +like larry_ and pandas_, hopefully would then be able to reuse this common +layer. + +After returning to Berkeley, we've had a design discussion and a sprint as part +of the `Py4Science series`_ I organize on campus. It was particularly useful +that Keith Goodman (the author of Larry_), has been very actively working with +us. Currently our code is available `on github`_ and we have also pushed an +`early proto-release to PyPI`_ as well as keeping up to date documentation_. +We hope this will serve as a good base to understand how to best design such an +object for eventually improving Numpy. + +I want to thank everyone who participated in the discussion, and at the same +time apologize for not having been sufficiently active on the list and merging +some pull requests that have already been made (sorry Rob!). But I hope that +with Keith and a few more participants on board, my own time limitations will +be less of a bottleneck and we'll have some great improvements to numpy +available for real-world use very soon. + +.. _larry: http://github.com/kwgoodman/la +.. _pandas: http://code.google.com/p/pandas +.. _on github: http://github.com/fperez/datarray +.. _documentation: http://fperez.github.com/datarray-doc/ +.. _py4science series: https://cirl.berkeley.edu/view/Py4Science +.. _an early proto-release to PyPI: http://pypi.python.org/pypi/datarray diff --git a/old/101223-scipy.in-2010.rst b/old/101223-scipy.in-2010.rst new file mode 100644 index 0000000..373420f --- /dev/null +++ b/old/101223-scipy.in-2010.rst @@ -0,0 +1,79 @@ +=============================== + SciPy India 2010 in Hyderabad +=============================== + +I have just returned to the USA after an intense ~10 days in India, where I had +the opportunity to participate at the `Scipy India 2010 conference`_ thanks to +a generous invitation by `Prabhu Ramachandran`_ and his `FOSSEE.in`_ project. +Fossee.in (Free and Open Source Software in Engineering and Education in India) +is a remarkable effort to systematically develop tools for the technical +education and research environments in India based around an open source +ecosystem, and Python plays a central role there (though other tools like LaTeX +and Scilab are also used). Considering that India, per Prabhu's numbers, +graduates approximately 800.000 technical and engineering students per year, +the potential long-term impact of this effort is absolutely enormous, and +Prabhu has assembled a team full of both energy and talent. I've been very +excited about this since I first heard the news of its funding, and I'm glad to +have had the chance of working with several members of the team for a few days. + +.. _Scipy India 2010 conference: http://scipy.in +.. _Prabhu Ramachandran: +.. _fossee.in: + +The project brough to the conference a group of invited speakers from outside +India: Perry Greenfield, John Hunter, Stefan van der Walt, Satrajit Ghosh, +Jarrod Millman and myself. Prabhu and his wife Kadambari (KD), also a member +of the fossee team, did a wonderful job in hosting us, and the hospitality of +the whole team was remarkable; I can't thank them enough for the time and +energy they put into making all of us feel comfortable and enjoy our visit +(which for most of us was our first foray into India). + +The conference +============== + +The conference itself followed a similar format (with minor order alterations) +to the US SciPy ones: two days of regular talks, then a mix of introductory +tutorials and development sprints. A brief summary of the (longer) invited +talks and a sampling of contributed talks follows. + +Perry presented a beautiful opening keynote that combined the history of +Python's growth in scientific computing through the eyes of the astronomy +world, where he has played a leading role for years, with examples of the +scientific problems in modern astronomy they have faced (and sometimes solved +using Python). John gave an overview of matplotlib_ that included both +historical aspects and an overview of its current capabilities, highlighting +some points that many people might not be aware of, like the new animation +functionality. Stefan's talk was about the social aspects of participaitng in +open source communities as part of an academic career, a topic which I think +was very fitting for the audience, since he showed how his contributions to +Numpy and other Python tools have opened many scientific and professional doors +for him. Given how the goal of the Fossee project is to involve India's +scientists and engineers with this community, I think this was probably one of +the most useful talks of the event, and it was delivered with Stefan's unique +combination of flair and self-deprecating humor that make him such a great +speaker. Satra and Jarrod both spoke about projects centered around +neuroscience; Satra offered the audience (not composed of neuroscientists) an +excellent introduction to the relevant scientific questions and an overview of +the problems of modern neuroimaging data analysis and how the Nipype_ package +offers solutions for some of them. Jarrod spoke about the genesis of the +umbrella Nipy_ effort that Nipype is a part of, and which now encompasses a +number of related Python packages, each targeted at a different aspect of +modern neuroimaging analysis. + +.. _nipy: http://nipy.org +.. _nipype: http://nipy.org/nipype +.. _matplotlib: http://matplotlib.sf.net + + + +IPython +======= + + + +.. _larry: http://github.com/kwgoodman/la +.. _pandas: http://code.google.com/p/pandas +.. _on github: http://github.com/fperez/datarray +.. _documentation: http://fperez.github.com/datarray-doc/ +.. _py4science series: https://cirl.berkeley.edu/view/Py4Science +.. _an early proto-release to PyPI: http://pypi.python.org/pypi/datarray diff --git a/old/110219-aaas-reproducible-research.rst b/old/110219-aaas-reproducible-research.rst new file mode 100644 index 0000000..9b528f3 --- /dev/null +++ b/old/110219-aaas-reproducible-research.rst @@ -0,0 +1,187 @@ +================================================================== + Reproducible Research at the AAAS 2011 meeting in Washington, DC +================================================================== + +**Update:** added links to other related posts, significantly expanded the +section on Git and Github for scientific work. + +At this year's AAAS meeting, currently taking place in DC (in unseasonably warm +and sunny weather), `Victoria Stodden`_ from the statistics department at +Columbia, organized a symposium_ titled *The Digitization of Science: +Reproducibility and Interdisciplinary Knowledge Transfer* that was very well +attended. She has put together a page_ with the full abstracts as well as +links to the individual slides. On this site, I've posted the slides_ for my +talk, as well as an `extended abstract`_. + +Lessons from the Open Source software world +=========================================== + +I have tagged this post with "Python" because my take on the matter was to +contrast the world of classic research/academic publishing with the practices +of open source software development, and what little I know about that (as well +as some specific tools I mentioned, like Sphinx_), I picked up from the world +of open source scientific Python projects I'm involved with, from IPython_ +onwards. My argument is that the tools and practices from the open source +community in fact come much closer to the scientific ideals of reproducibility +than much of what is published in scientific journals today. + +.. _victoria stodden: http://www.stanford.edu/~vcs +.. _symposium: http://aaas.confex.com/aaas/2011/webprogram/Session3166.html +.. _page: http://stanford.edu/~vcs/AAAS2011 +.. _sphinx: http://sphinx.pocoo.org +.. _ipython: http://ipython.scipy.org + +The OSS world is basically forced to do this, because people across the world +are collaborating on developing a project from different computing +environments, operating systems, library versions, compilers, etc. Without +very strong systems for provenance tracking (aka version control), automatic +testing and good quality documentation, this task would be simply impossible. +But many of these tools can be adapted for use in everyday scientific work; for +some use cases they work extremely well, for others there's still room for +improvement, but overall we can and should take these lessons into everyday +scientific practice. + +In my talk, I spent a fair amount of time discussing the Git version control +system, not in terms of its technical details, but rather trying to point out +how it should not be viewed just as a tool for software development, but +instead as something that can be an integral part of all aspects of the +research process. Git is a powerful and sophisticated system for provenance +tracking that automatically validates data integrity by design: Linus Torvalds +wanted to ensure that every commit operation is signed with a hash of its +contents *plus* the hash of its dependencies (for details on this, his +sometimes abrasive `Google Tech Talk about Git`_ is an excellent reference). +This simple idea ensures that a single byte change *anywhere* in the entire +repository can be detected autommatically. + +I use Git for just about all my activities at the computer that require +manually creating content, with repositories not only for research projects +that involve writing standalone libraries, but also for papers, grant +proposals, data analysis research, and even teaching. Its distributed nature +(every copy of the repository has all the project's history) makes it +automatically much more resilient to failures than a more limited legacy tool +like Subversion and its strong branching and merging capabilities make it great +for exploratory work (somehting that is painful to achieve with SVN). Git is +also the perfect way to collaborate on projects: all members have full +versioning control, can commit work as they need it, and can make visible to +collaborators only what they deem ready for sharing (this is impossible to do +with SVN). Writing a multiauthor paper or grant proposal with Git is a far +saner, more efficient and less error prone process than the common madness of +emailing dozens or hundreds of attachments every which way between multiple +people (for those who think Dropbox suffices for collaborative writing\: it's +like using a wood saw for brain surgery; Dropbox is great for many things and I +love it, but it's not the tool for this problem). I have also used Git for +teaching, by creating a public repository for all course content and individual +repositories for each student that only the student, the teaching assistants +and myself can access. This enables students to fetch all new class content +with a simple:: + + git pull + +instead of clicking on tens of files in some web-based interface (the typical +system used by many universities). A single clone operation can reconstruct +the entire class repository on another computer if they need to use it in more +than one place or lose their old copy. And when it's time to submitting +homework, instead of emailing or uploading anything, all they need to do is:: + + git push + +and the TAs have immediate access to all their work, including its development +history. In this manner, not only is the process vastly smoother and simpler +for all involved, but the students learn to use version control as a natural +tool that is simply part of their daily workflow. + +.. _google tech talk about git: http://www.youtube.com/watch?v=4XpnKHJAok8 + +I also tried to highlight the role played by the GitHub_ service as an enabler +of collaboration. While Git can be used (and it is extremely useful in this +mode) on a single computer without any server support, the moment several +people want to share their repositories for collaborative work, some kind of +persistent server is the best solution. GitHub, a service that is free for +Open Source projects and that offers paid plans for non-public work, has a +number of brilliant features that make the collaboration process amazingly +useful. Github makes it trivial for new contributors to begin participating in +a project by *forking* it (i.e. getting their personal copy to work on), and if +they want their work to be incorporated into the project, they can make a *pull +request*. The original authors then review the proposed changes, comment on +them (including making line-specific comments with a single click), and once +all are satisfied with the outcome, integrate them. This is effectively a +public peer review system that, while created for software development, can be +equally useful for collaborative authorship of a research project. + +.. _github: http://github.com + +I should add, however, that I think there's still room for improvement +regarding Git as a tool for pervasive use in the scientific workflow. As much +as I absolutely love Git, it's a tool tailored for *source code* tracking and +its atomic unit of change is the line of code. As such, it doesn't work as +conveniently when tracking for example changes in a paper (even if written in +TeX), where a small change can reflow a whole paragraph, showing a diff that is +much larger than the real change. In this case, the "track changes" features +of word processors actually work better at showing the specific changes made +(despite the fact that I think they make for a horrible interface for the +overall workflow) [*Note*: in the comments below, a reader indicates that the +``--word-diff`` option solves this problem, though I think it requires a very +new version of Git, 1.7.2 at least. But it's fantastic to see this kind of +improvement being already available]. And for tracking changes to binary +files, there's simply no meaningful diff available. It would be interesting to +see new ideas for improving something like git for these kinds of use cases. + +I wrapped things up with a short mention of the new `Open Research +Computation`_ journal, where Victoria and I are members of the editorial board, +as well as several well-known contributors to the scientific Python ecosystem, +including Titus Brown, Hans-Petter Langtangen, Jarrod Millman, Prabhu +Ramachandran and Gaël Varoquaux. + +.. _open research computation: http://www.openresearchcomputation.com +.. _slides: http://fperez.org/talks/1102_aaas_reproducibility_fperez_slides.pdf +.. _extended abstract: +.. http://fperez.org/talks/1102_aaas_reproducibility_fperez_extabs.pdf + + +Other presentations +=================== + +I spoke after `Keith Baggerly`_ and Victoria. Keith presented an amazing +dissection of the (ongoing) scandal with the Duke University `cancer clinical +trials`_ that has seen extensive media coverage. This case is a bone-chilling +example of the worst that can happen when unreproducible research is used as +the base for decisions that impact the health and lives of human beings. Yet, +despite the rather dark subject, Keith's talk was one of the most lively and +entertaining presentations I've seen at a conference in a long time. Victoria +discussed the legal framework in which we can begin considering the problem of +reproducible computational research; she was instrumental in the NSF's new +grant guidelines now having a mandatory data management plan section. She has +the unique combination of both a computational and a legal background, which is +very necessary to tackle this problem in a meaningful way (since licensing, +copyright and other legal issues are central to the discussion). + +Afterwards, Michael Reich from the Broad Institute presented the GenePattern_ +project, an impressive genomic analysis platform that includes provenance +tracking and workflow execution, as well as a plug-in for Microsoft Word to +connect documents with the execution engine. While the Word graphical user +interface would likely not be my environment of use, the GenePattern system +seems to be very well thought out and useful. The last three talks were by +Robert Gentleman of BioConductor_ fame, David Donoho --Victoria's PhD advisor +and a pioneer in posing the problem of reproducibility in computational work +together with Jon Claerbout, and finally Mark Liberman of U. Penn (see `Mark's +blog`_ for his take on the symposium). + +.. _keith baggerly: http://odin.mdacc.tmc.edu/~kabaggerly +.. _cancer clinical trials: http://bioinformatics.mdanderson.org/Supplements/ReproRsch-All +.. _genepattern: http://genepattern.org +.. _bioconductor: http://www.bioconductor.org +.. _mark's blog: http://languagelog.ldc.upenn.edu/nll/?p=2976 + +I think the symposium went very well; there was lively discussion with the +audience and good attendance. A journalist made a good point on how +improvements on the reproducibility front are important for them, when they are +trying to do their job of reporting to a sometimes skeptical public the results +of scientific work. If our work is made available with strong, credible +guarantees of reproducibility, it will be that much more easily presented to a +society which ultimately decides whether to support the scientific endeavor (or +not). + +There is a lot of room for improvement, as Keith Baggerly's talk painfully +reminded us. But I think that finally the climate is changing, and in this +case in the right direction: the tools are improving, people are interested, +funding agencies are modifying their policies and so are journals.