JOSS paper preparation#1249
Conversation
|
I will automatically update this comment whenever this PR is modified
|
jules32
left a comment
There was a problem hiding this comment.
Hi! Great work on this Danny! A few commits and some suggestions to consider.
Co-authored-by: Julia Stewart Lowndes <julia@openscapes.org>
There was a problem hiding this comment.
We could symlink this in to our docs!
There was a problem hiding this comment.
@mfisher87, want to create an issue for it?
I would say let's not wait. We've demonstrated impact and I think that matters more. Alternatively, let's just go 1.0.0 in the short term and be OK with quickly moving to a 2.0.0 release with breaking changes. I think both are fine, but the latter sets more a precedent of maintainers taking the user impact of breaking changes too lightly. |
Co-authored-by: Matt Fisher <3608264+mfisher87@users.noreply.github.com>
I'm fine with either too. I also think the decision could be on hold until one of the two things – (i) co-author reviews/revisions, (ii) development for v1.0.0 – is completely ready-to-go. |
Co-authored-by: Amy Steiker <47193922+asteiker@users.noreply.github.com> Co-authored-by: Jessica Scheick <JessicaS11@users.noreply.github.com>
Co-authored-by: Jessica Scheick <JessicaS11@users.noreply.github.com> Co-authored-by: Amy Steiker <47193922+asteiker@users.noreply.github.com>
|
Hey all, it's been a couple weeks since activity here, so pinging to keep this moving. Would be great to have a complete draft ready to submit before Northern Hemisphere summer! If there's not a specific note next to your username, a general read-through and comments are welcome: @andypbarrett |
jules32
left a comment
There was a problem hiding this comment.
Thank you @danielfromearth ! I've added the award number. Thanks for leading this!
|
|
||
| **Peer-reviewed publications.** `earthaccess` has been used in published research, | ||
| including studies on multi-sensor drought observations in forested environments | ||
| [@andreadis2024] and tidal bore detection using SWOT satellite data [@arildsen2025]. |
Co-authored-by: Daniel Kaufman <114174502+danielfromearth@users.noreply.github.com> Co-authored-by: Julia Stewart Lowndes <julia@openscapes.org> Co-authored-by: Amy Steiker <47193922+asteiker@users.noreply.github.com>
|
Friendly ping for co-authors who haven't yet had a chance to review (or at least, approve): @andypbarrett @jhkennedy @jrbourbeau @battistowx @Sherwin-14 @betolink @chuckwondo Things have been coming together and I think we are getting close to a complete draft that's ready. Would be great to have everyone's eyes on it, even briefly, before we finalize. Could you each take a look in the next week or two? In particular, please confirm your name, affiliation, and ORCID are correct in the author list. And of course, all other comments welcome. If timing doesn't work, just comment as such so we know where things stand. Thanks! |
Co-authored-by: Jessica Scheick <JessicaS11@users.noreply.github.com>
|
Related: https://earthaccess.zulipchat.com/#narrow/channel/480557-general/topic/JOSS.3F/with/590557057 We're considering / planning going for pyopensci review first, which will give us a stronger review and expedite the JOSS acceptance process if accepted to pyopensci. Thanks @sampottinger for sharing this with me :) |
betolink
left a comment
There was a problem hiding this comment.
I left some comments and suggestions but nothing major, I think this is a good draft so I'm approving as is. Thanks for leading the effort @danielfromearth
|
|
||
| 3. **Access**: Attempts to detect at runtime whether the process is running within AWS `us-west-2` | ||
| and automatically selects the optimal access path -- direct S3 reads for in-region | ||
| access or HTTPS downloads otherwise. Users can manually specify an access path if needed. Files can be opened as `fsspec`-compatible |
There was a problem hiding this comment.
I like the concise way of presenting this, maybe we can add that being format-agnostic and python file-like object compatible makes the library interoperable with the rest of the scientific python ecosystem (aka Pydata/Pangeo)
| open-source tools -- `python-cmr` for search, `fsspec` and `s3fs` for file I/O, | ||
| VirtualiZarr and kerchunk for virtual datasets -- rather than reimplementing their | ||
| functionality. The library's unique contribution is the NASA-specific integration | ||
| layer that binds these tools together. |
There was a problem hiding this comment.
This is the awesomeness, integrating and simplifying the steps a scientist usually do when working with NASA data. Maybe adding an example of time to science reduction both in lines of code and speed through performance optimizations via fsspec and virtualizarr. Tempo or ICESat-2 can be used for this, before N minutes, now N seconds. Before 10 lines of code, now 1.
| [@andreadis2024] and tidal bore detection using SWOT satellite data [@arildsen2025]. | ||
|
|
||
| **Community adoption.** The library is a dependency of 230 public GitHub | ||
| repositories (as of 5 March 2026), spanning data analysis workflows, Jupyter-based tutorials, and |
There was a problem hiding this comment.
Let's mention machine learning projects here, some of the projects using earthaccess do AI or ML workflows even at production scale.
There was a problem hiding this comment.
Each of these projects didn't have to reinvent the wheel to access NASA Earth data.
There was a problem hiding this comment.
Oh nevermind, is mentioned below
Thanks for the ping @danielfromearth! I've been traveling the last few weeks and will finally make it in back to the office on Monday. I'll have a look ASAP, but I suspect it's already in good shape judging from my quick glance here. |
Co-authored-by: Jessica Scheick <JessicaS11@users.noreply.github.com>
jhkennedy
left a comment
There was a problem hiding this comment.
Well, it turns out I do have a bit of feedback 😊 . I think it's in a very good place, and really would be fine to submit with or without my feedback.
Other than the specific things discussed below, I have a pretty big concern around publishing this discussing the automatic cloud-detection logic. That's something we know is technically infeasible to do reliably and we've decided to rip out:
https://github.com/earthaccess-dev/earthaccess/blob/main/docs/governance/decisions/231-aws-us-west-2-checking-method.md
So I'd like to either not mention it or abstract that away in the manuscript language.
Since I have a lot of feedback, I could open a PR into this PR with how I'd resolve my comments, if that's easier. Just let me know.
| - name: "Booz Allen Hamilton, Inc., McLean, VA, USA" | ||
| index: 8 | ||
| ror: 051rcp357 | ||
| - name: "University of Alaska Fairbanks, Fairbanks, AK, USA" |
There was a problem hiding this comment.
| - name: "University of Alaska Fairbanks, Fairbanks, AK, USA" | |
| - name: "Alaska Satellite Facility, Geophysical Institute, University of Alaska Fairbanks, Fairbanks, AK, USA" |
| must now contend with two possible access paradigms, traditional HTTPS downloads and S3-based | ||
| access. These both may even occur within a single analysis workflow. During workshops organized by NASA | ||
| Openscapes [@nasa_openscapes; @lowndes2019], the need for simpler tools became evident. | ||
| `earthaccess` was created to address this gap: it provides uniform access to NASA |
There was a problem hiding this comment.
| `earthaccess` was created to address this gap: it provides uniform access to NASA | |
| `earthaccess` is a community project that was created to address this gap: it provides uniform access to NASA |
We don't mention community at all until the end of the Software Design section, and we don't talk about the community aspect of developing this library at all, which I think is pretty integral to it's success and would be nice to represent somewhere in the introduction.
| error, and DAAC-specific configurations further compound the challenge. | ||
|
|
||
| NASA's ongoing migration to the Earthdata Cloud adds further complexity, as researchers | ||
| must now contend with two possible access paradigms, traditional HTTPS downloads and S3-based |
There was a problem hiding this comment.
I think this sentence should be moved up into the previous paragraph before (5) and (6), or part of a stand alone paragraph with (5) and (6).
| and decision-makers globally [@nasa_esds_data_metrics]. However, the complexity of the underlying data infrastructure | ||
| presents a significant barrier to scientific productivity. A typical data access workflow | ||
| requires a researcher to: (1) authenticate with NASA Earthdata Login; (2) discover | ||
| relevant datasets and granules through the CMR API; (3) parse metadata to obtain download |
There was a problem hiding this comment.
| relevant datasets and granules through the CMR API; (3) parse metadata to obtain download | |
| relevant datasets and granules through the CMR API; (3) parse metadata to obtain access |
This is true for downloading, in-place HTTP access, or S3 "direct" access"
| URLs; (4) manage HTTP sessions with tokens and redirect handling; (5) determine whether | ||
| data are hosted on-premises or in the Earthdata Cloud; and (6) obtain temporary AWS S3 | ||
| credentials when accessing cloud-hosted data. Each step introduces opportunities for |
There was a problem hiding this comment.
You only need to do (5) if you're doing (6)...
| - **python-cmr** [@python_cmr] provides a Python wrapper around the CMR API for dataset | ||
| and granule queries. `earthaccess` builds on `python-cmr`, extending it with | ||
| DAAC-aware provider resolution, cloud-hosting filters, and rich result objects that | ||
| encapsulate metadata. However, `python-cmr` does not handle authentication, data | ||
| download, or cloud access -- the areas where researchers face many workflow difficulties. |
There was a problem hiding this comment.
We should also call out asf_search -- it's in between python_CMR and Earthaccess, focused on search and discovery but handles auth/etc. It is however, primarily focused on SAR data so has domain-specific tools/functionality added to it.
It was started 2 months before Earthaccess and came out of the same need/problems but with a different focus
| - **earthdatalogin** [@earthdatalogin_r] provides similar authentication and access | ||
| functionality for the R programming ecosystem. The two projects share a common motivation and | ||
| serve as complementary tools for their respective language communities. |
There was a problem hiding this comment.
🤔 are there other R/Julia things we should call out?
| NASA's Earth science data archive is one of the largest and most diverse collections of | ||
| Earth observation data in the world, used by over ten million researchers, educators, | ||
| and decision-makers globally [@nasa_esds_data_metrics]. However, the complexity of the underlying data infrastructure | ||
| presents a significant barrier to scientific productivity. A typical data access workflow |
There was a problem hiding this comment.
I'm not sure I like how we've ordered the "data access workflow". Right now we have:
- auth
- "discover"
- parse metadata
- sessions + redirects
- is cloud?
- S3 credentials
I think (1) and (4) should be combined and indeed that's how we discuss it on L196
https://github.com/earthaccess-dev/earthaccess/pull/1249/changes#diff-e504eb580b095a7e65428b098183a581e475f0fb316db95287eacd7d4f344424R196
Similarly, (5) and (6) are also optional and only for in-place cloud access with performance constraints or if you want to use S3 aware tools, and really, fit into (1) and (4) as well, which is also discussed this way on L196.
I also think (2) is better described as "search" and (2) + (3) is what I would call discovery. At least for me, I am always parsing metadata as part of what I'd call discovery -- typically searching broadly and then refining with sensor/bands/variable/etc, so that I end up with the actual set of granules I want to use in my workflow. I don't really see why getting the access URLs are special compared to getting any of the other metadata along the way.
We don't talk about data preparation at all, except as features of Harmony and Icepyx, which seems like a missed opportunity.
I think I'd restructure this like:
- Discovery
- Auth (EDL, S3, Sessions + redirects)
- Access
- Data prep (includes virtual datasets and transformations)
which is similar to the Software Design section. Note, I've put auth after discovery since you generally only need it to access data, unless you're trying to discover restricted datasets... so It could go before or after discovery, I think it just flows a little better narrative-ly after, but 🤷 .
|
|
||
| 3. **Access**: Attempts to detect at runtime whether the process is running within AWS `us-west-2` | ||
| and automatically selects the optimal access path -- direct S3 reads for in-region | ||
| access or HTTPS downloads otherwise. Users can manually specify an access path if needed. Files can be opened as `fsspec`-compatible |
There was a problem hiding this comment.
| access or HTTPS downloads otherwise. Users can manually specify an access path if needed. Files can be opened as `fsspec`-compatible | |
| access or HTTPS access otherwise. Users can manually specify an access path if needed. Files can be opened as `fsspec`-compatible |
You can download or stream via HTTPS
|
|
||
| # AI usage disclosure | ||
|
|
||
| No generative AI tools were used in the development of the `earthaccess` software; all architectural and design decisions were made exclusively by the authors and contributors. |
There was a problem hiding this comment.
Hmm, is this true anymore? @betolink have you been using Claude for the virtulizarr work?
I wonder if we need to adopt an AI policy and say something like "...developers may use AI tools but are responsible for their contributions...".
Manuscript draft
This PR is intended for revisions and improvements to the manuscript draft being prepared for submission to the Journal of Open Source Software (JOSS).
Paper format: The manuscript is prepared as a Markdown (
paper.md) file with references in apaper.bibfile, following the JOSS formatting guidelines.For a PDF preview: With docker installed locally, a PDF preview of the draft manuscript can be generated, by running the following from the earthaccess root directory (as described in the JOSS guidelines's docker section):
📚 Documentation preview 📚: https://earthaccess--1249.org.readthedocs.build/en/1249/