Skip to content

Commit 89123d1

Browse files
committed
BOSC poster
0 parents  commit 89123d1

15 files changed

+7878
-0
lines changed

01.intro.md

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Intro
2+
3+
Python has a mature ecosystem for extensions using C/C++, with Cython being part of the standard toolset for scientific programming. Even so, C/C++ still have many drawbacks, ranging from smaller annoyances (like library packaging, versioning and build systems) to serious one like buffer overflows and undefined behavior leading to security issues. Rust is a system programming language trying to avoid many of the C/C++ pitfalls, on top of providing a good development workflow and memory safety guarantees. This work presents a way to write extensions in Rust and use them in Python, using sourmash [Brown and Irber, 2016] as an example.
4+
5+
sourmash implements MinHash [Broder, 1997], a method for estimating the similarity of two or more datasets, and expanding on the work pioneered by Mash [Ondov et al, 2016]. It is available as a CLI and a Python library.

02.why.md

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Why Rust?
2+
3+
While Rust doesn't aim at being a scientific language, its focus on being a general purpose language allows a phenomenon similar to what happened with Python, where people from many areas pushed the language in different directions (system scripting, web development, numerical programming...) creating an environment where developers can combine it all in their systems.
4+
5+
Rust brings many best practices to the default experience: integrated package management with Cargo (supporting documentation, testing and benchmarking). Some of them are not viable in C/C++ due to the widespread adoption of both languages and backward compatibility guarantees, but due to Rust being developed initially to be integrated incrementally in the Firefox browser engine it tries to keep as much compatibility as possible with C/C++.
6+
7+
Rust also has a minimal runtime (like C, unlike Python), making it a good candidate for embedding into other software or even for situations where strict control of resources is required (microcontrollers and embedded systems, for example).

03.current_impl.md

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Current implementation
2+
3+
[![](poster/figures/arch_cpp.png)](poster/figures/arch_cpp.svg)
4+
5+
## Pros
6+
7+
- Cython is a superset of Python
8+
- Mature codebases for example usage and best practices
9+
- Lower overhead to call C/C++ code
10+
- NumPy integration
11+
- Nice gradual path to migrate performance-intensive code from Python to C/C++
12+
13+
## Cons
14+
15+
- Cython C++ integration has some corner cases and missing features
16+
- Need to rewrite header declarations (pxd file)
17+
- Errors can be cryptic (do they happen at the C/C++, Cython or Python level?)
18+
- Many C/C++ build system combinations
19+
- Vendored dependencies (no package mgmt)
20+
- One wheel per OS and Python version

04.rust_impl.md

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Rust implementation
2+
3+
[![](poster/figures/arch_rust.png)](poster/figures/arch_rust.svg)
4+
5+
## Pros
6+
7+
- Cargo and crates.io for package management
8+
- FFI interface is reusable in other languages
9+
- Auto-generated C header (cbindgen) and low level bindings (CFFI)
10+
- Works for PyPy too
11+
- One wheel per OS (universal)
12+
13+
## Cons
14+
15+
- Fewer projects using Rust extensions
16+
- FFI overhead when calling C code
17+
- No gradual transition from Python to Rust code
18+
- Fewer bioinformatics libraries available
19+
- No NumPy integration
20+
- Low level abstraction ("what C can represent")

05.future.md

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Future work
2+
3+
This proof of concept focused on replacing the C++ parts with Rust, but while all the sourmash tests are passing there are many improvements to be done. The performance in most benchmarks is very close to the C++ implementation, but since this wasn't the initial goal of the experiment there are many opportunities to make it faster.
4+
5+
Another goal is to be able to use the core functionality of sourmash in browsers. A previous experiment focused on implementing a compatible package in JavaScript, but it lead to split codebases and increased maintenance burden. The Rust implementation make it possible to target WebAssembly and generate a JavaScript package wrapping it, with the added benefit of avoiding some JavaScript shortcomings (like 64-bit integers support).
6+
7+
The Rust library implements basic compatibility with Finch sketches [Bovee and Greenfield, 2018], allowing sharing data between both MinHash implementations. Many of the other sourmash methods (search, gather) are not available in Rust yet, but this already allows using other MinHash sketches with them.

LICENSE

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
BSD 3-Clause License
2+
3+
Copyright (c) 2018, Luiz Irber
4+
All rights reserved.
5+
6+
Redistribution and use in source and binary forms, with or without
7+
modification, are permitted provided that the following conditions are met:
8+
9+
* Redistributions of source code must retain the above copyright notice, this
10+
list of conditions and the following disclaimer.
11+
12+
* Redistributions in binary form must reproduce the above copyright notice,
13+
this list of conditions and the following disclaimer in the documentation
14+
and/or other materials provided with the distribution.
15+
16+
* Neither the name of the copyright holder nor the names of its
17+
contributors may be used to endorse or promote products derived from
18+
this software without specific prior written permission.
19+
20+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.md

+43
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Oxidizing Python: writing extensions in Rust
2+
3+
[Luiz Carlos Irber Júnior](https://github.com/luizirber)
4+
5+
- Department of Population Health and Reproduction, University of California, Davis, USA
6+
7+
Poster presented at [GCCBOSC 2018][1].
8+
9+
## Abstract
10+
11+
Python has a mature ecosystem for extensions using C/C++,
12+
with Cython being part of the standard toolset for scientific programming.
13+
Even so, C/C++ still have many drawbacks,
14+
ranging from smaller annoyances (like library packaging, versioning and build systems)
15+
to serious one like buffer overflows and undefined behavior leading to security issues.
16+
17+
Rust is a system programming language trying to avoid many of the C/C++ pitfalls,
18+
on top of providing a good development workflow and memory safety guarantees.
19+
20+
This work presents a way to write extensions in Rust and use them in Python,
21+
using sourmash as an example.
22+
23+
## Table of Contents
24+
25+
- [Introduction](01.intro.md)
26+
- [Why Rust?](02.why.md)
27+
- [Current implementation](03.current_impl.md)
28+
- [Rust implementation](04.rust_impl.md)
29+
- [Future work](05.future.md)
30+
- [References](#references)
31+
- Appendices
32+
- [Submitted abstract](abstract.md)
33+
- [The final poster](poster/poster.pdf)
34+
35+
## References
36+
37+
- Broder, Andrei Z. 1997. “On the Resemblance and Containment of Documents.” - In Compression and Complexity of Sequences 1997. Proceedings, 21–29. IEEE. http://ieeexplore.ieee.org/abstract/document/666900/.
38+
- Ondov, Brian D., Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. 2016. “Mash: Fast Genome and Metagenome Distance Estimation Using MinHash.” Genome Biology 17: 132. https://dx.doi.org/10.1186/s13059-016-0997-x
39+
- Bovee, Roderick, and Nick Greenfield. 2018. “Finch: A Tool Adding Dynamic Abundance Filtering to Genomic MinHashing.” The Journal of Open Source Software. doi: https://dx.doi.org/10.21105/joss.00505
40+
- Titus Brown, C., and Luiz Irber. 2016. “sourmash: A Library for MinHash Sketching of DNA.” The Journal of Open Source Software 1 (5). https://dx.doi.org/10.21105/joss.00027
41+
42+
43+
[1]: https://gccbosc2018.sched.com/

abstract.md

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Oxidizing Python: writing extensions in Rust
2+
3+
Python has a mature ecosystem for extensions using C/C++,
4+
with Cython being part of the standard toolset for scientific programming.
5+
Even so, C/C++ still have many drawbacks,
6+
ranging from smaller annoyances (like library packaging, versioning and build systems)
7+
to serious one like buffer overflows and undefined behavior leading to security issues.
8+
9+
Rust is a system programming language trying to avoid many of the C/C++ pitfalls,
10+
on top of providing a good development workflow and memory safety guarantees.
11+
12+
This work presents a way to write extensions in Rust and use them in Python,
13+
using sourmash as an example.

0 commit comments

Comments
 (0)