Proceedings of the 19th Python in Science Conference

doi:10.25080/Majora-342d178e-02b

Posters and Slides

Accepted Paper Slides¶

Treating gridded geospatial data as point data to simplify analytics

Gridded geospatial remote sensing (satellite) data has traditionally been stored in file-based multidimensional arrays to preserve the locality of data. Measurements from locations that are physically next to each other on earth remain next to each other in the arrays. Maintaining this locality is useful when running calculations like reprojection, but unnecessary for many other calculations. This talk will go through a real world example of a tool redesign at the Goddard Earth Sciences Data and Information Services Center (GES DISC), showing the advantages of using the data frame model for calculating summary statistics, where measurement proximity is unimportant.

Christine Smit, Hailiang Zhang, Mahabaleshwara Hegde, Faith Giguere, Long Pham

Arkouda: Terascale Data Science at Interactive Rates

This talk describes Arkouda, a Python package that we have developed for doing exploratory analysis on massive data sets at interactive rates. Arkouda’s API is based on NumPy/Pandas, yet its arrays can be transparently distributed across the compute nodes of a cluster or supercomputer to support large-scale analytics. In our work, we have run Arkouda operations from Jupyter notebooks on TB-sized data sets in seconds to small numbers of minutes—achieving scalability and performance that we have not observed with competing technologies.

Benjamin Albrecht, Michael Merrill, William Reus, Brad Chamberlain

Boost-histogram: High-Performance Histograms as Objects

Boost-histogram is a new Python library that provides Histograms that can be filled, manipulated, sliced, and projected as objects.

Henry Schreiner, Hans Dembinski, Jim Pivarski, Shuo Liu

Open-source bioimage analysis software to accelerate drug discovery

Anne Carpenter

cuSignal - GPU Accelerating SciPy Signal with Numba and CuPy

cuSignal is a GPU accelerated signal processing library built around a SciPy Signal-like API, CuPy, and custom Numba and CuPy CUDA kernels. cuSignal is written exclusively in Python and demonstrates GPU speeds without a C++ software layer.

Adam Thompson, Matt Nicely, Graham Markall, Brad Rees

Frictionless Data for Reproducible Biology

This talk discusses how biologists can make their data more reproducible using Frictionless Data’s open source Python libraries

Lilly Winfree

Interactive Supercomputing with Jupyter at the National Energy Research Scientific Computing Center

At the National Energy Research Scientific Computing (NERSC) Center, interactive access to high-performance computing and data through Jupyter is a priority. We will discuss the nuts and bolts of how Jupyter is deployed at NERSC, and how we’ve adapted to engage the Jupyter ecosystem and open-source community to deliver this key capability to our users. Jupyter is a major component in our Superfacility initiative, which aims to connect experimental and observational big data facilities (telescopes, microscopes, genome sequencers, light sources, etc.) with next-generation supercomputing and data capabilities at NERSC.

Rollin Thomas, Shane Canon, Shreyas Cholia, Matt Henderson, Kelly Rowland, Jon Hays, William Krinsman, Justin Ley, Labanya Mukhopadhyay, Trevor Slaton

Project Mjolnir: A Modular, Open-source Platform for Developing Scientific IoT Sensor Networks

From a humble beginning as a side effort using a Raspberry Pi to talk to lightning instruments, Project Mjolnir is evolving into a modular, open source client-server platform for developing scientific IoT sensor networks. Its goal is to enable scientists of many disciplines to employ low-cost hardware to robustly ingest, log and uplink periodic and on-demand science and engineering data and commands, controlled either autonomously or centrally, all with little or no bespoke code. The talk will discuss Mjolnir’s development and future, present examples of current projects built on it, and explore how to leverage it for new applications.

C.A.M. Gerlach

Pandera: Statistical Data Validation of Pandas Dataframes

This talk introduces pandera, an open source Python package for pandas data validation. It covers data validation in theory and practice, and goes through a case study analysis of the Fatal Encounters dataset to demonstrate how pandera can be used to make data analysis and machine learning more reproducible, robust, and reliable.

Niels Bantilan

Molecular infrastructure for modeling viruses with pythonic-mediated packages: pyF4all

We model full viruses by coupling short highly-detailed molecular dynamics simulations with lower-resolution (but faster) continuum electrostatic models. Such multiscale approach enables to model a full virus in a desktop/small cluster-level infrastructure, which are available for most researchers. Here, we propose a first interfacing of the pythonic-like packages in a multiscale approach that automatizes the access to state-of-the-art biomolecular simulations via Jupyter Notebooks.

Horacio V. Guzman

pyhf: a pure Python statistical fitting library with tensors and autograd

pyhf is a pure-Python implementation of the HistFactory statistical model for multi-bin histogram-based analysis with asymptotic interval estimation, and part of the Scikit-HEP project ecosystem. pyhf supports modern computational graph libraries as computational backends in order to make use of features such as auto-differentiation and GPU acceleration. Additionally, the statistical models are defined in a declarative JSON schema, readily enabling preservation and distribution through services such as the Durham High-Energy Physics Database (HEPData).

Matthew Feickert

Bringing GPU Support to Datashader: A RAPIDS Case Study

A case study on using RAPIDS technologies to add GPU support to the Datashader Python library

Jon Mease

Learning from evolving data streams

A brief introduction to machine learning for evolving data streams. In this field data is assumed infinite and can change over time. scikit-multiflow, a package for stream learning in Python is also presented.

Jacob Montiel

Spatial Algorithms at Scale with spatialpandas

How do you analyze 1 trillion rows of geospatial point data? We recently solved this problem using spatialpandas, dask, and parquet file format to efficiently build and execute spatial algorithms at scale. We compare the spatialpandas solution’s performance with other cases, and discuss the tradeoffs with various approaches.

Dharhas Pothina, Kim Pevey, Adam Lewis

Accepted Posters¶

Decentralized, Deterministic Robot Swarm Control using Blob Methods for PDEs

A Jupyter notebook about robot swarm control, simulation, digital experiments, and computational considerations

, , ,

SciPy Tools Plenaries¶

HoloViz: What’s new and what’s next

Updates and roadmaps for Panel, hvPlot, HoloViews, GeoViews, Datashader, Param, and Colorcet. The HoloViz suite of tools together form a unified approach for visualization from exploration to sharing applications and dashboards, building on the SciPy ecosystem to support easy visualization of large multidimensional or columnar datasets.

SciPy Tools Plenary on Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. This presentation summarizes changes over the past year, new features, and future plans.

SciPy Tools Plenary on Numba

Numba is a just-in-time compiler for a subset of Python. This is a short presentation of Numba updates for 2019-2020.

Lightning Talks¶

Building an AutoML System for Fun and Non-profit

This talk introduces metalearn, a MetaRL-based AutoML system that learns to learn how to propose hyperparameter selections that produce high validation scores on meta-test datasets.