Proceedings of the 22nd Python in Science Conference

doi:10.25080/gerudo-f2bc6f59-039

Posters and Slides

Accepted Paper Slides¶

Fast Exploration of the Milky Way (or any other n-dimensional dataset)

A presentation about how to efficiently explore n-dimensional datasets

Francesc Alted

Better (Open Source) Homes and Gardens with Project Pythia

Tending Your Open Source Garden session, SciPy 2023

Drew Camron, Kevin Tyle

Accessibility best practices for authoring Jupyter notebooks

Jupyter notebooks seem like they are for everyone, but how a notebook gets written can greatly impact how usable it is for people with disabilities. Using staple accessibility frameworks, this talk dives into what it means to make a notebook’s content accessible and provide actionable guidance on how authors can improve their notebooks.

Isabela Presedo-Floyd, Stephannie Jimenez Gacha

Gammapy: a Python package for gamma-ray astronomy

In this contribution we presented the first stable version v1.0 of Gammapy, an openly developed Python package for gamma-ray astronomy. Gammapy provides methods for the analysis of astronomical gamma-ray data, such as measurement of spectra, images and light curves. By relying on standardized data formats and a joint likelihood framework, it allows astronomers to combine data from multiple instruments and constrain underlying astrophysical emission processes across large parts of the electromagnetic spectrum. Finally shared lessons learned during the journey towards version v1.0 for an openly developed scientific Python package.

Axel Donath, The Gammapy Developer Team (https://gammapy.org/team.html)

Python Array API Standard: Toward Array Interoperability in the Scientific Python Ecosystem

The array API standard (https://data-apis.org/array-api/) is a common specification for Python array libraries, such as NumPy, PyTorch, CuPy, Dask, and JAX. This standard will make it straightforward for array-consuming libraries, like scikit-learn and SciPy, to write code that uniformly supports all of these libraries. This will allow, for instance, running the same code on the CPU and GPU. This talk covers the scope of the array API standard, supporting tooling which includes a library-independent test suite and compatibility layer, what work has been completed so far, and the plans going forward.

Aaron Meurer

bayes_mapvar: Bayesian Statistics with Python, No Resampling Necessary

bayes_mapvar is a Python package that provides tools for Maximum A Posterior (MAP) estimation and posterior variance estimation. TensorFlow Probability and SciPy are leveraged to allow fast and efficient estimation for Bayesian models without using resampling methods.

Charles Lindsey

New CUDA Toolkit packages for Conda

A presentation on the new CUDA package layout for Conda (as included in conda-forge).

Rick Ratzel, Thomson Comer, John Kirkham

Taming Black Swans: Long-tailed distributions in the natural and engineered world

Long-tailed distributions are common in natural and engineered systems; as a result, we encounter extreme values more often than we would expect from a short-tailed distribution. If we are not prepared for these ‘black swans’, they can be disastrous.

But we have statistical tools for identifying long-tailed distributions, estimating their parameters, and making better predictions about rare events.

In this talk, I present evidence of long-tailed distributions in a variety of datasets -- including earthquakes, asteroids, and stock market crashes -- discuss statistical methods for dealing with them, and show implementations using scientific Python libraries.

Allen B. Downey

In-Process Analytical Data Management with DuckDB

DuckDB is a novel analytical data management system available under the MIT license. DuckDB supports complex queries in SQL or a relational API, has no external dependencies, and is deeply integrated into the Python ecosystem (reading and writing Numpy, Pandas, and PyArrow objects). DuckDB can also analyze datasets that are too large to fit in main memory.

Alexander Monahan, Hannes Mülheisen, Mark Raasveldt, Pedro Holanda

DataJoint: Bringing databases back into data science

Relational databases manage structured data and facilitate queries in collaborative repositories, but using SQL from a scientific programming language is awkward. DataJoint is an open-source framework for managing scientific data supporting data definition, diagramming, and queries. DataJoint makes computation a native part of its data model, bridging the gap between databases and numerical analysis in automated workflows.

Raphael Guzman, Dimitri Yatsenko

Accelerating the Use of Public Geohysical Data for Recharging California’s Groundwater

Recharging ground aquifers is an urgent task for improving groundwater sustainability in California. Geophysical data can provide a capability to image the subsurface where there are major data gaps. However, neither data nor analytic tools required to derive subsurface information is readily accessible. We present an interactive web application that utilizes a curated, public database, GIS capabilities and directly integrates Jupyter Notebooks and Python packages from researchers to guide recharge site location. Our project showcases a unique combination of open-source tools to help turn research knowledge into actionable insights for practitioners to improve groundwater recharge in California.

Seogi Kang, Steve Purves

Interactive Exploration of Large-Scale Datasets with Jupyter-Scatter

Jupyter Scatter is a scalable, interactive, and interlinked scatter plot widget for exploring datasets with up to several million data points. It focuses on data-driven visual encodings and offers two-way pan+zoom and lasso interactions. Beyond a single instance, Jupyter Scatter can compose multiple scatter plots and synchronize their views and selections. In this presentation, Fritz introduces Jupyter Scatter’s API and demonstrates how the widget can be used for exploring large-scale datasets using real-world examples from biology, machine learning, and geospatial data.

Fritz Lekschas

vak: a neural network framework for researchers studying animal acoustic communication

Research on animal acoustic communication is being revolutionized by deep learning. In this talk we present vak, a framework that allows researchers in this area to easily benchmark deep neural network models and apply them to their own data. We’ll demonstrate how research groups are using vak through examples with TweetyNet, a model that automates annotation of birdsong by segmenting spectrograms. Then we’ll show how adopting Lightning as a backend in version 1.0 has allowed us to incorporate more models and features, building on the foundation we put in place with help from the scientific Python stack.

David Nicholson, Yarden Cohen

Open Force Field: next-generation force fields with open data, open software, and open science

The Open Force Field (OpenFF) initiative was formed to produce open and extensible infrastructure to build a new generation of MD force fields. We have now developed many software packages for constructing, applying, and benchmarking force fields. We have also generated several high-quality quantum chemistry datasets. Everything is available freely on GitHub, Zenodo, and the MolSSI QCArchive server. This work has been successfully used to investigate potential improvements to force fields, as well as simplify many previously difficult aspects of preparing MD systems.

Jeffrey Wagner

Pandera: Going Beyond Pandas DataFrame Validation

This talk is about how Pandera has evolved to provide a standard schema interface for easily extending and supporting validation backends for arbitrary statistical data containers. Attendees will learn not only about data testing principles such as run-time validation and property-based testing, they will also learn about the challenges of maintaining and evolving an open source project that many people rely on as a critical piece of their data infrastructure. The high-level goal for this talk is to highlight lessons learned from Pandera’s particular journey from supporting only Pandas as a backend to supporting a whole suite of data objects.

Niels Bantilan

Tidy geospatial data cubes

Borrowing from the tidy data principles developed for tabular datasets Wickham, 2014, this presentation imagines ‘tidy’ principles for n-dimensional array data represented by Xarray objects with a specific focus on geospatial datasets.

Emma Marshall, Deepak Cherian, Scott Henderson

Zarr: Community specification of large, cloud-optimised, N-dimensional, typed array storage

A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together ( Cf. https://data-apis.org/ ). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling.

Sanket Verma, Josh Moore, John Kirkham

Accepted Posters¶

Unleashing the Power of Modern Portfolio Theory: Maximizing Returns while Managing Risk

Modern portfolio theory is a mathematical approach that helps in creating an investment portfolio by considering both the potential risks and returns.Our sample portfolio comprises eight diverse assets, representing exposure to various sectors in the Global Industry Classification Standard (GICS).

Kalyan Prasad

Data engineering and analytics for photolithography manufacturing process at DuPont - A practical approach from lab to fab

With smaller chips, requirements on chemical suppliers to control material parameters used in semiconductor manufacturing are stricter. DuPont Electronics & Industrial is working to improve the quality of photolithography products. Our goal is to pre-emptively identify failures and minimize defects using data science and statistics. This requires good data engineering practices in a challenging enterprise IT environment with multiple systems. We present a practical approach to overcome these, and collect and organize data using open-source python tools to help domain experts and practitioners. We describe approaches to scale this effort, which should be relevant to chemistry and manufacturing practitioners.

Avishek Panigrahi, Stefan J Caporale, Abhishek Shrivastava, Sumanth Sekar

EEG-to-fMRI Neuroimaging Cross Modal Synthesis in Python

A Python package for supports EEG to fMRI Synthesis in Python

David Calhas

Hamilton: Scalable, Portable, and Self-Documenting Dataflows in Python

Poster presented at the 2023 SciPy Conference. It describes the Hamilton project, which is a Python library for creating dataflows that are scalable, portable, and self-documenting. It uses the Naturf project as a case study to show before and after Hamilton.

Stefan Krawczyk, Elijah ben Izzy, Levi Sweet-Breu, Emily Rexer, Chris Vernon, Melissa Allen-Dumas

itk-elastix: Medical image registration in Python

This SciPy 2023 poster provides an overview of the open-source medical image registration package itk-elastix.

Konstantinos Ntatsis, Niels Dekker, Viktor van der Valk, Tom Birdsong, Dženan Zukić, Stefan Klein, Marius Staring, Matthew McCormick

Spatial Microsimulation and Activity Allocation in Python: An Update on the Likeness Toolkit

The Likeness toolkit utilizes state-of-the-art spatial microsimulation and activity allocation methods to generate synthetic populations within the US, at the metropolitan scale. These ‘parallel universes’ of agents are attributed with hundreds of demographic variables, probable nighttime and daytime locations, and plausible routing by commute mode between locations. This functionality is demonstrated for Tallahassee, FL and vicinity, a mid-sized metropolitan area.

Joseph V. Tuccillo, James D. Gaboardi

Matchmaker: A Toolkit for Combining Satellite Observations from Multiple Sensors

Matchmaker constructs multi-sensor datasets that enable scientists to directly compare observations for validation or monitoring purposes, or to fuse measurements from complementary instruments to improve geophysical understanding. Matchmaker leverages SciPy ecosystem tools to perform each of its primary tasks: orbital simulation, geometric collocation of individual observations, and alignment / aggregation of sensor data arrays.

Greg Quinn

Patterns and Anti-Patterns when Measuring Diversity in Open Source

If we fundamentally believe that ‘Open source is for everyone’, how do we know we are actually bringing everyone in, meeting them where they are, and fostering a diverse and inclusive open source ecosystem? Our open source team has evolved our practices for measuring open source communities, and the impact we have on them. This poster presents patterns and anti-patterns we have learned about measuring diversity in global open source communities.

amanda casari

PyQtGraph - High Performance Visualization for All Platforms

Discuss and showcase the PyQtGraph plotting library. Special attention will be given to highlight PyQtGraph’s primary objectives, its performance, cross-platform support and interactivity and how the library achieves those objectives.

Ognyan Moore, Nathan Jessurun, Nils Nemitz, Martin Chase, Luke Campagnola

PyVista

Let’s plot 3D Pythonic visualization

Tetsuo Koyama

OpenCRUMS: Open Classification of Regimes in the Southeast USA

The U.S. Department of Energy AI for Earth System Predictability program is interested in exploring how machine learning can be used to characterize aerosol conditions over the Atmospheric Radiation Measurement (ARM) facility’s measurement sites. In this poster we provide links to cookbooks that show how to use TensorFlow and Keras to produce explainable classifications of aerosol conditions over the Houston region where a recent ARM field campaign, TRacking Aerosol Convection intERactions (TRACER), was conducted. We show that a CNN-based classifier of the EPA PM2.5 Air Quality Index is able to capture the diurnal cycle of aerosols over Houston as well as influence of Dust from the Saharan Desert.

Robert Jackson, Maria Zawadowicz, Die Wang, Chongai Kuang, Minnie Park, Michael Jensen, Scott Collis

Analyse the uncertainty of your system: Sensitivity Analysis in Python with scipy.stats.sobol_indices

Use the indices of Sobol’ to measure the uncertainty in your system. Starting with SciPy 1.11, you can now use scipy.stats.sobol_indices which provides a simple yet powerful API.

Pamphile T. Roy

aPhyloGeo-Covid: A Web Interface for Reproducible Phylogeographic Analysis of SARS-CoV-2 Variation using Neo4j and Snakemake

The gene sequencing data, along with the associated lineage tracing and research data generated throughout the Coronavirus disease 2019 (COVID-19) pandemic, constitute invaluable resources that profoundly empower phylogeography research. To optimize the utilization of these resources, we have developed an interactive analysis platform called aPhyloGeo-Covid, leveraging the capabilities of Neo4j, Snakemake, and Python. This platform enables researchers to explore and visualize diverse data sources specifically relevant to SARS-CoV-2 for phylogeographic analysis. The integrated Neo4j database acts as a comprehensive repository, consolidating COVID-19 pandemic-related sequences information, climate data, and demographic data obtained from public databases, facilitating efficient filtering and organization of input data for phylogeographical studies. Presently, the database encompasses over 113,774 nodes and 194,381 relationships. Additionally, aPhyloGeo-Covid provides a scalable and reproducible phylogeographic workflow for investigating the intricate relationship between geographic features and the patterns of variation in diverse SARS-CoV-2 variants. The code repository of platform is publicly accessible on GitHub (https://github.com/tahiri-lab/iPhyloGeo/tree/iPhylooGeo-neo4j), providing researchers with a valuable tool to analyze and explore the intricate dynamics of SARS-CoV-2 within a phylogeographic context.

Wanlin Li, Nadia Tahiri

Moving the Earth with thermodynamics and python

This poster describes some of the challenges of coupling thermodynamic models to geodynamic simulations and introduces a new python-based tool, ThermoCodegen (TCG), to deal with them. TCG uses SymPy to symbolically represent thermodynamic models and automatically generate interfaces to a set of consistent thermodynamic parameters for use in geodynamic models. It can be used quite generally but has been particularly designed for reactive disequilibrium problems in Earth science and we present some examples here.

Cian Wilson, Marc Spiegelman, Owen Evans, Mark Ghiorso, Lucy Tweed

TUG-RSE: Pulling Students into Research Software Engineering

Research Software Engineering (RSE) is a rapidly growing profession within the Scientific Python community, but it can present significant challenges for newcomers, particularly students who may lack the necessary skills or knowledge of the field. The aim of this poster presentation is to discuss the current challenges faced by newcomers to research software engineering, the potential solutions to make the community more inclusive, and an introduction to ‘The Undergraduate’s Guide To Research Software Engineering’ (TUG-RSE).

Aman Goel

Yori: a new, highly customizable tool for Level-3 data production

Yori is a highly customizable, sensor-agnostic software, developed to support the NASA Atmosphere Science Teams, to spatially/temporally resample geophysical variables from satellite measurements. Out of the box this software outputs common statistics, such as mean, standard deviation and pixel count, of each geophysical variable read, for every grid cell. Additionally, Yori allows users to easily produce additional statistics (e.g. histograms, min/max, median) and filter data to create custom outputs while maintaining a CF-compliant format. Yori has been designed to be easily scalable on a distributed computing environment.

Paolo Veglio, Robert Holz, Liam Gumley, Steve Dutcher, Greg Quinn, Bruce Flynn

SciPy Tools Plenaries¶

SciPy Tools Plenary on Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. This presentation summarizes changes over the past year, new features, and future plans.

Elliott Sales de Andrade

SciPy Tools Plenary on SciPy

2023 updates in SciPy

Pamphile T. Roy

Zarr Updates for SciPy 2023

SciPy tools plenary session updates for Zarr

Josh Moore

Lightning Talks¶

NumFOCUS Academic Consortium and Open Source Pledge

Announcement of the NumFOCUS Academic Consortium and the Academic Data Science Alliance (ADSA) and NumFOCUS Open Source Pledge.

Arliss Collins

Hamilton: drop procedural scripts in favor of declarative functions

Lightning Talk on using Hamilton in favor of procedural scripts

Stefan Krawczyk