
Posters and Slides
Accepted Paper Slides¶
Fast Exploration of the Milky Way (or any other n-dimensional dataset)
A presentation about how to efficiently explore n-dimensional datasets
Francesc Alted
Better (Open Source) Homes and Gardens with Project Pythia
Tending Your Open Source Garden session, SciPy 2023
Drew Camron, Kevin Tyle
Accessibility best practices for authoring Jupyter notebooks
Jupyter notebooks seem like they are for everyone, but how a notebook gets written can greatly impact how usable it is for people with disabilities. Using staple accessibility frameworks, this talk dives into what it means to make a notebook’s content accessible and provide actionable guidance on how authors can improve their notebooks.
Isabela Presedo-Floyd, Stephannie Jimenez Gacha
Gammapy: a Python package for gamma-ray astronomy
In this contribution we presented the first stable version v1.0 of Gammapy, an openly developed Python package for gamma-ray astronomy. Gammapy provides methods for the analysis of astronomical gamma-ray data, such as measurement of spectra, images and light curves. By relying on standardized data formats and a joint likelihood framework, it allows astronomers to combine data from multiple instruments and constrain underlying astrophysical emission processes across large parts of the electromagnetic spectrum. Finally shared lessons learned during the journey towards version v1.0 for an openly developed scientific Python package.
Axel Donath, The Gammapy Developer Team (https://gammapy.org/team.html)
Python Array API Standard: Toward Array Interoperability in the Scientific Python Ecosystem
The array API standard (https://
Aaron Meurer
bayes_mapvar: Bayesian Statistics with Python, No Resampling Necessary
bayes_mapvar is a Python package that provides tools for Maximum A Posterior (MAP) estimation and posterior variance estimation. TensorFlow Probability and SciPy are leveraged to allow fast and efficient estimation for Bayesian models without using resampling methods.
Charles Lindsey
New CUDA Toolkit packages for Conda
A presentation on the new CUDA package layout for Conda (as included in conda-forge).
Rick Ratzel, Thomson Comer, John Kirkham
Taming Black Swans: Long-tailed distributions in the natural and engineered world
Long-tailed distributions are common in natural and engineered systems; as a result, we encounter extreme values more often than we would expect from a short-tailed distribution. If we are not prepared for these ‘black swans’, they can be disastrous.
But we have statistical tools for identifying long-tailed distributions, estimating their parameters, and making better predictions about rare events.
In this talk, I present evidence of long-tailed distributions in a variety of datasets -- including earthquakes, asteroids, and stock market crashes -- discuss statistical methods for dealing with them, and show implementations using scientific Python libraries.
Allen B. Downey
In-Process Analytical Data Management with DuckDB
DuckDB is a novel analytical data management system available under the MIT license. DuckDB supports complex queries in SQL or a relational API, has no external dependencies, and is deeply integrated into the Python ecosystem (reading and writing Numpy, Pandas, and PyArrow objects). DuckDB can also analyze datasets that are too large to fit in main memory.
Alexander Monahan, Hannes Mülheisen, Mark Raasveldt, Pedro Holanda
DataJoint: Bringing databases back into data science
Relational databases manage structured data and facilitate queries in collaborative repositories, but using SQL from a scientific programming language is awkward. DataJoint is an open-source framework for managing scientific data supporting data definition, diagramming, and queries. DataJoint makes computation a native part of its data model, bridging the gap between databases and numerical analysis in automated workflows.
Raphael Guzman, Dimitri Yatsenko
Accelerating the Use of Public Geohysical Data for Recharging California’s Groundwater
Recharging ground aquifers is an urgent task for improving groundwater sustainability in California. Geophysical data can provide a capability to image the subsurface where there are major data gaps. However, neither data nor analytic tools required to derive subsurface information is readily accessible. We present an interactive web application that utilizes a curated, public database, GIS capabilities and directly integrates Jupyter Notebooks and Python packages from researchers to guide recharge site location. Our project showcases a unique combination of open-source tools to help turn research knowledge into actionable insights for practitioners to improve groundwater recharge in California.
Seogi Kang, Steve Purves
Interactive Exploration of Large-Scale Datasets with Jupyter-Scatter
Jupyter Scatter is a scalable, interactive, and interlinked scatter plot widget for exploring datasets with up to several million data points. It focuses on data-driven visual encodings and offers two-way pan+zoom and lasso interactions. Beyond a single instance, Jupyter Scatter can compose multiple scatter plots and synchronize their views and selections. In this presentation, Fritz introduces Jupyter Scatter’s API and demonstrates how the widget can be used for exploring large-scale datasets using real-world examples from biology, machine learning, and geospatial data.
Fritz Lekschas
vak: a neural network framework for researchers studying animal acoustic communication
Research on animal acoustic communication is being revolutionized by deep learning. In this talk we present vak, a framework that allows researchers in this area to easily benchmark deep neural network models and apply them to their own data. We’ll demonstrate how research groups are using vak through examples with TweetyNet, a model that automates annotation of birdsong by segmenting spectrograms. Then we’ll show how adopting Lightning as a backend in version 1.0 has allowed us to incorporate more models and features, building on the foundation we put in place with help from the scientific Python stack.
David Nicholson, Yarden Cohen
Open Force Field: next-generation force fields with open data, open software, and open science
The Open Force Field (OpenFF) initiative was formed to produce open and extensible infrastructure to build a new generation of MD force fields. We have now developed many software packages for constructing, applying, and benchmarking force fields. We have also generated several high-quality quantum chemistry datasets. Everything is available freely on GitHub, Zenodo, and the MolSSI QCArchive server. This work has been successfully used to investigate potential improvements to force fields, as well as simplify many previously difficult aspects of preparing MD systems.
Jeffrey Wagner
Pandera: Going Beyond Pandas DataFrame Validation
This talk is about how Pandera has evolved to provide a standard schema interface for easily extending and supporting validation backends for arbitrary statistical data containers. Attendees will learn not only about data testing principles such as run-time validation and property-based testing, they will also learn about the challenges of maintaining and evolving an open source project that many people rely on as a critical piece of their data infrastructure. The high-level goal for this talk is to highlight lessons learned from Pandera’s particular journey from supporting only Pandas as a backend to supporting a whole suite of data objects.
Niels Bantilan
Tidy geospatial data cubes
Borrowing from the tidy data principles developed for tabular datasets Wickham, 2014, this presentation imagines ‘tidy’ principles for n-dimensional array data represented by Xarray objects with a specific focus on geospatial datasets.
Emma Marshall, Deepak Cherian, Scott Henderson
Zarr: Community specification of large, cloud-optimised, N-dimensional, typed array storage
A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together ( Cf. https://
Sanket Verma, Josh Moore, John Kirkham
Accepted Posters¶
Unleashing the Power of Modern Portfolio Theory: Maximizing Returns while Managing Risk
Modern portfolio theory is a mathematical approach that helps in creating an investment portfolio by considering both the potential risks and returns.Our sample portfolio comprises eight diverse assets, representing exposure to various sectors in the Global Industry Classification Standard (GICS).
Kalyan Prasad
Data engineering and analytics for photolithography manufacturing process at DuPont - A practical approach from lab to fab
With smaller chips, requirements on chemical suppliers to control material parameters used in semiconductor manufacturing are stricter. DuPont Electronics & Industrial is working to improve the quality of photolithography products. Our goal is to pre-emptively identify failures and minimize defects using data science and statistics. This requires good data engineering practices in a challenging enterprise IT environment with multiple systems. We present a practical approach to overcome these, and collect and organize data using open-source python tools to help domain experts and practitioners. We describe approaches to scale this effort, which should be relevant to chemistry and manufacturing practitioners.
Avishek Panigrahi, Stefan J Caporale, Abhishek Shrivastava, Sumanth Sekar
EEG-to-fMRI Neuroimaging Cross Modal Synthesis in Python
A Python package for supports EEG to fMRI Synthesis in Python
David Calhas
Hamilton: Scalable, Portable, and Self-Documenting Dataflows in Python
Poster presented at the 2023 SciPy Conference. It describes the Hamilton project, which is a Python library for creating dataflows that are scalable, portable, and self-documenting. It uses the Naturf project as a case study to show before and after Hamilton.
Stefan Krawczyk, Elijah ben Izzy, Levi Sweet-Breu, Emily Rexer, Chris Vernon, Melissa Allen-Dumas
itk-elastix: Medical image registration in Python
This SciPy 2023 poster provides an overview of the open-source medical image registration package itk-elastix.
Konstantinos Ntatsis, Niels Dekker, Viktor van der Valk, Tom Birdsong, Dženan Zukić, Stefan Klein, Marius Staring, Matthew McCormick
Spatial Microsimulation and Activity Allocation in Python: An Update on the Likeness Toolkit
The Likeness toolkit utilizes state-of-the-art spatial microsimulation and activity allocation methods to generate synthetic populations within the US, at the metropolitan scale. These ‘parallel universes’ of agents are attributed with hundreds of demographic variables, probable nighttime and daytime locations, and plausible routing by commute mode between locations. This functionality is demonstrated for Tallahassee, FL and vicinity, a mid-sized metropolitan area.
Joseph V. Tuccillo, James D. Gaboardi
Matchmaker: A Toolkit for Combining Satellite Observations from Multiple Sensors
Matchmaker constructs multi-sensor datasets that enable scientists to directly compare observations for validation or monitoring purposes, or to fuse measurements from complementary instruments to improve geophysical understanding. Matchmaker leverages SciPy ecosystem tools to perform each of its primary tasks: orbital simulation, geometric collocation of individual observations, and alignment / aggregation of sensor data arrays.
Greg Quinn
Patterns and Anti-Patterns when Measuring Diversity in Open Source
If we fundamentally believe that ‘Open source is for everyone’, how do we know we are actually bringing everyone in, meeting them where they are, and fostering a diverse and inclusive open source ecosystem? Our open source team has evolved our practices for measuring open source communities, and the impact we have on them. This poster presents patterns and anti-patterns we have learned about measuring diversity in global open source communities.
amanda casari
PyQtGraph - High Performance Visualization for All Platforms
Discuss and showcase the PyQtGraph plotting library. Special attention will be given to highlight PyQtGraph’s primary objectives, its performance, cross-platform support and interactivity and how the library achieves those objectives.
Ognyan Moore, Nathan Jessurun, Nils Nemitz, Martin Chase, Luke Campagnola
PyVista
Let’s plot 3D Pythonic visualization
Tetsuo Koyama
OpenCRUMS: Open Classification of Regimes in the Southeast USA
The U.S. Department of Energy AI for Earth System Predictability program is interested in exploring how machine learning can be used to characterize aerosol conditions over the Atmospheric Radiation Measurement (ARM) facility’s measurement sites. In this poster we provide links to cookbooks that show how to use TensorFlow and Keras to produce explainable classifications of aerosol conditions over the Houston region where a recent ARM field campaign, TRacking Aerosol Convection intERactions (TRACER), was conducted. We show that a CNN-based classifier of the EPA PM2.5 Air Quality Index is able to capture the diurnal cycle of aerosols over Houston as well as influence of Dust from the Saharan Desert.
Robert Jackson, Maria Zawadowicz, Die Wang, Chongai Kuang, Minnie Park, Michael Jensen, Scott Collis
Analyse the uncertainty of your system: Sensitivity Analysis in Python with scipy.stats.sobol_indices
Use the indices of Sobol’ to measure the uncertainty in your system. Starting with SciPy 1.11, you can now use scipy.stats.sobol_indices which provides a simple yet powerful API.
Pamphile T. Roy
aPhyloGeo-Covid: A Web Interface for Reproducible Phylogeographic Analysis of SARS-CoV-2 Variation using Neo4j and Snakemake
The gene sequencing data, along with the associated lineage tracing and research data generated throughout the Coronavirus disease 2019 (COVID-19) pandemic, constitute invaluable resources that profoundly empower phylogeography research. To optimize the utilization of these resources, we have developed an interactive analysis platform called aPhyloGeo-Covid, leveraging the capabilities of Neo4j, Snakemake, and Python. This platform enables researchers to explore and visualize diverse data sources specifically relevant to SARS-CoV-2 for phylogeographic analysis. The integrated Neo4j database acts as a comprehensive repository, consolidating COVID-19 pandemic-related sequences information, climate data, and demographic data obtained from public databases, facilitating efficient filtering and organization of input data for phylogeographical studies. Presently, the database encompasses over 113,774 nodes and 194,381 relationships. Additionally, aPhyloGeo-Covid provides a scalable and reproducible phylogeographic workflow for investigating the intricate relationship between geographic features and the patterns of variation in diverse SARS-CoV-2 variants. The code repository of platform is publicly accessible on GitHub (https://
Wanlin Li, Nadia Tahiri
Moving the Earth with thermodynamics and python
This poster describes some of the challenges of coupling thermodynamic models to geodynamic simulations and introduces a new python-based tool, ThermoCodegen (TCG), to deal with them. TCG uses SymPy to symbolically represent thermodynamic models and automatically generate interfaces to a set of consistent thermodynamic parameters for use in geodynamic models. It can be used quite generally but has been particularly designed for reactive disequilibrium problems in Earth science and we present some examples here.
Cian Wilson, Marc Spiegelman, Owen Evans, Mark Ghiorso, Lucy Tweed
TUG-RSE: Pulling Students into Research Software Engineering
Research Software Engineering (RSE) is a rapidly growing profession within the Scientific Python community, but it can present significant challenges for newcomers, particularly students who may lack the necessary skills or knowledge of the field. The aim of this poster presentation is to discuss the current challenges faced by newcomers to research software engineering, the potential solutions to make the community more inclusive, and an introduction to ‘The Undergraduate’s Guide To Research Software Engineering’ (TUG-RSE).
Aman Goel
Yori: a new, highly customizable tool for Level-3 data production
Yori is a highly customizable, sensor-agnostic software, developed to support the NASA Atmosphere Science Teams, to spatially/temporally resample geophysical variables from satellite measurements. Out of the box this software outputs common statistics, such as mean, standard deviation and pixel count, of each geophysical variable read, for every grid cell. Additionally, Yori allows users to easily produce additional statistics (e.g. histograms, min/max, median) and filter data to create custom outputs while maintaining a CF-compliant format. Yori has been designed to be easily scalable on a distributed computing environment.
Paolo Veglio, Robert Holz, Liam Gumley, Steve Dutcher, Greg Quinn, Bruce Flynn
SciPy Tools Plenaries¶
SciPy Tools Plenary on Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. This presentation summarizes changes over the past year, new features, and future plans.
Elliott Sales de Andrade
SciPy Tools Plenary on SciPy
2023 updates in SciPy
Pamphile T. Roy
Zarr Updates for SciPy 2023
SciPy tools plenary session updates for Zarr
Josh Moore
Lightning Talks¶
NumFOCUS Academic Consortium and Open Source Pledge
Announcement of the NumFOCUS Academic Consortium and the Academic Data Science Alliance (ADSA) and NumFOCUS Open Source Pledge.
Arliss Collins
Hamilton: drop procedural scripts in favor of declarative functions
Lightning Talk on using Hamilton in favor of procedural scripts
Stefan Krawczyk