ProceedingsSciPy ProceedingsContent License: Creative Commons Attribution 3.0 Unported (CC-BY-3.0)Credit must be given to the creatorProceedings of the 22nd Python in Science ConferenceSciPy 2023, Austin, Texas July 10 - July 16July 10, 2023https://doi.org/10.25080/gerudo-f2bc6f59-039Download PDFDownload BibtexBack to ArticlePosters and SlidesDownload ArticleContentsProceedings of the 22nd Python in Science ConferenceOrganizationPosters and SlidesSponsored StudentsSupporting DocumentsOrganizationPosters and SlidesSponsored StudentsPosters and SlidesAccepted Paper Slides¶Fast Exploration of the Milky Way (or any other n-dimensional dataset)Fast Exploration of the Milky Way (or any other n-dimensional dataset)A presentation about how to efficiently explore n-dimensional datasetsFrancesc Altedhttps://doi.org/10.25080/gerudo-f2bc6f59-025Better (Open Source) Homes and Gardens with Project PythiaBetter (Open Source) Homes and Gardens with Project PythiaTending Your Open Source Garden session, SciPy 2023Drew Camron, Kevin Tylehttps://doi.org/10.25080/gerudo-f2bc6f59-026Accessibility best practices for authoring Jupyter notebooksAccessibility best practices for authoring Jupyter notebooksJupyter notebooks seem like they are for everyone, but how a notebook gets written can greatly impact how usable it is for people with disabilities. Using staple accessibility frameworks, this talk dives into what it means to make a notebook's content accessible and provide actionable guidance on how authors can improve their notebooks.Isabela Presedo-Floyd, Stephannie Jimenez Gachahttps://doi.org/10.25080/gerudo-f2bc6f59-027Gammapy: a Python package for gamma-ray astronomyGammapy: a Python package for gamma-ray astronomyIn this contribution we presented the first stable version v1.0 of Gammapy, an openly developed Python package for gamma-ray astronomy. Gammapy provides methods for the analysis of astronomical gamma-ray data, such as measurement of spectra, images and light curves. By relying on standardized data formats and a joint likelihood framework, it allows astronomers to combine data from multiple instruments and constrain underlying astrophysical emission processes across large parts of the electromagnetic spectrum. Finally shared lessons learned during the journey towards version v1.0 for an openly developed scientific Python package.Axel Donath, The Gammapy Developer Team (https://gammapy.org/team.html)https://doi.org/10.25080/gerudo-f2bc6f59-028Python Array API Standard: Toward Array Interoperability in the Scientific Python EcosystemPython Array API Standard: Toward Array Interoperability in the Scientific Python EcosystemThe array API standard (https://data-apis.org/array-api/) is a common specification for Python array libraries, such as NumPy, PyTorch, CuPy, Dask, and JAX. This standard will make it straightforward for array-consuming libraries, like scikit-learn and SciPy, to write code that uniformly supports all of these libraries. This will allow, for instance, running the same code on the CPU and GPU. This talk covers the scope of the array API standard, supporting tooling which includes a library-independent test suite and compatibility layer, what work has been completed so far, and the plans going forward.Aaron Meurerhttps://doi.org/10.25080/gerudo-f2bc6f59-029bayes_mapvar: Bayesian Statistics with Python, No Resampling Necessarybayes_mapvar: Bayesian Statistics with Python, No Resampling Necessarybayes_mapvar is a Python package that provides tools for Maximum A Posterior (MAP) estimation and posterior variance estimation. TensorFlow Probability and SciPy are leveraged to allow fast and efficient estimation for Bayesian models without using resampling methods.Charles Lindseyhttps://doi.org/10.25080/gerudo-f2bc6f59-02aNew CUDA Toolkit packages for CondaNew CUDA Toolkit packages for CondaA presentation on the new CUDA package layout for Conda (as included in conda-forge).Rick Ratzel, Thomson Comer, John Kirkhamhttps://doi.org/10.25080/gerudo-f2bc6f59-02bTaming Black Swans: Long-tailed distributions in the natural and engineered worldTaming Black Swans: Long-tailed distributions in the natural and engineered worldLong-tailed distributions are common in natural and engineered systems; as a result, we encounter extreme values more often than we would expect from a short-tailed distribution. If we are not prepared for these 'black swans', they can be disastrous. But we have statistical tools for identifying long-tailed distributions, estimating their parameters, and making better predictions about rare events. In this talk, I present evidence of long-tailed distributions in a variety of datasets -- including earthquakes, asteroids, and stock market crashes -- discuss statistical methods for dealing with them, and show implementations using scientific Python libraries.Allen B. Downeyhttps://doi.org/10.25080/gerudo-f2bc6f59-02cIn-Process Analytical Data Management with DuckDBIn-Process Analytical Data Management with DuckDBDuckDB is a novel analytical data management system available under the MIT license. DuckDB supports complex queries in SQL or a relational API, has no external dependencies, and is deeply integrated into the Python ecosystem (reading and writing Numpy, Pandas, and PyArrow objects). DuckDB can also analyze datasets that are too large to fit in main memory.Alexander Monahan, Hannes Mülheisen, Mark Raasveldt, +1https://doi.org/10.25080/gerudo-f2bc6f59-02dDataJoint: Bringing databases back into data scienceDataJoint: Bringing databases back into data scienceRelational databases manage structured data and facilitate queries in collaborative repositories, but using SQL from a scientific programming language is awkward. DataJoint is an open-source framework for managing scientific data supporting data definition, diagramming, and queries. DataJoint makes computation a native part of its data model, bridging the gap between databases and numerical analysis in automated workflows.Raphael Guzman, Dimitri Yatsenkohttps://doi.org/10.25080/gerudo-f2bc6f59-02eAccelerating the Use of Public Geohysical Data for Recharging California’s GroundwaterAccelerating the Use of Public Geohysical Data for Recharging California’s GroundwaterRecharging ground aquifers is an urgent task for improving groundwater sustainability in California. Geophysical data can provide a capability to image the subsurface where there are major data gaps. However, neither data nor analytic tools required to derive subsurface information is readily accessible. We present an interactive web application that utilizes a curated, public database, GIS capabilities and directly integrates Jupyter Notebooks and Python packages from researchers to guide recharge site location. Our project showcases a unique combination of open-source tools to help turn research knowledge into actionable insights for practitioners to improve groundwater recharge in California.Seogi Kang, Steve Purveshttps://doi.org/10.25080/gerudo-f2bc6f59-02fInteractive Exploration of Large-Scale Datasets with Jupyter-ScatterInteractive Exploration of Large-Scale Datasets with Jupyter-ScatterJupyter Scatter is a scalable, interactive, and interlinked scatter plot widget for exploring datasets with up to several million data points. It focuses on data-driven visual encodings and offers two-way pan+zoom and lasso interactions. Beyond a single instance, Jupyter Scatter can compose multiple scatter plots and synchronize their views and selections. In this presentation, Fritz introduces Jupyter Scatter's API and demonstrates how the widget can be used for exploring large-scale datasets using real-world examples from biology, machine learning, and geospatial data.Fritz Lekschashttps://doi.org/10.25080/gerudo-f2bc6f59-030vak: a neural network framework for researchers studying animal acoustic communicationvak: a neural network framework for researchers studying animal acoustic communicationResearch on animal acoustic communication is being revolutionized by deep learning. In this talk we present vak, a framework that allows researchers in this area to easily benchmark deep neural network models and apply them to their own data. We'll demonstrate how research groups are using vak through examples with TweetyNet, a model that automates annotation of birdsong by segmenting spectrograms. Then we'll show how adopting Lightning as a backend in version 1.0 has allowed us to incorporate more models and features, building on the foundation we put in place with help from the scientific Python stack.David Nicholson, Yarden Cohenhttps://doi.org/10.25080/gerudo-f2bc6f59-031Open Force Field: next-generation force fields with open data, open software, and open scienceOpen Force Field: next-generation force fields with open data, open software, and open scienceThe Open Force Field (OpenFF) initiative was formed to produce open and extensible infrastructure to build a new generation of MD force fields. We have now developed many software packages for constructing, applying, and benchmarking force fields. We have also generated several high-quality quantum chemistry datasets. Everything is available freely on GitHub, Zenodo, and the MolSSI QCArchive server. This work has been successfully used to investigate potential improvements to force fields, as well as simplify many previously difficult aspects of preparing MD systems.Jeffrey Wagnerhttps://doi.org/10.25080/gerudo-f2bc6f59-032Pandera: Going Beyond Pandas DataFrame ValidationPandera: Going Beyond Pandas DataFrame ValidationThis talk is about how Pandera has evolved to provide a standard schema interface for easily extending and supporting validation backends for arbitrary statistical data containers. Attendees will learn not only about data testing principles such as run-time validation and property-based testing, they will also learn about the challenges of maintaining and evolving an open source project that many people rely on as a critical piece of their data infrastructure. The high-level goal for this talk is to highlight lessons learned from Pandera’s particular journey from supporting only Pandas as a backend to supporting a whole suite of data objects.Niels Bantilanhttps://doi.org/10.25080/gerudo-f2bc6f59-033Tidy geospatial data cubesTidy geospatial data cubesBorrowing from the tidy data principles developed for tabular datasets [Wickham, 2014](https://vita.had.co.nz/papers/tidy-data.pdf), this presentation imagines 'tidy' principles for n-dimensional array data represented by Xarray objects with a specific focus on geospatial datasets.Emma Marshall, Deepak Cherian, Scott Hendersonhttps://doi.org/10.25080/gerudo-f2bc6f59-034Zarr: Community specification of large, cloud-optimised, N-dimensional, typed array storageZarr: Community specification of large, cloud-optimised, N-dimensional, typed array storageA key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together ( Cf. https://data-apis.org/ ). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling.Sanket Verma, Josh Moore, John Kirkhamhttps://doi.org/10.25080/gerudo-f2bc6f59-035Accepted Posters¶Unleashing the Power of Modern Portfolio Theory: Maximizing Returns while Managing RiskUnleashing the Power of Modern Portfolio Theory: Maximizing Returns while Managing RiskModern portfolio theory is a mathematical approach that helps in creating an investment portfolio by considering both the potential risks and returns.Our sample portfolio comprises eight diverse assets, representing exposure to various sectors in the Global Industry Classification Standard (GICS).Kalyan Prasadhttps://doi.org/10.25080/gerudo-f2bc6f59-015Data engineering and analytics for photolithography manufacturing process at DuPont - A practical approach from lab to fabData engineering and analytics for photolithography manufacturing process at DuPont - A practical approach from lab to fabWith smaller chips, requirements on chemical suppliers to control material parameters used in semiconductor manufacturing are stricter. DuPont Electronics & Industrial is working to improve the quality of photolithography products. Our goal is to pre-emptively identify failures and minimize defects using data science and statistics. This requires good data engineering practices in a challenging enterprise IT environment with multiple systems. We present a practical approach to overcome these, and collect and organize data using open-source python tools to help domain experts and practitioners. We describe approaches to scale this effort, which should be relevant to chemistry and manufacturing practitioners.Avishek Panigrahi, Stefan J Caporale, Abhishek Shrivastava, +1https://doi.org/10.25080/gerudo-f2bc6f59-016EEG-to-fMRI Neuroimaging Cross Modal Synthesis in PythonEEG-to-fMRI Neuroimaging Cross Modal Synthesis in PythonA Python package for supports EEG to fMRI Synthesis in PythonDavid Calhashttps://doi.org/10.25080/gerudo-f2bc6f59-017Hamilton: Scalable, Portable, and Self-Documenting Dataflows in PythonHamilton: Scalable, Portable, and Self-Documenting Dataflows in PythonPoster presented at the 2023 SciPy Conference. It describes the Hamilton project, which is a Python library for creating dataflows that are scalable, portable, and self-documenting. It uses the Naturf project as a case study to show before and after Hamilton.Stefan Krawczyk, Elijah ben Izzy, Levi Sweet-Breu, +3https://doi.org/10.25080/gerudo-f2bc6f59-018itk-elastix: Medical image registration in Pythonitk-elastix: Medical image registration in PythonThis SciPy 2023 poster provides an overview of the open-source medical image registration package itk-elastix. Konstantinos Ntatsis, Niels Dekker, Viktor van der Valk, +5https://doi.org/10.25080/gerudo-f2bc6f59-019Spatial Microsimulation and Activity Allocation in Python: An Update on the Likeness ToolkitSpatial Microsimulation and Activity Allocation in Python: An Update on the Likeness ToolkitThe Likeness toolkit utilizes state-of-the-art spatial microsimulation and activity allocation methods to generate synthetic populations within the US, at the metropolitan scale. These 'parallel universes' of agents are attributed with hundreds of demographic variables, probable nighttime and daytime locations, and plausible routing by commute mode between locations. This functionality is demonstrated for Tallahassee, FL and vicinity, a mid-sized metropolitan area.Joseph V. Tuccillo, James D. Gaboardihttps://doi.org/10.25080/gerudo-f2bc6f59-01aMatchmaker: A Toolkit for Combining Satellite Observations from Multiple SensorsMatchmaker: A Toolkit for Combining Satellite Observations from Multiple SensorsMatchmaker constructs multi-sensor datasets that enable scientists to directly compare observations for validation or monitoring purposes, or to fuse measurements from complementary instruments to improve geophysical understanding. Matchmaker leverages SciPy ecosystem tools to perform each of its primary tasks: orbital simulation, geometric collocation of individual observations, and alignment / aggregation of sensor data arrays.Greg Quinnhttps://doi.org/10.25080/gerudo-f2bc6f59-01bPatterns and Anti-Patterns when Measuring Diversity in Open SourcePatterns and Anti-Patterns when Measuring Diversity in Open SourceIf we fundamentally believe that 'Open source is for everyone', how do we know we are actually bringing everyone in, meeting them where they are, and fostering a diverse and inclusive open source ecosystem? Our open source team has evolved our practices for measuring open source communities, and the impact we have on them. This poster presents patterns and anti-patterns we have learned about measuring diversity in global open source communities.amanda casarihttps://doi.org/10.25080/gerudo-f2bc6f59-01cPyQtGraph - High Performance Visualization for All PlatformsPyQtGraph - High Performance Visualization for All PlatformsDiscuss and showcase the PyQtGraph plotting library. Special attention will be given to highlight PyQtGraph's primary objectives, its performance, cross-platform support and interactivity and how the library achieves those objectives.Ognyan Moore, Nathan Jessurun, Nils Nemitz, +2https://doi.org/10.25080/gerudo-f2bc6f59-01dPyVistaPyVistaLet's plot 3D Pythonic visualizationTetsuo Koyamahttps://doi.org/10.25080/gerudo-f2bc6f59-01eOpenCRUMS: Open Classification of Regimes in the Southeast USAOpenCRUMS: Open Classification of Regimes in the Southeast USAThe U.S. Department of Energy AI for Earth System Predictability program is interested in exploring how machine learning can be used to characterize aerosol conditions over the Atmospheric Radiation Measurement (ARM) facility's measurement sites. In this poster we provide links to cookbooks that show how to use TensorFlow and Keras to produce explainable classifications of aerosol conditions over the Houston region where a recent ARM field campaign, TRacking Aerosol Convection intERactions (TRACER), was conducted. We show that a CNN-based classifier of the EPA PM2.5 Air Quality Index is able to capture the diurnal cycle of aerosols over Houston as well as influence of Dust from the Saharan Desert.Robert Jackson, Maria Zawadowicz, Die Wang, +4https://doi.org/10.25080/gerudo-f2bc6f59-01fAnalyse the uncertainty of your system: Sensitivity Analysis in Python with scipy.stats.sobol_indicesAnalyse the uncertainty of your system: Sensitivity Analysis in Python with scipy.stats.sobol_indicesUse the indices of Sobol' to measure the uncertainty in your system. Starting with SciPy 1.11, you can now use scipy.stats.sobol_indices which provides a simple yet powerful API.Pamphile T. Royhttps://doi.org/10.25080/gerudo-f2bc6f59-020aPhyloGeo-Covid: A Web Interface for Reproducible Phylogeographic Analysis of SARS-CoV-2 Variation using Neo4j and SnakemakeaPhyloGeo-Covid: A Web Interface for Reproducible Phylogeographic Analysis of SARS-CoV-2 Variation using Neo4j and SnakemakeThe gene sequencing data, along with the associated lineage tracing and research data generated throughout the Coronavirus disease 2019 (COVID-19) pandemic, constitute invaluable resources that profoundly empower phylogeography research. To optimize the utilization of these resources, we have developed an interactive analysis platform called aPhyloGeo-Covid, leveraging the capabilities of Neo4j, Snakemake, and Python. This platform enables researchers to explore and visualize diverse data sources specifically relevant to SARS-CoV-2 for phylogeographic analysis. The integrated Neo4j database acts as a comprehensive repository, consolidating COVID-19 pandemic-related sequences information, climate data, and demographic data obtained from public databases, facilitating efficient filtering and organization of input data for phylogeographical studies. Presently, the database encompasses over 113,774 nodes and 194,381 relationships. Additionally, aPhyloGeo-Covid provides a scalable and reproducible phylogeographic workflow for investigating the intricate relationship between geographic features and the patterns of variation in diverse SARS-CoV-2 variants. The code repository of platform is publicly accessible on GitHub (https://github.com/tahiri-lab/iPhyloGeo/tree/iPhylooGeo-neo4j), providing researchers with a valuable tool to analyze and explore the intricate dynamics of SARS-CoV-2 within a phylogeographic context.Wanlin Li, Nadia Tahirihttps://doi.org/10.25080/gerudo-f2bc6f59-021Moving the Earth with thermodynamics and pythonMoving the Earth with thermodynamics and pythonThis poster describes some of the challenges of coupling thermodynamic models to geodynamic simulations and introduces a new python-based tool, ThermoCodegen (TCG), to deal with them. TCG uses SymPy to symbolically represent thermodynamic models and automatically generate interfaces to a set of consistent thermodynamic parameters for use in geodynamic models. It can be used quite generally but has been particularly designed for reactive disequilibrium problems in Earth science and we present some examples here.Cian Wilson, Marc Spiegelman, Owen Evans, +2https://doi.org/10.25080/gerudo-f2bc6f59-022TUG-RSE: Pulling Students into Research Software EngineeringTUG-RSE: Pulling Students into Research Software EngineeringResearch Software Engineering (RSE) is a rapidly growing profession within the Scientific Python community, but it can present significant challenges for newcomers, particularly students who may lack the necessary skills or knowledge of the field. The aim of this poster presentation is to discuss the current challenges faced by newcomers to research software engineering, the potential solutions to make the community more inclusive, and an introduction to 'The Undergraduate's Guide To Research Software Engineering' (TUG-RSE).Aman Goelhttps://doi.org/10.25080/gerudo-f2bc6f59-023Yori: a new, highly customizable tool for Level-3 data productionYori: a new, highly customizable tool for Level-3 data productionYori is a highly customizable, sensor-agnostic software, developed to support the NASA Atmosphere Science Teams, to spatially/temporally resample geophysical variables from satellite measurements. Out of the box this software outputs common statistics, such as mean, standard deviation and pixel count, of each geophysical variable read, for every grid cell. Additionally, Yori allows users to easily produce additional statistics (e.g. histograms, min/max, median) and filter data to create custom outputs while maintaining a CF-compliant format. Yori has been designed to be easily scalable on a distributed computing environment.Paolo Veglio, Robert Holz, Liam Gumley, +3https://doi.org/10.25080/gerudo-f2bc6f59-024SciPy Tools Plenaries¶SciPy Tools Plenary on MatplotlibSciPy Tools Plenary on MatplotlibMatplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. This presentation summarizes changes over the past year, new features, and future plans.Elliott Sales de Andradehttps://doi.org/10.25080/gerudo-f2bc6f59-036SciPy Tools Plenary on SciPySciPy Tools Plenary on SciPy2023 updates in SciPyPamphile T. Royhttps://doi.org/10.25080/gerudo-f2bc6f59-037Zarr Updates for SciPy 2023Zarr Updates for SciPy 2023SciPy tools plenary session updates for ZarrJosh Moorehttps://doi.org/10.25080/gerudo-f2bc6f59-038Lightning Talks¶NumFOCUS Academic Consortium and Open Source PledgeNumFOCUS Academic Consortium and Open Source PledgeAnnouncement of the NumFOCUS Academic Consortium and the Academic Data Science Alliance (ADSA) and NumFOCUS Open Source Pledge.Arliss Collinshttps://doi.org/10.25080/gerudo-f2bc6f59-013Hamilton: drop procedural scripts in favor of declarative functionsHamilton: drop procedural scripts in favor of declarative functionsLightning Talk on using Hamilton in favor of procedural scriptsStefan Krawczykhttps://doi.org/10.25080/gerudo-f2bc6f59-014Proceedings of the 22nd Python in Science ConferenceOrganizationProceedings of the 22nd Python in Science ConferenceSponsored Students