Proceedings of SciPy 2023

Python Array API Standard: Toward Array Interoperability in the Scientific Python Ecosystem

The Python array API standard specifies standardized application programming interfaces and behaviors for array and tensor objects and operations. The establishment and subsequent adoption of the standard aims to reduce ecosystem fragmentation and facilitate array library interoperability.

Aaron Meurer, Athan Reines, Ralf Gommers, +15

A Modified Strassen Algorithm to Accelerate Numpy Large Matrix Multiplication with Integer Entries

We present a Strassen type algorithm for multiplying large matrices with integer entries. The algorithm is the standard Strassen divide and conquer algorithm but it crosses over to Numpy when either the row or column dimension of one of the matrices drops below 128.

Anthony Breitzman

An Accessible Python based Author Identification Process

Author identification also known as ‘author attribution’ and more recently ‘forensic linguistics’ involves identifying true authors of anonymous texts. In this paper we replicate the analysis but in a much more accessible way using modern text mining methods and Python.

Anthony Breitzman

Biomolecular Crystallographic Computing with Jupyter

To further advance this use of Jupyter, we developed a collection of code fragments that use the vast Computational Crystallography Toolbox (cctbx) library for novel analyses. We made versions of this library for use in JupyterLab and Colab.

Blaine H. M. Mooers

Bayesian Statistics with Python, No Resampling Necessary

TensorFlow Probability is a powerful library for statistical analysis in Python. Using TensorFlow Probability’s implementation of Bayesian methods, modelers can incorporate prior information and obtain parameter estimates and a quantified degree of belief in the results.

Charles Lindsey

Using Numba for GPU acceleration of Neutron Beamline Digital Twins

Digital twins of neutron instruments using Monte Carlo ray tracing have proven to be useful in neutron data analysis and verifying instrument and sample designs. In this paper, we present a GPU accelerated version of MCViNE using Python and Numba to balance user extensibility with performance.

Coleman J. Kendrick, Jiao Y. Y. Lin, Garrett E. Granroth

EEG-to-fMRI Neuroimaging Cross Modal Synthesis in Python

Electroencepholography and functional magnetic resonance imaging are two ways of recording brain activity. We developed a Python package, EEG-to-fMRI, which provides cross modal neuroimaging synthesis functionalities.

David Calhas

vak: a neural network framework for researchers studying animal acoustic communication

The study of acoustic communication is being revolutionized by deep neural network models. To address this need, we developed vak, a neural network framework designed for acoustic communication researchers.

David Nicholson, Yarden Cohen

Emukit: A Python toolkit for decision making under uncertainty

Emukit is a highly flexible Python toolkit for enriching decision making under uncertainty with statistical emulation. It is particularly pertinent to complex processes and simulations where data are scarce or difficult to acquire.

Andrei Paleyes, Maren Mahsereci, Neil D. Lawrence

Using Blosc2 NDim As A Fast Explorer Of The Milky Way (Or Any Other NDim Dataset)

Large multidimensional datasets are widely used in various engineering and scientific applications. We have added support for large dimensional datasets to Blosc2, a compression and format library.

Project Blosc, Francesc Alted, Marta Iborra, +3

MDAKits: A Framework for FAIR-Compliant Molecular Simulation Analysis

The reproducibility and transparency of scientific findings are widely recognized as crucial for promoting scientific progress. The MDAKits framework provides a cookiecutter template, best practices documentation, and a continually validated registry.

Irfan Alibay, Lily Wang, Fiona Naughton, +4

The Pandata Scalable Open-Source Analysis Stack

As the scale of scientific data analysis continues to grow, traditional domain-specific tools often struggle with data of increasing size and complexity. We introduce the Pandata open-source software stack as a solution, emphasizing the use of domain-independent tools at critical stages of the data life cycle, without compromising the depth of domain-specific analyses.

James A. Bednar, Martin Durant

Spatial Microsimulation and Activity Allocation in Python: An Update on the Likeness Toolkit

Understanding human security and social equity issues within human systems requires large-scale models of population dynamics that simulate high-fidelity representations of individuals and access to essential activities (work/school, social, errands, health). Likeness is a Python toolkit that provides spatial microsimulation project.

Joseph V. Tuccillo, James D. Gaboardi

itk-elastix: Medical image registration in Python

Image registration plays a vital role in understanding changes that occur in 2D and 3D scientific imaging datasets. In this paper, we introduce itk-elastix, a user-friendly Python wrapping of the mature elastix registration toolbox.

Konstantinos Ntatsis, Niels Dekker, Viktor van der Valk, +5

PyQtGraph - High Performance Visualization for All Platforms

PyQtGraph is a plotting library with high performance, cross-platform support and interactivity as its primary objectives. These goals are achieved by connecting the Qt GUI framework and the scientific Python ecosystem.

Ognyan Moore, Nathan Jessurun, Martin Chase, +2

Pandera: Going Beyond Pandas Data Validation

Data quality remains a core concern for practitioners in machine learning, data science, and data engineering, and many specialized packages have emerged to fulfill the need of validating and monitoring data and models. This paper outlines pandera’s motivation and challenges that took it from being a pandas-only data validation framework to one that is extensible to other non-pandas-compliant dataframe-like libraries.

Niels Bantilan

libyt: a Tool for Parallel In Situ Analysis with yt

In the era of exascale computing, storage and analysis of large scale data have become more important and difficult. We present libyt, an open source C++ library, that allows researchers to analyze and visualize data using yt or other Python packages in parallel during simulation runtime.

Shin-Rong Tsai, Hsi-Yu Schive, Matthew J. Turk

Data Reduction Network

Multidimensional categorical data is widespread but not easily visualized using standard methods. For example, questionnaire data generally consists of questions with categorical responses. Popular methods of handling categorical data include one-hot encoding and enumeration, which applies an unwarranted and potentially misleading notional order to the data. To address this, we introduce a novel visualization method named Data Reduction Network.

Haoyin Xu, Haw-minn Lu, José Unpingco

aPhyloGeo-Covid: A Web Interface for Reproducible Phylogeographic Analysis of SARS-CoV-2 Variation using Neo4j and Snakemake

The gene sequencing data, along with the associated lineage tracing and research data generated throughout the Coronavirus disease 2019 (COVID-19) pandemic, constitute invaluable resources that profoundly empower phylogeography research. To optimize the utilization of these resources, we have developed an interactive analysis platform called aPhyloGeo-Covid.

Wanlin Li, Nadia Tahiri