Proceedings of SciPy 2021

SciPy 2021, the 20th annual Scientific Computing with Python conference, was a virtual conference held July 12-18, 2021. 20 peer reviewed articles were published in the conference proceedings.

PyBMRB: Data visualization tool for BioMagResBank

The Biological Magnetic Resonance Data Bank (BioMagResBank or BMRB https://bmrb.io), founded in 1988, is the international, open archive for data generated by Nuclear Magnetic Resonance (NMR) spectroscopy of biological systems.
Kumaran Baskaran, Jonathan R Wedell, Eldon L. Ulrich, +2
https://doi.org/10.25080/majora-1b6fd038-00a

Social Media Analysis using Natural Language Processing Techniques

Social media is very popularly used every day with daily content viewing and/or posting that in turn influences people around this world in a variety of ways. Social media platforms, such as YouTube, have a lot of activity that goes on every day in terms of video posting, watching and commenting.
Jyotika Singh
https://doi.org/10.25080/majora-1b6fd038-009

PyCID: A Python Library for Causal Influence Diagrams

Why did a decision maker select a certain decision? What behaviour does a certain objective incentivise? How can we improve this behaviour and ensure that a decision-maker chooses decisions with safer or fairer consequences? This paper introduces the Python package PyCID, built upon pgmpy, that implements (causal) influence diagrams, a widely used graphical modelling framework for decision-making problems.
James Fox, Tom Everitt, Ryan Carey, +3
https://doi.org/10.25080/majora-1b6fd038-008

CLAIMED, a visual and scalable component library for Trusted AI

CLAIMED is a component library for artificial intelligence, machine learning, \textquotedbl{}extract, transform, load\textquotedbl{} processes and data science. The goal is to enable low-code/no-code rapid prototyping by providing ready-made components for various business domains, supporting various computer languages, working on various data flow editors and running on diverse execution engines.
Romeo Kienzler, Ivan Nesic
https://doi.org/10.25080/majora-1b6fd038-007

Natural Language Processing with Pandas DataFrames

Most areas of Python data science have standardized on using Pandas DataFrames for representing and manipulating structured data in memory. Natural Language Processing (NLP), not so much. We believe that Pandas has the potential to serve as a universal data structure for NLP data.
Frederick Reiss, Bryan Cutler, Zachary Eichenberger
https://doi.org/10.25080/majora-1b6fd038-006

MPI-parallel Molecular Dynamics Trajectory Analysis with the H5MD Format in the MDAnalysis Python Package

Molecular dynamics (MD) computer simulations help elucidate details of the molecular processes in complex biological systems, from protein dynamics to drug discovery. One major issue is that these MD simulation files are now commonly terabytes in size, which means analyzing the data from these files becomes a painstakingly expensive task.
Edis Jakupovic, Oliver Beckstein
https://doi.org/10.25080/majora-1b6fd038-005

Accelerating Spectroscopic Data Processing Using Python and GPUs on NERSC Supercomputers

The Dark Energy Spectroscopic Instrument (DESI) will create the most detailed 3D map of the Universe to date by measuring redshifts in light spectra of over 30 million galaxies. The extraction of 1D spectra from 2D spectrograph traces in the instrument output is one of the main computational bottlenecks of DESI data processing pipeline, which is predominantly implemented in Python.
Daniel Margala, Laurie Stephey, Rollin Thomas, +1
https://doi.org/10.25080/majora-1b6fd038-004

signac: Data Management and Workflows for Computational Researchers

The signac data management framework (https://signac.io) helps researchers execute reproducible computational studies, scales workflows from laptops to supercomputers, and emphasizes portability and fast prototyping.
Bradley D. Dice, Brandon L. Butler, Vyas Ramasubramani, +7
https://doi.org/10.25080/majora-1b6fd038-003

Modernizing computing by structural biologists with Jupyter and Colab

Protein crystallography produces most of the protein structures used in structure-based drug design. The process of protein structure determination is computationally intensive and error-prone because many software packages are involved.
Blaine H. M. Mooers
https://doi.org/10.25080/majora-1b6fd038-002

Using Python for Analysis and Verification of Mixed-mode Signal Chains

Any application involving sensitive measurements of the physical world starts with accurate, precise, and low-noise signal chain. Modern, highly integrated data acquisition devices can often be directly connected to sensor outputs, performing analog signal conditioning, digitization, and digital filtering on a single silicon device, greatly simplifying system electronics.
Mark Thoren, Cristina Suteu
https://doi.org/10.25080/majora-1b6fd038-001

How PDFrw and fillable forms improves throughput at a Covid-19 Vaccine Clinic

PDFrw was used to prepopulate Covid-19 vaccination forms to improve the efficiency and integrity of the vaccination process in terms of federal and state privacy requirements. We will describe the vaccination process from the initial appointment, through the vaccination delivery, to the creation of subsequent required documentation.
Haw-minn Lu, José Unpingco
https://doi.org/10.25080/majora-1b6fd038-000

Cell Tracking in 3D using deep learning segmentations

Live-cell imaging is a highly used technique to study cell migration and dynamics over time. Although many computational tools have been developed during the past years to automatically detect and track cells, they are optimized to detect cell nuclei with similar shapes and/or cells not clustering together.
Varun Kapoor, Claudia Carabaña
https://doi.org/10.25080/majora-1b6fd038-014

CNN Based ToF Image Processing

In this paper a Time of Flight (ToF) camera specific data processing pipeline is presented, followed by real life applications using artificial intelligence. These applications include use cases such as gesture recognition, movement direction estimation or physical exercises monitoring.
Marian-Leontin Pop, Szilard Molnar, Alexandru Pop, +3
https://doi.org/10.25080/majora-1b6fd038-013

Multithreaded parallel Python through OpenMP support in Numba

A modern CPU delivers performance through parallelism. A program that exploits the performance available from a CPU must run in parallel on multiple cores. This is usually best done through multithreading.
Todd Anderson, Tim Mattson
https://doi.org/10.25080/majora-1b6fd038-012

Training machine learning models faster with Dask

Machine learning (ML) relies on stochastic algorithms, all of which rely on gradient approximations with \textquotedbl{}batch size\textquotedbl{} examples. Growing the batch size as the optimization proceeds is a simple and usable method to reduce the training time, provided that the number of workers grows with the batch size.
Joesph Holt, Scott Sievert
https://doi.org/10.25080/majora-1b6fd038-011

Monitoring Scientific Python Usage on a Supercomputer

In 2021, more than 30\% of users at the National Energy Research Scientific Computing Center (NERSC) used Python on the Cori supercomputer. To determine this we have developed and open-sourced a simple, minimally invasive monitoring framework that leverages standard Python features to capture Python imports and other job data via a package called \textquotedbl{}Customs\textquotedbl{}.
Rollin Thomas, Laurie Stephey, Annette Greiner, +1
https://doi.org/10.25080/majora-1b6fd038-010

Classification of Diffuse Subcellular Morphologies

Characterizing dynamic sub-cellular morphologies in response to perturbation remains a challenging and important problem. Many organelles are anisotropic and difficult to segment, and few methods exist for quantifying the shape, size, and quantity of these organelles.
Neelima Pulagam, Marcus Hill, Mojtaba Fazli, +6
https://doi.org/10.25080/majora-1b6fd038-00f

PyRSB: Portable Performance on Multithreaded Sparse BLAS Operations

This article introduces PyRSB, a Python interface to the LIBRSB library. LIBRSB is a portable performance library offering so called Sparse BLAS (Sparse Basic Linear Algebra Subprograms) operations for modern multicore CPUs.
Michele Martone, Simone Bacchio
https://doi.org/10.25080/majora-1b6fd038-00e

Programmatically Identifying Cognitive Biases Present in Software Development

Mitigating bias in AI-enabled systems is a topic of great concern within the research community. While efforts are underway to increase model interpretability and de-bias datasets, little attention has been given to identifying biases that are introduced by developers as part of the software engineering process.
Amanda E. Kraft, Matthew Widjaja, Trevor M. Sands, +1
https://doi.org/10.25080/majora-1b6fd038-00c

Conformal Mappings with SymPy: Towards Python-driven Analytical Modeling in Physics

This contribution shows how the symbolic computing Python library SymPy can be used to improve flow force modeling due to a Couette-type flow, i.e. a flow of viscous fluid in the region between two bodies, where one body is in tangential motion relative to the other.
Zoufiné Lauer-Baré, Erich Gaertig
https://doi.org/10.25080/majora-1b6fd038-00b