Proceedings of SciPy 2024
The 23rd annual SciPy conference was held in Tacoma, WA at the Tacoma Convention Center, July 8-14, 2024.
SciPy brings together attendees from industry, academia and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development.
Full proceedings, posters and slides, and organizing committee can be found at https://
The quest for more efficient and faster deep learning models has led to the development of various alternatives to Transformers, one of which is the Mamba model. This paper provides a comprehensive comparison between Mamba models and Transformers, focusing on their architectural differences, performance metrics, and underlying mechanisms.
While Python excels at prototyping and iterating quickly, it’s not always performant enough for whole-genome scale data processing. Flyte, an open-source Python-based workflow orchestrator, presents an excellent way to tie together the myriad tools required to run bioinformatics workflows.
Presenting a model or algorithm as a GUI application is a common need in the scientific and engineering community. Funix was created to automatically launch apps from existing Python functions, automatically selecting widgets based on the types of the arguments and returning functions according to the type-to-widget mapping defined in a theme.
Multivariate interpolation is a fundamental tool in scientific computing used to approximate the values of a function between known data points in multiple dimensions. Despite its importance, the Python ecosystem offers a fragmented landscape of specialized tools for this task; the multinterp package was developed to address this challenge.
Discover how scikit-build-core revolutionizes Python extension building with its seamless integration of CMake and Python packaging standards. Learn about its enhanced features for cross-compilation, multi-platform support, and simplified configuration, which enable writing binary extensions with pybind11, Nanobind, Fortran, Cython, C++, and more.
Harmful algal blooms pose major health risks to human and aquatic life. CyFi is an open-source Python package that enables detection of cyanobacteria in inland water bodies using 10-30m Sentinel-2 imagery and a computationally efficient tree-based machine learning model.
Feature selection is crucial for reducing data dimensionality as well as enhancing model interpretability and performance in machine learning tasks. This study explores the possibility of performing feature selection on a subset of data to reduce the computational burden.
We present mandala, a Python library that largely eliminates the accidental complexity of scientific data management and incremental computing. While most traditional and/or popular data management solutions are based on logging, mandala takes a fundamentally different approach, using memoization of function calls as the fundamental unit of saving, loading, querying and deleting computational artifacts.
Identifying the sources is vital for generative AI models, like ChatGPT and Bard, due to concerns about copyright infringement and plagiarism. In this paper, we explore text watermarking as a potential solution. We investigate techniques including physical watermarking and logical watermarking.
The ATLAS experiment at CERN explores vast amounts of physics data to answer the most fundamental questions of the Universe. This paper will describe to a broad audience how a large scientific collaboration leverages the power of the Scientific Python ecosystem to tackle domain-specific challenges and advance our understanding of the Cosmos.
The peracarid taxon Cumacea is an essential indicator of benthic quality in marine ecosystems. This study investigated the influence of environmental (i.e., biological or ecosystemic), climatic (i.e., meteorological or atmospheric), and spatial (i.e., geographic or regional) variables on their genetic variability and adaptability in the Northern North Atlantic, focusing on Icelandic waters.
Histopathological images, which are digitized images of human or animal tissue, contain insights into disease state. We present PredX-Tools, a suite of simple and easy to use python GUI applications which facilitate analysis of histopathological images and provide a no-code platform for data scientists and researchers to perform analysis on raw and transformed data.
Understanding neighborhood context is critical for social science research, public policy analysis, and urban planning. We introduce geosnap, the Geospatial Neighborhood Analysis Package, a suite of tools for exploring, modeling, and visualizing the social context and spatial extent of neighborhoods and regions over time.
Jupyter Widgets enable interactive code and data visualization in notebooks, but creating and distributing widgets across the Jupyter ecosystem is challenging. The anywidget project introduces a standard and toolset for portable, web-based widgets in various computing environments, simplifying development and extending compatibility beyond Jupyter. Its approach has fostered a rich widget ecosystem, driving the creation of new widgets and adoption of the standard by multiple platforms.
Understanding cilia behavior is essential in diagnosing and treating such diseases, but, the tasks of automatically analyzing cilia are often a labor and time-intensive. In this work we overcome this bottleneck by developing a robust, self-supervised framework exploiting the visual similarity of normal and dysfunctional cilia.
Jupyter is a popular platform for writing interactive computational narratives that contain computer code and its output interleaved with prose that describes the code and the output. It is possible to use one’s voice to interact with Jupyter notebooks.
In recent years, WebAssembly has emerged as a widely-supported technology that offers high performance, compact binary size, support for multiple languages, hardware independence, security, and universal platform support. ITK-Wasm brings WebAssembly’s capabilities to scientific computing by combining the Insight Toolkit (ITK) and WebAssembly to enable high-performance spatial analysis across programming languages and hardware architectures.
Water column sonar data collected by echosounders are essential for fisheries and marine ecosystem research, enabling the detection, classification, and quantification of fish and zooplankton from many different ocean observing platforms. We introduce Echostack, a suite of open-source Python software packages that leverage existing distributed computing and cloud-interfacing libraries to support intuitive and scalable data access, processing, and interpretation.
Tradespace datasets are the result of large parameter sweeps run over numerous design options and can consist of thousands or even millions of design configurations and the corresponding performance metrics. THEIA has been developed for visualizing this complex tradespace data related to the acquisitions process.
With the influx of large data from multiple instruments and experiments, scientists are wrangling complex data pipelines that are context-dependent and non-reproducible. Echodataflow provides transparent reproducible pipelines that can be edited with text "recipes", scaled and monitored.
Science requires new mediums to compose ideas and ways to share research findings iteratively, as early as possible and connected directly to software and data. In this paper we discuss two tools for scientific authoring and publishing, MyST Markdown and Curvenote, and illustrate examples of improving metadata, reimagining the reading experience, including computational content, and transforming publishing practices for individuals and societies through automation and continuous practices.
In recent years, leveraging satellite imagery with deep learning architectures has become an effective approach for environmental monitoring tasks, including forest wildfire detection. This paper presents a Python-based methodology for gathering and using a labeled high-resolution satellite imagery dataset for forest wildfire detection.
Evaluating probabilistic forecasts is complex and essential across various domains, yet no comprehensive software framework exists to simplify this task. Despite extensive literature on evaluation methodologies, current practices are fragmented and often lack reproducibility. To address this gap, we introduce a reproducible experimental workflow for evaluating probabilistic forecasting algorithms using the sktime package.
Machine learning is revolutionizing a wide range of research areas and industries, but many ML projects never progress past the proof-of-concept stage. To address this problem, we introduce Model Share AI, a platform designed to streamline collaborative model development, model provenance tracking, and model deployment.
Interactive visualizations are invaluable tools for building intuition and supporting rapid exploration of datasets and models. This paper explains the benefits of IPyVuetify with the ability to arbitrarily overlay widgets and plots on top of others to support more flexible details-on-demand techniques.
The increasing volume of research data in fields such as astronomy, biology, and engineering necessitates efficient distributed data management. This paper presents the Librarian, a custom framework designed for data transfer in large academic collaborations, designed for the Simons Observatory.
This article demonstrates practical approaches to fully type-hinting generic NumPy arrays and StaticFrame DataFrames, and shows how the same annotations can improve code quality with both static analysis and runtime validation.
Rough path theory is a branch of mathematics arising out of stochastic analysis. One of the main tools of rough path analysis is the signature, which captures the evolution of an unparametrised path including the order in which events occur. RoughPy is our new Python package that aims change the way we think about sequential streamed data.