How the Scientific Python ecosystem helps answer fundamental questions of the Universe
Abstract¶
The ATLAS experiment at CERN explores vast amounts of physics data to answer the most fundamental questions of the Universe. The prevalence of Python in scientific computing motivated ATLAS to adopt it for its data analysis workflows while enhancing users’ experience. This paper will describe to a broad audience how a large scientific collaboration leverages the power of the Scientific Python ecosystem to tackle domain-specific challenges and advance our understanding of the Cosmos. Through a simplified example of the renowned Higgs boson discovery, attendees will gain insights into the utilization of Python libraries to discriminate a signal in immersive noise, through tasks such as data cleaning, feature engineering, statistical interpretation and visualization at scale.
1Introduction¶
The field of high energy physics (HEP) is devoted to the study of the fundamental forces of Nature and their interactions with matter. To study the structure of the Universe on the smallest scales requires the highest energy density environments possible — similar to those of the early Universe. These extreme energy density environments are created at the CERN international laboratory, in Geneva, Switzerland, using the Large Hadron Collider (LHC) to collide “bunches” of billions of protons at a center-of-mass energy of . The resulting collisions are recorded with building-sized particle detectors positioned around the LHC’s ring that are designed to measure subatomic particle properties. Given the rarity of the subatomic phenomena of interest, the rate of the beam crossings is a tremendous to maximize the number of high quality collisions that can be captured and read out by the detectors. Even with real-time onboard processing (“triggering”) of the experiment detector readout to save only the most interesting collisions, detectors like the ATLAS experiment ATLAS Collaboration, 2008 still produce multiple petabytes of data per year. These data are then further filtered through selection criteria on the topology and kinematic quantities of the particle collision “events” recorded into specialized datasets for different kinds of physics analysis. The final datasets that physicists use in their physics analyses in ATLAS is still on the order of hundreds of terabytes, which poses challenges of compute scale and analyst time to efficiently use while maximizing physics value.
Traditionally, the ATLAS and the other LHC experiment have created experiment-specific custom C++
frameworks to handle all stages of the data processing pipeline, from the initial construction of high-level physics objects from the raw data to the final statistical analyses.
Motivated by the broad success of the Scientific Python ecosystem across many domains of science, and the rise of the Scikit-HEP ecosystem of Pythonic tooling for particle physics Rodrigues & others, 2020Schreiner et al., 2022 and community tools produced by the Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP) Elmer et al., 2017Albrecht & others, 2019, there has been a broad community-driven shift in HEP towards use of the Scientific Python ecosystem for analysis of physics data — a PyHEP ecosystem HEP Software Foundation, 2024.
The use of dataframes and array programming for data analysis has enhanced the user experience while providing efficient computations without the need of coding optimized low-level routines.
The ATLAS collaboration is further extending this ecosystem of tooling to include high-level custom Python bindings to the low level C++
frameworks using nanobind
Jakob, 2022.
Collectively, these tools are modernizing the methods which researchers are engaging data analysis at large scale and providing a novel end-to-end analysis ecosystem for the ATLAS collaboration.
2Employing the Scientific Python ecosystem¶
The multiple stages of physics data processing and analysis map onto different parts of the Scientific Python ecosystem.
This begins with the highly-structured but jagged nature of the event data in HEP.
The data structure of each event consists of variable length lists of physics objects (e.g. electrons, collections of tracks from charged objects).
To study the properties of the physics objects in a statistical manner, a fixed event analysis procedure is repeated over billions of events.
This has traditionally motivated the use of “event loops” that implicitly construct event-level quantities of interest and leveraged the C++
compiler to produce efficient iterative code.
This precedent made it difficult to take advantage of array programming paradigms that are common in Scientific Python given NumPy Harris et al., 2020 vector operations.
The Scikit-HEP library Awkward Array Pivarski et al., 2018 provides a path forward by providing NumPy-like idioms for nested, variable-sized (JSON-like) data and also brings analysts into an array programming paradigm Hartmann et al., 2021.
With the ability to operate on HEP data structures in an array programming — or “columnar” — approach, the next step is to be able to read and write with the HEP domain specific ROOT Brun & Rademakers, 1997 file format — which has given the particle physics community columnar data structures with efficient compression since 1997 Pivarski et al., 2020.
This is accomplished with use of the uproot
library Pivarski et al., 2017, which allows for efficient transformation of ROOT data to NumPy or Awkward arrays.
The data is then filtered through kinematic and physics signature motivated selections using Awkward manipulations and queries to create array collections that contain the passing events.
Through intense detector characterization and calibration efforts, the ATLAS collaboration has developed robust methods and tooling to apply corrections to the data and evaluate systematic uncertainties.
For instance, corrections to the signal collected by a specific calorimeter subsystem along with systematic uncertainties due to the imperfect knowledge of the subsystem.
Given the custom nature of the detector and correction implementations, these corrections are implemented in custom C++
libraries in the ATLAS software framework, Athena ATLAS Collaboration, 2021ATLAS Collaboration, 2021.
To expose these C++
libraries to the Pythonic tooling layer, custom Python bindings are written using nanobind
for high efficiency, as seen in Figure 1.
To contend with the extreme data volume, efficient distributed computing is an essential requirement.
Given the success of Dask Dask Development Team, 2016 in the Scientific Python ecosystem, and its ability to be deployed across both traditional batch systems and cloud based infrastructure with Kubernetes, the Scikit-HEP ecosystem has built extensions to Dask that allow for native Dask collections of Awkward arrays dask-awkward Development Team, 2022 and computing multidimensional boost-histogram
objects Schreiner et al., 2018 with Dask collections dask-histogram Development Team, 2021.
Using Dask and these extensions, the data selection and systematic correction workflow is able to be horizontally scaled out across ATLAS collaboration compute resources to provide the data throughput necessary to make analysis feasible.
This is often achieved through use of the high level coffea
columnar analysis framework Gray et al., 2019 which was designed to integrate with Dask and these HEP specific Dask extensions.
The resulting data objects that are returned to analysts are histograms of physics quantity distributions — such as the reconstructed invariant-mass of a collection of particles or particle momentum.
Using the hist
library Schreiner et al., n.d. for higher level data exploration and manipulation, physicists are then able to efficiently further manipulate the data distributions using tooling from the broader Scientific Python ecosystem and create domain-centric visualizations using the mplhep
Novak et al., 2020 extension of Matplotlib Hunter, 2007.
From these high level data representations of the relevant physics, physicists are then able to serialize the distributions and use them for the final stages of data analysis and statistical modeling and inference.
3Uncovering the Higgs boson¶
The most famous and revolutionary discovery in particle physics this century is the discovery of the Higgs boson — the particle corresponding to the quantum field that gives mass to fundamental particles through the Brout-Englert-Higgs mechanism — by the ATLAS and CMS experimental collaborations in 2012. ATLAS Collaboration, 2012CMS Collaboration, 2012
This discovery work was done using large amounts of customized C++
software, but in the following decade the state of the PyHEP community has advanced enough that the workflow can now be done using community Python tooling.
To provide an overview of the Pythonic tooling and functionality, a high level summary of a simplified analysis workflow Held et al., 2023 of a Higgs “decay” to two intermediate bosons that decay to charged leptons (i.e. electrons () and muons (μ)), , on ATLAS open data ATLAS Collaboration, 2020 is summarized in this section.
3.1Loading data¶
Given the size of the data, the files used in a real analysis will usually be cached at a national level “analysis facility” where the analysis code will run.
Using coffea
, uproot
, and Dask, these files can then be efficiently read and the tree structure of the data can populate Awkward arrays.
3.2Cleaning and selecting data¶
Once the data is in Awkward arrays, additional selections need to be applied before it can be analyzed. Only physics objects of adequate quality are kept for further analysis and those should reconstruct the topology of interest. In this particular case, due to the decay of the Higgs boson to two leptons, the data selected contain four charged leptons grouped in two opposite flavor lepton pairs (so that the total charge is zero, as the Higgs and the -bosons are electrically neutral). Additionally, in order to compare various kinds of simulated data, the events need to be normalized/weighted given their relative appearance in reality and the amount of actual data collected by the experiment.
These selection and weighting can then be implemented in an analysis specific coffea
processor, and then the processor can be executed used a Dask executor to horizontally scale out the analysis selection across the available compute.
3.3Feature engineering: The invariant mass¶
In order to discriminate the events of interest, i.e. candidates of the Higgs boson decay, from the vast background which has the same experimental signature, a discriminating feature is constructed. The example shown uses a simple, physics-inspired discriminant the “invariant mass” but the methods used can use complex feature engineering that involve machine learning methods to calculate more efficient discriminants. The invariant mass is the mass of a system that remains constant regardless of the system’s motion or the reference frame in which it is measured. Invariant mass is derived from the energy and momentum of a system of particles and is a fundamental property of the system:
where and is the total energy and momentum of the particles, respectively.
By detecting and measuring the energies and momenta of the detected particles at the experiment, we can reconstruct the invariant mass of the decay system. Particle systems originating from the decay of the Higgs boson will have a characteristic value of the invariant mass, which after the discovery in 2012 we know it is about 125. This is the quantity that will allow us to discriminate from particle systems that originate from background processes.
3.4Measurement uncertainties¶
One of the most expensive operations that happens during the event selections is the computation of systematic variations of the events to accommodate for imperfect knowledge of the detector systems.
This in practice requires applying complex, experiment specific corrections to each event, using algorithms implemented in C++
.
Historically these tools were implemented for an event loop processing paradigm, but with recent tooling additions, as shown in Figure 1, efficient on-the-fly systematic corrections can be computed for array programming paradigms.
The following provides an example of high level Python APIs that provide handles to these tools to use in the workflows described so far. These tools are efficient enough to be able to apply multiple systematic variations in analysis workflows, as seen in Figure 2.
3.5The “discovery” plot¶
After running the coffea
processors, the resulting data from the selections is accumulated into boost-histogram
objects, as seen visualized in Figure 3.
These histograms are then serialized into files with uproot
and used by the statistical modeling and inference libraries pyhf
Heinrich et al., n.d.Heinrich et al., 2021 and cabinetry
Held & Feickert, n.d. to build binned statistical models and efficiently fit the models to the observed data using vectorized computations and the optimization library iminuit
Dembinski et al., n.d. for full uncertainties on all model parameters.
The resulting best-fit model parameters — such as the scale factor on the signal component of the model corresponding to the normalization on the Higgs contributions — are visualized in Figure 4, where good agreement between the model predictions and the data is observed.
The signal component, clearly visible above the additional “background” components of the model, are Higgs boson events, with an observed count in agreement with theoretical expectations.
4Conclusions¶
When the Higgs boson was discovered in 2012, the idea of being able to perform real Pythonic data analysis in HEP, let alone efficient analysis, was viewed as unfeasible. Though investment in the broader Scientific Python ecosystem, and development of the domain specific pieces in the Scikit-HEP organization the field of particle physics successfully created a PyHEP ecosystem of robust tooling. Further investment by the ATLAS collaboration has resulted in new performant tooling for complex systematic corrections that will allow for more full and complex operations to be performed entirely within a Python workflow, helping to further reduce the time to insight for physics analysts.
5Acknowledgements¶
Matthew Feickert, Alexander Held, and Gordon Watts are supported by the U.S. National Science Foundation (NSF) under Cooperative Agreement OAC-1836650 and PHY-2323298 (IRIS-HEP). Lukas Heinrich and Vangelis Kourlitis are supported by the Excellence Cluster ORIGINS, which is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC-2094-390783311 and by the German Federal Ministry of Education and Research Project 05H2021 (ErUM-FSP T02) - “Run 3 von ATLAS am LHC: Analysis Infrastructure for the ATLAS Detektor at the LHC”.
- ATLAS Collaboration. (2008). The ATLAS Experiment at the CERN Large Hadron Collider. JINST, 3, S08003. 10.1088/1748-0221/3/08/S08003
- Rodrigues, E., & others. (2020). The Scikit HEP Project – overview and prospects. EPJ Web Conf., 245, 06028. 10.1051/epjconf/202024506028
- Henry Schreiner, Jim Pivarski, & Eduardo Rodrigues. (2022). Awkward Packaging: building Scikit-HEP. In Meghann Agarwal, Chris Calloway, Dillon Niederhut, & David Shupe (Eds.), Proceedings of the 21st Python in Science Conference (pp. 115–120). 10.25080/majora-212e5952-012
- Elmer, P., Neubauer, M., & Sokoloff, M. D. (2017). Strategic Plan for a Scientific Software Innovation Institute (S2I2) for High Energy Physics.
- Albrecht, J., & others. (2019). A Roadmap for HEP Software and Computing R&D for the 2020s. Comput. Softw. Big Sci., 3(1), 7. 10.1007/s41781-018-0018-8