Data driven advances dominate the applied sciences landscape, with quantum chemistry being no exception to the rule. Dataset biases and human error are key bottlenecks in the development of reproducible and generalized insights. At a computational level, we demonstrate how changing the granularity of the abstractions employed in data generation from simulations can aid in reproducible work. In particular, we introduce wailord (https://wailord.xyz), a free-and-open-source python library to shorten the gap between data-analysis and computational chemistry, with a focus on the ORCA suite binaries. A two level hierarchy and exhaustive unit-testing ensure the ability to reproducibly describe and analyze “computational experiments.” wailord offers both input generation, with enhanced analysis, and raw output analysis, for traditionally executed ORCA runs. The design focuses on treating output and input generation in terms of a mini domain specific language instead of more imperative approaches, and we demonstrate how this abstraction facilitates chemical insights.

Keywords:quantum chemistryparsersreproducible reportscomputational inference


The use of computational methods for chemistry is ubiquitous and few modern chemists retain the initial skepticism of the field Kohn, 1999Schaefer, 1986. Machine learning has been further earmarked Meyer et al., 2019Dral, 2020Schütt et al., 2019 as an effective accelerator for computational chemistry at every level, from DFT Gao et al., 2016 to alchemical searches De et al., 2016 and saddle point searches Ásgeirsson & Jónsson, 2018. However, these methods trade technical rigor for vast amounts of data, and so the ability to reproduce results becomes increasingly more important. Independently, the ability to reproduce results Peng, 2011Sandve et al., 2013 in all fields of computational research, and has spawned a veritable flock of methodological and programmatic advances Community et al., 2019, including the sophisticated provenance tracking of AiiDA Pizzi et al., 2016Huber et al., 2020.

Dataset bias

Dataset bias Engstrom et al., 2020Blum & Stangl, 2019Rahaman et al., 2019 has gained prominence in the machine learning literature, but has not yet percolated through to the chemical sciences community. At its core, the argument for dataset biases in generic machine learning problems of image and text classification, can be linked to the difficulty in obtaining labeled results for training purposes. This is not an issue in the computational physical sciences at all, as the training data can often be labeled without human intervention. This is especially true when simulations are carried out at varying levels of accuracy. However, this also leads to a heavy reliance on high accuracy calculations on “benchmark” datasets and results Hoja et al., 2021Senior et al., 2019.

Compute is expensive, and the reproduction of data which is openly available is often hard to justify as a valid scientific endeavor. Rather than focus on the observable outputs of calculations, instead we assert that it is best to be able to have reproducible confidence in the elements of the workflow. In the following sections, we will outline wailord, a library which implements a two level structure for interacting with ORCA Neese, 2012 to implement an end-to-end workflow to analyze and prepare datasets. Our focus on ORCA is due to its rapid and responsive development cycles, that it is free to use (but not open source) and also because of its large repertoire of computational chemistry calculations. Notably, the black-box nature of ORCA (in that the source is not available) mirrors that of many other packages (which are not free) like VASP Hafner, 2008. Using ORCA then, allows us to design a workflow which is best suited for working with many software suites in the community.

We shall understand this wailord from the lens of what is often known as a design pattern in the practice of computational science and engineering. That is, a template or description to solve commonly occurring problems in the design of programs.

Structure and Implementation

Python has grown to become the lingua-franca for much of the scientific community Oliphant, 2007Millman & Aivazis, 2011, in no small part because of its interactive nature. In particular, the REPL (read-evaluate-print-loop) structure which has been prioritized (from IPython to Jupyter) is one of the prime motivations for the use of Python as an exploratory tool. Additionally, PyPI, the python package index, accelerates the widespread disambiguation of software packages. Thus wailord is implemented as a free and open source python library.


Data generation involves set of known configurations (say, xyz inputs) and a series of common calculations whose outputs are required. Computational chemistry packages tend to be focused on acceleration and setup details on a per-job scale. wailord, in contrast, considers the outputs of simulations to form a tree, where the actual run and its inputs are the leaves, and each layer of the tree structure holds information which is collated into a single dataframe which is presented to the user.

Downstream tasks for simulations of chemical systems involve questions phrased as queries or comparative measures. With that in mind, wailord generates pandas dataframes which are indistinguishable from standard machine learning information sources, to trivialize the data-munging and preparation process. The outputs of wailord represent concrete information and it is not meant to store runs like the ASE database Larsen et al., 2017 , nor run a process to manage discrete workflows like AiiDA Huber et al., 2020.

By construction, it differs also from existing “interchange” formats as those favored by the materials data repositories like the QCArchive project Smith et al., 2021 and is partially close in spirit to the cclib endeavor O'boyle et al., 2008.


Two classes form the backbone of the data-harvesting process. The intended point of interface with a user is the orcaExp class which collects information from multiple ORCA outputs and produces dataframes which include relevant metadata (theory, basis, system, etc.) along with the requested results (energy surfaces, energies, angles, geometries, frequencies, etc.). A lower level “orca visitor” class is meant to parse each individual ORCA output. Until the release of ORCA 5 which promises structured property files, the outputs are necessarily parsed with regular expressions, but validated extensively. The focus on ORCA has allowed for more exotic helper functions, like the calculation of rate constants from orcaVis files. However, beyond this functionality offered by the quantum chemistry software (ORCA), a computational chemistry workflow requires data to be more malleable. To this end, the plain-text or binary outputs of quantum chemistry software must be further worked on (post-processed) to gain insights. This means for example, that the outputs may be entered into a spreadsheet, or into a plain text note, or a lab notebook, but in practice, programming languages are a good level of abstraction. Of the programming languages, Python as a general purpose programming language with a high rate of community adoption is a good starting place.

Python has a rich set of structures implemented in the standard library, which have been liberally used for structuring outputs. Furthermore, there have been efforts to convert the grammar of graphics Wilkinson & Wills, 2005 and tidy-data Wickham et al., 2019 approaches to the pandas package which have also been adapted internally, including strict unit adherence using the pint library. The user is not burdened by these implementation details and is instead ensured a pandas data-frame for all operations, both at the orcaVis level, and the orcaExp level.

Software industry practices have been followed throughout the development process. In particular, the entire package is written in a test-driven-development (TDD) fashion which has been proven many times over for academia Desai et al., 2008 and industry Bhat & Nagappan, 2006. In essence, each feature is accompanied by a test-case. This is meant to ensure that once the end-user is able to run the test-suite, they are guaranteed the features promised by the software. Additionally, this means that potential bugs can be submitted as a test case which helps isolate errors for fixes. Furthermore, software testing allows for coverage metrics, thereby enhancing user and development confidence in different components of any large code-base.

User Interface

The core user interface is depicted in Figure 1. The test suites cover standard usage and serve as ad-hoc tutorials. Additionally, jupyter notebooks are also able to effectively run wailord which facilitates its use over SSH connections to high-performance-computing (HPC) clusters. The user is able to describe the nature of calculations required in a simple YAML file format. A command line interface can then be used to generate inputs, or another YAML file may be passed to describe the paths needed. A very basic harness script for submissions is also generated which can be rate limited to ensure optimal runs on an HPC cluster.

Some implemented workflows including the two input YML files. VPT2 stands for second-order vibrational perturbation theory and Orca_vis objects are part of wailord’s class structure. PES stands for potential energy surface.

Figure 1:Some implemented workflows including the two input YML files. VPT2 stands for second-order vibrational perturbation theory and Orca_vis objects are part of wailord’s class structure. PES stands for potential energy surface.

Design and Usage

A simulation study can be broken into:

  • Inputs + Configuration for runs + Data for structures
  • Outputs per run
  • Post-processing and aggregation

From a software design perspective, it is important to recognize the right level of abstraction for the given problem. An object-oriented pattern is seen to be the correct design paradigm. However, though combining test driven development and object oriented design is robust and extensible, the design of wailord is meant to tackle the problem at the level of a domain specific language. Recall from formal language theory Aho & Aho, 2007 the fact that a grammar is essentially meant to specify the entire possible set of inputs and outputs for a given language. A grammar can be expressed as a series of tokens (terminal symbols) and non-terminal (syntactic variables) symbols along with rules defining valid combinations of these.

It may appear that there is little but splitting hairs between parsing data line by line as is traditionally done in libraries, compared to defining the exact structural relations between allowed symbols. However, this design, apart from disallowing invalid inputs, also makes sense from a pedagogical perspective.

For example, of the inputs, structured data like configurations (XYZ formats) are best handled by concrete grammars, where each rule is followed in order:

grammar_xyz = Grammar(
    meta = natoms ws coord_block ws?
    natoms = number
    coord_block = (aline ws)+
    aline = (atype ws cline)
    atype = ~"[a-zA-Z]" / ~"[0-9]"
    cline = (float ws float ws float)
    float = pm number "." number
    pm              = ~"[+-]?"
    number          = ~"\\d+"
    ws              = ~"\\s*"

This definition maps neatly into the exact specification of an xyz file:


H   -2.8   2.8   0.1
H   -3.2   3.4   0.2

Where we recognize that the overarching structure is of the number of atoms, followed by multiple coordinate blocks followed by optional whitespace. We move on to define each coordinate block as a line of one or many aline constructs, each of which is an atype with whitespace and three float values representing coordinates. Finally we define the positive, negative, numeric and whitespace symbols to round out the grammar. This is the exact form of every valid xyz file. The parsimonious library allows handling grammatical constructs in a Pythonic manner.

However, the generation of inputs is facilitated through the use of generalized templates for “experiments” controlled by cookiecutter. This allows for validations on the workflow during setup itself.

For the purposes of the simulation study, one “experiment” consists of multiple single-shot runs; each of which can take a long time.

Concretely, the top-level “experiment” is controlled by a YAML file:

project_slug: methylene
project_name: singlet_triplet_methylene
outdir: './lab6'
desc: An experiment to calculate singlet and triplet states differences at a QCISD(T) level
author: Rohit
year: '2020'
license: MIT
orca_root: '/home/orca/'
orca_yml: 'orcaST_meth.yml'
inp_xyz: 'ch2_631ppg88_trip.xyz'

Where each run is then controlled individually.

  active: True
  style: ['UHF', 'QCISD', 'QCISD(T)']
  calculations: ['OPT']
    - 6-311++G**
xyz: 'inp.xyz'
  - '0 1' # Singlet
  - '0 3' # Triplet
extra: '!NUMGRAD'
  molden: True
  chemcraft: True
jobscript: 'basejob.sh'

Usage is then facilitated by a high-level call.


The resulting directory tree can be sent to a High Performance Computing Cluster (HPC), and once executed via the generated run-script helper; locally analysis can proceed.

mdat = waio.orca.genEBASet(Path("buildOuts") / \
caption="CH2 energies and angles \
at various levels of theory, with NUMGRAD"))

In certain situations, ordering may be relevant as well (e.g. for generating curves of varying density functional theoretic complexity). This can be handled as well.

For the outputs, similar to the key ideas across signac, nix, spack and other tools, control is largely taken away from the user in terms of the auto-generated directory structure. The outputs of each run is largely collected through regular expressions, due to the ever changing nature of the outputs of closed source software.

Importantly, for a code which is meant to confer insights, the concept of units is key. wailord with ORCA has first class support for units using pint.

Dissociation of H2

As a concrete example, we demonstrate a popular pedagogical exercise, namely to obtain the binding energy curves of the H2 molecule at varying basis sets and for the Hartree Fock, along with the results of Kolos & Wolniewicz (1968). We first recognize, that even for a moderate 9 basis sets with 33 points, we expect around 1814 data points. Where each basis set requires a separate run, this is easily expected to be tedious.

Naively, this would require modifying and generating ORCA input files.


    R = 0.4, 2.0, 33 # x-axis of H1

*xyz 0 1
H    0.00   0.0000000    0.0000000
H    {R}    0.0000000    0.0000000

We can formulate the requirement imperatively as:

  active: True
  style: ['UHF', 'QCISD', 'QCISD(T)']
  calculations: ['ENERGY'] # Same as single point or SP
    - 3-21G
    - 6-31G
    - 6-311G
    - 6-311G*
    - 6-311G**
    - 6-311++G**
    - 6-311++G(2d,2p)
    - 6-311++G(2df,2pd)
    - 6-311++G(3df,3pd)
xyz: 'inp.xyz'
  - '0 1'
  - name: R
    range: [0.4, 2.00]
    points: 33
      xyz: True
      atype: 'H'
      anum: 1 # Start from 0
      axis: 'x'
extra: Null
jobscript: 'basejob.sh'

This run configuration is coupled with an experiment setup file, similar to the one in the previous section. With this in place, generating a data-set of all the required data is fairly trivial.

kolos = pd.read_csv(
    names=["bond_length", "Actual Energy"],
    sep=" ",

expt = waio.orca.orcaExp(expfolder=Path("buildOuts") / "h2")
h2dat = expt.get_energy_surface()

Finally, the resulting data can be plotted using tidy principles.

imgname = "images/plotH2A.png"
p1a = (
        data=h2dat, mapping=p9.aes(x="bond_length",
        y="Actual Energy",
    + p9.geom_point()
    + p9.geom_point(mapping=p9.aes(x="bond_length",
      y="SCF Energy"),
      color="black", alpha=0.1,
      shape='*', show_legend=True)
    + p9.geom_point(mapping=p9.aes(x="bond_length",
      y="Actual Energy",
    + p9.scales.scale_y_continuous(breaks
      = np.arange( h2dat["Actual Energy"].min(),
      h2dat["Actual Energy"].max(), 0.05) )
    + p9.ggtitle("Scan of an H2 \
      bond length (dark stars are SCF energies)")
    + p9.labels.xlab("Bond length in Angstrom")
    + p9.labels.ylab("Actual Energy (Hatree)")
    + p9.facet_wrap("basis")
p1a.save(imgname, width=10, height=10, dpi=300)

Which gives rise to the concise representation Figure 2 from which all required inference can be drawn.

Plots generated from tidy principles for post-processing wailord parsed outputs.

Figure 2:Plots generated from tidy principles for post-processing wailord parsed outputs.

In this particular case, it is possible to see the deviations from the experimental results at varying levels of theory for different basis sets.


We have discussed wailord in the context of generating, in a reproducible manner the structured inputs and output datasets which facilitate chemical insight. The formulation of bespoke datasets tailored to the study of specific properties across a wide range of materials at varying levels of theory has been shown. The test-driven-development approach is a robust methodology for interacting with closed source software. The design patterns expressed, of which the wailord library is a concrete implementation, is expected to be augmented with more workflows, in particular, with a focus on nudged elastic band. The methodology here has been applied to ORCA, however, the two level structure has generalizations to most quantum chemistry codes as well.

Importantly, we note that the ideas expressed form a design pattern for interacting with a plethora of computational tools in a reproducible manner. By defining appropriate scopes for our structured parsers, generating deterministic directory trees, along with a judicious use of regular expressions for output data harvesting, we are able to leverage tidy-data principles to analyze the results of a large number of single-shot runs.

Taken together, this tool-set and methodology can be used to generate elegant reports combining code and concepts together in a seamless whole. Beyond this, the interpretation of each computational experiment in terms of a concrete domain specific language is expected to reduce the requirement of having to re-run benchmark calculations.


R Goswami thanks H. Jónsson and V. Ásgeirsson for discussions on the design of computational experiments for inference in computation chemistry. This work was partially supported by the Icelandic Research Fund, grant number 217436052.

  1. Kohn, W. (1999). Nobel Lecture: Electronic Structure of Matter—Wave Functions and Density Functionals. Reviews of Modern Physics, 71(5), 1253–1266. 10.1103/RevModPhys.71.1253
  2. Schaefer, H. F. (1986). Methylene: A Paradigm for Computational Quantum Chemistry. Science, 231(4742), 1100–1107. 10.1126/science.231.4742.1100
  3. Meyer, R., Schmuck, K. S., & Hauser, A. W. (2019). Machine Learning in Computational Chemistry: An Evaluation of Method Performance for Nudged Elastic Band Calculations. Journal of Chemical Theory and Computation, 15(11), 6513–6523. 10.1021/acs.jctc.9b00708
  4. Dral, P. O. (2020). Quantum Chemistry in the Age of Machine Learning. The Journal of Physical Chemistry Letters, 11(6), 2336–2347. 10.1021/acs.jpclett.9b03664
  5. Schütt, K. T., Gastegger, M., Tkatchenko, A., Müller, K.-R., & Maurer, R. J. (2019). Unifying Machine Learning and Quantum Chemistry with a Deep Neural Network for Molecular Wavefunctions. Nature Communications, 10(1), 5024. 10.1038/s41467-019-12875-2