Abstract

The pandas library has become the de facto library for data wrangling in the Python programming language. However, inconsistencies in the pandas application programming interface (API), while idiomatic due to historical use, prevent use of expressive, fluent programming idioms that enable self-documenting pandas code. Here, we introduce pyjanitor, an open source Python package that extends the pandas API with such idioms. We describe its design and implementation of the package, provide usage examples from a variety of domains, and discuss the ways that the pyjanitor project has enabled the inclusion of first-time contributors to open source projects.

Keywords:data engineeringdata sciencedata cleaning