Abstract

Most areas of Python data science have standardized on using Pandas DataFrames for representing and manipulating structured data in memory. Natural Language Processing (NLP), not so much.

We believe that Pandas has the potential to serve as a universal data structure for NLP data. DataFrames could make every phase of NLP easier, from creating new models, to evaluating their effectiveness, to building applications that integrate those models. However, Pandas currently lacks important data types and operations for representing and manipulating crucial types of data in many of these NLP tasks.

This paper describes Text Extensions for Pandas, a library of extensions to Pandas that make it possible to build end-to-end NLP applications while representing all of the applications’ internal data with DataFrames. We leverage the extension points built into Pandas library to add new data types, and we provide important NLP-specfific operations over these data types and and integrations with popular NLP libraries and data formats.

Keywords:natural language processingPandasDataFrames