In this paper, we present an analysis of 59,000 OkCupid user profiles that examines online self-presentation by combining natural language processing (NLP) with machine learning. We analyze word usage patterns by self-reported sex and drug usage status. In doing so, we review standard NLP techniques, cover several ways to represent text data, and explain topic modeling. We find that individuals in particular demographic groups self-present in consistent ways. Our results also suggest that users may unintentionally reveal demographic attributes in their online profiles.

Keywords:natural language processingmachine learningsupervised learningunsupervised learningtopic modelingokcupidonline dating