Using Natural Language Processing (NLP) Technologies to Understand Consumer Search Behavior

Search is a fundamental topic in economics. Consumers need to search in order to gather information about products available in the markets; individuals also need to search for potential job vacancies to find work.

Traditionally, without the capability to observe either the search process of consumers or the set of products they consider when making a purchase (“consideration set”), economists are limited to making inferences based on the distribution of equilibrium prices and quantities across products, maintaining strong assumptions about search process consumers take (Hong and Shum 2006). Some recent work has explored the possibility of using higher curvatures of consumer demand to understand consumer demand without search friction, in turn backing out the impact of search (Abaluck and Compiani 2020). However, strong restrictions remain on what the econometrician needs to know about product characteristics as well as on consumer search behavior.

With the rapid mass digitization, one important aspect of recent development in consumer search is the availability of large amounts of digitized search and textual data, as well as the development and use of natural language processing (NLP) technologies in search algorithms.

I am going to use a few blog posts to provide an overview of recent advances in how economists have utilized digital textual data and have attempted to capture the impact of NLP technologies.

Today, we will provide an overview. First, what is the main distinction in trying to use digitized text as data? One, any representation of texts is inherently high-dimensional. Imagine this blog post having n words drawn from the universe of N possible words. Without understanding the context within which each word occurs, there can be N^n possible blog posts. “A sample of thirty-word Twitter messages that use only the one thousand most common words in the English language, for example, has roughly as many dimensions as there are atoms in the universe.”  (Gentzkow, Kelly, and Taddy 2019)

(1) Reducing the dimensionality of textual data and (2) utilizing methods that deal-with high-dimensional data are thus crucial. Economic studies have thus far been focused on causal analyses that use predictions generated by the above two steps. For instance, Scott and Varian (2015) condenses Google search data to “nowcast” important economic variables such as unemployment rate using Bayesian time-series models; Many studies in finance use predictions of short-term movements in stock price based on internet text data such as Twitter feeds (Tetlock 2007); Baker and Fradkin (2017) also uses Google search data as a proxy for job search intensity in their study of the impact of unemployment insurance generosity on job search activities.

In the next blog post, we will zoom in on each of the two steps in several studies to better understand the role of textual data.



Abaluck, Jason, and Giovanni Compiani. 2020. “A Method to Estimate Discrete Choice Models That Is Robust to Consumer Search.” Working Paper Series. National Bureau of Economic Research.

Baker, Scott R., and Andrey Fradkin. 2017. “The Impact of Unemployment Insurance on Job Search: Evidence from Google Search Data.” The Review of Economics and Statistics 99 (5): 756–68.

Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. “Text as Data.” Journal of Economic Literature 57 (3): 535–74.

Hong, Han, and Matthew Shum. 2006. “Using Price Distributions to Estimate Search Costs.” The Rand Journal of Economics 37 (2): 257–75.

Scott, Steven L., and Hal R. Varian. n.d. “Chapter 4 - Bayesian Variable Selection for Nowcasting Economic Time Series / Steven L. Scott and Hal R. Varian.” Economic Analysis of the Digital Economy.

Tetlock, Paul C. 2007. “Giving Content to Investor Sentiment: The Role of Media in the Stock Market.” The Journal of Finance 62 (3): 1139–68.

    Related Frontiers