"Big data is not a substitute for common sense, economic theory, or the need for careful research designs. Nonetheless, there is little doubt in our own minds that it will change the landscape of economic research. "——Einav and Levin (2014, Science)
Human cognition has been developed throughout a continuing race between data and model. So does the economic research, as explained by Tom Sargent at the Annual Conference of Luohan Academy in 2019. On the one hand, data collect observations or facts about the world; on the other hand, models or theories structurize our understanding or reflections on the world. We reflect according to observations, whilst observing in the guidance of reflections. In this way we keep refining our knowledge and intelligence. The past decades have witnessed more and more data, faster and cheaper computational power and storage. But how about the other hand of the race? How do our economic minds respond to the big data revolution?
More works on data than pure theory
A quick and obvious response of the economists seems to be a leaning towards more data-related works, resulting in a rise of empirical economic studies. Daniel Hamermesh, a U.S. economist of Columbia University, reviewed the publications from 1963 to 2011 in top economic journals. Until the mid-1980s, the majority of papers were theoretical. Since then, the share of empirical papers in top journals has climbed to more than 70%. More recently, three economic researchers from Princeton University, Janet Currie, Esmee Zwiers and Henrik Kleven find a similar trend by checking the NBER working papers from 1980 to 2018 and the "Top 5" economic journal papers from 2004 to 2019: the fraction of applied microeconomics papers alone has increased from 55-60% to about 75%.
Big data in the eyes and hands of economists
This shift probably mirrors the expansion of available data, as mentioned by Liran Einav and Jonathan Levin, two economists from Stanford University, "who happen to live and work in the epicenter of the data revolution, Silicon Valley". They describe what is new about big data from the perspective of economists.
The three V's of big data means that the economists can get data faster, with greater coverage and scope, and with much higher frequency, dimensionality and granularity which was not even measurable/observable. More data and new data can be good news: economists can get more information to test their understanding and guesses, or put it formally, their theories and hypotheses. Straightforwardly, the large number of observations can make the statistical power of empirical tests much less of a concern.
But more importantly, the novel and detailed data allow the economists to zoom in to decode the micro-level "black boxes" of how individuals, firms and markets operate and zoom out to provide more timely and versatile pictures of the macroeconomy or aggregate economic activities. For example, the granular data, such as personal communications, search logs, geolocations, consumer behaviors before and after purchase, etc., open the door to exploring issues that economists have long viewed as important but did not have good ways to look at empirically. For instance, the role of social connections and geographic proximity in shaping preferences, the detailed behavior that reveals the decision-making processes of consumers, etc. The Billion Price Project (BPP) at MIT coordinate with Internet retailers to aggregate daily prices and detailed product attributes on hundreds of thousands of products, and produce a daily price index that replicates closely the monthly published CPI series.
The universal coverage of big data also increases the representativeness of research samples and naturally solves the classical empirical issues such as selection bias, limited variation of concerned variables. Sometimes, economists can literally work on a population instead of samples, because administrative data like tax records, hospital admission records, credit bureau records might cover all individuals of interest. For example, Thomas Piketty and Emmanuel Saez, two economists focusing on inequality studies, are able to use tax records to calculate income and wealth shares for the very upper portion of the income distribution. It was problematic traditionally because of the small sample sizes of surveys, under-reporting of high incomes, etc. The availability of population data can also make random sampling much easier when samples are needed.
Despite the various benefits, big data also brings in as many, if not more, challenges to economists, which though may be regarded as opportunities from an optimistic perspective. The economists may get lost in the overloaded information. A single sequence of events or economic behaviors can be described in an enormous number of ways, from which an almost unlimited number of variables can be created. Figuring out how to organize and reduce the dimensionality of large-scale data is becoming a crucial challenge. Traditionally, a few summary statistics and several graphs can be enough to give us some sense of how a set of data looks like. We are comfortable to live with limited snapshots though we are aware of the parable of blind men and an elephant, or the allegory of Plato's Cave, since we believe that's the best we can do. In the world of big data, however, we are enabled to take unlimited snapshots from various positions and angles. The problem is how to organize and analyze them.
Liran Einav and Jonathan Levin gave an example of Internet browsing history. It contains a great deal of information about a person's interests and beliefs, and how they evolve over time. But how can we extract this information? Another related issue of unstructured or complex structured big data acerbates the problem. The workhorse of empirical economists is still the traditional econometric toolbox used 15-20 years ago, which works with "rectangular" data sets, with N observations, K<<N variables per observation, and a relatively simple dependence structure between the observations. Now the economists have to deal with much higher dimensions, less clear and even complex structures. For example, individuals in a social network may be interconnected in highly complex ways, and the point of econometric modeling is to uncover exactly what are the key features of this dependence structure, even before or at the same time when any meaningful other explorations are carried out.
The two economists from Stanford University stress the role of economic theory in dealing with this problem. They believe one way to deal with high dimensionality and complexity is to seek insights from simpler organizing frameworks, i.e., economic models or theories. Their solution is probably not new, as theories were created as simplifications of complex real-world phenomena in the first place. The importance of economic theory, as they pointed out, has already been seen in some applied settings. For example, running online auction markets requires an understanding of both big data predictive modeling and economic theory about sophisticated auction mechanisms. Many large tech companies have built economic teams along with statisticians and computer scientists. Indeed, "the richer the data, the more important it becomes to have an organizing theory to make any progress."
The trend in an empirical economic method based on text mining
Though economists today may still routinely analyze big data, or more accurately, large data sets from big data with the same econometric methods used 15-20 years ago, these methods are surely adapting to the big data revolution to some extent, sooner or later. The three economists from Princeton University mentioned earlier tried to depict the trend by analyzing a data set constructed by mining the text from papers of NBER working paper series and publications of top economic journals.
Overall, they find a trend of economic methods towards demanding greater credibility and transparency, echoing the prediction of the Credibility Revolution in empirical economics by Joshua Angrist and Jorn-Steffen Pischke in 2010. Back in the 1980s, several famous economists, such as Edward Leamer, David Hendry, Robert Lucas, Christopher Sims, etc., doubted the reliability of the data analysis for the time being. Initiatives to "take the con out of econometrics" started to be an urge and pursuit since then.
Thanks to such pursuit and the availability of big data, empirical microeconomics has probably been experiencing a credibility revolution, with a particular focus on the quality of empirical research designs. The doubts about empirical studies back in the old days were mainly about the credibility of seemingly significant causality in the simple statistical inferences, which underscores the importance of identifying a clean effect of any variable on another. According to the text mining study of economic papers, the fraction of papers explicitly referring to "identification" has risen from 4% in 1980 to 50% in 2018. Specific identification concerns, such as omitted variables, selection biases, reverse causality are also trending up.
Requiring cleaner identification methods leads to a rise in experimental and quasi-experimental methods, such as Random Controlled Trials (RCTs), lab experiments, difference-in-difference, regression discontinuity, event studies, etc. The use of experiments generates new critiques of external validity, i.e., it is questionable whether the results from experiments still hold true beyond the experiment setup. To deal with this issue, economists supplement the treatment effect estimates of experimental methods with evidence on mechanisms. That is, economists need to explain and prove how the causality identified from experiments carries out. Correspondingly, the fraction of papers mentioning "external validity" and discussing "mechanisms" rises sharply. Most strikingly, the fraction of top journal papers with mechanism discussions is over 70% in 2019.
There is some trace of big data in the text mining study as well, though very limited and only recently. The fraction of papers using binscatter plots to visualize big data, mentioning "machine learning" and "text analysis" started to increase around 2012, and still remains below 1-2 percent by 2019. However, using techniques such as binscatter plots, machine learning, text analysis is certainly not the core of how economists are expected to empower economic methods by big data. Rather, we are expecting a revolutionary combination of economic theories, econometric techniques, and big data processing skills that interact and complement each other.
Theories, statistics, and computer science technologies. Perhaps, it is the nexus of them that haunts both economics and big data, as well as the application of big data in many other fields. As in every step of the progress of human wisdom, we believe, the solutions to better use big data with the help of economic theories are and will be on their way.
A review of the challenges in accessing and using these new data, as well as discussions on how new data sets may change the statistical methods used by economists and the types of questions posed in empirical research.
An analysis of methodological changes in applied microeconomics by plotting the time series of methods-related words and phrases since 1980 (for NBER working papers) and 2004 (for top-five papers).
An introduction of the non-traditional information sources and data analysis methods.
A resource to introduce how machine learning gives its own place in the econometric toolbox.
A detailed description on the work of the Billion Prices Project at MIT, and the key lessons that can be used for both inflation measurement and some fundamental research questions in macro and international economics.
A case of constructing top income shares time series over the long-run for more than 20 countries using income tax statistics.
An analysis of data on all full-length articles in the three top general economic journals for one year in each decade 1960s-2010s, and the changing patterns of co-authorship, age structure and methodology.
Nine trends in the publication of top economic journals since 1970.
Angrist, Joshua D., and Jörn-Steffen Pischke. "The credibility revolution in empirical economics: How better research design is taking the con out of econometrics." Journal of economic perspectives 24.2 (2010): 3-30.
A review of the key factors that contribute to improve empirical work, and how research design has moved front and center in much of empirical micro.