Social media traces: the end of opinion polls?

In 2008 Chris Anderson published the article “The end of theory” that created great agitation among the scientific community. In it, Anderson stated that the enormity of data available today made the scientific method of hypothesising, modelling and testing obsolete. His main argument was that with enough data – this is, with the quantity of digital data available today – numbers speak for themselves. His article welcomed a new era of theory-free science, in which finding patterns in data – correlations between variables – would be the new way of doing science. Together with Anderson, numerous are the voices proclaiming the data deluge as a turning point for science.

In the field of public opinion research, voices claiming the obsolescence of surveys have not been absent. In their article “The Coming Crisis of Empirical Sociology”, Savage and Burrows (2007) stress the historicity of sample surveys. They explain that the relative pre-eminence that this research tool enjoyed in the second half of the 20th century had been because, with respect to the level of technology existing at that point of time, statistical inference was the most cost-efficient method. Indeed, in a moment in which storage, manipulation and processing of data were extremely expensive, statistical inference from a representative sample-based survey was the only way for predicting characteristics of an entire population without having to conduct a census. However, in today’s world, surveys would be losing their cost-effectiveness. This would be due to two reasons. First, because society would no longer be the same. Response rates to surveys have been falling and in consequence it is becoming increasingly expensive to make statistical inferences. Second, because with an ever-increasing digitalisation of the human existence, the enormity of available data documenting so much of our lives today would be a much cheaper alternative to surveys. Why investing in designing questionnaires and asking people questions, why even producing new data, when there are enormous quantities of data giving so much more detail about people’s behaviours, interests and opinions than what any questionnaire could ever collect?

As promising as these massive quantities of data may seem, the complete replacement of surveys with Big Data analytics seems nonetheless quite unlikely, at least in the coming years. This article will expose the challenges that the analysis of social media traces poses to the study of opinion, postulating that the end of polls is not yet in sight.

A new society demands new methods

When the Internet was a novelty, as Dillman and colleagues (2014) explain, a great optimism was shared among surveyors: reaching people would become much easier, faster and cheaper, and consequently response rates would increase. In particular in the field of market research, surveyors started very early to digitalised their methods, and surveys began to be conducted online.

Something in these predictions, however, went wrong. Response rates to survey questionnaires have not increased as expected, which compromises the capacity of these sources of data to make statistical inferences about the entire population being studied. What went wrong was the incapacity of seizing the true nature of the changes introduced by the Internet, and thus of foreseeing the full impact that it would have over society.

In order to explain the profound impact represented by the Internet, Boris Beaude (2012) argues that the invention of the Internet must be understood as the creation of a new place. The Internet made possible for the first time in the history of humankind that we all could interact in the same, common place. The significance of this accomplishment is such, that this author introduces a new term to refer to this process. If with the radio and television what humanity achieved was to share information in a same moment of time, which we call synchronisation (from the word for time in Greek ‘chronos’), what the Internet enabled – to create a common place to interact – is named by Beaude “synchorisation”, from the word for space in Greek ‘chôra’. The difficulty for comprehending the Internet as a place is that we share a materialistic conception of space. Instead, the notion of ‘chôra’ understands space as a matter of distances. It is a relativistic and relational conception of space. By creating a common place to interact, the Internet reduces the relative distances between spaces that are geographically apart.

By reducing distances, the Internet has redefined the way we interact, and ultimately, it has transformed our social relations. Indeed, the synchorisation has made possible interactions that were simply unimaginable in the past. And, as pointed out by Beaude (2012), such transformations in the way we relate between one another end up transforming our entire society:

« Changer l’espace, c’est toucher à ce que le social a de plus intime : la relation. Changer l’espace, c’est changer la société. » (Beaude, 2012)

The societal changes have been of such profound consequences that the way the public opinion had been traditionally studied does not seem to be adapted for this task anymore. By reducing distances, the Internet has had as consequence the acceleration of life rhythm. As explained by Dillman and colleagues (2014), interactions have become much faster and shorter. Moreover, they argue, if the new technology has facilitated communication, this has in its turn increased the amount of correspondence, and ultimately, rendered the ignorance of this massif mailing socially acceptable. The consequence: falling response rates and increasing limitations to make inferences.

The study of opinion in the Internet era requires a transformation of research methods. Some would argue that opinion polls are witnessing the end of their days. Let us review the promising opportunities the use of social media traces, instead of data from surveys, might bring to the study of society.

Opportunities in social media traces

An increasingly part of our immaterial existences are happening online, and thus being documented in the form of data. Social media websites or apps, such as Facebook, Twitter, Instagram or Reddit have become new meeting points where more and more our social interactions take place. These platforms have been recording all our demographics, pictures, social networks, the events we attend, the likes we give, the public figures we follow. These enormous amount of natively digital data that is being permanently created as a by-product of our activity online present unprecedented opportunities for the field of public opinion research.

A first enormous advantage of social media traces is that they are data that already exist, and that, in many cases, can be accessed for free. Richard Rogers (2013) uses the term of repurposing digital traces for the study of society. He presents an example of how data from social media could be used for inferring political preferences, by means of postdemographics. He defines this as the study of the data in social networking platforms, “in which the interest has shifted from the traditional demographics of race, ethnicity, age, income, and educational level- or derivations thereof such as class- to tastes, interests, favourites, groups, accepted invitations, installed apps, and other information that comprises an online profile and its accompanying baggage” (p. 154). To give an example of this kind of studies, in 2008 and 2009 Rogers and colleagues conducted an investigation comparing the profile content of Barack Obama’s friends and John McCain’s on the social media platform MySpace. The aim was to determine whether the friends of the two presidential candidates had distinctive profiles. They found common patterns in the movies, music, books and heroes of each of the two groups. In words of Rogers “Here one may begin a practice of inferring political preference from TV shows and other “favorite” media” (p. 158). This example shows how digital traces allow to conduct opinion research without the need of large budgets.

A second characteristic of social media traces is that they “offer near-real-time records of phenomena, and that they highly granulated temporally” (Japec et al., 2015: 848). The same way Google Flu Trends aimed to monitor cases of flu with Google search data, there are numerous researchers now claiming the possibility of using Twitter data for measuring political support (i.e Tumasjan et al. 2010). Daniel Gayo-Avello (2011), uses the phrase of “predicting the present”, to refer to the idea that the analysis of social media could allow to have “a glimpse on the collective psyche” almost at real time. Lazer and Radford (2017) use the term of “nowcasting” to refer to this research technique. As these authors point out, many contemporary institutions have been using permanent survey programs to monitor social phenomena. Compared to surveys, the enhanced temporal granularity of nowcasting with social media traces presents an important improvement in the timeliness and the reduction of costs for these monitoring purposes.

A third very important feature of social media traces is their quantity. The number of cases is so much more elevated, that analyses reach a level of detail that is unbeatable by traditional sources of data such surveys. This many cases allow for example researches to study the tails of a distribution, or selected subgroups, that in traditional data sources tend to be too rare, for any analysis to be possible. (Japec et al. 2015: 848) For example, Twitter has made possible the study of traditionally underrepresented populations, such as people suffering from PTSD, suicidal ideation or depression (Lazer & Radford, 2017).

A last, but not less important characteristic of social media traces is that they are behavioural data. This implies a big improvement with respect to the way behaviour data are collected in surveys: via self-reports. By being behavioural data, social media traces may outperform survey data, firstly, because they are potentially much more reliable with respect to the limited human memory. Secondly, because they are free from the influence of the survey situation, as would be for example the fact of giving social desirable answers. As pointed out by Lazer and Radford (2017), there are certain types of information that are inaccessible via self-reports, simply because people would systematically lie about them. Social media traces present thus an important opportunity for the study of actual behaviour.  

Limitations of social media traces

Although it is undeniable that social media traces present unprecedented opportunities for the study of public opinion and of society in general, there are nonetheless a number of aspects of these data that require our attention. A review of these aspects will make clear that social media traces are still far from being able to replace completely the role surveys have played in the last decades.

Access and format of data

Let us start with the more practical problems. A first point is that the access to social media data is not as straightforward as it may seem. To begin with, there are multiple platforms of social media, and they all have their own terms and conditions for accessing to their data. There seem to be a predilection among researchers for the use of Twitter data. Indeed, as Lazer and Radford point out, “Twitter has become to social media scholars what the fruit fly is to biologists – a model organism” (2017: 29). The choice of this platform over others is precisely because it is relatively easy to access to its data.

However, studying public opinion based on Twitter data only may be problematic, in the extent to which patterns of behavior differ across platforms. It seems likely that people using more than one platform at the same time would do this precisely because each platform would be linked to a different logic of interaction, and/or to a different group of people to interact with. If combining data from different platforms would seem the path to follow, a new obstacle is raised: the heterogeneity of data formats. Likes? Dislikes? Tweets? Shares? Pictures? Videos? How to combine them all?

Analytical methods

Another practical problem refers to the analytic tools that are used for the study of these massive sources of data. Although data from social media present a broad variety of formats, it is true that most of the analysis is done on textual data, as for example on tweets. A popular method for analysing these data is by means of text-mining algorithms. A very common use of these lexicon-based approaches is to classify tweets according to the sentiment either positive or negative they evoke. These algorithms rely on machine-learning and have developed natural-language-processing technology. Despite their sophistication, the accuracy and reliability of these technologies requires still some further development. Some examples of the inaccuracies that can be detected in these kind of algorithms are wonderfully displayed by David Auerbach in his following tweet:

If Only AI Could Save Us from Ourselves – MIT Technology Review – David Auerbach December 13, 2016

There is a critical problem of the use of Big data, and it is that the analysis of them very much consists in finding patterns in data, or looking for correlations between variables. As Chris Anderson (2008) put it, when enough data is available, “numbers speak for themselves”. However, this assertion can be very wrong. Japec et al. (2015) explain how, because of their massive size and dimensionality, Big Data analytics can lead to problematic discoveries. If the analysis is conducted on a great number of variables at the same time, it is very likely that the most salient common characteristics may be lost by all other differences observed in all the other variables that were included. This is what Japec et al. (2015) call as the problem of “noise accumulation”. Likewise, when trying to find correlations between variables, the massive size of the dataset makes that correlations are found simply by chance, resulting in false discoveries, or “spurious correlation”. These are just some of the problems that may occur when following a theory-free approach for the analysis of data.

Representativity and statistical inference

Who are social media users? If already there is a part of the population that does not use Internet at all, out of the online users, not everyone is active on social media. According to Murphy and colleagues (2014), the rate of social media use differs most significantly by age group: among 18-29-year-old, nine out of ten are users of social-networking sites, whereas less than half of the 65+ population uses them. The possibilities of using these data for representing the general population are still remote.

Moreover, very often, these data sources contain repeatedly information of particular individuals. Lazer and Radford use the following analogy: “these near-census projects are like sending millions of people out into the streets to count the population of the United States. You would count a large number of people, but you could not know the kinds of people who are counted twice or not counted at all, and therefore you could not know what kinds of people are over- or underrepresented and to what extent.” (2017: 29).

Another problem arises from the ideal-user assumption (Lazer & Radford, 2017). Much of the Big Data come from Social Media Networks and researchers assume that all the data come from prototypical users. However, very often these “users” deviate from this ideal-type. Many are not even humans, but rather organizations or even robots. Other users hide their true identity and are not the persons they claim to be. Moreover, there are many people that have multiple accounts for different purposes. At the end, it is unclear who are truly represented in the data sources.

The impossibility of knowing the coverage of these data has important consequences for the quality of predictions. Quantity of data is not everything. An excellent example learnt from the first days of survey research, is in the one during the presidential election of 1936. The Literary Digest, a magazine, conducted a survey on 2.4 million people. Despite the enormous sample, the predictions turned out to be false. George Gallup, having used a much smaller sample, had forecasted the results with a great level of precision. As explained by Tim Harford (2014), the reason for the much more accurate results from Gallup were because his was an unbiased sample. On the contrary, The Literary Digest had used a sample of disproportionately prosperous persons, that in addition were more likely to mail back their questionnaires. This early experience taught pollsters that selection bias could represent a much bigger problem than sample error. In other words, that the representativity of the sample was often more important than the size of it.

Operationalisation and validity of concepts

The last, and maybe the most important problem of using social media traces arises from their nature as “found” or “organic” data sources, with respect to the “designed” sources of data from surveys. Big Data are by-products of some kind of interaction, that are registered as variables and stored. Therefore, much of the information contained in these data sources is not what social scientists would have chosen, or designed for their specific research question. The most important question that researchers using digital traces should ask themselves is: what are these data actually measuring? Franco Moretti (2013) stresses the importance of operationalising, of building a bridge from concepts to measurements. Many variables can only be used as a proxy of the actual measurement objective, and many others are simply of no use. Jungherr and colleagues (2016) found evidence that the way of operationalising political support with Twitter data would be rather measuring another concept, the one of attention toward politics.

The end of opinion polls?

With the beginning of Big Data Analytics, the positivist approach gained ground, welcoming a science free of theory. This paradigm is based on the idea that science musts enable prediction. Causality of phenomena would be out of the scope if science. From the positivistic point of view, as long as correlations enable prediction, Big Data Analytics are succeeding.

The replacement of opinion polls by social media traces in the field of public opinion research would follow the same positivist logic. If social media traces enable prediction, surveys would no longer be needed. This article has however exposed a series of problems inherent to the analysis of social media traces that are likely to limit the reliability of predictions made out of these data.

The falling response rates of surveys are a clear signal that traditional survey methods have to adapt to the new society of the Internet era. But rather that witnessing a replacement of these traditional methods by social media traces, a combination of old and new methods seems to be occurring.

Kitchin (2014) ideas go in the same direction. Rather than returning to that empiricism, he postulates that a new paradigm would be emerging: data-driven science. Unlike positivism, this new paradigm does not claim the end of theory. Far from that, what it constitutes is an hybrid combination of abductive, inductive and deductive approaches. The idea is to use Big Data Analytics (induction) as a first step, in order to formulate new hypotheses. On a second step, the deductive approach would be followed to test them. Thus, data-driven science still rely in theory.

A similar approach can be imagined for the future of opinion research. As pointed out by Lazer and colleagues, « Big data offer enormous possibilities for understanding human interactions at a societal scale, with rich spatial and temporal dynamics, and for detecting complex interactions and nonlinearities among variables » (2014: 1205). To this, Murphy and colleagues (2014) add that one of the major advantages of social media data is the possibility of studying the social network of individuals. Social media traces represent without any doubt an excellent new source of knowledge for studying our societies. Nevertheless, these data sources cannot replace opinion polls, at least not in the near future. null

References

Anderson, Chris. 2008. “The end of theory”, in: Wired, 23 juin 2008. 

Auerbach, David. 2016. “If Only AI Could Save Us from Ourselves”, in: MIT Technology Review, December 13.

Beaude, B. 2012. Internet. Changer l’espace, changer la société (introduction, conclusion and introductions to chapters). Limoges : FYP. Available online: http://www.beaude.net/icecs/

Burrows, Roger, & Mike Savage. 2014. “After the crisis? Big data and the methodological challenges of empirical sociology”, in: Big data and Society, 1(1), 1-6.

Dillman, Don A., Jolene D. Smyth, & Leah Melani Christian. 2014. Internet, phone, mail, and mixed-mode surveys. The tailored design method. New York: Wiley.

Gayo-Avello, Daniel, Panagiotis T. Metaxas, & Eni Mustafaraj. 2011. “Limits of Electoral Predictions using Social Media Data”, in: Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media.

Harford, Tim. 2014. “Big data: are we making a big mistake?”, in: Financial Times Magazine, March 28, 2014.

Japec, Lilli, Frauke Kreuter, Marcus Berg, Paul Biemer, Paul Decker, Cliff Lampe, Julia Lane, Cathy O’Neil, & Abe Usher. 2015. “Big data in survey research. AAPOR task force report”, in: Public Opinion Quarterly, 79(4), 839-880.

Jungherr, Andreas, Harald Schoen, Oliver Posegga, & Pascal Jürgens. 2017. “Digital Trace Data in the Study of Public Opinion: An Indicator of Attention Toward Politics Rather Than Political Support”, in: Social Science Computer Review, 35 (3), 336–56.

Kitchin, Rob. 2014. “Big data, new epistemologies and paradigm shifts”, in: Big Data and Society, 1(1), 1-12.

Lazer, David, Ryan Kennedy, Gary King, & Alessandro Vespignani. 2014. “Big data. The parable of Google Flu: traps in big data analysis”, in Science, 343, 6176, 1203–1205. 

Lazer, David, & Jason Radford. 2017. “Data ex machine. Introduction to Big data”, in: Annual Review of Sociology, 43, 19-39.

Moretti, Franco. 2013. “’Operationalizing’ : or, the function of measurement in modern literary theory”, in: New Left Review, 84 (Nov/Dec 2013), 103-119.

Murphy, Joe, Michael W. Link, Jennifer Hunter Childs, Casey Langer Tesfaye, Elizabeth Dean, Michael Stern, Josh Pasek, Jon Cohen, Mario Callegaro, & Paul Harwood. 2014. “Social media in public opinion research: Executive summary of the AAPOR task force on emerging technologies in public opinion research”, in: Public Opinion Quarterly, 78, 788–794.

Rogers, Richard. 2013. Digital Methods. MIT Press. 

Savage, Mike & Roger Burrows. 2007. “The Coming Crisis of Empirical Sociology”, in: Sociology,  41 (5), 885-899.

Tumasjan, Andranik, Timm O. Sprenger, Philipp G. Sandner, & Isabell M. Welpe. 2010. “Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment”, in:  Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media.

Jimena Sobrino Piazza