Detecting the bias in media with statistical learning methods
The international media system plays a crucial role both in re ecting public opinion and events, and in shaping them. Understanding the workings of this complex system is of crucial importance for sociology, anthropology, communication sciences and many other disciplines. Traditionally, the analysis of media content has been performed by social scientists, by searching databases of news for certain specic keywords or keyphrases of interest. Statisticians have been involved in what is called "content analysis", a task essentially based on "pattern matching": the count of occurrences of specic patterns (typically keywords). The availability of modern pattern analysis technology and of news in digital format allow for a radically new approach to the classic statistical task of content analysis. Automatic pattern discovery methods working on massive amounts of data can monitor and analyze even subtle patterns in the media system. We believe that the social sciences can benet from automatic pattern analysis technology, as much as the life sciences have been doing in the past years. In this chapter we present an application of statistical learning algorithms to the analysis of patterns in media content: a complete case study of how this can be automatically done exploiting modern algorithms and web-based news sources. We cover the entire pipeline, from story extraction to the analysis of its content. Rather than focusing on classical aspects such as topic extraction, o named entity identication, here we explore the analysis of two very elusive patterns, such as the presence of a systematic bias in the choice of words or topics when reporting. Support Vector Machines and other kernel methods are used in various stages for extracting a set of news stories from the web, that represent dierent versions of the same events. A signicant statistical bias is found, correlating the choice of terms with the source reporting the news. The fact that all this is done automatically and in a very general way, makes it easily scalable. An automatic system based on learning algorithms has been used to create a corpus of news appeared in the online versions of 4 international news sources between 31st March 2005 and 14th of April 2006. The sources used are CNN, the english version of Al Jazeera (AJ), International Herald Tribune (IHT) and Detroit News (DN).We have performed three experiments on this dataset, aimed at extracting patterns from the news content that relate to a bias in lexical choice when reporting the same events, or a bias in choosing the events to cover. The rst experiment, using Support Vector Machines  and limited to CNN and AJ, demonstrates how it is possible to predict the source of a story based on its content, and identies the terms that are most helpful in this discrimination. The second experiment, using Canonical Correlation Analysis , identies topics in the CNN/AJ part of the corpus, and then identies words that are discriminative for the two sources in each topic. Finally, we have generated "maps" reflecting the distance separating the 4 sources, based both on topic-choice and on lexical-choice features. In order to separate the two eects (choice of topics and of lexicon) we developed an algorithm to identify corresponding news items in dierent sources (based on a combination of date and bag-of-words similarity). This means that any patterns in lexical dierence we identify are obtained by comparing dierent versions of the same stories. For the rst two experiments, we constructed a paired corpus of news, where each pair is formed by one article AJ and one article from CNN, reporting on the same story. The corpus was created by extracting the text of each story from HTML pages, using a support vector machine, and later it was paired using an algorithm developed for this purpose. Starting from 918 stories gathered over a period of 13 months in 2005 and 2006 from those two news sources, 816 pairs were so obtained, most of which turned out to be related to Middle East politics and events. Furthermore, it has been possible to isolate a subset of words that are crucial in informing this decision. These are words that are used in dierent ways by the two sources. In other words, the choice of terms is biased in the two sources, and these keywords are the most polarized ones. This include a preference for terms such as `insurgency' , `militants' , ` terrorists' in CNN when describing the same stories in which Al Jazeera prefers using the words `resistance' , `ghters' and `rebels'. For the last set of experiments, involving the generation of Maps, we have used the full corpus. Obtained with the same techniques and for the same time interval, it contains 21552 news items and 2142 for AJ, 6840 for CNN, 2929 for DN and 9641 for IHT. That the two news sources with more regional focus (AJ and DN) have the smallest set of news, as well as having the smallest intersection, as expected, resulting in few stories being covered by all 4 newspapers. Most stories that were covered by all four news sources were middle-east related. More generally, this chapter aims at demonstrating the potential of modern pattern analysis technology, coupled with text engineering and information retrieval methods, in the social sciences. Modern pattern analysis algorithms play now a crucial role in the life sciences, but virtually no transfer of technology has yet taken place to the social sciences. One of the purposes of this study is to demonstrate the potential for this interaction.