he Hollywood film Moneyball (2011) is about the Oakland Athletics baseball team and the attempt by its manager to put together a competitive team on a lean budget using data and computer analytics rather than depending on mere biases to recruit new players. The film stands out for focussing the spotlight on data science by showing that the art of data science is more about asking the right questions than just having the data.
It is difficult to imagine the great volume of data we supply to different agencies in our everyday actions, bit by bit through surfing the Internet, posting on social media, using credit and debit cards, making online purchases, and other things where we share information about our identity. It is believed that social media and networking service companies such as Facebook may already have more data than they are leveraging. There are infinite ways to slice and dice data, which itself is quite daunting as at every step, there is potential to make huge mistakes.
Careful data mining from Big Data might help understand our behaviour in order to facilitate planning. But there are examples of blunders being made with a load of information at one’s fingertips. The problem with so much information is that there is a much larger haystack now in which one has to search for the needle.
The Google project
Here is an example. In 2008, Google was excited about “Big Data hubris” and launched its much-hyped Google Flu Trends (GFT) based on online searches on Google for flu-related information with the aim of “providing almost instant signals” of overall flu prevalence weeks earlier than data out out by the Centers for Disease Control and Prevention (CDC), the leading national public health institute in the U.S. But much of it went wrong; GFT missed the 2009 swine flu pandemic, and was wrong for 100 out of 108 weeks since August 2011; it even missed the peak of the 2013 flu season by 140%. Google tried to identify “flu” with the search pattern. Usually, about 80-90% of those visiting a doctor for “flu” don’t really have it. The CDC tracks these visits under “influenza-like illness”. Understandably, the Net search history of these people might also be an unreliable source of information. While GFT was promoted as the poster project of Big Data, it eventually became the poster child of the foibles of Big Data. In the end, it focussed on the need for correctly using the limitless potential of Big Data through efficient data mining.
Data blunders often arise out of bias, low-quality data, unreliable sources, technical glitches, an improper understanding of the larger picture, and lack of proper statistical tools and resources to analyse large volumes of data. Moreover, Big Data invariably exhibits fake statistical relationships among different variables, which are technically called “spurious correlations” or “nonsense correlations”. Relying too heavily on a particular model is also a common mistake in Big Data analyses. Therefore, the model should be wisely and carefully chosen according to the situation.