Scientists can be passionately critical of different approaches to a given problem. Over the years, I have found myself crossing swords frequently with epidemiologists with whom I am otherwise good friends ‘off the pitch'. To some extent, this difference in approach seems to me to be grounded in the logistics of gathering or generating data, and the analysis thereof. The epidemiological approach is often incredibly labour-intensive with regards to data gathering—consider the interview approach and dietary studies that require subjects to recall meals eaten—and therefore requires careful thought and planning to achieve a meaningful outcome with limited resources. By contrast, many areas of molecular biology generate an enormous amount of data easily and quickly, and therefore lend themselves to a different way of thinking and heated debate about the ‘correct way to do science'. Such arguments are futile at best and possibly damaging to science at worst. What is regarded as the correct approach might be, in part, due to limitations in generating data, but it might not be necessary or correct in the future, as the history of science has shown.
The accepted scientific method consists of formulating a hypothesis and then testing it by experimentation which, at least in theory, attempts to disprove the hypothesis. Experiments generate data that can be analysed by various means to test the hypothesis. This is a common approach in epidemiological studies—for example, for identifying a risk factor for a given disease. This approach necessitates that one proposes a risk factor for the condition or disease—alcohol consumption or dietary deficiency—and then collects data pertaining to that factor. Of course, studies of social or nutritional risk factors often simultaneously collect data for testing multiple parameters.
The main risk of such studies, in my opinion, is that the researcher must first predict or guess the risk factor and then collect data pertaining to it. Of course, previous work might have suggested or even identified a risk factor in a separate study. Nevertheless, I suggest that such an approach can be fundamentally deficient, as the researcher approaches the study with his or her inherent limitations of knowledge and experiential bias, merely by choosing one possible risk factor out of perhaps hundreds of possibilities. There is a good chance that the risk factor will be confirmed as such and give rise to a publication, as a result of the bias in publishing positive studies and, particularly, studies that verify previous findings.
By contrast, the new ‘omics' technologies allow us to generate massive quantities of data rapidly and thereby enable us to take a far less biased approach to tackle a given problem. There are already about one million transcriptome data sets available , or single-nucleotide polymorphism (SNP) chips that can analyse one million SNPs each. One might easily use these to investigate a large number of individuals, generating one billion separate pieces of information per 1,000 individuals, with no prejudice or preconceived ideas.
Equipped with such huge data sets, we can perform data mining in an objective way. For some purists, this approach to data acquisition is anathema, as it is not ‘hypothesis-driven'. However, I submit that it is. In this case, the original hypothesis is broad or generic—we generate data, assess it and probably find something useful for elucidating our research problem. The broad hypothesis states that we use the results to generate models that identify differences, for example between experimental subjects and controls, without specifying what those differences are and without collecting specific and limited data sets. The ‘old-style stickler' might find this approach unacceptable; however, it might be the best way to avoid bias. Contrary to what some have suggested to me, this approach is not simply playing with data to generate a hypothesis, which would violate the principle that one should not look for a primary hypothesis in results. The hypothesis is that one will design an algorithm and find a pattern, which allows us to distinguish between cases and controls.
By using this approach, the examination of large data sets might generate useful and specific leads for further study and validation. These follow-up studies can be governed by the traditional hypothesis-driven approach: ‘biomarker X' is a risk factor for ‘condition Y'. Such a combination of data-gathering and hypothesis-driven approaches might be the only way to understand complex diseases, even infectious diseases, in which the invading pathogen might be necessary for disease, but in itself is inadequate as a single risk factor.
Of course, if we examine large data sets to find interesting patterns or biomarkers that might correlate with a given condition, we will probably identify false positives at a rate of at least 1:20, simply by chance. This is why replication studies that use independent sample sets are important. As long as we have proper controls in place and use statistics appropriately, this approach to science should yield wonderful new results and massively increase our knowledge of the world, instead of merely proving or refuting notions that we already suspect.
The author declares that he has no conflict of interest.
- Baker M (2012) Nature487: 282–283 [PubMed]
Professor Hans Rosling certainly is a remarkable figure. I recommend watching his performances. Especially the BBC's "Joy of Stats" is exemplary. Rosling sells passion for data, visual clarity and great deal of comedy. He represents the data-driven paradigm in science. What is it? And is it as exciting and promising as the documentary suggests?
Data-driven scientists (data miners) such as Rosling believe that data can tell a story, that observation equals information, that the best way towards scientific progress is to collect data, visualize them and analyze them (data miners are not specific about what analyze means exactly). When you listen to Rosling carefully he sometimes makes data equivalent to statistics: a scientist collects statistics. He also claims that "if we can uncover the patterns in the data then we can understand". I know this attitude: there are massive initiatives to mobilize data, integrate data, there are methods for data assimilation and data mining, and there is an enormous field of scientific data visualization. Data-driven scientists sometimes call themselves informaticians or data scientists. And they are all excited about big data: the larger is the number of observations (N) the better.
Rosling is right that data are important and that science uses statistics to deal with the data. But he completely ignores the second component of statistics: hypothesis (here equivalent to model or theory). There are two ways to define statistics and both require data as well as hypotheses: (1) Frequentist statistics makes probabilistic statements about data, given the hypothesis. (2) Bayesian statistics works the other way round: it makes probabilistic statements about the hypothesis, given the data. Frequentist statistics prevailed as a major discourse as it used to be computationally simpler. However, it is also less consistent with the way we think - we are nearly always ultimately curious about the Bayesian probability of the hypothesis (i.e. "how probable it is that things work a certain way, given what we see") rather then in the frequentist pobability of the data (i.e. "how likely it is that we would see this if we repeated the experiment again and again and again").
In any case, data and hypothesis are two fundamental parts of both Bayesian and frequentist statistics. Emphasisizing data at the expense of hypothesis means that we ignore the actual thinking and we end up with trivial or arbitrary statements, spurious relationships emerging by chance, maybe even with plenty of publications, but with no real understanding. This is the ultimate and unfortunate fate of all data miners. I shall note that the opposite is similarly dangerous: Putting emphasis on hypotheses (the extreme case of hypothesis-driven science) can lead to a lunatic abstractions disconnected from what we observe. Good science keeps in mind both the empirical observations (data) and theory (hypotheses, models).
Is it any good to have large data (high N)? In other words, does high number of observations lead to better science? It doesn't. Data have their value only when confronted with a useful theory. Theories can get strong and robust support even from relatively small data (Fig. 1a, b). Hypotheses and relationships that need very large data to be demonstrated (Fig. 1c, d) are weak hypotheses and weak relationships. Testing simple theories is more of a hassle with very large data than with small data, especially in the computationally intensive Bayesian framework. Finally, collection, storage and handling of very large data costs a lot of effort, time and money.
Figure 1Strong effects (the slope of the linear model y=f(x)) can get strong support even from small data (a). Collecting more data does not increase the support very much (b) and is just a waste of time, effort, storage space and money. Weak effects will find no support in small data (c) and will be supported only by very large datasets (d). In case of (d) there is such a large amount of unexplained variability and the effect is so weak, that the hypothesis that y=f(x) does not seem very interesting - there is probably some not yet imagined cause of the variability.
My final argument is that data are not always an accurate representation of what they try to measure. Especially in life sciences and social sciences (the "messy" fields) data are regularly contaminated by measurement errors, subjective biases, incomplete coverages, non-independence, detectability problems, aggregation problems, poor metadata, nomenclature problems and so on. Collecting more data may enhance such problems and can lead to spurious patterns. On the other hand, if the theory-driven approach is adopted, these biases can be made an integral part of the model, fitted to the data, and accounted for. What is then visualized are not the raw biased data but the (hopefully) unbiased model predictions of the real process of interest.
So why many scientists find data-driven research and large data exciting? It has nothing to do with science. The desire to have datasets as large as possible and to create giant data mines is driven by our instinctive craving for plenty (richness), and by boyish tendency to have a "bigger" toy (car, gun, house, pirate ship, database) than anyone else. And whoever guards the vaults of data holds power over all of the other scientists who crave the data.
But most importantly, data-driven science is less intellectually demanding then hypothesis-driven science. Data mining is sweet, anyone can do it. Plotting multivariate data, maps, "relationships" and colorful visualizations is hip and catchy, everybody can understand it. By contrary, thinking about theory can be pain and it requires a rare commodity: imagination.