The Cons of Big Data in Science and Medicine

“In a big-data world, by contrast, we won’t have to be fixated on causality; instead we can discover patterns arid- correlations in the data that offer-US novel and invaluable insights; The correlations may not tell us precisely why something is happening, but they alert us that it is happening;

And-in many situations-this is good enough. If millions of electronic medical records reveal that cancer sufferers who take a certain combination of aspirin and orange juice see their disease go into remission, then the exact cause for the improvement in health may be less important than the fact that they lived.” (page 14)

Matters of science and discovery shouldn’t be considered solved because some correlation was found from mining through millions and billions of pieces of data, only once we reach an actual understanding.

This passage refers to one of the advantages of using big data to answer questions about the world we live in, that we don’t need to know what causes something to happen.  It allows researchers to find cause and effect relationships between two events or actions without needing to know the why or how in the middle. However it can also lead to poor medical practices through superfluous correlations, much like old fashioned medieval medical cures.

A quick google search on old medical cures yields some surprising results, such as placing a tuft of grass on your stomach to cure stomach pains, or making a child eat a rotten mouse would stop them from wetting the bed. To us, all these cures sound ridiculous, however the doctors of the times wouldn’t have used these “cures” if they themselves didn’t believe them to have an effect on their patient’s well being. These old cures likely came around in a similar way to the proposed orange juice and aspirin cancer cure example, a doctor tried it, found it effective, and stuck with it; it’s the same idea, but on a smaller scale.

However these methods are vulnerable to superfluous correlations between two variables. A superfluous correlation occurs when two variables appear to be related, when in reality there is no relation, this can be due to chance, or a hidden connection between them.

For example, there is a correlation between ice cream consumption and drowning. Ice cream consumption, does not cause drowning, the correlation is due the fact that ice cream consumption increases during the summer, as does the popularity of swimming.

Cancer remission may have nothing to do with aspirin and orange juice, but something else shared by the patients who regularly consume aspirin and orange juice.  If we cure cancer, but don’t know how it works, then we’re not really moving ahead, we’re falling behind.


2 thoughts on “The Cons of Big Data in Science and Medicine”

  1. Although some superfluous results can arise when analyzing data the whole idea of big data is to look at so much that these misinterpretations will be weeded out. By using big data with modern digital technology these correlations will be found, and some may not be causally linked, but knowing there is a link between a certain diet or medicine and curing cancer may lead doctors down the path to curing cancer. While, at the same time millions of people can be recommended this treatment and possibly have their cancer cured.

    1. On the contrary, having so much data makes it possible to find these spurious results.

      This website has dozens of charts of data that just happens to correlate. For example, there is a fairly strong correlation between the US Egg Consumption per Capita and the number of drivers killed in non collision transport accident. Just because this correlation happens to exist doesn’t mean we should stop eating eggs. Big data, like any statistical tool, has it’s applications, but has to be used carefully, or the results will be meaningless.

Leave a Reply

Your email address will not be published.