Why we shouldn’t trust models that are entirely based on data?

October 01, 2020 Blog by Cassotis Consulting

The so-called fourth industrial revolution, also known as Industry 4.0, is becoming a reality in several companies around the world. Among the main characteristics of this revolution is the control and monitoring of a large amount of data and information: the so-called Big Data.

When this practice began, so did a race to develop algorithms and statistic analyses. Studies to identify correlations between different monitored variables and create predictive models for important process features were now being conducted intensively.

With this large amount of data, many times we identify correlations between variables that process specialists don't even monitor. Despite the fact that increasing the scope of observations and potentially discovering new important variables in the process may be something positive, applying such predictive models and correlations to decision-making can become dangerous.

One of the main reasons for this is the fact that correlation does not imply causality. Let’s see the concepts: in statistics, correlation refers to the measure of the relation between two variables. A positive correlation between two variables indicates that both have a similar trend in their movements, and a negative correlation indicates that both vary in opposite directions each other. On the other hand, the concept of causality determines that the change in a variable causes the change in another variable.

Saying that correlation does not imply causality means that, although two variables are correlated, this does not imply that one causes the other.

In fact, it may be that variable A causes the variable B, but it may also be that variable B actually causes variable A. Or, that there are other factors that cause both A and B.

Additionally, maybe the variables impact each other and, therefore, A causes B and B causes A. Finally, the correlation between A and B may be simply a coincidence, or pure chance.

There are a few simple examples for these situations.

We may notice a correlation between a rooster’s crow (A) and the sunrise (B), however, we cannot think that A causes B, but the exact opposite.
There is a high positive correlation between the number of popsicles sold (A) and the number of cases of drowning on beaches (B). We can definitely say that A does not cause B, nor does B cause A. But, by experience, we may notice that there are other factors: the heat (C) makes many people go to beaches (D). A large number of people on beaches implies a higher consumption of popsicles and, at the same time, a higher number of drownings.
There is a correlation between having yellow teeth (A) and developing lung cancer (B). However, through scientific studies, it is possible to notice that A does not cause B and vice-versa. There is a factor (C), smoking heavily, that causes both A and C.
There is a negative correlation between the number of pirates (A) and global warming (B). Historically, as the number of pirates decreased, the temperature on Earth increased. This correlation clearly results countless other factors. Otherwise, it would be enough to encourage more people to become pirates, and the global warming issue would be solved!
There is a strong correlation between the consumption of mozzarella cheese (A) and the number of people with a PhD in civil engineering (B). There is no scientific evidence that A causes B or vice-versa, only that it is a great coincidence.

This type of correlation is called a spurious correlation. Other examples include the correlation between the number of movies Nicolas Cage was in and the number of cases of drowning in pools, and the correlation between the divorce rate in Maine, USA, and the per capita consumption of butter. More correlations like this one can be found on [1].

Another reason for which using data indiscriminately may lead to incorrect decisions is known as Simpson’s Paradox. This paradox occurs when there is a behavior trend observed in a variable’s data that is reversed when splitting data into certain smaller groups.

There are several examples of this paradox. One of the most famous occurred at Berkeley University, in California.[2] Data the from to the doctorate program suggested that men had a much higher admission rate than women. The difference was so great (44% vs 35%), that it resulted in a lawsuit against the university. However, when breaking down the data by departments, it was verified that is was actually the opposite that occurred. In most departments, women had higher admission rates than men. Later, it was concluded that what led to the confusion was the fact that more women applied to the most competitive departments, which on average reduced the female admission rate when compared with men that applied in higher numbers to less competitive departments.

In addition to these two issues, it is always necessary to be careful about how data is obtained. How the data is collected, how often and its accuracy may have a major impact on the correlations obtained. For example, corrective process changes may result in effects only demonstrated by data after a certain period of time, leading to incorrect correlations.

Therefore, considering all reasons presented, it is possible to conclude that fully trusting models purely based on data may lead us to incorrect and bad decisions. It is always necessary to weigh the data with people’s experience and with the existing technical and theoretical knowledge.

This is one of Cassotis’ principles and is reflected in its works: we always use the theoretical knowledge and practical experience of our consultants and our customers to guide the data and correlations used in our optimization models. We believe that the analysis of a large mass of data can greatly contribute to the search of correlations between variables, provided that this work is conducted by experts to validate the relations found and, thus, boost the optimizations made, freeing the results of coincidences and chance when making decisions.

References:

[1] VIGEN,Tyler. Spurious Correlations.

[2] DEXTER, Shawn. How UC Berkeley Almost Got Sued For SEX Discrimination….LYING Data?

Cassiano Vinhas de Lima - Consultant at Cassotis Consulting