Monday | Jul 25 | 2022

Pitfalls in the Data Science Process: What Could Go Wrong and How To Fix It

data science

Before you focus on leveraging data science, you need to be aware of any pitfalls that may happen so that you don’t undertake a data science project just for the sake of doing so. A common task across businesses that try to implement data science is to ensure they avoid possible pitfalls. If you notice that there’s no return on investment for your data science project, you probably are a victim of some similar surprises.

Here we present you with a list of potential data science pitfalls and how to solve them.


Data leakage happens when the data you use to predict something is not fair to use. An example of leakage occurred during the INFORMS competition, which was concerned with predicting pneumonia admissions within hospitals. At the end of the competition, they found that a logistic regression that included the number of diagnosis codes as a numeric value did not perform as well good as the one with diagnosis codes as an absolute value.

In the figure above, the diagnosis code for pneumonia is 486, and the rows represent different patients, while the columns are separate diagnoses. If the diagnosis showed up in the record, the code was replaced with the following diagnosis code, or with a ‘-1’ if no further diagnoses were made, and ‘-1’ means that there is nothing for that specific entry.

As such, Then, all the other diagnoses after the pneumonia were moved to the left, and ‘-1’s were kept on the right. The problem is that if a row has only ‘-1’s, you know there is only pneumonia.

On the other hand, if there are no ‘-1’s, there is no pneumonia. This discovery alone was enough to win the INFORMS competition. Data mining competitions like this one often face the problem of data leakage.

data leakage

So How Can You Avoid Leakage?

There is always a risk of data leakage as you prepare your data, deal with missing values, remove outliers, or do any data operations. Even if you create a model that works well on a clean dataset, it may not perform well in a real-world scenario.

To avoid data leakage, always start from scratch with pure raw data, and know how the data was produced.

Evaluating Models

When you evaluate your model’s performance, don’t look just at standard modelling evaluation measures like misclassification rate, area under the curve (AUC), or mean squared error (MSE). Open your horizons to the real world and look at actual business impact metrics, too, such as profits profit, for example.

Two big problems you should look for when you evaluate your models are overfitting or underfitting. To avoid overfitting, reduce the complexity of the model. On the other hand, to fix underfitting, you will need to increase the complexity of your model.

In the image below, you have a simple visualization of an underfit (left), an overfit (right), and a good model fit (middle).

Feature Construction

Data scientists often make assumptions to build features that have high predictive power.

To avoid any pitfall :
  1. Keep a record of which features were generated and which ones were collected.
  2. Remember how they were generated and what assumptions were made.
  3. Ask if the data used to generate the features will be available long-term.

It does not matter what sector your business is in – technology, transportation, healthcare, or any other industry can take advantage of the automated decisions that become possible if you implement a data science project. At the same time, this growth opportunity comes up with many options for mistakes.

Keep an eye on the possible pitfalls explained above, and you will increase your chances of reaping the benefits of data science. You’ll surely help your business grow.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Inline Feedbacks
View all comments