Here we will do a study on Machine Learning Classification using completely hypothetical data.


In this study, we create the data which means that we absolutely know things about the data. The ML model is used to reverse engineer the characteristics that we built in. Since we know the truth of the data, we can explore the value of the ML model–its predictions and statistics.

For example, we know that gender makes a genuine difference in the accuracy of the predictions. When we ask our code to give us statistics (e.g. Accuracy, Feature Importance) we can interpret these numbers with full understanding of “the truth” to know that gender is an important feature.

In normal research, we do not know the truth of the data. All we have are the results of the ML model. We need to be able to accurately and fairly interpret the model so that value is added.

In this study, students should learn:

  • How to create and analyze a Classification model

  • Accuracy can be misleading in a multitude of ways

  • Calculating accuracy is not the stopping point of the research

  • That graphing a model’s predictions reveals important information & insight

  • Feature Importance reflects how much a feature is used in making a model’s decisions, but that does not mean the feature is, in practice, (un)important in a Fair & Ethical ML model.

  • With limited data, a model’s accuracy may be “wrong”

  • Using a randomized test set won’t always reveal that the model is overly complex or convoluted.

Data Overview#

In this hypothetical example, we know the height and gender of about 2000 people who are between the ages of 22-32. We also know whether they play in the NBA (or WNBA). Let’s see if a DecisionTreeClassifier can accurately predict whether they play in the NBA or not based solely on their gender and height. Note that for simplicity, we will only consider male/female, and we will often say just NBA when we also mean to include WNBA.

We will break up our study into two primary sections. Each section uses completely contrived and made-up data. This allows us to understand the model’s results. In both cases, height impacts the results (playing for NBA), but the specifics of how it impacts the results is different in each study.

In our first dataset, the threshold for automatically being in the NBA is 81+ inches tall for a male, while a woman must be only 75+ inches tall.

The two data types are:

  1. Predictable Data : a person is in the NBA based entirely on their height.

  2. Randomized Data : the taller a person is, the more likely they are in the NBA.

Example Data

A person is in the NBA when they are tall enough. Nothing else matters. On the right, you can see the people who are tall enough. Men need to be 81 inches tall. Women need to be 75 inches tall.
Predictable Data Predictable Data

In this data, your height provides the chances that your in the NBA. There will be people who are tall enough but are not in the NBA, and there will be relatively short people who are in the NBA. But, the really tall people have a greater chance of being in the NBA.
Randomized Data Randomized Data

The code#

There are several parameterized methods used to create the data, models and graphs. Seeing the code can help the student understand the results as well as to reproduce similar studies.

Data 1: Predictable Data Study#

In this contrived dataset, we say that whether a person is in the NBA is completely determined by their height. We use the in_nba_by_height method (in the Code dropdown above). We see that the accuracy is very, very high–nearly 100%! This kind of makes sense because we’ve generated the data in a way to be predictable.

Belonging to the NBA in this Predictable Data Study is very simple: if you’re tall enough for your gender, you’re in! When we look at the Feature Importance we see that height has an importance of 60% while gender is ~40%.

Model Predictions Drawn
In this graph we see both how the actual data is distributed and how the predictions are made. The predictions are shown with the larger, blue and orange dots. On top of predictions are smaller, red and green dots that represent that actual data. It shows that the model does a great job! The picture says a lot and you should find it useful in concluding that the model is sound.

Simple Data Predictions

So what? Where do we go from here?

Let’s present a few questions:

  • How important is gender in the predictions?

  • What does the 40% mean?

We will explore this by removing gender from the set of features available to the model when making its predictions.

Make your prediction

Take a minute to think and make some predictions about what will happen when we remove gender from the feature set.

An Aside

When the percentage of actual women in WNBA drops to less than 1% (say, 2/500 women), then the Feature Importance drops to just ~6%. What does that say about Feature Importance?

Read on and hopefully it will all make sense.

Gender Removed#

Here are the results when gender is removed. I find the results fascinating! First off, the accuracy of the model is 99.5%!! Where you expecting that?

How in the world does the model predict with 99.5% accuracy when we ignore gender which has a Feature Importance of 40%?

What we see in the graph is that the model doesn’t predict NBA until one is 80 inches tall. In this particular dataset, one woman happened to be that tall while 12 others in the WNBA were not that tall.
No Gender Model Results


We are left to reflect about the meaning of Accuracy when it completely clobbers a segment of the population. Without knowing the error rates, the Accuracy number alone loses significance.

We must also question what the Feature Importance really means. In this scenario we can predict with nearly 100% accuracy who is in the NBA even when excluding a feature with 40% importance.

In this study, the count of people at various heights is random, and the training set randomly selected. This leads to getting different numbers each time we run the code. The greatest variance was in the Feature Importance of gender. The accuracy was always very close to 100%.

So, is gender an important feature to have in the model? The answer depends on whether you want to be Fair or not. If all you care about is overall accuracy, you could ignore gender which simplifies your model. But, then, you’re not fair! Let’s examine Fairness more.


First, you should know that fairness is covered in Lesson 28. What is especially important is to know the definitions of fairness and how we can discuss fairness scientifically once we define it.

Let’s say that in our new model we choose to ignore gender because we’ve seen how overall Accuracy is still very high. What happens?

We see that the model’s predictions for men is still very good and it basically ignores that women are shorter on average. This means that we have bad Fairness as measured by Equal Opportunity: there are too many False Negatives and not enough (or any) True-Positives.

In other words, the True-Positive rate (the ability to correctly identify WNBA players) for women is near 0%. This is also called PPV (Positive Predictive Value) Video Reference

To be Fair, you want FNR and FPR to be relatively close across all the groups. If you look a these two graphs, you’ll see that FNR for Females is 92% while Males is at 0%!! This leads us to conclude that the model is NOT FAIR when we exclude gender from the features.

Female Fairness

Male Fairness


Feature Importance : Each feature is given a percentage amount that reflects how much that feature is used in the Decision Tree. The sum of all the importance values will total 100%.

Feature Importance does NOT necessarily reflect how accurate the model will be if that feature is removed from consideration.

Plotting the model’s predictions provides a lot of information.

Calculating Fairness values is important when evaluating a model.

Data 2: Randomized Data Study#

In this sub-study, we will create data that is more random. Surely not everyone who is very tall is in the NBA; they are simply more likely to be in the NBA.

In this trial:

  • We set basis=0.01 for a slightly greater chance of being in the NBA.

  • We got 49 people to be in the NBA.

  • We let the max_depth=6 for the Decision Tree.

  • We allowed gender to be considered.
    You can see that there are very few small, green dots (representing True NBA players) that are incorrectly categorized. There are a few randomly scattered large, blue dots (representing predicted NBA players). These scattered dots reflect the model’s attempt to learn and predict the inherit randomness in the data.
    Random #1 Predictions

Accuracy: 97.67%
Feature: male, Importance: 2.38%
Feature: height, Importance: 97.62%

The Fairness comparison for Male/Female shows that the model appears to be fair, meaning that the False-Negative-Rates across genders are not disparately out of whack.


Equal Opportunity

Predictive Equality







In this trial:

  • We keep basis=0.01 for a slightly greater chance of being in the NBA.

  • We got 42 people to be in the NBA.

  • We let the max_depth=3 for the Decision Tree; this is to reduce overfitting.

  • We allowed gender to be considered.
    You can see that randomness in the prediction went away, which was expected since we set max_depth=3. What is unexpected though is that we still have ~98% accuracy, and the Feature Importance for gender dropped to zero!
    Random #2 Predictions

Accuracy: 97.67%
Feature: male, Importance: 0.00%
Feature: height, Importance: 100.00%

The Fairness comparison for Male/Female shows that the model appears to be fair, meaning that the False-Negative-Rates across genders are not disparately out of whack. However, it might be better to say that the model is equally UNfair because the False Rates are pretty high for both genders. Effectively, the high accuracy rate occurs because there are so few people in the NBA.


Equal Opportunity

Predictive Equality







In this trial:

  • We keep basis=0.03 for a significantly greater chance of being in the NBA.

  • We got 102 people to be in the NBA.

  • We let the max_depth=6 for the Decision Tree–we may overfit.

  • We allowed gender to be considered.
    You can see that randomness in the prediction went away, which was UNexpected since we set max_depth=6. We still have a high
    Random #3 Predictions

Accuracy: 96.17%
Feature: male, Importance: 37.17%
Feature: height, Importance: 62.83%

The Fairness comparison for Male/Female show it to be quite (un)fair:


Equal Opportunity

Predictive Equality







This image of the Decision Tree is small and hard to read. The blue boxes predict NBA will the others predict not. You can see that virtually every path leads to False (not in NBA). Seems like pretty dumb tree to me.

Decision Tree for NBA

Did we land a good model? Perhaps it is the best we can do, but the inability to accurately predict players in the NBA is horrible. It fails to predict 74% of the NBA women and 88% of the NBA men. That’s a lot!

This output is generated from the model_acc code in the prior tab. It is a bit surprising to think that model achieved ~99% accuracy while predicting that every male will

Feature: male, Importance: 0.00%
Feature: height, Importance: 100.00%

This graphic shows that our tree is relatively simple. It shows that it always looks at height, and it predicts in the NBA in only 1 small case. NBA Model