How DataRobot SOLVED Machine Learning

Keenan Moukarzel
7 min readAug 7, 2021

Quick Intro

Machine Learning has been solved! We can all celebrate and move on to the next thing. Don’t believe me? I will show you why and how DataRobot unlocked all of Machine Learning’s potential. And why we can finally start focusing on the problem at hand and not be distracted by the technicalities of “which model did you use” and “how did you optimize your hyerparameters”. The questions will now be more focused on “how did you frame the problem and the solution” and “what data did you use”.

Also, before we start, it’s important to note that I am not affiliated with DataRobot other than the fact that I am an avid user and fan of the product. I’ve been using it for a couple of years now and it’s done so much for my career and overall business capabilities.

Let’s get to it!

What DataRobot Did

Okay, there are actually many things that DataRobot did that simply accelerates all aspects of Machine Learning and model creation. Let’s list the main ones.

  1. Automated Exploratory Data Analysis (EDA)

EDA is the first critical step to working on any datasets (many skip it but please don’t…). When you drag and drop you dataset into DataRobot, it will perform automated EDA and highlight a few important things:

  • Assign data types to each column, along with high level stats (number of obs, number of nulls, mean, medium, etc). If you disagree with a data type, you can simply change it
  • Identify outliers or issues in your data, such as too many 0’s or nulls for columns
  • Identify duplicate or redundant columns that may cause multicollinearity issues

2. Create dozens of models and rank them based on performance

This is the part called “AutoML” which stands for Automated Machine Learning. It’s also the part that you will benefit the absolute most from! What are the benefits of running autoML on your dataset?

  • Gives you confidence that you are working with the most performant model for your dataset vs what you may just be comfortable with or used to. You can even pick which metric you want to rank your models (see dropdown in screenshot). It reduces your technical bias by many folds.
  • It optimizes your hyperparamaters thanks to its automated hyperparameter optimization. How did they get it so good at it? Well number 1, the company was founded by many Kaggle grandmasters. But also, they perform machine learning on machine learning! Meaning they learn from the millions of model people ran on DR and learn what works best for each type of dataset. You can even access the hyperparameters values in the “describe” tab.
  • It saves you a TON of time. Think about how much time it takes to create just one model (no matter how good you are). Now multiply this by 60+ times, which is how many models an autopilot run will create for you.

3. You can deploy your models FAST and monitor them

When you create a model manually (say using scikit-learn or Tensorflow), you need to deploy your manually model once one, meaning you have to create the API yourself, schema for data, etc… which again, takes so much time and effort. DataRobot has simply automated this process with one click of a button! It even tells you how to call the API (providing you the Python code and all).

Not only that, but you can also track how many calls are made to your model, and most important, track any data drifts, which occur when the data you are evaluating is different too much from your original training dataset.

They have really thought this through!

Telling the story of your model

This is another super important aspect of DataRobot. Now that you created your perfect model, how do you sell it to your stakeholders? There are 2 critical components to this.

  1. Feature Impact

The feature impact component lists which features, or data elements, you used to create your model have the most impact on your target prediction.

For example, say you want to predict house prices, and want to use many aspects of houses to predict the price, such as sq footage, distance to school, access to highway, etc… how do you know which element is the most impactful in predicting house prices? This is where feature impact comes into play.

Based on the above chart (based on my previous article on neural network feature impact), you can tell that # of Bedrooms has the biggest effect on house prices, which makes sense! Feature impact will help you validate that the logic of your model makes business and logical sense.

DataRobot automatically generates the Feature Impact automatically for each model it creates (you may have to run it for the bottom models).

2. Feature Effect

Feature Effect takes Feature Impact to the next level. Not only does it tell you which features are most predictive, but it dives into each individual feature, and tells you how every value of each feature affects your final outcome.

In our previous example, we saw how # of Bedrooms is very predictive of house prices. But how exactly does it impact it? Here is what the feature effect graph is showing for this feature:

The red line measures the variation in Median House Price (y-axis) when varying “# of Rooms” (x-axis). The red line clearly shows that the higher the number of rooms, the higher the house price. In fact, it looks like the Median House Price can go from a modest $100,000 for a 1 BR to a whopping $360,000 for a 9-room house, everything else held constant, which is a $260,000 swing. Pretty impressive for just one variable!

The light blue distribution plot will help you see the actual distribution of the number of Rooms in the Boston dataset. This gives you perspective as how often each value occurs. It looks like the # of Rooms follows a normal distribution, with 6 being the most common value. Please note # of Rooms in our case is # of rooms in a house that is not a bathroom. For example, if someone has a Living/Dining room (1) + a Kitchen (1) + a Master Bedroom (1), that is a 3-room apartment.

Once again, DataRobot allows you to automatically create your feature effect for any model and all the variables used for that model. Crazy!

Automating AutoML with the DataRobot API

The is the cherry on the cake. The DataRobot API allows you to use ALL of DataRobot’s features mentioned above but from the comfort of your Jupyter Notebook… meaning you can create your AutoML project, start auto-pilot, select any model you want, get feature impact and effect, deploy your model, etc… all from very simple Python commands!

Why is this even more powerful?

The DataRobot API allows you to Automate AutoML

That means you continuously run AutoML on any dataset, deploy your model, monitor it, and do it all over again, in an automated fashion using Python code. I will be writing a full article on how to implement this in more details, so stay tuned!

Conclusion

I have only touched on the tip of the iceberg. I am an avid user of the tool and it is part of my daily life. I highly recommend exploring the tool for your professional needs if you haven’t done so.

I hope you enjoyed reading this article as much as I enjoyed writing it! I teach data science on the side at www.thepythonacademy.com, so if you’d like further training, or even want to learn it all from scratch, feel free to contact us on the website. I also plan to publish many articles on Machine Learning and AI on here, so feel free to follow me as well. Please share, like, connect, and comment, as I always love hearing from you. Thank you!

--

--