GLOBAL RESEARCH SYNDICATE
No Result
View All Result
  • Login
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights
No Result
View All Result
globalresearchsyndicate
No Result
View All Result
Home Data Analysis

Disadvantages Of Random Forest Regression

globalresearchsyndicate by globalresearchsyndicate
January 17, 2021
in Data Analysis
0
Disadvantages Of Random Forest Regression
0
SHARES
53
VIEWS
Share on FacebookShare on Twitter

Author profile picture

@neptuneAI_jakubneptune.ai Jakub Czakon

Senior data scientist building experiment tracking tools for ML projects at https://neptune.ai

In this article, we’ll look at a major problem with using Random Forest for Regression which is extrapolation. 

We’ll cover the following items:

  • Random Forest Regression vs Linear Regression
  • Random Forest Regression Extrapolation Problem
  • Potential solutions
  • Should you use Random Forest for Regression?

Let’s dive in. 

Random Forest Regression vs Linear Regression

Random Forest Regression is quite a robust algorithm, however, the question is should you use it for regression?

Why not use linear regression instead?  The function in a Linear Regression can easily be written as y=mx + c while a function in a complex Random Forest Regression seems like a black box that can’t easily be represented as a function. 

Generally, Random Forests produce better results, work well on large datasets, and are able to work with missing data by creating estimates for them. However, they pose a major challenge that is that they can’t extrapolate outside unseen data. We’ll dive deeper into these challenges in a minute 

Decision Tree Regression

Decision Trees are great for obtaining non-linear relationships between input features and the target variable.

The inner working of a Decision Tree can be thought of as a bunch of if-else conditions.

It starts at the very top with one node. This node then splits into a left and right node — decision nodes. These nodes then split into their respective right and left nodes. 

At the end of the leaf node, the average of the observation that occurs within that area is computed. The most bottom nodes are referred to as leaves or terminal nodes.

The value in the leaves is usually the mean of the observations occurring within that specific region. For instance in the right most leaf node below, 552.889 is the average of the 5 samples.

How far this splitting goes is what is known as the depth of the tree. This is one of the hyperparameters that can be tuned. The maximum depth of the tree is specified so as to prevent the tree from becoming too deep — a scenario that leads to overfitting. 

Random Forest Regression

Random forest is an ensemble of decision trees. This is to say that many trees, constructed in a certain “random” way form a Random Forest. 

  • Each tree is created from a different sample of rows and at each node, a different sample of features is selected for splitting.
  •  Each of the trees makes its own individual prediction. 
  • These predictions are then averaged to produce a single result. 

Wikimedia

The averaging makes a Random Forest better than a single Decision Tree hence improves its accuracy and reduces overfitting. 

A prediction from the Random Forest Regressor is an average of the predictions produced by the trees in the forest. 

Example of trained Linear Regression and Random Forest

In order to dive in further, let’s look at an example of a Linear Regression and a Random Forest Regression. For this, we’ll apply the Linear Regression and a Random Forest Regression to the same dataset and compare the result. 

Let’s take this example dataset where you should predict the price of diamonds based on other features like carat, depth, table, x, y and z. If we look at the distribution of price below:

We can see that the price ranges from 326 to 18823.

Let’s train the Linear Regression model and run predictions on the validation set.

The distribution of predicted prices is the following:

Predicted prices are clearly outside the range of values of “price” seen in the training dataset. 

A Linear Regression model, just like the name suggests, created a linear model on the data. A simple way to think about it is in the form of y = mx+C. Therefore, since it fits a linear model, it is able to obtain values outside the training set during prediction. It is able to extrapolate based on the data.

Let’s now look at the results obtained from a Random Forest Regressor using the same dataset.

These values are clearly within the range of 326 and 18823 — just like in our training set. There are no values outside that range. Random Forest cannot extrapolate.

Extrapolation Problem 

As you have seen above, when using a Random Forest Regressor, the predicted values are never outside the training set values for the target variable.

If you look at prediction values they will look like this:

Hengl, Tomislav et. al “Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables”. PeerJ. 6. e5518. 10.7717/peerj.5518. Source

Wondering why?

Let’s explore that phenomenon here. The data used above has the following columns carat, depth, table, x, y, z for predicting the price.

The diagram below shows one decision tree from the Random Forest Regressor. 

Let’s zoom in to a smaller section of this tree. For example, there are 4 samples with depth <= 62.75, x <= 5.545, carat <= 0.905, and z <= 3.915. The price being predicted for these is 2775.75. This figure represents the mean of all these four samples. Therefore, any value in the test set that falls in this leaf will be predicted as 2775.75. 

This is to say that when the Random Forest Regressor is tasked with the problem of predicting for values not previously seen, it will always predict an average of the values seen previously. Obviously the average of a sample can not fall outside the highest and lowest values in the sample. 

The Random Forest Regressor is unable to discover trends that would enable it in extrapolating values that fall outside the training set. When faced with such a scenario, the regressor assumes that the prediction will fall close to the maximum value in the training set. Figure 1 above illustrates that.

Potential Solutions

Ok, so how can you deal with this extrapolation problem?

There are a couple of options:

  • Use a linear model such as SVM regression, Linear Regression, etc
  • Build a deep learning model because neural nets are able to extrapolate (they are basically stacked linear regression models on steroids)
  • Combine predictors using stacking. For example, you can create a stacking regressor using a Linear model and a Random Forest Regressor.
  •  Use modified versions of random forest

One of such extensions is Regression-Enhanced Random Forests(RERFs). The authors of this paper propose a technique borrowed from the strengths of penalized parametric regression to give better results in extrapolation problems.

Specifically there are two steps to the process:

  • run Lasso before Random Forest, 
  • train a Random Forest on the residuals from Lasso. 

Since Random Forest is a fully nonparametric predictive algorithm, it may not efficiently incorporate known relationships between the response and the predictors. The response values are the observed values Y1, . . . , Yn  from the training data. RERFs are able to incorporate known relationships between the responses and the predictors which is another benefit of using Regression-Enhanced Random Forests for regression problems. 

Haozhe Zhang et. al 2019 “Regression-Enhanced Random Forests” https://arxiv.org/abs/1904.10416

Final Thoughts

At this point, I am sure you might be wondering whether or not you should use a Random Forest for regression problems.

Let’s look at that. 

 When to use it

  • When the data has a non-linear trend and extrapolation outside the training data is not important.

When not to use it

  • When your data is in time series form. Time series problems require identification of a growing or decreasing trend that a Random Forest Regressor will not be able to formulate. 

Hopefully, this article gave you some background into the inner workings of Random Forest Regression.

This article was originally written by Derrick Mwiti posted on the Neptune blog. You can find more in-depth articles for machine learning practitioners there.

You can also find me tweeting @Neptune_ai or posting on LinkedIn about ML and Data Science stuff.

Related

Tags

Join Hacker Noon

Create your free account to unlock your custom reading experience.

Related Posts

How Machine Learning has impacted Consumer Behaviour and Analysis
Consumer Research

How Machine Learning has impacted Consumer Behaviour and Analysis

January 4, 2024
Market Research The Ultimate Weapon for Business Success
Consumer Research

Market Research: The Ultimate Weapon for Business Success

June 22, 2023
Unveiling the Hidden Power of Market Research A Game Changer
Consumer Research

Unveiling the Hidden Power of Market Research: A Game Changer

June 2, 2023
7 Secrets of Market Research Gurus That Will Blow Your Mind
Consumer Research

7 Secrets of Market Research Gurus That Will Blow Your Mind

May 8, 2023
The Shocking Truth About Market Research Revealed!
Consumer Research

The Shocking Truth About Market Research: Revealed!

April 25, 2023
market research, primary research, secondary research, market research trends, market research news,
Consumer Research

Quantitative vs. Qualitative Research. How to choose the Right Research Method for Your Business Needs

March 14, 2023
Next Post
High rate of veterinarian suicide a problem | City News

High rate of veterinarian suicide a problem | City News

Categories

  • Consumer Research
  • Data Analysis
  • Data Collection
  • Industry Research
  • Latest News
  • Market Insights
  • Marketing Research
  • Survey Research
  • Uncategorized

Recent Posts

  • Ipsos Revolutionizes the Global Market Research Landscape
  • How Machine Learning has impacted Consumer Behaviour and Analysis
  • Market Research: The Ultimate Weapon for Business Success
  • Privacy Policy
  • Terms of Use
  • Antispam
  • DMCA

Copyright © 2024 Globalresearchsyndicate.com

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT
No Result
View All Result
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights

Copyright © 2024 Globalresearchsyndicate.com