How To Code Linear Regression Models With R

Regression is one of the most common data science problem. It, therefore, finds its application in artificial intelligence and machine learning. Regression techniques are used in machine learning to predict continuous values, for example predicting salaries, ages or even profits. Linear regression is the type of regression in which the correlation between the dependent and independent factors can be represented in a linear fashion.

In this article, we will tailor a template for three commonly-used linear regression models in ML :

Simple Linear Regression
Multiple Linear Regression
Support Vector Machine Regression

Here are the pre-requisites:

Simple Linear Regression

Simple linear regression is the simplest regression model of all. The model is used when there are only two factors, one dependent and one independent.

The model is capable of predicting the salary of an employee with respect to his/her age or experience. Given a dataset consisting of two columns age or experience in years and salary, the model can be trained to understand and formulate a relationship between the two factors. Based on the derived formula, the model will be able to predict salaries for any given age or experience.

Here’s The Code:

The Simple Linear Regression is handled by the inbuilt function ‘lm’ in R.

Creating the Linear Regression Model and fitting it with training_Set

regressor = lm(formula = Y ~ X, data = training_set)

This line creates a regressor and provides it with the data set to train.

* formula : Used to differentiate the independent variable(s) from the dependent variable.In case of multiple independent variables, the variables are appended using ‘+’ symbol. Eg. Y ~ X1 + X2 + X3 + …

* X: independent Variable or factor. The column label is specified

* Y: dependent Variable.The column label is specified.

* data : The data the model trains on, training_set.

Predicting the values for test set

predicted_Y = predict(regressor, newdata = test_set)

This line predicts the values of dependent factor for new given values of independent factor.

* regressor : the regressor model that was previously created for training.

* newdata : the new set of observations that you want to predict Y for.

Visualizing training set predictions

install.packages('ggplot2') # install once library(ggplot2) # importing the library ggplot() + geom_point(aes(x = training_set$X, y = training_set$Y), colour = 'black') + geom_line(aes(x = training_set$X, y = predict(regressor, newdata = training_set)),colour = 'red') + ggtitle('Y vs X (Training Set)') xlab('X') ylab('y')

Visualizing test set predictions

ggplot() + geom_point(aes(x = test_set$X, y = test_set$Y), colour = 'blue') + geom_line(aes(x = training_set$X, y = predict(regressor, newdata = training_set)),colour = 'red') + ggtitle('Y VS X (Test Set)') xlab('X') ylab('Y')

These two blocks of code represent the dataset in a graph. ggplot2 library is used for plotting the data points and the regression line.

The first block is used for plotting the training_set and the second block for the test_set predictions.

* geom_point() : This function scatter plots all data points in a 2 Dimensional graph

* geom_line() : Generates or draws the regression line in 2D graph

* ggtitle() : Assigns the title of the graph

* xlab : Labels the X- axis

* ylab : Labels the Y-axis

Replace all X and Y with the Independent and dependent factors (Column labels) respectively.

Multiple Linear Regression

Multiple Linear Regression is another simple regression model used when there are multiple independent factors involved. So unlike simple linear regression, there are more than one independent factors that contribute to a dependent factor. It is used to explain the relationship between one continuous dependent variable and two or more independent variables. The independent variables can be continuous or categorical (dummy variables).

Unlike simple linear regression where we only had one independent variable, having more independent variables leads to another challenge of identifying the one that shows more correlation to the dependent variable. Backward Elimination is one method that can help us identify the independent variables with strongest relation to the dependent variable. In this method, a significance Level is chosen. Most commonly it’s 0.05. The regressor model returns a P value for each independent factor/variable. The variable with P Value greater than the chosen Significance Level is removed and P values are updated. The process is iterated until the strongest factor is obtained.

This model can be used to predict the salary of an employee against multiple factors like experience, employee_score etc.

Here’s The Code:

The Multiple Linear Regression is also handled by the function lm.

Creating the Multiple Linear Regressor and fitting it with Training Set

regressor = lm(Y ~ .,data = training_set)

The expression ‘Y ~ .” takes all variables except Y in the training_set as independent variables.

Predicting the values for test set

predicted_Y = predict(regressor, newdata = test_set)

Using Backward Elimination to Find the most significant Factors

backwardElimination <- function(x, sl) { numVars = length(x) for (i in c(1:numVars)){ regressor = lm(formula = Y ~ ., data = x) maxVar = max(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"]) if (maxVar > sl){ j = which(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"] == maxVar) x = x[, -j] } numVars = numVars - 1 } return(summary(regressor)) } SL = 0.05 dataset = dataset[, c(indexes of independent factors separated by a coma)] backwardElimination(dataset, SL)

This block identifies the most significant independent factor by using Backward Elimination method.The independent variable with a greater P value than the chosen Significance Level is removed iteratively until the most Significant variable remains.

Support Vector Regression

Support Vector Regression is a subset of Support Vector Machine (SVM) which is a classification model. Unlike SVM used for predicting binary categories, SVR uses the same principle to predict continuous values.

Here’s The Code:

The package e1071 is used for handling Support Vector Regression in R

Installing and Importing the Library

install.packages('e1071') #install once library(e1071) #importing the library

Creating the Support Vector Regressor and fitting it with Training Set

svr_regressor = svm(formula = Y ~ ., data = training_set, type = 'eps-regression')

This line creates a Support Vector Regressor and provides the data to train.

* type : one of two types. ‘eps-regression’ denotes that this is a regression problem

Predicting the values for test set

predicted_Y = predict(svr_regressor, newdata = test_set)

Outlook

The R programming language has been gaining popularity in the ever-growing field of AI and Machine Learning. The language has libraries and extensive packages tailored to solve real real-world problems and has thus proven to be as good as its competitor Python. Linear Regression models are the perfect starter pack for machine learning enthusiasts. This tutorial will give you a template for creating three most common Linear Regression models in R that you can apply on any regression dataset.