GLOBAL RESEARCH SYNDICATE
No Result
View All Result
  • Login
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights
No Result
View All Result
globalresearchsyndicate
No Result
View All Result
Home Data Analysis

How To Choose The Best Machine Learning Algorithm For A Particular Problem?

globalresearchsyndicate by globalresearchsyndicate
October 18, 2020
in Data Analysis
0
How To Choose The Best Machine Learning Algorithm For A Particular Problem?
0
SHARES
2
VIEWS
Share on FacebookShare on Twitter

W3Schools


How do you know what machine learning algorithm to choose for your problem? Why don’t we try all the machine learning algorithms or some of the algorithms which we consider will give good accuracy. If we apply each and every algorithm it will take a lot of time. So, it is better to apply a technique to identify the algorithm that can be used.

Choosing the right algorithm is linked up with the problem statement. It can save both money and time. So, it is important to know what type of problem we are dealing with. 

In this article, we will be discussing the key techniques that can be used to choose the right machine algorithm in a particular work. Through this article, we will discuss how we can decide to use which machine learning model using the plotting of dataset properties. We will also discuss how the size of the dataset can be a considerable measure in choosing a machine learning algorithm.



Getting the first Dataset

The dataset is taken from Kaggle, you can find it here. It has information about the diabetic patient and whether or not each patient will have an onset of diabetes. It has 9 columns and 767 rows. Rows and columns represent patient numbers and details.

Techniques to choose the right machine learning algorithm

1. Visualization of Data

Practical Implication:

First of all, we will import the required libraries.

#Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

After it we will proceed by reading the csv file.

df = pd.read_csv("diabetes.csv")
df.head(5)

Pair Plot Method

 By applying the pair plot we will be able to understand which algorithm to choose.

#PairPlot to choose right algorithm
sb.pairplot(data=df[['Glucose' ,'BloodPressure','SkinThickness', 'Outcome']], hue='Outcome', dropna=True, height=3)

From the plot, we can see that there is a lot of overlap between the data points.KNN should be preferred as it works on the principle of Euclidean distance. In case KNN is not performing as per the expectation then we can use the Decision Tree or Random Forest algorithm.

A decision tree or Random Forest works on the principle of non-linear classification. We can use it if some of the data points are overlapping with each other.

Many algorithms work on the assumption that classes can be separated by a straight line. In such cases, Logistic regression or Support Vector Machine should be preferred. It easily separates the data points by drawing a line that divides the target class. Linear regression algorithms assume that data trends follow a straight line. These algorithms perform well for the present case.

2. Size of Training Data & Training Time

Import the various algorithm classifiers to check the training time of small and large dataset.

#Import Sklearn Libraries
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
#Store independent and dependent variable
feature = ['Pregnencies', 'Glucose', 'BloodPressure', 'SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']
X = df[feature] # Features
y = df["Outcome"]

Split the data into train and test. Now we can proceed by applying Decision Tree, Logistic Regression, Random Forest and Support Vector Machine algorithms to check the training time for a classification problem.

#Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Now, we will fit several machine learning models on this dataset and check the training time taken by these models.

See Also


Decision Tree

# Create Decision Tree classifier object
import time
clf = DecisionTreeClassifier()
# Train Decision Tree Classifier
start = time.time()
clf = clf.fit(X_train,y_train)
stop = time.time()
print(f"Training time: {stop - start}s")

Logistic Regression

#Import sklearn library
from sklearn.linear_model import LogisticRegression
import time
clf = LogisticRegression(random_state = 0) 
start = time.time()
clf.fit(X_train,y_train) 
stop = time.time()
print(f"Training time: {stop - start}s")

Random Forest

#Create a RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=100)
start = time.time()
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
stop = time.time()
print(f"Training time: {stop - start}s")

Support Vector Machine 

# Support Vector Classifier
from sklearn.svm import SVC  
clf = SVC(kernel='linear') 
start = time.time() 
# fitting x samples and y classes 
clf.fit(X_train,y_train)
stop = time.time()
print(f"Training time: {stop - start}s")

From the above results, we can conclude that Decision Trees will take much less time than all algorithms for small dataset. Hence, it is recommended to use a low bias/high variance classifier like a decision tree.

Getting the Second Dataset

The dataset is taken from Kaggle, you can find it here. It has information about credit card fraud that occurred in two days. Feature Class is a target variable and it takes 1 in case of fraud and 0 otherwise. It has 284807 rows and 31columns.

#Read the csv file
df = pd.read_csv("creditcard.csv") 
df.head(5) 
X=df.iloc[:,0:-1] 
y=df.iloc[:,-1]

#Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Now again, on this second dataset, we will fit the above machine learning models on this dataset and check the training time taken by these models.

Decision Tree

# Create Decision Tree classifier object
import time clf = DecisionTreeClassifier() 
# Train Decision Tree Classifier 
start = time.time() 
clf = clf.fit(X_train,y_train) 
stop = time.time() 
print(f"Training time: {stop - start}s")

Logistics Regression

#Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
import time
classifier = LogisticRegression(random_state = 0) 
start = time.time()
classifier.fit(X_train,y_train) 
stop = time.time()
print(f"Training time: {stop - start}s")

Random Forest

#Create a RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=100)
start = time.time()
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
stop = time.time()
print(f"Training time: {stop - start}s")

Support Vector Machine

#Support Vector Classifier 
from sklearn.svm import SVC 
clf = SVC(kernel='linear') 
start = time.time() 
# fitting x samples and y classes 
clf.fit(X_train,y_train)
stop = time.time()
print(f"Training time: {stop - start}s") 

With the huge dataset size depth of Decision Tree grows, it implements multiple if-else statements which increase complexity and time. Both Random Forest and Xgboost use the Decision Tree algorithm which takes more time. The result shows Logistic regression outperforms others.

Final Thoughts

I have concluded my analysis in selecting the correct machine learning algorithm. Furthermore, it is always advisable to use two algorithms for addressing the problem statement. This could provide a good reference point for the audience.


If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Related Posts

How Machine Learning has impacted Consumer Behaviour and Analysis
Consumer Research

How Machine Learning has impacted Consumer Behaviour and Analysis

January 4, 2024
Market Research The Ultimate Weapon for Business Success
Consumer Research

Market Research: The Ultimate Weapon for Business Success

June 22, 2023
Unveiling the Hidden Power of Market Research A Game Changer
Consumer Research

Unveiling the Hidden Power of Market Research: A Game Changer

June 2, 2023
7 Secrets of Market Research Gurus That Will Blow Your Mind
Consumer Research

7 Secrets of Market Research Gurus That Will Blow Your Mind

May 8, 2023
The Shocking Truth About Market Research Revealed!
Consumer Research

The Shocking Truth About Market Research: Revealed!

April 25, 2023
market research, primary research, secondary research, market research trends, market research news,
Consumer Research

Quantitative vs. Qualitative Research. How to choose the Right Research Method for Your Business Needs

March 14, 2023
Next Post
Data Center Environment Sensors Market Trend, COVID-19 Impact, Share, Demand, Manufacturers And 2026 Forecast Research – PRnews Leader

Data Center Environment Sensors Market Trend, COVID-19 Impact, Share, Demand, Manufacturers And 2026 Forecast Research – PRnews Leader

Categories

  • Consumer Research
  • Data Analysis
  • Data Collection
  • Industry Research
  • Latest News
  • Market Insights
  • Marketing Research
  • Survey Research
  • Uncategorized

Recent Posts

  • Ipsos Revolutionizes the Global Market Research Landscape
  • How Machine Learning has impacted Consumer Behaviour and Analysis
  • Market Research: The Ultimate Weapon for Business Success
  • Privacy Policy
  • Terms of Use
  • Antispam
  • DMCA

Copyright © 2024 Globalresearchsyndicate.com

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT
No Result
View All Result
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights

Copyright © 2024 Globalresearchsyndicate.com