Global Research Syndicate
No Result
View All Result
  • Login
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights
No Result
View All Result
globalresearchsyndicate
No Result
View All Result
Home Data Analysis

Supervised learning explained | InfoWorld

globalresearchsyndicate by globalresearchsyndicate
November 23, 2019
in Data Analysis
0
Supervised learning explained | InfoWorld
0
SHARES
8
VIEWS
Share on FacebookShare on Twitter

Machine learning is a branch of artificial intelligence that includes algorithms for automatically creating models from data. At a high level, there are four kinds of machine learning: supervised learning, unsupervised learning, reinforcement learning, and active machine learning. Since reinforcement learning and active machine learning are relatively new, they are sometimes omitted from lists of this kind. You could also add semi-supervised learning to the list, and not be wrong.

What is supervised learning?

Supervised learning starts with training data that are tagged with the correct answers (target values). After the learning process, you wind up with a model with a tuned set of weights, which can predict answers for similar data that haven’t already been tagged.

You want to train a model that has high accuracy without overfitting or underfitting. High accuracy means that you have optimized the loss function. In the context of classification problems, accuracy is the proportion of examples for which the model produces the correct output.

Overfitting means that the model is so closely tied to the data it has seen that it doesn’t generalize to data it hasn’t seen. Underfitting means that the model is not complex enough to capture the underlying trends in the data.

The loss function is chosen to reflect the “badness” of the model; you minimize the loss to find the best model. For numerical (regression) problems, the loss function is often the mean squared error (MSE), also formulated as the root mean squared error (RMSE), or root mean squared deviation (RMSD). This corresponds to the Euclidian distance between the data points and the model curve. For classification (non-numerical) problems, the loss function may be based on one of a handful of measures including the area under the ROC curve (AUC), average accuracy, precision-recall, and log-loss. (More on the AUC and ROC curve below.)

To avoid overfitting, you often divide the tagged data into two sets, the majority for training and the minority for validation or testing. The validation set loss is usually higher than the training set loss, but it’s the one you care about, because it shouldn’t exhibit bias towards the model.

For small data sets, using fixed holdout sets for test validation can result in low statistics. One way around this is to use a cross-validation scheme, in which different folds (data subsets) take turns being the holdout set for different training epochs.

I mentioned that AUC is the area under the ROC curve. ROC is the receiver operating characteristic curve; the term comes from radio signal analysis, but essentially the ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to the rate of false positives. High area under the ROC curve is good, so when you are using it as the basis for a loss function you actually want to maximize the AUC.

Data cleaning for machine learning

There is no such thing as clean data in the wild. To be useful for machine learning, data must be aggressively filtered. For example, you’ll want to:

  1. Look at the data and exclude any columns that have a lot of missing data.
  2. Look at the data again and pick the columns you want to use (feature selection) for your prediction. Feature selection is something you may want to vary when you iterate.
  3. Exclude any rows that still have missing data in the remaining columns.
  4. Correct obvious typos and merge equivalent answers. For example, U.S., US, USA, and America should be merged into a single category.
  5. Exclude rows that have data that are out of range. For example, if you’re analyzing taxi trips within New York City, you’ll want to filter out rows with pickup or drop-off latitudes and longitudes that are outside the bounding box of the metropolitan area.

There is a lot more you can do, but it will depend on the data collected. This can be tedious, but if you set up a data-cleaning step in your machine learning pipeline you can modify and repeat it at will.

Data encoding and normalization for machine learning

To use categorical data for machine classification, you need to encode the text labels into another form. There are two common encodings.

One is label encoding, which means that each text label value is replaced with a number. The other is one-hot encoding, which means that each text label value is turned into a column with a binary value (1 or 0). Most machine learning frameworks have functions that do the conversion for you. In general, one-hot encoding is preferred, as label encoding can sometimes confuse the machine learning algorithm into thinking that the encoded column is ordered.

To use numeric data for machine regression, you usually need to normalize the data. Otherwise, the numbers with larger ranges might tend to dominate the Euclidian distance between feature vectors, their effects could be magnified at the expense of the other fields, and the steepest descent optimization might have difficulty converging. There are a number of ways to normalize and standardize data for machine learning including min-max normalization, mean normalization, standardization, and scaling to unit length. This process is often called feature scaling.

Feature engineering for machine learning

A feature is an individual measurable property or characteristic of a phenomenon being observed. The concept of a “feature” is related to that of an explanatory variable, which is used in statistical techniques such as linear regression. Feature vectors combine all the features for a single row into a numerical vector.

Part of the art of choosing features is to pick a minimum set of independent variables that explain the problem. If two variables are highly correlated, either they need to be combined into a single feature, or one should be dropped. Sometimes people perform principal component analysis to convert correlated variables into a set of linearly uncorrelated variables.

Some of the transformations that people use to construct new features or reduce the dimensionality of feature vectors are simple. For example, subtract Year of Birth from Year of Death and you construct Age at Death, which is a prime independent variable for lifetime and mortality analysis. In other cases, feature construction may not be so obvious.

Common machine learning algorithms

There are dozens of machine learning algorithms, ranging in complexity from linear regression and logistic regression to deep neural networks and ensembles (combinations of other models). However, some of the most common algorithms include:

  • Linear regression, aka least squares regression (for numeric data)
  • Logistic regression (for binary classification)
  • Linear discriminant analysis (for multi-category classification)
  • Decision trees (for both classification and regression)
  • Naïve Bayes (for both classification and regression)
  • K-nearest neighbors, aka KNN (for both classification and regression)
  • Learning vector quantization, aka LVQ (for both classification and regression)
  • Support vector machines, aka SVM (for binary classification)
  • Random forests, a type of “bagging” (bootstrap aggregation) ensemble algorithm (for both classification and regression)
  • Boosting methods, including AdaBoost and XGBoost, are ensemble algorithms that create a series of models where each incremental model tries to correct errors from the previous model (for both classification and regression)
  • Neural networks (for both classification and regression)

Hyperparameter tuning

Hyperparameters are free variables other than the weights being tuned within a machine learning model. The hyperparameters vary from algorithm to algorithm, but often include the learning rate used to control the size of the correction applied after the errors have been calculated for a batch.

Several production machine learning platforms now offer automatic hyperparameter tuning. Essentially, you tell the system what hyperparameters you want to vary, and possibly what metric you want to optimize, and the system sweeps those hyperparameters over as many runs as you allow. (Google Cloud Machine Learning Engine’s hyperparameter tuning extracts the appropriate metric from the TensorFlow model, so you don’t have to specify it.)

There are three major search algorithms for sweeping hyperparameters: Bayesian optimization, grid search, and random search. Bayesian optimization tends to be the most efficient. You can easily implement your own hyperparameter sweeps in code, even if that isn’t automated by the platform you are using.

To summarize, supervised learning turns labeled training data into a tuned predictive model. Along the way, you need to clean and normalize the data, engineer a set of linearly uncorrelated features, and try multiple algorithms to find the best model.

Related Posts

How Machine Learning has impacted Consumer Behaviour and Analysis
Consumer Research

How Machine Learning has impacted Consumer Behaviour and Analysis

January 4, 2024
Market Research The Ultimate Weapon for Business Success
Consumer Research

Market Research: The Ultimate Weapon for Business Success

June 22, 2023
Unveiling the Hidden Power of Market Research A Game Changer
Consumer Research

Unveiling the Hidden Power of Market Research: A Game Changer

June 2, 2023
7 Secrets of Market Research Gurus That Will Blow Your Mind
Consumer Research

7 Secrets of Market Research Gurus That Will Blow Your Mind

May 8, 2023
The Shocking Truth About Market Research Revealed!
Consumer Research

The Shocking Truth About Market Research: Revealed!

April 25, 2023
market research, primary research, secondary research, market research trends, market research news,
Consumer Research

Quantitative vs. Qualitative Research. How to choose the Right Research Method for Your Business Needs

March 14, 2023
Next Post
DesignUp 2019 raises the bar of excellence for the design community

DesignUp 2019 raises the bar of excellence for the design community

Categories

  • Consumer Research
  • Data Analysis
  • Data Collection
  • Industry Research
  • Latest News
  • Market Insights
  • Marketing Research
  • Survey Research
  • Uncategorized

Recent Posts

  • Ipsos Revolutionizes the Global Market Research Landscape
  • How Machine Learning has impacted Consumer Behaviour and Analysis
  • Market Research: The Ultimate Weapon for Business Success
  • Privacy Policy
  • Terms of Use
  • Antispam
  • DMCA

Copyright © 2024 Globalresearchsyndicate.com

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT
No Result
View All Result
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights

Copyright © 2024 Globalresearchsyndicate.com