Global Research Syndicate
No Result
View All Result
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights
No Result
View All Result
globalresearchsyndicate
No Result
View All Result
Home Data Analysis

Are you stuck in the past? A case against data sampling Part I

globalresearchsyndicate by globalresearchsyndicate
November 13, 2019
in Data Analysis
0
Are you stuck in the past? A case against data sampling Part I
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter

READ ALSO

Stellar Lumens, Waves, IOTA Price Analysis: 20 January

XAU/USD surges on Biden’s inauguration day, levels to watch – Confluence Detector

In a world of cloud computing, where storage and elasticity of resources have become more flexible and affordable than ever, more and more organizations are increasingly asking why they are limiting themselves to antiquated architectures that have dated for decades. Today, data is abundant with more diverse information being generated each day. While large data sets allow us to gain complete insights about our customers or competitors, why would we limit ourselves to only explore smaller segments of the data?

Sample-based data preparation techniques – where a random, smaller subset of the entire data is selected to infer general rules about the shape and quality of the full data – have their roots in statistics but are now attempting to creep into all sorts of data projects. Just as looking through a small lens that only shows part of a larger object, this paradigm is limited, flawed and not based on reality.

More on that later, but let’s first study where the concept of sampling started.

Sampling and its roots in statistics and machine learning

Researchers and scientists rarely, if ever, work with the entire population of the data. Instead, they conduct studies using samples to make generalizations about the entire data population. Take medical researchers: they rely on a sample of patients to study the effect of a treatment on a disease. In pharmaceuticals, clinical trials are performed on a subset of subjects. Marketing specialists build campaigns based on surveys conducted across a subset of their entire customer base. And the list goes on.

Yes, this methodology – i.e. using a sample of data to model predictive outcomes or calculate risks and exposures on new data – was introduced decades ago, namely because of two major limitations:

  1. Data was never available in its entirety.
  2. The processing power and computational resources could not handle larger datasets.

Over the last several decades, statistical tools were dominated by desktop applications which inherently have a limited capacity for data storage and compute. As a result, statisticians resorted to samples.

Data sampling problems in statistical and machine learning projects

Even though sampling is a common practice in data science projects, organizations continue to seek ways to overcome the errors introduced by sampling.

The general recommendation is that the sample size should be sufficiently large. But, how large should the sample size be? Well, it really all depends on the population. When the population is skewed or asymmetric, the sample size should be large, but if the population is symmetric, we can draw small samples as well.

The size of the data sample also depends on the type of the model selected. A large amount of training data plays a critical role in making the deep learning models successful, although traditional machine learning models (e.g. linear regression) don’t require as much.

SEE ALSO: Apache Tika – “Data-driven analytics are at the heart of modern applications”

However, as this article points out, large data impacts the performance of models in both traditional machine learning as well as more advanced deep learning in a much similar way. The graph below, as referenced by the article, shows how the performance of both linear and nonlinear algorithms (e.g. deep learning) methods continue to improve as you give them more data.

sampling

Source / CC BY 2.0

Scientists agree that with smaller samples, the room for bias (difference between actual and prediction) and variance (difference between training and test data) is high. As we increase the number of data points (i.e. the size of the sample), we successfully capture its true distribution. More data helps the model uncover the true relationship between the two different data elements.

However, keep in mind that in machine learning, although selecting larger samples and adding more input data helps with the performance of the model, selecting the entire population of the data, even if technically and physically achievable, is not recommended. This is because it can overfit the model to the extent that the model learns so much of the detail and noise in the training data that it negatively impacts its performance/predictions when applied to new data. After all, we don’t want machine learning models to pick up the noise or random fluctuations in the training data so much so that they are learned as concepts by the model.

Nevertheless, it is proven that the more data available for training, the better the performance of the model. This is so critical that many data scientists develop a set of procedures to overcome the problems of sampling in order to make conclusions about the population more accurately.

Because sample-based findings are likely different than what you would find using the entire population, statisticians apply math principals to determine if what you see in the sample is what you would see in the population. These techniques include generalizations and inferential statistics such as regression analysis or analysis of variances to ensure that the sample is a good representation of the whole data, helping data scientists distinguish signal (e.g. real differences) from the noise (e.g. random error). To learn more about this, there are plenty of articles available.

Machine learning aside: How samples can undermine business decisions

To recap, even if a sample is created as a representation of the entire data, sample sizes should be large enough to improve the performance of machine learning models and that statistical techniques should be considered so the insights are not skewed or misrepresented by samples.

However, it should be profoundly clear that these techniques are at the purview and expertise of SMEs in statistical or data science fields. So, what about other types of data projects; those such as analytics and reporting, creating a single view of customers or suppliers, and application migrations and consolidation, where the data practitioner is not a statistician or a data science expert?

SEE ALSO: Building a data platform on Google Cloud Platform

Let’s take this a step further. Knowing that there are dozens of data sampling techniques, how would we expect a general business or data analyst or even a SQL developer to know which one of sampling techniques to apply, as there are many examples including:

  • Simple random sampling randomly selects subjects from the whole population.
  • Stratified sampling subsets data sets or population based on a common factor, where samples are randomly collected from each subgroup.
  • Cluster sampling divides a larger data set is into subsets (clusters) based on a defined factor, then a random sampling of clusters is analyzed.
  • Multistage sampling is a more complicated form of cluster sampling that involves dividing the larger population into a number of clusters, where second-stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed. Note, this could continue where multiple subsets are identified, clustered and analyzed.
  • Systematic sampling sets an interval at which to extract data from the larger population – for example, selecting every 10th row.

As you can see, sampling only works when it is put in the hands of data science specialists. But what about your general business users or data or business analysts that don’t have the expertise or the programming mindset or have not grown up in the school of mathematics? While they all want to prepare, shape, and clean data on their own, the reality is they only have three choices:

  1. Leave data cleanup and shaping to their technical counterparts and hope for the best.
  2. Use randomly generated samples to glean insights and formulate how the data should be shaped, cleaned and prepared, and use those findings as guidance for their technical counterparts who will clean the full data based on the sample-based guidance.
  3. Leverage modern technology to prepare and shape the data on their own using the entire body of data, not just samples.

For those that want to throw in the towel and call it a day (aka option #1) we understand. But for those who have a data-driven mindset and a self-service attitude – those that would naturally pick options 2 or 3, there is hope.

In Part II of this series, we will explain how to avoid problems with samples outside the context of machine learning and statistics and delve into how modern technologies can free you from samples.

Related Posts

Stellar Lumens, Waves, IOTA Price Analysis: 20 January
Data Analysis

Stellar Lumens, Waves, IOTA Price Analysis: 20 January

January 20, 2021
Capped under robust resistance after suffering the sell-off– Confluence Detector
Data Analysis

XAU/USD surges on Biden’s inauguration day, levels to watch – Confluence Detector

January 20, 2021
Pullback from 21-day SMA recalls sub-131.00 area
Data Analysis

Hits the highest level since September 2020, around 142.35

January 20, 2021
Orifice Plates Market Study Report (2020-2026), Competitive Analysis, Proposal Strategy, Potential Targets, Assessment And Recommendations
Data Analysis

Global Thymoquinone Market Scope and Price Analysis of Top Manufacturers Profiles 2020-2027 – NeighborWebSJ

January 20, 2021
Litecoin, Algorand, VeChain Price Analysis: 20 January
Data Analysis

Litecoin, Algorand, VeChain Price Analysis: 20 January

January 20, 2021
January 19 – Latest News, Breaking News, Top News Headlines
Data Analysis

January 19 – Latest News, Breaking News, Top News Headlines

January 20, 2021
Next Post
Global Downhole Drilling Tools Market Proffessional Survey 2019 – Oil Field, Gas Field – Top Industry News 24

Global Downhole Drilling Tools Market Proffessional Survey 2019 – Oil Field, Gas Field – Top Industry News 24

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

‘Vatican Blackout’ Trends on Twitter as Trigger-Happy Users Try to Link It with US Election Fraud

‘Vatican Blackout’ Trends on Twitter as Trigger-Happy Users Try to Link It with US Election Fraud

January 10, 2021
Global Food Authenticity Industry

Fifth Third Bank Partners with Cardtronics to Enhance Brand Visibility in Carolinas through ATM Branding Program

February 4, 2020
Horowitz: Asian-American researcher fired from Michigan State administration for advancing facts about police shootings

Horowitz: Asian-American researcher fired from Michigan State administration for advancing facts about police shootings

July 8, 2020
Digital Learning Market 2020 industry report explores segmented by growth opportunities, emerging-trends, and industry verticals till 2025

Online Brand Protection Software Market report reviews size, share, analysis, trends, growth and forecast 2025

March 6, 2020
Survey finds 40% of fashion brands have not paid suppliers | Apparel Industry News

Survey finds 40% of fashion brands have not paid suppliers | Apparel Industry News

May 29, 2020

EDITOR'S PICK

Desalinators Market Demand & SWOT Analysis By 2025: Key Players ProMinent GmbH, Selmar

Desalinators Market Demand & SWOT Analysis By 2025: Key Players ProMinent GmbH, Selmar

December 2, 2019
Global Western Wear Market $99.42 Billion by 2023 at 4.8% CAGR, Says Allied Market Research

Automotive Trims Market worth $43.35 Billion by 2027: Allied Market Research

October 20, 2020
Iodine Market Segmentation along with Regional Outlook, Competitive Strategies, Factors Contributing to Growth and Forecast 2025

Iodine Market Segmentation along with Regional Outlook, Competitive Strategies, Factors Contributing to Growth and Forecast 2025

July 10, 2020
Concrete Bonding Adhesives Market with Competitive Analysis, New Business Developments and Top Companies: Polyguard Products, Bostik, ENECON, Sakrete

Concrete Bonding Adhesives Market with Competitive Analysis, New Business Developments and Top Companies: Polyguard Products, Bostik, ENECON, Sakrete

August 5, 2020

Categories

  • Consumer Research
  • Data Analysis
  • Data Collection
  • Industry Research
  • Latest News
  • Market Insights
  • Marketing Research
  • Survey Research
  • Uncategorized

Recent Posts

  • Minimally Invasive Spine Surgery Market 2020 Key vendors- DePuy Synthes, Medtronic, NuVasive and more – NeighborWebSJ
  • Stellar Lumens, Waves, IOTA Price Analysis: 20 January
  • Mixed Reality in Gaming Market Introducing New Industry Dynamics Through Swot Analysis 2020 – KSU
  • Privacy Policy
  • Terms of Use
  • Antispam
  • DMCA
  • Contact Us

Copyright © 2020 Globalresearchsyndicate.com.

No Result
View All Result
  • Latest News
  • Consumer Research
  • Survey Research
  • Marketing Research
  • Industry Research
  • Data Collection
  • More
    • Data Analysis
    • Market Insights

Copyright © 2020 Globalresearchsyndicate.com.

Login to your account below

Forgotten Password?

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In