How Does Text Classification Work?

The field of data science seems to just get bigger and more popular everyday. According to LinkedIn, data science was one of the fastest-growing job fields in 2017 and in 2020 Glassdoor ranked the job of data science as one of the three best jobs within the United States. Given the growing popularity of data science, it’s no surprise that more people are getting interested in the field. Yet what is data science exactly?

Let’s get acquainted with data science, taking some time to define data science, explore how big data and artificial intelligence is changing the field, learn about some common data science tools, and examine some examples of data science.

Defining Data Science

Before we can explore any data science tools or examples, we’ll want to get a concise definition of data science.

Defining “data science” is actually a little tricky, because the term is applied to many different tasks and methods of inquiry and analysis. We can begin by reminding ourselves of what the term “science” means. Science is the systematic study of the physical and natural world through observation and experimentation, aiming to advance human understanding of natural processes. The important words in that definition are “observation” and “understanding”.

If data science is the process of understanding the world from patterns in data, then the responsibility of a data scientist is to transform data, analyze data, and extract patterns from data. In other words, a data scientist is provided with data and they use a number of different tools and techniques to preprocess the data (get it ready for analysis) and then analyze the data for meaningful patterns.

The role of a data scientist is similar to the role of a traditional scientist. Both are concerned with the analysis of data to support or reject hypotheses about how the world operates, trying to make sense of patterns in the data to improve our understanding of the world. Data scientists make use of the same scientific methods that a traditional scientist does. A data scientist starts by gathering observations about some phenomena they would like to study. They then formulate a hypothesis about the phenomenon in question and try to find data that nullifies their hypothesis in some way.

If the hypothesis isn’t contradicted by the data, they might be able to construct a theory, or model, about how the phenomenon works, which they can go on to test again and again by seeing if it holds true for other similar datasets. If a model is sufficiently robust, if it explains patterns well and isn’t nullified during other tests, it can even be used to predict future occurrences of that phenomenon.

A data scientist typically won’t gather their own data through an experiment. They usually won’t design experiments with controls and double-blind trials to discover confounding variables that might interfere with a hypothesis. Most data analyzed by a data scientist will be data gained through observational studies and systems, which is a way in which the job of a data scientist might differ from the job of a traditional scientist, who tends to perform more experiments.

That said, a data scientist might be called on to do a form of experimentation called A/B testing where tweaks are made to a system that gathers data to see how the data patterns change.

Regardless of the techniques and tools used, data science ultimately aims to improve our understanding of the world by making sense out of data, and data is gained through observation and experimentation. Data science is the process of using algorithms, statistical principles, and various tools and machines to draw insights out of data, insights that help us understand patterns in the world around us.

What Do Data Scientists Do?

You might be seeing that any activity that involves the analysis of data in a scientific manner can be called data science, which is part of what makes defining data science so hard. To make it more clear, let’s explore some of the activities that a data scientist might do on a daily basis.

Data science brings many different disciplines and specialties together. Photo: Calvin Andrus via Wikimeedia Commons, CC BY SA 3.0 (https://commons.wikimedia.org/wiki/File:DataScienceDisciplines.png)

On any given day, a data scientist might be asked to: create data storage and retrieval schema, create data ETL (extract, transform, load) pipelines and clean up data, employ statistical methods, craft data visualizations and dashboards, implement artificial intelligence and machine learning algorithms, make recommendations for actions based on the data.

Let’s break the tasks listed above down a little.

Data Storage, Retrieval, ETL, and Cleanup

A data scientist may be required to handle the installation of technologies needed to store and retrieve data, paying attention to both hardware and software. The person responsible for this position may also be referred to as “Data Engineer”. However, some companies include these responsibilities under the role of data scientists. A data scientist may also need to create, or assist in the creation of, ETL pipelines. Data very rarely comes formatted just as a data scientist needs. Instead, the data will need to be received in a raw form from the data source, transformed into a usable format, and preprocessed (things like standardizing the data, dropping redundancies, and removing corrupted data).

Statistical Methods

The application of statistics is necessary to turn simply looking at data and interpreting it into an actual science. Statistical methods are used to extract relevant patterns from datasets, and a data scientist needs to be well versed in statistical concepts. They need to be able to discern meaningful correlations from spurious correlations by controlling for confounding variables. They also need to know the right tools to use to determine which features in the dataset are important to their model/have predictive power. A data scientist needs to know when to use a regression approach vs. a classification approach, and when to care about the mean of a sample vs. the median of a sample. A data scientist just wouldn’t be a scientist without these crucial skills.

Data Visualization

A crucial part of a data scientist’s job is communicating their findings to others. If a data scientist can’t effectively communicate their findings to others, than the implications of their findings don’t matter. A data scientist should be an effective story-teller as well. This means producing visualizations that communicate relevant points about the dataset and the patterns discovered within it. There is a large number of different data visualization tools that a data scientist might use, and they may visualize data for the purposes of initial, basic exploration (exploratory data analysis) or visualize the results that a model produces.

Recommendations and Business Applications

A data scientist needs to have some intuition of the requirements and goals of their organization or business. A data scientist needs to understand these things because they need to know what types of variables and features they should be analyzing, exploring patterns that will help their organization achieve its goals. The data scientists need to be aware of the constraints that they are operating under and the assumptions that the organization’s leadership are making.

Machine Learning and AI

Machine learning and other artificial intelligence algorithms and models are tools used by data scientists to analyze data, identify patterns within data, discern relationships between variables, and make predictions about future events.

Traditional Data Science vs. Big Data Science

As data collection methods have gotten more sophisticated and databases larger, a difference has arisen between traditional data science and “big data” science.

Traditional data analytics and data science is done with descriptive and exploratory analytics, aiming to find patterns and analyze the performance results of projects. Traditional data analytics methods often focus on just past data and current data. Data analysts often deal with data that has already been cleaned and standardized, while data scientists often deal with complex and dirty data. More advanced data analytics and data science techniques might be used to predict future behavior, although this is more often done with big data, as predictive models often need large amounts of data to be reliably constructed.

“Big data” refers to data that is too large and complex to be handled with traditional data analytics and science techniques and tools. Big data is often collected through online platforms and advanced data transformation tools are used to make the large volumes of data ready for inspection by data science. As more data is collected all the time, more of a data scientists job involves the analysis of big data.

Data Science Tools

Common data science tools include tools to store data, carry out exploratory data analysis, model data, carry out ETL, and visualize data. Platforms like Amazon Web Services, Microsoft Azure, and Google Cloud all offer tools to help data scientists store, transform, analyze, and model data. There are also standalone data science tools like Airflow (data infrastructure) and Tableau (data visualization and analytics).

In terms of machine learning and artificial intelligence algorithms used to model data, they are often provided through data science modules and platforms like TensorFlow, PyTorch, and the Azure Machine-learning studio. These platforms like data scientists make edits to their datasets, compose machine learning architectures, and train machine learning models.

Other common data science tools and libraries include SAS (for statistical modeling), Apache Spark (for the analysis of streaming data), D3.js (for interactive visualizations in the browser), and Jupyter (for interactive, sharable code blocks and visualizations).

Photo: Seonjae Jo via Flickr, CC BY SA 2.0 (https://www.flickr.com/photos/130860834@N02/19786840570)

Examples of Data Science

Examples of data science and its applications are everywhere. Data science has applications in everything from food delivery, sports, traffic, and health. Data is everywhere and so data science can be applied to everything.

In terms of food, Uber is investing in an expansion to its ride-sharing system focused on the delivery of food, Uber Eats. Uber Eats needs to get people their food in a timely fashion, while it is still hot and fresh. In order for this to occur, data scientists for the company need to use statistical modeling that takes into account aspects like distance from restaurants to delivery points, holiday rushes, cooking time, and even weather conditions, all considered with the goal of optimizing delivery times.

Sports statistics are used by team managers to determine who the best players are and form strong, reliable teams that will win games. One notable example is the data science documented by Michael Lewis in the book Moneyball, where the general manager of the Oakland Athletics team analyzed a variety of statistics to identify quality players that could be signed to the team at relatively low cost.

The analysis of traffic patterns is critical for the creation of self-driving vehicles. Self-driving vehicles must be able to predict the activity around them and respond to changes in road conditions, like the increased stopping distance required when it is raining, as well as the presence of more cars on the road during rush hour. Beyond self-driving vehicles, apps like Google Maps analyze traffic patterns to tell commuters how long it will take them to get to their destination using various routes and forms of transportation.

In terms of health data science, computer vision is often combined with machine learning and other AI techniques to create image classifiers capable of examining things like X-rays, FMRIs, and ultrasounds to see if there are any potential medical issues that might show up in the scan. These algorithms can be used to help clinicians diagnose disease.

Ultimately, data science covers numerous activities and brings together aspects of different disciplines. However, data science is always concerned with telling compelling, interesting stories from data, and with using data to better understand the world.