ChatGPT解决这个技术问题 Extra ChatGPT

What is the difference between labeled and unlabeled data?

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. This question does not appear to be about programming within the scope defined in the help center. Closed 5 days ago. The community reviewed whether to reopen this question 5 days ago and left it closed: Original close reason(s) were not resolved Improve this question

In this video from Sebastian Thrum he says that supervised learning works with "labeled" data and unsupervised learning works with "unlabeled" data. What does he mean by this? Googling "labeled vs unlabeled data" returns a bunch of scholarly papers on this topic. I just want to know the basic difference.

I’m voting to close this question because it's about definitions for a specific data science topic, not about programming.
I’m voting to close this question because this is about machine learning theory. Such questions are on-topic for Artificial Intelligence, Cross Validated or Computer Science

R
Riccardo

Typically, unlabeled data consists of samples of natural or human-created artifacts that you can obtain relatively easily from the world. Some examples of unlabeled data might include photos, audio recordings, videos, news articles, tweets, x-rays (if you were working on a medical application), etc. There is no "explanation" for each piece of unlabeled data -- it just contains the data, and nothing else.

Labeled data typically takes a set of unlabeled data and augments each piece of that unlabeled data with some sort of meaningful "tag," "label," or "class" that is somehow informative or desirable to know. For example, labels for the above types of unlabeled data might be whether this photo contains a horse or a cow, which words were uttered in this audio recording, what type of action is being performed in this video, what the topic of this news article is, what the overall sentiment of this tweet is, whether the dot in this x-ray is a tumor, etc.

Labels for data are often obtained by asking humans to make judgments about a given piece of unlabeled data (e.g., "Does this photo contain a horse or a cow?") and are significantly more expensive to obtain than the raw unlabeled data.

After obtaining a labeled dataset, machine learning models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.

There are many active areas of research in machine learning that are aimed at integrating unlabeled and labeled data to build better and more accurate models of the world. Semi-supervised learning attempts to combine unlabeled and labeled data (or, more generally, sets of unlabeled data where only some data points have labels) into integrated models. Deep neural networks and feature learning are areas of research that attempt to build models of the unlabeled data alone, and then apply information from the labels to the interesting parts of the models.


N
Nava Bogatee

Labeled data, used by Supervised learning add meaningful tags or labels or class to the observations (or rows). These tags can come from observations or asking people or specialists about the data.

Classification and Regression could be applied to labelled datasets for Supervised learning.

https://i.stack.imgur.com/4WE6N.png

https://i.stack.imgur.com/xqnJr.png

Clustering is considered to be one of the most popular unsupervised machine learning techniques used for grouping data points, or objects that are somehow similar.

Unsupervised learning has fewer models, and fewer evaluation methods that can be used to ensure that the outcome of the model is accurate. As such, unsupervised learning creates a less controllable environment as the machine is creating outcomes for us.

Picture courtesy of Coursera: Machine Learning with Python


J
John Greenall

There are many different problems in Machine Learning so I'll pick classification as a case in point. In classification, labelled data typically consists of a bag of multidimensional feature vectors (normally called X) and for each vector a label, Y which is often just an integer corresponding to a category eg. (face=1, non-face=-1). Unlabelled data misses the Y component. There are many scenarios where unlabelled data is plentiful and easily obtained but labelled data often requires a human/expert to annotate.


S
Souravi Sinha

Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of that unlabeled data with meaningful tags that are informative. For example, labels might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, whether the dot in an x-ray is a tumor, etc.


S
Shashwat Pandey

We can say that labeled is that data which is well defined. Eg. Emails, IP addresses,etc. Whereas unlabeled data is something which is not properly defined. Eg. Nature patterns, migration patterns of birds, etc. Unlabeled data alone does makes any sense but labeled data alone can be understood.


M
Muhammad Waqas Dilawar

In order to better answer your question, let's first define what is training data, "Training data just means the prepared data that's used to create a model."

Now let's define what is labeled or supervised learning: "The value you want to predict is actually in the training data." It means that each record from training data contains all the necessary information (features and target value as well).

Unlabeled or unsupervised learning: "The value you want to predict is not in the training data."

Side note: Both approaches are used, but it's fair to say that the most common approach is supervised learning.


K
Krishna Gannamaneni

In unlabeled data, there is no target value (dependent variable). We use unsupervised machine learning models to generate a target/dependent variable, which is basically grouping similar data together as clusters.


关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now