How do I split a custom dataset into training and test datasets?

python deep-learning pytorch

import pandas as pd
import numpy as np
import cv2
from torch.utils.data.dataset import Dataset

class CustomDatasetFromCSV(Dataset):
    def __init__(self, csv_path, transform=None):
        self.data = pd.read_csv(csv_path)
        self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
        self.height = 48
        self.width = 48
        self.transform = transform

    def __getitem__(self, index):
        pixels = self.data['pixels'].tolist()
        faces = []
        for pixel_sequence in pixels:
            face = [int(pixel) for pixel in pixel_sequence.split(' ')]
            # print(np.asarray(face).shape)
            face = np.asarray(face).reshape(self.width, self.height)
            face = cv2.resize(face.astype('uint8'), (self.width, self.height))
            faces.append(face.astype('float32'))
        faces = np.asarray(faces)
        faces = np.expand_dims(faces, -1)
        return faces, self.labels

    def __len__(self):
        return len(self.data)

This is what I could manage to do by using references from other repositories. However, I want to split this dataset into train and test.

How can I do that inside this class? Or do I need to make a separate class to do that?

Fábio Perez

Starting in PyTorch 0.4.1 you can use random_split:

train_size = int(0.8 * len(full_dataset))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])

I followed your answer and got this problem while iterating throught the split train_loader stackoverflow.com/questions/53916594/…

AttributeError: 'Subset' object has no attribute 'targets' how can I access targets of only one of the subsets? I want to print something like this for train and test data separately {0: 111, 1: 722, 2: 813, 3: 175, 4: 283, 5: 2846, 6: 290, 7: 106}

For others: If you're getting TypeError 'DataLoader' object is not subscriptable you probably also want to look at stackoverflow.com/a/60150673/12068941

Anyway to include spatial resampling strategy?

Community

Using Pytorch's SubsetRandomSampler:

import torch
import numpy as np
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler

class CustomDatasetFromCSV(Dataset):
    def __init__(self, csv_path, transform=None):
        self.data = pd.read_csv(csv_path)
        self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
        self.height = 48
        self.width = 48
        self.transform = transform

    def __getitem__(self, index):
        # This method should return only 1 sample and label 
        # (according to "index"), not the whole dataset
        # So probably something like this for you:
        pixel_sequence = self.data['pixels'][index]
        face = [int(pixel) for pixel in pixel_sequence.split(' ')]
        face = np.asarray(face).reshape(self.width, self.height)
        face = cv2.resize(face.astype('uint8'), (self.width, self.height))
        label = self.labels[index]

        return face, label

    def __len__(self):
        return len(self.labels)


dataset = CustomDatasetFromCSV(my_path)
batch_size = 16
validation_split = .2
shuffle_dataset = True
random_seed= 42

# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, 
                                           sampler=train_sampler)
validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
                                                sampler=valid_sampler)

# Usage Example:
num_epochs = 10
for epoch in range(num_epochs):
    # Train:   
    for batch_index, (faces, labels) in enumerate(train_loader):
        # ...

What is num_train?

My bad, it has been renamed appropriately (dataset_size).

Also when I put this in model, the function forward takes the input data. And the shape of that data is 5D tensor - (32L, 35887L, 48L, 48L, 1L). 32 is the batch size, next is the length of dataset and then image height, width and channel.

Dataset.__getitem__() should return a single sample and label, not the whole dataset. I edited my post to give you an example how it should look.

@AnaClaudia: batch_size defines the number of samples stacked together into a mini-batch passed to the neural network each training iteration. See Dataloader documentation or this Cross-Validated thread for more info.

Shital Shah

Current answers do random splits which has disadvantage that number of samples per class is not guaranteed to be balanced. This is especially problematic when you want to have small number of samples per class. For example, MNIST has 60,000 examples, i.e. 6000 per digit. Assume that you want only 30 examples per digit in your training set. In this case, random split may produce imbalance between classes (one digit with more training data then others). So you want to make sure each digit precisely has only 30 labels. This is called stratified sampling.

One way to do this is using sampler interface in Pytorch and sample code is here.

Another way to do this is just hack your way through :). For example, below is simple implementation for MNIST where ds is MNIST dataset and k is number of samples needed for each class.

def sampleFromClass(ds, k):
    class_counts = {}
    train_data = []
    train_label = []
    test_data = []
    test_label = []
    for data, label in ds:
        c = label.item()
        class_counts[c] = class_counts.get(c, 0) + 1
        if class_counts[c] <= k:
            train_data.append(data)
            train_label.append(torch.unsqueeze(label, 0))
        else:
            test_data.append(data)
            test_label.append(torch.unsqueeze(label, 0))
    train_data = torch.cat(train_data)
    for ll in train_label:
        print(ll)
    train_label = torch.cat(train_label)
    test_data = torch.cat(test_data)
    test_label = torch.cat(test_label)

    return (TensorDataset(train_data, train_label), 
        TensorDataset(test_data, test_label))

You can use this function like this:

def main():
    train_ds = datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor()
                       ]))
    train_ds, test_ds = sampleFromClass(train_ds, 3)

Eric

If you would like to ensure your splits have balanced classes, you can use train_test_split from sklearn.

Assuming you have wrapped your data in a custom Dataset object:

from torch.utils.data import DataLoader, Subset
from sklearn.model_selection import train_test_split

TEST_SIZE = 0.1
BATCH_SIZE = 64
SEED = 42

# generate indices: instead of the actual data we pass in integers instead
train_indices, test_indices, _, _ = train_test_split(
    range(len(data)),
    data.targets,
    stratify=data.targets,
    test_size=TEST_SIZE,
    random_state=SEED
)

# generate subset based on indices
train_split = Subset(data, train_indices)
test_split = Subset(data, test_indices)

# create batches
train_batches = DataLoader(train_split, batch_size=BATCH_SIZE, shuffle=True)
test_batches = DataLoader(test_split, batch_size=BATCH_SIZE)

prosti

This is the PyTorch Subset class attached holding the random_split method. Note that this method is base for the SubsetRandomSampler.

https://i.stack.imgur.com/K9D0z.png

For MNIST if we use random_split:

loader = DataLoader(
  torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.5,), (0.5,))
                             ])),
  batch_size=16, shuffle=False)

print(loader.dataset.data.shape)
test_ds, valid_ds = torch.utils.data.random_split(loader.dataset, (50000, 10000))
print(test_ds, valid_ds)
print(test_ds.indices, valid_ds.indices)
print(test_ds.indices.shape, valid_ds.indices.shape)

We get:

torch.Size([60000, 28, 28])
<torch.utils.data.dataset.Subset object at 0x0000020FD1880B00> <torch.utils.data.dataset.Subset object at 0x0000020FD1880C50>
tensor([ 1520,  4155, 45472,  ..., 37969, 45782, 34080]) tensor([ 9133, 51600, 22067,  ...,  3950, 37306, 31400])
torch.Size([50000]) torch.Size([10000])

Our test_ds.indices and valid_ds.indices will be random from range (0, 600000). But if I would like to get sequence of indices from (0, 49999) and from (50000, 59999) I cannot do that at the moment unfortunately, except this way.

Handy in case you run the MNIST benchmark where it is predefined what should be the test and what should be the validation dataset.

clearly the easiest way to go

Is there a reason why the code is a screenshot? Please avoid that.

prosti

Bear in mind that most canonical examples are already spited. For instance on this page you will find MNIST. One common belief is that is has 60.000 images. Bang! Wrong! It has 70.000 images out of that 60.000 training and 10.000 validation (test) images.

So for the canonical datasets the flavor of PyTorch is to provide you already spited datasets.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset, TensorDataset
from torch.optim import *
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import os
import numpy as np
import random

bs=512

t = transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize(mean=(0), std=(1))]
                       )

dl_train = DataLoader( torchvision.datasets.MNIST('/data/mnist', download=True, train=True, transform=t), 
                batch_size=bs, drop_last=True, shuffle=True)
dl_valid = DataLoader( torchvision.datasets.MNIST('/data/mnist', download=True, train=False, transform=t), 
                batch_size=bs, drop_last=True, shuffle=True)

It seems to me the pipeline should be load data, split then transform - in particular in your case you've hardcoded the inputs to Normalize. In general, these should be determined only from the training dataset, but with pytorch the transforms always seem to be applied to the full dataset.

From the data you have you ideally create train, validation and test datasets. (TRAVALTES). Train for training, validation to check if you are overfitting/underfitting. You calculate accuracy score, or some other score (f1...) to get some clues and ideally create confusion matrix if you have a problem like classification. So this post of mine sucks. I will improve it later today.

Yeah, my comment is more about how most of the canonical pytorch examples seem to hard code the mean / std of the features as an input into Transform, usually with pre-split test / validation data. This seems a little circular as in reality you'd want to split the data and compute the Transformer parameters from the train set, then apply to the validation (and/or test). But the DataSet / Transformer design doesn't make this as easy as say sklearn. Sometimes I wonder if the scaling should be performed by a nn layer and hence be learnable parameters - but I guess that could affect convergence.

I updated the article. Most of the time set the mean to 0 and std to 1 if you train from scratch. For the pretrained models just follow the nomalization parameters provided with the model. You use the same normalization translation for the train and test sets (transforms.Normalize). @DavidWaterworth. Yeah I know some practitioners using the BN layer at the very start of the model to do normalization.

With the mean=(0), std=(1) as in the upper showcase I am getting 99.3% validation accuracy on simple hand made ResNet. Same would be for the mean=(0.5), std=(0.5) as we often seen that example wise.

eyal RnD

In case you want up to X samples per class in the train dataset you can use this code:

def stratify_split(dataset: Dataset, train_samples_per_class: int):
        import collections
        train_indices = []
        val_indices = []
        TRAIN_SAMPLES_PER_CLASS = 10
        target_counter = collections.Counter()
        for idx, data in enumerate(dataset):
            target = data['target']
            target_counter[target] += 1
            if target_counter[target] <= train_samples_per_class:
                train_indices.append(idx)
            else:
                val_indices.append(idx)
        train_dataset = Subset(dataset, train_indices)
        val_dataset = Subset(dataset, val_indices)
        return train_dataset, val_dataset

How do I split a custom dataset into training and test datasets?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US