Pytorch 数据加载器源码分析

Learning Notes

Pytorch

Publish Date: 2020-02-28

Update Date: 2022-09-21

Word Count: 1.2k

Read Times: 5 Min

最近在优化程序的代码，于是借此机会加深对框架底层代码的认识。

1DataLoader介绍

1.1定义

官方解释：“数据加载由数据集和采样器组成，基于python的单、多进程的iterators来处理数据。”

DataLoader()是Pytorch中数据读取的一个重要接口。该接口在dataloader.py文件中，通常用Pytorch来训练模型基本都会用到，像tensorflow、mxnet等框架类似。该接口的目的：将自定义的dataset根据batch size、是否shuffle等封装成一个batch sieze的Tensor用于训练模型。通俗地讲：数据喂给模型的接口，我们设置好如何喂、喂多少等，批量封装成一个用于训练的特有东西。

1.2 数据加载流程

创建一个Dataset对象
创建一个DataLoader对象
循环这个DataLoader对象，将img, lable加载到模型中训练

dataset = MyDataset()
dataloader = DataLoader(dataset)
for epoch in range(num_epochs):
  for img, label in dataloader:
    ......

2源码

class DataLoader(object):
    r"""
    Data loader. Combines a dataset and a sampler, and provides
    single- or multi-process iterators over the dataset.

    Arguments:
        dataset (Dataset): dataset from which to load the data.
        batch_size (int, optional): how many samples per batch to load
            (default: 1).
        shuffle (bool, optional): set to ``True`` to have the data reshuffled
            at every epoch (default: False).
        sampler (Sampler, optional): defines the strategy to draw samples from
            the dataset. If specified, ``shuffle`` must be False.
        batch_sampler (Sampler, optional): like sampler, but returns a batch of
            indices at a time. Mutually exclusive with batch_size, shuffle,
            sampler, and drop_last.
        num_workers (int, optional): how many subprocesses to use for data
            loading. 0 means that the data will be loaded in the main process.
            (default: 0)
        collate_fn (callable, optional): merges a list of samples to form a mini-batch.
        pin_memory (bool, optional): If ``True``, the data loader will copy tensors
            into CUDA pinned memory before returning them.
        drop_last (bool, optional): set to ``True`` to drop the last incomplete batch,
            if the dataset size is not divisible by the batch size. If ``False`` and
            the size of dataset is not divisible by the batch size, then the last batch
            will be smaller. (default: False)
        timeout (numeric, optional): if positive, the timeout value for collecting a batch
            from workers. Should always be non-negative. (default: 0)
        worker_init_fn (callable, optional): If not None, this will be called on each
            worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as
            input, after seeding and before data loading. (default: None)

    .. note:: By default, each worker will have its PyTorch seed set to
              ``base_seed + worker_id``, where ``base_seed`` is a long generated
              by main process using its RNG. However, seeds for other libraies
              may be duplicated upon initializing workers (w.g., NumPy), causing
              each worker to return identical random numbers. (See
              :ref:`dataloader-workers-random-seed` section in FAQ.) You may
              use ``torch.initial_seed()`` to access the PyTorch seed for each
              worker in :attr:`worker_init_fn`, and use it to set other seeds
              before data loading.

    .. warning:: If ``spawn`` start method is used, :attr:`worker_init_fn` cannot be an
                 unpicklable object, e.g., a lambda function.
    """

    __initialized = False

    def __init__(self, dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None,
                 num_workers=0, collate_fn=default_collate, pin_memory=False, drop_last=False,
                 timeout=0, worker_init_fn=None):
        self.dataset = dataset
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.collate_fn = collate_fn
        self.pin_memory = pin_memory
        self.drop_last = drop_last
        self.timeout = timeout
        self.worker_init_fn = worker_init_fn

        if timeout < 0:
            raise ValueError('timeout option should be non-negative')

        if batch_sampler is not None:
            if batch_size > 1 or shuffle or sampler is not None or drop_last:
                raise ValueError('batch_sampler option is mutually exclusive '
                                 'with batch_size, shuffle, sampler, and '
                                 'drop_last')
            self.batch_size = None
            self.drop_last = None

        if sampler is not None and shuffle:
            raise ValueError('sampler option is mutually exclusive with '
                             'shuffle')

        if self.num_workers < 0:
            raise ValueError('num_workers option cannot be negative; '
                             'use num_workers=0 to disable multiprocessing.')

        if batch_sampler is None:
            if sampler is None:
                if shuffle:
                    sampler = RandomSampler(dataset)  //将list打乱
                else:
                    sampler = SequentialSampler(dataset)
            batch_sampler = BatchSampler(sampler, batch_size, drop_last)

        self.sampler = sampler
        self.batch_sampler = batch_sampler
        self.__initialized = True

    def __setattr__(self, attr, val):
        if self.__initialized and attr in ('batch_size', 'sampler', 'drop_last'):
            raise ValueError('{} attribute should not be set after {} is '
                             'initialized'.format(attr, self.__class__.__name__))

        super(DataLoader, self).__setattr__(attr, val)

    def __iter__(self):
        return _DataLoaderIter(self)

    def __len__(self):
        return len(self.batch_sampler)

Arguments:
    dataset(Dataset): 传入的数据集
    batch_size(int, optional): 每个batch有多少个样本
    shuffle(bool, optional): 在每个epoch开始的时候，对数据进行重新排序
    sampler(Sampler, optional): 自定义从数据集中取样本的策略，如果指定这个参数，那么shuffle必须为False
    batch_sampler(Sampler, optional): 与sampler类似，但是一次只返回一个batch的indices（索引），需要注意的是，一旦指定了这个参数，那么batch_size,shuffle,sampler,drop_last就不能再制定了（互斥——Mutually exclusive）
    num_workers (int, optional): 这个参数决定了有几个进程来处理data loading。0意味着所有的数据都会被load进主进程。（默认为0）
    collate_fn (callable, optional): 将一个list的sample组成一个mini-batch的函数
    pin_memory (bool, optional)： 如果设置为True，那么data loader将会在返回它们之前，将tensors拷贝到CUDA中的固定内存（CUDA pinned memory）中.
    drop_last (bool, optional): 如果设置为True：这个是对最后的未完成的batch来说的，比如你的batch_size设置为64，而一个epoch只有100个样本，那么训练的时候后面的36个就被扔掉了…如果为False（默认），那么会继续正常执行，只是最后的batch_size会小一点。
    timeout(numeric, optional): 如果是正数，表明等待从worker进程中收集一个batch等待的时间，若超出设定的时间还没有收集到，那就不收集这个内容了。这个numeric应总是大于等于0。默认为0
    worker_init_fn (callable, optional): 每个worker初始化函数,默认为None。

Jeremy Zhang

http://sqzhang-jeremy.github.io/posts/2004.html

Leave the message below or email me sqzhang.jeremy@gmail.com when you have any problem. I would be happy if you repost or make suggestions for this article.

Learning Notes

现代医学和公共卫生对人类的影响

整理笔记、谈谈自己的看法和顺便练习Markdown语法

2020-03-04 Lecture

Talk

Tag and Category的区别

标签和分类的区别？标签最显著的作用：一是传统意义上分类的作用，类似分类名称；二是对文章内容进行一定程度的描述，类似于关键词。标签和分类还有一些细微的区别：同一篇文章标签可以用多个，但通常只能属于一个分类；标签一般是在写作完

2020-02-21 blog

hexo

Pytorch 数据加载器源码分析

1DataLoader介绍

1.1定义

1.2 数据加载流程

2源码

你的赏识是我前进的动力