When I first started using PyTorch to implement recurrent neural networks (RNN), I faced a small issue when I was trying to use DataLoader in conjunction with variable-length sequences. What I specifically wanted to do was to automate the process of distributing training data among multiple graphics cards. Even though there are numerous examples online talking about how to do the actual padding, I couldn’t find any concrete example of using DataLoader in conjunction with padding, and my many-months old question on their forum is still left unanswered!!
The standard way of working with inputs of variable lengths is to pad all the sequences with zeros to make their lengths equal to the length of the largest sequence. This padding is done with the pad_sequence function. PyTorch’s RNN (LSTM, GRU, etc) modules are capable of working with inputs of a padded sequence type and intelligently ignore the zero paddings in the sequence.
If the goal is to train with mini-batches, one needs to pad the sequences in each batch. In other words, given a mini-batch of size N
, if the length of the largest sequence is L
, one needs to pad every sequence with a length of smaller than L
with zeros and make their lengths equal to L
. Moreover, it is important that the sequences in the batch are in the descending order.
To do proper padding with DataLoader, we can use the collate_fn
argument to specify a class that performs the collation operation, which in our case is zero padding. The following is a minimal example of a collation class that does the padding we need:
import torch import numpy as np class PadSequence: def __call__(self, batch): # Let's assume that each element in "batch" is a tuple (data, label). # Sort the batch in the descending order sorted_batch = sorted(batch, key=lambda x: x[0].shape[0], reverse=True) # Get each sequence and pad it sequences = [x[0] for x in sorted_batch] sequences_padded = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=True) # Also need to store the length of each sequence # This is later needed in order to unpad the sequences lengths = torch.LongTensor([len(x) for x in sequences]) # Don't forget to grab the labels of the *sorted* batch labels = torch.LongTensor(map(lambda x: x[1], sorted_batch)) return sequences_padded, lengths, labels
Note the importance of batch_first=True
in my code above. By default, DataLoader assumes that the first dimension of the data is the batch number. Whereas, PyTorch’s RNN modules, by default, put batch in the second dimension (which I absolutely hate). Fortunately, this behavior can be changed for both the RNN modules and the DataLoader. I personally always prefer to have the batch be the first dimension of the data.
With my code above, DataLoader instance is created as follows:
torch.utils.data.DataLoader(dataset=dataset, ... more arguments ..., collate_fn=PadSequence())
The last remaining step here is to pass each batch to the RNN module during training/inference. This can be done by using the pack_padded_sequence function as follows:
from torch.nn.utils.rnn import pack_padded_sequence as PACK class MyModel(nn.Module): def __init__(): self.gru = nn.GRU(10, 20, 2, batch_first=True) # Note that "batch_first" is set to "True" def forward(self, batch): x, x_lengths, _ = batch x_pack = PACK(x, x_lengths, batch_first=True) output, hidden = self.gru(x_pack)
7 comments
Skip to comment form
hello im getting an error like this
Traceback (most recent call last):
File “/home/sanjay/anaconda3/lib/python3.7/multiprocessing/queues.py”, line 236, in _feed
obj = _ForkingPickler.dumps(obj)
File “/home/sanjay/anaconda3/lib/python3.7/multiprocessing/reduction.py”, line 51, in dumps
cls(buf, protocol).dump(obj)
AttributeError: Can’t pickle local object ‘PadSequence.__call__..’
Thanks a lot for the help, Mehran!
Author
Happy to help!
Thank you very much, it really helps.
Would you mind providing an unpadding example as well?
def forward(self, inputs, input_lengths, state):
# inputs is of shape batch_size, num_steps(sequence length which is the length of
# longest text sequence). Each row of inputs is 1d LongTensor array of length
# num_steps containing word index. Using the embedding layer we want to convert
# each word index to its corresponding word vector of dimension emb_dim
batch_size = inputs.size(0)
num_steps = inputs.size(1)
# embeds is of shape batch_size * num_steps * emb_dim and is the input to lstm layer
embeds = self.emb_layer(inputs)
# pack_padded_sequence before feeding into LSTM. This is required so pytorch knows
# which elements of the sequence are padded ones and ignore them in computation.
# This step is done only after the embedding step
embeds_pack = pack_padded_sequence(embeds, input_lengths, batch_first=True)
# lstm_out is of shape batch_size * num_steps * hidden_size and contains the output
# features (h_t) from the last layer of LSTM for each t
# h_n is of shape num_layers * batch_size * hidden_size and contains the final hidden
# state for each element in the batch i.e. hidden state at t_end
# same for c_n as h_n except that it is the final cell state
lstm_out_pack, (h_n, c_n) = self.lstm_layer(embeds_pack)
# unpack the output
lstm_out, lstm_out_len = pad_packed_sequence(lstm_out_pack, batch_first=True)
# tensor flattening works only if tensor is contiguous
# https://discuss.pytorch.org/t/contigious-vs-non-contigious-tensor/30107/2
# flatten lstm_out from 3d to 2d with shape (batch_size * num_steps), hidden_dim)
lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
# regularize lstm output by applying dropout
out = self.dropout(lstm_out)
# The the output Y of fully connected rnn layer has the shape of
# (`num_steps` * `batch_size`, `num_hiddens`). This Y is then fed as input to the
# output fully connected linear layer which produces the prediction in the output shape of
# (`num_steps` * `batch_size`, `output_dim`).
output = self.linear(out)
# reshape output to batch_size, num_steps, output_dim
output = output.view(batch_size, -1, self.output_dim)
# reshape output again to batch_size, output_dim. The last element of middle dimension
# i.e. num_steps is taken i.e. for each item in the batch the output is the hidden state
# from the last layer of LSTM for t = t_end
output = output[:, -1, :]
output = self.act(output)
return output, (h_n, c_n)
Thank you. But I am getting an error of TypeError: expected Tensor as element 0 in argument 0, but got csr_matrix
Can you please help with this please
Hi I have a multivariate time-series dataset with different sequence length and I am trying to train my model for a conditioned autoregressive problem. I used a similar method as yours but the results don’t look good. My guess is maybe my dataset is not normalized. Do you have any suggestion for how to normalize this type of data with variable length?