data exploration and construction of the Dataloader for the SemEval-2019 Task 3 dataset (contextual emotion detection in text)

Data exploration

Let us first take a look at the data using Pandas. Note that we use the preprocessed, cleaned data from PhilippMaxx.

open_data[source]

open_data(path)

returns a Pandas DataFrame consisting of the SemEval data at path path

path = 'data/clean_train.txt'
df = open_data(path)
df.head()
turn1 turn2 turn3 label
id
0 do not worry i am girl hmm how do i know if you are what ' s ur name ? others
1 when did i ? saw many times i think shame no . i never saw you angry
2 by by google chrome where you live others
3 u r ridiculous i might be ridiculous but i am telling the tru... u little disgusting whore angry
4 just for time pass wt do u do 4 a living then maybe others
df.shape
(30160, 4)

As you can see we are dealing with very unformal language, typos, and bad grammar. Moreoever, we are given an unbalanced dataset:

df['label'].value_counts()
others    14948
angry      5506
sad        5463
happy      4243
Name: label, dtype: int64

Data Transforms

We now tokenize the examples into features and create attention masks according to the (Distil)Bert standard. The following image from Jay Alammar's great blog post A Visual Guide to Using BERT for the First Time visualizes the tokenization step.

Tokenization explained visually

transform_data[source]

transform_data(df, max_seq_len)

returns the padded input ids and attention masks according to the DistilBert tokenizer

max_seq_len = 10
padded, attention_mask = transform_data(df, max_seq_len)
assert padded.shape == attention_mask.shape == df[['turn1','turn2','turn3']].shape + (max_seq_len,)

Lets digest the outcome for the first two conversations.

df[:2]
turn1 turn2 turn3 label
id
0 do not worry i am girl hmm how do i know if you are what ' s ur name ? others
1 when did i ? saw many times i think shame no . i never saw you angry

The utterances are transformed according to the tokenizer vocabulary

vocab = np.array(list(DistilBertTokenizer.from_pretrained('distilbert-base-uncased').vocab.items()))
vocab[padded[:2]].reshape(2,3,-1)
array([[['[CLS]', '101', 'do', '2079', 'not', '2025', 'worry', '4737',
         'i', '1045', 'am', '2572', 'girl', '2611', '[SEP]', '102',
         '[PAD]', '0', '[PAD]', '0'],
        ['[CLS]', '101', 'hmm', '17012', 'how', '2129', 'do', '2079',
         'i', '1045', 'know', '2113', 'if', '2065', 'you', '2017',
         'are', '2024', '[SEP]', '102'],
        ['[CLS]', '101', 'what', '2054', "'", '1005', 's', '1055', 'ur',
         '24471', 'name', '2171', '?', '1029', '[SEP]', '102', '[PAD]',
         '0', '[PAD]', '0']],

       [['[CLS]', '101', 'when', '2043', 'did', '2106', 'i', '1045',
         '?', '1029', '[SEP]', '102', '[PAD]', '0', '[PAD]', '0',
         '[PAD]', '0', '[PAD]', '0'],
        ['[CLS]', '101', 'saw', '2387', 'many', '2116', 'times', '2335',
         'i', '1045', 'think', '2228', 'shame', '9467', '[SEP]', '102',
         '[PAD]', '0', '[PAD]', '0'],
        ['[CLS]', '101', 'no', '2053', '.', '1012', 'i', '1045',
         'never', '2196', 'saw', '2387', 'you', '2017', '[SEP]', '102',
         '[PAD]', '0', '[PAD]', '0']]], dtype='<U18')

...resulting in the corresponding input ids (padded with zeros to max_seq_len)

padded[:2]
tensor([[[  101,  2079,  2025,  4737,  1045,  2572,  2611,   102,     0,     0],
         [  101, 17012,  2129,  2079,  1045,  2113,  2065,  2017,  2024,   102],
         [  101,  2054,  1005,  1055, 24471,  2171,  1029,   102,     0,     0]],

        [[  101,  2043,  2106,  1045,  1029,   102,     0,     0,     0,     0],
         [  101,  2387,  2116,  2335,  1045,  2228,  9467,   102,     0,     0],
         [  101,  2053,  1012,  1045,  2196,  2387,  2017,   102,     0,     0]]])

...and masks for the self-attention layers specifying the length of each utterance:

attention_mask[:2]
tensor([[[1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 0, 0]],

        [[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]])

We also transform the labels to integers according to a given dictionary emo_dict.

get_labels[source]

get_labels(df, emo_dict)

returns the labels according to the emotion dictionary

emo_dict = {'others': 0, 'sad': 1, 'angry': 2, 'happy': 3}
labels = get_labels(df, emo_dict)

Let us look at the result for our two conversations from above:

labels[:2]
tensor([0, 2])

The following widget allows to interactively explore the former data transformations.

PyTorch Dataloader

Finally we aggregate all the functions above into a PyTorch dataloader (with optional distributed training).

dataloader[source]

dataloader(path, max_seq_len, batch_size, emo_dict, use_ddp=False, labels=True)

Transforms the .csv data stored in path according to DistilBert features and returns it as a DataLoader

batch_size = 5
loader = dataloader(path, max_seq_len, batch_size, emo_dict)

A batch consists of batch_size input ids, attention_masks, and optionally the corresponding labels.

batch = next(iter(loader))
batch
[tensor([[[  101,  1045,  2572,  2643,   102,     0,     0,     0,     0,     0],
          [  101,  2025,  2074,  2023,  5304,  1012,  2017,  2024,  4242,   102],
          [  101,  2025,  2428,   102,     0,     0,     0,     0,     0,     0]],
 
         [[  101,  2748,  1012,   102,     0,     0,     0,     0,     0,     0],
          [  101,  1030,  2068,   102,     0,     0,     0,     0,     0,     0],
          [  101,  2425,  2033,  2242,  2008,  2097,  3037,  2033,  1012,   102]],
 
         [[  101,  2821,   102,     0,     0,     0,     0,     0,     0,     0],
          [  101,  1997, 17876,  5262,   999,  7653,  2227,   102,     0,     0],
          [  101,  2024,  2017,  2183,  2000,  3637,   102,     0,     0,     0]],
 
         [[  101,  2079,  2025,  2131,  2046,  1996,  4751,   102,     0,     0],
          [  101,  2023,  2003,  5024,  6040,  1012,   102,     0,     0,     0],
          [  101,  2748,  1045,  2064,  2025,  5454,   102,     0,     0,     0]],
 
         [[  101,  1045,  2031,  2070,  2147,  2000,  2079,   102,     0,     0],
          [  101,  2168,   102,     0,     0,     0,     0,     0,     0,     0],
          [  101,  2061,  2175,  2079,  2115,  2147,   102,     0,     0,     0]]]),
 tensor([[[1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]],
 
         [[1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
          [1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
 
         [[1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
          [1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
          [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]],
 
         [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
          [1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
          [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]],
 
         [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
          [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
          [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]]),
 tensor([0, 0, 0, 0, 0])]

We can find, for instance, the first conversation of this batch in our input ids above.

conversation = batch[0][0]
assert torch.any(torch.all(torch.all(conversation == padded, dim=2), dim=1)) == True