Lightning Module¶
Defining the Lightning module is now straightforward, see also the documentation. The default hyperparameter choices were motivated by this paper.
Further references for PyTorch Lightning and its usage for Multi-GPU Training/Hyperparameter search can be found in the following blog posts by William Falcon:
Hyperparameter Search Argument Parser¶
Next we define the HyperOptArgumentParser including distributed training (see also the documentation) and debugging functionality.
Let us take a look at the different attributes of hparams
.
hparams = get_args(EmotionModel)
hparams = hparams.parse_args(args=[])
vars(hparams)
Trainer¶
Next we define a function calling the Lightning trainer using the setting specified in hparams
.
Let us check the model by running a quick development run.
hparams.fast_dev_run = True
main(hparams)
We also create a python file for automatic hyperparameter optimization across different GPUs or CPUs:
%%writefile main.py
from emotion_transformer.lightning import EmotionModel, get_args, main
if __name__ == '__main__':
hparams = get_args(EmotionModel)
hparams = hparams.parse_args()
if hparams.mode in ['test','default']:
main(hparams)
elif hparams.mode == 'hparams_search':
if hparams.gpus:
hparams.optimize_parallel_gpu(main, max_nb_trials=20,
gpu_ids = [gpus for gpus in hparams.gpus.split(' ')])
else:
hparams.optimize_parallel_cpu(main, nb_trials=20, nb_workers=4)
Background Information¶
For the interested reader we provide some background information on the (distributed) training loop:
- one epoch consists of m = ceil(30160/batchsize) batches for the training and additional n = ceil(2755/batchsize) batches for the validation.
dp case:
the batchsize will be split and each gpu receives (up to rounding) a batch of size batchsize/num_gpus
in the validation steps each gpu computes its own scores for each of the n batches (of size batchsize/num_gpus), i.e. each gpu calls the
validation_step
methodthe
output
which is passed to thevalidation_end
method consists of list of dictionaries (containing the concatenated scores from the different gpus), i.e.
output = [ {first_metric: [first_gpu_batch_1,...,last_gpu_batch_1],...,
last_metric: [first_gpu_batch_1,...,last_gpu_batch_1]},...,
{first_metric: [first_gpu_batch_n,...,last_gpu_batch_n],...,
last_metric: [first_gpu_batch_n,...,last_gpu_batch_n]} ]
ddp case: (does not work from jupyter notebooks)
the gpus receive (disjoint) samples of size batchsize and train on own processes but communicate and average their gradients (thus the resulting models on each gpu have the same weights)
each gpu computes its own validation_end method and its own list of dictionaries
output_first_gpu = [ {first_metric: batch_1,...,last_metric: batch_1},...,
{first_metric: batch_n,...,last_metric: batch_n} ]
output_last_gpu = [ {first_metric: batch_1,...,last_metric: batch_1},...,
{first_metric: batch_n,...,last_metric: batch_n} ]
ddp case: (does not work from jupyter notebooks)
- on each node we have the dp case but the nodes communicate analogous to the ddp case