Trade-of memory versus performance

Augmenting the data using the full dataset, which requires a three tensor index \theta_{ik}\phi_{kj} is 5-6x faster than doing the augmentation sample by sample. However, this requires exceedingly large amounts of memory. In contrast, using augment_reduce to augment sample by sample requires essentially no memory. Let us find a common ground, where we do batches to get some speed, but also prevents large swats of memory. Probably, we will need to pad the dataset in order to split it up into batches.