Abstract: Transformer networks have outperformed recurrent neural networks and convolutional neural networks in various sequential tasks. However, scaling transformer networks for long sequences has ...