Deep Learning (DL) has been widely applied in both academia and industry. System innovations can continue to squeeze more efficiency out of modern hardware. Existing systems such as TensorFlow, MXNet, and PyTorch have emerged to assist researchers to train their models on a large scale. However, obtaining performant execution for different DL jobs on heterogeneous hardware platforms is notoriously difficult. We found that current solutions show relatively low scalability and inefficiencies when training neural networks on heterogeneous clusters due to stragglers and low resource utilization. Furthermore, existing strategies either require significant engineering efforts in developing hardware-specific optimization methods or result in suboptimal parallelization performance. This thesis discusses our efforts to build an efficient and scalable deep learning system when training DL jobs in heterogeneous environments. The goal of a scalable learning system is to pursue a parallel computing framework with (1) efficient parameter synchronization approaches; (2) efficient resource management techniques; (3) scalable data and model parallelism in heterogeneous environments;
In this thesis, we implement robust synchronization, efficient resource provisioning approaches, asynchronous collective communication operators, which optimize the popular learning frameworks to achieve efficient and scalable DL training. First, to avoid the "long-tail effects" for parallel tasks, we design a decentralized, relaxed, and randomized sampling approach to implement partial AllReduce operation to synchronize DL models. Second, to improve GPU memory utilization, we implement an efficient GPU memory management scheme for training nonlinear DNNs by adopting graph analysis and exploiting the layered dependency structures. Third, to train wider and deeper Deep Learning Recommendation Models (DLRMs) in heterogeneous environments, we propose an efficient collective communication operator to support hybrid embedding table placements on heterogeneous resources and a more fine-grained pipeline execution scheme to improve parallel training throughput by overlapping the communication with computation. We implement the proposed methods in several open-source learning frameworks and evaluate their performance in physical clusters with various practical DL benchmarks.