Gao Yanjie
Senior Researcher, Microsoft Research Asia
Yanjie Gao is a senior researcher at Microsoft Research Asia. His research interests include deep learning tools and platforms, big data systems, and so on. He has published several papers in top software engineering and system conferences, such as ICSE, ESEC/FSE, SoCC, CLUSTER, etc. He has also published several technical books and contributed to open source systems and educational communities, such as OpenPAI and AI-System.
Topic
Deep Learning Job GPU Utilization Analysis and Enhancement
Deep learning plays a key role in numerous intelligent software applications. Enterprise developers submit and run deep learning jobs on shared multi-tenant GPU deep learning platforms to efficiently train and test models. However, certain jobs utilize the allocated GPUs rather poorly, leading to significant resource wastage and reduced development efficiency. We conduct a comprehensive empirical study on the low GPU utilization problem by collecting real jobs on deep learning platforms, identify common root causes and propose corresponding non-invasive generic fixes to achieve good performance improvement, and design artificial intelligence models to model and predict GPU utilization.