As computation and memory requirements of the deep neural network models grow, multi-GPU or even multi-node training/inference has become essential. Modern deep learning frameworks such as PyTorch or Tensorflow supports multi-node execution of the deep neural network models. However, it still requires sophisticated knowledge of the underlying system architecture - which hinders the users from adapting their code to run on a cluster system. To solve the problem, we develop SnuDL, a deep learning framework that provides a single GPU image of the cluster to the user. SnuDL works transparently under the existing deep learning frameworks, so users can run their single-GPU targeted deep learning program on a heterogeneous cluster system without any source code modification.
Recent advances in deep learning suggest that the predictive power of the models is beneficial from their enormous number of the parameters. However, deploying these large models into devices with limited computation and memory resources is a challenge. Quantization is a practical solution to reduce the size of the models and accelerate the inference speed. Here, we study the quantization of deep learning models using the integer-based representation. We focus on improving the generalization ability of the deep learning models by introducing the stochasticity to the quantization scale and using a mixture of precisions at the training phase in an adaptive manner. Our techniques aim to improve the accuracy of the models at the inference phase.
Conventional Deep Learning (DL) frameworks, such as TensorFlow and PyTorch, heavily rely on libraries with fixed implementations (e.g., NVIDIA cuDNN). Thus, it is hard for them to handle versatile DNN models and GPU architectures. We propose a deep learning optimization framework for versatile GPU workloads. Unlike the conventional DL framework, it generates GPU kernels optimized for the target DL workload and the target hardware architecture. We implements this idea using CUDA and OpenCL. Our deep learning optimization framework, DeepCuts, achieves state-of-the-art performance compared to other DL optimization frameworks, such as TVM, TensorRT, and TensorFlow XLA.
We are working on a framework called SnuRHAC that provides an illusion of a single GPU for multiple GPUs in a cluster. SnuRHAC automatically distributes workload and manages data across the nodes. We propose several optimization techniques such as prefetching to maximize the performance. SnuRHAC aims to achieve both ease-of-programming and performance for GPU programming.
We are developing a tokenizer based on deep learning for Korean natural language processing (NLP). Tokenization is the process of breaking down text into tokens for computer programs or NLP models to process. Korean tokenization involves both decomposing words into morphemes('형태소') and assigning part-of-speech('품사') tags to the decomposed morphemes. We are using large scale Transformer-based models, such as GPT-2, for this task. Our goal is to develop an accurate and efficient tokenizer suited for Korean NLP tasks.
Recently, huge transformer models with more than 100 billion parameters (e.g., GPT-3) have been developed rapidly. However, training these models is not easy because conventional DL frameworks do not support out-of-GPU-memory (OoGM) scale models well. We are working on a cluster-targeting DL framework that supports OoGM scale model training, called SnuTRFM. It utilizes storage systems, NVMe SSDs and HDDs, to break the GPU memory wall. It also uses the performance model to find the optimal way of parallelization. We are training GPT-3 model from scratch using a small-scale heterogeneous cluster that is specifically customized to GPT-3. We are also training SnuTRFM for Korean language by gathering appropriate Korean training data sets. The tokenizer used in this process is SnuTOK.
As energy efficiency becomes one of the most important issues, FPGAs are emerging as promising accelerators for HPC systems. However, low-level hardware description language and compilation flow with ancient vendor-provided tools are still a primary way to program FPGAs, which requires considerable expertise to use. This programming wall prevents FPGAs from being adopted widely as accelerators. To overcome this obstacle, we are constructing a full-stack framework for FPGAs. As the first step, we presented a high-level synthesis framework of OpenCL/CUDA for FPGAs, called SOFF. Our framework does not require any explicit user annotations while achieving high performance. We are also developing a Neural Processing Unit (NPU) and a Deep Learning framework for FPGAs, which automatically generates an optimal circuit for a given deep learning model. With direct communication technology between FPGAs, we plan to extend the framework to cluster systems.
Quantum computing is the next generation of computer science. We are developing an end-to-end software stack for quantum computing, including a compiler, a runtime system, and a classical quantum circuit simulator. Currently, we focus on high-performance and scalable quantum simulators for quantum circuit simulation and tensor network contraction. We successfully simulated the 42-qubit Quantum Supremacy circuit using a workstation level computer system equipped with NVMe SSDs and HDDs. The current limit of a workstation-level full-state quantum circuit simulation is around 34 qubits. The current capability of our quantum circuit simulator is 256 times bigger than conventional full-state quantum circuit simulators. We also build a small-scale cluster that is specialized with NVMe SSDs and HDDs to simulate a supercomputer-level quantum circuits with our full-state simulation method.
Under-display Camera (UDC) is a new imaging system that mounts the camera under the display. This allows a larger screen-to-body ratio and better eye tracking; however, the image quality of the camera is seriously degraded due to the decrease in light transmittance and diffraction effect. There have been studies on image restoration models, which cannot be applied to real-world applications on mobile devices due to the lack of restoration speed and quality. We are developing a real-time image restoration DNN model by overcoming the limitations of mobile architectures. We exploit methods, such as model quantization and vectorization.
The latest mobile devices are equipped with high-performance cameras, increasing the workload involved in image processing. However, digital signal processors (DSPs) mounted on mobile devices can only be utilized to accelerate specific image processing algorithms. To process general image processing algorithms and improve image processing performance on a variety of systems, it is essential to develop technologies that utilize accelerators, especially GPUs, residing in a mobile platform. We use OpenCL, a standard parallel programming model, to accelerate image processing performance, and analyze mobile architectures and application characteristics to apply various optimization techniques.
Enabled by massive parallel computing, deep learning has produced state-of-the-art results in many important tasks, including natural language processing, object recognition, and speech recognition. However, existing DNNs with millions of parameters are computationally expensive and easy to overfit training data. We develop new regularization methods and metrics to detect overfitting for large-scale DNNs. We also investigate how information can flow properly in various networks without co-adaptation of feature detectors.
In semiconductor manufacturing, wafer map analysis is essential in the yield improvement process. However, clustering wafer maps is becoming more difficult as the wafer maps form more complex patterns along with high-dimensional data. Various machine learning methods have been proposed for industrial applications, but most of them are limited to supervised training with an available set of clean labels. To address these issues, we employ modern unsupervised image clustering models, especially convolutional autoencoders, deep clustering models, and visual transformers, for a given unsupervised wafer map clustering task.