Content area
Abstract
Optimizing scientific application performance in HPC environments is a complicated task which has motivated the development of many performance analysis tools over the past decades. These tools were designed to analyze the performance of a single parallel code using common approaches such as message passing (MPI), multithreading (OpenMP), acceleration (CUDA), or a hybrid approach. However, current trends in HPC such as the push to exascale, convergence with Big Data, and growing complexity of HPC applications and scientific workflows, have created gaps that these performance tools do not cover, particularly involving end-to-end data movement through an end-to-end HPC workflow comprising multiple codes, paradigms, or platforms.
To address this performance monitoring gap, we define a new metric called Workflow Critical Path (WCP), a data-oriented critical path metric for Holistic HPC Workflows. Using cloud-based technologies, we implement a prototype called Crux, a distributed analysis tool for calculating and visualizing WCP. Crux takes a novel, data-oriented approach by constructing program activity graphs (PAGs) using data states as vertices and data mutations as edges. Our experiments with a workflow simulator on Amazon Web Services show Crux is scalable and capable of calculating WCP for common Holistic HCP workflow patterns. We discuss how Crux and WCP could be used with production HPC applications.





