• 青年评述 •    下一篇

高性能计算中的性能工程问题

谭光明   

  1. 中国科学院计算技术研究所, 北京 100086
  • 收稿日期:2022-06-10 出版日期:2022-12-14 发布日期:2022-12-08
  • 作者简介:谭光明, 研究员、博导、中科院计算技术研究所高性能计算机研究中心主任. 国家杰出青年基金获得者, 参与了曙光系列高性能计算机系统研制. 发表学术论文100余篇, 包括CCF A类会议(SC、PPoPP、PLDI) 和Nature子刊等, 曾任IEEE TPDS编委和多个国际会议的程序委员. 曾获得国家科技进步奖二等奖、卢嘉锡青年人才奖和全国向上向善好青年称号.
  • 基金资助:
    国家自然科学基金(61972377, 62032023, T2125013)资助

谭光明. 高性能计算中的性能工程问题[J]. 数值计算与计算机应用, 2022, 43(4): 343-362.

Tan Guangming. PERFORMANCE ENGINEERING PROBLEM IN HIGH PERFORMANCE COMPUTING[J]. Journal on Numerica Methods and Computer Applications, 2022, 43(4): 343-362.

PERFORMANCE ENGINEERING PROBLEM IN HIGH PERFORMANCE COMPUTING

Tan Guangming   

  1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100086, China
  • Received:2022-06-10 Online:2022-12-14 Published:2022-12-08
高性能计算的核心目标是追求极致的计算性能.本文对高性能计算在硬件工程、软件工程、性能工程三个阶段需要攻克的核心技术难题进行了总结,并且重点针对E级计算发展趋势下实现复杂应用负载与多样异构系统之间高效适配存在的性能可移植挑战,阐述了性能工程的相关概念和研究意义,最后讨论了当前性能工程涉及的三大关键技术:模式驱动的性能建模方法、输入感知的智能调优引擎、统一抽象的软硬件代码生成.
The goal of high performance computing is to pursue the ultimate computational performance. This paper summaries the key technologies that need to be developed in the three phases of high performance computing: hardware engineering, software engineering and performance engineering, and focuses on the performance portability challenges in achieving efficient adaptation between complex application loads and diverse heterogeneous sys -tems under the trend of E-class computing. Finally, the three main technologies involved in performance engineering are discussed in detail, which are pattern-driven performance modeling approach, input-aware intelligent tuning engine, and unified abstraction of software and hardware code generation.

MR(2010)主题分类: 

()
[1] TOP500.org. TOP500[M]. https://www.top500.org/lists/top500.
[2] Arden W M. The international technology roadmap for semiconductors perspectives and challenges for the next 15 years[J]. Current Opinion in Solid State and Materials Science, 2002, 6(5):371-377.
[3] Leiserson C E, Thompson N C, Emer J S, Kuszmaul B C, Lampson B W, Sanchez D, Schardl T B. There's plenty of room at the Top:What will drive computer performance after Moore's law?[J]. Science, 2020, 368(6495):eaam9744.
[4] Nelson M T, Humphrey W, Gursoy A, Dalke A, Kalé L V, Skeel R D, Schulten K. NAMD:a parallel, object-oriented molecular dynamics program[J]. The International Journal of Supercomputer Applications and High Performance Computing, 1996, 10(4):251-268.
[5] GmbH V S. VASP-Vienna Ab initio Simulation Package[M]. https://www.vasp.at.
[6] Li J, Tan G, Chen M, Sun N. SMAT:an input adaptive auto-tuner for sparse matrix-vector multiplication[C]. In Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation, 2013, 117-126.
[7] Franchetti F, Low T M, Popovici D T, Veras R M, Spampinato D G, Johnson J R, Püschel M, Hoe J C, Moura J M. SPIRAL:Extreme performance portability[J]. Proceedings of the IEEE, 2018, 106(11):1935-1968.
[8] Ben-Nun T, de Fine Licht J, Ziogas A N, Schneider T, Hoefler T. Stateful dataflow multigraphs:A data-centric model for performance portability on heterogeneous architectures[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, 1-14.
[9] Balaprakash P, Wild S M, Norris B. SPAPT:Search problems in automatic performance tuning[J]. Procedia Computer Science, 2012, 9:1959-1968.
[10] 陈国良.并行计算:结构·算法·编程[M].高等教育出版社, 1999.
[11] 刘方爱, 刘志勇, 乔香珍.一种异步BSP模型及其程序优化技术[J].计算机学报, 2002, 25(4):373-380.
[12] Williams S, Waterman A, Patterson D. Roofline:an insightful visual performance model for multicore architectures[J]. Communications of the ACM, 2009, 52(4):65-76.
[13] Ziogas A N, Ben-Nun T, Fernández G I, Schneider T, Luisier M, Hoefler T. A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, 1-13.
[14] Treibig J, Hager G. Introducing a performance model for bandwidth-limited loop kernels[C]. In International Conference on Parallel Processing and Applied Mathematics, Springer, 2009, 615- 624.
[15] Valiant L G. A bridging model for parallel computation[J]. Communications of the ACM, 1990, 33(8):103-111.
[16] Kurt S E, Sukumaran-Rajam A, Rastello F, Sadayyapan P. Efficient tiled sparse matrix multiplication through matrix signatures[C]. In SC20:International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2020, 1-14.
[17] Niu W, Guan J, Wang Y, Agrawal G, Ren B. DNNFusion:accelerating deep neural networks execution with advanced operator fusion[C]. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, 883-898.
[18] Meng K, Li J, Tan G, Sun N. A pattern based algorithmic autotuner for graph processing on GPUs[C]. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, 2019, 201-213.
[19] Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph:Distributed Graph-Parallel Computation on Natural Graphs[C]. In 10th USENIX symposium on operating systems design and implementation (OSDI 12), 2012, 17-30.
[20] Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens J D. Gunrock:A high-performance graph processing library on the GPU[C]. In Proceedings of the 21st ACM SIGPLAN symposium on principles and practice of parallel programming, 2016, 1-12.
[21] Xiao J, Li S, Wu B, Zhang H, Li K, Yao E, Zhang Y, Tan G. Communication-Avoiding for Dynamical Core of Atmospheric General Circulation Model[C]. In Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, New York, NY, USA, 2018. Association for Computing Machinery.
[22] Whaley R C, Dongarra J J. Automatically tuned linear algebra software[C]. In SC'98:Proceedings of the 1998 ACM/IEEE conference on Supercomputing, IEEE, 1998, 38-38.
[23] Frigo M, Johnson S G. The design and implementation of FFTW3[J]. Proceedings of the IEEE, 2005, 93(2):216-231.
[24] Puschel M, Moura J M, Johnson J R, Padua D, Veloso M M, Singer B W, Xiong J, Franchetti F, Gacic A, Voronenko Y, μÉ. SPIRAL:Code generation for DSP transforms[J]. Proceedings of the IEEE, 2005, 93(2):232-275.
[25] Vuduc R, Demmel J W, Yelick K A. OSKI:A library of automatically tuned sparse matrix kernels[C]. In Journal of Physics:Conference Series, IOP Publishing, 2005, 16:071.
[26] Choi J W, Singh A, Vuduc R W. Model-driven autotuning of sparse matrix-vector multiply on GPUs[J]. ACM sigplan notices, 2010, 45(5):115-126.
[27] Jain Ankit. pOSKI:An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures[D]. Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, 2008.
[28] Kourtis K, Karakasis V, Goumas G, Koziris N. CSX:an extended compression format for spmv on shared memory systems[J]. ACM SIGPLAN Notices, 2011, 46(8):247-256.
[29] Su B Y, Keutzer K. clSpMV:A cross-platform OpenCL SpMV framework on GPUs[C]. In Proceedings of the 26th ACM international conference on Supercomputing, 2012, 353-364.
[30] Yang X, Parthasarathy S, Sadayappan P. Fast sparse matrix-vector multiplication on GPUs:Implications for graph mining[J]. arXiv preprint arXiv:1103.2405, 2011.
[31] Falgout R D. An introduction to algebraic multigrid[R]. Technical report, Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), 2006.
[32] Xie Z, Tan G, Liu W, Sun N. A Pattern-Based SpGEMM Library for Multi-Core and Many-Core Architectures[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 33(1):159-175.
[33] Golnari P A, Malik S. Sparse matrix to matrix multiplication:A representation and architecture for acceleration (long version)[J]. arXiv preprint arXiv:1906.00327, 2019.
[34] Chen T, Moreau T, Jiang Z, Zheng L, Yan E, Shen H, Cowan M, Wang L, Hu Y, Ceze L, μÉ. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning[C]. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, 578-594.
[35] Nvidia. CUTLASS[M]. https://github.com/NVIDIA/cutlass.
[36] Li Z, Jia H, Zhang Y, Chen T, Yuan L, Cao L, Wang X. AutoFFT:a template-based FFT codes auto-generation framework for ARM and X86 CPUs[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, 1-15.
[37] Luo Y, Tan G, Mo Z, Sun N. Fast:A fast stencil autotuning framework based on an optimalsolution space model[C]. In Proceedings of the 29th ACM on International Conference on Supercomputing, 2015, 187-196.
[1] 黄荣锋, 赵永华, 于天禹, 刘世芳. 基于GPU架构的两层并行块Jacobi SVD算法[J]. 数值计算与计算机应用, 2022, 43(4): 380-399.
[2] 刘伟峰. 高可扩展、高性能和高实用的稀疏矩阵计算研究进展与挑战[J]. 数值计算与计算机应用, 2020, 41(4): 259-281.
[3] 于天禹, 赵永华, 赵莲. 基于神威太湖之光架构的LOBPCG并行算法研究[J]. 数值计算与计算机应用, 2019, 40(4): 291-309.
[4] 徐小文. 并行代数多重网格算法:大规模计算应用现状与挑战[J]. 数值计算与计算机应用, 2019, 40(4): 243-260.
[5] 谢力, 王武, 冯仰德. 基于多层半可分结构矩阵的快速算法与并行实现[J]. 数值计算与计算机应用, 2017, 38(1): 37-48.
[6] 王天一, 姜金荣, 张贺, 何卷雄, 迟学斌. CAS-ESM编译运行脚本文件系统设计与实现[J]. 数值计算与计算机应用, 2016, 37(4): 287-298.
[7] 郑汉垣, 宋安平, 张武. 基于MIC的GaBP并行算法[J]. 数值计算与计算机应用, 2015, 36(1): 31-41.
[8] 胡伟. 常微分方程初值问题的完全三阶并行块算法及实验阶研究[J]. 数值计算与计算机应用, 2014, 35(3): 163-170.
[9] 王玉柱, 姜金荣, 蔡长青, 迟学斌, 岳天祥. 三维变分资料同化系统并行算法设计与实现[J]. 数值计算与计算机应用, 2013, 34(3): 231-240.
[10] 吴洋, 赵永华, 纪国良. 一类大规模稀疏矩阵特征问题求解的并行算法[J]. 数值计算与计算机应用, 2013, 34(2): 136-146.
[11] 张学波, 李晓梅. 快速求解一类Toeplitz循环三对角线性方程组的分布式并行算法[J]. 数值计算与计算机应用, 2009, 30(3): 161-169.
[12] 金君,乔楠,梁德旺. NAPA软件的并行优化[J]. 数值计算与计算机应用, 2008, 29(1): 65-72.
[13] 肖曼玉,吕全义,汪保,欧阳洁. 块三对角线性方程组的一种并行算法[J]. 数值计算与计算机应用, 2007, 28(4): 241-249.
[14] 朱君,赵宁. Euler方程非结构网格分布式并行计算[J]. 数值计算与计算机应用, 2007, 28(3): 161-166.
[15] 郑芳英,韩丛英,贺国平. 一个无约束优化问题并行算法的异步执行[J]. 数值计算与计算机应用, 2007, 28(1): 63-70.
阅读次数
全文


摘要