• 青年评述 • 下一篇
谭光明
谭光明. 高性能计算中的性能工程问题[J]. 数值计算与计算机应用, 2022, 43(4): 343-362.
Tan Guangming. PERFORMANCE ENGINEERING PROBLEM IN HIGH PERFORMANCE COMPUTING[J]. Journal on Numerica Methods and Computer Applications, 2022, 43(4): 343-362.
Tan Guangming
MR(2010)主题分类:
分享此文:
[1] TOP500.org. TOP500[M]. https://www.top500.org/lists/top500. [2] Arden W M. The international technology roadmap for semiconductors perspectives and challenges for the next 15 years[J]. Current Opinion in Solid State and Materials Science, 2002, 6(5):371-377. [3] Leiserson C E, Thompson N C, Emer J S, Kuszmaul B C, Lampson B W, Sanchez D, Schardl T B. There's plenty of room at the Top:What will drive computer performance after Moore's law?[J]. Science, 2020, 368(6495):eaam9744. [4] Nelson M T, Humphrey W, Gursoy A, Dalke A, Kalé L V, Skeel R D, Schulten K. NAMD:a parallel, object-oriented molecular dynamics program[J]. The International Journal of Supercomputer Applications and High Performance Computing, 1996, 10(4):251-268. [5] GmbH V S. VASP-Vienna Ab initio Simulation Package[M]. https://www.vasp.at. [6] Li J, Tan G, Chen M, Sun N. SMAT:an input adaptive auto-tuner for sparse matrix-vector multiplication[C]. In Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation, 2013, 117-126. [7] Franchetti F, Low T M, Popovici D T, Veras R M, Spampinato D G, Johnson J R, Püschel M, Hoe J C, Moura J M. SPIRAL:Extreme performance portability[J]. Proceedings of the IEEE, 2018, 106(11):1935-1968. [8] Ben-Nun T, de Fine Licht J, Ziogas A N, Schneider T, Hoefler T. Stateful dataflow multigraphs:A data-centric model for performance portability on heterogeneous architectures[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, 1-14. [9] Balaprakash P, Wild S M, Norris B. SPAPT:Search problems in automatic performance tuning[J]. Procedia Computer Science, 2012, 9:1959-1968. [10] 陈国良.并行计算:结构·算法·编程[M].高等教育出版社, 1999. [11] 刘方爱, 刘志勇, 乔香珍.一种异步BSP模型及其程序优化技术[J].计算机学报, 2002, 25(4):373-380. [12] Williams S, Waterman A, Patterson D. Roofline:an insightful visual performance model for multicore architectures[J]. Communications of the ACM, 2009, 52(4):65-76. [13] Ziogas A N, Ben-Nun T, Fernández G I, Schneider T, Luisier M, Hoefler T. A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, 1-13. [14] Treibig J, Hager G. Introducing a performance model for bandwidth-limited loop kernels[C]. In International Conference on Parallel Processing and Applied Mathematics, Springer, 2009, 615- 624. [15] Valiant L G. A bridging model for parallel computation[J]. Communications of the ACM, 1990, 33(8):103-111. [16] Kurt S E, Sukumaran-Rajam A, Rastello F, Sadayyapan P. Efficient tiled sparse matrix multiplication through matrix signatures[C]. In SC20:International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2020, 1-14. [17] Niu W, Guan J, Wang Y, Agrawal G, Ren B. DNNFusion:accelerating deep neural networks execution with advanced operator fusion[C]. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, 883-898. [18] Meng K, Li J, Tan G, Sun N. A pattern based algorithmic autotuner for graph processing on GPUs[C]. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, 2019, 201-213. [19] Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph:Distributed Graph-Parallel Computation on Natural Graphs[C]. In 10th USENIX symposium on operating systems design and implementation (OSDI 12), 2012, 17-30. [20] Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens J D. Gunrock:A high-performance graph processing library on the GPU[C]. In Proceedings of the 21st ACM SIGPLAN symposium on principles and practice of parallel programming, 2016, 1-12. [21] Xiao J, Li S, Wu B, Zhang H, Li K, Yao E, Zhang Y, Tan G. Communication-Avoiding for Dynamical Core of Atmospheric General Circulation Model[C]. In Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, New York, NY, USA, 2018. Association for Computing Machinery. [22] Whaley R C, Dongarra J J. Automatically tuned linear algebra software[C]. In SC'98:Proceedings of the 1998 ACM/IEEE conference on Supercomputing, IEEE, 1998, 38-38. [23] Frigo M, Johnson S G. The design and implementation of FFTW3[J]. Proceedings of the IEEE, 2005, 93(2):216-231. [24] Puschel M, Moura J M, Johnson J R, Padua D, Veloso M M, Singer B W, Xiong J, Franchetti F, Gacic A, Voronenko Y, μÉ. SPIRAL:Code generation for DSP transforms[J]. Proceedings of the IEEE, 2005, 93(2):232-275. [25] Vuduc R, Demmel J W, Yelick K A. OSKI:A library of automatically tuned sparse matrix kernels[C]. In Journal of Physics:Conference Series, IOP Publishing, 2005, 16:071. [26] Choi J W, Singh A, Vuduc R W. Model-driven autotuning of sparse matrix-vector multiply on GPUs[J]. ACM sigplan notices, 2010, 45(5):115-126. [27] Jain Ankit. pOSKI:An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures[D]. Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, 2008. [28] Kourtis K, Karakasis V, Goumas G, Koziris N. CSX:an extended compression format for spmv on shared memory systems[J]. ACM SIGPLAN Notices, 2011, 46(8):247-256. [29] Su B Y, Keutzer K. clSpMV:A cross-platform OpenCL SpMV framework on GPUs[C]. In Proceedings of the 26th ACM international conference on Supercomputing, 2012, 353-364. [30] Yang X, Parthasarathy S, Sadayappan P. Fast sparse matrix-vector multiplication on GPUs:Implications for graph mining[J]. arXiv preprint arXiv:1103.2405, 2011. [31] Falgout R D. An introduction to algebraic multigrid[R]. Technical report, Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), 2006. [32] Xie Z, Tan G, Liu W, Sun N. A Pattern-Based SpGEMM Library for Multi-Core and Many-Core Architectures[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 33(1):159-175. [33] Golnari P A, Malik S. Sparse matrix to matrix multiplication:A representation and architecture for acceleration (long version)[J]. arXiv preprint arXiv:1906.00327, 2019. [34] Chen T, Moreau T, Jiang Z, Zheng L, Yan E, Shen H, Cowan M, Wang L, Hu Y, Ceze L, μÉ. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning[C]. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, 578-594. [35] Nvidia. CUTLASS[M]. https://github.com/NVIDIA/cutlass. [36] Li Z, Jia H, Zhang Y, Chen T, Yuan L, Cao L, Wang X. AutoFFT:a template-based FFT codes auto-generation framework for ARM and X86 CPUs[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, 1-15. [37] Luo Y, Tan G, Mo Z, Sun N. Fast:A fast stencil autotuning framework based on an optimalsolution space model[C]. In Proceedings of the 29th ACM on International Conference on Supercomputing, 2015, 187-196. |
[1] | 黄荣锋, 赵永华, 于天禹, 刘世芳. 基于GPU架构的两层并行块Jacobi SVD算法[J]. 数值计算与计算机应用, 2022, 43(4): 380-399. |
[2] | 刘伟峰. 高可扩展、高性能和高实用的稀疏矩阵计算研究进展与挑战[J]. 数值计算与计算机应用, 2020, 41(4): 259-281. |
[3] | 于天禹, 赵永华, 赵莲. 基于神威太湖之光架构的LOBPCG并行算法研究[J]. 数值计算与计算机应用, 2019, 40(4): 291-309. |
[4] | 徐小文. 并行代数多重网格算法:大规模计算应用现状与挑战[J]. 数值计算与计算机应用, 2019, 40(4): 243-260. |
[5] | 谢力, 王武, 冯仰德. 基于多层半可分结构矩阵的快速算法与并行实现[J]. 数值计算与计算机应用, 2017, 38(1): 37-48. |
[6] | 王天一, 姜金荣, 张贺, 何卷雄, 迟学斌. CAS-ESM编译运行脚本文件系统设计与实现[J]. 数值计算与计算机应用, 2016, 37(4): 287-298. |
[7] | 郑汉垣, 宋安平, 张武. 基于MIC的GaBP并行算法[J]. 数值计算与计算机应用, 2015, 36(1): 31-41. |
[8] | 胡伟. 常微分方程初值问题的完全三阶并行块算法及实验阶研究[J]. 数值计算与计算机应用, 2014, 35(3): 163-170. |
[9] | 王玉柱, 姜金荣, 蔡长青, 迟学斌, 岳天祥. 三维变分资料同化系统并行算法设计与实现[J]. 数值计算与计算机应用, 2013, 34(3): 231-240. |
[10] | 吴洋, 赵永华, 纪国良. 一类大规模稀疏矩阵特征问题求解的并行算法[J]. 数值计算与计算机应用, 2013, 34(2): 136-146. |
[11] | 张学波, 李晓梅. 快速求解一类Toeplitz循环三对角线性方程组的分布式并行算法[J]. 数值计算与计算机应用, 2009, 30(3): 161-169. |
[12] | 金君,乔楠,梁德旺. NAPA软件的并行优化[J]. 数值计算与计算机应用, 2008, 29(1): 65-72. |
[13] | 肖曼玉,吕全义,汪保,欧阳洁. 块三对角线性方程组的一种并行算法[J]. 数值计算与计算机应用, 2007, 28(4): 241-249. |
[14] | 朱君,赵宁. Euler方程非结构网格分布式并行计算[J]. 数值计算与计算机应用, 2007, 28(3): 161-166. |
[15] | 郑芳英,韩丛英,贺国平. 一个无约束优化问题并行算法的异步执行[J]. 数值计算与计算机应用, 2007, 28(1): 63-70. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||