• 论文 • 上一篇    

基于GPU架构的两层并行块Jacobi SVD算法

黄荣锋1,2, 赵永华1, 于天禹1,2, 刘世芳1,2   

  1. 1. 中国科学院计算机网络信息中心, 北京 100190;
    2. 中国科学院大学, 北京 100049
  • 收稿日期:2021-05-31 发布日期:2022-12-08
  • 基金资助:
    国家重点研发计划(2017YFB0202202)和中国科学院战略性先导科技专项 (XDC05000000).

黄荣锋, 赵永华, 于天禹, 刘世芳. 基于GPU架构的两层并行块Jacobi SVD算法[J]. 数值计算与计算机应用, 2022, 43(4): 380-399.

Huang Rongfeng, Zhao Yonghua, Yu Tianyu, Liu Shifang. A PARALLEL TWO-TIER BLOCKED JACOBI SVD ALGORITHM ON GPU[J]. Journal on Numerica Methods and Computer Applications, 2022, 43(4): 380-399.

A PARALLEL TWO-TIER BLOCKED JACOBI SVD ALGORITHM ON GPU

Huang Rongfeng1,2, Zhao Yonghua1, Yu Tianyu1,2, Liu Shifang1,2   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China;
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2021-05-31 Published:2022-12-08
SVD(singular value decomposition)广泛应用于图像处理、人脸识别、信号降噪等领域.本文基于单边Jacobi SVD算法给出了块间和块内两层并行的块Jacobi SVD GPU算法.为了更好地利用GPU的共享内存,块间并行通过存储矩阵列块之间的内积解决了共享内存不足的问题.此外,块间并行还通过矩阵块操作技术提高数据利用率及数据预取技术实现数据访问和数据计算的重叠.块内并行通过直接更新矩阵列块之间的内积替代了更新矩阵列块以及更新矩阵列块之后计算矩阵列块之间内积的归约操作,增加了GPU线程的利用率.另一方面,块内并行将需要多次访问的数据存储于共享内存或寄存器,减少了对全局内存的访问从而提升了算法实现性能.在NVIDIA Tesla V100 GPU上的数值实验结果表明,本文的算法较Cusolver库有1.8×倍的加速,较MAGMA库中最快的算法加速达2.5×倍.
SVD (singular value decomposition) is wildly used in image processing, face recognition, signal processing, and other fields. In this paper, a parallel two-tier blocked Jacobi SVD GPU algorithm based on the one-sided Jacobi SVD algorithm and its effective implementation is presented. The parallel two-tier algorithm is composed of an inter-block parallel level and an intra-block parallel level. In the inter-block parallel level, the problem that the shared memory is too small to hold the matrix panels is overcome by storing the inner product of matrix panels on the shared memory instead. Besides, the matrix computation makes full use of the block operation technique to improve data reuse and the data prefetching technique to overlap the time of loading data and computing data. In the inner-block parallel level, for increasing the utilization of GPU threads, the computation of the inner product of matrix columns is avoided by updating the inner product of matrix columns parallelly. By storing data that can be reused many times on the shared memory or register files, the iterative process of intra-block parallelism level can reduce the access of the global memory, which improves the performance of our implementation. Numerical experiments on an NVIDIA Tesla V100 GPU show that the implementation of this paper is 1.8× and 2.5× times faster than the Cusolver and MAGMA libraries respectively.

MR(2010)主题分类: 

()
[1] Ranade A, Mahabalarao S S, Kale S. A variation on SVD based image compression[J]. Image & Vision Computing, 2007, 25(6):771-777.
[2] Singh N, Joshi S, Birla S, et al. Suitability of singular value decomposition for image watermarking[C]//the 6th International Conference on Signal Processing and Integrated Networks (SPIN). Piscataway, NJ:IEEE, 2019:983-986.
[3] Azadeh M, Ahmad M A. SSVD:Structural SVD-based image quality assessment[J]. Signal Processing:Image Communication, 2019, 5(75):54-63.
[4] Zhao L N, Hu W B, Cui L H. Face recognition feature comparison based SVD and FFT[J]. Journal of Signal and Information Processing, 2012, 3(2):259-262.
[5] Wang H H, Wang X, Cheng Y, et al. Research on noise reduction method of RDTS using D-SVD[J]. Optical Fiber Technology, 2019, 5(48):151-158.
[6] Golub G H, Kahan W. Calculating the singular values and pseudo-inverse of a matrix[J]. Journal of the Society for Industrial and Applied Mathematics Series B Numerical Analysis, 1965, 2(2):205-224.
[7] Gu M, Eisenstat S C. A divide-and-conquer algorithm for the bidiagonal SVD[J]. SIAM Journal on Matrix Analysis and Applications, 1995, 16(1):79-92.
[8] Li S, Gu M, CHENG L, et al. An accelerated divide-and-conquer algorithm for the bidiagonal SVD problem[J]. SIAM Journal on Matrix Analysis and Applications, 2014, 35(3):1038-1057.
[9] Kogbetliantz E G. Solution of linear equations by diagonalization of coefficients matrix[J]. Quarterly of Applied Mathematics, 1955, 13(2):123-132.
[10] Hestenes M R. Inversion of matrices by biorthogonalization and related results[J]. Journal of the Society for Industrial and Applied Mathematics, 1958, 6(1):51-90.
[11] Bischof C H. Computing the singular value decomposition on a distributed system of vector processors[J].Parallel Computing, 1989, 11(2):171-186.
[12] Arbenz P, Slapnicar I. An analysis of parallel implementations of the block -Jacobi algorithm for computing the SVD. In:Proc. 3rd Int. Conf. Industrial and Applied Mathematics (ICIAM 95), (Goetz Alefeld, Oskar Mahrenholtz, and Reinhard Mennicken, eds.), Hamburg, July 1995. Zeitschrift fur Angewandte Mathematik und Mechanik (Special issue). Berlin:Akademie Verlag, vol. I,pp. 343-344(1996).
[13] Hari V. Accelerating the SVD block-Jacobi method[J]. Computing, 2005, 75(1):27-53.
[14] Drmac Z, Veselic K. New fast and accurate Jacobi SVD algorithm.I[J]. SIAM Journal on Matrix Analysis and Applications, 2008, 29(4):1322-1342.
[15] Drmac Z, Veselic K. New fast and accurate Jacobi SVD algorithm.II[J]. SIAM Journal on Matrix Analysis and Applications, 2008, 29(4):1343-1362.
[16] Li H, Kluger Y,Tygert M. Randomized algorithms for distributed computation of principal component analysis and singular value decomposition[J]. Advances in Computational Mathematics, 2018, 44(5):1651-1672.
[17] Andrei T, Pantelimon G. A fast singular value decomposition algorithm of general k-tridiagonal matrices[J]. Journal of Computational Science, 2019, 31(2):1-5.
[18] Tugay R, Oguducu S G. Ranky. An approach to solve distributed svd on large sparse matrices[C]//2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), Kuala Lumpur:IEEE, 2018:1-6.
[19] Lahabar S, Narayanan P J. Singular value decomposition on GPU using CUDA[C]//2009 IEEE International Symposium on Parallel & Distributed Processing, Piscataway, NJ:IEEE, 2009:1-10.
[20] http://icl.utk.edu/magma/
[21] Novakovic V. A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units[J]. SIAM Journal on Scientific Computing, 2015, 37(1):1-30.
[22] https://docs.nvidia.com/CUDA/cusolver/index.html
[23] Drmac Z. Implementation of Jacobi rotations for accurate singular value computation in floating point arithmetic[J]. SIAM Journal on Scientific Computing, 1997, 18(4):1200-1222.
[24] Brent R P, Luk F T. The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays[J]. SIAM Journal on Scientific and Statistical Computing, 1985, 6(1):69-84.
[25] Eberlein P J. On one-sided Jacobi methods for parallel computation[J]. SIAM Journal on Algebraic Discrete Methods, 1987, 8(4):790-796.
[26] Luk F T, Park H. On parallel Jacobi orderings[J]. SIAM Journal on Scientific and Statistical Computing, 1989, 10(1):18-26.
[27] Luk F T, Park H. A proof of convergence for two parallel Jacobi SVD algorithms[J]. IEEE Transactions on Computers, 1989, 38(6):806-811.
[28] https://developer.nvidia.com/cublas
[29] Rivera C, Chen J R, Xiong N, et al. TSM2X:High-performance tall-and-skinny matrix-matrix multiplication on GPU s[J]. Journal of Parallel and Distributed Computing, 2021, 151(5):70-85.
[1] 谭光明. 高性能计算中的性能工程问题[J]. 数值计算与计算机应用, 2022, 43(4): 343-362.
[2] 兰林, 朱春钢. 基于Newton迭代法的最小二乘渐进迭代逼近[J]. 数值计算与计算机应用, 2022, 43(1): 88-111.
[3] 刘伟峰. 高可扩展、高性能和高实用的稀疏矩阵计算研究进展与挑战[J]. 数值计算与计算机应用, 2020, 41(4): 259-281.
[4] 于天禹, 赵永华, 赵莲. 基于神威太湖之光架构的LOBPCG并行算法研究[J]. 数值计算与计算机应用, 2019, 40(4): 291-309.
[5] 徐小文. 并行代数多重网格算法:大规模计算应用现状与挑战[J]. 数值计算与计算机应用, 2019, 40(4): 243-260.
[6] 谢力, 王武, 冯仰德. 基于多层半可分结构矩阵的快速算法与并行实现[J]. 数值计算与计算机应用, 2017, 38(1): 37-48.
[7] 李政, 冯春生, 张晨松. 一种针对GPU上的油藏数值模拟的高效SpMV[J]. 数值计算与计算机应用, 2016, 37(4): 315-324.
[8] 郑汉垣, 宋安平, 张武. 基于MIC的GaBP并行算法[J]. 数值计算与计算机应用, 2015, 36(1): 31-41.
[9] 胡伟. 常微分方程初值问题的完全三阶并行块算法及实验阶研究[J]. 数值计算与计算机应用, 2014, 35(3): 163-170.
[10] 刘新, 张慧娟, 孙彬彬, 王艳超. 基于角点检测和奇异值分解的多重数字水印算法[J]. 数值计算与计算机应用, 2013, 34(4): 279-285.
[11] 王玉柱, 姜金荣, 蔡长青, 迟学斌, 岳天祥. 三维变分资料同化系统并行算法设计与实现[J]. 数值计算与计算机应用, 2013, 34(3): 231-240.
[12] 吴洋, 赵永华, 纪国良. 一类大规模稀疏矩阵特征问题求解的并行算法[J]. 数值计算与计算机应用, 2013, 34(2): 136-146.
[13] 乔海军, 李会元. 二维各向同性湍流直接数值模拟的六边形谱方法及其GPU实现和优化[J]. 数值计算与计算机应用, 2013, 34(2): 147-160.
[14] 张学波, 李晓梅. 快速求解一类Toeplitz循环三对角线性方程组的分布式并行算法[J]. 数值计算与计算机应用, 2009, 30(3): 161-169.
[15] 姚国柱, 廖安平, 段雪峰. 矩阵方程$AXB=C$的最小二乘Hamilton解[J]. 数值计算与计算机应用, 2009, 30(1): 48-57.
阅读次数
全文


摘要