Sparse Matrix-Vector Multiplication on Processors with Small Memory Footprints

IP.com Number IPCOM000184518D
thumb 01 thumb 02 thumb 03 thumb 04
Scaled page rendering of the first four pages
Dated Jun 29, 2009 UTC
Size 5 page(s) (32.5 KB)
 
Disclosed by IBM-IPCOM

Publication Summary

We present a sparse matrix-dense vector multiplication procedure which is suitable for a subclass of sparse matrices (which we call block-sparse matrices), and which effectively harnesses the compute power of the Cell/B.E. processor. This technique could be mapped to GPUs, Larrabee, or any low-memory footprint microprocessor. .
Country
Language English (United States)

About this Publication

This document was submitted to IP.com's Prior Art Database and this preview is designed to provide you with information regarding the contents of this document by displaying up to the first four pages of the document as scaled page renderings and displaying a limited amount of text which was extracted from the document on the Text Preview Tab.

To find out more on how to obtain the entire document, click the Download tab. There is a charge for downloading some Prior Art Database documents; please examine carefully whether you believe this document fills your needs before purchasing.

For more information about the Prior Art Database, visit the Learn section of this website. Thank you for visiting IP.com's Prior Art Database! You may wish to check out our Intellectual Property Library website before you leave.

Continue to Text Preview →

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 24% of the total text.
This text was extracted from a PDF file.

Page 1 of 5

Sparse Matrix-Vector Multiplication on Processors with Small Memory Footprints

In this invention we describe an implementation of the 3x3 block sparse matrix multiply operation that leverages the architectural elements fast local storage, fast element interconnect bus (EIB), and high-

performance SIMD units to distribute, synchronize, and balance the

workload across multiple SPEs to achieve the highest performance

To make more effective use of the Cell/B.E.'s 4-way SIMD operations and avoid the sum-across operation, it would be convenient to calculate four elements of the result vector at a time. This can be done as follows. Assuming that the matrix

A

                               can be stored in column-major form, we load the first element of each of the first four rows into a vector register, replicate ("splat") b 0 into the four elements of a second vector register, multiply the registers together, and store the product into an accumulator. We then move to the next column of the original matrix, loading the first four elements into one vector register, replicating the corresponding vector element into another vector register, and performing a multiply-add of the registers to the accumulator. When all of
the columns have been processed in this manner, we have the first four elements of the result vector in the accumulator. This calculation is repeated to process the remaining rows, four at a time.

The soft-body dynamics application with which we are concerned gives us a block-sparse matrix with 3x3 subblocks. We represent this matrix as a row-major ordered list of 3x3 subblocks; within each subblock, the matrix elements are stored in column-major order. In this scenario, each SPE can copy some number of block rows into its local storage and perform the matrix-vector multiplication for that subset of the entire sparse matrix. SPEs can then work in

parallel and asynchronously, computing parts of the result vector and copying them out to main

storage via DMA. By computing on three rows (one block row) at a time we waste one of the four positions in the SIMD registers, but we consider this a reasonable price to pay for relatively straightforward code.

Implementing sparse matrix-vector multiplication on the Cell/B.E.™ presents a number of challenges. While the operation itself parallelizes readily (each block row of the matrix combines with the input vector to produce three elements of the output vector, which are not affected by any other block rows of the matrix), the limited amount of local store on the SPEs and the alignment requirements for efficient DMA of the matrix and vector data require some planning.

Workload Parallelization

In order to divide the workload between the SPEs we used a static approach in which each SPE works on a pre-determined set of rows (which is a multiple of the 32-block-row work unit). The assignment of rows to the SPEs is done before matrix multiplication begins and it is not dynamically m...

Download This Document →

 

Copyright © 2004-2010 IP.com. All Rights Reserved.

Privacy Policy   |   About IP.com   |   Contact Us