



### This is an ACCEPTED VERSION of the following published document:

Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, and Torsten Hoefler. 2023. VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '23). Association for Computing Machinery, New York, NY, USA, Article 72, 1–14. https://doi.org/10.1145/3581784.3607087

Link to published version: https://doi.org/10.1145/3581784.3607087

### **General rights**:

© 2023 Autores | ACM. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in *International Conference for High Performance Computing, Networking, Storage and Analysis*, https://doi.org/10.1145/3581784.3607087

Roberto L. Castro\* roberto.lopez.castro@udc.es CITIC Universidade da Coruña A Coruña, Spain

Tal Ben-Nun talbn@inf.ethz.ch Department of Computer Science ETH Zürich Zürich, Switzerland Andrei Ivanov anivanov@inf.ethz.ch Department of Computer Science ETH Zürich Zürich, Switzerland

> Basilio B. Fraguela basilio.fraguela@udc.es CITIC Universidade da Coruña A Coruña, Spain

Diego Andrade diego.andrade@udc.es CITIC Universidade da Coruña A Coruña, Spain

Torsten Hoefler htor@inf.ethz.ch Department of Computer Science ETH Zürich Zürich, Switzerland

### ABSTRACT

The increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized sparse vector units. An example of those are the NVIDIA's Sparse Tensor Cores (SPTCs), which promise a 2× speedup. However, SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%. We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs. To efficiently exploit the resulting format, we propose Spatha, a high-performance sparselibrary for DL routines. We show that Spatha achieves up to  $37 \times$ speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.

### **KEYWORDS**

Sparse Tensor Cores, GPU, Pruning, Sparsification, CUDA

### **1** INTRODUCTION

The rapid progress of Deep Learning (DL) is revolutionizing Artificial Intelligence (AI) in areas such as Natural Language Processing (NLP). Large Language Models (LLMs) are at the forefront of modern NLP systems [7, 34]; however, their massive growth has led to unprecedented computational requirements [1, 2, 14, 18]. As a result, training transformers has become a dominant task in DL, with costs reaching millions of dollars and significant energy and carbon emissions [32]. Therefore, it is critical to improve their inference and training performance. One of the most widely used techniques for this purpose is network pruning [15], which removes the less significant weights to produce simpler and compressed, yet accurate models.

There is a plethora of pruning algorithms and sparse formats focused on accelerating tensor operations such as matrix-matrix multiplications (MMMs) by means of specialized hardware like Tensor Core Units (TCUs) [37]. While these algorithms and formats reduce the number of arithmetic operations and memory usage compared to their dense counterparts, achieving significant speedup on these accelerators while maintaining model accuracy is challenging [16]. Semi-structured pruning can yield practical speedups at moderate sparsity levels (e.g., 80 - 90%) [4, 5, 25]. However, the irregularity of the sparse input matrices still limits performance and makes difficult to reach the theoretical peak considering the reduction of the number of arithmetic operations [11].

Last generations of NVIDIA GPUs include Sparse Tensor Cores (SPTCs) that are specifically designed for sparse computation [27]. SPTCs promise to accelerate math operations by up to 2× at 50% sparsity. The data layout proposed to use SPTCs imposes strict constraints (i.e., 2:4 format, where every consecutive 4 elements have 2 nonzero values), but it reduces the irregularity of the sparse input w.r.t. other performance-aware sparse formats (e.g., vector-wise, block-wise). This makes the N:M format very suitable to execute on GPUs since it favors key aspects of the execution of tensor operations such as inter- and intra-warp load balance. However, there is an important limitation related to the usage of SPTCs and the 2:4 format: recent models like LLMs commonly have hundreds of millions to trillions of parameters, making it feasible to prune them to higher sparsity ratios with little or no loss in accuracy [21]. Unfortunately, there is currently no hardware support for executing arbitrary N:M formats with higher compression ratios, which limits the total achievable speedup.

Recent research has explored the N:M format [6, 8]. However, these investigations have been limited to a theoretical perspective, such as network pruning, or have relied on CPU implementations due to a lack of hardware support for alternative N:M patterns on GPUs. To address these limitations, we propose the Vectorized N:M format, which we refer to as V:N:M<sup>1</sup>. This format introduces an abstraction layer over SPTCs, enabling the execution of alternative N:M formats and arbitrary sparsity ratios. The vectorization aspect is derived from the selection of vertical vectors of elements that are stacked together to provide the row-wise N:M pattern. This approach enables the conversion from generic N:M formats to the

<sup>\*</sup>Corresponding author: Roberto L. Castro (roberto.lopez.castro@udc.es), Universidade da Coruña, CITIC, Computer Architecture Group, 15071 A Coruña, Spain

<sup>&</sup>lt;sup>1</sup>Pronounced "venom'

2:4 that is accepted by SPTCs. To efficiently exploit the benefits of the V:N:M format, we propose Spatha<sup>2</sup>, a template-based library dedicated to general matrix-matrix multiplication on half precision where one of the operands is sparse (SpMM). Spatha serves as an open-source alternative to cuSparseLt [28] and removes its 2:4 restriction. The main contributions of this paper are:

- A new sparse matrix format V:N:M which enables arbitrary N:M patterns on SPTCs.
- Highly optimized SpMM kernels to efficiently exploit the V:N:M format. Specifically, we propose a template-based implementation that can be tuned depending on the input dynamics, such as GEMM size or the V:N:M format configuration.
- A second-order pruning technique tailored for the V:N:M format and scalable to the dimensionality of LLMs. This technique allows the sparsification to high sparsity ratios with little to no loss in accuracy (e.g., ~ 2% drop in BERT F1 score on the SQuAD dataset with 2:16 sparsity), which is required for the full exploitation of the V:N:M format.
- Spatha achieves unprecedented speedups w.r.t. its dense counterpart versions (e.g., cuBLAS) yielding up to 37× faster MMMs on matrices extracted from real-world DL models. Furthermore, Spatha implementation provides speedups of up to 1.38× over the vendor library for 2:4 sparsity, cuSparseLt.
- For end-to-end sparse LLMs inference, Spatha shows a GEMM time reduction of 11× at 2:32 sparsity on real-world models such as GPT-3.

The source code of VENOM is available at https://github.com/UDC-GAC/venom.

### 2 BACKGROUND

This section presents the technical background of the paper, covering network pruning techniques and the Sparse Tensor Cores of NVIDIA GPUs.

### 2.1 Network pruning

In DL, pruning is a technique used to reduce memory usage, which can also reduce the computational load when combined with compressed storage formats and efficient sparse kernels. Pruning techniques can be categorized based on various criteria, such as the pruning strategy employed, or the granularity of the pruning.

Pruning schemes are often based on weight saliency metrics, which directly correlate with the expected impact on accuracy when those weights are removed from the network. Various methods exist to select the candidate weights for removal, including magnitude pruning [20], which selects weights with lower absolute values, and gradient-based methods that use the gradient applied to each weight to identify those that are trending towards to zero faster. Within the gradient-based methods, we can find first-order techniques based on the first-derivative information [31, 38], and second-order ones [9, 21, 23], which pursue to find the set of weights whose removal will generate a minimum loss increase in the network. Second-order methods have proven to be effective in pruning convolutional

networks in the past, but they have recently been optimized for Large Language Models (LLMs) [21].

As for the granularity of the pruning, unstructured methods [13] remove weights individually, with gradual magnitude pruning (GMP) being the most commonly used variant [10]. On the other end of the granularity spectrum, structured methods [26, 35] prune complete components like layers, or heads, in the case of transformers networks[36]. In between, semi-structured methods prune groups of weights. These latter methods aim to balance performance and accuracy by defining specific formats that promote the exploitation of the underlying hardware more efficiently. These methods often imply the usage of tailored compressed storage formats and custom kernels [11, 22]. The N:M format, which enables the use of Sparse Tensor Cores (SPTCs) in NVIDIA GPUs, can be classified in this last group.

### 2.2 Sparse Tensor Cores of NVIDIA GPUs

The CUDA programming model organizes GPU kernels into three granularity levels: thread-blocks, warps, and threads. A thread block is composed of a set of warps, with warps being the basic scheduling unit in CUDA. Each warp consists of 32 threads.

NVIDIA GPUs consist of an array of Streaming Multiprocessors (SMs), with all SMs sharing the L2 cache, and a DRAM memory, also called Global Memory (GMEM). Each SM is divided in processing blocks, each one having a Register File (RF), a warp scheduler, and an L0 instruction cache. All the processing blocks within an SM share a L1 cache, which is partially used as Shared Memory (SMEM). Each processing block is also equipped with four types of units: Floating-Point Units (FPU), Tensor-Core Units (TCU), Int Units (ALU) and Special Function Units (SFU).



Figure 1: The 2:4 format and its mapping to SPTCs

Last generations of NVIDIA GPUs have extended their TCUs to also handle row-wise 2:4 sparsity. These updated TCUs include hardware support for sparse computation, and are referred to as Sparse Tensor Cores (SPTCs). To exploit SPTCs, the first argument in tensor operations must be stored in NVIDIA's N:M sparse format, where *N* represents the maximum number of non-zero elements in a block of *M* values. Figure 1 illustrates this format. The left side of the figure shows an uncompressed sparse matrix following the row-wise 2:4 pattern. The compression of that  $R \times K$  matrix requires two structures: (1) a  $R \times K/2$  matrix representing the values of the non-zero elements, and (2) a metadata structure which contains the position of each nonzero value within each group of 4 values. Finally, Figure 1, right side, illustrates the mapping of a 2:4 sparse operation onto SPTCs. Notice that the metadata structure is also used by the

<sup>&</sup>lt;sup>2</sup>SParse linear Algebra rouTines for High-performance Applications. The name is motivated by the analogy with the Cutlass library, with the accent on sparse computation - a sharp and efficient tool to cut through the complexity of sparse routines

hardware to select the corresponding elements in the dense matrix *B* and perform the Matrix Multiply-Accumulate (MMA) operation.

| Precision   | Format | Supported shapes          |
|-------------|--------|---------------------------|
| fp32        | 1:2    | k8, k16                   |
| half (fp16) | 2:4    | k32, k16                  |
| uint8       | 2:4    | k32, k64                  |
| uint4       | 2:4    | <i>k</i> 64, <i>k</i> 128 |

Table 1: Matrix Shapes for *mma.sp* on SPTCs. M and N dimensions are fixed to 16 and 8, respectively (m16n8)

SPTCs can be accessed in CUDA using the NVPTX API which includes the *mma.sp* instruction. SPTCs support various shapes of this instruction depending on the data precision (Table 1). This instruction multiplies a  $m \times k$  matrix by a  $k \times n$  matrix, where m = 16, n = 8 are fixed dimensions, and k represents the sparsified dimension which can vary in size. This paper focuses on half precision kernels. Instruction shapes define the sizes of the left-hand-side (LHS) and the right-hand-side (RHS) operands as inputs to TCUs. For example, k = 32 implies that the LHS operand has a shape of  $m \times k = 16 \times 32$  while the RHS is  $k \times n = 32 \times 8$ . It is important to note that the LHS is 50% sparse, meaning that its real size will be  $16 \times 16(32/2)$ . NVIDIA's notation for this instruction is m16n8k32.

### **3 THE V:N:M FORMAT**

This section presents the new V:N:M format, which enables pruning to arbitrary N:M ratios retaining the use of SPTCs, which are designed to support only 2:4 patterns natively.



Figure 2: The V:N:M pruning procedure

Sparse compression formats are of great significance in many HPC areas other than DL. However, the characteristics of the sparse matrices in DL workloads differ from those in other areas in several aspects [11]: (1) the sparsity level is generally much lower, (2) the number of non-zeros per row is higher and (3) the load imbalance is more pronounced. To address these challenges, ad-hoc solutions for DL workloads have been developed in two different planes: compression formats and pruning techniques, often interlinked. They seek the efficient exploitation of the hardware during the execution of tensor operations in DL workloads.

A new area of research is focused on enhancing control over the distribution of non-zero elements in sparse matrices. This involves, for example, selecting 2D dense groups with size  $v \times v$  (Figure 2, **1**)

or 1D groups of length v, either row-wise or column-wise **2**. The aim is to create sparse matrices that are more regular, making them more suitable for efficient execution on GPUs. Block-based pruning techniques (**1** and **2**) are particularly useful on improving data reuse on L1 cache or registers during the multiplication of sparse matrices. Furthermore, optimized sparse formats, which compress their data, can be designed to facilitate traversal for the access patterns that arise during matrix multiplication [25, 30].

On the one hand, **①** can be overly aggressive in dropping blocks of elements, leading to a significant reduction in accuracy as the sparsity level increases. On the other hand, **②** offers more flexibility and enables higher sparsification ratios. However, using small vector lengths is a limiting factor to prevent accuracy loss (e.g.,  $v \le 8$ ). Furthermore, in these approaches, the different number of elements per row can generate load imbalance and inherent negative effects such as thread divergence, inefficient memory transactions and low occupancy ratios.

The N:M format ③ provides an alternative that overcomes most of the weaknesses of other performance-aware methods. Moreover, NVIDIA GPUs recently included hardware support for this format, but it is limited to 2:4. This paper introduces the new V:N:M format ④ which combines block-wise storage, and vector-wise and N:M pruning to enable the exploitation of SPTCs for arbitrary N:M patterns, leveraging higher compression ratios and reducing further the number of arithmetic operations required in MMMs.



Figure 3: The V:N:M compression format

Figure 2, illustrates how this approach starts by partitioning the original dense matrix in blocks of  $V \times M$  elements (block-wise). Then, the four most significant columns of each block are selected (vector-wise pruning), and for each row of four elements in a block, the two most meaningful weights are kept (2:4 pruning). These two levels of pruning (vector-wise and N:M) enable the exploitation of SPTCs for matrices with arbitrary levels of sparsity, as the vector-wise pruning stage diversifies the sparsity level, and N:M pruning imposes the restrictions required later by SPTCs. That is, in 0, the SPTC vector is 2:4, but it belongs to a 6-columns row, where 2 columns were fully pruned. It is actually an implementation of a 2:6 sparsity pattern that it is mapped onto SPTCs as the required 2:4.

Finally, the data is represented using a new block-wise compression format shown in Figure 3. As for the NVIDIA 2:4 layout (Figure 1), the format requires an array with the non-zero values, and a 2-bit metadata index per non-zero (*m-indices*). Notice that now, each 2-bit metadata index refers to one of the 4 columns that we have selected in each block and not to each column of the original dense input matrix (see **④** in Figure 2). Furthermore, the size of these two structures depends on the M value, more specifically their shape now is  $R \times K/M \times 2$ . This format requires a third structure *column-loc* of size  $R/V \times K/M \times 4$ , that indicates which 4 columns (out of *M*) of each block were selected in the vector-wise pruning stage.

### 4 SPATHA: A HIGH-PERFORMANCE SPARSE LIBRARY FOR SPARSE MMM

This section provides an in-depth description of the sparse kernel implementation associated to the V:N:M format, Spatha. The Sparse Matrix-Matrix multiplication (SpMM) is an important workload in DL that serves as the sparse counterpart to Matrix-Matrix Multiplication (MMM). This routine is widely used in various components of modern DL models. For instance, in the forward pass of a pruned model, the sparse weight matrix is multiplied by a dense activation matrix. Similarly, in transformers, the self-attention operation is performed by multiplying a sparse attention weight matrix by a dense one. Thus, optimizing this routine is crucial to improve the efficiency and the performance of our models.



Figure 4: Mapping a 4:2:8 format onto Sparse Tensor Core (only native support to 2:4 format)

Figure 4 shows an example of how the new V:N:M format (4:2:8 in the figure) is mapped onto SPTCs, which natively only support the 2:4 format. It shows how the SPTC is fed with the appropriate values from a row of the sparse matrix and a column of the dense matrix. The LHS operand is a  $R \times K/4$  dense matrix after having been pruned with sparsity of 75% (2:8). This pruning reduces the required multiply-and-add operations by 4 (from 16 to 4), but also halves the rows loaded from the dense matrix B (selected by the values contained in *column-loc*).

### 4.1 Kernel design

The design of an efficient CUDA kernel mostly depends on **three main stages**: (1) the efficient loading of the data to the top levels of the memory hierarchy (i.e., GMEM->SMEM->RF), (2) the

computation, and (3) the storage of the results (i.e. RF->SMEM->GMEM). Figure 5 covers **stage 1**, particularly the data movement from GMEM to RF, which is divided into 3 steps (**1**-**1**-**3**). Figure 6 focuses on **stage 2**, and shows how the data in the RF is mapped onto SPTCs in three steps (**2**-**2**). Finally, Figure 8 illustrates how **stage 3** is performed (steps **3**-**3**).



Figure 5: Thread-Block Tile and Warp tile view (stage 1)

Spatha is designed as a template-based library, where several parameters can be tuned depending on the input properties. Considering a  $R \times K \times C$  GEMM problem, these parameters are: the thread-block tile size ( $BS_r \times BS_k \times BS_c$ ), the warp tile size ( $WS_r \times WS_k \times WS_c$ ), the mma instruction shape ( $MMA_r \times MMA_k \times MMA_c$ ) and the level of memory pipelining (*batchSize*).

4.1.1 Stage 1-Data loading. Figure 5 shows the Spatha procedure to load the operands from GMEM onto RF. There are two dimensions to be taken into account: the data location (i.e., GMEM, SMEM, and RF), and the scope of this data from the NVIDIA programming model perspective (i.e., thread-block, and warp). Step 🚯 loads the column-loc structure from GMEM to SMEM with a two-level prefetching strategy. Note that the column-loc information is used to select the rows of B to be loaded from GMEM (Figure 5, left side) to SMEM (step ()). Pre-fetching this information breaks the data dependency with the activation matrix. Furthermore, column-loc is small, so it is convenient to load the information of multiple tiles together to maximize memory bandwidth. Next, step 12 loads the corresponding A and B tiles from GMEM to SMEM. Each threadblock is responsible for an output block of size  $BS_r \times BS_c$ . More specifically,  $BS_r = V$ , so each thread-block will load **only** the rows of B selected by the column-loc structure. In order to avoid memory stalls due to data dependencies with the next steps, we pipelined step 12 with step 13 and stage 2 (computation) taking advantage of CUDA asynchronous copies. The pipelining degree depends on the batchSize variable previously mentioned. Finally, in 13, each warp is responsible for an output block of size  $WS_r \times WS_c$ , so the corresponding tiles are loaded from SMEM to RF. Emphasize that all the previously mentioned memory transactions have been optimized to use 128-bit instructions. At this point, we also load directly to the RF the *m*-indices information.

4.1.2 Stage 2-Computation. When all the data is loaded in the RF, stage 2 starts, which performs the Matrix Multiply-Accumulate (*mma.sp*) on this data using SPTCs. Figure 6 shows a detailed view of stage 2, depicting how the data in the RF is mapped onto SPTCs to be executed. Each warp has to break down the warp tile into instruction tiles, which depends on the instruction shapes available

on SPTCs, in this example m16n8k32. The first step O, selects  $MMA_k = 16$  elements from the warp tile and maps this data to SPTCs following step O layout. This layout represents the LHS fragment to the *mma.sp* instruction. That means that, if  $WS_r = 32$ , we will need to iterate twice over the rows of A's warp tile. Similarly, the next step maps the B's warp tile information into SPTCs following step O layout, which represents the RHS fragment to the *mma.sp* instruction. At this point, the *mma.sp* instruction is executed.



Figure 6: SPTCs view

Storage order. Related to stage ① and ②, we propose a specific order to store the non-zero values and the *m-indices* structure of the V:N:M format, which merges, once again, the block-wise and the N:M principles. This order is represented in Figure 7, and it seeks to optimize the data traversal during the data loading and computation. In this representation, half of the non-zero structure shows the access pattern followed to store the data, while the other half shows how the second half-warp is mapped into this structure. This storage order enables 128-bit memory transactions, ensures memory coalescence, and can dispense with the *ldmatrix* instruction, which is known to cause bank conflicts and can require more Shared Memory transactions to sequentially serve the memory access [33].



Figure 7: Storage order

4.1.3 Stage 3-Result storage. Once the product is calculated, we have to write the output tiles back to GMEM (stage 3). This requires storing the intermediate partial results in SMEM. On NVIDIA GPUs, shared memory is partitioned into banks, each one of 32 bits. Each bank can only address one position at a time, so if a quarter-warp (128-bit instructions) tries to access the same bank, the instruction will be serialized. This effect is known as bank conflict. An example of thread mapping to SMEM with  $BS_c = 64$  is shown in Figure 8. The left side of the figure shows how the threads in a warp are mapped to SMEM banks during the storage of their partial results (step 3). These stores are performed with 128-bit instructions. Padding elements have been added to avoid bank conflicts. In this specific example, each thread has accumulated 8 partial results  $(BS_c/MMA_c = 64/8)$ , so the thread mapping is repeated 8 times, meaning that each thread needs 8 iterations to store its partial results. Each color represents a quarter-warp, so we can see that each group of 8 consecutive threads accesses a different memory bank in the same iteration.



Figure 8: Conflict-free accesses for output tiles on SMEM

The right side of Figure 8 shows step ③, that is, the SMEM thread mapping designed to read the previously stored intermediate results, and finally, write them back to GMEM. The loads from SMEM and the stores to GMEM are performed with 128-bit instructions. Once again, each thread will need to access SMEM 8 times to read all the data. We have colored the accesses related to the first quarter-warp, what depicts a conflict-free layout.

Ablation study - Spatha performance and column-loc overhead. In Figure 9, we present the results of a microbenchmark study on matrices of fixed outer dimensions (corresponding to the size of one BERT<sub>large</sub> weight linear layer), but varying the inner (sparsified) one, K (1024  $\times$  K  $\times$  4096). The study was conducted using different sparsity levels, specified by different N:M combinations (from 2:10 to 2:100), while the vector size V was kept constant at 128. Furthermore, to measure the effect of using the columnloc mechanism, we tested the performance with and without this structure. In the latter we used fixed indexes to simulate an ideal situation with no memory accesses. These experiments are performed on an NVIDIA RTX 3090 GPU, equipped with SPTCs. The results show that Spatha achieves speedups for sparse computation, approaching theoretical peak performance for a given sparsity level considering the operation count reduction w.r.t. the dense counterpart version. This effect becomes more pronounced as the GEMM problem size increases, as it tend to have higher arithmetic intensity.

Castro R.L., et al.



Figure 9: Ablation study of column-loc with different sizes of the inner K dimension and different V:N:M formats (BERT<sub>larae</sub>)



Figure 10: Scaling study of wide shared memory stores for different V:N:M configurations

For instance, at a sparsity level of 80% (2:10 format), the speedup is approximately 4.5×, where 5× is the ideal scenario. Then, the speedups reported are 8.5×, 17.5×, and 37× for sparsity levels of 90% (2:20), 95% (2:40) and 98% (2:100), whose theoretical caps are 10×, 20× and 50×, respectively. It can be observed that, for every sparsity ratio, the *column-loc* structure's overhead has a negligible effect on the overall time, despite being a software approach to support arbitrary N:M ratios. However, the impact of *column-loc* becomes slightly more noticeable when dealing with 2:100 sparsity, which is not practical for DL applications in real-world scenarios.

Scaling study - Impact of V and output layout format. The V variable in our V:N:M format can be used to define trade-offs between performance and accuracy in the same way that the block-size in block-wise pruning, for example. To study this, we performed a second ablation study on one matrix from  $BERT_{large}$  (size  $1024 \times 4096 \times 4096$ ). Figure 10 shows the performance results of Spatha on this matrix using three different vector lengths: 32, 64 and 128. This test is conducted for different sparsity levels, in practice, the test explores different configurations of the V:N:M values. Furthermore, in order to study the impact of the previously proposed layout for writing back results (Figure 8), it is compared the effect of using such layout, enabling 128-bit SMEM stores instead of 32-bit ones. As we can see in Figure 10, the difference in terms of speedups between the three selected vector lengths is noticeable, the value of V being conditioned by the accuracy loss. The effect of using 128-bit stores



Figure 11: Energy evaluation study on the V:N:M format

instead of 32-bit ones is noticeable in this problem size, bringing up to a  $2\times$  difference in the final speedup. We performed a similar ablation test for a matrix of a GPT-3 model (size  $36864 \times 12288 \times 4096$ ) and the effect of using 128-bit stores was attenuated, as the weight of the output phase in the total execution time is smaller.

### 5 ENERGY EVALUATION OF V:N:M

DL pruning techniques aim to achieve the highest possible sparsity levels in the pruned models while ensuring little to no loss in accuracy. This becomes especially challenging when the target sparse format requires a specific pruning scheme, and when high sparsity levels are targeted. In these scenarios, the percentage of non-zero values is low, and their location is heavily influenced by the format. Therefore, it is crucial to demonstrate the effectiveness of new sparse formats, to ensure its applicability with minimal or no impact on accuracy.

The energy evaluation metric is employed to measure the flexibility of a format by comparing the total magnitude of the model (sum of the individual weights) before and after pruning to a specific format. Let us assume a well-optimized dense model  $w^* \in \mathbb{R}^d$ , where *d* is the total number of weights. We wish to prune  $w^*$  to a target sparsity  $s \in (0, 1]$  by zeroing out  $s \times d$  weights. The result is a sparse model  $w \in \mathbb{R}^{s \times d}$ . The energy metric is defined as follows:

$$energy = \frac{\sum_{i=0}^{s \times d} |w_i|}{\sum_{i=0}^{d} |w_i^*|}$$

This metric yields a normalized score between  $0 \sim 1$ , the higher the better.

Figure 11 presents the energy evaluation study for a weight tensor extracted from an encoder layer of  $BERT_{base}$ . This figure compares three weight selection policies: unstructured (ideal), V:N:M with different V values, and vector-wise pruning with several vector lengths l ( $vw_l$ ). The evaluation is done for different sparsity levels, whose value in the V:N:M format is controlled by the N:M ratio.

Unstructured pruning represents the ideal non-zero selection policy, as it does not impose any restrictions on the location of nonzero values. Vector-wise pruning can accelerate sparse routines on GPUs. However, if the vector length is greater than 8, it can significantly reduce the accuracy [4, 5, 25]. The results demonstrate that the V:N:M format occupies an intermediate position between unstructured and vector-wise pruning. Moreover, it is highly robust to changes in the vector length, allowing the usage of V = 128while consistently preserving more energy than vw 8 and vw 4.

Additionally to the previous conclusions, independently of the selected pruning method, we can also see the tremendous impact on the energy of magnitude-based weight selection policies. At 50% of sparsity, unstructured pruning already lost 20% of the original dense matrix energy. At the other side, at 95% only 20% of the original energy remain in the pruned dense matrix. Thus, we can conclude that, in order to achieve moderate to high sparsity ratios in models with the dimensionality of BERT, more sophisticated pruning methods must be used. Second-order pruning offers an alternative to these problems.

#### **SECOND-ORDER PRUNING** 6

Magnitude-based pruning techniques provide a straightforward approach to reducing the size of our models without requiring model evaluation for weight selection. However, while magnitude pruning can be effective at moderate sparsity levels, it becomes more challenging to select the "least significant" weights to remove when aiming for high sparsity ratios, and this can significantly impact network accuracy.

In contrast, second-order pruning methods offer a more sophisticated approach to select weight candidates for removal, by considering the difference in loss relative to the current model. Hence, they target to find the set of weights whose removal will generate a minimum loss increase. In this context, the Hessian matrix is a key component of second-order pruning methods which represents the matrix of second-order derivatives of the loss function w.r.t. the weights, mathematically expressed as  $H = \nabla_{w}^{2}L$ , for a twice-differentiable loss L. The Fisher matrix is very similar to the Hessian matrix but in the probabilistic setting, used to estimate the curvature of the loss function around the current value. As a result, this approximation allows to identify the weight parameters that have less impact in the loss function, and therefore are candidates to be pruned [15].

#### The V:N:M format in 2nd order methods 6.1

This section introduces a new second-order pruning method based on [21] and tailored for the V:N:M format. This type of approach yields state-of-the-art results in LLMs for unstructured and semistructured (block) compression.

Let us assume we have a well-optimized dense model  $w^* \in \mathbb{R}^d$ , where d is the total number of weights. Our target is to identify a set of weights Q that we can prune with a minimum loss increase. Te following saliency score function is defined to rank groups of weights [21]:

$$\rho_Q = \frac{1}{2} (E_Q w^*)^T (E_Q \widehat{F}^{-1} (w^*) E_Q^T)^{-1} E_Q w^*$$

where.

- *F*<sup>-1</sup>(*w*) ∈ ℝ<sup>d×d</sup> is the Fisher matrix.
   *E*<sub>Q</sub> ∈ ℝ<sup>|Q|×d</sup> is a matrix composed of the corresponding canonical basis vectors for a set of Q weights.

Thus, the set of canonical basis vectors  $E_O$  depends on the specific sparse format we are using. For instance, in 2:4 sparsity, the canonical vectors are:

$$E_Q = [[1,1,0,0], [1,0,1,0], [1,0,0,1], [0,1,1,0], [0,1,0,1], [0,0,1,1]]$$

As observed,  $E_Q$  encompasses all possible correlations between 2 weights, in a set of 4 elements. In general, for an N:M format, this approach requires evaluating  $\binom{M}{N}$  combinations to determine the best one, which can turn into an intractable combinatorial problem. Furthermore, in the V:N:M format, the addition of a new dimension V amplifies the complexity as it requires finding the optimal set of  $V \times N$  weights, leading to a combinatorial explosion.

To address these challenges, we adopt a similar approach as [21] between sets of Q elements, which involves disregarding correlations between rows within  $V \times M$  blocks. This simplification drops the number of combinations to evaluate. Additionally, to mitigate combinatorial issues that may still arise within  $1 \times M$  groups, we propose a pair-wise approach where correlations are calculated between pairs of elements, that is:

$$E_O = [[1, 0], [0, 1], [1, 1]]$$

Depending on the N and M values, we can modulate the complexity of the problem to be solved by dynamically selecting the m-combinatorial or the pair-wise approach.

6.1.1 Gradual pruning definition. The N:M format prunes a model to a target sparsity  $s \in (0, 1]$ . Typically, the  $s \times d$  weights are removed in one step (one-shot pruning). For 50% (2:4) sparsity, this approach can be applied in most cases and the models still recover the original accuracy. However, for higher sparsity ratios, one-shot pruning reduces severely the model performance and makes hard to recover the original accuracy using additional finetuning steps. This negative effect on accuracy also happens in second-order methods, where one-shot pruning can result in worse Taylor approximations of the function. We propose a structure decay scheduler for the V:N:M format, which performs N:M pruning across different  $\beta$  steps, for increasing sparsity levels. This scheduler starts with a high initial value of  $N_0 >> N_\beta$  (lower sparsity), where  $N_{\beta}$  is our target N value, and gradually decreases N (conversely increasing sparsity) until it reaches the N target value. This gradual pruning approach mitigates the adverse effects on network accuracy and improves the recovery of the accuracy in subsequent finetuning processes.

### 7 EVALUATION

We evaluate the performance on an NVIDIA RTX 3090 GPU of the Ampere architecture equipped with SPTCs. We compare the performance of Spatha with different sparse libraries (cuSparseLt, CLASP, Sputnik) and also with a dense counterpart version (cuBLAS). We build our benchmarks on matrices from real-world LLMs. Additionally to these micro benchmarks, we also conduct a case study on real-world applications. At this point, we demonstrate the proposed second-order pruning technique, and we benchmark the end-to-end performance of Spatha on different LLM models (BERT, GPT-2, and GPT-3).

# 7.1 Comparison with existing dense and sparse libraries

Firstly, we evaluate our baseline implementation for 1:2:4 sparsity (50%). Since higher N:M ratios will depend on this baseline's performance, it is crucial to have good speedup results in this configuration. We selected cuBLAS GEMM as our dense counterpart, and for exploiting the 2:4 format on SPTCs, we used the cuSparseLt SpMM implementation, which represents the reference library on this format. Our experiments involve varying sizes of a  $R \times K \times C$ GEMM problem, where R and C are predetermined values from two BERT's weight linear layers (768 and 4096 for BERTbase, 1024 and 4096 for BERT<sub>larae</sub>). The inner dimension K of the product, which is the sparsified one, is variable in these experiments. Note that the inner dimension is usually scaled up to enhance the network accuracy. For instance, GPT-3 uses a hidden size of 12288 [3]. Figure 12 reports the performance of the three contending implementations (cuBLAS, cuSparseLt and Spatha) and the speedups of the selected sparse libraries w.r.t. cuBLAS. The results show that the performance of the sparse implementation improves with the GEMM size, as larger GEMMs tend to have larger arithmetic intensity. In these microbenchmarks, BERT<sub>large</sub> matrices (right side) increase the computation intensity w.r.t. BERT base (left). Notably, for larger GEMM sizes, the performance of cuSparseLt and Spatha is similar, while our implementation shows better performance on smaller sizes, which constitutes an interesting feature, since Spatha can probably cover a more variety of network architectures. Overall, Spatha achieves up to 1.38× speedup over the vendor library for 2:4 sparsity, cuSparseLt.





Figure 13 compares the performance of Spatha to other dense and sparse libraries for higher levels of sparsity. The benchmarks are built using sparse matrices from weight-pruned linear layers extracted from BERT with different sparsity levels ranging from 50 ~ 98%. In this context, cuSparseLt SpMM implementation is the reference library to exploit the 2:4 format on SPTCs. Since there are no SpMM GPU implementations for arbitrary N:M sparsity levels, we have considered in the evaluation the following third-party libraries that support half-precision: Sputnik [11], and CLASP [4] which extends vectorSparse [5] to the latest generations of NVIDIA GPU architectures. While [11] has been designed for non-structured sparse matrices, [4] is focused on semi-structured sparse input matrices following the column-vector sparse format, which supports vector lengths l = 2, 4 and 8. This configurations has been referenced in the columns of Figure 13 with the notation  $vw_l$ .

The first row of Figure 13 shows the speedup results on sparse matrices extracted from  $BERT_{base}$  while the second one reports that performance on BERT<sub>large</sub>. The y-axis is represented in a logarithmic scale to make the results more readable. First of all, existing implementations for sparse computation are usually able to outperform the dense counterpart version (e.g., cuBLAS) at sparsity levels above 80%. However, the speedup they can achieve is usually up to  $\sim 3 \times$ . Furthermore, these implementations are usually designed considering as a reference sparse matrices extracted from small models (e.g., ResNet) where the left operand can be a tiny matrix (e.g.,  $64 \times 64$ ) [12]. That influences the SpMM design, since they can afford to load the data directly into registers, for example. But when we evaluate these implementations on medium or big matrices extracted from larger models (e.g., LLMs), the performance is even worse, and they only outperform cuBLAS at sparsity levels above 90%.

The fact that Spatha reaches  $2 \times$  speedup at 50% sparsity enables the achievement of high speedups as the sparsity increases, yielding up to  $27 \times$  in BERT-like matrices. We can also appreciate that the best performance in our implementation is reached as the arithmetic intensity increases, peaking for BERT\_*large* with batch size 16.

### 7.2 Case study: sparse LLMs

LLMs have revolutioned the NLP field with their unrivaled performance in various domains. Nowadays, these models are widely used in everyday technologies, such as ChatGPT. Transformer LLMs typically consist of multiple transformer layers with self-attention.

There are two major sub-components inside a transformer architecture: the multi head attention (MHA), and the fully connected feed forward network (FFN). At a higher level, the model size is determined by different configurable components, such as the head dimension, the number of heads and the number of layers, depending on the specific architecture used.

This case study focuses on weight pruning, and explores the on computational speedups achievable with Spatha. In LLMs weight tensors are present in Linear Layers, which can be found in both the MHA and the MLP sub-components. Figure 14 illustrates a pruned MHA where four GEMM instructions are converted to SpMMs by sparsifying the corresponding weight tensors. In this study we demonstrate the efficiency of Spatha on different LLMs. However, it is important to note that without an efficient implementations of the



Figure 13: Speedup results on BERT<sub>base</sub> and BERT<sub>larae</sub> with sequence length=512. The notation V:N:M represents the vector length V used on Spatha, while  $vw_l$  represents the vector length l used on CLASP. The N:M pattern related to each of the considered sparsity levels are in ascending order of sparsity: 2:4, 2:7, 2:8, 2:10, 2:20, 2:40 and 2:100



Figure 14: Simplified view of a pruned MHA

experiments we considered 75% and 87.5% sparsity levels, represented by 2:8 and 2:16 ratios, respectively, to demonstrate that our pruning approach produce robust results on this kind of networks.

| Sparsity     | 1:N:M | 64:N:M | 128:N:M | vw_8  |
|--------------|-------|--------|---------|-------|
| 75% (2:8)    | 88.61 | 88.47  | 87.94   | 88.55 |
| 87.5% (2:16) | 87.73 | 86.50  | 85.01   | 86.90 |

Table 2: F1 score of BERT<sub>base</sub> on the SQuADv1.1. Dense model F1=88.43

As we can see, 1:N:M, 64:N:M and vw\_8 slightly improve the original model accuracy at 2:8 sparsity, while the 128:N:M format presents a 0.005% accuracy loss. For 2:16 sparsity, the four methods suffer a slight accuracy loss. Specifically, the plain 2:16 format is able to recover 99% of the original accuracy, while 64:2:16 and vw\_8 pruning recover 98%. In these terms, the 128:2:16 approach is slightly more restrictive, but is still able to recover 96% of the original accuracy.

7.2.2 Integration with Pytorch. In order to perform the end-to-end evaluations, we have streamlined the adoption of Spatha into the PyTorch training pipeline by integrating it with the STen library [17]. This integration allows for easy addition of sparsity to existing models such as BERT and GPT with just a few lines of code. Users can specify a list of weights to be made sparse in their custom models, making the process straightforward. To facilitate this, we have defined a VNMSparsifier class that performs pruning while adhering to the V:N:M format constraints. Additionally, we have introduced a VNMTensor class that serves as a container for tensors in the V:N:M format. When using SpMM with VNMTensor, STen

SpMM instruction, the final performance of the pruned model can significantly decrease compared to the dense counterpart version.

7.2.1 Second-order pruning at LLMs scale. We used our 2nd order pruning approach following the V:N:M format to demonstrate its applicability to the size of LLM models. Specifically, we focused on BERT<sub>base</sub>, one of the most commonly used LLMs, which comprises 12 transformer layers with 110M parameters. As per community standards [21], we pruned the encoder's weights of the model (85M). We evaluate the performance on the SQuAD v.1.1 task, which is a widely-used benchmark to measure model compression. Table 2 shows the F1 score metric for different pruning techniques including: traditional N:M format (1:N:M), V:N:M format with V size of 64 and 128, and vector-wise pruning with dense vertical vectors of size 8 (vw 8).

LLMs have been shown to be susceptible to minor model perturbations that can cause model collapse [19]. However, in these

automatically dispatches it to the efficient implementation in Spatha. A pseudocode example of this integration is shown in Listing 1.

```
1 import sten
  import spatha
2
  @sten.register_sparsifier_implementation(
4
      sparsifier=spatha.VNMSparsifier,
      inp=torch.Tensor, out=spatha.VNMTensor)
      torch_tensor_to_vnm(sparsifier, tensor, grad_fmt):
      return sten.SparseTensorWrapper \
8
           .wrapped_from_dense(
               spatha.vnm_sparsifier(
10
                   sparsifier.n, sparsifier.m,
                   sparsifier.v, tensor),
               tensor, grad_fmt)
14
  class Spmm(torch.nn.Module):
      def __init__(self, original: torch.nn.Linear):
16
           self.bias = original.bias
          w = original.weight.wrapped_tensor
18
           self.values
                         = w.values
19
           self.columns = w.columns
20
           self.metadata = w.metadata
      def forward(self, input):
           return spatha.spmm(self.values, self.columns,
24
               self.metadata, input, self.bias, ...)
```

Listing 1: Pseudocode example of using Spatha and the V:N:M sparsifier

7.2.3 Sparse Inference. We benchmark the end-to-end performance of Spatha on the inference task for different real-world LLM models: BERT (336M), GPT2-large (774M), and GPT-3 (175B), obtained from HuggingFace. Since GPT-3 is not a public trained model, we have created a model with the same configuration than this LLM. The target of this experiment is measuring time performance, thus, we initialized the weights of the GPT-3 model with random values. The time results on BERT and GPT2-large have been obtained over the inference of the entire model, while the results of GPT-3 were obtained by measuring the inference time of a single encoder to fit it on a single GPU.

Figure 15 shows the end-to-end evaluation results on the inference of these models. As we have seen in the previous micro benchmark experiments, increase the arithmetic intensity of the MMMs improves the utilization of the GPU resources, and also the final performance of the SpMM. We configured the three models to the larger configuration possible before achieving out-of-memory issues. In the case of BERT<sub>large</sub>, this implied the selection of a batch size (bs) of 32. For GPT2-large, the bs is 8, and in the case of GPT-3, it is 1. However, bs only affect the C dimension of the GEMM problem  $(R \times K \times C)$ , while the two others, *R* and *K*, depend on the model characteristics. Regarding these sizes, BERT has smaller weight tensor sizes (the ones sparsified) than GPT2-large, while GPT-3 is formed by weight tensors much larger than the two other models. Due to the previously described reasons, we can see that the best performance is obtained in the case of GPT-3, where the GEMM computation contributes to around 80% of the total execution time.

In the case of BERT, tensor contraction time is improved up to 9.95×, while in terms of the whole model, the end-to-end latency is improved up to a 72%. For GPT2-large, the GEMM time is improved in 10.84×, since some weight tensors are slightly bigger, but the total



Figure 15: Latency of LLMs inference using Spatha

GEMM time is around 50%, so the general improvement is limited by this factor. However, when we move to GPT-3, the tensor time contraction is improved up to  $11\times$ , but the GEMM time represents a much higher percentage, meaning a time reduction of up to  $3.20\times$ of the total execution time of a GPT-3 encoder.

### 8 RELATED WORK

Semi-structured pruning techniques are a hot research topic. The column-vector-sparse-encoding [5] seeks to accelerate sparse kernels, and it achieves a speedup between 1.71× and 7.19× over cuS-PARSE without exploiting SPTCs, and limited to the Volta architecture. The same authors target the SPTCs in [6] proposing DFSS, a dynamic N:M sparse attention mechanism and a tailored implementation of the sparse kernels, but limited to the 2:4 format. The unaligned group-level pruning proposed in [24] increases the accuracy of this kind of semi-structured pruning techniques by providing additional flexibility.

NVIDIA cuSparse [29] is a library from NVIDIA that implements several linear algebra routines for sparse matrices stored in different compressed formats (COO, CSR and Blocked-Ellpack). It was originally created to target scientific applications. The cuSparseLt[28] library from NVIDIA adds support for the exploitation of Sparse-Tensor Cores (SPTCs) following the N:M format, and giving support to 1:2 ad 2:4 sparsity patterns (50% of sparsity).

Sputnik [11] library has been specifically designed for DL workloads. It uses only the CSR compressed format, and it focuses on gaining flexibility on the scheduling of workloads by defining a onedimensional tiling scheme. This library evolved to Vector-Sparse [5] adding support for the exploitation of Tensor-Core Units. It is based on using semi-structured 1D pruning, and a special compressed format called Column-Vector Sparse Encoding. As a continuation, *CLASP* [4] offers an SPMM implementation which extends the support of Vector-Sparse to the Ampere architecture.

In the same line, *M*agicube [25] is an implementation of the SPMM and SDDMM routines for quantized sparse matrices. The

kernels are complemented with en efficient online method to transpose the dense matrix.

### 9 DISCUSSION

a) *Spatha application to other tasks.* The integration of the Spatha library into STen, and the implementation of a specific 2nd order pruning technique to exploit the V:N:M format, enables distributed sparse training as a direct application of the previously mentioned contributions. Furthermore, notice that the Spatha library represents a tool to perform general Sparse Matrix-Matrix Multiplications, so can be extended to other domains other than DL.

b) *Distributed deep learning systems*. In this work, we have focused on large-scale models based on LLMs. However, the Spatha library represents a generic tool for sparse MMMs. To achieve efficient large-scale DL on distributed systems, data, operator, and pipeline parallelism are often combined. In this context, Spatha can serve as a third-party implementation to accelerate the execution of these operators in the backend, and mitigate the computation bottleneck on these systems.

### **10 CONCLUSION**

This paper opens the possibility to use Sparse Tensor Cores (SPTCs) for arbitrary sparsity levels and N:M patterns. In order to do so, we defined a new sparse format (V:N:M), a new library to efficiently exploit the proposed kernel (Spatha), and a second-order pruning technique that demonstrated the applicability of the proposed format on real-world deep learning models. The experiments show that this three-fold approach yields up to a 37× speedup over cuBLAS at the kernel level. Furthermore, the proposed pruning technique offers a solution scalable to the dimensionality of LLMs, and is able to achieve high sparsity ratios with minimum impact in loss (~ 2% at 2:16 sparsity on BERT models). Finally, we demonstrate the performance on end-to-end sparsity, achieving speedups on GPT-3 encoder of up to  $3.20 \times$  at 2:32 sparsity, what is translated into a tensor contraction improvement of up to  $11 \times$ .

### ACKNOWLEDGMENTS

This research was supported by grants PID2019-104184RB-I00 and PID2022-136435NB-I00, funded by MCIN/AEI/ 10.13039/501100011033, PID2022 also funded by "ERDF A way of making Europe", EU; the Ministry of Education (predoctoral grant of Roberto L. Castro, FPU19/03974), by Xunta de Galicia under the Consolidation Program of Competitive Reference Groups (ED431C 2021/30), and ERC grant PSAP, no. 101002047. We also acknowledge the support from CITIC, funded by Xunta de Galicia and FEDER funds of the EU (Centro de Investigación de Galicia accreditation 2019-2022, ED431G 2019/01). Finally, we thank the Swiss National Supercomputing Center (CSCS) and the Centro de Supercomputación de Galicia (CESGA) for the use of their computers.

### REFERENCES

- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners.
- [2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin

Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper\_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

- [3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
- [4] Roberto L. Castro, Diego Andrade, and Basilio B. Fraguela. 2023. Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM Routine on Ampere GPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (Chicago, Illinois) (PACT '22). Association for Computing Machinery, New York, NY, USA, 135–147. https: //doi.org/10.1145/3559009.3569691
- [5] Zhaodong Chen, Zheng Qu, Liu Liu, Yufei Ding, and Yuan Xie. 2021. Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Association for Computing Machinery, New York, NY, USA, Article 78, 14 pages. https://doi. org/10.1145/3458817.3476182
- [6] Zhaodong Chen, Zheng Qu, Yuying Quan, Liu Liu, Yufei Ding, and Yuan Xie. 2023. Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal, QC, Canada) (PPoPP '23). Association for Computing Machinery, New York, NY, USA, 369–379. https://doi.org/10. 1145/3572848.3577500
- [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
- [8] Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv:2301.00774 [cs.LG]
- [9] Elias Frantar, Eldar Kurtic, and Dan Alistarh. 2021. M-FAC: Efficient Matrix-Free Approximations of Second-Order Information. arXiv:2107.03356 [cs.LG]
- [10] Trevor Gale, Erich Elsen, and Sara Hooker. 2019. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019).
- [11] Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. Sparse GPU Kernels for Deep Learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. https://doi.org/10.1109/ SC41405.2020.00021
- [12] Google Research. 2020. Deep Learning Matrix Collection. Retrieved March 26, 2023 from https://github.com/google-research/google-research/tree/master/sgk
- [13] Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:1510.00149 [cs.CV]
- [14] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. 2017. Deep Learning Scaling is Predictable, Empirically. arXiv:1712.00409 [cs.LG]
- [15] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. 2021. Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks. J. Mach. Learn. Res. 22, 1, Article 241 (jan 2021), 124 pages.
- [16] Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. 2021. Data Movement Is All You Need: A Case Study on Optimizing Transformers. In Proceedings of Machine Learning and Systems 3 (MLSys 2021).
- [17] Andrei Ivanov, Nikoli Dryden, and Torsten Hoefler. 2022. STen: An Interface for Efficient Sparsity in PyTorch. https://github.com/spcl/sten. In Sparsity in Neural Networks workshop.
- [18] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]
- [19] Olga Kovaleva, Saurabh Kulshreshtha, Anna Rogers, and Anna Rumshisky. 2021. BERT Busters: Outlier Dimensions that Disrupt Transformers. arXiv:2105.06990 [cs.CL]
- [20] Eldar Kurtic and Dan Alistarh. 2022. GMP\*: Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods. arXiv:2210.06384 [cs.CL]
- [21] Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, and Dan Alistarh. 2022. The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models. https:

//doi.org/10.48550/ARXIV.2203.07259

- [22] François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. 2021. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838 (2021).
- [23] Yann LeCun, John S. Denker, and Sara A. Solla. 1989. Optimal Brain Damage. In NIPS.
- [24] Kwangbae Lee, Hoseung Kim, Hayun Lee, and Dongkun Shin. 2020. Flexible group-level pruning of deep neural networks for on-device machine learning. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 79–84.
- [25] Shigang Li, Kazuki Osawa, and Torsten Hoefler. 2022. Efficient Quantized Sparse Matrix Operations on Tensor Cores. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Dallas, Texas) (SC '22). IEEE Press, Article 37, 15 pages.
- [26] Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One?. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper\_ files/paper/2019/file/2c601ad9d2ff9bc8b282670cdd54f69f-Paper.pdf
- [27] Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. 2021. Accelerating Sparse Deep Neural Networks. arXiv:2104.08378 [cs.LG]
- [28] NVIDIA. 2020. Exploiting NVIDIA Ampere Structured Sparsity with cuS-PARSELt. https://developer.nvidia.com/blog/exploiting-ampere-structuredsparsity-with-cusparselt/
- [29] NVIDIA. 2023. The cuSparse Library. https://docs.nvidia.com/cuda/cusparse/ index.html. Accessed: 2023-04-03.
- [30] Ali Pinar and Michael T. Heath. 1999. Improving Performance of Sparse Matrix-Vector Multiplication. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (Portland, Oregon, USA) (SC '99). Association for Computing Machinery, New York, NY, USA, 30–es. https://doi.org/10.1145/331532.331562
- [31] Victor Sanh, Thomas Wolf, and Alexander Rush. 2020. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing

Systems 33 (2020), 20378-20389.

- [32] Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 3645–3650. https://doi.org/10.18653/ v1/P19-1355
- [33] Wei Sun, Ang Li, Tong Geng, Sander Stuijk, and Henk Corporaal. 2023. Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors. *IEEE Transactions on Parallel and Distributed Systems* 34, 1 (jan 2023), 246–261. https://doi.org/10.1109/tpds.2022.3217824
- [34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper\_ files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
- [35] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv:1905.09418 [cs.CL]
- [36] Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2019. Structured pruning of large language models. arXiv preprint arXiv:1910.04732 (2019).
- [37] Da Yan, Wei Wang, and Xiaowen Chu. 2020. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 634–643. https://doi.org/10.1109/ IPDPS47924.2020.00071
- [38] Qingru Zhang, Simiao Zuo, Chen Liang, Alexander Bukharin, Pengcheng He, Weizhu Chen, and Tuo Zhao. 2022. PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 26809–26823. https://proceedings.mlr.press/v162/zhang22ao.html