Software performance engineering is the science and art of making code run fast or otherwise limiting its consumption of resources, such as energy, memory footprint, network utilization, response time, etc. Performance engineering encompasses parallel computing, but it also includes other techniques, such as caching, vectorization, algorithms, bit tricks, loop unrolling, compiler-switch selection, tailoring code to the architecture, exploiting sparsity, changing data representation, metaprogramming, etc. I will explain why the end of Moore's Law makes software performance engineering a critical technical skill for the future. I will also argue that the PPoPP community is ideally positioned to show leadership in SPE and that it would be wise to change the meaning of its acronym to "Principles and Practice of Performance Programming."
Online GNN inference has been widely explored by applications such as online recommendation and financial fraud detection systems, where even minor delays can result in significant financial impact. Real-time dynamic graph sampling enables online GNN inference to reflect the latest graph updates in real-world graphs. However, online GNN inference typically demands millisecond-level latency Service Level Objectives (SLOs) as its performance guarantees, which poses great challenges for existing dynamic graph sampling approaches based on graph databases. The issues mainly arise from two aspects: long tail latency due to imbalanced data-dependent sampling and large communication overhead incurred by distributed sampling. To address these issues, we propose Helios, an efficient distributed dynamic graph sampling service to meet the stringent latency SLOs. The key ideas of Helios are 1) pre-sampling the dynamic graph in an event-driven approach, and 2) maintaining a query-aware sample cache to build the complete K-hop sampling results locally for inference requests. Experiments on multiple datasets show that Helios achieves up to 67× higher serving throughput and up to 32× lower P99 query latency compared to baselines.
Recent GPUs have introduced Sparse Tensor Cores (SPTC) to accelerate computations on sparse matrices meeting the N:M sparse patterns. Software tools expand the support to more general V:N:M patterns. Graphs in Graph Neural Networks (GNNs) are typically sparse, but the sparsity is often irregular, not conforming to the required V:N:M sparse patterns. This paper proposes a novel graph reordering algorithm to transform irregular graph data into the required sparse patterns for GNNs to benefit from SPTC. The optimization is lossless, maintaining the accuracy of GNN. It at the same time keeps the symmetry of the adjacency matrices of the graphs so that the same matrices can remain compatible with many symmetry-based graph algorithms. The optimization successfully removes 98-100% violations of the N:M sparse patterns at the vector level and increases the portion of conforming graphs in the SuiteSparse collection from 5-9% to 88.7-93.5%. On A100 GPUs, the optimization accelerates Sparse Matrix Matrix (SpMM) by up to 43X (a geomean speedup of 2.3X - 7.5X) over cuSPARSE and speeds up the key graph operations in GNNs on real graphs by as much as 8.6X (3.5X on average).
There are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., with the shortest running time), and the optimal strategy depends on the graph dataset, GNN model, training algorithm, and hardware configurations. As such, we design the APT system to automatically select efficient parallelization strategies for GNN training tasks. To this end, we analyze the trade-offs of the strategies and design simple yet effective cost models to compare their execution time and facilitate strategy selection. Moreover, we also propose a general abstraction of the strategies, which allows to implement a unified execution engine that can be configured to run different strategies. Our experiments show that APT usually chooses the optimal or a close to optimal strategy, and the training time can be reduced by over 2x compared with always using a single strategy. APT is open-source at https://github.com/kaihaoma/APT.
The n-body problem involves calculating the effect of bodies on each other. n-body simulations are ubiquitous in the fields of physics and astronomy and notoriously computationally expensive. The naïve algorithm for n-body simulations has the prohibiting O(n2) time complexity. Reducing the time complexity to O(n · lg(n)), the tree-based Barnes-Hut algorithm approximates the effect of bodies beyond a certain threshold distance. Other than algorithmic improvements, extensive research has gone into accelerating n-body simulations on GPUs and multi-core systems. However, Barnes-Hut is a tree-traversal algorithm, which makes it a poor target for acceleration using traditional GPU shader cores. In contrast, recent work shows that, for tree-based computations, GPU ray-tracing (RT) cores dominate shader cores. In this work, we reformulate the Barnes-Hut algorithm as a ray-tracing problem and implement it with NVIDIA OptiX. Our evaluation shows that the resulting system, RT-BarnesHut, outperforms current state-of-the-art GPU-based implementations.
Amid conflicting demands for ever-improving performance and maximizing energy savings, it is important to have a tool that automatically identifies opportunities to save power/energy at runtime without compromising performance. GPUs in particular present challenges due to (1) reduced savings available from memory bound applications, and (2) limited availability of low overhead performance counters. Thus, a successful tool must address these issues while still tackling the challenges of dynamic application characterization, versatility across processors from different vendors, and effectiveness at making the right power-performance tradeoffs for desired energy savings.
We propose Everest, a tool that automatically finds energy saving opportunities across GPUs at runtime. Specifically, Everest finds two unique avenues for saving energy using DVFS in GPUs in addition to the traditional method of lowering core clock for memory bound phases. Everest has very low overhead and works across different GPUs given its reliance on the minimum possible performance events for the needed characterization. Everest works at a finer granularity of individual application phases and utilizes built-in performance estimation to provide desired performance guarantees for an effective solution that outperforms existing solutions on the latest NVIDIA and AMD GPUs.
GPU-based fast Fourier transform (FFT) is extremely important for scientific computing and signal processing. However, we find the inefficiency of existing FFT libraries and the absence of fault tolerance against soft error. To address these issues, we introduce TurboFFT, a new FFT prototype co-designed for high performance and online fault tolerance. For FFT, we propose an architecture-aware, padding-free, and template-based prototype to maximize hardware resource utilization, achieving a competitive or superior performance compared to the state-of-the-art closed-source library, cuFFT. For fault tolerance, we 1) explore algorithm-based fault tolerance (ABFT) at the thread and threadblock levels to reduce additional memory footprint, 2) address the error propagation by introducing a two-side ABFT with location encoding, and 3) further modify the threadblock-level FFT from 1-transaction to multi-transaction in order to bring more parallelism for ABFT. Our two-side strategy enables online correction without additional global memory while our multi-transaction design averages the expensive threadblock-level reduction in ABFT with zero additional operations. Experimental results on an NVIDIA A100 server GPU and a Tesla Turing T4 GPU demonstrate that TurboFFT without fault tolerance is comparable to or up to 300% faster than cuFFT and outperforms VkFFT. TurboFFT with fault tolerance maintains an overhead of 7% to 15%, even under tens of error injections per minute for both FP32 and FP64.
We present Reciprocating Locks, a novel mutual exclusion locking algorithm, targeting cache-coherent shared memory (CC), that enjoys a number of desirable properties. The doorway arrival phase and the Release operation both run in constant-time. Waiting threads use local spinning and only a single waiting element is required per thread, regardless of the number of locks a thread might hold at a given time. While our lock does not provide strict FIFO admission, it bounds bypass and has strong anti-starvation properties. The lock is compact, space efficient, and has been intentionally designed to be readily usable in real-world general purpose computing environments such as pthreads, or C++. We show the lock exhibits high throughput under contention and low latency in the uncontended case. Under sustained contention, Reciprocating Locks generate less coherence traffic than MCS and CLH. The performance of Reciprocating Locks is competitive with and often better than the best state-of-the-art scalable queue-based spin locks.
Many concurrent algorithms require processes to perform fetch-and-add operations on a single memory location, which can be a hot spot of contention. We present a novel algorithm called Aggregating Funnels that reduces this contention by spreading the fetch-and-add operations across multiple memory locations. It aggregates fetch-and-add operations into batches so that the batch can be performed by a single hardware fetch-and-add instruction on one location and all operations in the batch can efficiently compute their results by performing a fetch-and-add instruction on a different location. We show experimentally that this approach achieves higher throughput than previous combining techniques, such as Combining Funnels, and is substantially more scalable than applying hardware fetch-and-add instructions on a single memory location. We show that replacing the fetch-and-add instructions in the fastest state-of-the-art concurrent queue by our Aggregating Funnels eliminates a bottleneck and greatly improves the queue's overall throughput.
MCS lock and its variants provide scalability on many-core architectures, using lists of lock requests to reduce access contention on the mutex data. Most recent variants have adopted a two-stage design, allowing requests to be allocated from stack memory rather than heap. However, this design still produces mutex access contention and limits fairness in the presence of fast paths. This paper proposes the Freezer mechanism and its optimization methods, which extend the list structure operations of MCS lock, to reduce mutex access without using heap memory and enable an independent choice of fairness policies and fast paths. Additionally, we propose four optimization methods for queue-based reader-writer locks. Our evaluation using three benchmarks demonstrated the effectiveness of the proposed fair reader-writer locks. They achieved up to 3.5× higher throughput and improved tail latency by up to 2.7× compared to the baselines.
Safe memory reclamation techniques that utilize per read reservations, such as hazard pointers and hazard eras, often cause significant overhead in traversals of linked concurrent data structures. This is primarily due to the need to announce a reservation, and fence to make it globally visible (and enforce appropriate ordering), before each read. In real world read-intensive workloads, this overhead is amplified because, even if relatively little memory reclamation actually occurs, the full overhead of reserving records before use is still incurred while traversing data structures.
In this paper, we propose a novel memory reclamation technique by combining POSIX signals and delayed reclamation, introducing a publish-on-ping approach. This method eliminates the need to make reservations globally visible before use. Instead, threads privately track which records they are accessing, and share this information on demand with threads that intend to reclaim memory. The approach can serve as a drop-in replacement for hazard pointers and hazard eras. Furthermore, the capability to retain reservations during traversals in data structure operations and publish them on demand facilitates the construction of a variant of hazard pointers (EpochPOP). This variant uses epochs to approach the performance of epoch-based reclamation in the common case where threads are not frequently delayed (while retaining the robustness of hazard pointers).
Our publish-on-ping implementations based on hazard pointers and hazard eras, when applied to various data structures, exhibit significant performance improvements. The improvements across various workloads and data structures range from 1.2X to 4X over the original HP, up to 20% compared to a heavily optimized HP implementation similar to the one in the Folly open-source library, and up to 3X faster than hazard eras. EpochPOP delivers performance similar to epoch-based reclamation while providing stronger guarantees.
In-memory key-value (KV) caching bridges the performance gap between high-performance networks and disk devices. However, prior in-memory KV caching systems either consider large objects or introduce additional memory overhead. In this paper, we conduct a systematic analysis over 56 production traces, and make three observations: (i) small objects dominate the traces and data accesses are highly skewed; (ii) the hotness of objects keeps stable across days; and (iii) the multi-get operation that retrieves multiple objects from the same node incurs much shorter tail latency than purely using the single-get operation.
These observations motivate the design of AC-Cache, an access-correlation-aware in-memory caching system for small objects. AC-Cache comprises three design primitives: (i) we formulate the distribution of KV objects as an integer linear programming problem to balance data accesses and memory consumption; (ii) we capture the access correlation in a memory-efficient means and generate fine-grained correlation groups; and (iii) we formulate the distribution of the correlation groups as a maximum flow problem to balance data accesses, and leverage a heuristic algorithm to dispatch other KV objects to balance memory consumption. Extensive experiments with billions of objects on Alibaba Cloud show that AC-Cache can reduce the tail latency by 5.1-80.2% and increase the access throughput by 42.8-534.8%.
In today's data-driven era, the explosive growth of global data volume has led to an increasing consumption of computing and storage resources. Effective management of virtual machines (VMs) memory usage is critical for cloud vendors to optimize system performance and resource utilization. Existing memory prefetching methods often slow down system performance, creating a difficult balance between maintaining service quality and optimizing resource use. For instance, Leap, which primarily utilizes address information, performs poorly in VM environments. The main issue is the performance drop caused by the reuse of memory resources in virtualized environments, a common situation in public clouds.
To tackle this, we introduce vPrefetcher, a page prefetching system designed specifically for VMs. vPrefetcher monitors how VMs access memory that is not currently in use and prepares that memory in advance. Our unique Trend algorithm uses the spatial and temporal patterns to improve how memory is prefetched, reduce the overhead of memory reclamation. By applying vPrefetcher in tests with typical data center applications under memory reuse conditions, we have noticed a significant improvement. Our system reduces the performance loss in VMs by up to 45%, which is a considerable improvement over existing solutions.
The Mixture of Experts (MoE) architecture enhances model quality by scaling up model parameters. However, its development is hindered in distributed training scenarios due to significant communication overhead and expert load imbalance. Existing methods, which only allow for coarse-grained overlapping of communication and computation, slightly alleviate communication costs but at the same time, they introduce a notable impairment of computational efficiency. Furthermore, current approaches to addressing load imbalance often compromise model quality.
We introduce CCFuser, a novel framework designed for efficient training of MoE models. CCFuser replaces the costly All2All operations typical in MoE architectures with efficient inter-GPU shared memory access. This allows for the concurrent computation of local and remote data within a fused kernel, achieving substantially higher compute FLOPS for GEMM operations. Additionally, CCFuser addresses load imbalance with a resource-efficient expert reassignment strategy, which optimizes the use of computational resources in expert reassignment through equivalent graph transformations without sacrificing statistical accuracy. By integrating these optimizations, CCFuser significantly enhances GPU utilization efficiency. Experiments conducted on A100 servers show that CCFuser outperforms state-of-the-art methods FastMoE and FasterMoE by an average of 2.96x and 2.48x, respectively, achieving a maximum speedup of 4.34x.
Deep neural networks (DNNs) have shown significant effectiveness in natural language processing and video applications. However, DNN models, especially for long-context tasks, introduce extremely large intermediate tensors, producing substantial memory overhead. Although considerable efforts have been made to optimize DNNs, insufficient awareness of tensor properties has hindered effective memory optimization and can lead to inefficient computations in a long-context scenario.
In this paper, we present FlashTensor1, a DNN optimization system that reduces memory overhead and improves inference performance by leveraging fine-grained tensor properties. We first extract and identify essential tensor properties from a computation graph, such as reduce dependency and broadcastability. Then, we apply various optimizations involving transformation and kernel mapping based on these properties. Experiments on seven models demonstrate that FlashTensor achieves speedups of 1.50× and 3.24× on average for end-to-end and core module performance, respectively, compared to eight state-of-the-art works, on H100 (1.86× and 3.70× on A100).
Large language models have to be trained in parallel due to their large number of parameters and significant memory footprint. Among various parallelism techniques, pipeline parallelism is widely adopted in inter-node scenarios with minimal communication overhead. However, state-of-the-art pipeline schemes lead to extra and imbalanced memory footprints, leaving room for further improvement. In this paper, we propose Mario, a pipeline optimizer that automatically tessellates activation checkpointing to existing pipeline schemes, enabling training larger models (or longer sequences) with less and balanced memory footprint across GPUs and improved GPU utilization. First, the activation recomputation can be effectively overlapped in the bubbles by moving it earlier in the execution process, thereby improving overall efficiency. With eliminated memory footprint through checkpointing, Mario allows for preposing more forward computation into the pipeline bubbles, making more room for further overlapping with greater flexibility, and thus exploiting the bubbles. Then we design a lightweight pipeline simulator to model execution behavior w/o|w/ Mario. Finally, we introduce an automatic pipeline scheduler specifically for Mario, capable of searching for near optimal combination of checkpointing and pipeline configurations within minutes. Experimental results on GPT3 and LLaMA2 models show that Mario can speed up existing state-of-the-art pipeline schemes (w/o|w/ checkpointing) including 1F1B, Chimera, and Interleave by 1.16×|1.57× on average. This work paves a new direction for effective low-cost pipeline training.
Second-order optimization methods have been developed to enhance convergence and generalization in deep neural network (DNN) training compared to first-order methods like Stochastic Gradient Descent (SGD). However, these methods face challenges in distributed settings due to high communication overhead. Gradient compression, a technique commonly used to accelerate communication for first-order approaches, often results in low communication reduction ratios, decreased model accuracy, and/or high compression overhead when applied to second-order methods. To address these limitations, we introduce a novel gradient compression method for second-order optimizers called COMPSO. This method effectively reduces communication costs while preserving the advantages of second-order optimization. COMPSO employs stochastic rounding to maintain accuracy and filters out minor gradients to improve compression ratios. Additionally, we develop GPU optimizations to minimize compression overhead and performance modeling to ensure end-to-end performance gains across various systems. Evaluation of COMPSO on different DNN models shows that it achieves a compression ratio of 22.1×, reduces communication time by 14.2×, and improves overall performance by 1.9×, all without any drop in model accuracy.
Training large language models (LLMs) has become increasingly expensive due to the rapid expansion in model size. Pipeline parallelism is a widely used distributed training technique. However, as LLMs with larger context become prevalent and memory optimization techniques advance, traditional PP methods encounter greater communication challenges due to the increased size of activations and gradients of activations. To address this issue, we introduce weight-pipeline parallelism (WeiPipe) that transitions from an activation-passing pipeline to a weight-passing pipeline. WeiPipe reduces communication costs and achieves a more balanced utilization by transmitting only weights and their gradients between workers in a pipeline manner. WeiPipe does not rely on collective communication primitives, thus ensuring scalability. We present four variations of WeiPipe parallelism, including WeiPipe-Interleave, which emphasizes communication efficiency, and WeiPipe-zero-bubble, discussing the potential for minimal bubble ratios. Our implementation of WeiPipe-Interleave, performed on up to 32 GPUs and tested in various model configurations, including large-context LLM training, demonstrates a significant improvement in throughput compared to state-of-the-art pipeline parallelism and fully sharded data parallelism with different underlying infrastructures, including NVLink connections within cluster with Ethernet among cluster, and PCIe within cluster and Ethernet among cluster. Additionally, WeiPipe also shows greater scalability in communication-constrained scenarios compared to state-of-art strategies.
As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains a key open question whether speedups are achievable also in batched settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads.
In this paper, we resolve this question positively by introducing a new design for Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be practically supported with close to maximum (4×) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to significant end-to-end LLM inference speedups (of up to 2.8×) when integrated with the popular vLLM open-source serving engine. Finally, we show that MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.
Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, the training of these models is computationally intensive and susceptible to faults, particularly in the attention mechanism, which is a critical component of transformer-based LLMs. In this paper, we investigate the impact of faults on LLM training, focusing on INF, NaN, and near-INF values in the computation results with systematic fault injection experiments. We observe the propagation patterns of these errors, which can trigger non-trainable states in the model and disrupt training, forcing the procedure to load from checkpoints. To mitigate the impact of these faults, we propose ATTNChecker, the first Algorithm-Based Fault Tolerance (ABFT) technique tailored for the attention mechanism in LLMs. ATTNChecker is designed based on fault propagation patterns of LLM and incorporates performance optimization to adapt to both system reliability and model vulnerability while providing lightweight protection for fast LLM training. Evaluations on four LLMs show that ATTNChecker incurs on average 7% overhead on training while detecting and correcting all extreme errors. Compared with the state-of-the-art checkpoint/restore approach, ATTNChecker reduces recovery overhead by up to 49×.
Cloud service providers heavily colocate high-priority, latency sensitive (LS), and low-priority, best-effort (BE) DNN inference services on the same GPU to improve resource utilization in data centers. Among the critical shared GPU resources, there has been very limited analysis on the dynamic allocation of compute units and VRAM bandwidth, mainly for two reasons: (1) The native GPU resource management solutions are either hardware-specific, or unable to dynamically allocate resources to different tenants, or both; (2) NVIDIA doesn't expose interfaces for VRAM bandwidth allocation, and the software stack and VRAM channel architectures are black-box, both of which limit the software-level resource management. These drive prior work to design either conservative sharing policies detrimental to throughput, or static resource partitioning only applicable to a few GPU models.
To bridge this gap, this paper proposes SGDRC, a fully software-defined dynamic VRAM bandwidth and compute unit management solution for concurrent DNN inference services. SGDRC aims at guaranteeing service quality, maximizing the overall throughput, and providing general applicability to NVIDIA GPUs. SGDRC first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs through comprehensive reverse engineering and eliminates VRAM channel conflicts using software-level cache coloring. SGDRC applies bimodal tensors and tidal SM masking to dynamically allocate VRAM bandwidth and compute units, and guides the allocation of resources based on offline profiling. We evaluate 11 mainstream DNNs with real-world workloads on two NVIDIA GPUs. The results show that compared with the state-of-the-art GPU sharing solutions, SGDRC achieves the highest SLO attainment rates (99.0% on average), and improves overall throughput by up to 1.47× and BE job throughput by up to 2.36×.
Deterministic parallelism is a key building block for distributed and fault-tolerant systems that offers substantial performance benefits while guaranteeing determinism. By studying existing deterministically parallel systems (DPS), we identify certain design pitfalls, such as batched execution and inefficient runtime synchronization, that preclude them from meeting the demands of μs-scale and high-throughput distributed systems deployed in modern datacenters.
We present DORADD, a deterministically parallel runtime with low latency and high throughput, designed for modern datacenter services. DORADD introduces a hybrid scheduling scheme that effectively decouples request dispatching from execution. It employs a single dispatcher to deterministically construct a dynamic dependency graph of incoming requests and worker pools that can independently execute requests in a work-conserving and synchronization-free manner. Furthermore, DORADD overcomes the single-dispatcher throughput bottleneck based on core pipelining.
We use DORADD to build an in-memory database and compare it with Caracal, the current state-of-the-art deterministic database, via the YCSB and TPC-C benchmarks. Our evaluation shows up to 2.5× better throughput and more than 150× and 300× better tail latency in non-contended and contended cases, respectively. We also compare DORADD with Caladan, the state-of-the-art non-deterministic remote procedure call (RPC) scheduler, and demonstrate that determinism in DORADD does not incur any performance overhead.
The carbon and water footprint of large-scale computing systems poses serious environmental sustainability risks. In this study, we discover that, unfortunately, carbon and water sustainability are at odds with each other - and, optimizing one alone hurts the other. Toward that goal, we introduce, WaterWise, a novel job scheduler for parallel workloads that intelligently co-optimizes carbon and water footprint to improve the sustainability of geographically distributed data centers.
Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior computing power, which is promising to boost the performance of matrix operators to a higher level. However, due to the irregularity of unstructured sparse data, it is difficult to deliver practical speedups on TCUs. To this end, we propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture. Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy. Benefiting from the minimum sparse granularity, the computation redundancy is remarkably reduced while the computing power of TCUs is fully utilized. Besides, FlashSparse is equipped with a memory-efficient thread mapping strategy for coalesced data access and a sparse matrix storage format to save memory footprint. Extensive experimental results on H100 and RTX 4090 GPUs show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe).
General-purpose Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel in scientific computing and deep learning. The emergence of new matrix computation units such as Tensor Cores (TCs) brings more opportunities for SpMM acceleration. However, in order to fully unleash the power of hardware performance, systematic optimization is required. In this paper, we propose Acc-SpMM, a high-performance SpMM library on TCs, with multiple optimizations, including data-affinity-based reordering, memory efficient compressed format, high-throughput pipeline, and adaptive sparsity-aware load balancing. In contrast to the state-of-the-art SpMM kernels on various NVIDIA GPU architectures with a diverse range of benchmark matrices, Acc-SpMM achieves significant performance improvements, on average 2.52x (up to 5.11x) speedup on RTX 4090, on average 1.91x (up to 4.68x) speedup on A800, and on average 1.58x (up to 3.60x) speedup on H100 over cuSPARSE.
Breadth First Search (BFS) plays a key role in computational science, networking, and artificial intelligence applications. Although the BFS approach has been extensively studied, particularly in its direction-optimized form, existing implementations still present three main issues: (1) high memory footprint; (2) the under-realized lightweight representations using bitmaps; and (3) the underuse of modern hardware such as Tensor Cores Units (TCUs).
In this paper, we propose BerryBees, an efficient algebraic BFS algorithm that leverages the Matrix Multiply-Accumulate (MMA) instructions of TCUs. The novelty of BerryBees lies in (1) the Binarized Row Slice (BRS) format, which encodes the adjacency matrix by using bitmaps to represent non-empty row segments; and (2) a warp-level algorithm that leverages TCUs for accelerating both SpMV and SpMSpV operations for enhanced BFS performance. The experimental results on three latest NVIDIA GPUs show that BerryBees outperforms five state-of-the-art BFS methods: GAP, Gunrock, Enterprise, GSWITCH, and GraphBLAST, and delivers average speedups of 1.42 ×, 1.97×, 5.05×, 1.24×, and 3.74× (up to 9.99×, 13.66×, 114.07×, 13.97×, and 24.74×), respectively.
While Tensor Core Units (TCUs) excel in AI tasks, their application to HPC algorithms like stencil computations faces significant challenges due to sparsity, which leads to underutilization and exacerbates memory-bound limitations. This paper introduces FlashFFTStencil1, a memory-efficient stencil computing system designed to bridge FFT to fully-dense stencil computations on TCUs. Aimed at bound shifting, FlashFFTStencil comprises three key techniques: Kernel Tailoring on HBM fuses distinct kernels to enhance parallelism while reducing memory transfer and footprint; Architecture Aligning on SMEM restructures FFT-based stencil computations into dense matrix multiplications tailored for shared memory architecture; Computation Streamlining on TCU optimizes TCU utilization and thread parallelism by minimizing pipeline stalls and maximizing register reuse. Notably, a distinctive extension is FlashFFTStencil's ability to enable theoretically unrestricted temporal fusion by FFT. Results show that FlashFFTStencil achieves effective sparsity-free bound shifting, with an average speedup of 2.57x over the state-of-the-art. FlashFFTStencil pioneers a new era in unifying computational patterns within the HPC landscape and bridges them with cutting-edge AI-driven hardware innovations like TCUs.
Approximate Nearest-Neighbor Search (ANNS) has become the standard querying method in vector databases, especially with the recent surge in large-scale, high-dimensional data driven by LLM-based applications. Recently, graph-based ANNS has shown improved throughput by constructing a graph from the dataset, with edges representing the distances between data points, and using best-first or beam search algorithms for query evaluation.
This work highlights that key aspects of the design space for both graph construction and search in graph-based ANNS remain under-explored. Specifically, the construction phase often neglects the potential temporal correlation between input queries and their results, while the search phase lacks a thorough exploration of beam search parameterization. To address these gaps, we present PANNS, a system that embeds temporal information in the proximity graph construction to capture query-result correlations. Moreover, it introduces a fully parameterized beam search algorithm, enabling extensive performance optimization. PANNS achieves up to a 3.7× improvement in query throughput over state-of-the-art graph-based ANNS methods, while maintaining equivalent recall levels. Furthermore, it reduces graph size by up to 30% without degrading query quality.
Relaxed semantics have been introduced to increase the achievable parallelism of concurrent data structures in exchange for weakening their ordering semantics. In this paper, we revisit the balanced allocations d-choice load balancing scheme in the context of relaxed FIFO queues. Our novel load balancing approach distributes operations evenly across n sub-queues based on operation counts, achieving low relaxation errors independent on the queues size, as opposed to similar earlier designs. We prove its relaxation errors to be of O(n log log n/log d) with high probability for a collection of possible executions. Furthermore, our scheme, contrary to previous ones, manages to interface and integrate the most performant linearizable queue designs from the literature as components. Our resulting relaxed FIFO queue is experimentally shown to outperform the previously best design using balanced allocations by more than four times in throughput, while simultaneously incurring less than a thousandth of its relaxation errors. In a concurrent breadth-first-search benchmark, our queue consistently outperforms both relaxed and strict state-of-the-art FIFO queues.
The Ray-Tracing (RT) core has become a widely integrated feature in modern GPUs to accelerate ray-tracing rendering. Recent research has shown that RT cores can also be repurposed to accelerate non-rendering workloads. Since the RT core essentially serves as a hardware accelerator for Bounding Volume Hierarchy (BVH) tree traversal, it holds the potential to significantly improve the performance of spatial workloads. However, the specialized RT programming model poses challenges for using RT cores in these scenarios. Inspired by the core functionality of RT cores, we designed and implemented LibRTS, a spatial index library that leverages RT cores to accelerate spatial queries. LibRTS supports both point and range queries and remains mutable to accommodate changing data. Instead of relying on a case-by-case approach, LibRTS provides a general, highperformance spatial indexing framework for spatial data processing. By formulating spatial queries as RT-suitable problems and overcoming load-balancing challenges, LibRTS delivers superior query performance through RT cores without requiring developers to master complex programming on this specialized hardware. Compared to CPU and GPU spatial libraries, LibRTS achieves speedups of up to 85.1x for point queries, 94.0x for range-contains queries, and 11.0x for range-intersects queries. In a real-world application, pointin-polygon testing, LibRTS also surpasses the state-of-the-art RT method by up to 3.8x.
Scaling blockchain performance through parallel smart contract execution has gained significant attention, as traditional methods remain constrained by the performance of a single virtual machine (VM), even in multi-chain or Layer-2 systems. Parallel VMs offer a compelling solution by enabling concurrent transaction execution within a single smart contract, using multiple CPU cores. However, Ethereum's sequential, shared-everything model limits the efficiency of existing parallel mechanisms, resulting in frequent rollbacks with optimistic methods and high overhead with pessimistic methods due to state dependency analysis and locking.
This paper introduces Crystality, a programming model for smart contracts on parallel Ethereum Virtual Machines (EVMs) that enables developers to express and leverage the parallelism inherent in smart contracts. Crystality introduces Programmable Contract Scopes to partition contract states into non-overlapping, parallelizable segments and decompose a smart contract function into finer-grained components. Crystality also features Asynchronous Functional Relay to manage execution flow across EVMs. These features simplify parallelism expression and enable asynchronous execution for commutative contract operations.
Crystality extends Solidity with directives, transpiling Crystality code into standard Solidity code for EVM compatibility. The system supports two execution modes: an asynchronous mode for transactions involving commutative operations and an optimistic-based fallback to ensure block-defined transaction order. Our experiments demonstrated Crystality's superior performance compared to Ethereum, Aptos, and Sui on a 64-core machine.
K-means is a popular clustering algorithm with significant applications in numerous scientific and engineering areas. One drawback of K-means is its inability to identify non-linearly separable clusters, which may lead to inaccurate solutions in certain cases. Kernel K-means is a variant of classical K-means that can find non-linearly separable clusters. However, it scales quadratically with respect to the size of the dataset, taking several minutes to cluster even medium-sized datasets on traditional CPU-based machines.
In this paper, we present a formulation of Kernel K-means using sparse-dense matrix multiplication (SpMM) and sparse matrix-vector multiplication (SpMV), and we show that our formulation enables the rapid implementation of a fast GPU-based version of Kernel K-means with little programming effort. Our implementation, named Popcorn, is the first open-source GPU-based implementation of Kernel K-means.
Popcorn achieves a speedup of up to 123.8× over a CPU implementation of Kernel K-means and a speedup of up to 2.6× over a GPU implementation of Kernel K-means that does not use sparse matrix computations. Our results support the effectiveness of sparse matrices as tools for efficient parallel programming.
The Louvain algorithm is one of the most popular algorithms for community detection. Observing that existing implementations suffer from inaccurate pruning and inefficient intermediate state management, we introduce GALA, GPU-Accelerated Louvain Algorithm, which incorporates two key innovations. The first innovation is a novel modularity gain-based pruning strategy, supported by rigorous theoretical guarantees of optimality and able to reduce up to 76% of vertices as well as their corresponding computations. To take advantage of the memory hierarchy and parallelism of GPUs, the second innovation is workload-aware kernels, featuring a shuffle-based kernel founded on the warp-level primitives for exchange states and a hash-based kernel that prioritizes shared memory in hashtable design. GALA further scales to multiple GPUs by minimizing the synchronization overhead between GPUs through a dense-sparse synchronization strategy. We evaluate the performance of GALA through theoretical analysis and practical experiments on various real-world graphs. The experimental results confirm that GALA significantly improves the performance of the parallel Louvain algorithm on GPUs, surpassing state-of-the-art solutions by 6× on average.
Graph Pattern Mining (GPM) has made significant progress in recent years, supporting a range of big data applications. However, GPM applications are memory-intensive as they require a tremendous amount of edge checking, which involves enumerating all possible vertex pairs and checking their connectivity or counting the common neighbors of each pair. Adjacent list or CSR format is widely used in recent GPM systems to overcome the sparsity of real-world graph, at the cost of sacrificing the ability to query the connectivity of two vertices in O(1) compared with the raw adjacent matrix. In this paper, we rethink the format to store graph in GPM systems, by partial "restoring" graph from CSR to adjacent matrix, we can significantly accelerate checking connectivity or counting common neighbors for GPM algorithms. However, directly storing graph in adjacent matrix is cost-prohibitive, thus we propose GLumin, by selectively constructing a lookup table (LUT) at the runtime, which holds the connectivity that can be repeatedly retrieved at low cost. The LUTs are compressed and tailored to fit the hierarchical memory architecture like shared memory of GPU or cache of CPU. We adopt GLumin technique to several SOTA GPM systems like AutoMine, G2Miner and GraphFold, bringing two orders of magnitude speedup to them on more than 100 different graphs and 24 patterns, demonstrating the general applicability and significance of our methods.
Tridiagonalization, which is a key step in symmetric eigenvalue decomposition (EVD), aims to convert a symmetric matrix to a tridiagonal form. In Nvidia's cuSOLVER library, the FP64 precision tridiagonalization process only reach 2.1 TFLOPs out of 67 TFLOPs on H100 GPU, and it consumes a significant portion of the elapsed time in the entire EVD process, accounting for over 97%. Thus, improving the tridiagonalization performance is crucial on accelerating EVD. In this paper, we analyze the reasons behind the suboptimal performance of tridiagonalization on GPU architectures, and we propose a new double blocking band reduction algorithm along with an implementation of GPU-based bulge chasing to improve the tridiagonalization performance. Through experimental evaluation, the proposed FP64 precision tridiagonalization method yields up to 19.6 TFLOPs which is 9.3x and 5.2x faster compared cuSOVLER and MAGMA, respectively.
Stencil computation plays a pivotal role in numerous scientific and engineering applications. Previous studies have extensively investigated vectorization techniques to enhance in-core parallelism; however, the performance bottleneck caused by data alignment conflicts (DAC) has not been effectively resolved in all dimensions. This paper proposes Jigsaw, a conflict-free vectorization method to reduce DAC across all dimensions by tessellating swizzled finest-grained lanes. Jigsaw comprises three key components: Lane-based Butterfly Vectorization, SVD-based Dimension Flattening, and Iteration-based Temporal Merging. These components effectively address DAC across spatial and temporal dimensions. Experimental results on different machines demonstrate that Jigsaw could achieve a significant improvement compared to the state-of-the-art techniques, with an average speedup of 2.31x on various stencil kernels.
Parallel multigrid methods are widely used as preconditioned for solving large sparse linear systems. Most multigrids rely on general sparse matrix formats, which prevent them from achieving optimal performance. There is an emerging trend towards semi-structured multigrids that balance flexibility with performance. However, existing libraries often fall short in terms of speed and scalability for semi-structured problems. To address these limitations, we have designed and implemented Semi-StructMG. It employs multi-dimensional coarsening to reduce complexity and simplify communication patterns. It also considers the special role of inter-block connections in smoothers and triple-matrix products to improve convergence under large-scale parallelism. We evaluated Semi-StructMG using two benchmark problems and four real-world applications from petroleum reservoir simulation, ship manufacturing, numerical weather prediction, and ocean modeling. Compared to hypre's multigrids, Semi-StructMG achieves the fastest time-to-solution across all cases, with average speedups of 5.97x, 15.2x, and 3.85x over SSAMG, Split, and BoomerAMG, respectively. Additionally, Semi-StructMG significantly improves both strong and weak scaling efficiencies in all tests. These results suggest that it can serve as an effective alternative to SSAMG and Split.
Group testing is a widely used binary classification method that efficiently distinguishes between samples with and without a binary-classifiable attribute by pooling and testing subsets of a group. Bayesian Group Testing (BGT) is the state-of-the-art approach, which integrates prior risk information into a Bayesian Boolean Lattice framework to minimize test counts and reduce false classifications. However, BGT, like other existing group testing techniques, struggles with multinomial group testing, where samples have multiple binary-classifiable attributes that can be individually distinguished simultaneously. We address this need by proposing Bayesian Multinomial Group Testing (BMGT), which includes a new Bayesian-based model and supporting theorems for an efficient and precise multinomial pooling strategy. We further design and develop SBMGT, a high-performance and scalable framework to tackle BMGT's computational challenges by proposing three key innovations: 1) a parallel binary-encoded product lattice model with up to 99.8% efficiency; 2) the Bayesian Balanced Partitioning Algorithm (BBPA), a multinomial pooling strategy optimized for parallel computation with up to 97.7% scaling efficiency on 4096 cores; and 3) a scalable multinomial group testing analytics framework, demonstrated in a real-world disease surveillance case study using AIDS and STDs datasets from Uganda, where SBMGT reduced tests by up to 54% and lowered false classification rates by 92% compared to BGT.
Global Storm Resolving Models (GSRMs) is crucial for understanding extreme weather events under the climate change background. In this study, we optimize Global-Regional Integrated Forecast System (GRIST), which is a unified weather-climate modeling system designed for research and operation, for the next-generation Sunway supercomputer, incorporating AI-enhanced physics suite, OpenMP-based parallelization, and mixed-precision optimizations to enhance both efficiency and performance portability, as well as the unified modeling capability. Our experiments successfully capture significant events during the "23.7" extreme rainfall over northern China influenced by super Typhoon Doksuri, at 1km resolution. Notably, our work scales to 34 million cores, enabling simulation speeds at 491 SDPD (3km) and 181 SDPD (1km).
In this work, we present theoretically and practically efficient implementations of Big Atomics, i.e., k-word linearizable registers that support the load, store, and compare-and-swap (CAS) operations. While modern hardware supports k = 1 and sometimes k = 2 (e.g., double-width compare-and-swap in x86), our implementations support arbitrary k. Big Atomics are useful in many applications, including atomic manipulation of tuples, version lists, and implementing load-linked/store-conditional (LL/SC). We design fast, lock-free implementations of big atomics based on a novel fast-path-slow-path approach we develop. We then use them to develop an efficient concurrent hash table, as evidence of their utility.
We experimentally validate the approach by comparing a variety of implementations of big atomics under a variety of workloads (thread counts, load/store ratios, contention, oversubscription, and number of atomics). The experiments compare two of our lock-free variants with C++ std::atomic, a lock-based version, a version using sequence locks, and an indirect version. The results show that our approach is close to the fastest under all conditions and far outperforms others under oversubscription. We also compare our big atomics based concurrent hash table to a variety of other state-of-the-art hash tables that support arbitrary length keys and values, including implementations from Intel's TBB, Facebook's Folly, libcuckoo, and a recent release from Boost. The results show that our approach of using big atomics in the design of hash tables is a promising direction.
Graph reordering is an effective technique for improving the access locality of graph processing. However, existing methods often overlook the data access locality among concurrently activated vertices (a.k.a. frontiers). These vertices, while lacking direct connections or shared neighbors, can exhibit significant locality attributed to their overlapped k-order in-neighbors. However, calculating such overlaps directly is computationally prohibitive. We propose to estimate the overlapped k-order in-neighbors through frontier distribution analysis based on a few BFS samples. Our proposed graph reordering method, FrontOrder, constructs feature vectors from the frontier distribution of BFS samples, and employs K-means clustering with a custom distance metric to group vertices with high locality. Additionally, the learned clusters can predict runtime computing intensity, enabling load balancing through vertex reordering. FrontOrder achieves average speedups of 2.65× on Ligra and 1.73× on GPOP, outperforming state-of-the-art methods.
Transactional Data Structure Programming Systems (TD-SPSs) let programmers compose method invocations on concurrent data structures into coarse-grained, isolated transactions. They detect conflicts among concurrent transactions and recover when operations do not commute.
We present Harmony, a new TDSPS design that orchestrates two categories of metadata: one for synchronization and another for transaction management. By separating these responsibilities, Harmony avoids unnecessary conflicts and simplifies the implementation of complex data structures like skip lists, resizable arrays, deques, and hash tables. These data structures support both read-only and mutating range queries efficiently.
Concurrent queues and stacks are important components of many software systems. They are well-known contended data structures due to intense contention on hotspots, resulting in limited performance and poor scalability. To achieve higher performance, several attempts applied the batching technique, which packs a group of standard operations into a single batch for execution, to lock-free queues and stacks built upon the linked list, but still inherited the contended compare-and-swap (CAS).
In this paper, we construct the lock-free queue and stack to support linearizable batch operations, which benefit from fetch-and-add (FAA) primitive and get rid of the inherent CAS contention. In our evaluation, our new design has a significant performance advantage over the competitors.
Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (ML) tools have been leveraged in MLIPs, but they are not perfectly matched with each other, since many optimization opportunities in MLIPs have been missed by ML tools. This inefficiency arises from the fact that HPC+AI applications work with far more computational complexity compared with pure AI scenarios. This paper has developed an MLIP, named TensorMD, independently from any ML tool. TensorMD has been evaluated on two supercomputers and scaled to 51.8 billion atoms, i.e., ~ 3× compared with state-of-the-art.
Sequence alignment is a fundamental and often time-consuming step in genomic data analysis. Typically, it adheres to the seed-and-extension paradigm and numerous accelerator-based approaches have been proposed to optimize either of the kernels. However, these approaches often increase costs and contribute minimally to the overall alignment process. To address this, we have designed an optimized full pipeline, FastBWA, which seeks to enhance performance while keeping costs low and explores the potential of CPU computing resources. Our implementation demonstrates that FastBWA achieves up to 2.5× and 1.8× in end-to-end alignment throughput compared to BWA-MEM and its newer version, BWA-MEM2.
Scientific images play a crucial role in many experimental sciences; however, the large volumes of data generated present significant challenges. Effective image compression must be fast, achieve high compression ratios, and preserve critical domain-specific features. Existing compressors, such as JPEG and SZ, often distort important textures when operating at high compression ratios. Conversely, AI-based compressors offer superior image quality and higher compression ratios but are significantly slower than traditional methods. To address this trade-off, we developed ViSemZ, a high-performance AI-based compressor specifically designed to preserve visual semantics. Our approach enhances AI compression by integrating sparse encoding with variable-length integer truncation, optimized lossless encoding using bitshuffle and a decoupled lookback prefix-sum, and pipelining techniques to enable efficient data streaming and asynchronous processing. Evaluations on scientific datasets demonstrate that, at comparable compression ratios, ViSemZ achieves performance almost on par with existing AI-based compressors while delivering a 9.6× overall compression speedup. These results effectively bridge the performance gap between traditional and AI-based compression methods.
Triangle counting is a fundamental graph algorithm used to identify the number of triangles within a graph. This algorithm can be reformulated into linear algebraic operations, including sparse matrix multiplication, intersection and reduction. Modern GPUs, equipped with Tensor Cores, offer massive parallelism that can significantly accelerate graph algorithms. However, leveraging Tensor Cores, originally designed for dense matrix multiplication, to handle sparse workloads for triangle counting presents non-trivial challenges. In this paper, we introduce ToT, which enhances the utilization of Tensor Cores and expands their functionalities for diverse sparse matrix operations. In experiments, ToT is evaluated against state-of-the-art methods. ToT outperform the second-fastest method with an 11.56× speedup in end-to-end execution. This work represents a pioneering exploration into utilizing Tensor Cores for accelerating graph algorithms.
Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to limited parallel fusion scopes and insufficient consideration of intra-operator information. This paper introduces Magneto, a novel framework designed to accelerate parallel structures in DNNs through the co-optimization of parallel operators. By expanding the scope of parallel operator fusion and introducing a dedicated co-tuning algorithm, Magneto unlocks new opportunities for co-optimization. Experimental results demonstrate that Magneto outperforms NVIDIA TensorRT and AMD MIGraphX, achieving speedups of 3.02× and 4.19×, respectively.
Graph Convolutional Networks (GCNs) are widely used in various domains. However, training distributed full-batch GCNs on large-scale graphs poses challenges due to inefficient memory access patterns and high communication overhead. This paper presents a general and efficient GCN training framework on CPU supercomputers. It comprises a general aggregation kernel designed to optimize irregular memory access and a quantization method with label propagation to reduce communication overhead. Experimental results show that our method achieves a speedup of up to 4.1× compared with the SoTA implementations.
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finitestate automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived fromregular expressions (RE). Such an algorithm cuts the input string into chunks, independently recognizes each chunk in parallel by means of identical FAs, and at last joins the chunk results and checks the overall consistency. In chunk recognition, it is necessary to speculatively start the FAs in any state, thus causing an overhead that reduces the speedup over a serial algorithm. The existing data-parallel DFA-based recognizers suffer from an excessive number of starting states, and the NFA-based ones suffer from the number of nondeterministic transitions.