High-performance parallel code generation is a complex and fascinating area in computer science that focuses on producing code that executes as quickly and efficiently as possible. In our paper, we designed a new architecture for parallel code generation agent with 4 inter-connected components of LLM---Memory, Planning, Tools and Action. It also incooperated with two techniques: data augmentation, prompting and retrieval-augmented editing to improve the performance of the parallel codes. Data augmentation is implemented by extracting and processing PIE dataset, and also synthesis dataset generated by LLM models with ParEval benchmark. Finally planning-oriented prompting, code verification and retrieval augmented editing are used to promote the actual performance of the LLM generated code. The evaluation results confirm that a rough speedup of 6.06X and 5.13X are achieved using Qwen2.5-Coder-7B-Instruct, Qwen2.5-Coder-14B-Instruct LLM models.
Breadth-first search (BFS) is a fundamental building block of many graph algorithms, such as shortest path finding, network flow analysis, and connected component detection. As the graph size continues to increase, efficient implementations of parallel BFS across multiple cores are critical to the design of scalable graph algorithms. In this paper, we introduce an efficient implementation of the popular bi-directional BFS (BD-BFS) algorithm. Evaluating on the Speedcode platform [3], we can achieve 38.07% throughput performance improvement (3.01B/s vs 2.18B/s) over the conventional BD-BFS implementation.
The efficiency of Breadth-first Search (BFS) is predominantly constrained by its poor locality and severe load imbalance. The former primarily arises from random accesses to neighbors of active vertices (a.k.a frontiers), while the latter is caused by skewed distribution of edge traversals among CPU cores. Graph reordering aims to improve data locality by contiguously placing vertices that are likely to be consecutively accessed, and can potentially enhance the load balance by distributing more balanced workloads to CPU cores. Prior graph reordering methods generally adopt static structural metrics like vertex degree and graph community to explore the vertex locality. However, the data access patterns and load distribution of BFS are intrinsically contingent upon the frontiers generated at runtime, which can vary significantly across different inputs and iterations. To this end, we propose a frontier-based locality feature vector that is capable to characterize the locality of the frontiers in BFS requests. With this feature vector, we propose a graph reordering method named HisOrder, which uses history of multiple BFS requests to construct a feature vector and explore the locality of vertices with K-Means. Furthermore, HisOrder finetunes the vertex order for balancing workloads among CPU cores by predicting runtime computing intensity using the learned frontier distribution. To enhance the efficiency of HisOrder, we eliminate redundant distance computations in K-means, thereby achieving a significant acceleration. Based on evaluation on four small-world graph datasets, HisOrder achieves an average performance improvement of 1.20× with reasonable overhead, and consistently surpasses the state-of-the-art graph reordering.
Breadth-first search (BFS) is a fundamental graph algorithm that presents significant challenges for parallel implementation due to irregular memory access patterns, load imbalance and synchronization overhead. In this paper, we introduce a set of optimization strategies for parallel BFS on multicore systems, including hybrid traversal, bitmap-based visited set, and a novel non-atomic distance update mechanism. We evaluate these optimizations across two different architectures - a 24-core Intel Xeon platform and a 128-core AMD EPYC system - using a diverse set of synthetic and real-world graphs. Our results demonstrate that the effectiveness of optimizations varies significantly based on graph characteristics and hardware architecture. For small-diameter graphs, our hybrid BFS implementation achieves speedups of 3 -- 8× on the Intel platform and 3 -- 10× on the AMD system compared to a conventional parallel BFS implementation. However, the performance of large-diameter graphs is more nuanced, with some of the optimizations showing varied performance across platforms including performance degradation in some cases.
Breadth-First Search (BFS) performance on shared-memory systems is often limited by irregular memory access and cache inefficiencies. This work presents two optimizations for BFS graph traversal: a bitmap-based algorithm designed for small-diameter graphs and MergedCSR, a graph storage format that improves cache locality for large-scale graphs. Experimental results on real-world datasets show an average 1.3× speedup over a state-of-the-art implementation, with MergedCSR reducing RAM accesses by approximately 15%.
Breadth-First Search (BFS) is a fundamental graph traversal algorithm in a level-by-level pattern. It has been widely used in real-world applications, such as social network analysis, scientific computing, and web crawling. However, achieving high performance for BFS on large-scale graphs remains a challenging task due to irregular memory access patterns, diverse graph structures, and the necessity for efficient parallelization. This paper addresses these challenges by designing a highly optimized parallel BFS implementation based on the top-down and bottom-up traversal strategies. It further integrates several key innovations, including graph type-aware computation strategy selection, graph pruning, two-level bottom-up, and efficient parallel implementation. We evaluate our method on 11 diverse graphs in terms of size, diameter, and density. On a CPU server with 48 threads, our method achieves an average speedup of 9.5× over the serial BFS implementation. Also, on a synthetic dense graph, our method processes 9.3 billion edges per second, showing its efficiency in dense graph traversal.
Breadth-First Search (BFS) is a fundamental graph traversal algorithm widely used in applications such as social network analysis, web crawling, and shortest path computations. However, efficiently parallelizing BFS for large-scale and irregular graph structures remains a significant challenge due to issues like workload imbalance, excessive synchronization, and data locality constraints.
In this paper, we introduce BFSBlitz, a highly parallel graph processing system for BFS on shared memory systems. BFSBlitz dynamically adapts between TopDown and BottomUp traversals, utilizing hybrid partitioning for load balancing and contention reduction while mitigating data races. Experimental results demonstrate an average speedup of 34.61× over serial standard BFS, highlighting its efficiency for large-scale graph processing.
Breadth-first search (BFS) and single source shortest paths (SSSP) are two fundamental graph problems with countless real-world applications. It is of major interest to develop efficient parallel algorithms and implementations to solve these problems on modern multicore architectures. The challenge is that these computations are often irregular, and there is no known algorithm that guarantees polylogarithmic depth for all graphs. For BFS, this paper describe several performance engineering methods to develop a practical parallel implementation for all classes of real-world graphs. For SSSP, we introduce a unique contraction based preprocessing method which significantly speeds up queries on high diameter graphs, like road networks. Our method is generic and can be used with any algorithm for SSSP queries.
The Parallel Single-Source Shortest Path (SSSP) problem has been tackled through many implementations, yet no single approach consistently outperforms others across diverse graph structures. Moreover, most implementations require extensive parameter tuning to reach peak performance. In this paper, we introduce the AdaMW scheduler, which dynamically selects between the schedulers Wasp and Multi-Queue. To achieve this, we use graph sampling and heuristics to select and configure the scheduler. In contrast to common state-of-the-art bulk-synchronous implementations, AdaMW is fully asynchronous, thus not needing to stop for global barriers. The resulting scheduler is highly competitive with the best manually-tuned, state-of-the-art implementations.
Shortest path algorithms have many real-world applications. It's challenging to achieve high scalability due to their sequential nature. To address this, stepping algorithms, including Δ* and ρ stepping, enable parallel execution by grouping nodes into buckets based on distance ranges. However, existing implementations require manual selection of the optimal method and parameters.
In this work, we propose hyb-stepping, a parallel shortest path algorithm that dynamically selects between Δ* and ρ stepping based on graph properties. By leveraging a light preprocessing analysis, we classify graphs using four key metrics: (i) undirected graph average degree, (ii) general scale, (iii) large diameter scale, and (iv) small diameter scale. Our method eliminates manual tuning and ensures optimal parameter selection, significantly improving performance. Our results indicate hyb-stepping achieves a 1.5× speedup over fixed-parameter methods (440M edges per second).
The single-source shortest path (SSSP) problem is essential in graph theory with applications in navigation, biology, social networks, and traffic analysis. The Δ-Stepping algorithm enhances parallelism by grouping vertices into "buckets" based on their tentative distances. However, its performance depends on Δ values and graph properties. This paper introduces an adaptive parallel Δ-Stepping implementation with three innovations: neighbor reordering, bucket fusion, and graph type-aware Δ selection. Tested on 11 diverse graphs, it achieves an average 7.1× speedup over serial Dijkstra's algorithm on a 48-threads CPU.