ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1

Full Citation in the ACM Digital Library

SESSION: Papers

AQUATOPE: QoS-and-Uncertainty-Aware Resource Management for Multi-stage Serverless Workflows

Multi-stage serverless applications, i.e., workflows with many computation and I/O stages, are becoming increasingly representative of FaaS platforms. Despite their advantages in terms of fine-grained scalability and modular development, these applications are subject to suboptimal performance, resource inefficiency, and high costs to a larger degree than previous simple serverless functions.

We present Aquatope, a QoS-and-uncertainty-aware resource scheduler for end-to-end serverless workflows that takes into account the inherent uncertainty present in FaaS platforms, and improves performance predictability and resource efficiency. Aquatope uses a set of scalable and validated Bayesian models to create pre-warmed containers ahead of function invocations, and to allocate appropriate resources at function granularity to meet a complex workflow's end-to-end QoS, while minimizing resource cost. Across a diverse set of analytics and interactive multi-stage serverless workloads, Aquatope significantly outperforms prior systems, reducing QoS violations by 5X, and cost by 34% on average and up to 52% compared to other QoS-meeting methods.

CAFQA: A Classical Simulation Bootstrap for Variational Quantum Algorithms

Classical computing plays a critical role in the advancement of quantum frontiers in the NISQ era. In this spirit, this work uses classical simulation to bootstrap Variational Quantum Algorithms (VQAs). VQAs rely upon the iterative optimization of a parameterized unitary circuit (ansatz) with respect to an objective function. Since quantum machines are noisy and expensive resources, it is imperative to classically choose the VQA ansatz initial parameters to be as close to optimal as possible to improve VQA accuracy and accelerate their convergence on today’s devices.

This work tackles the problem of finding a good ansatz initialization, by proposing CAFQA, a Clifford Ansatz For Quantum Accuracy. The CAFQA ansatz is a hardware-efficient circuit built with only Clifford gates. In this ansatz, the parameters for the tunable gates are chosen by searching efficiently through the Clifford parameter space via classical simulation. The resulting initial states always equal or outperform traditional classical initialization (e.g., Hartree-Fock), and enable high-accuracy VQA estimations. CAFQA is well-suited to classical computation because: a) Clifford-only quantum circuits can be exactly simulated classically in polynomial time, and b) the discrete Clifford space is searched efficiently via Bayesian Optimization.

For the Variational Quantum Eigensolver (VQE) task of molecular ground state energy estimation (up to 18 qubits), CAFQA’s Clifford Ansatz achieves a mean accuracy of nearly 99% and recovers as much as 99.99% of the molecular correlation energy that is lost in Hartree-Fock initialization. CAFQA achieves mean accuracy improvements of 6.4x and 56.8x, over the state-of-the-art, on different metrics. The scalability of the approach allows for preliminary ground state energy estimation of the challenging chromium dimer (Cr2) molecule. With CAFQA’s high-accuracy initialization, the convergence of VQAs is shown to accelerate by 2.5x, even for small molecules.

Furthermore, preliminary exploration of allowing a limited number of non-Clifford (T) gates in the CAFQA framework, shows that as much as 99.9% of the correlation energy can be recovered at bond lengths for which Clifford-only CAFQA accuracy is relatively limited, while remaining classically simulable.

Cooperative Concurrency Control for Write-Intensive Key-Value Workloads

Key-Value Stores (KVS) are foundational infrastructure components for online services. Due to their latency-critical nature, today’s best-performing KVS contain a plethora of full-stack optimizations commonly targeting read-mostly, popularity-skewed workloads. Motivated by production studies showing the increased prevalence of write-intensive workloads, we break down the KVS workload space into four distinct classes, and argue that current designs are only sufficient for two of them. The reason is that KVS concurrency control protocols expose a fundamental tradeoff: avoiding synchronization by partitioning writes across threads is mandatory for high throughput, but necessarily creates load imbalance that grows with core count and write fraction. We break this tradeoff with C-4, a co-design between NIC hardware and KVS software that judiciously separates write requests into two classes: independent ones that can be balanced across threads, and dependent ones which must be queued. C-4 dynamically partitions independent writes with the NIC to increase the load balancing flexibility of current KVS designs, and adds a software layer to the KVS to compact dependent writes into batches. Our evaluation shows that for write-intensive workloads, C-4 reduces 99th% tail latency by 1.3−5× and improves throughput by up to 1.7×.

DecoMine: A Compilation-Based Graph Pattern Mining System with Pattern Decomposition

Graph pattern mining (GPM) is an important application that identifies structures from graphs. Despite the recent progress, the performance gap between the state-of-the-art GPM systems and an efficient algorithm—pattern decomposition—is still at least an order of magnitude. This paper clears the fundamental obstacles of adopting pattern decomposition to a GPM system.

First, the performance of pattern decomposition algorithms depends on how to decompose the whole pattern into subpatterns. The original method performs complexity analysis of algorithms for different choices, and selects the one with the lowest complexity upper bound. Clearly, this approach is not feasible for average or even expert users. To solve this problem, we develop a GPM compiler with conventional and GPM-specific optimizations to generate algorithms for different decomposition choices, which are evaluated based on an accurate cost model. The executable of the GPM task is obtained from the algorithm with the best performance. Second, we propose a novel partial-embedding API that is sufficient to construct advanced GPM applications while preserving pattern decomposition algorithm advantages. Compared to state-of-the-art systems, our new GPM system, DecoMine, developed based on the ideas, reduces the execution time of GPM on large graphs and patterns from days to a few hours with low programming effort.

Erms: Efficient Resource Management for Shared Microservices with SLA Guarantees

A common approach to improving resource utilization in data centers is to adaptively provision resources based on the actual workload. One fundamental challenge of doing this in microservice management frameworks, however, is that different components of a service can exhibit significant differences in their impact on end-to-end performance. To make resource management more challenging, a single microservice can be shared by multiple online services that have diverse workload patterns and SLA requirements.

We present an efficient resource management system, namely Erms, for guaranteeing SLAs in shared microservice environments. Erms profiles microservice latency as a piece-wise linear function of the workload, resource usage, and interference. Based on this profiling, Erms builds resource scaling models to optimally determine latency targets for microservices with complex dependencies. Erms also designs new scheduling policies at shared microservices to further enhance resource efficiency. Experiments across microservice benchmarks as well as trace-driven simulations demonstrate that Erms can reduce SLA violation probability by 5× and more importantly, lead to a reduction in resource usage by 1.6×, compared to state-of-the-art approaches.

Glign: Taming Misaligned Graph Traversals in Concurrent Graph Processing

In concurrent graph processing, different queries are evaluated on the same graph simultaneously, sharing the graph accesses via the memory hierarchy. However, different queries may traverse the graph differently, especially for those starting from different source vertices. When these graph traversals are ”misaligned”, the benefits of graph access sharing can be seriously compromised. As more concurrent queries are added to the evaluation batch, the issue tends to become even worse.

To address the above issue, this work introduces Glign, a runtime system that automatically aligns the graph traversals for concurrent queries. Glign introduces three levels of graph traversal alignment for iterative evaluation of concurrent queries. First, it synchronizes the accesses of different queries to the active parts of the graph within each iteration of the evaluation—intra-iteration alignment. On top of that, Glign leverages a key insight regarding the “heavy iterations” in query evaluation to achieve inter-iteration alignment and alignment-aware batching. The former aligns the iterations of different queries to increase the graph access sharing, while the latter tries to group queries of better graph access sharing into the same evaluation batch. Together, these alignment techniques can substantially boost the data locality of concurrent query evaluation. Based on our experiments, Glign outperforms the state-of-the-art concurrent graph processing systems Krill and GraphM by 3.6× and 4.7× on average, respectively.

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

Large deep learning models have shown great potential with state-of-the-art results in many tasks. However, running these large models is quite challenging on an accelerator (GPU or TPU) because the on-device memory is too limited for the size of these models. Intra-layer model parallelism is an approach to address the issues by partitioning individual layers or operators across multiple devices in a distributed accelerator cluster. But, the data communications generated by intra-layer model parallelism can contribute to a significant proportion of the overall execution time and severely hurt the computational efficiency.

As intra-layer model parallelism is critical to enable large deep learning models, this paper proposes a novel technique to effectively reduce its data communication overheads by overlapping communication with computation. With the proposed technique, an identified original communication collective is decomposed along with the dependent computation operation into a sequence of finer-grained operations. By creating more overlapping opportunities and executing the newly created, finer-grained communication and computation operations in parallel, it effectively hides the data transfer latency and achieves a better system utilization. Evaluated on TPU v4 Pods using different types of large models that have 10 billion to 1 trillion parameters, the proposed technique improves system throughput by 1.14 - 1.38x. The achieved highest peak FLOPS utilization is 72% on 1024 TPU chips with a large language model that has 500 billion parameters.

Risotto: A Dynamic Binary Translator for Weak Memory Model Architectures

Dynamic Binary Translation (DBT) is a powerful approach to support cross-architecture emulation of unmodified binaries. However, DBT systems face correctness and performance challenges, when emulating concurrent binaries from strong to weak memory consistency architectures. As a matter of fact, we report several translation errors in QEMU, when emulating x86 binaries on Arm hosts.

To address these challenges, we propose an end-to-end approach that provides correct and efficient emulation for weak memory model architectures. Our contributions are twofold: we formalize QEMU’s intermediate representation’s memory model, and use it to propose formally verified mapping schemes to bridge the strong-on-weak memory consistency mismatch. Secondly, we implement these verified mappings in Risotto, a QEMU-based DBT system that optimizes memory fence placement while ensuring correctness. Risotto further enhances the emulation performance via cross-architecture dynamic linking of native shared libraries, and fast and correct translation of compare-and-swap operations.

We evaluate Risotto using multi-threaded benchmark suites and real-world applications, and show that Risotto improves the emulation performance by 6.7% on average over ”erroneous” QEMU, while ensuring correctness.

TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators

Memory buffer allocation for on-chip memories is a major challenge in modern machine learning systems that target ML accelerators. In interactive systems such as mobile phones, it is on the critical path of launching ML-enabled applications. In data centers, it is part of complex optimization loops that run many times and are the limiting factor for the quality of compilation results.

In contrast to the traditional memory allocation problem in languages such as C++, where allocation requests dynamically arrive as the application is executing, ML systems typically execute a static control flow graph that is known in advance. The task of the memory allocator is to choose buffer locations in device memory such that the total amount of used memory never exceeds the total memory available on-device. This is a high dimensional, NP-hard optimization problem that is challenging to solve.

Today, ML frameworks approach this problem either using ad-hoc heuristics or solver-based methods. Heuristic solutions work for simple cases but fail for more complex instances of this problem. Solver-based solutions can handle these more complex instances, but are expensive and impractical in scenarios where memory allocation is on the critical path, such as on mobile devices that compile models on-the-fly. We encountered this problem in the development of Google's Pixel 6 phone, where some important models took prohibitively long to compile.

We introduce an approach that solves this challenge by combining constraint optimization with domain-specific knowledge to achieve the best properties of both. We combine a heuristic-based search with a solver to guide its decision making. Our approach matches heuristics for simple inputs while being significantly faster than the best Integer Linear Program (ILP) solver-based approach for complex inputs. We also show how ML can be used to continuously improve the search for the long tail of workloads. Our approach is shipping in two production systems: Google's Pixel 6 phone and TPUv4. It achieves up to two orders of magnitude allocation time speed-up on real ML workloads compared to a highly-tuned production ILP approach that it replaces and enables important real-world models that could not otherwise be supported.