Read-Copy-Update (RCU) is a critical synchronization mechanism for concurrent data structures, enabling efficient deferred memory reclamation. However, implementing and using RCU correctly is challenging due to its inherent concurrency complexities. While previous work verified RCU, they either relied on unrealistic assumptions of sequentially consistent (SC) memory model or lacked three key features of general-purpose RCU libraries: modular specification, switchable critical sections, and concurrent writer support.
We present the first formal verification of a general-purpose RCU in realistic relaxed memory consistency (RMC), addressing the challenges posed by these features. To achieve modular specification that encompasses relaxed behaviors, we extend existing SC specifications to account for explicit synchronization. To support switchable critical sections, which require read-after-write (RAW) synchronization, we introduce a reasoning principle for RAW-synchronizing SC fences. Using this principle, we also present the first formal verification of Peterson’s mutex in RMC. To support concurrent writers performing partially ordered writes, we avoid assuming a total order of links and instead formulate invariants based on per-node incoming link histories. Our proofs are mechanized in the iRC11 relaxed memory separation logic, built upon Iris, in Rocq.
Hazard pointers (HP) is one of the earliest manual memory reclamation algorithms for concurrent data structures. It is widely used for its robustness: memory overhead is bounded (e.g., by the number of threads). To access a node, threads first announce the protection of each to-be-accessed node, which prevents its reclamation. After announcement, they validate the node's reachability from the root to ensure that no threads have missed the announcement and reclaimed it. Traversal-based data structures typically takes a marking-based validation strategy. This strategy uses a node's mark to indicate whether the node is to be detached. Unmarked nodes are considered safe to traverse as both the node and its successors are still reachable, while marked nodes are considered unsafe. However, this strategy is inapplicable to the efficient optimistic traversal strategy that skips over marked nodes.
We propose a new validation strategy for HP that supports lock-free data structures with optimistic traversal, such as lists, trees, and skip lists. The key idea is to exploit the immutability of marked nodes, and validate their reachability at once by checking the reachability of the most recent unmarked node. To ensure correctness, we prove the safety of Harris's list protected with the new strategy in Rocq using the Iris separation logic framework. We show that the new strategy's performance is competitive with state-of-the-art reclamation algorithms when applied to data structures with optimistic traversal, while remaining simple and robust.
We report the first formal verification of a lock-free list, skiplist, and a skiplist-based priority queue against a strong specification in relaxed memory consistency (RMC). RMC allows relaxed behaviors in which memory accesses may be reordered with other operations, posing two significant challenges for the verification of lock-free traversals. (1) Specification challenge: formulating a specification that is flexible enough to capture relaxed behaviors, yet simple enough to be easily understood and used. We address this challenge by proposing the per-key linearizable history specification that enforces a total order of operations for each key that respects causality, rather than a total order of all operations. (2) Verification challenge: devising verification techniques for reasoning about the reachability of edges for traversing threads, which can read stale edges due to relaxed behaviors. We address this challenge by introducing the shadowed-by relation that formalizes the notion of outdated edges. This relation enables us to establish a total order of edges and thus their associated operations for each key, required to satisfy the strong specification. All our proofs are mechanized on the iRC11 relaxed memory separation logic, built on the Iris framework in Rocq.
There has been a recent upsurge of interest in formal, machine-checked verification of timing guarantees for C implementations of real-time system schedulers. However, prior work has only considered tick-based schedulers, which enjoy a clearly defined notion of time: the time "quantum". In this work, we present a new approach to real-time systems verification for interrupt-free schedulers, which are commonly used in deeply embedded and resource-constrained systems but which do not enjoy a natural notion of periodic time. Our approach builds on and connects two recently developed Rocq-based systems—RefinedC (for foundational C verification) and Prosa (for verified response-time analysis)—adapting the former to reason about timed traces and the latter to reason about overheads. We apply the resulting system, which we call RefinedProsa, to verify Rössl, a simple yet representative, fixed-priority, non-preemptive, interrupt-free scheduler implemented in C.
Separation logic (SL) has recently evolved at an exciting pace, opening the way to more complex goals, notably soundness proof of Rust's ownership type system and functional verification of Rust programs. In this paper, we address verification of termination in the presence of advanced features, especially Rust's ownership types. Perhaps surprisingly, this goal cannot be achieved by a simple application of existing studies that dealt only with safety properties. For high-level reasoning about advanced shared mutable state as used in Rust, they used higher-order ghost state (i.e., logical state that depends on SL assertions), but in a way that depends on the later modality, a fundamental obstacle to verifying termination.
To solve this situation, we propose a novel general framework, Nola, which achieves later-free higher-order ghost state. Even in the presence of advanced features such as invariants and borrows, Nola enables natural termination verification, allowing arbitrary induction in the meta-logic. Its key idea is to parameterize higher-order ghost state, generalizing and subsuming the existing approach. Nola is fully mechanized in Rocq as a library of Iris. Moreover, to demonstrate the power of Nola, we develop a prototype of RustHalt, the first semantic and mechanized foundation for total correctness verification of Rust programs.
Generating random variates is a fundamental operation in diverse areas of computer science and is supported in almost all modern programming languages. Traditional software libraries for random variate generation are grounded in the idealized "Real-RAM" model of computation, where algorithms are assumed to be able to access uniformly distributed real numbers from the unit interval and compute with infinite-precision real arithmetic. These assumptions are unrealistic, as any software implementation of a Real-RAM algorithm on a physical computer can instead access a stream of individual random bits and computes with finite-precision arithmetic. As a result, existing libraries have few theoretical guarantees in practice. For example, the actual distribution of a random variate generator is generally unknown, intractable to quantify, and arbitrarily different from the desired distribution; causing runtime errors, unexpected behavior, and inconsistent APIs.
This article introduces a new approach to principled and practical random variate generation with formal guarantees. The key idea is to first specify the desired probability distribution in terms of a finite-precision numerical program that defines its cumulative distribution function (CDF), and then generate exact random variates according to this CDF. We present a universal and fully automated method to synthesize exact random variate generators given any numerical CDF implemented in any binary number format, such as floating-point, fixed-point, and posits. The method is guaranteed to operate with the same precision used to specify the CDF, does not overflow, avoids expensive arbitrary-precision arithmetic, and exposes a consistent API. The method rests on a novel space-time optimal implementation for the class of generators that attain the information-theoretically optimal Knuth and Yao entropy rate, consuming the least possible number of input random bits per output variate. We develop a random variate generation library using our method in C and evaluate it on a diverse set of "continuous" and "discrete" distributions, showing competitive runtime with the state-of-the-art GNU Scientific Library while delivering higher accuracy, entropy efficiency, and automation.
Dynamic race detection based on the happens before (HB) partial order has now become the de facto approach to quickly identify data races in multi-threaded software. Most practical implementations for detecting these races use timestamps to infer causality between events and detect races based on these timestamps. Such an algorithm updates timestamps (stored in vector clocks) at every event in the execution, and is known to induce excessive overhead. Random sampling has emerged as a promising algorithmic paradigm to offset this overhead. It offers the promise of making sound race detection scalable. In this work we consider the task of designing an efficient sampling based race detector with low overhead for timestamping when the number of sampled events is much smaller than the total events in an execution. To solve this problem, we propose (1) a new notion of freshness timestamp, (2) a new data structure to store timestamps, and (3) an algorithm that uses a combination of them to reduce the cost of timestamping in sampling based race detection. Further, we prove that our algorithm is close to optimal --- the number of vector clock traversals is bounded by the number of sampled events and number of threads, and further, on any given dynamic execution, the cost of timestamping due to our algorithm is close to the amount of work any timestamping-based algorithm must perform on that execution, that is it is instance optimal. Our evaluation on real world benchmarks demonstrates the effectiveness of our proposed algorithm over prior timestamping algorithms that are agnostic to sampling.
Many quantum algorithms instantiate and use ancillas, spare qubits that serve as temporary storage in a quantum circuit. In particular, many recently developed high-level and modular quantum programming languages (QPLs) use ancilla qubits to implement various programming constructs. These are lowered to circuits with nested/cascading compute-uncompute gate sequences that use ancilla qubits to track internal state. We present SPARE, a rewrite-based quantum circuit optimizer that restructures these compute-uncompute gate sequences, leveraging the ancilla qubit state information to optimize the circuit. In this work, we prove the correctness of SPARE’s rewrites and link SPARE’s gate-level transforms to language-level program rewrites, which may be performed on the input language. We evaluate SPARE on QPL-generated quantum circuits against Unqomp and Spire, two optimizing compilers for QPLs. SPARE achieves a reduction of up to 27.3% in qubit count, 56.7% in 2-qubit gates, 68.2% in 1-qubit gates and 73.9% in depth against Unqomp, and up to 17.8% in qubits, 67.3% in 2-qubit gates, 61.4% in 1-qubit gates and 59.9% in depth against Spire. We also evaluate SPARE against the Quartz, Feynman, and PyZX circuit optimizers: SPARE achieves up to a 70.0% reduction in two-qubit gates, up to a 53.6% reduction in 1-qubit gates, and up to a 56.7% reduction in depth compared to the best result from all the gate-level optimizers.
Alias analysis is a fundamental compiler analysis that powers numerous optimizations. While research has focused on deriving more precise alias information assuming that the compiler will optimize better, recent work shows a negligible, or even negative, performance impact of alias information. In this work, we shift the perspective from refining to relaxing alias information, i.e., removing information, to complement existing work and challenge that assumption systematically. Our study on a state-of-the-art compiler, LLVM, running the SPEC CPU 2017 benchmark suite, shows (1) a small overall impact—removing alias analysis entirely has little impact on the final binaries, (2) few influential queries—only a small fraction, namely ∼3%, of the alias information leads to changes in the final binary, and (3) lost potential—random relaxations can reduce execution time by 21% and binary size by 39% for certain cases, suggesting that compilers could better utilize alias information. Through this work, we advocate that it is beneficial for future research to avoid simply refining the general precision of alias analysis, but also to explore how to find and refine the most relevant queries, and how to more effectively utilize alias information.
Approximations play a pivotal role in verifying deep neural networks (DNNs). Existing approaches typically rely on either single-neuron approximations (simpler to design but less precise) or multi-neuron approximations (higher precision but significantly more complex to construct). Between them, a notable gap exists.
This work bridges the gap. The idea is to lift single-neuron approximations into multi-neuron approximations with precision gain. To this end, we formulate the approximation transition as a novel problem, named Convex Approximation Lifting (CAL), and propose a constructive approach, Support Triangle Machine (STM), to solving it. STM is grounded in two core insights: (i) there exists a simple geometric structure, called the support triangle, along with an efficient triangle lifting technique that connects single-neuron approximations and multi-neuron approximations; and (ii) typical single-neuron approximations can be easily decomposed into multiple atomically liftable components. Specifically, given a CAL instance, STM constructs a multi-neuron approximation by iteratively processing each output coordinate. For each coordinate, it decomposes the single-neuron approximation into several linear parts, lifts each of them using the triangle lifting technique, and then synthesize an intermediate approximation, which later servers as input for the next iteration.
We theoretically prove the correctness of STM and empirically evaluate its performance on a variety of CAL problems and DNN verification tasks. Experimental results demonstrate STM's broad applicability, improved precision, and sustained efficiency. Beyond DNN verification, STM has the potential to facilitate approximation construction process in more general tasks, and we expect it to catalyze further research in related fields.
Spatial dataflow architectures (SDAs) are a promising and versatile accelerator platform. They are software-programmable and achieve near-ASIC performance and energy efficiency, beating CPUs by orders of magnitude. Unfortunately, many SDAs struggle to efficiently implement irregular computations because they suffer from an abstraction inversion: they fail to capture coarse-grain dataflow semantics in the application — namely asynchronous communication, pipelining, and queueing — that are naturally supported by the dataflow execution model and existing SDA hardware.
Ripple is a language and architecture that corrects the abstraction inversion by preserving dataflow semantics down the stack. Ripple provides asynchronous iterators, shared-memory atomics, and a familiar task-parallel interface to concisely express the asynchronous pipeline parallelism enabled by an SDA. Ripple efficiently implements deadlock-free, asynchronous task communication by exposing hardware token queues in its ISA. Across nine important workloads, compared to a recent ordered-dataflow SDA, Ripple shrinks programs by 1.9×, improves performance by 3×, increases IPC by 58%, and reduces dynamic instructions by 44%.
We develop StacKAT, a network verification language featuring loops, finite state variables, nondeterminism, and---most importantly---access to a stack with accompanying push and pop operations. By viewing the variables and stack as the (parsed) headers and (to-be-parsed) contents of a network packet, StacKAT can express a wide range of network behaviors including parsing, source routing, and telemetry. These behaviors are difficult or impossible to model using existing languages like NetKAT. We develop a decision procedure for StacKAT program equivalence, based on finite automata. This decision procedure provides the theoretical basis for verifying network-wide properties and is able to provide counterexamples for inequivalent programs. Finally, we provide an axiomatization of StacKAT equivalence and establish its completeness.
Fully Homomorphic Encryption (FHE) is a cryptographic technique that enables privacy-preserving computation. State-of-the-art Boolean FHE implementations provide a very low-level interface, usually exposing a limited set of Boolean gates that programmers must use to write their FHE applications. This programming model is unnecessarily restrictive: many Boolean FHE schemes support programmable bootstrapping, an operation that allows evaluation of an arbitrary fixed-size lookup table. However, most modern FHE compilers are only capable of reasoning about traditional Boolean circuits, and therefore struggle to take full advantage of programmable bootstrapping.
We present COATL, an FHE compiler that makes use of programmable bootstrapping to produce circuits that are smaller and more efficient than their traditional Boolean counterparts. COATL generates circuits using arithmetic lookup tables, a novel abstraction we introduce for reasoning about computations in Boolean FHE programs. We demonstrate on a variety of benchmarks that COATL can generate circuits that run up to 1.5× faster than those generated by other state-of-the-art compilation strategies.
There is a tension in dynamic language runtime design between speed and correctness. State-of-the-art JIT compilation, the result of enormous industrial investment and significant research, achieves heroic speedups at the cost of complexity. This complexity leads to subtle and sometimes catastrophic correctness bugs. Much of this complexity comes from the existence of multiple tiers and the need to maintain correspondence between these separate definitions of the language’s semantics; it also comes from the indirect nature of the semantics implicitly encoded in a compiler backend. One way to address this complexity is to automatically derive, as much as possible, the compiled code from a single source-of-truth, such as the interpreter tier. In this work, we introduce a partial evaluator that can compile a whole guest-language function ahead-of-time, without tracing or profiling, “for free.” This transform unrolls an interpreter function expressed in a standard compiler intermediate representation (static single assignment or SSA) and uses partial evaluation of the interpreter function and its regular control flow to drive the guest-language compilation. The effect of this is that the transform is applicable to almost unmodified existing interpreters in systems languages such as C or C++, producing ahead-of-time guest-language compilers. We show the effectiveness of this new tool by applying it to the interpreter tier of an existing industrial JavaScript engine, SpiderMonkey, yielding 2.17× speedups, and the PUC-Rio Lua interpreter, yielding 1.84× speedups. Finally, we outline an approach to carry this work further, deriving more of the capabilities of a JIT backend from first principles while retaining correctness.
The C and C++ languages define hundreds of cases as having undefined behavior (UB). These include, for example, corner cases where different CPU architectures disagree on the semantics of an instruction and the language does not want to force a specific implementation (e.g., shift by a value larger than the bitwidth). Another class of UB involves errors that the language chooses not to detect because it would be too expensive or impractical, such as dereferencing out-of-bounds pointers.
Although there is a common belief within the compiler community that UB enables certain optimizations that would not be possible otherwise, no rigorous large-scale studies have been conducted on this subject. At the same time, there is growing interest in eliminating UB from programming languages to improve security.
In this paper, we present the first comprehensive study that examines the performance impact of exploiting UB in C and C++ applications across multiple CPU architectures. Using LLVM, a compiler known for its extensive use of UB for optimizations, we demonstrate that, for the benchmarks and UB categories that we evaluated, the end-to-end performance gains are minimal. Moreover, when performance regresses, it can often be recovered through small improvements to optimization algorithms or by using link-time optimizations.
Certified compilers are complex software systems. Like other large systems, they demand modular, extensible designs. While there has been progress in extensible metatheory mechanization, scaling extensibility and reuse to meet the demands of full compiler verification remains a major challenge.
We respond to this challenge by introducing novel expressive power to a proof language. Our language design equips the Rocq prover with an extensibility mechanism inspired by the object-oriented ideas of late binding, mixin composition, and family polymorphism. We implement our design as a plugin for Rocq, called Rocqet. We identify strategies for using Rocqet’s new expressive power to modularize the monolithic design of large certified developments as complex as the CompCert compiler. The payoff is a high degree of modularity and reuse in the formalization of intermediate languages, ISAs, compiler transformations, and compiler extensions, with the ability to compose these reusable components—certified compilers à la carte. We report significantly improved proof-compilation performance compared to earlier work on extensible metatheory mechanization. We also report good performance of the extracted compiler.
Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA’s Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called tasks that operate on tensors and are free of communication and synchronization. Cypress programs are bound to the target machine through a mapping specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.
A singular function is a partial function such that at one or more points, the left and/or right limit diverge (e.g., the function 1/x). Since programming languages typically support division, programs may denote singular functions. Although on its own, a singularity may be considered a bug, introducing a division-by-zero error, singular integrals—a version of the integral that is well-defined when the integrand is a singular function and the domain of integration contains a singularity—arise in science and engineering, including in physics, aerodynamics, mechanical engineering, and computer graphics.
In this paper, we present the first semantics of a programming language for singular integration. Our differentiable programming language, SingularFlow, supports the evaluation and differentiation of singular integrals. We formally define the denotational semantics of SingularFlow, deriving all the necessary mathematical machinery so that this work is rigorous and self-contained. We then define an operational semantics for SingularFlow that estimates integrals and their derivatives using Monte Carlo samples, and show that the operational semantics is a well-behaved estimator for the denotational semantics.
We implement SingularFlow in JAX and evaluate the implementation on a suite of benchmarks that perform the finite Hilbert transform, an integral transform related to the Fourier transform, which arises in domains such as physics and electrical engineering. We then use SingularFlow to approximate the solutions to four singular integral equations—equations where the unknown function is in the integrand of a singular integral—arising in aerodynamics and mechanical engineering.
When a program synthesis task starts from an ambiguous specification, the synthesis process often involves an iterative specification refinement process. We introduce the Programming by Navigation Synthesis Problem, a new synthesis problem adapted specifically for supporting iterative specification refinement in order to find a particular target solution. In contrast to prior work, we prove that synthesizers that solve the Programming by Navigation Synthesis Problem show all valid next steps (Strong Completeness) and only valid next steps (Strong Soundness). To meet the demands of the Programming by Navigation Synthesis Problem, we introduce an algorithm to turn a type inhabitation oracle (in the style of classical logic) into a fully constructive program synthesizer. We then define such an oracle via sound compilation to Datalog. Our empirical evaluation shows that this technique results in an efficient Programming by Navigation synthesizer that solves tasks that are either impossible or too large for baselines to solve. Our synthesizer is the first to guarantee that its specification refinement process satisfies both Strong Completeness and Strong Soundness.
Data races are a prevalent class of concurrency bugs in shared-memory parallel programs, posing significant challenges to software reliability and reproducibility. While there is an extensive body of research on detecting data races and a wealth of practical detection tools across various programming languages, considerably less effort has been directed toward automatically fixing data races at an industrial scale. In large codebases, data races are continuously introduced and exhibit myriad patterns, making automated fixing particularly challenging.
In this paper, we tackle the problem of automatically fixing data races at an industrial scale. We present Dr.Fix, a tool that combines large language models (LLMs) with program analysis to generate fixes for data races in real-world settings, effectively addressing a broad spectrum of racy patterns in complex code contexts. Implemented for Go—the programming language widely used in modern microservice architectures where concurrency is pervasive and data races are common—Dr.Fix seamlessly integrates into existing development workflows. We detail the design of Dr.Fix and examine how individual design choices influence the quality of the fixes produced. Over the past 18 months, Dr.Fix has been integrated into developer workflows at Uber demonstrating its practical utility. During this period, Dr.Fix produced patches for 224 (55%) from a corpus of 404 data races spanning various categories; 193 of these patches (86%) were accepted by more than a hundred developers via code reviews and integrated into the codebase.
A core design principle of C++ is that users should only incur costs for features they actually use, both in terms of performance and code size. A notable exception to this rule is the run-time type information (RTTI) data, used for dynamic downcasts, exceptions, and run-time type introspection.
For classes that define at least one virtual method, compilers generate RTTI data that uniquely identifies the type, including a string for the type name. In large programs with complex type inheritance hierarchies, this RTTI data can grow substantially in size. Moreover, dynamic casting algorithms are linear in the type hierarchy size, causing some programs to spend considerable time on these casts.
The common workaround is to use the -fno-rtti compiler flag, which disables RTTI data generation. However, this approach has significant drawbacks, such as disabling polymorphic exceptions and dynamic casts, and requiring the flag to be applied across the entire program due to ABI changes.
In this paper, we propose a new link-time optimization to mitigate both the performance and size overhead associated with dynamic casts and RTTI data. Our optimization replaces costly library calls for downcasts with short instruction sequences and eliminates unnecessary RTTI data by modifying vtables to remove RTTI slots. Our prototype, implemented in the LLVM compiler, demonstrates an average speedup of 1.4
Optimizing compilers, such as LLVM, generate debug information in machine code to aid debugging. This information is particularly important when debugging optimized code, as modern software is often compiled with optimization enabled. However, properly updating debug information to reflect code transformations during optimization is a complex task that often relies on manual effort. This complexity makes the process prone to errors, which can lead to incorrect or lost debug information. Finding and fixing potential debug information update errors is vital to maintaining the accuracy and reliability of the overall debugging process. To our knowledge, no existing techniques can rectify debug information update errors in LLVM. While black-box testing approaches can find such bugs, they can neither pinpoint the root causes nor suggest fixes. To fill the gap, we propose the first technique to robustify debug information updates in LLVM. In particular, our robustification approach can find and fix incorrect debug location updates. Central to our approach is the observation that the debug locations in the original and optimized programs must satisfy a conformance relation. The relation ensures that LLVM optimizations do not introduce extraneous debug location information on the control-flow paths of the optimized programs. We introduce control-flow conformance analysis, a novel approach that determines the reference updates ensuring the conformance relation by observing the execution of LLVM optimization passes and analyzing the debug locations in the control-flow graphs of programs under optimization. The determined reference updates are then used to check developer-written updates in LLVM. When discrepancies arise, the reference updates serve as the update skeletons to guide the fixing. We realized our approach as a tool named MetaLoc, which determines proper debug location updates for LLVM optimizations. More importantly, with MetaLoc, we have reported and patched 46 previously unknown update errors in LLVM. All the patches, along with 22 new regression tests, have been merged into the LLVM codebase, effectively improving the accuracy and reliability of debug information in all programs optimized by LLVM. Furthermore, our approach uncovered and led to corrections in two issues within LLVM’s official documentation on debug information updates.
We introduce a novel decision procedure for solving the class of position string constraints, which includes string disequalities, ¬prefixof, ¬suffixof, str.at, and ¬str.at. These constraints are generated frequently in almost any application of string constraint solving. Our procedure avoids expensive encoding of the constraints to word equations and, instead, reduces the problem to checking conflicts on positions satisfying an integerconstraint obtained from the Parikh image of a polynomial-sized finite automaton with a special structure. By the reduction to counting, solving position constraints becomes NP-complete and for some cases even falls into PTime. This is much cheaper than the previously used techniques, which either used reductions generating word equations and length constraints (for which modern string solvers use exponential-space algorithms) or incomplete techniques. Our method is relevant especially for automata-based string solvers, which have recently achieved the best results in terms of practical efficiency, generality, and completeness guarantees. This work allows them to excel also on position constraints, which used to be their weakness. Besides the efficiency gains, we show that our framework may be extended to solve a large fragment of ¬contains (in NExpTime), for which decidability has been long open, and gives a hope to solve the general problem. Our implementation of the technique within the Z3-Noodler solver significantly improves its performance on position constraints.
Quantum Hamiltonian simulation, fundamental in quantum algorithm design, extends far beyond its foundational roots, powering diverse quantum computing applications. However, optimizing the compilation of quantum Hamiltonian simulation poses significant challenges. Existing approaches fall short in reconciling deterministic and randomized compilation, lack appropriate intermediate representations, and struggle to guarantee correctness. Addressing these challenges, we present MarQSim, a novel compilation framework. MarQSim leverages a Markov chain-based approach, encapsulated in the Hamiltonian Term Transition Graph, adeptly reconciling deterministic and randomized compilation benefits. Furthermore, we formulate a Minimum-Cost Flow model that can tune transition matrices to enforce correctness while accommodating various optimization objectives. Experimental results demonstrate MarQSim's superiority in generating more efficient quantum circuits for simulating various quantum Hamiltonians while maintaining precision.
Large language models (LLMs) have achieved notable success in code generation. However, they still frequently produce uncompilable output because their next-token inference procedure does not model formal aspects of code. Although constrained decoding is a promising approach to alleviate this issue, it has only been applied to handle either domain-specific languages or syntactic features of general-purpose programming languages. However, LLMs frequently generate code with typing errors, which are beyond the domain of syntax and generally hard to adequately constrain. To address this challenge, we introduce a type-constrained decoding approach that leverages type systems to guide code generation. For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code. We formalize our approach on a foundational simply-typed language and extend it to TypeScript to demonstrate practicality. Our evaluation on the HumanEval and MBPP datasets shows that our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks across LLMs of various sizes and model families, including state-of-the-art open-weight models with more than 30B parameters. The results demonstrate the generality and effectiveness of our approach in constraining LLM code generation with formal rules of type systems.
Incorrect compiler optimizations can lead to unintended program behavior and security vulnerabilities. However, the enormous size and complexity of modern compilers make it challenging to ensure the correctness of optimizations. The problem becomes more severe as compiler engineers continuously add new optimizations to improve performance and support new language features. In this paper, we propose Optimuzz, a framework to effectively detect incorrect optimization bugs in such continuously changing compilers. The key idea is to combine two complementary techniques: directed grey-box fuzzing and translation validation. We design a novel optimization-directed fuzzing framework that efficiently generates input programs to trigger specific compiler optimizations. Optimuzz then use existing translation validation tools to verify the correctness of the optimizations on the input programs. We instantiate our approach for two major compilers, LLVM and TurboFan. The results show that Optimuzz can effectively detect miscompilation bugs in these compilers compared to the state-of-the-art tools. We also applied Optimuzz to the latest version of LLVM and discovered 55 new miscompilation bugs.
It is a long-standing open problem to support verified compilation of multi-threaded programs compositionally when sharing of stack data between threads is allowed. Although certain solutions exist on paper, none of them is completely formalized because of the difficulty in simultaneously enabling sharing and forbidding modification of stack memory in presence of arbitrary memory operations (e.g., pointer arithmetic). We present a compiler verification framework that solves this open problem in the setting of cooperative multi-threading. To address the challenges of sharing stack data, we introduce threaded Kripke memory relations (TKMR) to support both protection and sharing of stacks in a multi-stack memory model. We further introduce threaded forward simulations parameterized by TKMR to capture semantics preservation for compiling program modules in multi-threaded contexts. We show that threaded forward simulations are both horizontally composable—thereby enabling the compositional verification of open threads and heterogeneous modules—and vertically composable—thereby enabling composition of compiler correctness for multiple compiler passes. Furthermore, threaded forward simulations can be converted into backward simulations. We apply this framework to 18 passes of CompCert to get CompCertOC, the first optimizing verified compiler that supports compositional verification of cooperative multi-threaded programs with shared stacks.
Dynamic race detection is a highly effective runtime verification technique for identifying data races by instrumenting and monitoring concurrent program runs. However, standard dynamic race detection is incompatible with practical weak memory models; the added instrumentation introduces extra synchronization, which masks weakly consistent behaviors and inherently misses certain data races. In response, we propose to dynamically verify program robustness-a property ensuring that a program exhibits only strongly consistent behaviors. Building on an existing static decision procedure, we develop an algorithm for dynamic robustness verification under a C11-style memory model. The algorithm is based on "location clocks", a variant of vector clocks used in standard race detection. It allows effective and easy-to-apply defense against weak memory on a per-program basis, which can be combined with race detection that assumes strong consistency. We implement our algorithm in a tool, called RSAN, and evaluate it across various settings. To our knowledge, this work is the first to propose and develop dynamic verification of robustness against weak memory models.
Direct manipulation programming gives users a way to write programs without directly writing code, by using the familiar GUI-style interactions they know from direct manipulation interfaces. To date, direct manipulation programming systems have relied on two core components: (1) a patch component, which modifies the program based on a GUI interaction, and (2) a forward evaluator, which executes the modified program to produce an updated program output. This architecture has worked for developing short-running programs—i.e., programs that reliably execute in <1 second—generating outputs such as SVG and HTML documents. However, direct manipulation programming has not yet been applied to long-running programs (e.g., data visualization, mapping), perhaps because executing such programs in response to every GUI interaction would mean crossing outside of interactive speeds. We propose extending direct manipulation programming to long-running programs by pairing a standard patch component (patch) with a corresponding reconciliation component (recon). recon directly updates the program output in response to a GUI interaction, obviating the need for forward evaluation. We introduce corresponding patch and recon procedures for the domain of geospatial data visualization and prove them sound—that is, we show that the output produced by recon is identical to the output produced by forward-evaluating a patch-modified program. recon can operate both incrementally and in parallel with patch. Our implementation of our patch-recon instantiation achieves a 2.92x median reduction in interface latency compared to forward evaluation on a suite of real-world geospatial visualization tasks. Looking forward, our results suggest that patch-reconciliation correspondence offers a promising pathway for extending direct manipulation programming to domains involving large-scale computation.
The Satisfiability Modulo Theory (SMT) problem over floating-point operations presents a significant challenge. State-of-the-art SMT solvers often run into difficulties when dealing with large, complex floating-point constraints. Recently, a new approach to floating-point constraint solving emerges, utilizing mathematical optimization (MO) methods as an engine of their solving approach.
Despite the novelty, these methods can fall short in both effectiveness and efficiency due to issues of the translated functions (e.g., discontinuity) and inherent limitations of their underlying MO method (e.g., imprecise search process, scalability issues).
Driven by these weaknesses of prior solvers, this paper introduces a new MO-based approach that is shown highly potent in solving floating-point constraints. Specifically, on the benchmarks of JFS (a recent solver based on fuzzing), Grater, a realization of our approach, solves as many constraints as Bitwuzla and one more than CVC5 but runs over 10 times faster and over 40 times faster than Bitwuzla and CVC5 in median solving time across all benchmarks. It is worth mentioning that Bitwuzla and CVC5 are the strongest solvers for floating-point constraints according to results of the annual international SMT solver competition (SMT-COMP). Together, they have won all gold medals for QF_FPArith and FPArith divisions, which focus on floating-point constraints solving, over the past three years. To further evaluate Grater, we select over 100 most difficult benchmarks from the FP SMT-LIB, a logic regularly used in SMT-COMP. The difficulty is measured by the complexity of the composition (e.g., number of variables, clauses) and the interdependencies within constraints. Grater again solves the same number of constraints as Bitwuzla and CVC5 while running over 10 times faster than both solvers in average solving time, and over 50 times (resp. 30 times) faster than Bitwuzla (resp. CVC5) in median solving time.
We release the source code of Grater, along with all evaluation data, including detailed comparisons of Grater against each baseline solver (i.e., Z3, CVC5, Bitwuzla, JFS, XSat, and CoverMe), at https://github.com/grater-exp/grater-experiment to facilitate reproducibility.
We define webs to be the collections of producers and consumers (e.g., functions and calls) in a program that are constrained: in higher-order languages, multiple functions can flow to the same call, all of which must agree on an interface (e.g., calling convention). We argue that webs are fundamentally the unit of transformation: a change to one member requires changes across the entire web. We introduce a web-centric intermediate language that exposes webs as annotations, and describe web-based (that is, flow-directed) transformations guided by these annotations. As they affect all members of a web, these transformations are interprocedural, operating over entire modules. Through the lens of webs we reframe and generalize a collection of transformations from the literature, including dead-parameter elimination, uncurrying, and defunctionalization, as well as describe novel transformations. We contrast this approach with rewriting strategies that rely on inlining and cascading rewrites.
Webs are an over-approximation of the semantic function-call relationship produced by control-flow analyses (CFA). This information is inherently independent from the transformations; more precise analyses permit more transformations. A limitation of precise analyses is that the transformations may not maintain well-typedness, as the type system is a less-precise static analysis. Our solution is a simple and lightweight typed-based analysis that causes the flow-directed transformations to preserve well-typedness, making flow-directed, type-preserving transformations easily accessible in many compilers. This analysis builds on unification, distinguishing types that look the same from types that have to be the same. Our experiments show that while our analysis is theoretically less precise, in practice its precision is similar to CFAs.
We present Dependent Lambek Calculus (LambekD), a domain-specific dependent type theory for verified parsing and formal grammar theory. In LambekD, linear types are used as a syntax for formal grammars, and parsers can be written as linear terms. The linear typing restriction provides a form of intrinsic verification that a parser yields only valid parse trees for the input string. We demonstrate the expressivity of this system by showing that the combination of inductive linear types and dependency on non-linear data can be used to encode commonly used grammar formalisms such as regular and context-free grammars as well as traces of various types of automata. Using these encodings, we define parsers for regular expressions using deterministic automata, as well as examples of verified parsers of context-free grammars.
We present a denotational semantics of our type theory that interprets the linear types as functions from strings to sets of abstract parse trees and terms as parse transformers. Based on this denotational semantics, we have made a prototype implementation of LambekD using a shallow embedding in the Agda proof assistant. All of our examples parsers have been implemented in this prototype implementation.
Input-centric program optimization aims to optimize code by considering the relations between program inputs and program behaviors. Despite its promise, a long-standing barrier for its adoption is the difficulty of automatically identifying critical features of complex inputs. This paper introduces a novel technique, reductive analysis through compiler-guided Large Language Models (LLMs), to solve the problem through a synergy between compilers and LLMs. It uses a reductive approach to overcome the scalability and other limitations of LLMs in program code analysis. The solution, for the first time, automates the identification of critical input features without heavy instrumentation or profiling, cutting the time needed for input identification by 44× (or 450× for local LLMs), reduced from 9.6 hours to 13 minutes (with remote LLMs) or 77 seconds (with local LLMs) on average, making input characterization possible to be integrated into the workflow of program compilations. Optimizations on those identified input features show similar or even better results than those identified by previous profiling-based methods, leading to optimizations that yield 92.6% accuracy in selecting the appropriate adaptive OpenMP parallelization decisions, and 20–30% performance improvement of serverless computing while reducing resource usage by 50–60%.
Quantum recursive programming has been recently introduced for describing sophisticated and complicated quantum algorithms in a compact and elegant way. However, implementation of quantum recursion involves intricate interplay between quantum control flow and recursive procedure calls. In this paper, we aim at resolving this fundamental challenge and develop a series of techniques to efficiently implement quantum recursive programs. Our main contributions include:
1. We propose a notion of quantum register machine, the first quantum architecture (including an instruction set) that provides instruction-level support for quantum control flow and recursive procedure calls at the same time.
2. Based on quantum register machine, we describe the first comprehensive implementation process of quantum recursive programs, including the compilation, the partial evaluation of quantum control flow, and the execution on the quantum register machine.
3. As a bonus, our efficient implementation of quantum recursive programs also offers automatic parallelisation of quantum algorithms. For implementing certain quantum algorithmic subroutine, like the widely used quantum multiplexor, we can even obtain exponential parallel speed-up (over the straightforward implementation) from this automatic parallelisation. This demonstrates that quantum recursive programming can be win-win for both modularity of programs and efficiency of their implementation.
The separation logic framework Iris has been built on the premise that all assertions are stable, meaning they unconditionally enjoy the famous frame rule. This gives Iris—and the numerous program logics that build on it—very modular reasoning principles. But stability also comes at a cost. It excludes a core feature of the Viper verifier family, heap-dependent expression assertions, which lift program expressions to the assertion level in order to reduce redundancy between code and specifications and better facilitate SMT-based automation.
In this paper, we bring heap-dependent expression assertions to Iris with Daenerys. To do so, we must first revisit the very core of Iris, extending it with a new form of unstable resources (and adapting the frame rule accordingly). On top, we then build a program logic with heap-dependent expression assertions and lay the foundations for connecting Iris to SMT solvers. We apply Daenerys to several case studies, including some that go beyond what Viper and Iris can do individually and others that benefit from the connection to SMT.
Large-scale, revenue-critical application services are often written in Java or other memory-safe languages whose type systems do not expose immutable state. Such applications are especially exposed to garbage collection performance overheads because latency, throughput, and memory consumption are first-order concerns for service providers. We observe that: i) An important class of server applications are request-based: they scale by concurrently servicing large numbers of quasi-independent requests. ii) Object lifetimes are strongly tied to request lifetimes. iii) Most objects remain private to the request in which they were allocated. iv) Global operations are the primary impediment to responsiveness at scale. If we could perform request-private garbage collection, we might achieve both responsiveness and efficiency at scale. Unfortunately, this straightforward insight runs into significant practical problems. The most obvious of these is that a request-private collection cannot safely move objects that may be referenced outside the scope of that request, and yet moving objects is a requirement of most modern high performance collector designs. This dilemma can be sidestepped by exploiting immutability, which is unfortunately not practical in languages like Java whose type systems do not expose it. We develop Iso, a garbage collector for request-based services that exploits a mark-region heap structure to solve these impediments and deliver outstanding performance. The key contributions of this paper are that: i) We use opportunistic copying to solve the problem of practical thread-local garbage collection for languages without exploitable immutability. ii) We provide the first detailed analysis of the behavior of Java workloads with respect to thread-local collection, identify shortcomings of existing benchmarks and introduce a new one. iii) We design, implement, and evaluate Iso, a practical and effective request-private GC. We show that dynamic tracking of object visibility, a prerequisite for request-private GC, incurs an overhead of just 2% for important request-based workloads including Tomcat and Spring. Iso demonstrates that for suitable workloads, request-based garbage collection is extremely effective, outperforming OpenJDK with its default collector, G1, by 32% and 22% in execution time in a modest heap. This work presents the first request-private garbage collector for Java. It shows a promising way forward for highly responsive collection on an important class of large scale workloads.
We introduce a version of probabilistic Kleene algebra with angelic nondeterminism and a corresponding class of automata. Our approach implements semantics via distributions over multisets in order to overcome theoretical barriers arising from the lack of a distributive law between the powerset and Giry monads. We produce a full Kleene theorem and a coalgebraic theory, as well as both operational and denotational semantics and equational reasoning principles.
Translating software between programming languages is a challenging task, for which automated techniques have been elusive and hard to scale up to larger programs. A key difficulty in cross-language translation is that one has to re-express the intended behavior of the source program into idiomatic constructs of a different target language. This task needs abstracting away from the source language-specific details, while keeping the overall functionality the same. In this work, we propose a novel and systematic approach for making such translation amenable to automation based on a framework we call program skeletons. A program skeleton retains the high-level structure of the source program by abstracting away and effectively summarizing lower-level concrete code fragments, which can be mechanically translated to the target programming language. A skeleton, by design, permits many different ways of filling in the concrete implementation for fragments, which can work in conjunction with existing data-driven code synthesizers. Most importantly, skeletons can conceptually enable sound decomposition, i.e., if each individual fragment is correctly translated, taken together with the mechanically translated skeleton, the final translated program is deemed to be correct as a whole. We present a prototype system called SKEL embodying the idea of skeleton-based translation from Python to JavaScript. Our results show promising scalability compared to prior works. For 9 real-world Python programs, some with more than about 1k lines of code, 95% of their code fragments can be automatically translated, while about 5% require manual effort. All the final translations are correct with respect to whole-program test suites.
Actors are lightweight reactive processes that communicate by asynchronous message-passing. Actors address common problems like concurrency control and fault tolerance, but resource management remains challenging: in all four of the most popular actor frameworks (Pekko, Akka, Erlang, and Elixir) programmers must explicitly kill actors to free up resources. To simplify resource management, researchers have devised actor garbage collectors (actor GCs) that monitor the application and detect when actors are safe to kill. However, existing actor GCs are impractical for distributed systems where the network is unreliable and nodes can fail. The simplest actor GCs do not collect cyclic garbage, whereas more sophisticated actor GCs are not fault-recovering: dropped messages and crashed nodes can cause actors to become garbage that never gets collected.
We present Conflict-free Replicated Garbage Collection (CRGC): the first fault-recovering cyclic actor GC. In CRGC, actors and nodes record information locally and broadcast updates to the garbage collectors running on each node. CRGC does not require locks, explicit memory barriers, or any assumptions about message delivery order, except for reliable FIFO channels from actors to their local garbage collector. Moreover, CRGC is simple: we concisely present its operational semantics, which has been formalized in TLA+, and prove both soundness (non-garbage actors are never killed) and completeness (all garbage actors are eventually killed, under reasonable assumptions). We also present a preliminary implementation in Apache Pekko and measure its performance using two actor benchmark suites. Our results show the performance overhead of CRGC is competitive with simpler approaches like weighted reference counting, while also being much more powerful.
We propose a hybrid approach to end-to-end Rust verification where the proof effort is split into powerful automated verification of safe Rust and targeted semi-automated verification of unsafe Rust. To this end, we present Gillian-Rust, a proof-of-concept semi-automated verification tool built on top of the Gillian platform that can reason about type safety and functional correctness of unsafe code. Gillian-Rust automates a rich separation logic for real-world Rust, embedding the lifetime logic of RustBelt and the parametric prophecies of RustHornBelt, and is able to verify real-world Rust standard library code with only minor annotations and with verification times orders of magnitude faster than those of comparable tools. We link Gillian-Rust with Creusot, a state-of-the-art verifier for safe Rust, by providing a systematic encoding of unsafe code specifications that Creusot can use but cannot verify, demonstrating the feasibility of our hybrid approach.
We present the Quantum Gate Virtualization Machine (QVM), an end-to-end generic system for scalable execution of large quantum circuits with high fidelity on noisy and small quantum processors (QPUs) by leveraging gate virtualization. QVM exposes a virtual circuit intermediate representation (IR) that extends the notion of quantum circuits to incorporate gate virtualization. Based on the virtual circuit as our IR, we propose the QVM compiler—an extensible compiler infrastructure to transpile a virtual circuit through a series of modular optimization passes to produce a set of optimized circuit fragments. Lastly, these transpiled circuit fragments are executed on QPUs using our QVM runtime—a scalable and parallel infrastructure to virtualize and execute circuit fragments on a set of QPUs.
We evaluate QVM on IBM’s 7- and 27-qubit QPUs. Our evaluation shows that our approach allows for the execution of circuits with up to double the number of qubits compared to the qubit-count of a QPU, while improving fidelity by 4.7× on average compared to larger QPUs and that we can effectively reduce circuit depths to only 40% of the original circuit depths.
The Rust programming language is well known for its ownership-based type system, which offers strong guarantees like memory safety and data race freedom. However, Rust also provides unsafe escape hatches, for which safety is not guaranteed automatically and must instead be manually upheld by the programmer. This creates a tension. On the one hand, compilers would like to exploit the strong guarantees of the type system—particularly those pertaining to aliasing of pointers—in order to unlock powerful intraprocedural optimizations. On the other hand, those optimizations are easily invalidated by “badly behaved” unsafe code. To ensure correctness of such optimizations, it thus becomes necessary to clearly define what unsafe code is “badly behaved.” In prior work, Stacked Borrows defined a set of rules achieving this goal. However, Stacked Borrows rules out several patterns that turn out to be common in real-world unsafe Rust code, and it does not account for advanced features of the Rust borrow checker that were introduced more recently.
To resolve these issues, we present Tree Borrows. As the name suggests, Tree Borrows is defined by replacing the stack at the heart of Stacked Borrows with a tree. This overcomes the aforementioned limitations: our evaluation on the 30 000 most widely used Rust crates shows that Tree Borrows rejects 54 % fewer test cases than Stacked Borrows does. Additionally, we prove (in Rocq) that it retains most of the Stacked Borrows optimizations and also enables important new ones, notably read-read reorderings.
Subroutines are essential building blocks in software design: users encapsulate common functionality in libraries and write applications by composing calls to subroutines. Unfortunately, performance may be lost at subroutine boundaries due to reduced locality and increased memory consumption. Operator fusion helps recover the performance lost at composition boundaries. Previous solutions fuse operators by manually rewriting code into monolithic fused subroutines, or by relying on heavy-weight compilers to generate code that performs fusion. Both approaches require a semantic understanding of the entire computation, breaking the decoupling necessary for modularity and reusability of subroutines.
In this work, we attempt to identify the minimal ingredients required to fuse computations, enabling composition of subroutines without sacrificing performance or modularity. We find that, unlike previous approaches that require a semantic understanding of the computation, most opportunities for fusion require understanding only data production and consumption patterns.Exploiting this insight, we add fusion on top of black-box subroutines by proposing a lightweight enrichment of subroutine declarations to expose data-dependence patterns. We implement our approach in a system called Fern, and demonstrate Fern’s benefits by showing that it is competitive with state-of-the-art, high-performance libraries with manually fused operators, can fuse across library and domain boundaries for unforeseen workloads, and can deliver speedups of up to 5× over unfused code.
Quantum error correction (QEC) is fundamental for suppressing noise in quantum hardware and enabling fault-tolerant quantum computation. In this paper, we propose an efficient verification framework for QEC programs. We define an assertion logic and a program logic specifically crafted for QEC programs and establish a sound proof system. We then develop an efficient method for handling verification conditions (VCs) of QEC programs: for Pauli errors, the VCs are reduced to classical assertions that can be solved by SMT solvers, and for non-Pauli errors, we provide a heuristic algorithm. We formalize the proposed program logic in Coq proof assistant, making it a verified QEC verifier. Additionally, we implement an automated QEC verifier, Veri-QEC, for verifying various fault-tolerant scenarios. We demonstrate the efficiency and broad functionality of the framework by performing different verification tasks across various scenarios. Finally, we present a benchmark of 14 verified stabilizer codes.
Differential privacy (DP) has become the gold standard for privacy-preserving data analysis, but implementing it correctly has proven challenging. Prior work has focused on verifying DP at a high level, assuming either that the foundations are correct or that a perfect source of random noise is available. However, the underlying theory of differential privacy can be very complex and subtle. Flaws in basic mechanisms and random number generation have been a critical source of vulnerabilities in real-world DP systems.
In this paper, we present SampCert, the first comprehensive, mechanized foundation for executable implementations of differential privacy. SampCert is written in Lean with over 12,000 lines of proof. It offers a generic and extensible notion of DP, a framework for constructing and composing DP mechanisms, and formally verified implementations of Laplace and Gaussian sampling algorithms. SampCert provides (1) a mechanized foundation for developing the next generation of differentially private algorithms, and (2) mechanically verified primitives that can be deployed in production systems. Indeed, SampCert's verified algorithms power the DP offerings of Amazon Web Services, demonstrating its real-world impact.
SampCert's key innovations include: (1) A generic DP foundation that can be instantiated for various DP definitions (e.g., pure, concentrated, Rényi DP); (2) formally verified discrete Laplace and Gaussian sampling algorithms that avoid the pitfalls of floating-point implementations; and (3) a simple probability monad and novel proof techniques that streamline the formalization.
To enable proving complex correctness properties of DP and random number generation, SampCert makes heavy use of Lean's extensive Mathlib library, leveraging theorems in Fourier analysis, measure and probability theory, number theory, and topology.
NetKAT is a domain-specific programming language and logic that has been successfully used to specify and verify the behavior of packet-switched networks. This paper develops techniques for automatically learning NetKAT models of unknown networks using active learning. Prior work has explored active learning for a wide range of automata (e.g., deterministic, register, Büchi, timed etc.) and also developed applications, such as validating implementations of network protocols. We present algorithms for learning different types of NetKAT automata, including symbolic automata proposed in recent work. We prove the soundness of these algorithms, build a prototype implementation, and evaluate it on a standard benchmark. Our results highlight the applicability of symbolic NetKAT learning for realistic network configurations and topologies.
Choreographic programming (CP) is a paradigm for implementing distributed systems that uses a single global program to define the actions and interactions of all participants. Library-level CP implementations, like HasChor, integrate well with mainstream programming languages but have several limitations: Their conditionals require extra communication; they require specific host-language features (e.g., monads); and they lack support for programming patterns that are essential for implementing realistic distributed applications.
We make three contributions to library-level CP to specifically address these challenges. First, we propose and formalize conclaves and multiply-located values, which enable efficient conditionals in library-level CP without redundant communication. Second, we propose census polymorphism, a technique for abstracting over the number of participants in a choreography. Third, we introduce a design pattern for library-level CP in host languages without support for monads. We demonstrate these contributions via implementations in Haskell, Rust, and TypeScript.
Static analysis of real-world programs combines flow- and context-sensitive analyses of local program states with computation of flow- and context-insensitive invariants at globals, that, e.g., abstract data shared by multiple threads. The values of locals and globals may mutually depend on each other, with the analysis of local program states both making contributions to globals and querying their values. Usually, all contributions to globals are accumulated during fixpoint iteration, with widening applied to enforce termination. Such flow-insensitive information often becomes unnecessarily imprecise and can include superfluous contributions — trash — which, in turn, may be toxic to the precision of the overall analysis. To recover precision of globals, we propose techniques complementing each other: Narrowing on globals differentiates contributions by origin; reluctant widening limits the amount of widening applied at globals; and finally, abstract garbage collection undoes contributions to globals and propagates their withdrawal. The experimental evaluation shows that these techniques increase the precision of mixed flow-sensitive analyses at a reasonable cost.
We introduce a new family of abstractions based on a data structure that we call labeled union-find, an extension of the classic efficient union-find data structure where edges carry labels. These labels have a composition operation that obey the group axioms. Like union-find, the labeled version can efficiently compute the transitive closure of a relation, but it is not limited to equivalence relations; it can represent any injective transformation between equivalence classes, which includes two-variables per equality (TVPE) constraints of the form y = a× x + b. Using abstract interpretation theory, we study the properties deriving from the use of abstract relations as labels, and the combination of labeled union-find with other representations of constraints, allowing both improvements in precision and simplification of existing constraints. Due to its efficiency, the labeled union-find abstractions could find many uses; we use it in two use cases, program analysis based on abstract interpretation and constraint solving for SMT, with encouraging preliminary results.
Nondeterminism introduced by race conditions and message reorderings makes parallel and distributed programming hard. Nevertheless, promising approaches such as LVars and CRDTs address this problem by introducing a partial order structure on shared state that describes how the state evolves over time. Monotone programs that respect the order are deterministic. Datalog-inspired languages incorporate this idea of monotonicity in a first-class way but they are not general-purpose. We would like parallel and distributed languages to be as natural to use as any functional language, without sacrificing expressivity, and with a formal basis of study as appealing as the lambda calculus.
This paper presents λ∨, a core language for deterministic parallelism that embodies the ideas above. In λ∨, values may increase over time according to a streaming order and all computations are monotone with respect to that order. The streaming order coincides with the approximation order found in Scott semantics and so unifies the foundations of functional programming with the foundations of deterministic distributed computation. The resulting lambda calculus has a computationally adequate model rooted in domain theory. It integrates the compositionality and power of abstraction characteristic of functional programming with the declarative nature of Datalog.
This paper is about semantic regular expressions (SemREs). This is a concept that was recently proposed by Smore (Chen et al. 2023) in which classical regular expressions are extended with a primitive to query external oracles such as databases and large language models (LLMs). SemREs can be used to identify lines of text containing references to semantic concepts such as cities, celebrities, political entities, etc. The focus in their paper was on automatically synthesizing semantic regular expressions from positive and negative examples.
In this paper, we study the membership testing problem.
First, we present a two-pass NFA-based algorithm to determine whether a string w matches a SemRE r in O(|r|2 |w|2 + |r| |w|3) time, assuming the oracle responds to each query in unit time. In common situations, where oracle queries are not nested, we show that this procedure runs in O(|r|2 |w|2) time. Experiments with a prototype implementation of this algorithm validate our theoretical analysis, and show that the procedure massively outperforms a dynamic programming-based baseline, and incurs a ≈ 2 × overhead over the time needed for interaction with the oracle.
Second, we establish connections between SemRE membership testing and the triangle finding problem from graph theory, which suggest that developing algorithms which are simultaneously practical and asymptotically faster might be challenging. Furthermore, algorithms for classical regular expressions primarily aim to optimize their time and memory consumption. In contrast, an important consideration in our setting is to minimize the cost of invoking the oracle. We demonstrate an Ω(|w|2) lower bound on the number of oracle queries necessary to make this determination.
Using program synthesis to select instructions for and optimize input programs is receiving increasing attention. However, existing synthesis-based compilers are faced by two major challenges that prohibit the deployment of program synthesis in production compilers: exorbitantly long synthesis times spanning several minutes and hours; and scalability issues that prevent synthesis of complex modern compute and data swizzle instructions, which have been found to maximize performance of modern tensor and stencil workloads. This paper proposes MISAAL, a synthesis-based compiler that employs a novel strategy to use formal semantics of hardware instructions to automatically prune a large search space of rewrite rules for modern complex instructions in an offline stage. MISAAL also proposes a novel methodology to make term-rewriting process in the online stage (at compile-time) extremely lightweight so as to enable programs to compile in seconds. Our results show that MISAAL reduces compilation times by up to a geomean of 16x compared to the state-of-the-art synthesis-based compiler, HYDRIDE. MISAAL also delivers competitive runtime performance against the production compiler for image processing and deep learning workloads, Halide, as well as HYDRIDE across x86, Hexagon and ARM.
Compiler diagnostics for type inference failures are notoriously bad, and type classes only make the problem worse. By introducing a complex search process during inference, type classes can lead to wholly inscrutable or useless errors. We describe a system, Argus, for interactively visualizing type class inferences to help programmers debug inference failures, applied specifically to Rust’s trait system. The core insight of Argus is to avoid the traditional model of compiler diagnostics as one-size-fits-all, instead providing the programmer with different views on the search tree corresponding to different debugging goals. Argus carefully uses defaults to improve debugging productivity, including interface design (e.g., not showing full paths of types by default) and heuristics (e.g., sorting obligations based on the expected complexity of fixing them). We evaluated Argus in a user study where N = 25 participants debugged type inference failures in realistic Rust programs, finding that participants using Argus correctly localized 2.2× as many faults and localized 3.3× faster compared to not using Argus.
We present a novel symbolic reasoning engine for SQL which can efficiently generate an input I for n queries P1, ⋯, Pn, such that their outputs on I satisfy a given property (expressed in SMT). This is useful in different contexts, such as disproving equivalence of two SQL queries and disambiguating a set of queries. Our first idea is to reason about an under-approximation of each Pi — that is, a subset of Pi’s input-output behaviors. While it makes our approach both semantics-aware and lightweight, this idea alone is incomplete (as a fixed under-approximation might miss some behaviors of interest). Therefore, our second idea is to perform search over an expressive family of under-approximations (which collectively cover all program behaviors of interest), thereby making our approach complete. We have implemented these ideas in a tool, Polygon, and evaluated it on over 30,000 benchmarks across two tasks (namely, SQL equivalence refutation and query disambiguation). Our evaluation results show that Polygon significantly outperforms all prior techniques.
The Node.js ecosystem, with its growing popularity and increasing exposure to security vulnerabilities, has a pressing need for more effective security analysis tools. To reduce false positives, recent works on detecting vulnerabilities in Node.js packages have developed synthesis algorithms to generate proof-of-concept exploits. However, these tools focus mainly on vulnerabilities that can be triggered by a single direct call to an exported function of the analyzed package, failing to generate exploits that require more complex interactions. In this paper, we present Explode.js, the first tool capable of synthesizing exploits that include complex call sequences to trigger vulnerabilities in Node.js packages. By combining static analysis and symbolic execution, Explode.js generates functional exploits that confirm the existence of command, code injection, prototype pollution, and path traversal vulnerabilities, effectively eliminating false positives. The results of evaluating Explode.js on two state-of-the-art datasets of Node.js packages with confirmed vulnerabilities show that it generates significantly more exploits than its main competitor tools. Furthermore, when applied to real-world Node.js packages, Explode.js uncovered 44 zero-day vulnerabilities, with 4 new CVEs.
Graphics shaders are the core of modern 3D visual effects, enabling developers to create realistic, real-time rendering of 3D scenes. Shaders are specialized programs written in high-level shading languages like GLSL, and graphics shader compilers translate these high-level shader programs into low-level binaries that run on GPUs. These shader compilers are complex programs with multiple layers: front-end, middle-end, and back-end. Despite significant development efforts from industrial GPU vendors such as NVIDIA and AMD, graphics shader compilers still contain bugs that can impact downstream applications and lead to negative consequences from poor user experience in entertainment to accidents in driving assistance systems. Because they are complex and deep in the compilation pipeline, the back-ends of shader compilers are particularly challenging to test. Our empirical exploration shows that state-of-the-art testing tools for shader compilers do not specifically target the back-ends and are thus ineffective in uncovering back-end bugs.
This work fills this gap and introduces ShaDiv, an automated testing tool specifically designed to uncover bugs in the back-ends of graphics shader compilers. To this end, ShaDiv generates test inputs with two novel, carefully designed strategies to support the unique computational models of the back-ends, namely control and data flow divergence among GPU threads. Indeed, ShaDiv deliberately perturbs divergence patterns in both the control and data flow of shader programs to effectively trigger back-end optimizations. Our evaluation of ShaDiv on graphics shader compilers from four mainstream GPU vendors uncovered 12 back-end bugs. Further comparison with existing shader compiler testing tools shows that ShaDiv achieves a 25% coverage increase in the back-end components and finds four times as many back-end bugs.
Blockchains operating at the global scale demand high-performance byzantine fault-tolerant (BFT) consensus protocols. Most classic PBFT-like protocols suffer from an issue known as the leader bottleneck, which severely limits their throughput and resource utilization. Recently, Directed Acyclic Graph, or DAG-based protocols, have emerged as a promising approach for eliminating the leader bottleneck and achieving better performance. They attain higher throughput by separating data dissemination and block ordering. However, their safety and liveness logic is also significantly more elaborate. So far, most DAG-based protocols have only enjoyed on-paper security proofs, and it is not clear how to construct formal proofs of these protocols efficiently.
We introduce LiDO-DAG, a concurrent object model that abstracts the common logic of these protocols. LiDO-DAG is constructed by combining a DAG abstraction and LiDO, a recently proposed abstraction for leader-based consensus. To demonstrate that our framework enables rapid validation of new DAG-based protocol designs, we implemented LiDO-DAG in Coq and applied it to three recent DAG-based protocols, including Narwhal, Bullshark, and Sailfish. Our framework readily yields mechanized safety and liveness proofs for all three protocols, which are also the first mechanized liveness proofs of any DAG-based protocol. Our framework has also revealed an optimization for Sailfish that improves its worst-case latency.
Database-backed applications form the backbone of modern software, yet their complexity poses significant challenges for static analysis. These applications involve intricate interactions among application code, diverse database frameworks such as JDBC, Hibernate, and Spring Data JPA, and languages like Java and SQL. In this paper, we introduce DBridge, the first pointer analysis specifically designed for Java database-backed applications, capable of statically constructing comprehensive Java-to-database value flows. DBridge unifies application code analysis, database access specification modeling, SQL analysis, and database abstraction within a single pointer analysis framework, capturing interactions across a wide range of database access APIs and frameworks. Additionally, we present DB-Micro, a new micro-benchmark suite with 824 test cases crafted to systematically evaluate static analysis for database-backed applications. Experiments on DB-Micro and large, complex, real-world applications demonstrate DBridge's effectiveness, achieving high recall and precision in building Java-to-database value flows efficiently and outperforming state-of-the-art tools in SQL statement identification. To further validate DBridge's utility, we develop three client analyses for security and program understanding. Evaluation on these real-world applications reveals 30 Stored XSS attack vulnerabilities and 3 horizontal broken access control vulnerabilities, all previously undiscovered and real, as well as a high detection rate in impact analysis for schema changes. By open-sourcing DBridge (14K LoC) and DB-Micro (22K LoC), we seek to help advance static analysis for modern database-backed applications in the future.
At the heart of the Damas-Hindley-Milner (HM) type system lies the abstraction rule which derives a function type for a lambda expression. In this rule, the type of the parameter can be ”guessed”, and can be any type that fits the derivation. The beauty of the HM system is that there always exists a most general type that encompasses all possible derivations – Algorithm W is used to infer these most general types in practice.
Unfortunately, this property is also the bane of the HM type rules. Many languages extend HM typing with additional features which often require complex side conditions to the type rules to maintain principal types. For example, various type systems for impredicative type inference, like HMF, FreezeML, or Boxy types, require let-bindings to always assign most general types. Such a restriction is difficult to specify as a logical deduction rule though, as it ranges over all possible derivations. Despite these complications, the actual implementations of various type inference algorithms are usually straightforward extensions of algorithm W, and from an implementation perspective, much of the complexity of various type system extensions, like boxes or polymorphic weights, is in some sense artificial.
In this article we rephrase the HM type rules as _type inference under a prefix_, called HMQ. HMQ is sound and complete with respect to the HM type rules, but always derives principal types that correspond to the types inferred by algorithm W. The HMQ type rules are close to the clarity of the declarative HM type rules, but also specific enough to ”read off” an inference algorithm, and can form an excellent basis to describe type system extensions in practice. We show in particular how to describe the FreezeML and HMF systems in terms of inference under a prefix, and how we no longer require complex side conditions. We also show a novel formalization of static overloading in HMQ as implemented in Koka language.
MLIR is a toolkit supporting the development of extensible and composable intermediate representations (IRs) called dialects; it was created in response to rapid changes in hardware platforms, programming languages, and application domains such as machine learning. MLIR supports development teams creating compilers and compiler-adjacent tools by factoring out common infrastructure such as parsers and printers. A major limitation of MLIR is that it is syntax-focused: it has no support for directly encoding the semantics of operations in its dialects. Thus, at present, the parts of MLIR tools that depend on semantics—optimizers, analyzers, verifiers, transformers—must all be engineered by hand.
Our work makes formal semantics a first-class citizen in the MLIR ecosystem. We designed and implemented a collection of semantics-supporting MLIR dialects for encoding the semantics of compiler IRs. These dialects support a separation of concerns between three domains of expertise when building formal-methods-based tooling for compilers. First, compiler developers define their dialect’s semantics as a lowering (compilation transformation) from their dialect to one or more of ours. Second, SMT solver experts provide tools to optimize domain-specific high-level semantics and lower them to SMT queries. Third, tool builders create dialect-independent verification tools.
We validate our work by defining semantics for five key MLIR dialects, defining a state-of-the-art SMT encoding for memory-based semantics, and building three dialect-agnostic tools, which we used to find five miscompilation bugs in upstream MLIR, verify a canonicalization pass, and also formally verify transfer functions for two dataflow analyses: “known bits” (that finds individual bits that are always zero or one in all executions) and “demanded bits” (that finds don’t-care bits). The transfer functions that we verify are improved versions of those in upstream MLIR; they detect on average 36.6% more known bits in real-world MLIR programs compared to the upstream implementation.
Cryptographic library developers take care to ensure their library does not leak secrets even when there are (inevitably) exploitable vulnerabilities in the applications the library is linked against. To do so, they choose some class of application vulnerabilities to defend against and hardcode protections against those vulnerabilities in the library code. A single set of choices is a poor fit for all contexts: a chosen protection could impose unnecessary overheads in contexts where those attacks are impossible, and an ignored protection could render the library insecure in contexts where the attack is feasible.
We introduce RoboCop, a new methodology and toolchain for building secure and efficient applications from cryptographic libraries, via four contributions. First, we present an operational semantics that describes the behavior of a (cryptographic) library executing in the context of a potentially vulnerable application so that we can precisely specify what different attackers can observe. Second, we use our semantics to define a novel security property, Robust Constant Time (RCT), that defines when a cryptographic library is secure in the context of a vulnerable application. Crucially, our definition is parameterized by an attacker model, allowing us to factor out the classes of attackers that a library may wish to secure against. This refactoring yields our third contribution: a compiler that can synthesize bespoke cryptographic libraries with security tailored to the specific application context against which the library will be linked, guaranteeing that the library is RCT in that context. Finally, we present an empirical evaluation that shows the RoboCop compiler can automatically generate code to efficiently protect a wide range (over 500) of cryptographic library primitives against three classes of attacks: read gadgets (due to application memory safety vulnerabilities), speculative read gadgets (due to application speculative execution vulnerabilities), and concurrent observations (due to application threads), with performance overhead generally under 2% for protections from read gadgets and under 4% for protections from speculative read gadgets, thus freeing library developers from making one-size-fits-all choices between security and performance.
PulseCore is a new program logic suitable for intrinsic proofs of higher-order, stateful, concurrent, dependently typed programs. It provides many of the features of a modern, concurrent separation logic, including dynamically allocated impredicative invariants, higher-order ghost state, step-indexing with later credits, and support for user-defined ghost state constructions. PulseCore is developed foundationally within the F★ programming language with fully mechanized proofs, and is applicable to F★ programs itself.
To evaluate our work, we use Pulse, a surface language within F★ for PulseCore, to develop a range of program proofs. Illustrating its suitability for proving higher-order concurrent programs, we present a verified library for task pools in the style of OCaml5, together with some verified task-parallel programs. Next, we present various data structures and synchronization primitives, including a barrier that requires the use of higher-order ghost state. Finally, we present a verified implementation of the DICE Protection Environment, an industry standard secure boot protocol. Taken together, our evaluation consists of more than 31,000 lines of verified code in a range of settings, providing evidence that PulseCore is both highly expressive as well as practical for a variety of program proof applications.
Database isolation is a formal contract concerning the level of data consistency that a database provides to its clients. In order to achieve low latency, high throughput, and partition tolerance, modern databases forgo strong transaction isolation for weak isolation guarantees. However, several production databases have been found to suffer from isolation bugs, breaking their data-consistency contract. Black-box testing is a prominent technique for detecting isolation bugs, by checking whether histories of database transactions adhere to a prescribed isolation level.
In order to test databases on realistic workloads of large size, isolation testers must be as efficient as possible, a requirement that has initiated a study of the complexity of isolation testing. Although testing strong isolation has been known to be NP-complete, weak isolation levels were recently shown to be testable in polynomial time, which has propelled the scalability of testing tools. However, existing testers have a large polynomial complexity, restricting testing to workloads of only moderate size, which is not typical of large-scale databases. How efficiently can we provably test weak database isolation?
In this work, we develop AWDIT, a highly-efficient and provably optimal tester for weak database isolation. Given a history H of size n and k sessions, AWDIT tests whether H satisfies the most common weak isolation levels of Read Committed (RC), Read Atomic (RA), and Causal Consistency (CC) in time O(n3/2), O(n3/2), and O(n· k), respectively, improving significantly over the state of the art. Moreover, we prove that AWDIT is essentially optimal, in the sense that there is a lower bound of n3/2, based on the combinatorial BMM hypothesis, for any weak isolation level between RC and CC. Our experiments show that AWDIT is significantly faster than existing, highly optimized testers; e.g., for the ∼20% largest histories, AWDIT obtains an average speedup of 245×, 193×, and 62× for RC, RA, and CC, respectively, over the best baseline.
The ways in which the components of a program interact with each other in a concurrent setting can be considerably more complex than in a sequential setting. The core problem is unrestricted shared mutable state. An alternative to unrestricted shared mutable state is to restrict the sharing using Ownership. Ownership can turn what would have been a race into a deterministic failure that can be explained to the programmer. However, Ownership has predominantly taken place in statically typed languages.
In this paper, we explore retrofitting an existing dynamically typed programming language with an ownership model based on regions. Our core aim is to provide safe concurrency, that is, the ownership model should provide deterministic dynamic failures of ownership that can be explained to the programmer. We present a dynamic model of ownership that provides ownership of groups objects called regions. We provide dynamic enforcement of our region discipline, which we have implemented in a simple interpreter that provides a Python-like syntax and semantics, and report on our first steps into integrating it into an existing language, Python.
We transport multi-stage programming from functional to relational programming, with novel constructs to give programmers control over staging and non-determinism. We stage interpreters written as relations, in which the programs under interpretation can contain holes representing unknown expressions or values. By compiling the known parts without interpretive overhead and deferring interpretation to run time only for the unknown parts, we compound the benefits of staging (e.g., turning interpreters into compilers) and relational interpretation (e.g., turning functions into relations and synthesizing from sketches). We extend miniKanren with staging constructs and apply the resulting multi-stage language to relational interpreters for subsets of Racket and miniKanren as well as a relational recognizer for context-free grammars. We demonstrate significant performance gains across multiple synthesis problems, systematically comparing unstaged and staged computation, as well as indicatively comparing with an existing hand-tuned relational interpreter.
Large language models (LLMs) show promise in code translation due to their ability to generate idiomatic code. However, a significant limitation when using LLMs for code translation is scalability: existing works have shown a drop in translation success rates for code exceeding around 100 lines. We overcome this limitation by developing a modular approach to translation, where we partition the code into small code fragments which can be translated independently and semantically validated (that is, by checking I/O equivalence). When this approach is applied naively, we discover that LLMs are unreliable when translating features of the source language that do not have a direct mapping to the target language, and that the LLM often gets stuck in repair loops when attempting to fix errors. To address these issues, we introduce two key concepts: (1) feature mapping, which integrates predefined translation rules with LLM-based translation to guide the LLM in navigating subtle language differences and producing semantically accurate code; and (2) type-compatibility, which facilitates localized checks at the function signature level to detect errors early, thereby narrowing the scope of potential repairs. We apply our approach to translating real-world Go codebases to Rust, demonstrating that we can consistently generate reliable Rust translations for projects up to 9,700 lines of code and 780 functions, with an average of 73% of functions successfully validated for I/O equivalence, considerably higher than any existing work.
We present the first technique to synthesize programs that compose side-effecting functions, pure functions, and control flow, from partial traces containing records of only the side-effecting functions. This technique can be applied to synthesize API composing scripts from logs of calls made to those APIs, or a script from traces of system calls made by a workload, for example. All of the provided traces are positive examples, meaning that they describe desired behavior. Our approach does not require negative examples. Instead, it generalizes over the examples and uses cost metrics to prevent over-generalization. Because the problem is too complex for traditional monolithic program synthesis techniques, we propose a new combination of optimizing rewrites and syntax-guided program synthesis. The resulting program is correct by construction, so its output will always be able to reproduce the input traces. We evaluate the quality of the programs synthesized when considering various optimization metrics and the synthesizer's efficiency on real-world benchmarks. The results show that our approach can generate useful real-world programs.
Session types provide a formal type system to define and verify communication protocols between message-passing processes. In order to analyze randomized systems, recent works have extended session types with probabilistic type constructors. Unfortunately, all the proposed extensions only support constant probabilities which limits their applicability to real-world systems. Our work addresses this limitation by introducing probabilistic refinement session types which enable symbolic reasoning for concurrent probabilistic systems in a core calculus we call PReST. The type system is carefully designed to be a conservative extension of refinement session types and supports both probabilistic and regular choice type operators. We also implement PReST in a prototype which we use for validating probabilistic concurrent programs. The added expressive power leads to significant challenges, both in the meta theory and implementation of PReST, particularly with type checking: it requires reconstructing intermediate types for channels when type checking probabilistic branching expressions. The theory handles this by semantically quantifying refinement variables in probabilistic typing rules, a deviation from standard refinement type systems. The implementation relies on a bi-directional type checker that uses an SMT solver to reconstruct the intermediate types minimizing annotation overhead and increasing usability. To guarantee that probabilistic processes are almost-surely terminating, we integrate cost analysis into our type system to obtain expected upper bounds on recursion depth. We evaluate PReST on a wide variety of benchmarks from 4 categories: (i) randomized distributed protocols such as Itai and Rodeh's leader election, bounded retransmission, etc., (ii) parametric Markov chains such as random walks, (iii) probabilistic analysis of concurrent data structures such as queues, and (iv) distributions obtained by composing uniform distributions using operators like max, and sum. Our experiments show that the PReST type checker scales to large programs with sophisticated probabilistic distributions.
Formal verification of software and compilers has been used to rule out large classes of security-critical issues, but risk of unintentional information leakage has received much less consideration. It is a key requirement for formal specifications to leave some details of a system's behavior unspecified so that future implementation changes can be accommodated, and yet it is nonetheless expected that these choices would not be made based on confidential information the system handles. This paper formalizes that notion using omnisemantics and plain single-copy assertions, giving for the first time a specification of what it means for a nondeterministic program to be constant-time or more generally to avoid leaking (a part of) its inputs. We use this theory to prove data-leak-free execution of core cryptographic routines compiled from Bedrock2 C to RISC-V machine code, showing that the smooth specification and proof experience omnisemantics provides for nondeterminism extends to constant-time properties in the same setting. We also study variants of the key program-compiler contract, highlighting pitfalls of tempting simplifications and subtle consequences of how inputs to nondeterministic choices are constrained. Our results are backed by modular program-logic and compiler-correctness theorems, and they integrate into a neat end-to-end theorem in the Coq proof assistant.
This paper presents an automated reasoning technique for checking equivalence between graph database queries written in Cypher and relational queries in SQL. To formalize a suitable notion of equivalence in this setting, we introduce the concept of database transformers, which transform database instances between graph and relational models. We then propose a novel verification methodology that checks equivalence modulo a given transformer by reducing the original problem to verifying equivalence between a pair of SQL queries. This reduction is achieved by embedding a subset of Cypher into SQL through syntax-directed translation, allowing us to leverage existing research on automated reasoning for SQL while obviating the need for reasoning simultaneously over two different data models. We have implemented our approach in a tool called Graphiti and used it to check equivalence between graph and relational queries. Our experiments demonstrate that Graphiti is useful both for verification and refutation and that it can uncover subtle bugs, including those found in Cypher tutorials and academic papers.
Semantics-Guided Synthesis (SemGuS) provides a framework to specify synthesis problems in a solver-agnostic and domain-agnostic way, by allowing a user to provide both the syntax and semantics of the language in which the desired program should be synthesized. Because synthesis and verification are closely intertwined, the SemGuS framework raises the following question: how does one verify that a user-given program satisfies a given specification when interpreted according to a user-given semantics?
In this paper, we prove that this form of language-agnostic verification (specifically that verifying whether a program is a valid solution to a SemGuS problem) can be reduced to proving validity of a query in the µCLP calculus, a fixed-point logic that is capable of expressing alternating least and greatest fixed-points. Our encoding into µCLP allows us to further classify the SemGuS verification problems into ones that are reducible to satisfiability of (i) first-order-logic formulas, (ii) Constrained Horn Clauses, and (iii) µCLP queries. Furthermore, our encoding shines light on some limitations of the SemGuS framework, such as its inability to model nondeterminism and reactive synthesis. We thus propose a modification to SemGuS that makes it more expressive, and for which verifying solutions is exactly equivalent to proving validity of a query in the µCLP calculus. Our implementation of SemGuS verifiers based on the above encoding can verify instances that were not even encodable in previous work. Furthermore, we use our SemGuS verifiers within an enumeration-based SemGuS solver to correctly synthesize solutions to SemGuS problems that no previous SemGuS synthesizer could solve.
The selection monad on a set consists of selection functions. These select an element from the set, based on a loss (dually, reward) function giving the loss resulting from a choice of an element. Abadi and Plotkin used the monad to model a language with operations making choices of computations taking account of the loss that would arise from each choice. However, their choices were optimal, and they asked if they could instead be programmer provided.
In this work, we present a novel design enabling programmers to do so. We present a version of algebraic effect handlers enriched by computational ideas inspired by the selection monad. Specifically, as well as the usual delimited continuations, our new kind of handlers additionally have access to choice continuations, that give the possible future losses. In this way programmers can write operations implementing optimisation algorithms that are aware of the losses arising from their possible choices.
We give an operational semantics for a higher-order model language λ C, and establish desirable properties including progress, type soundness, and termination for a subset with a mild hierarchical constraint on allowable operation types. We give this subset a selection monad denotational semantics, and prove soundness and adequacy results. We also present a Haskell implementation and give a variety of programming examples.
Latency is a major concern for web rendering engines like those in Chrome, Safari, and Firefox. These engines reduce latency by using an incremental layout algorithm to redraw the page when the user interacts with it. In such an algorithm, elements that change frame-to-frame are marked dirty, and only those elements are processed to draw the next frame, dramatically reducing latency. However, the standard incremental layout algorithm must search the page for dirty elements, accessing auxiliary elements in the process. These auxiliary elements add cache misses and stalled cycles, and are responsible for a sizable fraction of all layout latency.
We introduce a new, faster incremental layout algorithm called Spineless Traversal. Spineless Traversal uses a cache-friendlier priority queue algorithm that avoids accessing auxiliary nodes and thus reduces cache traffic and stalls. This leads to dramatic speedups on the most latency-critical interactions such as hovering, typing, and animation. Moreover, thanks to numerous low-level optimizations, Spineless Traversal is competitive across the whole spectrum of incremental layout workloads. Spineless Traversal is faster than the standard approach on 83.0% of 2216 benchmarks, with a mean speedup of 1.80× concentrated in the most latency-critical interactions.
There are many state-of-the-art techniques for loop bound analysis. Most of them target an upper bound for a given program, and others find a lower bound. Exact bound analysis still remains largely unexplored, but it offers new applications. To compute an exact bound for a program it is necessary to reason about the possible values the program’s inputs can take and how they relate to each other. Since inputs can vary on any given execution of the program, it makes the problem of computing an exact bound challenging. In this work, we present a new approach to find an exact bound by way of precondition synthesis which iteratively considers under-approximations of a program under which the bound can be precomputed over initial values of program variables. For each precondition, our approach synthesizes a function over program variables such that when the function is applied to the initial values of the program variables, its output is an exact bound for the program. We reduce the precondition synthesis problem to that of safety verification which lends its correctness guarantees to the exact bounds we compute. Our technique has been implemented in a tool called Elba, and we show that it is effective on a set of challenging single loop benchmarks under Linear Integer Arithmetic.
Backward error analysis offers a method for assessing the quality of numerical programs in the presence of floating-point rounding errors. However, techniques from the numerical analysis literature for quantifying backward error require substantial human effort, and there are currently no tools or automated methods for statically deriving sound backward error bounds. To address this gap, we propose Bean, a typed first-order programming language designed to express quantitative bounds on backward error. Bean’s type system combines a graded coeffect system with strict linearity to soundly track the flow of backward error through programs. We prove the soundness of our system using a novel categorical semantics, where every Bean program denotes a triple of related transformations that together satisfy a backward error guarantee.
To illustrate Bean’s potential as a practical tool for automated backward error analysis, we implement a variety of standard algorithms from numerical linear algebra in Bean, establishing fine-grained backward error bounds via typing in a compositional style. We also develop a prototype implementation of Bean that infers backward error bounds automatically. Our evaluation shows that these inferred bounds match worst-case theoretical relative backward error bounds from the literature, underscoring Bean’s utility in validating a key property of numerical programs: numerical stability.
We present new techniques for exact and approximate inference in discrete probabilistic programs, based on two new ways of exploiting lazy evaluation. First, we show how knowledge compilation, a state-of-the art technique for exact inference in discrete probabilistic programs, can be made lazy, enabling asymptotic speed-ups. Second, we show how a probabilistic program’s lazy semantics naturally give rise to a division of its random choices into subproblems, which can be solved in sequence by sequential Monte Carlo with locally-optimal proposals automatically computed via lazy knowledge compilation. We implement our approach in a new tool, Pluck, and evaluate its performance against state-of-the-art approaches to inference in discrete probabilistic languages. We find that on a suite of inference benchmarks, lazy knowledge compilation can be faster than state-of-the-art approaches, sometimes by orders of magnitude.
Equality saturation has gained significant interest as a powerful optimization and reasoning technique. At its heart is the e-graph data structure, that space-efficiently represents equal sub-terms uniquely. An important open problem in this context is extending this efficient representation to languages featuring (bound) variables. Independent of how we represent variables in e-graphs, either as names or nameless (using de Bruijn indices), sharing is broken as sub-terms that differ only in the names of their variables are represented separately. This results in aggressive e-graph growth, bad performance, as well as reduced expressiveness. In this paper, we present a novel approach to representing bound variables in e-graphs by making them a first-class built-in feature of the data structure. Our slotted e-graph represents terms that differ only by (bound or free) variable names uniquely. To do so, e-classes that represent equivalent terms via e-nodes are parameterized by slots, abstracting over free variables of the represented terms. Referring to an e-class from an e-node now requires relating the variables from its context to the slots of the e-class. Our evaluation of slotted e-graph uses two case studies from compiler optimization and theorem proving to show that performing equality saturation for languages with bound variables is greatly simplified and that we can solve practically relevant problems that cannot be solved with e-graphs using de Bruijn indices.
Liquid types can express richer verification properties than simple type systems. However, despite their advantages, liquid types have yet to achieve widespread adoption. To understand why, we conducted a study analyzing developers' challenges with liquid types, focusing on LiquidHaskell. Our findings reveal nine key barriers that span three categories, including developer experience, scalability challenges with complex and large codebases, and understanding the verification process. Together, these obstacles provide a comprehensive view of the usability challenges to the broader adoption of liquid types and offer insights that can inform the current and future design and implementation of liquid type systems.
This paper revisits the fundamental problem of monitoring the linearizability of concurrent stacks, queues, sets, and multisets.
Given a history of a library implementing one of these abstract data types, the monitoring problem is to answer whether the given history is linearizable.
For stacks, queues, and (multi)sets, we present monitoring algorithms with complexities (n2), (n log n), and (n), respectively, where n is the number of operations in the input history.
For stacks and queues, our results hold under the standard assumption of data-independence, i.e., the behavior of the library is not sensitive to the actual values stored in the data structure.
Past works to solve the same problems have cubic time complexity and (more seriously) have correctness issues: they either (i) lack correctness proofs or (ii) have unsound correctness proofs (we present counter-examples of the correctness proofs), or (iii) have unsound algorithms.
Our improved complexity results rely on substantially different algorithms for which we provide detailed proofs of correctness.
We have implemented our stack and queue algorithms in LiMo (Linearizability Monitor). We evaluate LiMo and compare it with the state-of-the-art tool Violin – whose correctness proofs we have found errors in – which checks for linearizability violations. Our experimental evaluation confirms that LiMo outperforms Violin regarding both efficiency and scalability.
When designing new architectural security mechanisms, a key question is whether they actually provide the intended security, but this has historically been very hard to assess. One cannot gain much confidence by testing, as such mechanisms should provide protection in the presence of arbitrary unknown code. Previously, one also could not gain confidence by mechanised proof, as the scale of production instruction-set architecture (ISA) designs, many tens or hundreds of thousands of lines of specification, made that prohibitive.
We focus in this paper especially on the secure encapsulation of software components, as supported by CHERI architectures in general and by the Arm Morello prototype architecture and hardware design in particular. Secure encapsulation is an essential security mechanism, for fault isolation and to constrain untrusted third-party code. It has previously often been implemented using virtual memory, but that does not scale to large numbers of compartments. Morello provides capability-based mechanisms that do scale, within a single address space.
We prove a strong secure encapsulation property for an example of encapsulated code running on Morello, that holds in the presence of arbitrary untrusted code, above a full-scale sequential model of the Morello ISA. To do so, we build on, extend, and unify three orthogonal lines of previous work: the Cerise proof of such an encapsulation property for a highly idealised capability machine, expressed using a logical relation in Iris; the Islaris approach for reasoning about known code in production-scale ISAs; and the T-CHERI security properties of arbitrary Morello code, previously proved only for executions up to domain crossing.
This demonstrates how one can prove such strong properties of security mechanisms for full-scale industry architectures.
Domain-specific languages (DSLs) for machine learning are revolutionizing the speed and efficiency of machine learning workloads as they enable users easy access to high-performance compiler optimizations and accelerators. However, to take advantage of these capabilities, a user must first translate their legacy code from the language it is currently written in, into the new DSL. The process of automatically lifting code into these DSLs has been identified by several recent works, which propose program synthesis as a solution. However, synthesis is expensive and struggles to scale without carefully designed and hard-wired heuristics. In this paper, we present an approach for lifting that combines an enumerative synthesis approach with a Large Language Model used to automatically learn the domain-specific heuristics for program lifting, in the form of a probabilistic grammar. Our approach outperforms the state-of-the-art tools in this area, despite only using learned heuristics.
Compared to familiar hardware-description languages like Verilog, rule-based languages like Bluespec offer opportunities to import modularity features from software programming. While Verilog modules are about connecting wires between submodules, Bluespec modules resemble objects in object-oriented programming, where interactions with a module occur only through calls to its methods. However, while software objects can typically be characterized one method at a time, the concurrent nature of hardware makes it essential to consider the repercussions of invoking multiple methods simultaneously. Prior formalizations of rule-based languages conceptualized modules by describing their semantics considering arbitrary sets of simultaneous method calls. This internalized concurrency significantly complicates correctness proofs. Rather than analyzing methods one-at-a-time, as is done when verifying software object methods, validating the correctness of rule-based modules necessitated simultaneous consideration of arbitrary subsets of method calls. The result was a number of proof cases that grew exponentially in the size of the module’s API.
In this work, we side-step the exponential blowup through a set of judicious language restrictions. We introduce a new Bluespec-inspired formal language, Fjfj, that supports sequential characterization of modules, while preserving the concurrent hardware nature of the language. We evaluated Fjfj by implementing it in Coq, proving the key framework principle: the refinement theorem. We demonstrated Fjfj’s expressivity via implementation and verification of three examples: a pipelined processor, a parameterized crossbar, and a network switch.
Our RLibm project has recently proposed methods to generate a single implementation for an elementary function that produces correctly rounded results for multiple rounding modes and representations with up to 32-bits. They are appealing for developing fast reference libraries without double rounding issues. The key insight is to build polynomial approximations that produce the correctly rounded result for a representation with two additional bits when compared to the largest target representation and with the "non-standard" round-to-odd rounding mode, which makes double rounding the RLibm math library result to any smaller target representation innocuous. The resulting approximations generated by the RLibm approach are implemented with machine supported floating-point operations with the round-to-nearest rounding mode. When an application uses a rounding mode other than the round-to-nearest mode, the RLibm math library saves the application's rounding mode, changes the system's rounding mode to round-to-nearest, computes the correctly rounded result, and restores the application’s rounding mode. This frequent change of rounding modes has a performance cost.
This paper proposes two new methods, which we call rounding-invariant outputs and rounding-invariant input bounds, to avoid the frequent changes to the rounding mode and the dependence on the round-to-nearest mode. First, our new rounding-invariant outputs method proposes using the round-to-zero rounding mode to implement RLibm's polynomial approximations. We propose fast, error-free transformations to emulate a round-to-zero result from any standard rounding mode without changing the rounding mode. Second, our rounding-invariant input bounds method factors any rounding error due to different rounding modes using interval bounds in the RLibm pipeline. Both methods make a different set of trade-offs and improve the performance of resulting libraries by more than 2X.
We introduce Thrust, a new verification tool for ensuring functional correctness in Rust, distinguished by its strengths in automated verification, including the synthesis of inductive invariants for loops and recursive functions. Thrust is built on a novel dependent refinement type system for Rust and refinement type inference techniques based on Constrained Horn Clause (CHC) solvers. Leveraging advantages of the type system, Thrust also supports semi-automated verification utilizing user type annotations to complement CHC solvers in cases where automatic constraint solving is unsuccessful, as well as modular verification at the function and subexpression levels. Thrust also achieves precise verification, especially for programs involving pointer aliasing and borrowing, without sacrificing the benefits of automated verification, by incorporating the notion of prophecy into the refinement type system: it not only enables strong updates by leveraging the “aliasing XOR mutability” guarantee provided by Rust’s type system, but also achieves propagation of update information to the original owner upon mutable borrow release through the use of a prophecy variable. Incorporating prophecy into a refinement type system is itself challenging and requires certain tricks, as discussed in this paper, making a theoretical contribution and paving the way for further research into prophecy-based refinement type systems. While our type system addresses the challenge, we keep it simple for extensibility, specifically by delegating the guarantee of “aliasing XOR mutability,” and, more technically, the “well-borrowedness” of the program in the sense of the stacked borrows aliasing model, to Rust’s type system, allowing us to focus on reasoning about functional correctness and propagating update information through prophecy variables. Compared to RustHorn, another automated verification tool based on prophecy, our approach leverages the strengths of refinement types to support modular verification, higher-order functions, and refinement of data stored in algebraic data structures. We implemented Thrust, a refinement type inference tool as a plugin for the Rust compiler, and evaluated it using RustHorn benchmarks, as well as additional new benchmarks, including those that are beyond the capabilities of RustHorn and other semi-automated verification tools, obtaining promising results.
Exact probabilistic inference is a requirement for many applications of probabilistic programming languages (PPLs) such as in high-consequence settings or verification. However, designing and implementing a PPL with scalable high-performance exact inference is difficult: exact inference engines, much like SAT solvers, are intricate low-level programs that are hard to implement. Due to this implementation challenge, PPLs that support scalable exact inference are restrictive and lack many features of general-purpose languages.
This paper presents Roulette, the first discrete probabilistic programming language that combines high-performance exact inference with general-purpose language features. Roulette supports a significant subset of Racket, including data structures, first-class functions, surely-terminating recursion, mutable state, modules, and macros, along with probabilistic features such as finitely supported discrete random variables, conditioning, and top-level inference. The key insight is that there is a close connection between exact probabilistic inference and the symbolic evaluation strategy of Rosette. Building on this connection, Roulette generalizes and extends the Rosette solver-aided programming system to reason about probabilistic rather than symbolic quantities. We prove Roulette sound by generalizing a proof of correctness for Rosette to handle probabilities, and demonstrate its scalability and expressivity on a number of examples.
Sparse data structures are ubiquitous in modern computing, and numerous formats have been designed to represent them. These formats may exploit specific sparsity patterns, aiming to achieve higher performance for key numerical computations than more general-purpose formats such as CSR and COO.
In this work we present UZP, a new sparse format based on polyhedral sets of integer points. UZP is a flexible format that subsumes CSR, COO, DIA, BCSR, etc., by raising them to a common mathematical abstraction: a union of integer polyhedra, each intersected with an affine lattice. We present a modular approach to building and optimizing UZP: it captures equivalence classes for the sparse structure, enabling the tuning of the representation for target-specific and application-specific performance considerations. UZP is built from any input sparse structure using integer coordinates, and is interoperable with existing software using CSR and COO data layout. We provide detailed performance evaluation of UZP on 200+ matrices from SuiteSparse, demonstrating how simple and mostly unoptimized generic executors for UZP can already achieve solid performance by exploiting Z-polyhedra structures.
Program synthesis aims at the automatic generation of programs based on given specifications. Despite significant progress, the inherent complexity of synthesis tasks and the interplay among intention, invention and adaptation limit its scope. A promising yet challenging avenue is the integration of concurrency to enhance synthesis algorithms. While some efforts have applied basic concurrency by parallelizing search spaces, more intricate synthesis scenarios involving interdependent subproblems remain unexplored. In this paper, we focus on string transformation as the target domain and introduce the first concurrent synthesis algorithm that enables asynchronous coordination between deductive and enumerative processes, featuring an asynchronous deducer for dynamic task decomposition, a versatile enumerator for resolving enumeration requests, and an accumulative case splitter for if-then-else condition/branch search and assembling. Our implementation, Synthphonia exhibits substantial performance improvements over state-of-the-art synthesizers, successfully solving 116 challenging string transformation tasks for the first time.
Traditionally, most concurrent algorithms rely on safe memory reclamation (SMR) schemes for manual memory management. SMR schemes such as epoch-based reclamation (EBR) and hazard pointers (HP) are typically viewed as the only solution for memory recycling.
When using SMR, a new object needs to be allocated whenever something new is added to a data structure. However, in more complex scenarios, the same object may need to be moved between different data structures (e.g., moving a node from one list to another, and then back to the original list) in a copy-free manner, i.e., without deallocating and allocating the node again. It is typically impossible for two reasons: (1) the ABA problem would still arise even when using SMR since the same pointer can reappear (without going through the full SMR cycle) if the same node eventually ends up back in the original data structure; (2) while in simple queues and stacks, nodes can immediately be recycled, it is unclear how to adapt data structures which use non-trivial traversal and two-phase deletion strategies, e.g., linked lists, skip lists, hash tables, trees, etc., where it is seemingly impossible to always immediately move (logically) deleted objects since they might still be accessed by other threads.
We propose a general method of creating RRR (Reduce, Reuse, Recycle) data structures to allow safe memory recycling when using SMR which addresses the above-mentioned problems. Our method is applicable to linked lists, skip lists, hash tables, Natarajan-Mittal tree, and other data structures. We also discuss and propose a specialized approach -- a more efficient version of Michael-and-Scott's (recycling) queue. Our evaluation on x86-64 shows promising results when using our methods for different data structures and SMR schemes.
Modern functional languages rely on sophisticated type inference algorithms. However, there often exists a gap between the theoretical presentation of these algorithms and their practical implementations. Specifically, implementations employ techniques not explicitly included in formal specifications, causing undesirable consequences. First, this leads to confusion and unforeseen challenges for developers adhering to the formal specification. Moreover, theoretical guarantees established for a formal presentation may not directly translate to the implementation. This paper focuses on formalizing one such technique, known as levels, which is widely used in practice but whose theoretical treatment remains largely understudied. We present the first comprehensive formalization of levels and demonstrate their applicability to type inference implementations.