The Entity-Component-System (ECS) software design pattern, long used in game development, encourages a clean separation of identity (entities), data properties (components), and computational behaviors (systems). Programs written using the ECS pattern are naturally concurrent, and the pattern offers modularity, flexibility, and performance benefits that have led to a proliferation of ECS frameworks. Nevertheless, the ECS pattern is little-known and not well understood outside of a few domains. Existing explanations of the ECS pattern tend to be mired in the concrete details of particular ECS frameworks, or they explain the pattern in terms of imperfect metaphors or in terms of what it is not. We seek a rigorous understanding of the ECS pattern via the design of a formal model, Core ECS, that abstracts away the details of specific implementations to reveal the essence of software using the ECS pattern. We identify a class of Core ECS programs that behave deterministically regardless of scheduling, enabling use of the ECS pattern as a deterministic-by-construction concurrent programming model. With Core ECS as a point of comparison, we then survey several real-world ECS frameworks and find that they all leave opportunities for deterministic concurrency unexploited. Our findings point out a space for new ECS implementation techniques that better leverage such opportunities.
Given the high cost of formal verification, a large system may include differently analyzed components: a few are fully verified, and the rest are tested. Currently, there is no reasoning system that can soundly compose these heterogeneous analyses and derive the overall formal guarantees of the entire system. The traditional compositional reasoning technique—rely-guarantee reasoning—is effective for verified components, which undergo over-approximated reasoning, but not for those components that undergo under-approximated reasoning, e.g., using testing or other program analysis techniques.
The goal of this paper is to develop a formal, logical foundation for composing heterogeneous analysis, deploying both over-approximated (verification) and under-approximated (testing) reasoning. We focus on systems that can be modeled as a collection of communicating processes. Each process owns its internal resources and a set of channels through which it communicates with other processes. The key idea is to quantify the guarantees obtained about the behavior of a process as a test level, which captures the constraints under which this guarantee is analyzed to be true. We design a novel proof system LabelBI based on the logic of bunched implications that enables rely-guarantee reasoning principles for a system of differently analyzed components. We develop trace semantics for this logic, against which we prove our logic is sound. We also prove cut elimination of our sequent calculus. We demonstrate the expressiveness of our logic via a case study.
We present Pyrosome, a generic framework for modular language metatheory that embodies a novel approach to extensible semantics and compilation, implemented in Coq. Common techniques for semantic reasoning are often tied to the specific structures of the languages and compilers that they support. Contextual equivalence is difficult to work with directly, and both logical relations and transition system-based approaches typically fix a specific notion of effect globally. While modular transition systems have been effective in imperative settings, they are suboptimal for functional code. These limitations restrict the extension and composition of semantics in these systems. In Pyrosome, verified compilers are fully extensible, meaning that to extend a language simply requires defining and verifying the compilation of the new feature, reusing the old correctness theorem for all other cases. The novel enabling idea is an inductive formulation of equivalence preservation that supports the addition of new rules to the source language, target language, and compiler.
Pyrosome defines a formal, deeply embedded notion of programming languages with semantics given by dependently sorted equational theories, so all compiler-correctness proofs boil down to type-checking and equational reasoning. We support vertical composition of any compilers expressed in our framework in addition to feature extension. Since our design requires compilers to support open programs, our correctness guarantees support linking with any target code of the appropriate type. As a case study, we present a multipass compiler from System F with simple references, through CPS translation and closure conversion. Specifically, we demonstrate how we can build such a compiler incrementally by starting with a compiler for simply typed lambda-calculus and adding natural numbers, the unit type, recursive functions, and a global heap, then extending judgments with a type environment and adding type abstraction, all while reusing the original theorems. We also present a linear version of the simply typed CPS pass and compile a small imperative language to the simply typed target to show how Pyrosome handles substructural typing and imperative features.
Program reduction is a widely used technique in testing and debugging language processors. Given a program that triggers a bug in a language processor, program reduction searches for a canonicalized and minimized program that triggers the same bug, thereby facilitating bug deduplication and simplifying the debugging process. To improve reduction performance without sacrificing generality, prior research has leveraged the formal syntax of the programming language as guidance. Two key syntax-guided transformations—Compatible Substructure Hoisting and Quantified Node Reduction—were introduced to enhance this process. While these transformations have proven effective to some extent, their application excessively prunes the search space, preventing the discovery of many smaller results. Consequently, there remains significant potential for further improvement in overall reduction performance.
To this end, we propose a novel syntax-guided transformation named Structure Form Conversion (SFC) to complement the aforementioned two transformations. Building on SFC, we introduce three reduction methods: Smaller Structure Replacement, Identifier Elimination, and Structure Canonicalization, designed to effectively and efficiently leverage SFC for program reduction. By integrating these reduction methods to previous language-agnostic program reducers, Perses and Vulcan, we implement two prototypes named SFCPerses and SFCVulcan. Extensive evaluations show that SFCPerses and SFCVulcan significantly outperforms Perses and Vulcan in both minimization and canonicalization. Specifically, compared to Perses, SFCPerses produces programs that are 36.82%, 18.71%, and 41.05% smaller on average on the C, Rust, and SMT-LIBv2 benchmarks at the cost of 3.65×, 16.99×, and 1.42× the time of Perses, respectively. Similarly, SFCVulcan generates programs that are 14.51%, 7.65%, and 7.66% smaller than those produced by Vulcan at the cost of 1.56×, 2.35×, and 1.42× the execution time of Vulcan. Furthermore, in an experiment with a benchmark suite containing 3,796 C programs that trigger 46 unique bugs, SFCPerses and SFCVulcan reduce 442 and 435 more duplicates (programs that trigger the same bug) to identical programs than Perses and Vulcan, respectively.
Parsing—the process of structuring a linear representation according to a given grammar—is a fundamental activity in software engineering. While formal language theory has provided theoretical foundations for parsing, the most common kind of parsers used in practice are written ad hoc. They use common string operations without explicitly defining an input grammar. These ad hoc parsers are often intertwined with application logic and can result in subtle semantic bugs. Grammars, which are complete formal descriptions of input languages, can enhance program comprehension, facilitate testing and debugging, and provide formal guarantees for parsing code. But writing grammars—e.g., in the form of regular expressions—can be tedious and error-prone. Inspired by the success of type inference in programming languages, we propose a general approach for static inference of regular input string grammars from unannotated ad hoc parser source code. We use refinement type inference to synthesize logical and string constraints that represent regular parsing operations, which we then interpret with an abstract semantics into regular expressions. Our contributions include a core calculus λΣ for representing ad hoc parsers, a formulation of (regular) grammar inference as refinement inference, an abstract interpretation framework for solving string refinement variables, and a set of abstract domains for efficiently representing the constraints encountered during regular ad hoc parsing. We implement our approach in the PANINI system and evaluate its efficacy on a benchmark of 204 Python ad hoc parsers. Compared with state-of-the-art approaches, PANINI produces better grammars (100% precision, 93% average recall) in less time (0.82 ± 2.85 s) without prior knowledge of the input space.
REST APIs form the backbone of modern interconnected systems by providing a language-agnostic communication interface. REST API specifications should clearly describe all response types, but automatically generating specifications is difficult with existing tools.
We present REST𝜋, a type inference engine capable of automatically generating REST API specifications. The novel contribution of REST𝜋 is our use of path-sensitive type inference, which encodes symbolic path-constraints directly into a type system. This allows REST𝜋 to enumerate all response types by considering each distinct execution path through an endpoint implementation. We implement path-sensitive type inference for Ruby, a popular language used for REST API servers. We evaluate REST𝜋 by using it to infer types for 132 endpoints across 5 open-source REST API implementations without utilizing existing specifications or test suites. We find REST𝜋 performs type inference efficiently and produces types that are more precise and complete than those obtained via an HTTP proxy. Our results suggest that path-sensitivity is a key technique to enumerate distinct response types for REST endpoints.
Most existing quantum programming languages are based on the quantum circuit model of computation, as higher-level abstractions are particularly challenging to implement—especially ones relating to quantum control flow. The Qunity language, proposed by Voichick et al., offered such an abstraction in the form of a quantum control construct, with great care taken to ensure that the resulting language is still realizable. However, Qunity lacked a working implementation, and the originally proposed compilation procedure was very inefficient, with even simple quantum algorithms compiling to unreasonably large circuits.
In this work, we focus on the efficient compilation of high-level quantum control flow constructs, using Qunity as our starting point. We introduce a wider range of abstractions on top of Qunity's core language that offer compelling trade-offs compared to its existing control construct. We create a complete implementation of a Qunity compiler, which converts high-level Qunity code into the quantum assembly language OpenQASM 3. We develop optimization techniques for multiple stages of the Qunity compilation procedure, including both low-level circuit optimizations as well as methods that consider the high-level structure of a Qunity program, greatly reducing the number of qubits and gates used by the compiler.
The merge operator is a powerful construct in programming languages, enabling flexible composition of various components such as functions, records, or classes. Unfortunately, its application often leads to ambiguity and non-determinism, especially when dealing with overlapping types. To prevent ambiguity, approaches such as disjoint intersection types have been proposed. However, disjointness imposes strict constraints to ensure determinism, at the cost of limiting expressiveness, particularly for function overloading. This paper introduces a novel concept called type apartness, which relaxes the strict disjointness constraints, while maintaining type safety and determinism. Type apartness allows some overlap for overloaded functions as long as the calling contexts of those functions can be used to disambiguate upcasts in function calls. By incorporating the notion of guarded subtyping to prevent ambiguity when upcasting, our approach is the first to support function overloading, return type overloading, extensible records, and nested composition in a single calculus while preserving determinism. We formalize our calculi and proofs using Coq and prove their type soundness and determinism. Additionally, we demonstrate how type normalization and type difference provide more convenience and help resolve conflicts, enhancing the flexibility and expressiveness of the merge operator.
Probabilistic extensions of logic programming languages, such as ProbLog, integrate logical reasoning with probabilistic inference to evaluate probabilities of output relations; however, prior work does not account for potential statistical correlations among input facts. This paper introduces Praline, a new extension to Datalog designed for precise probabilistic inference in the presence of (partially known) input correlations. We formulate the inference task as a constrained optimization problem, where the solution yields sound and precise probability bounds for output facts. However, due to the complexity of the resulting optimization problem, this approach alone often does not scale to large programs. To address scalability, we propose a more efficient δ-exact inference algorithm that leverages constraint solving, static analysis, and iterative refinement. Our empirical evaluation on challenging real-world benchmarks, including side-channel analysis, demonstrates that our method not only scales effectively but also delivers tight probability bounds.
Image processing workflows typically consist of a series of different functions, each working with “image” inputs and outputs in an abstract sense. However, the specific in-memory representation of images differs between and sometimes within libraries. Conversion is therefore necessary when integrating functions from several sources into a single program. The conversion process forces users to consider low-level implementation details, including data types, color channels, channel order, minibatch layout, memory locations, and pixel intensity ranges. Specifically in the case of visual programming languages (VPLs), this distracts from high-level operations. We introduce im2im, a Python library that automates the conversion of in-memory image representations. The central concept of this library is a knowledge graph that describes image representations and how to convert between them. The system queries this knowledge graph to generate the conversion code and execute it, converting an image to the desired representation. The effectiveness of the approach is evaluated through two case studies in VPLs. In each case, we compared a workflow created in a basic block-based VPL with the same workflow enhanced using im2im. These evaluations show that im2im automates type conversions and eliminates the need for manual intervention. Additionally, we compared the overhead of using explicit intermediate representations versus im2im, both of which avoid manual type conversions in VPLs. The results indicate that im2im generates only the necessary conversions, avoiding the runtime overhead associated with converting to and from intermediate formats. A performance comparison between the step-by-step approach used by im2im and a single-function approach demonstrates that the overhead introduced by im2im does not impact practical usability. While focused on block-based VPLs, im2im can be generalized to other VPLs and textual programming environments. Its principles are also applicable to domains other than images. The source code and analyses are available via GitHub.
We present a coverage-guided testing algorithm for distributed systems implementations. Our main innovation is the use of an abstract formal model of the system that is used to define coverage. Such abstract models are frequently developed in the early phases of protocol design and verification but are infrequently used at testing time. We show that guiding random test generation using model coverage can be effective in covering interesting points in the implementation state space. We have implemented a fuzzer for distributed system implementations and abstract models written in TLA+. Our algorithm achieves better coverage over purely random exploration as well as random exploration guided by different notions of scheduler coverage and mutation. In particular, we show consistently higher coverage on implementations of distributed consensus protocols such as Two-Phase Commit and the Raft implementations in Etcd-raft and RedisRaft and detect bugs faster. Moreover, we discovered 12 previously unknown bugs in their implementations, four of which could only be detected by model-guided fuzzing.
Graph databases have become an important data management technology across various domains, including biology, sociology, industry (e.g. fraud detection, supply chain management, financial services), and investigative journalism, due to their ability to efficiently store and query large-scale knowledge graphs and networks. Recently, the Graph Query Language (GQL) was introduced as a new ISO standard providing a unified framework for querying graphs. However, this initial specification lacks a formal type system for query validation. As a result, queries can fail at runtime due to type inconsistencies or produce empty results without prior warning. Solving this issue could have great benefits for users in writing correct queries, especially when handling large datasets. To address this gap, we introduce a formal type model for a core fragment of GQL extended with property-based filtering and imprecise types both in the schema and the queries. This model, named FPPC, enables static detection of semantically incorrect and stuck queries, improving user feedback. We establish key theoretical properties, including emptiness (detecting empty queries due to type mismatches) and type safety (guaranteeing that well-typed queries do not fail at runtime). Additionally, we prove a gradual guarantee, ensuring that removing type annotations either does not introduce static type errors or only increases the result set. By integrating imprecision into GQL, FPPC offers a flexible solution for handling schema evolution and incomplete type information. This work contributes to making GQL more robust, improving both its usability and its formal foundation.
Machine learning (ML) compilers rely on graph-level transformations to enhance the runtime performance of ML models. However, performing local transformations on individual operations can create effects far beyond the location of the rewrite. In particular, a local rewrite can change the profitability or legality of hard-to-predict downstream transformations, particularly regarding data layout, parallelization, fine-grained scheduling, and memory management. As a result, program transformations are often driven by manually-tuned compiler heuristics, which are quickly rendered obsolete by new hardware and model architectures.
Instead of hand-written local heuristics, we propose the use of equality saturation. We replace such heuristics with a more robust global performance model, which accounts for downstream transformations. Equality saturation addresses the challenge of local optimizations inadvertently constraining or negating the benefits of subsequent transformations, thereby providing a solution that is inherently adaptable to newer workloads. While this approach still requires a global performance model to evaluate the profitability of transformations, it holds significant promise for increased automation and adaptability.
This paper addresses challenges in applying equality saturation on real-world ML compute graphs and state-of-the-art hardware. By doing so, we present an improved method for discovering effective compositions of graph optimizations. We study different cost modeling approaches to deal with fusion and layout optimization, and tackle scalability issues that arise from considering a very wide range of algebraic optimizations. We design an equality saturation pass for the XLA compiler, with an implementation in C++ and Rust. We demonstrate an average speedup of 3.45% over XLA’s optimization flow across our benchmark suite on various CPU and GPU platforms, with a maximum speedup of 56.26% for NasRNN on CPU.
Zero-knowledge (ZK) protocols enable software developers to provide proofs of their programs' correctness to other parties without revealing the programs themselves. Regular expressions are pervasive in real-world software, and zero-knowledge protocols have been developed in the past for the problem of checking whether an individual string appears in the language of a regular expression, but no existing protocol addresses the more complex PSPACE-complete problem of proving that two regular expressions are equivalent.
We introduce Crepe, the first ZK protocol for encoding regular expression equivalence proofs and also the first ZK protocol to target a PSPACE-complete problem. Crepe uses a custom calculus of proof rules based on regular expression derivatives and coinduction, and we introduce a sound and complete algorithm for generating proofs in our format. We test Crepe on a suite of hundreds of regular expression equivalence proofs. Crepe can validate large proofs in only a few seconds each.
In today’s complex software, internal trusted code is tightly intertwined with external untrusted code. To reason about internal code, programmers must reason about the potential effects of calls to external code, even though that code is not trusted and may not even be available.
The effects of external calls can be limited if internal code is programmed defensively, limiting potential effects by limiting access to the capabilities necessary to cause those effects.
This paper addresses the specification and verification of internal code that relies on encapsulation and object capabilities to limit the effects of external calls. We propose new assertions for access to capabilities, new specifications for limiting effects, and a Hoare logic to verify that a module satisfies its specification, even while making external calls. We illustrate the approach though a running example with mechanised proofs, and prove soundness of the Hoare logic.
Context-free language (CFL) reachability is a critical framework for various program analyses, widely adopted despite its computational challenges due to cubic or near-cubic time complexity. This often leads to significant performance degradation in client applications. Notably, in real-world scenarios, clients typically require reachability information only for specific source-to-sink pairs, offering opportunities for targeted optimization.
We introduce MoYe, an effective regularization-based graph simplification technique designed to enhance the performance of client-driven CFL-reachability analyses by pruning non-contributing edges—those that do not participate in any specified CFL-reachable paths. MoYe employs a regular approximation to ensure exact reachability results for all designated node pairs and operates linearly with respect to the number of edges in the graph. This lightweight efficiency makes MoYe a valuable pre-processing step that substantially reduces both computational time and memory requirements for CFL-reachability analysis, outperforming a recent leading graph simplification approach. Our evaluations with two prominent CFL-reachability client applications demonstrate that MoYe can substantially improve performance and reduce resource consumption.
Classical simulation of quantum circuits is critical for the development of implementations of quantum algorithms: it does not require access to specialized hardware, facilitates debugging by allowing direct access to the quantum state, and is the only way to test on inputs that are too big for current NISQ computers.
Many quantum algorithms rely on invariants that result in sparsity in the state vector. A sparse state vector simulator only computes with non-zero amplitudes. For important classes of algorithms, this results in an asymptotic improvement in simulation time. While promising prior work has investigated ways to exploit sparsity, it is still unclear what is the best way to scale sparse simulation to modern multi-core architectures.
In this work, we address this challenge and present qblaze, a highly optimized sparse state vector simulator based on (i) a compact sorted array representation, and (ii) new, easily parallelizable and highly-scalable algorithms for all quantum operations. Our extensive experimental evaluation shows that qblaze is often orders-of-magnitude more efficient than prior sparse state vector simulators even on a single thread, and also that qblaze scales well to a large number of CPU cores.
Overall, our work enables testing quantum algorithms on input sizes that were previously out of reach.
React has become the most widely used web front-end framework, enabling the creation of user interfaces in a declarative and compositional manner. Hooks are a set of APIs that manage side effects in function components in React. However, their semantics are often seen as opaque to developers, leading to UI bugs. We introduce React-tRace, a formalization of the semantics of the essence of React Hooks, providing a semantics that clarifies their behavior. We demonstrate that our model captures the behavior of React, by theoretically showing that it embodies essential properties of Hooks and empirically comparing our React-tRace-definitional interpreter against a test suite. Furthermore, we showcase a practical visualization tool based on the formalization to demonstrate how developers can better understand the semantics of Hooks.
Certified programming, as carried out in proof assistants and dependently-typed programming languages, ensures that a software meets its requirements by supporting the definition of both specifications and proofs. However, proofs easily break with partial definitions and incremental changes because specifications are not designed to account for the intermediate incomplete states of programs. We advocate for proper support for incremental certified programming by analyzing its objectives and inherent challenges, and propose a formal framework for achieving incremental certified programming in a principled manner. The key idea is to define appropriate notions of completion refinement and completeness to capture incrementality, and to systematically produce specifications that are valid at every stage of development while preserving the intent of the original statements. We provide a prototype implementation in the Rocq Prover, called IncRease, which exploits typeclasses for automation and extensibility, and is independent of any specific mechanism used to handle incompleteness. We illustrate its use with both an incremental textbook formalization of the simply-typed λ-calculus, and a more complex case study of incremental certified programming for an existing dead-code elimination optimization pass of the CompCert project. We show that the approach is compatible with randomized property-based testing as provided by QuickChick. Finally we study how to combine incremental certified programming with deductive synthesis, using a novel incrementality-friendly adaptation of the Fiat library. This work provides theoretical and practical foundations towards systematic support for incremental certified programming, highlighting challenges and perspectives for future developments.
Many software applications rely on concurrent and distributed (micro)services that interact via message-passing and various forms of remote procedure calls (RPC). As these systems organically evolve and grow in scale and complexity, the risk of introducing deadlocks increases and their impact may worsen: even if only a few services deadlock, many other services may block while awaiting responses from the deadlocked ones. As a result, the "core" of the deadlock can be obfuscated by its consequences on the rest of the system, and diagnosing and fixing the problem can be challenging.
In this work we tackle the challenge by proposing distributed black-box monitors that are deployed alongside each service and detect deadlocks by only observing the incoming and outgoing messages, and exchanging probes with other monitors. We present a formal model that captures popular RPC-based application styles (e.g., gen_servers in Erlang/OTP), and a distributed black-box monitoring algorithm that we prove sound and complete (i.e., identifies deadlocked services with neither false positives nor false negatives). We implement our results in a tool called DDMon for the monitoring of Erlang/OTP applications, and we evaluate its performance.
This is the first work that formalises, proves the correctness, and implements distributed black-box monitors for deadlock detection. Our results are mechanised in Coq. DDMon is the companion artifact of this paper.
Over the past decade, deep reinforcement learning (RL) techniques have significantly advanced robotic systems. However, due to the complex architectures of neural network models, ensuring their trustworthiness is a considerable challenge. Programmatic reinforcement learning has surfaced as a promising approach. Nonetheless, synthesizing robot-control programs remains challenging. Existing methods rely on domain-specific languages (DSLs) populated with user-defined state abstraction predicates and a library of low-level controllers as abstract actions to boot synthesis, which is impractical in unknown environments that lack such predefined components. To address this limitation, we introduce RoboScribe, a novel abstraction refinement-guided program synthesis framework that automatically derives robot state and action abstractions from raw, unsegmented task demonstrations in high-dimensional, continuous spaces. It iteratively enriches and refines an initially coarse abstraction until it generates a task-solving program over the abstracted robot environment. RoboScribe is effective in synthesizing iterative programs by inferring recurring subroutines directly from the robot’s raw, continuous state and action spaces, without needing predefined abstractions. Experimental results show that RoboScribe programs inductively generalize to long-horizon robot tasks involving arbitrary numbers of objects, outperforming baseline methods in terms of both interpretability and efficiency.
The advances in modern storage technologies necessitate the development of new input/output (I/O) APIs to maximize their performance benefits. However, migrating existing software to use different APIs poses significant challenges due to mismatches in computational models and complex code structures surrounding stateful, non-contiguous multi-API call sites. We present Sprout, a new system for automatically migrating programs across I/O APIs that guarantees behavioral equivalence. Sprout uses flow-sensitive pointer analysis to identify semantic variables, which enables the typestate analysis for matching API semantics and the synthesis of migrated programs. Experimental results with real-world C programs highlight the efficiency and effectiveness of our approach. We also show that Sprout can be adapted to other domains, such as databases.
Data processing frameworks like Apache Spark and Flink provide built-in support for user-defined aggregation functions (UDAFs), enabling the integration of domain-specific logic. However, for these frameworks to support efficient UDAF execution, the function needs to satisfy a homomorphism property, which ensures that partial results from independent computations can be merged correctly. Motivated by this problem, this paper introduces a novel homomorphism calculus that can both verify and refute whether a UDAF is a dataframe homomorphism. If so, our calculus also enables the construction of a corresponding merge operator which can be used for incremental computation and parallel execution. We have implemented an algorithm based on our proposed calculus and evaluate it on real-world UDAFs, demonstrating that our approach significantly outperforms two leading synthesizers.
We study the problem of synthesizing domain-specific languages (DSLs) for few-shot learning in symbolic domains. Given a base language and instances of few-shot learning problems, where each instance is split into training and testing samples, the DSL synthesis problem asks for a grammar over the base language that guarantees that small expressions solving training samples also solve corresponding testing samples. We prove that the problem is decidable for a class of languages whose semantics over fixed structures can be evaluated by tree automata and when expression size corresponds to parse tree depth in the grammar, and, furthermore, the grammars solving the problem correspond to a regular set of trees. We also prove decidability results for variants of the problem where DSLs are only required to express solutions for input learning problems and where DSLs are defined using macro grammars.
We introduce REPTILE, a compiler that performs tiling optimizations for programs expressed as mathematical recurrence equations. REPTILE recursively decomposes a recurrence program into a set of unique tiles and then simplifies each into a different set of recurrences. Given declarative user specifications of recurrence equations, optimizations, and optional mappings of recurrence subexpressions to external libraries calls, REPTILE generates C code that composes compiler-generated loops with calls to external hand-optimized libraries. We show that for direct linear solvers expressible as recurrence equations, the generated C code matches and often exceed the performance of standard hand-optimized libraries. We evaluate REPTILE's generated C code against hand-optimized implementations of linear solvers in Intel MKL, as well as two non-solver recurrences from bioinformatics: Needleman-Wunsch and Smith-Waterman. When the user provides good tiling specifications, REPTILE achieves parity with MKL, achieving between 0.79--1.27x speedup for the LU decomposition, 0.97--1.21x speedup for the Cholesky decomposition, 1.61x--2.72x for lower triangular matrix inversion, 1.01--1.14x speedup for triangular solve with multiple right-hand sides, and 1.14--1.73x speedup over handwritten implementations of the bioinformatics recurrences.
In untrusted execution environments such as web browsers, code from remote sources is regularly executed. To harden these environments against attacks, constituent programming languages and their implementations must uphold certain safety properties, such as memory safety. These properties must be maintained across the entire compilation stack, which may include intermediate languages that do not provide the same safety guarantees. Any case where properties are not preserved could lead to a serious security vulnerability.
In this work, we identify a specification vulnerability in the WebGPU Shading Language (WGSL) where code with data races can be compiled to intermediate representations in which an optimizing compiler could legitimately remove memory safety guardrails. To address this, we present SafeRace, a collection of threat assessments and specification proposals across the WGSL execution stack. While our threat assessment showed that this vulnerability does not appear to be exploitable on current systems, it creates a ”ticking time bomb”, especially as compilers in this area are rapidly evolving. Given this, we introduce the SafeRace Memory Safety Guarantee (SMSG), two components that preserve memory safety in the WGSL execution stack even in the presence of data races. The first component specifies that program slices contributing to memory indexing must be race free and is implemented via a compiler pass for WGSL programs. The second component is a requirement on intermediate representations that limits the effects of data races so that they cannot impact race-free program slices. While the first component is not guaranteed to apply to all possible WGSL programs due to limitations on how some data types can be accessed, we show that existing language constructs are sufficient to implement this component with minimal performance overhead on many existing important WebGPU applications. We test the second component by performing a fuzzing campaign of 81 hours across 21 compilation stacks; our results show violations on only one (likely buggy) machine, thus providing evidence that lower-level GPU frameworks could relatively straightforwardly support this constraint. Finally, our assessments discovered GPU memory isolation vulnerabilities in Apple and AMD GPUs, as well as a security-critical miscompilation of WGSL in a pre-release version of Firefox.
This paper introduces HEMVM, an innovative heterogeneous blockchain framework that seamlessly integrates diverse virtual machines (VMs), including the Ethereum Virtual Machine (EVM) and the Move Virtual Machine (MoveVM), into a unified system. This integration facilitates interoperability while retaining compatibility with existing Ethereum and Move toolchains by preserving high-level language constructs. HEMVM’s unique cross-VM operations allow users to interact with contracts across various VMs using any wallet software, effectively resolving the fragmentation in user experience caused by differing VM designs. Our experimental results demonstrate that HEMVM is both fast and efficient, incurring minimal overhead (less than 4.4%) for intra-VM transactions and achieving up to 9,300 TPS for cross-VM transactions. Our results also show that the cross-VM operations in HEMVM are sufficiently expressive to support complex decentralized finance interactions across multiple VMs. Finally, the parallelized prototype of HEMVM shows performance improvements up to 44.8% compared to the sequential version of HEMVM under workloads with mixed transaction types.
Compute Express Link (CXL) memory sharing, persistent memory, and other related technologies allow data to survive crash events. A key challenge is ensuring that data is consistent after crashes such that it can be safely accessed. While there has been much work on bug-finding tools for persistent memory programs, these tools cannot guarantee that a program is crash-consistent. In this paper, we present a language, CrashLang, and its type system, that together guarantee that well-typed data structure implementations written in CrashLang are crash-consistent. CrashLang leverages the well-known commit-store pattern in which a single store logically commits an entire data structure operation. In this paper, we prove that well-typed CrashLang programs are crash-consistent, and provide a prototype implementation of the CrashLang compiler. We have evaluated CrashLang on five benchmarks: the Harris linked list, the Treiber stack, the Michael–Scott queue, a Read-Copy-Update binary search tree, and a Cache-Line Hash Table. We experimentally verified that each implementation correctly survives crashes.
The human ability to think about thinking ("theory of mind") is a fundamental object of study in many disciplines. In recent decades, researchers across these disciplines have converged on a rich computational paradigm for modeling theory of mind, grounded in recursive probabilistic reasoning. However, practitioners often find programming in this paradigm challenging: first, because thinking-about-thinking is confusing for programmers, and second, because models are slow to run. This paper presents memo, a new domain-specific probabilistic programming language that overcomes these challenges: first, by providing specialized syntax and semantics for theory of mind, and second, by taking a unique approach to inference that scales well on modern hardware via array programming. memo enables practitioners to write dramatically faster models with much less code, and has already been adopted by several research groups.
Testing compilers with AI models, especially large language models (LLMs), has shown great promise. However, current approaches struggle with two key problems: The generated programs for testing compilers are often too simple, and extensive testing with the LLMs is computationally expensive. In this paper, we propose a novel compiler testing framework that decouples the testing process into two distinct phases: an offline phase and an online phase. In the offline phase, we use LLMs to generate a collection of small but feature-rich code pieces. In the online phase, we reuse these code pieces by strategically combining them to build high-quality and valid test programs, which are then used to test compilers.
We implement this idea in a tool, LegoFuzz, for testing C compilers. The results are striking: we found 66 bugs in GCC and LLVM, the most widely used C compilers. Almost half of the bugs are miscompilation bugs, which are serious and hard-to-find bugs that none of the existing LLM-based tools could find. We believe this efficient design opens up new possibilities for using AI models in software testing beyond just C compilers.
Many natural program correctness properties can be stated in terms of symmetries, but existing formal methods have little support for reasoning about such properties. We consider how to formally verify a broad class of symmetry properties expressed in terms of group actions. To specify these properties, we design a syntax for group actions, supporting standard constructions and a natural notion of entailment. Then, we develop a Hoare-style logic for verifying symmetry properties of imperative programs, where group actions take the place of the typical pre- and post-condition assertions. Finally, we develop a prototype tool SymVerif, and use it to verify symmetry properties on a series of handcrafted benchmarks. Our tool uncovered an error in a model of a dynamical system described by McLachlan and Quispel [Acta Numerica 2002].
In static analysis frameworks for Java, the bytecode frontend serves as a critical component, transforming complex, stack-based Java bytecode into a more analyzable register-based, typed 3-address code representation. This transformation often significantly influences the overall performance of analysis frameworks, particularly when processing large-scale Java applications, rendering the efficiency of the bytecode frontend paramount for static analysis. However, the bytecode frontends of currently dominant Java static analysis frameworks, Soot and WALA, despite being time-tested and widely adopted, exhibit limitations in efficiency, hindering their ability to offer a better user experience.
To tackle efficiency issues, we introduce a new bytecode frontend. Typically, bytecode frontends consist of two key stages: (1) translating Java bytecode to untyped 3-address code, and (2) performing type inference on this code. For 3-address code translation, we identified common patterns in bytecode that enable more efficient processing than traditional methods. For type inference, we found that traditional algorithms often include redundant computations that hinder performance. Leveraging these insights, we propose two novel approaches: pattern-aware 3-address code translation and pruning-based type inference, which together form our new frontend and lead to significant efficiency improvements. Besides, our approach can also generate SSA IR, enhancing its usability for various static analysis techniques.
We implemented our new bytecode frontend in Tai-e, a recent state-of-the-art static analysis framework for Java, and evaluated its performance across a diverse set of Java applications. Experimental results demonstrate that our frontend significantly outperforms Soot, WALA, and SootUp (an overhaul of Soot)—in terms of efficiency, being on average 14.2×, 14.5×, and 75.2× faster than Soot, WALA, and SootUp, respectively. Moreover, additional experiments reveal that our frontend exhibits superior reliability in processing Java bytecode compared to these tools, thus providing a more robust foundation for Java static analysis.
Property-based testing validates software against an executable specification by evaluating it on randomly generated inputs. The standard way that PBT users generate test inputs is via generators that describe how to sample test inputs through random choices. To achieve a good distribution over test inputs, users must tune their generators, i.e., decide on the weights of these individual random choices. Unfortunately, it is very difficult to understand how to choose individual generator weights in order to achieve a desired distribution, so today this process is tedious and limits the distributions that can be practically achieved.
In this paper, we develop techniques for the automatic and offline tuning of generators. Given a generator with undetermined symbolic weights and an objective function, our approach automatically learns values for these weights that optimize for the objective. We describe useful objective functions that allow users to (1) target desired distributions and (2) improve the diversity and validity of their test cases. We have implemented our approach in a novel discrete probabilistic programming system, Loaded Dice, that supports differentiation and parameter learning, and use it as a language for generators. We empirically demonstrate that our approach is effective at optimizing generator distributions according to the specified objective functions. We also perform a thorough evaluation on PBT benchmarks, demonstrating that, when automatically tuned for diversity and validity, the generators exhibit a 3.1-7.4x speedup in bug finding.
Abstract interpretation provides a systematic framework for static analysis, where widening operators are crucial for ensuring termination when analyzing programs over infinite-height lattices. Current abstract interpreters apply widening operations and fixpoint detection uniformly across all variables at the identified widening point (e.g., control-flow loop headers), leading to costly computations. Through our empirical study, we observe that infinite ascending chains typically originate from only a subset of variables involved in value-flow cycles, providing opportunities for selective widening and targeted fixpoint detection. This paper introduces an efficient approach to optimize abstract interpretation over non-relational domains through selective widening guided by value-flow analysis. We develop a modular and condensed value-flow graph (MVFG) that enables precise identification of variables requiring widening by detecting value-flow cycles across procedure boundaries. Our MVFG design incorporates efficient shortcut edges that summarize interprocedural value flows, achieving the precision of context-sensitive analysis but with linear complexity. By aligning value-flow cycles with the weak topological ordering (WTO) of the control-flow graph, we identify the minimal set of variables requiring widening operations, applying widening exclusively to variables that participate in value-flow back edges. Our evaluation on large-scale open-source projects shows that our selective widening approach reduces analysis time by up to 41.2% while maintaining identical precision. The method significantly reduces the number of widened variables by up to 99.5%, with greater benefits observed in larger codebases.
In formal hardware verification, particularly for Register-Transfer Level (RTL) designs in Verilog, model checking has been the predominant technique. However, it suffers from state explosion, limited expressive power, and a large trusted computing base (TCB). Deductive verification offers greater expressive power and enables foundational verification with a minimal TCB. Nevertheless, Verilog's standard semantics, characterized by its nondeterministic and global scheduling, pose significant challenges to its application.
To address these challenges, we propose a new Verilog semantics designed to facilitate deductive verification. Our semantics is based on least fixpoints to enable cycle-level functional evaluation and modular reasoning. For foundational verification, we prove our semantics equivalent to the standard scheduling semantics for synthesizable designs. We demonstrate the benefits of our semantics with a modular verification of a pipelined RISC-V processor's functional correctness and progress guarantees. All our results are mechanized in Rocq.
Effect handlers are a programming language feature that has recently gained popularity. They allow for non-local yet structured control flow and subsume features like generators, exceptions, asynchronicity, etc. However, implementations of effect handlers currently often sacrifice features to enable efficient implementations. Meta-tracing just-in-time (JIT) compilers promise to yield the performance of a compiler by implementing an interpreter. They record execution in a trace, dynamically detect hot loops, and aggressively optimize those using information available at runtime. They excel at optimizing dynamic control flow, which is exactly what effect handlers introduce. We present the first evaluation of tracing JIT compilation specifically for effect handlers. To this end, we developed RPython-based tracing JIT implementations for Eff, Effekt, and Koka by compiling them to a common bytecode format. We evaluate the performance, discuss which classes of effectful programs are optimized well and how our additional optimizations influence performance. We also benchmark against a baseline of state-of-the-art mainstream language implementations.
The wide adoption of deep learning in safety-critical domains has driven the need for formally verifying the robustness of neural networks. A critical challenge in this endeavor lies in addressing the inherent non-linearity of activation functions. The convex hull of the activation function has emerged as a promising solution, as it effectively tightens variable ranges and provides multi-neuron constraints, which together enhance verification precision. Given that constructing exact convex hulls is computationally expensive and even infeasible in most cases, existing research has focused on over-approximating them. Several ad-hoc methods have been devised for specific functions such as ReLU and Sigmoid. Nonetheless, there remains a substantial gap in developing broadly applicable approaches for general activation functions.
In this work, we propose WraAct, an approach to efficiently constructing tight over-approximations for activation function hulls. Its core idea is to introduce linear constraints to smooth out the fluctuations in the target function, by leveraging double-linear-piece (DLP) functions to simplify the local geometry. In this way, the problem is reduced to over-approximating DLP functions, which can be efficiently handled. We evaluate WraAct against SBLM+PDDM, the state-of-the-art (SOTA) multi-neuron over-approximation method based on decompositing functions into segments. WraAct outperforms it on commonly-used functions like Sigmoid, Tanh, and MaxPool, offering superior efficiency (average 400X faster) and precision (average 150X) while constructing fewer constraints (average 50% reduction). It can complete the computation of up to 8 input dimensions in 10 seconds. We also integrate WraAct into a neural network verifier to evaluate its capability in verification tasks. On 100 benchmark samples, it significantly enhances the single-neuron verification from under 10 to over 40, and outperforms the multi-neuron verifier PRIMA with up to additional 20 verified samples. On large networks like ResNets with 22k neurons, it can complete the verification of one sample within one minute.
The R programming language is primarily designed for statistical computing and mostly used by researchers without a background in computer science. R provides a wide range of dynamic features and peculiarities that are difficult to analyze statically like dynamic scoping and lazy evaluation with dynamic side effects. At the same time, the R ecosystem lacks sophisticated analysis tools that support researchers in understanding and improving their code. In this paper, we present a novel static dataflow analysis framework for the R programming language that is capable of handling the dynamic nature of R programs and produces the dataflow graph of given R programs. This graph can be essential in a range of analyses, including program slicing, which we implement as a proof of concept. The core analysis works as a stateful fold over a normalized version of the abstract syntax tree of the R program, which tracks (re-)definitions, values, function calls, side effects, external files, and a dynamic control flow to produce one dataflow graph per program. We evaluate the correctness of our analysis using output equivalence testing on a manually curated dataset of 779 sensible slicing points from executable real-world R scripts. Additionally, we use a set of systematic test cases based on the capabilities of the R language and the implementation of the R interpreter and measure the runtimes well as the memory consumption on a set of 4,230 real-world R scripts and 20,815 packages available on R’s package manager CRAN. Furthermore, we evaluate the recall of our program slicer, its accuracy using shrinking, and its improvement over the state of the art. We correctly analyze almost all programs in our equivalence test suite, preserving the identical output for 99.7% of the manually curated slicing points. On average, we require 576ms to analyze the dataflow and around 213kB to store the graph of a research script. This shows that our analysis is capable of analyzing real-world sources quickly and correctly. Our slicer achieves an average reduction of 84.8% of tokens indicating its potential to improve program comprehension.
Partial Differential Equations (PDEs) play a ubiquitous role in scientific computing and engineering. While numerical methods make solving PDEs tractable, these numerical solvers encounter several issues, particularly for hyperbolic PDEs. These issues arise from multiple sources including the PDE’s physical model, which can lead to effects like shock wave formation, and the PDE solver’s inherent approximations, which can introduce spurious numerical artifacts. These issues can cause the solver’s program execution to crash (due to overflow) or return results with unacceptable levels of inaccuracy (due to spurious oscillations or dissipation). Moreover, these challenges are compounded by the nonlinear nature of many of these PDEs. In addition, PDE solvers must obey numerical invariants like the CFL condition. Hence there exists a critical need to apply program analysis to PDE solvers to certify such problems do not arise and that invariants are always satisfied.
As a solution, we develop Phocus, which is the first abstract interpretation of hyperbolic PDE solvers. Phocus can certify precise bounds on nonlinear PDE solutions and certify key invariants such as the CFL condition and a solution’s total variation bound. Hence Phocus can verify the absence of shock formation, the stability of the solver, and bounds on the amount of spurious numerical effects. To enable effective abstract interpretation of hyperbolic PDE solvers, Phocus uses a novel optimization-based procedure to synthesize precise abstract transformers for multiple finite difference schemes. To evaluate Phocus, we develop a new set of PDE benchmark programs and use them to perform an extensive experimental evaluation which demonstrates Phocus’s significant precision benefits and scalability to several thousand mesh points.
Persistent memory (PM), with its data persistence, has found widespread applications. However, programmers have to manually annotate PM operations in programming to achieve crash consistency, which is labor-intensive and error-prone. In this paper, to alleviate the burden of programming PM applications, we develop HybridPersist, a compiler support for user-friendly and efficient PM programming. On the one hand, HybridPersist automatically achieves crash consistency, minimally intruding on programmers with negligible annotations. On the other hand, it enhances both performance and correctness of PM programs through a series of dedicated analysis passes. The evaluations on well-known benchmarks validate that HybridPersist offers superior programming productivity and runtime performance compared to the state-of-the-art.
Confidential computing (CC), designed for security-critical scenarios, uses remote attestation to guarantee code integrity on cloud servers. However, CC alone cannot provide assurance of high-level security properties (e.g., no data leak) on the code. In this paper, we introduce a novel framework, Agora, scrupulously designed to provide a trustworthy and open verification platform for CC. To prompt trustworthiness, we observe that certain verification tasks can be delegated to untrusted entities, while the corresponding (smaller) validators are securely housed within the trusted computing base (TCB). Moreover, through a novel blockchain-based bounty task manager, it also utilizes crowdsourcing to remove trust in complex theorem provers. These synergistic techniques successfully ameliorate the TCB size burden associated with two procedures: binary analysis and theorem proving. To prompt openness, Agora supports a versatile assertion language that allows verification of various security policies. Moreover, the design of Agora enables untrusted parties to participate in any complex processes out of Agora’s TCB. By implementing verification workflows for software-based fault isolation, information flow control, and side-channel mitigation policies, our evaluation demonstrates the efficacy of Agora.
Quantum computing platforms rely on simulators for modelling circuit behaviour prior to hardware execution, where inconsistencies can lead to costly errors. While existing formal validation methods typically target specific compiler components to manage state explosion, they often miss critical bugs. Meanwhile, conventional testing lacks systematic exploration of corner cases and realistic execution scenarios, resulting in both false positives and negatives.
We present FuzzQ, a novel framework that bridges this gap by combining formal methods with structured test generation and fuzzing for quantum simulators. Our approach employs differential benchmarking complemented by mutation testing and invariant checking. At its core, FuzzQ utilises our Alloy-based formal model of QASM 3.0, which encodes the semantics of quantum circuits to enable automated analysis and to generate structurally diverse, constraint-guided quantum circuits with guaranteed properties. We introduce several test oracles to assess both Alloy’s modelling of QASM 3.0 and simulator correctness, including invariant-based checks, statistical distribution tests, and a novel cross-simulator unitary consistency check that verifies functional equivalence modulo global phase, revealing discrepancies that standard statevector comparisons fail to detect in cross-platform differential testing.
We evaluate FuzzQ on both Qiskit and Cirq, demonstrating its platform-agnostic effectiveness. By executing over 800,000 quantum circuits to completion, we assess throughput, code and circuit coverage, and simulator performance metrics, including sensitivity, correctness, and memory overhead. Our analysis revealed eight simulator bugs, six previously undocumented. We also outline a path for extending the framework to support mixed-state simulations under realistic noise models.
To protect security-critical applications, secure compilers have to preserve security policies, such as non-interference, during compilation. The preservation of security policies goes beyond the classical notion of compiler correctness which only enforces the preservation of the semantics of the source program. Therefore, several standard compiler optimisations are prone to break standard security policies like non-interference. Existing approaches to secure compilation are very restrictive with respect to the compiler optimisations that they permit or to the security policies they support because of conceptual limitations in their formal setup.
In this paper, we present hyperproperty simulations, a novel framework to secure compilation that models the preservation of arbitrary k-hyperproperties during compilation and overcomes several limitations of existing approaches, in particular it is more expressive and more flexible. We demonstrate this by designing and proving a generic non-interference preserving code transformation that can be applied on different optimisations and leakage models. This approach reduces the proof burden per optimisation to a minimum. We instantiate this code transformation on different leakage models with various standard compiler optimisations that could be handled in a very limited and less modular way (if at all) by existing approaches. Our results are formally verified in the Rocq theorem prover.
The goal of active learning for program synthesis is to synthesize the desired program by asking targeted questions that minimize user interaction. While prior work has explored active learning in the purely symbolic setting, such techniques are inadequate for the increasingly popular paradigm of neurosymbolic program synthesis, where the synthesized program incorporates neural components. When applied to the neurosymbolic setting, such techniques can -- and, in practice, do -- return an unintended program due to mispredictions of neural components. This paper proposes a new active learning technique that can handle the unique challenges posed by neural network mispredictions. Our approach is based upon a new evaluation strategy called constrained conformal evaluation (CCE), which accounts for neural mispredictions while taking into account user-provided feedback. Our proposed method iteratively makes CCE more precise until all remaining programs are guaranteed to be observationally equivalent. We have implemented this method in a tool called SmartLabel and experimentally evaluated it on three neurosymbolic domains. Our results demonstrate that SmartLabel identifies the ground truth program for 98% of the benchmarks, requiring under 5 rounds of user interaction on average. In contrast, prior techniques for active learning are only able to converge to the ground truth program for at most 65% of the benchmarks.
GraalVM Native Image is increasingly used to optimize the startup performance of applications that run on the Java Virtual Machine (JVM), and particularly of Function-as-a-Service and Serverless workloads. Native Image resorts to Ahead-of-Time (AOT) compilation to produce a binary from a JVM application that contains a snapshot of the pre-initialized heap memory, reducing the initialization time and hence improving startup performance. However, this performance improvement is hindered by page faults that occur when accessing objects in the heap snapshot. Related work has proposed profile-guided approaches to reduce page faults by reordering objects in the heap snapshot of an optimized binary based on the order in which objects are first accessed, obtaining this information by profiling an instrumented binary of the same application. This reordering is effective only if objects in the instrumented binary can be matched to the semantically equivalent ones in the optimized binary. Unfortunately, this is very challenging because objects do not have unique identities and the heap-snapshot contents are not persistent across Native-Image builds of the same program. This work tackles the problem of matching heap snapshots, and proposes a novel approach to improve the mapping between semantically equivalent objects in different binaries of a Native-Image application. We introduce the concept of context-augmented heap path (CAHP)—a list of elements that describes a path to an object stored in the heap snapshot. Our approach associates a CAHP to each object in a way that is as unique as possible. Objects with the same CAHP across different binaries are considered semantically equivalent. Moreover, since some semantically equivalent objects may have different CAHPs in the instrumented and optimized binaries (due to nondeterminism in the image-build process and other factors), we present an approach that finds, for each unmatched CAHP in the optimized binary, the most similar CAHP in the instrumented binary, associating the two objects. We integrate our approach into Native Image, reordering the objects stored in the heap snapshot more efficiently using the improved mapping. Our experiments show that our approach leads to much less page faults (2.98× on average) and considerably improves startup time (1.98× on average) w.r.t. the original Native-Image implementation.
Code style transformation models built on code Language Models (code LMs) have achieved remarkable success. However, they typically focus on basis style transformations, where the target style follows a single criterion, and often struggle with combination styles, where the target style involves multiple criteria. In practice, style guides encompass multiple criteria, making the lack of effective combination style transformation a major limitation to their real-world applicability.
In this paper, we propose Absent-Basis-Combination (abbreviated as ABC), a novel framework for code style transformation that significantly improves combination style transformation and overcomes the limitations of existing approaches. We implement four variants of ABC with parameter sizes of 0.5B, 1.3B, 1.5B, and 3B, demonstrating consistent superiority over existing approaches across all model sizes in both basis and combination style transformations. Specifically, ABC achieves performance gains of up to 86.70%, and remains superior even when baseline approaches use three times the parameters. Furthermore, to address the lack of high-quality datasets and evaluation metrics, we construct and release a new style transformation dataset, Basis & Combination Code Style (abbreviated as BCCStyle), and introduce Code Sequence, Syntactic, Semantic and Stylistic BLEU (abbreviated as CS4BLEU), a novel code similarity metric that surpasses existing metrics in accuracy and consistency.
Compilers are among the most foundational software ever developed. A critical component of a compiler is its optimization phase, which enhances the efficiency of the generated code. Given the sheer size and complexity of modern compilers, automated techniques for improving their optimization component have been a major area of research. This paper focuses on a specific category of issues, namely missed optimizations, where compilers fail to apply an optimization that could have made the generated code more efficient.
To detect missed optimizations, we propose Synchronized Behavior Checking (SBC), a novel approach that cross-validates multiple optimizations by leveraging their coordinated behaviors. The key insight behind SBC is that the outcome of one optimization can validate whether the conditions required for another optimization were met.
For a practical implementation of SBC, we cross-validate two optimizations at once based on two kinds of relationships — co-occurring and complementary. In the co-occurring relationship, if an optimization is applied based on a specific semantic constraint (, optimization condition) from an input program, another optimization, which depends on the same semantic constraint, should be applied as well. Second, when two optimizations are enabled by complementary semantic constraints, exactly one of the two optimizations should be applied. When an optimization should have been applied (according to either relationship) but was not applied, we regard it as a missed optimization.
We conduct an extensive evaluation of SBC on two state-of-the-art industry compilers LLVM and GCC. SBC successfully detects a large number of missed optimizations in both compilers, in particular, they are caused by a wide range of compiler analyses. Based on our evaluation results, we reported 101 issues to LLVM and GCC, out of which 84 have been confirmed, and 39 have been fixed or assigned (for planned fixes).
SBC opens up a new, exciting direction for finding missed compiler optimizations.
The computational fabrication pipeline for 3D printing is much like a compiler --- users design models in Computer Aided Design (CAD) tools that are lowered to polygon meshes to be ultimately compiled to machine code by 3D slicers. For traditional compilers and programming languages, techniques for checking program invariants are well-established. Similarly, methods like differential testing are frequently used to uncover bugs in compilers themselves, which makes them more reliable.
The fabrication pipeline would benefit from similar techniques but traditional approaches do not directly apply to the representations used in this domain. Unlike traditional programs, 3D models exist both as geometric objects (a CAD model or a polygon mesh) as well as machine code that ultimately runs on the hardware. The machine code, like in traditional compiling, is affected by many factors like the model, the slicer being used, and numerous user-configurable parameters that control the slicing process.
In this work, we propose a new algorithm for lifting G-code (a common language used in many fabrication pipelines) by denoting a G-code program to a set of cuboids, and then defining an approximate point cloud representation for efficiently operating on these cuboids. Our algorithm opens up new opportunities: we show three use cases that demonstrate how it enables (1)~error localization in CAD models through invariant checking, (2)~quantitative comparisons between slicers, and (3)~evaluating the efficacy of mesh repair tools. We present a prototype implementation of our algorithm in a tool, GlitchFinder, and evaluate it on 58 real-world CAD models. Our results show that GlitchFinder is particularly effective in identifying slicing issues due to small features, can highlight differences in how popular slicers (Cura and PrusaSlicer) slice the same model, and can identify cases where mesh repair tools (MeshLab and Meshmixer) introduce new errors during repair.
Fairness in machine learning is more important than ever as ethical concerns continue to grow. Individual fairness demands that individuals differing only in sensitive attributes receive the same outcomes. However, commonly used machine learning algorithms often fail to achieve such fairness. To improve individual fairness, various training methods have been developed, such as incorporating fairness constraints as optimisation objectives. While these methods have demonstrated empirical effectiveness, they lack formal guarantees of fairness. Existing approaches that aim to provide fairness guarantees primarily rely on verification techniques, which can sometimes fail to produce definitive results. Moreover, verification alone does not actively enhance individual fairness during training. To address this limitation, we propose a novel framework that formally guarantees individual fairness throughout training. Our approach consists of two parts, i.e., (1) provably fair initialisation that ensures the model starts in a fair state, and (2) a fairness-preserving training algorithm that maintains fairness as the model learns. A key element of our method is the use of randomised response mechanisms, which protect sensitive attributes while maintaining fairness guarantees. We formally prove that this mechanism sustains individual fairness throughout the training process. Experimental evaluations confirm that our approach is effective, i.e., producing models that are empirically fair and accurate. Furthermore, our approach is much more efficient than the alternative approach based on certified training (which requires neural network verification during training).
Dynamic and polymorphic languages attach information, such as types, to run time objects, and therefore adapt the memory layout of values to include space for this information. This makes it difficult to efficiently implement IEEE754 floating-point numbers as this format does not leave an easily accessible space to store type information. The three main floating-point number encodings in use today, tagged pointers, NaN-boxing, and NuN-boxing, have drawbacks. Tagged pointers entail a heap allocation of all float objects, and NaN/NuN-boxing puts additional run time costs on type checks and the handling of other objects.
This paper introduces self-tagging, a new approach to object tagging that uses an invertible bitwise transformation to map floating-point numbers to tagged values that contain the correct type information at the correct position in their bit pattern, superimposing both their value and type information in a single machine word. Such a transformation can only map a subset of all floats to correctly typed tagged values, hence self-tagging takes advantage of the non-uniform distribution of floating point numbers used in practice to avoid heap allocation of the most frequently encountered floats.
Variants of self-tagging were implemented in two distinct Scheme compilers and evaluated on four microarchitectures to assess their performance and compare them to tagged pointers, NaN-boxing, and NuN-boxing. Experiments demonstrate that, in practice, the approach eliminates heap allocation of nearly all floating-point numbers and provides good execution speed of float-intensive benchmarks in Scheme with a negligible performance impact on other benchmarks, making it an attractive alternative to tagged pointers, alongside NaN-boxing and NuN-boxing.
Despite extensive in-house testing, bugs often escape to deployed software. Whenever a failure occurs in production software, it is desirable to collect as much execution information as possible so as to help developers reproduce, diagnose and fix the bug. To reconcile the tension between trace capability, runtime overhead, and trace scale, we propose continuous tail tracing for production use. Instead of capturing only crash stacks, we produce the complete sequence of function calls and returns. Importantly, to avoid the overwhelming stress to I/O, storage, and network transfer caused by the tremendous amount of trace data, we only retain the final segment of trace. To accomplish it, we design a novel trace decoder to support precise tail trace decoding, and an effective path-based instrumentation-site selection algorithm to reduce overhead. We implemented our approach as a tool called TailTracer on top of LLVM, and conducted the evaluations over the SPEC CPU 2017 benchmark suite, the open-source database system, and real-world bugs. The experimental results validate that TailTracer achieves low-overhead tail tracing, while providing more informative trace data than the baseline.
Zero-knowledge proof (ZKP) applications require translating high-level programs into arithmetic circuits—a process that demands both correctness and efficiency. While recent DSLs improve usability, they often yield suboptimal circuits, and hand-optimized implementations remain difficult to construct and verify. We present Tabby, a synthesis-aided compiler that automates the generation of high-performance ZK circuits from high-level code. Tabby introduces a domain-specific intermediate representation designed for symbolic reasoning and applies sketch-based program synthesis to derive optimized low-level implementations. By decomposing programs into reusable components and verifying semantic equivalence via SMT-based reasoning, Tabby ensures correctness while achieving substantial performance improvements. We evaluate Tabby on a suite of real-world ZKP applications and demonstrate significant reductions in proof generation time and circuit size against mainstream ZK compilers.
Context sensitivity is a foundational technique in pointer analysis, critical and essential for improving precision but often incurring significant efficiency costs. Recent advances focus on selective context-sensitive analysis, where only a subset of program elements, such as methods or heap objects, are analyzed under context sensitivity while the rest are analyzed under context insensitivity, aiming to balance precision with efficiency. However, despite the proliferation of such approaches, existing methods are typically driven by specific code patterns, therefore lacking a comprehensive theoretical foundation for systematically identifying code scenarios that benefit from context sensitivity.
This paper presents a novel and foundational theory that establishes a sound over-approximation of the ground truth, i.e., objects that really improve precision under context sensitivity. The proposed theory reformulates the identification of this upper bound into graph reachability problems over a typical Pointer Flow Graph (PFG), each of which can be efficiently solved under context insensitivity, respectively. Building on this theoretical foundation, we introduce our selective context-sensitive analysis approach, Moon. Moon performs both backward and forward traversal on a Variable Flow Graph (VFG), an optimized variant of PFG designed to facilitate efficient traversal. This traversal systematically identifies all objects that improve precision under context sensitivity. Our theoretical foundation, along with carefully designed trade-offs within our approach, allows Moon to limit the scope of objects to be selected, leading to an effective balance between its analysis precision and efficiency. Extensive experiments with Moon across 30 Java programs demonstrate that Moon achieves 37.2X and 382.0X speedups for 2-object-sensitive and 3-object-sensitive analyses, respectively with negligible precision losses of only 0.1% and 0.2%. These results highlight that the balance between efficiency and precision achieved by Moon significantly outperforms all previous approaches.
Capturing types in Scala unify static effect and resource tracking with object capabilities, enabling lightweight effect polymorphism with minimal notational overhead. However, their expressiveness has been insufficient for tracking capabilities embedded in generic data structures, preventing them from scaling to the standard collections library – an essential prerequisite for broader adoption. This limitation stems from the inability to name capabilities within the system’s notion of box types.
This paper develops System Capless, a new foundation for capturing types that provides the theoretical basis for reach capabilities (rcaps), a novel mechanism for naming “what’s in the box”. The calculus refines the universal capability notion into a new scheme with existential and universal capture set quantification. Intuitively, rcaps witness existentially quantified capture sets inside the boxes of generic types in a way that does not require exposing existential capture types in the surface language. We have fully mechanized the formal metatheory of System Capless in Lean, including proofs of type soundness and scope safety. System Capless supports the same lightweight notation of capturing types plus rcaps, as certified by a type-preserving translation, and also enables fully optional explicit capture-set quantification to increase expressiveness.
Finally, we present a full reimplementation of capture checking in Scala 3 based on System Capless and migrate the entire Scala collections library and an asynchronous programming library to evaluate its practicality and ergonomics. Our results demonstrate that reach capabilities enable the adoption of capture checking in production code with minimal changes and minimal-to-zero notational overhead in a vast majority of cases.
Multiple frameworks and optimizations have been proposed for accelerating Graph Neural Network (GNN) workloads over the years, achieving sizable runtime performance improvements. However, we notice that existing systems usually explore optimizing either at the intra-operator level or at the inter-operator level, missing synergies that exist due to their compositions. Further, most existing works focus primarily on optimizing the forward computation of GNNs, often overlooking opportunities for training-specific optimizations. To exploit these missed optimization opportunities, we introduce GALA, a domain-specific language (DSL) and a compiler that allows composing optimizations at different levels. The GALA DSL exposes intra-operator transformations as scheduling commands, while we introduce novel inter-operator transformations as part of the compiler. The composition of these transformations is made possible through the introduction of two novel intermediate representations (IR) in the GALA compiler that tracks and composes transformations at both the intra- and inter-operator levels. Further, the IRs maintain a global view of the GNN program, including its training process. This allows us to introduce training-specific transformations to aggressively optimize GNN training. Our evaluations show that GALA achieves a geo-mean speedup of 2.55× for inference and 2.52× for training across multiple systems, graphs, and GNN models. We also show that GALA performs well across different graph sizes and GNN model configurations, as well as allows users to explore different methods of performing similar optimizations leading to different tradeoff spaces.
Choreographic programming is a promising new paradigm for programming concurrent systems where a developer writes a single centralized program that compiles to individual programs for each node. Existing choreographic languages, however, lack critical features integral to modern systems, like the ability of one node to dynamically compute who should perform a computation and send that decision to others. This work addresses this gap with λQC, the first typed choreographic language with first class process names and polymorphism over both types and (sets of) locations. λQC also improves expressive power over previous work by supporting algebraic and recursive data types as well as multiply-located values. We formalize and mechanically verify our results in Rocq, including the standard choreographic guarantee of deadlock freedom.
In this paper, we present structural abstraction refinement, a novel framework for verifying the threshold problem of probabilistic programs. Our approach represents the structure of a Probabilistic Control-Flow Automaton (PCFA) as a Markov Decision Process (MDP) by abstracting away statement semantics. The maximum reachability of the MDP naturally provides a proper upper bound of the violation probability, termed the structural upper bound. This introduces a fresh “structural” characterization of the relationship between PCFA and MDP, contrasting with the traditional “semantical” view, where the MDP reflects semantics. The method uniquely features a clean separation of concerns between probability and computational semantics that the abstraction focuses solely on probabilistic computation and the refinement handles only the semantics aspect, where the latter allows non-random program verification techniques to be employed without modification.
Building upon this feature, we propose a general counterexample-guided abstraction refinement (CEGAR) framework, capable of leveraging established non-probabilistic techniques for probabilistic verification. We explore its instantiations using trace abstraction. Our method was evaluated on a diverse set of examples against state-of-the-art tools, and the experimental results highlight its versatility and ability to handle more flexible structures swiftly.
Reachability types are a recent proposal to bring Rust-style reasoning about memory properties to higher-level languages, with a focus on higher-order functions, parametric types, and shared mutable state -- features that are only partially supported by current techniques as employed in Rust. While prior work has established key type soundness results for reachability types using the usual syntactic techniques of progress and preservation, stronger metatheoretic properties have so far been unexplored. This paper presents an alternative semantic model of reachability types using logical relations, providing a framework in which we study key properties of interest: (1) semantic type soundness, including of not syntactically well-typed code fragments, (2) termination, especially in the presence of higher-order mutable references, (3) effect safety, especially the absence of observable mutation, and, finally, (4) program equivalence, especially reordering of non-interfering expressions for parallelization or compiler optimization.
Live programming environments provide various semantic services, including type checking and evaluation, continuously as the user is editing the program. The live paradigm promises to improve the developer experience, but liveness is an implementation challenge, particularly when working with large programs. This paper specifies and efficiently implements a system that is able to incrementally update type information for a live program in response to fine-grained program edits. This information includes type error marks and information about the expected and actual type of every expression. The system is specified type-theoretically as a small-step dynamics that propagates updates through the marked and annotated program. Most updates flow according to a base bidirectional type system. Additional pointers are maintained to connect bound variables to their binding locations, with type updates traversing these pointers directly. Order maintenance data structures are employed to efficiently maintain these pointers and to prioritize the order of update propagation. We prove this system is equivalent to naive reanalysis in the Agda theorem prover, along with other important metatheoretic properties. We then provide an efficient OCaml implementation, detailing a number of impactful optimizations. We evaluate this implementation's performance with a large stress-test and find that it is able to achieve multiple orders of magnitude speed-up compared to from-scratch reanalysis.
Despite the wide success of neural networks, their computational cost is very high. Quantization techniques reduce this cost, but it can result in changing the classifications of the original floating-point network, even if the training is quantization-aware. In this work, we rely on verification to design correction systems that detect classification inconsistencies at inference time and eliminate them. The key idea is to overapproximate the space of inconsistent inputs with their maximal classification confidence. The main challenge in computing this confidence is that it involves analyzing a quantized network, which introduces a very high degree of nonlinearity, over all possible inputs. We propose CoMPAQt, an algorithm for computing this confidence. CoMPAQt relies on a novel encoding of quantization in mixed-integer linear programming (MILP), along with customized linear relaxations to reduce the high complexity. To prune the search space, it ties the computations of the quantized network and its floating-point counterpart. Given this confidence, we propose two correction mechanisms. The first mechanism guarantees to return the classification of the floating-point network and relies on networks with increasing bit precisions. The second mechanism mitigates classification inconsistencies by an ensemble of quantized networks. We evaluate our approach on MNIST, ACAS-Xu, and tabular datasets over fully connected and convolutional networks. Results show that our first correction mechanism guarantees 100% consistency with the floating-point network’s classifications while reducing its computational cost by 3.8x, on average. Our second mechanism reaches an almost perfect consistency guarantee in our experiments while reducing the computational cost by 4.1x. Our work is the first to provide a formal guarantee on the classification consistency of a quantized network.
Refinement relates an abstract system model to a concrete, executable program, such that properties established for the abstract model carry over to the concrete implementation. Refinement has been used successfully in the development of substantial verified systems. Nevertheless, existing refinement techniques have limitations that impede their practical usefulness. Top-down refinement techniques that automatically generate executable code generally produce implementations with sub-optimal performance. Bottom-up refinement allows one to reason about existing, efficient implementations, but imposes strict requirements on the structure of the code, the structure of the refinement proofs, as well as the employed verification logic and tools.
In this paper, we present a novel bottom-up refinement methodology that removes these limitations. Our methodology uses the familiar notion of guarded transition systems to express abstract models, but combines guards with a novel notion of locally inductive invariants to relate the abstract model locally to concrete state. This approach is much more flexible than standard coupling invariants; in particular, it supports a wide range of program structures, data representations, and proof structures. We integrate our methodology as a library into Rust, leveraging the Rust type system to reason about ownership of guards. This integration allows one to use our methodology with an off-the-shelf Rust verification tool. It also facilitates practical applications, as we demonstrate on a number of substantial case studies including a concurrent implementation of Memcached.
We propose a compositional approach to combine and scale automated reasoning in the static analysis of decentralized system security, such as blockchains. Our focus lies in the game-theoretic security analysis of such systems, allowing us to examine economic incentives behind user actions. In this context, it is particularly important to certify that deviating from the intended, honest behavior of the decentralized protocol is not beneficial: as long as users follow the protocol, they cannot be financially harmed, regardless of how others behave. Such an economic analysis of blockchain protocols can be encoded as an automated reasoning problem in the first-order theory of real arithmetic, reducing game-theoretic reasoning to satisfiability modulo theories (SMT). However, analyzing an entire game-theoretic model (called a game) as a single SMT instance does not scale to protocols with millions of interactions. We address this challenge and propose a divide-and-conquer security analysis based on compositional reasoning over games. Our compositional analysis is incremental: we divide games into subgames such that changes to one subgame do not necessitate re-analyzing the entire game, but only the ancestor nodes. Our approach is sound, complete, and effective: combining the security properties of subgames yields security of the entire game. Experimental results show that compositional reasoning discovers intra-game properties and errors while scaling to games with millions of nodes, enabling security analysis of large protocols.
Enabling more concise and modular proofs is essential for advancing formal reasoning using interactive theorem provers (ITPs). Since many ITPs, such as Rocq and Lean, use tactic-style proofs, learning higher-level custom tactics is crucial for proof modularity and automation. This paper presents a novel approach to tactic discovery, which leverages Tactic Dependence Graphs (TDGs) to identify reusable proof strategies across multiple proofs. TDGs capture logical dependencies between tactic applications while abstracting away irrelevant syntactic details, allowing for both the discovery of new tactics and the refactoring of existing proofs into more modular forms. We have implemented this technique in a tool called TacMineR and compare it against an anti-unification-based approach (Peano) to tactic discovery. Our evaluation demonstrates that TacMineR can learn 3× as many tactics as Peano and reduces the size of proofs by 26% across all benchmarks. Furthermore, our evaluation demonstrates the benefits of learning custom tactics for proof automation, allowing a state-of-the-art proof automation tool to achieve a relative increase of 172% in terms of success rate.
Rust’s novel type system has proved an attractive target for verification and program analysis tools, due to the rich guarantees it provides for controlling aliasing and mutability. However, fully understanding, extracting and exploiting these guarantees is subtle and challenging: existing models for Rust’s type checking either support a smaller idealised language disconnected from real-world Rust code, or come with severe limitations in terms of precise modelling of Rust borrows, composite types storing them, function signatures and loops.
In this paper, we present Place Capability Graphs: a novel model of Rust’s type-checking results, which lifts these limitations, and which can be directly calculated from the Rust compiler’s own programmatic representations and analyses. We demonstrate that our model supports over 97% of Rust functions in the most popular public crates, and show its suitability as a general-purpose basis for verification and program analysis tools by developing promising new prototype versions of the existing Flowistry and Prusti tools.
Linearizability has become the de facto standard for specifying correctness of implementations of concurrent data structures. While formally verifying such implementations remains challenging, linearizability monitoring has emerged as a promising first step to rule out early problems in the development of custom implementations, and serves as a key component in approaches that stress test such implementations. In this work, we undertake an algorithmic investigation of the linearizability monitoring problem, which asks to check if an execution history obtained from a concurrent data structure implementation is linearizable.
While this problem is largely understood to be intractable in general, a systematic understanding of when it becomes tractable has remained elusive. We revisit this problem and first present a unified ‘decrease-and-conquer’ algorithmic framework for designing linearizability monitoring. At its heart, this framework asks to identify special linearizability-preserving values in a given history — values whose presence yields an equi-linearizable sub-history (obtained by removing operations of such values), and whose absence indicates non-linearizability. More importantly, we prove that a polynomial time algorithm for the problem of identifying linearizability-preserving values, immediately yields a polynomial time algorithm for the linearizability monitoring problem, while conversely, intractability of this problem implies intractability of monitoring.
We demonstrate the effectiveness of our decrease-and-conquer framework by instantiating it for several popular concurrent data types — registers, sets, stacks, queues and priority queues — deriving polynomial time algorithms for them, under the (unambiguity) restriction that each insertion to the underlying data structure adds a distinct value. We further optimize these algorithms to achieve log-linear running time through the use of efficient data structures for amortizing the cost of solving induced sub-problems. Our implementation and evaluation on publicly available implementations of concurrent data structures show that our approach scales to very large histories and significantly outperforms existing state-of-the-art tools.
Debugging and monitoring programs are integral to engineering and deploying software. Dynamic analyses monitor applications through source code or IR injection, machine code or bytecode rewriting, virtual machine APIs, or direct hardware support. While these techniques are viable within their respective domains, common tooling across techniques is rare, leading to fragmentation of skills, duplicated efforts, and inconsistent feature support. We address this problem in the WebAssembly ecosystem with Whamm, an instrumentation framework for Wasm that uses engine-level probing and has a bytecode rewriting fallback to promote portability. Whamm solves three problems: 1) tooling fragmentation, 2) prohibitive instrumentation overhead of general-purpose frameworks, and 3) tedium of tailoring low-level high-performance mechanisms. Whamm provides fully-programmable instrumentation with declarative match rules, static and dynamic predication, automatic state reporting, and user library support, achieving high performance through compiler and engine optimizations. The Whamm engine API allows instrumentation to be provided to a Wasm engine as Wasm code, reusing existing engine optimizations and unlocking new ones, most notably intrinsification, to minimize overhead. A key insight of our work is that explicitly requesting program state in match rules, rather than reflection, enables the engine to efficiently bundle arguments and even inline compiled probe logic. Whamm streamlines the tooling effort, as its bytecode-rewriting target can run instrumented programs everywhere, lowering fragmentation and advancing the state of the art for engine support. We evaluate Whamm with case studies of non-trivial monitors and show it is expressive, powerful, and efficient.
Computing derivatives is paramount for multiple domains ranging from training neural networks to precise climate simulations. While derivatives can be generated by Automatic Differentiation (AD) tools, they often require aggressive optimization to avoid compromising program performance. One of the central optimizations consists of identifying inactive operations that do not contribute to the partial derivatives of interest.
Multiple tools provide activity analyses for a variety of input languages, though often with only informal correctness guarantees. This paper formally defines activity analysis for AD as an abstract interpretation, proves its soundness, and implements it within the MLIR compiler infrastructure. To account for MLIR’s genericity, a subset of MLIR’s internal representation amenable to AD is formalized for the first time. Furthermore, the paper proposes a sound intraprocedural approximation of the whole-program activity analysis via function summaries along with a mechanism to automatically derive these summaries from function definitions.
The implementation is evaluated on a differentiation-specific benchmark suite. It achieves a 1.24 geometric mean speedup on CPU and a 1.7 geometric mean speedup on GPU in the runtime of generated programs, when compared to a baseline that does not use activity analysis. The evaluation also demonstrates that the intraprocedural analysis with function summaries proves inactive 100% of instructions proven inactive by the whole-program analysis.
We present a comprehensive overview of the Datalog facilities in the Flix programming language. We show how programmers can write functions implemented as Datalog programs and we demonstrate how to build modular and reusable families of Datalog programs using first-class Datalog program values, rho abstraction, parametric polymorphism, and type classes. We describe several features that improve the ergonomics, flexibility, and expressive power of Datalog programming in Flix, including the inject and query program constructs, head and guard expressions, functional predicates, lattice semantics, and more.
We illustrate Datalog programming in Flix with several applications, including implementations of Ullman's algorithm to stratify Datalog programs, the Ford-Fulkerson algorithm for maximum flow, and the IFDS and IDE algorithms for context-sensitive program analysis. The implementations of IFDS and IDE fulfill a long-term goal: to have fully modular, polymorphic, typed, and declarative formulations of these algorithms that can be instantiated with any abstract domain.
A microservice-based application is composed of multiple self-contained components called microservices, and controlling inter-service communication is important for enforcing safety properties. Presently, inter-service communication is configured using microservice deployment tools. However, such tools only support a limited class of single-hop policies, which can be overly permissive because they ignore the rich service tree structure of microservice calls. Policies that can express the service tree structure can offer development and security teams more fine-grained control over communication patterns.
To this end, we design an expressive policy language to specify service tree structures, and we develop a visibly pushdown automata-based dynamic enforcement mechanism to enforce service tree policies. Our technique is non-invasive: it does not require any changes to service implementations, and does not require access to microservice code. To realize our method, we build a runtime monitor on top of a service mesh, an emerging network infrastructure layer that can control inter-service communication during deployment. In particular, we employ the programmable network traffic filtering capabilities of Istio, a popular service mesh implementation, to implement an online and distributed monitor. Our experiments show that our monitor can enforce rich safety properties while adding minimal latency overhead on the order of milliseconds.
An important correctness gap exists between formally verifiable distributed system designs and their implementations. Recently proposed work bridges this gap by automatically extracting, or compiling, an implementation from the formally-verified design. The runtime behavior of this compiled implementation, however, may deviate from its design. For example, the compiler may contain bugs, the design may make incorrect assumptions about the deployment environment, or the implementation might be misconfigured.
In this paper we develop TraceLink, a methodology to detect such deviations through trace validation. TraceLink maps traces, that capture an execution's behavior, to the corresponding formal design. Unlike previous work on trace validation, our approach is completely automated.
We implement TraceLink for PGo, a compiler from Modular PlusCal to both TLA+ and Go. We present a formal semantics for interpreting execution traces as TLA+, along with a templatization strategy to minimize the size of the TLA+ tracing specification. We also present a novel trace path validation strategy, called sidestep, which detects bugs faster and with little additional overhead.
We evaluated TraceLink on several distributed systems, including an MPCal implementation of a Raft key-value store. Our evaluation demonstrates that TraceLink is able to find 9 previously undetected and diverse bugs in PGo's TCB, including a bug in the PGo compiler itself. We also show the effectiveness of the templatization approach and the sidestep path validation strategy.
Datalog engines for fixpoint evaluation have brought great benefits to static program analysis over the past decades. A Datalog specification of an analysis allows a declarative, easy-to-maintain specification, without sacrificing performance, and indeed often achieving significant speedups compared to hand-coded algorithms.
However, these benefits come with a certain loss of control. Datalog evaluation is bottom-up, meaning that all inferences (from a set of initial facts) are performed and all their conclusions are outputs of the computation. In practice, virtually every program analysis expressed in Datalog becomes unscalable for some inputs, due to the worst-case blowup of computing all results, even when a partial answer would have been perfectly satisfactory.
In this work, we present a simple, uniform, and elegant solution to the problem, with great practical effectiveness and application to virtually any Datalog-based analysis. The approach consists of leveraging the choice construct, supported natively in modern Datalog engines like Souffle. The choice construct allows the definition of functional dependencies in a relation and has been used in the past for expressing worklist algorithms. We show a near-universal construction that allows the choice construct to flexibly limit evaluation of predicates. The technique is applicable to practically any analysis architecture imaginable, since it adaptively prunes evaluation results when a (programmer-controlled) projection of a relation exceeds a desired cardinality.
We apply the technique to probably the largest, pre-existing Datalog analysis frameworks in existence: Doop (for Java bytecode) and the main client analyses from the Gigahorse framework (for Ethereum smart contracts). Without needing to understand the existing analysis logic and with minimal, local-only changes, the performance of each framework increases dramatically, by over 20x for the hardest inputs, with near-negligible sacrifice in completeness.
We present AccelerQ, a framework for automatically tuning quantum eigensolver (QE) implementations–these are quantum programs implementing a specific QE algorithm–using machine learning and search-based optimisation. Rather than redesigning quantum algorithms or manually tweaking the code of an already existing implementation, AccelerQ treats QE implementations as black-box programs and learns to optimise their hyperparameters to improve accuracy and efficiency by incorporating search-based techniques and genetic algorithms (GA) alongside ML models to efficiently explore the hyperparameter space of QE implementations and avoid local minima.
Our approach leverages two ideas: 1) train on data from smaller, classically simulable systems, and 2) use program-specific ML models, exploiting the fact that local physical interactions in molecular systems persist across scales, supporting generalisation to larger systems. We present an empirical evaluation of AccelerQ on two fundamentally different QE implementations: ADAPT-QSCI and QCELS. For each, we trained a QE predictor model, a lightweight XGBoost Python regressor, using data extracted classically from systems of up to 16 qubits. We deployed the model to optimise hyperparameters for executions on larger systems of 20-, 24-, and 28-qubit Hamiltonians, where direct classical simulation becomes impractical. We observed a reduction in error from 5.48% to 5.3% with only the ML model and further to 5.05% with GA for ADAPT-QSCI, and from 7.5% to 6.5%, with no additional gain with GA for QCELS. Given inconclusive results for some 20- and 24-qubit systems, we recommend further analysis of training data concerning Hamiltonian characteristics. Nonetheless, our results highlight the potential of ML and optimisation techniques for quantum programs and suggest promising directions for integrating software engineering methods into quantum software stacks.
Many imperative programming languages offer global variables to implement common functionality such as global caches and counters. Global variables are typically initialized by module initializers (e.g., static initializers in Java), code blocks that are executed automatically by the runtime system. When or in what order these initializers run is typically not known statically and modularly. For instance in Java, initialization is triggered dynamically upon the first use of a class, while in Go, the order depends on all packages of a program. As a result, reasoning modularly about global variables and their initialization is difficult, especially because module initializers may perform arbitrary side effects and may have cyclic dependencies. Consequently, existing modular verification techniques either do not support global state or impose drastic restrictions that are not satisfied by mainstream languages and programs.
In this paper, we present the first practical verification technique to reason formally and modularly about global state and its initialization. Our technique is based on separation logic and uses module invariants to specify ownership and values of global variables. A partial order on modules and methods allows us to reason modularly about when a module invariant may be soundly assumed to hold, irrespective of when exactly the module initializer establishing it runs. Our technique supports both thread-local and shared global state. We formalize it as a program logic in Iris and prove its soundness in Rocq. We make only minimal assumptions about the initialization semantics, making our technique applicable to a wide range of programming languages. We implemented our technique in existing verifiers for Java and Go and demonstrate its effectiveness on typical uses cases of global state as well as a substantial codebase implementing an Internet router.
Virtual memory management (VMM) code is a critical piece of general-purpose OS kernels, but verification of this functionality is challenging due to the complexity of the hardware interface (the page tables are updated via writes to those memory locations, using addresses which are themselves virtualized). Prior work on verification of VMM code has either only handled a single address space, or trusted significant pieces of assembly code.
In this paper, we introduce a modal abstraction to describe the truth of assertions relative to a specific virtual address space: [r]P indicating that P holds in the virtual address space rooted at r. Such modal assertions allow different address spaces to refer to each other, enabling complete verification of instruction sequences manipulating multiple address spaces. Using them effectively requires working with other assertions, such as points-to assertions about memory contents — which implicitly depend on the address space they are used in. We therefore define virtual points-to assertions to definitionally mimic hardware address translation, relative to a page table root. We demonstrate our approach with challenging fragments of VMM code showing that our approach handles examples beyond what prior work can address, including reasoning about a sequence of instructions as it changes address spaces. Our results are formalized for a RISC-like fragment of x86-64 assembly in Rocq.
Quantum networks have capabilities that are impossible to achieve using only classical information. They connect quantum capable nodes, with their fundamental unit of communication being the Bell pair, a pair of entangled quantum bits. Due to the nature of quantum phenomena, Bell pairs are fragile and difficult to transmit over long distances, thus requiring a network of repeaters along with dedicated hardware and software to ensure the desired results. The intrinsic challenges associated with quantum networks, such as competition over shared resources and high probabilities of failure, require quantitative reasoning about quantum network protocols. This paper develops PBKAT, an expressive language for specification, verification and optimization of quantum network protocols for Bell pair distribution. Our language is equipped with primitives for expressing probabilistic and possibilistic behaviors, and with semantics modeling protocol executions. We establish the properties of PBKAT’s semantics, which we use for quantitative analysis of protocol behavior. We further implement a tool to automate PBKAT’s usage, which we evaluated on real-world protocols drawn from the literature. Our results indicate that PBKAT is well suited for both expressing real-world quantum network protocols and reasoning about their quantitative properties.
Debugging non-deterministic programs on microcontrollers is notoriously challenging, especially when bugs manifest in unpredictable, input-dependent execution paths. A recent approach, called multiverse debugging, makes it easier to debug non-deterministic programs by allowing programmers to explore all potential execution paths. Current multiverse debuggers enable both forward and backward traversal of program paths, and some facilitate jumping to any previously visited states, potentially branching into alternative execution paths within the state space.
Unfortunately, debugging programs that involve input/output operations using existing multiverse debuggers can reveal inaccessible program states, i.e. states which are not encountered during regular execution. This can significantly hinder the debugging process, as the programmer may spend substantial time exploring and examining inaccessible program states, or worse, may mistakenly assume a bug is present in the code, when in fact, the issue is caused by the debugger.
This paper presents a novel approach to multiverse debugging, which can accommodate a broad spectrum of input/output operations. We provide the semantics of our approach and prove the correctness of our debugger, ensuring that despite having support for a wide range of input/output operations the debugger will only explore those program states which can be reached during regular execution.
We have developed a prototype, called MIO, leveraging the WARDuino WebAssembly virtual machine to demonstrate the feasibility and efficiency of our techniques. As a demonstration of the approach we highlight a color dial built with a Lego Mindstorms motor, and color sensor, providing a tangible example of how our approach enables multiverse debugging for programs running on an STM32 microcontroller.
Numerical code is often executed repetitively and on hardware with limited resources, which makes it a perfect target for optimizations. One of the most effective ways to boost performance—especially in terms of runtime—is by reducing the precision of computations. However, low precision can introduce significant rounding errors, potentially compromising the correctness of results. Mixed-precision tuning addresses this trade-off by assigning the lowest possible precision to a subset of variables and arithmetic operations in the program while ensuring that the overall error remains within acceptable bounds. State-of-the-art tools validate the accuracy of optimized programs using either sound static analysis or dynamic sampling. While sound methods are often considered safer but overly conservative, and dynamic methods are more aggressive and potentially more effective, the question remains: how do these approaches compare in practice? In this paper, we present the first comprehensive evaluation of existing mixed-precision tuning tools for floating-point programs, offering a quantitative comparison between sound static and (unsound) dynamic approaches. We measure the trade-offs between performance gains, utilizing optimization potential, and the soundness guarantees on the accuracy---what we refer to as the cost of soundness. Our experiments on the standard FPBench benchmark suite challenge the common belief that dynamic optimizers consistently generate faster programs. In fact, for small straight-line numerical programs, we find that sound tools enhanced with regime inference match or outperform dynamic ones, while providing formal correctness guarantees, albeit at the cost of increased optimization time. Standalone sound tools, however, are overly conservative, especially when accuracy constraints are tight; whereas dynamic tools are consistently effective for different targets, but exceed the maximum allowed error by up to 9 orders of magnitude.
Verifying a real-world program’s functional correctness can be decomposed into (1) a refinement proof showing that the program implements a more abstract high-level program and (2) an algorithm correctness proof at the high level. Relational Hoare logic serves as a powerful tool to establish refinement but often necessitates formalization beyond standard Hoare logic. Particularly in the nondeterministic setting, the ∀∃ relational Hoare logic is required. Existing approaches encode this logic into a Hoare logic with ghost states and invariants, yet these extensions significantly increase formalization complexity and soundness proof overhead. This paper proposes a generic encoding theory that reduces the ∀∃ relational Hoare logic to standard (unary) Hoare logic. Precisely, we propose to redefine the validity of relational Hoare triples while preserving the original proof rules and then encapsulate the ∀∃ pattern within assertions. We have proved that the validity of encoded standard Hoare triples is equivalent to the validity of the desired relational Hoare triples. Moreover, the encoding theory demonstrates how common relational Hoare logic proof rules are indeed special cases of standard Hoare logic proof rules, and relational proof steps correspond to standard proof steps. Our theory enables standard Hoare logic to prove ∀∃ relational properties by defining a predicate Exec, without requiring modifications to the logic framework or re-verification of soundness.
Garbage collection (GC) implementations must meet efficiency and maintainability requirements, which are often perceived to be at odds. Moreover, the desire for efficiency typically sacrifices agility, undermining rapid development and innovation, with unintended consequences on longer-term performance aspirations. Prior GC implementations struggle to: i) maximize efficiency, parallelism, and hardware utilization, while ii) correctly and elegantly implementing optimizations and scheduling constraints. This struggle is reflected in today’s implementations, which tend to be monolithic and depend on coarse phase-based synchronization.
This paper presents a new design for GC implementations that emphasizes both agility and efficiency. The design simplifies and unifies all GC tasks into work packets which define: i) work items, ii) kernels that process them, and iii) scheduling constraints. Our simple insights are that execution is dominated by a few very small, heavily executed kernels, and that GC implementations are high-level algorithms that orchestrate vast numbers of performance-critical work items. Work packets comprise groups of like work items, such as the scanning of a thread’s stack or the tracing of a single object in a multi-million object heap. The kernel attached to a packet specifies how to process items within the packet, such as how to scan a stack, or how to trace an object. The scheduling constraints express dependencies, e.g. all mutators must stop before copying any objects. Fully parallel activities, such as scanning roots and performing a transitive closure, proceed with little synchronization. The implementation of a GC algorithm reduces to declaring required work packets, their kernels, and dependencies. The execution model operates transparently of GC algorithms and work packet type. We broaden the scope of work-stealing, applying it to any type of GC work and introduce a novel two-tier work-stealing algorithm to further optimize parallelism at fine granularity.
We show the software engineering benefits of this design via eight collectors that use 23 common work packet types in the MMTk GC framework. We use the LXR collector to show that the work packet abstraction supports innovation and high performance: i) comparing versions of LXR, work packets deliver performance benefits over a phase-based approach, and ii) LXR with work packets outperforms the highly-tuned latest (OpenJDK 24), state-of-the-art G1 garbage collector. We thus demonstrate that work packets achieve high performance, while simplifying GC implementation, making them inherently easier to optimize and verify.
We present PRISM, a novel technique for detecting overfitting patches in automatic program repair (APR). Despite significant advances in APR, overfitting patches—those passing test suites but not fixing bugs—persist, degrading performance and increasing developer review burden. To mitigate overfitting, various automatic patch correctness classification (APCC) techniques have been proposed. However, while accurate, existing APCC methods often mislabel scarce correct patches as incorrect, significantly lowering the APR fix rate. To address this, we propose (1) novel semantic features capturing patch-induced behavioral changes and (2) a tailored learning algorithm that preserves correct patches while filtering incorrect ones. Experiments on ranked patch data from 10 APR tools show that PRISM uniquely reduces review burden and finds more correct patches. Other methods lower the fix rate by misclassifying correct patches. Evaluations on 1,829 labeled patches confirm Prism removes more incorrect patches at equal correct patch preservation rates.
Pattern matching is a powerful mechanism for writing safe and expressive conditional logic. Once primarily associated with functional programming, it has become a common paradigm even in non-functional languages, such as Java. Languages that support pattern matching include specific analyzers, known as pattern-match coverage analyzers, to ensure its correct and efficient use by statically verifying properties such as exhaustiveness and redundancy. However, these analyzers can suffer from soundness and completeness issues, leading to false negatives (unsafe patterns mistakenly accepted) or false positives (valid patterns incorrectly rejected).
In this work, we present a systematic approach for validating soundness and completeness in pattern-match coverage analyzers. The approach consists of a novel generator for algebraic data types and pattern-matching statements, supporting features that increase the complexity of coverage analysis, such as generalized algebraic data types. To establish the test oracle without building a reference implementation from scratch, the approach generates both exhaustive and inexhaustive pattern-matching cases, either by construction or by encoding them as SMT formulas. The latter leads to a universal test oracle that cross-checks coverage analysis results against a constraint solver, exposing soundness and completeness bugs in case of inconsistencies. We implement this approach in Ikaros, which we evaluate on three major compilers: Scala, Java, and Haskell. Despite pattern-match coverage analyzers being only a small part of these compilers, Ikaros has uncovered 16 bugs, of which 12 have been fixed. Notably, 7 instances were important soundness bugs that could lead to unexpected runtime errors. Additionally, Ikaros provides a scalable framework for extending it to any language with ML-like pattern matching.
Local reasoning about programs that combine aliasing and mutable state is a longstanding challenge. Existing approaches – ownership systems, linear and affine types, uniqueness types, and lexical effect tracking – impose global restrictions such as uniqueness or linearity, or rely on shallow syntactic analyses. These designs fall short with higher-order functions and shared mutable state. Reachability Types (RT) track aliasing and separation in higher-order programs, ensuring runtime safety and non-interference. However, RT systems face three key limitations: (1) they prohibit cyclic references, ruling out non-terminating computations and fixed-point combinators; (2) they require deep tracking, where a qualifier must include all transitively reachable locations, reducing precision and hindering optimizations like fine-grained parallelism; and (3) referent qualifier invariance prevents referents from escaping their allocation contexts, making reference factories inexpressible.
In this work, we address these limitations by extending RT with three mechanisms that enhance expressiveness. First, we introduce cyclic references, enabling recursive patterns to be encoded directly through the store. Second, we adopt shallow qualifier tracking, decoupling references from their transitively reachable values. Finally, we introduce an escaping rule with reference subtyping, allowing referent qualifiers to outlive their allocation context. These extensions are formalized in the F<:∘-calculus with a mechanized proof of type soundness, and case studies illustrate expressiveness through fixpoint combinators, non-interfering parallelism, and escaping read-only references.
Session types provide a formal framework to enforce rich communication protocols, ensuring correctness properties such as type safety and deadlock freedom. However, the traditional API of functional session type systems with first-class channels often leads to problems with modularity and composability.
This paper proposes a new, alternative session type API based on borrowing, embodied in the core calculus BGV. The borrowing-based API enables building modular and composable code for session type clients without imposing clutter or undue limitations. Its basis is a novel type system, founded on ordered linear typing, for functional session types with an explicit operation for splitting ownership of channels. We establish the semantics of BGV via a type-preserving translation to PGV, a deadlock-free functional session type calculus. We establish type safety and deadlock freedom for BGV by this translation.
We also present an external version of BGV that supports use of borrow notation. We developed an algorithmic version of the type system that includes a mechanized verified translation from the external language to BGV. This part establishes decidable type checking.
Generative AI has shown its value for many software engineering tasks. Still in its infancy, large language model (LLM)-based proof generation lags behind LLM-based code generation. In this paper, we present AutoVerus. AutoVerus uses LLMs to automatically generate correctness proof for Rust code. AutoVerus is designed to match the unique features of Verus, a verification tool that can prove the correctness of Rust code using proofs and specifications also written in Rust. AutoVerus consists of a network of agents that are crafted and orchestrated to mimic human experts' three phases of proof construction: preliminary proof generation, proof refinement guided by generic tips, and proof debugging guided by verification errors. To thoroughly evaluate AutoVerus and help foster future research in this direction, we have built a benchmark suite of 150 non-trivial proof tasks, based on existing code-generation benchmarks and verification benchmarks. Our evaluation shows that AutoVerus can automatically generate correct proof for more than 90% of them, with more than half of them tackled in less than 30 seconds or 3 LLM calls.
In modern cloud applications, data serialization and deserialization (DSD) are common yet expensive operations. Languages such as Java incur significant overhead from DSD operations because of their managed memory, which uses an IO-incompatible, sparse, platform-dependent data layout for storing objects. Can language VMs bypass costly DSD operations by directly operating on an IO-ready binary format? In this paper, we address this question by presenting a new DSD library for Java called HeapBuffers. HeapBuffers maintains a Java-friendly programming model through managed memory, yet it operates on data stored using a dense, platform-independent, memory format. In this way, HeapBuffers data can be used to perform IO operations without the need to convert objects from/to a VM-internal memory representation, thus eliminating the need for expensive DSD operations. Compiler optimizations combined with specialized garbage collection ensure that accessing HeapBuffers data is nearly as efficient as handling regular Java objects. The answer to our question is largely positive: our experiments show that HeapBuffers data can be accessed at high performance and can be used as is to perform IO, making the cost of DSD effectively negligible. IO performance of applications using HeapBuffers is often orders of magnitude faster than that obtained using the prevailing DSD libraries.
Version space algebra (VSA) is an effective data structure for representing sets of programs and has been extensively used in program synthesis. Despite this success, a crucial shortcoming of VSA-based synthesis is its inefficiency when processing many examples. Given a set of IO examples, a typical VSA-based synthesizer runs by first constructing an individual VSA for each example and then iteratively intersecting these VSAs one by one. However, the intersection of two VSAs can be much larger than the original ones -- this effect accumulates during the iteration, making the scale of intermediate VSAs quickly explode.
In this paper, we aim to reduce the cost of intersecting VSAs in synthesis. We investigate the process of the iterative intersection and observe that, although this process may construct some huge intermediate VSAs, its final VSA is usually small in practice because only a few programs can pass all examples. Utilizing this observation, we propose the approach of multi-way intersection, which directly intersects multiple small VSAs into the final result, thus avoiding the previous bottleneck of constructing huge intermediate VSAs. Furthermore, since the previous intersection algorithm is inefficient for multiple VSAs, we design a novel algorithm to avoid most unnecessary VSA nodes. We integrated our approach into two SOTA VSA-based synthesizers: a general synthesizer based on VSA and a specialized one for the string domain, Blaze. We evaluate them over 4 different datasets, 994 synthesis tasks; the results show that our approach can significantly improve the performance of VSA-based synthesis, with up to 105 more tasks solved and a speedup of 7.36 times.
Exception handlers—and effect handlers more generally—are language mechanisms for structured nonlocal control flow. A recent trend in language-design research has introduced lexically scoped handlers, which address a modularity problem with dynamic scoping. While dynamically scoped handlers allow zero-overhead implementations when no effects are raised, existing implementations of lexically scoped handlers require programs to pay a cost just for having handlers in the lexical context.
In this paper, we present a novel approach to implementing lexically scoped handlers of exceptional effects. It satisfies the zero-overhead principle—a property otherwise met by most modern compilers supporting dynamically scoped exception handlers. The key idea is a type-directed translation that emits information indicating how handlers come into the lexical context. This information guides the runtime in walking the stack to locate the right handler. Crucially, no reified lexical identifiers of handlers are needed, and mainline code is not slowed down by the presence of handlers in the program text.
We formalize the essential aspects of this compilation scheme and prove it correct. We integrate our approach into the Lexa language, allowing the compilation strategy to be customized for each declared effect based on its expected invocation rate. Empirical results suggest that the new Lexa compiler reduces run-time overhead in low-effect or no-effect scenarios while preserving competitive performance for effect-heavy workloads.
Behavior trees (BTs) are widely adopted in the field of agent control, particularly in robotics, due to their modularity and reactivity. However, constructing a BT that meets the desired expectations is time-consuming and challenging, especially for non-experts. This paper presents BtBot, a multi-modal sketch-based behavior tree synthesis technique. Given a natural language task description and a set of positive and negative examples, BtBot automatically generates a BT program that aligns with the natural language description and meets the requirements of the examples. Inside BtBot, an LLM is employed to understand the task’s natural language description and generate a sketch of the task execution. Then, BtBot searches the sketch to synthesize a candidate BT program consistent with the user-provided positive and negative examples. When the sketch is proven to be incapable of generating the target BT, BtBot provides a multi-step repairing method that modifies the control nodes and structure of the sketch to search for the desired BT. We have implemented BtBot in a prototype and evaluated it on a benchmark of 70 tasks across multiple scenarios. The experimental results indicate that BtBot outperforms the existing BT synthesis techniques in effectiveness and efficiency. In addition, two user studies have been conducted to demonstrate the usefulness of BtBot.
Rust is a non-Garbage Collected (GCed) language, but the lack of GC makes expressing data-structures that require shared ownership awkward, inefficient, or both. In this paper we explore a new design for, and implementation of, GC in Rust, called Alloy. Unlike previous approaches to GC in Rust, Alloy allows existing Rust destructors to be automatically used as GC finalizers: this makes Alloy integrate better with existing Rust code than previous solutions but introduces surprising soundness and performance problems. Alloy provides novel solutions for the core problems: finalizer safety analysis rejects unsound destructors from automatically being reused as finalizers; finalizer elision optimises away unnecessary finalizers; and premature finalizer prevention ensures that finalizers are only run when it is provably safe to do so.
Optimizing performance on top of modern runtime systems with just-in-time (JIT) compilation is a challenge for a wide range of applications from browser-based applications on mobile devices to large-scale server applications. Developers often rely on sampling-based profilers to understand where their code spends its time. Unfortunately, sampling of JIT-compiled programs can give inaccurate and sometimes unreliable results. To assess accuracy of such profilers, we would ideally want to compare their results to a known ground truth. With the complexity of today’s software and hardware stacks, such ground truth is unfortunately not available. Instead, we propose a novel technique to approximate a ground truth by accurately slowing down a Java program at the machine-code level, preserving its optimization and compilation decisions as well as its execution behavior on modern CPUs. Our experiments demonstrate that we can slow down benchmarks by a specific amount, which is a challenge because of the optimizations in modern CPUs, and we verified with hardware profiling that on a basic-block level, the slowdown is accurate for blocks that dominate the execution. With the benchmarks slowed down to specific speeds, we confirmed that Async-profiler, JFR, JProfiler, and YourKit maintain original performance behavior and assign the same percentage of run time to methods. Additionally, we identify cases of inaccuracy caused by missing debug information, which prevents the correct identification of the relevant source code. Finally, we tested the accuracy of sampling profilers by approximating the ground truth by the slowing down of specific basic blocks and found large differences in accuracy between the profilers. We believe, our slowdown-based approach is the first practical methodology to assess the accuracy of sampling profilers for JIT-compiling systems and will enable further work to improve the accuracy of profilers.
Auto-active verifiers like Dafny aim to make formal methods accessible to non-expert users through SMT automation. However, despite the automation and other programmer-friendly features, they remain sparsely used in real-world software development, due to the significant effort required to apply them in practice. We interviewed 14 experienced Dafny users about their experiences using it in large-scale projects. We apply grounded theory to analyze the interviews to systematically identify how auto-active verification impacts software development, and to identify opportunities to simplify the use, and hence, expand the adoption of verification in software development.
Code editors provide essential services that help developers understand, navigate, and modify programs. However, these services often fail in the presence of syntax errors. Existing syntax error recovery techniques, like panic mode and multi-option repairs, are either too coarse, e.g. in deleting large swathes of code, or lead to a proliferation of possible completions. This paper introduces tall tylr, an error-handling parser and editor generator that completes malformed code with syntactic obligations that abstract over many possible completions. These obligations generalize the familiar notion of holes in structure editors to cover missing operands, operators, delimiters, and sort transitions.
tall tylr is backed by a novel theory of tile-based parsing, conceptually organized around a molder that turns tokens into tiles and a melder that completes and parses tiles into terms using an error-handling generalization of operator-precedence parsing. We formalize melding as a parsing calculus, meldr, that completes input tiles with additional obligations such that it can be parsed into a well-formed term, with success guaranteed over all inputs. We further describe how tall tylr implements molding and completion-ranking using the principle of minimizing obligations.
Obligations offer a useful way to scaffold internal program representations, but in tall tylr we go further to investigate the potential of materializing these obligations visually to the programmer. We conduct a user study to evaluate the extent to which an editor like tall tylr that materializes syntactic obligations might be usable and useful, finding both points of positivity and interesting new avenues for future work.
Runtime verification (RV) monitors program executions for conformance with formal specifications (specs). This paper concerns Monitoring-Oriented Programming (MOP), the only RV approach shown to scale to thousands of open-source GitHub projects when simultaneously monitoring passing unit tests against dozens of specs. Explicitly storing traces—sequences of spec-related program events—can make it easier to debug spec violations or to monitor tests against hyperproperties, which requires reasoning about sets of traces. But, most online MOP algorithms are implicit trace, i.e. they work event by event to avoid the time and space costs of storing traces. Yet, TraceMOP, the only explicit-trace online MOP algorithm, is often too slow and often fails.
We propose LazyMOP, a faster explicit-trace online MOP algorithm for RV of tests that is enabled by three simple optimizations. First, whereas all existing online MOP algorithms eagerly monitor all events as they occur, LazyMOP lazily stores only unique traces at runtime and monitors them just before the test run ends. Lazy monitoring is inspired by a recent finding: 99.87% of traces during RV of tests are duplicates. Second, to speed up trace storage, LazyMOP encodes events and their locations as integers, and amortizes the cost of looking up locations across events. Lastly, LazyMOP only synchronizes accesses to its trace store after detecting multi-threading, unlike TraceMOP’s eager and wasteful synchronization of all accesses.
On 179 Java open-source projects, LazyMOP is up to 4.9x faster and uses 4.8x less memory than TraceMOP, finding the same traces (modulo test non-determinism) and violations. We show LazyMOP’s usefulness in the context of software evolution, where tests are re-run after each code change. LazyMOPe optimizes LazyMOP in this context by generating fewer duplicate traces. Using unique traces from one code version, LazyMOPe finds all pairs of method 𝑚 and spec 𝑠, where all traces for 𝑠 in 𝑚 are identical. Then, in a future version, LazyMOPe generates and monitors only one trace of 𝑠 in 𝑚. LazyMOPe is up to 3.9x faster than LazyMOP and it speeds up two recent techniques that speed up RV during evolution by up to 4.6x with no loss in violations.
Applying higher-order model checking techniques to programs that use effect handlers is a major challenge, given the recent undecidability result obtained by Dal Lago and Ghyselen. This challenge has been addressed by using answer-type modifications, the use of a monomorphic version of which allows to recover decidability. However, the absence of polymorphism leads to a loss of modularity, reusability, and even expressivity. In this work, we study the problem of defining a calculus that on the one hand supports answer-type polymorphism and subtyping but on the other hand ensures the underlying model checking problem to remain decidable. The solution proposed in this paper is based on the introduction of the polymorphic answer-type □ whose role is to provide a good compromise between expressiveness and decidability, the latter demonstrated through the construction of a selective type-directed CPS transformation targeting a calculus without effect handlers and any form of polymorphism. Noticeably, the introduced calculus HEPCF□ATM allows the answer types of effects implemented by tail-resumptive effect handlers to be polymorphic. We also implemented a proof-of-concept model checker for HEPCF□ATM programs.
Fuzzing is an effective technique to detect vulnerabilities in smart contracts. The challenge of smart contract fuzzing lies in the statefulness of contracts, which indicates that certain vulnerabilities can only be manifested in specific contract states. State-of-the-art fuzzers may generate and execute a plethora of meaningless or redundant transaction sequences during fuzzing, incurring a penalty in efficiency. To this end, we present DepFuzz, a hybrid fuzzer for efficient smart contract fuzzing, which introduces a symbolic execution module into the feedback-based fuzzer. Guided by the distance-based function dependencies between functions, DepFuzz can efficiently yield meaningful transaction sequences that contribute to vulnerability exposure or code coverage. The experiments on 286 benchmark smart contracts and 500 large real-world smart contracts corroborate that, compared to state-of-the-art approaches, DepFuzz achieves higher instruction coverage rate and uncovers many more vulnerabilities with less time.
Rust is gaining popularity for its well-known memory safety guarantees and high performance, distinguishing it from C/C++ and JVM-based languages. Its compiler, rustc, enforces these guarantees through specialized mechanisms such as trait solving, borrow checking, and specific optimizations. However, Rust's unique language mechanisms introduce complexity to its compiler, resulting in bugs that are uncommon in traditional compilers. With Rust's increasing adoption in safety-critical domains, understanding these language mechanisms and their impact on compiler bugs is essential for improving the reliability of both rustc and Rust programs. Such understanding could provide the foundation for developing more effective testing strategies tailored to rustc. Improving the quality of rustc testing is essential for enhancing compiler reliability, which in turn strengthens the safety and correctness of all Rust programs, as compiler bugs can silently propagate into every compiled program. Yet, we still lack a large-scale, detailed, and in-depth study of rustc bugs.
To bridge this gap, this work presents a comprehensive and systematic study of rustc bugs, specifically those originating in semantic analysis and intermediate representation (IR) processing, which are stages that implement essential Rust language features such as ownership and lifetimes. Our analysis examines issues and fixes reported between 2022 and 2024, with a manual review of 301 valid issues. We categorize these bugs based on their causes, symptoms, affected compilation stages, and test case characteristics. Additionally, we evaluate existing rustc testing tools to assess their effectiveness and limitations. Our key findings include: (1) rustc bugs primarily arise from Rust's type system and lifetime model, with frequent errors in the High-Level Intermediate Representation (HIR) and Mid-Level Intermediate Representation (MIR) modules due to complex checkers and optimizations; (2) bug-revealing test cases often involve unstable features, advanced trait usages, lifetime annotations, standard APIs, and specific optimization levels; (3) while both valid and invalid programs can trigger bugs, existing testing tools struggle to detect non-crash errors, underscoring the need for further advancements in rustc testing.
Property-based testing (PBT) is a testing methodology with origins in the functional programming community. In recent years, PBT libraries have been developed for non-functional languages, including Python. However, to date, there is little evidence regarding how effective property-based tests are at finding bugs, and whether some kinds of property-based tests might be more effective than others. To gather this evidence, we conducted a corpus study of 426 Python programs that use Hypothesis, Python’s most popular library for PBT. We developed formal definitions for 12 categories of property-based test and implemented an intraprocedural static analysis that categorizes tests. Then, we evaluated the efficacy of test suites of 40 projects using mutation testing, and found that on average, each property-based test finds about 50 times as many mutations as the average unit test. We also identified the categories with the tests most effective at finding mutations, finding that tests that look for exceptions, that test inclusion in collections, and that check types are over 19 times more effective at finding mutations than other kinds of property-based tests. Finally, we conducted a parameter sweep study to assess the strength of property-based tests as a function of the number of random inputs generated, finding that 76% of mutations found were found within the first 20 inputs.
Secure, performant execution of untrusted code—as promised by WebAssembly (Wasm)—requires correct compilation to native code that enforces a sandbox. Errors in instruction selection can undermine the sandbox's guarantees, but prior verification work struggles to scale to the complexity of realistic industrial compilers.
We present Arrival, an instruction-selection verifier for the Cranelift production Wasm-to-native compiler. Arrival enables end-to-end, high-assurance verification while reducing developer effort. Arrival (1) automatically reasons about chains of instruction-selection rules, thereby reducing the need for developer-supplied intermediate specifications, (2) introduces a lightweight, efficient method for reasoning about stateful instruction-selection rules, and (3) automatically derives high-assurance machine code specifications.
Our work verifies nearly all AArch64 instruction-selection rules reachable from Wasm core. Furthermore, Arrival reduces the developer effort required: 60% of all specifications benefit from our automation, thereby requiring 2.6X fewer hand-written specifications than prior approaches. Arrival finds new bugs in Cranelift's instruction selection, and it is viable for integration into production workflows.
Over the last 15 years, researchers have studied a wide variety of important aspects of contract systems, ranging from internal consistency (complete monitoring and correct blame) to subtle details about the semantics of contracts combinators (dependency) to the difficulty of efficient checking (avoiding asymptotically bad redundant checking). Although each paper offers essential insights about contract systems, they also differ in inessential ways, making it hard to know how their metatheories combine. Even worse, the metatheories share tremendous tedium in their definitions and proofs, occupying researchers' time with no benefit.
In this paper, we present the idea that higher-order contract systems can be viewed as transition systems and show that this perspective offers an important opportunity for reuse in their metatheories. We demonstrate the value of this perspective by proving representative properties from the literature, and by contributing a new proof establishing that elimination of redundant contract checks can eliminate asymptotic slowdowns. To confirm our claims and encourage the adoption of our ideas, we provide a mechanized development in Agda.
This is a corrigendum for the article “PAFL: Enhancing Fault Localizers by Leveraging Project-Specific Fault Patterns” by Donguk Kim, Minseok Jeon, Doha Hwang, and Hakjoo Oh, published in Proc. ACM Program. Lang. 9, OOPSLA1, Article 129 (April 2025), https://doi.org/10.1145/3720526. The designation of Minseok Jeon, Korea University, and Hakjoo Oh, Korea University, as co-corresponding authors was erroneously left off the article. The correct corresponding authors for this article are Minseok Jeon, Korea University, and Hakjoo Oh, Korea University.