Proceedings of the ACM on Programming Languages: Vol. 8, No. OOPSLA2. 2024

Full Citation in the ACM Digital Library

SECTION: Papers

VarLifter: Recovering Variables and Types from Bytecode of Solidity Smart Contracts

Since funds or tokens in smart contracts are maintained through specific state variables, contract audit, an effective means for security assurance, particularly focuses on these variables and their related operations. However, the absence of publicly accessible source code for numerous contracts, with only bytecode exposed, hinders audit efforts. Recovering variables and their types from Solidity bytecode is thus a critical task in smart contract analysis and audit, yet this is a challenging task because the bytecode loses variable and type information, only with low-level data operated by stack manipulations and untyped memory/storage accesses. The state-of-the-art smart contract decompilers miss identifying many variables and incorrectly infer the types for many identified variables. To this end, we propose VarLifter, a lifter dedicated to the precise and efficient recovery of typed variables. VarLifter interprets every read or written field of a data region as at least one potential variable, and after discarding falsely identified variables, it progressively refines the variable types based on the variable behaviors in the form of operation sequences. We evaluate VarLifter on 34,832 real-world Solidity smart contracts. VarLifter attains a precision of 97.48% and a recall of 91.84% for typed variable recovery. Moreover, VarLifter finishes analyzing 77% of smart contracts in around 10 seconds per contract. If VarLifter is used to replace the variable recovery modules of the two state-of-the-art Solidity bytecode decompilers, 52.4%, and 74.6% more typed variables will be correctly recovered, respectively. The applications of VarLifter to contract decompilation, contract audit, and contract bytecode fuzzing illustrate that the recovered variable information improves many contract analysis tasks.

A Dependent Nominal Physical Type System for Static Analysis of Memory in Low Level Code

We tackle the problem of checking non-proof-carrying code, i.e. automatically proving type-safety (implying in our type system spatial memory safety) of low-level C code or of machine code resulting from its compilation without modification. This requires a precise static analysis that we obtain by having a type system which (i) is expressive enough to encode common low-level idioms, like pointer arithmetic, discriminating variants by bit-stealing on aligned pointers, storing the size and the base address of a buffer in distinct parts of the memory, or records with flexible array members, among others; and (ii) can be embedded in an abstract interpreter. We propose a new type system that meets these criteria. The distinguishing feature of this type system is a nominal organization of contiguous memory regions, which (i) allows nesting, concatenation, union, and sharing parameters between regions; (ii) induces a lattice over sets of addresses from the type definitions; and (iii) permits updates to memory cells that change their type without requiring one to control aliasing. We provide a semantic model for our type system, which enables us to derive sound type checking rules by abstract interpretation, then to integrate these rules as an abstract domain in a standard flow-sensitive static analysis. Our experiments on various challenging benchmarks show that semantic type-checking using this expressive type system generally succeeds in proving type safety and spatial memory safety of C and machine code programs without modification, using only user-provided function prototypes.

Object-Oriented Fixpoint Programming with Datalog

Modern usages of Datalog exceed its original design purpose in scale and complexity. In particular, Datalog lacks abstractions for code organization and reuse, making programs hard to maintain. Is it possible to exploit abstractions and design patterns from object-oriented programming (OOP) while retaining a Datalog-like fixpoint semantics? To answer this question, we design a new OOP language called OODL with common OOP features: dynamic object allocation, object identity, dynamic dispatch, and mutation. However, OODL has a Datalog-like fixpoint semantics, such that recursive computations iterate until their result becomes stable. We develop two semantics for OODL: a fixpoint interpreter and a compiler that translates OODL to Datalog. Although the side effects found in OOP (object allocation and mutation) conflict with Datalog's fixpoint semantics, we can mostly resolve these incompatibilities through extensions of OODL. Within fixpoint computations, we employ immutable algebraic data structures (e.g. case classes in Scala), rather than relying on object allocation, and we introduce monotonically mutable data types (mono types) to enable a relaxed form of mutation. Our performance evaluation shows that the interpreter fails to solve fixpoint problems efficiently, whereas the compiled code exploits Datalog's semi-naïve evaluation.

Intensional Functions

Functions in functional languages have a single elimination form — application — and cannot be compared, hashed, or subjected to other non-application operations. These operations can be approximated via defunctionalization: functions are replaced with first-order data and calls are replaced with invocations of a dispatch function. Operations such as comparison may then be implemented for these first-order data to approximate e.g. deduplication of continuations in algorithms such as unbounded searches. Unfortunately, this encoding is tedious, imposes a maintenance burden, and obfuscates the affected code. We introduce an alternative in intensional functions, a language feature which supports the definition of non-application operations in terms of a function’s definition site and closure-captured values. First-order data operations may be defined on intensional functions without burdensome code transformation. We give an operational semantics and type system and prove their formal properties. We further define intensional monads, whose Kleisli arrows are intensional functions, enabling monadic values to be similarly subjected to additional operations.

Automating Unrealizability Logic: Hoare-Style Proof Synthesis for Infinite Sets of Programs

Automated verification of all members of a (potentially infinite) set of programs has the potential to be useful in program synthesis, as well as in verification of dynamically loaded code, concurrent code, and language properties. Existing techniques for verification of sets of programs are limited in scope and unable to create or use interpretable or reusable information about sets of programs. The consequence is that one cannot learn anything from one verification problem that can be used in another. Unrealizability logic (UL), proposed by Kim et al. as the first Hoare-style proof system to prove properties over sets of programs (defined by a regular tree grammar), presents a theoretical framework that can express and use reusable insight. In particular, UL features nonterminal summaries---inductive facts that characterize recursive nonterminals (analogous to procedure summaries in Hoare logic). In this work, we design the first UL proof synthesis algorithm, implemented as Wuldo. Specifically, we decouple the problem of deciding how to apply UL rules from the problem of synthesizing/checking nonterminal summaries by computing proof structure in a fully syntax-directed fashion. We show that Wuldo, when provided nonterminal summaries, can express and prove verification problems beyond the reach of existing tools, including establishing how infinitely many programs behave on infinitely many inputs. In some cases, Wuldo can even synthesize the necessary nonterminal summaries. Moreover, Wuldo can reuse previously proven nonterminal summaries across verification queries, making verification 1.96 times as fast as when summaries are instead proven from scratch.

Statistical Testing of Quantum Programs via Fixed-Point Amplitude Amplification

We present a new technique for accelerating quantum program testing. Given a quantum circuit with an input/output specification, our goal is to check whether executing the program on the input state produces the expected output. In quantum computing, however, it is impossible to directly check the equivalence of the two quantum states. Instead, we rely on statistical testing, which involves repeated program executions, state measurements, and subsequent comparisons with the specified output. To guarantee a high level of assurance, however, this method requires an extensive number of measurements. In this paper, we propose a solution to alleviate this challenge by adapting Fixed-Point Amplitude Amplification (FPAA) for quantum program testing. We formally present our technique, demonstrate its ability to reduce the required number of measurements as well as runtime cost without sacrificing the original statistical guarantee, and showcase its runtime effectiveness through case studies.

A Low-Level Look at A-Normal Form

A-normal form (ANF) is a widely studied intermediate form in which local control and data flow is made explicit in syntax, and a normal form in which many programs with equivalent control-flow graphs have a single normal syntactic representation. However, ANF is difficult to implement effectively and, as we formalize, difficult to extend with new lexically scoped constructs such as scoped region-based allocation. The problem, as has often been observed, is that normalization of commuting conversions is hard. This traditional view of ANF that normalizing commuting conversions is hard, found in formal models and informed by high-level calculi, is wrong. By studying the low-level intensional aspects of ANF, we can derive a normal form in which normalizing commuting conversion is easy, does not require join points, or code duplication, or renormalization after inlining, and is easily extended with new lexically scoped effects. We formalize the connection between ANF and monadic form and their intensional properties, derive an imperative ANF, and design a compiler pipeline from an untyped -calculus with scoped regions, to monadic form, to a low-level imperative monadic form in which A-normalization is trivial and safe for regions. We prove that any such compiler preserves, or optimizes, stack and memory behaviour compared to ANF. Our formalization reconstructs and systematizes pragmatic choices found in practice, including current production-ready compilers. The main take-away from this work is that, in general, monadic form should be preferred over ANF, and A-normalization should only be done in a low-level imperative intermediate form. This maximizes the advantages of each form, and avoids all the standard problems with ANF.

Full Iso-Recursive Types

There are two well-known formulations of recursive types: iso-recursive and equi-recursive types. Abadi and Fiore [LICS 1996] have shown that iso- and equi-recursive types have the same expressive power. However, their encoding of equi-recursive types in terms of iso-recursive types requires explicit coercions. These coercions come with significant additional computational overhead, and complicate reasoning about the equivalence of the two formulations of recursive types. This paper proposes a generalization of iso-recursive types called full iso-recursive types. Full iso-recursive types allow encoding all programs with equi-recursive types without computational overhead. Instead of explicit term coercions, all type transformations are captured by computationally irrelevant casts, which can be erased at runtime without affecting the semantics of the program. Consequently, reasoning about the equivalence between the two approaches can be greatly simplified. We present a calculus called λFiµ, which extends the simply typed lambda calculus (STLC) with full iso-recursive types. The λFiµ calculus is proved to be type sound, and shown to have the same expressive power as a calculus with equi-recursive types. We also extend our results to subtyping, and show that equi-recursive subtyping can be expressed in terms of iso-recursive subtyping with cast operators.

SMT2Test: From SMT Formulas to Effective Test Cases

One of the primary challenges in software testing is generating high-quality test inputs and obtaining corresponding test oracles. This paper introduces a novel methodology to mitigate this challenge in testing program verifiers by employing SMT (Satisfiability Modulo Theories) formulas as a universal test case generator. The key idea is to transform SMT formulas into programs and link the satisfiability of the formulas with the safety property of the programs, allowing the satisfiability of the formulas to act as a test oracle for program verifiers. This method was implemented as a framework named SMT2Test, which enables the transformation of SMT formulas into Dafny and C programs. An intermediate representation was designed to augment the flexibility of this framework, streamlining the transformation for other programming languages and fostering modular transformation strategies. We evaluated the effectiveness of SMT2Test by finding defects in two program verifiers: the Dafny verifier and CPAchecker. Utilizing the SMT2Test framework with the SMT formulas from the SMT competition and SMT solver fuzzers, we discovered and reported a total of 14 previously unknown defects in these program verifiers that were not found by previous methods. After reporting, all of them have been confirmed, and 6 defects have been fixed. These findings show the effectiveness of our method and imply its potential application in testing other programming language infrastructures.

Non-termination Proving at Scale

Program termination is a classic non-safety property whose falsification cannot in general be witnessed by a finite trace. This makes testing for non-termination challenging, and also a natural target for symbolic proof. Several works in the literature apply non-termination proving to small, self-contained benchmarks, but it has not been developed for large, real-world projects; as such, despite its allure, non-termination proving has had limited practical impact. We develop a compositional theory for non-termination proving, paving the way for its scalable application to large codebases. Discovering non-termination is an under-approximate problem, and we present UNTer, a sound and complete under-approximate logic for proving non-termination. We then extend UNTer with separation logic and develop UNTerSL for heap-manipulating programs, yielding a compositional proof method amenable to automation via under-approximation and bi-abduction. We extend the Pulse analyser from Meta and develop Pulse, an automated, compositional prover for non-termination based on UNTerSL. We have run Pulse on large codebases and libraries, each comprising hundreds of thousands of lines of code, including OpenSSL, libxml2, libxpm and CryptoPP; it discovered several previously-unknown non-termination bugs and have reported them to developers of these libraries.

Compiler Support for Sparse Tensor Convolutions

This paper extends prior work on sparse tensor algebra compilers to generate asymptotically efficient code for tensor expressions with affine subscript expressions. Our technique enables compiler support for a wide range of sparse computations, including sparse convolutions and pooling that are widely used in ML and graphics applications. We propose an approach that gradually rewrites compound subscript expressions to simple subscript expressions with loops that exploit the sparsity pattern of the input sparse tensors. As a result, the time complexity of the generated kernels is bounded by the number of stored elements and not by the shape of the tensors. Our approach seamlessly integrates into existing frameworks and is compatible with recent advances in compilers for sparse computations, including the flexibility to efficiently handle arbitrary combinations of different sparse tensor formats. The implementation of our algorithm is open source and upstreamed to the MLIR sparse compiler. Experimental results show that our method achieves 19.5x speedup when compared with the state-of-the-art compiler-based method at 99.9% sparsity. The generated sparse kernels start to outperform dense convolution implementations at about 80% sparsity.

Iris-MSWasm: Elucidating and Mechanising the Security Invariants of Memory-Safe WebAssembly

WebAssembly offers coarse-grained encapsulation guarantees via its module system, but does not support fine-grained sharing of its linear memory. MSWasm is a recent proposal which extends WebAssembly with fine-grained memory sharing via handles, a type of capability that guarantees spatial and temporal safety, and thus enables an expressive yet safe style of programming with flexible sharing. In this paper, we formally validate the pen-and-paper design of MSWasm. To do so, we first define MSWasmCert, a mechanisation of MSWasm that makes it a fully-defined, conservative extension of WebAssembly 1.0, including the module system. We then develop Iris-MSWasm, a foundational reasoning framework for MSWasm composed of a separation logic to reason about known code, and a logical relation to reason about unknown, potentially adversarial code. Iris-MSWasm thereby makes explicit a key aspect of the implicit universal contract of MSWasm: robust capability safety. We apply Iris-MSWasm to reason about key use cases of handles, in which the effect of calling an unknown function is bounded by robust capability safety. Iris-MSWasm thus works as a framework to prove complex security properties of MSWasm programs, and provides a foundation to evaluate the language-level guarantees of MSWasm.

libLISA: Instruction Discovery and Analysis on x86-64

Even though heavily researched, a full formal model of the x86-64 instruction set is still not available. We present libLISA, a tool for automated discovery and analysis of the ISA of a CPU. This produces the most extensive formal x86-64 model to date, with over 118000 different instruction groups. The process requires as little human specification as possible: specifically, we do not rely on a human-written (dis)assembler to dictate which instructions are executable on a given CPU, or what their in- and outputs are. The generated model is CPU-specific: behavior that is "undefined" is synthesized for the current machine. Producing models for five different x86-64 machines, we mutually compare them, discover undocumented instructions, and generate instruction sequences that are CPU-specific. Experimental evaluation shows that we enumerate virtually all instructions within scope, that the instructions' semantics are correct w.r.t. existing work, and that we improve existing work by exposing bugs in their handwritten models.

Synthesizing Formal Semantics from Executable Interpreters

Program verification and synthesis frameworks that allow one to customize the language in which one is interested typically require the user to provide a formally defined semantics for the language. Because writing a formal semantics can be a daunting and error-prone task, this requirement stands in the way of such frameworks being adopted by non-expert users. We present an algorithm that can automatically synthesize inductively defined syntax-directed semantics when given (i) a grammar describing the syntax of a language and (ii) an executable (closed-box) interpreter for computing the semantics of programs in the language of the grammar. Our algorithm synthesizes the semantics in the form of Constrained-Horn Clauses (CHCs), a natural, extensible, and formal logical framework for specifying inductively defined relations that has recently received widespread adoption in program verification and synthesis. The key innovation of our synthesis algorithm is a Counterexample-Guided Synthesis (CEGIS) approach that breaks the hard problem of synthesizing a set of constrained Horn clauses into small, tractable expression-synthesis problems that can be dispatched to existing SyGuS synthesizers. Our tool Synantic synthesized inductively-defined formal semantics from 14 interpreters for languages used in program-synthesis applications. When synthesizing formal semantics for one of our benchmarks, Synantic unveiled an inconsistency in the semantics computed by the interpreter for a language of regular expressions; fixing the inconsistency resulted in a more efficient semantics and, for some cases, in a 1.2x speedup for a synthesizer solving synthesis problems over such a language.

A Modal Type Theory of Expected Cost in Higher-Order Probabilistic Programs

The design of online learning algorithms typically aims to optimise the incurred loss or cost, e.g., the number of classification mistakes made by the algorithm. The goal of this paper is to build a type-theoretic framework to prove that a certain algorithm achieves its stated bound on the cost. Online learning algorithms often rely on randomness, their loss functions are often defined as expectations, precise bounds are often non-polynomial (e.g., logarithmic) and proofs of optimality often rely on potential-based arguments. Accordingly, we present pλ-amor, a type-theoretic graded modal framework for analysing (expected) costs of higher-order probabilistic programs with recursion. pλ-amor is an effect-based framework which uses graded modal types to represent potentials, cost and probability at the type level. It extends prior work (λ-amor) on cost analysis for deterministic programs. We prove pλ-amor sound relative to a Kripke step-indexed model which relates potentials with probabilistic coupling. We use pλ-amor to prove cost bounds of several examples from the online machine learning literature. Finally, we describe an extension of pλ-amor with a graded comonad and describe the relationship between the different modalities.

Sensitivity by Parametricity

The work of Fuzz has pioneered the use of functional programming languages where types allow reasoning about the sensitivity of programs. Fuzz and subsequent work (e.g., DFuzz and Duet) use advanced technical devices like linear types, modal types, and partial evaluation. These features usually require the design of a new programming language from scratch—a significant task on its own! While these features are part of the classical toolbox of programming languages, they are often unfamiliar to non-experts in this field. Fortunately, recent studies (e.g., Solo) have shown that linear and complex types in general, are not strictly needed for the task of determining programs’ sensitivity since this can be achieved by annotating base types with static sensitivity information. In this work, we take a different approach. We propose to enrich base types with information about the metric relation between values, and we present the novel idea of applying parametricity to derive direct proofs for the sensitivity of functions. A direct consequence of our result is that calculating and proving the sensitivity of functions is reduced to simply type-checking in a programming language with support for polymorphism and type-level naturals. We formalize our main result in a calculus, prove its soundness, and implement a software library in the programming language Haskell–where we reason about the sensitivity of canonical examples. We show that the simplicity of our approach allows us to exploit the type inference of the host language to support a limited form of sensitivity inference. Furthermore, we extend the language with a privacy monad to showcase how our library can be used in practical scenarios such as the implementation of differentially private programs, where the privacy guarantees depend on the sensitivity of user-defined functions. Our library, called Spar, is implemented in less than 500 lines of code.

Mix Testing: Specifying and Testing ABI Compatibility of C/C++ Atomics Implementations

The correctness of complex software depends on the correctness of both the source code and the compilers that generate corresponding binary code. Compilers must do more than preserve the semantics of a single source file: they must ensure that generated binaries can be composed with other binaries to form a final executable. The compatibility of composition is ensured using an Application Binary Interface (ABI), which specifies details of calling conventions, exception handling, and so on. Unfortunately, there are no official ABIs for concurrent programs, so different atomics mappings, although correct in isolation, may induce bugs when composed. Indeed, today, mixing binaries generated by different compilers can lead to an erroneous resulting binary. We present mix testing: a new technique designed to find compiler bugs when the instructions of a C/C++ test are separately compiled for multiple compatible architectures and then mixed together. We define a class of compiler bugs, coined mixing bugs, that arise when parts of a program are compiled separately using different mappings from C/C++ atomic operations to assembly sequences. To demonstrate the generality of mix testing, we have designed and implemented a tool, atomic-mixer, which we have used: (a) to reproduce one existing non-mixing bug that state-of-the-art concurrency testing tools are limited to being able to find (showing that atomic-mixer at least meets the capabilities of these tools), and (b) to find four previously-unknown mixing bugs in LLVM and GCC, and one prospective mixing bug in mappings proposed for the Java Virtual Machine. Lastly, we have worked with engineers at Arm to specify, for the first time, an atomics ABI for Armv8, and have used atomic-mixer to validate the LLVM and GCC compilers against it.

Statically Contextualizing Large Language Models with Typed Holes

Large language models (LLMs) have reshaped the landscape of program synthesis. However, contemporary LLM-based code completion systems often hallucinate broken code because they lack appropriate code context, particularly when working with definitions that are neither in the training data nor near the cursor. This paper demonstrates that tighter integration with the type and binding structure of the programming language in use, as exposed by its language server, can help address this contextualization problem in a token-efficient manner. In short, we contend that AIs need IDEs, too! In particular, we integrate LLM code generation into the Hazel live program sketching environment. The Hazel Language Server is able to identify the type and typing context of the hole that the programmer is filling, with Hazel's total syntax and type error correction ensuring that a meaningful program sketch is available whenever the developer requests a completion. This allows the system to prompt the LLM with codebase-wide contextual information that is not lexically local to the cursor, nor necessarily in the same file, but that is likely to be semantically local to the developer's goal. Completions synthesized by the LLM are then iteratively refined via further dialog with the language server, which provides error localization and error messages. To evaluate these techniques, we introduce MVUBench, a dataset of model-view-update (MVU) web applications with accompanying unit tests that have been written from scratch to avoid data contamination, and that can easily be ported to new languages because they do not have large external library dependencies. These applications serve as challenge problems due to their extensive reliance on application-specific data structures. Through an ablation study, we examine the impact of contextualization with type definitions, function headers, and errors messages, individually and in combination. We find that contextualization with type definitions is particularly impactful. After introducing our ideas in the context of Hazel, a low-resource language, we duplicate our techniques and port MVUBench to TypeScript in order to validate the applicability of these methods to higher-resource mainstream languages. Finally, we outline ChatLSP, a conservative extension to the Language Server Protocol (LSP) that language servers can implement to expose capabilities that AI code completion systems of various designs can use to incorporate static context when generating prompts for an LLM.

FlowCert: Translation Validation for Asynchronous Dataflow via Dynamic Fractional Permissions

Coarse-grained reconfigurable arrays (CGRAs) have gained attention in recent years due to their promising power efficiency compared to traditional von Neumann architectures. To program these architectures using ordinary languages such as C, a dataflow compiler must transform the original sequential, imperative program into an equivalent dataflow graph, composed of dataflow operators running in parallel. This transformation is challenging since the asynchronous nature of dataflow graphs allows out-of-order execution of operators, leading to behaviors not present in the original imperative programs. We address this challenge by developing a translation validation technique for dataflow compilers to ensure that the dataflow program has the same behavior as the original imperative program on all possible inputs and schedules of execution. We apply this method to a state-of-the-art dataflow compiler targeting the RipTide CGRA architecture. Our tool uncovers 8 compiler bugs where the compiler outputs incorrect dataflow graphs, including a data race that is otherwise hard to discover via testing. After repairing these bugs, our tool verifies the correct compilation of all programs in the RipTide benchmark suite.

SparseAuto: An Auto-scheduler for Sparse Tensor Computations using Recursive Loop Nest Restructuring

Automated code generation and performance enhancements for sparse tensor algebra have become essential in many real-world applications, such as quantum computing, physical simulations, computational chemistry, and machine learning. General sparse tensor algebra compilers are not always versatile enough to generate asymptotically optimal code for sparse tensor contractions. This paper shows how to generate asymptotically better schedules for complex sparse tensor expressions using kernel fission and fusion. We present generalized loop restructuring transformations to reduce asymptotic time complexity and memory footprint. Furthermore, we present an auto-scheduler that uses a partially ordered set (poset)-based cost model that uses both time and auxiliary memory complexities to prune the search space of schedules. In addition, we highlight the use of Satisfiability Module Theory (SMT) solvers in sparse auto-schedulers to approximate the Pareto frontier of better schedules to the smallest number of possible schedules, with user-defined constraints available at compile-time. Finally, we show that our auto-scheduler can select better-performing schedules and generate code for them. Our results show that the auto-scheduler provided schedules achieve orders-of-magnitude speedup compared to the code generated by the Tensor Algebra Compiler (TACO) for several computations on different real-world tensors.

Quantum Probabilistic Model Checking for Time-Bounded Properties

Probabilistic model checking (PMC) is a verification technique for analyzing the properties of probabilistic systems. However, existing techniques face challenges in verifying large systems with high accuracy. PMC struggles with state explosion, where the number of states grows exponentially with the size of the system, making large system verification infeasible. While statistical model checking (SMC) avoids PMC’s state explosion problem by using a simulation approach, it suffers from runtime explosion, requiring numerous samples for high accuracy. To address these limitations in verifying large systems with high accuracy, we present quantum probabilistic model checking (QPMC), the first method leveraging quantum computing for PMC with respect to time-bounded properties. QPMC addresses state explosion by encoding PMC problems into quantum circuits that superpose states within qubits. Additionally, QPMC resolves runtime explosion through Quantum Amplitude Estimation, efficiently estimating the probabilities of specified properties. We prove that QPMC correctly solves PMC problems and achieves a quadratic speedup in time complexity compared to SMC.

Verified Lock-Free Session Channels with Linking

Type systems and program logics based on session types provide powerful high-level reasoning principles for message-passing concurrency. Modern versions employ bidirectional session channels that (1) are asynchronous so that send operations do not block, (2) have buffers in both directions so that both parties can send messages in parallel, and (3) feature a link operation (also called forward) to concisely write programs in process style. These features complicate a low-level lock-free implementation of channels and therefore increase the gap between the meta theory of prior work—which is verified w.r.t. a high-level semantics of channels (e.g., π-calculus)—and the code that runs on an actual computer. We address this problem by verifying a low-level lock-free implementation of session channels w.r.t. a high-level specification based on session types. We carry out our verification in a layered manner by employing the Iris framework for concurrent separation logic. We start with an abstract specification of (unidirectional) queues—of which we provide a linked-list and array-segment based implementation—and gradually build up to session channels with all of the aforementioned features. To make a layered verification possible we develop two logical abstractions—queues with ghost linking and pairing invariants—to reason about the atomicity and changing endpoints due to linking, respectively. All our results are mechanized in the Coq proof assistant.

Crabtree: Rust API Test Synthesis Guided by Coverage and Type

Rust type system constrains pointer operations, preventing bugs such as use-after-free. However, these constraints may be too strict for programming tasks such as implementing cyclic data structures. For such tasks, programmers can temporarily suspend checks using the unsafe keyword. Rust libraries wrap unsafe code blocks and expose higher-level APIs. They need to be extensively tested to uncover memory-safety bugs that can only be triggered by unexpected API call sequences or inputs. While prior works have attempted to automatically test Rust library APIs, they fail to test APIs with common Rust features, such as polymorphism, traits, and higher-order functions, or they have scalability issues and can only generate tests for a small number of combined APIs. We propose Crabtree, a testing tool for Rust library APIs that can automatically synthesize test cases with native support for Rust traits and higher-order functions. Our tool improves upon the test synthesis algorithms of prior works by combining synthesis and fuzzing through a coverage- and type-guided search algorithm that intelligently grows test programs and input corpus towards testing more code. To the best of our knowledge, our tool is the first to generate well-typed tests for libraries that make use of higher-order trait functions. Evaluation of Crabtree on 30 libraries found four previously unreported memory-safety bugs, all of which were accepted by the respective authors.

Merging Gradual Typing

Programming language mechanisms with a type-directed semantics are nowadays common and widely used. Such mechanisms include gradual typing, type classes, implicits and intersection types with a merge operator. While sharing common challenges in their design and having complementary strengths, type-directed mechanisms have been mostly independently studied. This paper studies a new calculus, called λM⋆, which combines two type-directed mechanisms: gradual typing and a merge operator based on intersection types. Gradual typing enables a smooth transition between dynamically and statically typed code, and is available in languages such as TypeScript or Flow. The merge operator generalizes record concatenation to allow merges of values of any two types. Recent work has shown that the merge operator enables modelling expressive OOP features like first-class traits/classes and dynamic inheritance with static type-checking. These features are not found in mainstream statically typed OOP languages, but they can be found in dynamically or gradually typed languages such as JavaScript or TypeScript. In λM⋆, by exploiting the complementary strengths of gradual typing and the merge operator, we obtain a foundation for modelling gradually typed languages with both first-class classes and dynamic inheritance. We study a static variant of λM⋆ (called λM); prove the type-soundness of λM⋆; show that λM⋆ can encode gradual rows and all well-typed terms in the GTFL≲ calculus; and show that λM⋆ satisfies gradual typing criteria. The dynamic gradual guarantee (DGG) is challenging due to the possibility of ambiguity errors. We establish a variant of the DGG using a semantic notion of precision based on a step-indexed logical relation.

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming language. Code LLMs produce impressive results on high-resource programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available (e.g., OCaml, Racket, and several others). This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, called MultiPL-T, generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. MultiPL-T translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize unit tests for commented code from a high-resource source language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate the code from the high-resource source language to a target low-resource language. This gives us a corpus of candidate training data in the target language, but many of these translations are wrong. 3) We use a lightweight compiler to compile the test cases generated in (1) from the source language to the target language, which allows us to filter our obviously wrong translations. The result is a training corpus in the target low-resource language where all items have been validated with test cases. We apply this approach to generate tens of thousands of new, validated training items for five low-resource languages: Julia, Lua, OCaml, R, and Racket, using Python as the source high-resource language. Furthermore, we use an open Code LLM (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. Using datasets generated with MultiPL-T, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket that outperform other fine-tunes of these base models on the natural language to code task. We also present Racket fine-tunes for two very recent models, DeepSeek Coder and StarCoder2, to show that MultiPL-T continues to outperform other fine-tuning approaches for low-resource languages. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.

WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models

Compiler correctness is crucial, as miscompilation can falsify program behaviors, leading to serious consequences over the software supply chain. In the literature, fuzzing has been extensively studied to uncover compiler defects. However, compiler fuzzing remains challenging: Existing arts focus on black- and grey-box fuzzing, which generates test programs without sufficient understanding of internal compiler behaviors. As such, they often fail to construct test programs to exercise intricate optimizations. Meanwhile, traditional white-box techniques, such as symbolic execution, are computationally inapplicable to the giant codebase of compiler systems. Recent advances demonstrate that Large Language Models (LLMs) excel in code generation/understanding tasks and even have achieved state-of-the-art performance in black-box fuzzing. Nonetheless, guiding LLMs with compiler source-code information remains a missing piece of research in compiler testing. To this end, we propose WhiteFox, the first white-box compiler fuzzer using LLMs with source-code information to test compiler optimization, with a spotlight on detecting deep logic bugs in the emerging deep learning (DL) compilers. WhiteFox adopts a multi-agent framework: (i) an LLM-based analysis agent examines the low-level optimization source code and produces requirements on the high-level test programs that can trigger the optimization; (ii) an LLM-based generation agent produces test programs based on the summarized requirements. Additionally, optimization-triggering tests are also used as feedback to further enhance the test generation prompt on the fly. Our evaluation on the three most popular DL compilers (i.e., PyTorch Inductor, TensorFlow-XLA, and TensorFlow Lite) shows that WhiteFox can generate high-quality test programs to exercise deep optimizations requiring intricate conditions, practicing up to 8 times more optimizations than state-of-the-art fuzzers. To date, WhiteFox has found in total 101 bugs for the compilers under test, with 92 confirmed as previously unknown and 70 already fixed. Notably, WhiteFox has been recently acknowledged by the PyTorch team, and is in the process of being incorporated into its development workflow. Finally, beyond DL compilers, WhiteFox can also be adapted for compilers in different domains, such as LLVM, where WhiteFox has already found multiple bugs.

HiPy: Extracting High-Level Semantics from Python Code for Data Processing

Data science workloads frequently include Python code, but Python's dynamic nature makes efficient execution hard. Traditional approaches either treat Python as a black box, missing out on optimization potential, or are limited to a narrow domain. However, a deep and efficient integration of user-defined Python code into data processing systems requires extracting the semantics of the entire Python code. In this paper, we propose a novel approach for extracting the high-level semantics by transforming general Python functions into program generators that generate a statically-typed IR when executed. The extracted IR then allows for high-level, domain-specific optimizations and the generation of efficient C++ code. With our prototype implementation, HiPy, we achieve single-threaded speedups of 2-20x for many workloads. Furthermore, HiPy is also capable of accelerating Python code in other domains like numerical data, where it can sometimes even outperform specialized compilers.

Automatically Reducing Privilege for Access Control Policies

Access control policies are programs used to secure cloud resources. These polices should only grant the necessary permissions that a given application needs. However, it is challenging to write and maintain policies as applications and their required permissions change over time. In this paper, we focus on the Amazon Web Services (AWS) IAM policy language and present an approach that, given a policy, synthesizes a modified policy that is more restrictive and better abides to the principle of least privilege. Our approach looks at the actual access history (e.g., access logs) used by an application and computes the least permissive local modification of the user-given policy that still provides the same permissions that were observed in the access history. We treat the problem of computing the least permissive policy as a generalization problem in a lattice of possible policies (i.e., the set of local modifications). We show that our synthesis algorithm comes with correctness guarantees and is amendable to an efficient implementation that is easy to parallelize. We implement our algorithm in a tool IAM-PolicyRefiner and evaluate it on policies attached to AWS roles with access logs. For each role, IAM-PolicyRefiner can compute easy-to-inspect refined policies in less than 1 minute, and the refined policies do not overfit to the requests in the log---i.e., the policies also allow requests in a left-out test set of requests.

Fully Verified Instruction Scheduling

CompCert project, the state-of-the-art compiler that achieves the first end-to-end formally verified C compiler, does not support fully verified instruction scheduling. Instead, existing research that works on such topics only implements translation validation. This means they do not have direct formal proof that the scheduling algorithm is correct, but only a posterior validation to check each compiling case. Using such a method, CompCert accepts a valid C program and compiles correctly only when the untrusted scheduler generates a correct result. However, it does not guarantee the complete correctness of the scheduler. It also causes compile-time validation overhead in the view of runtime performance. In this work, we present the first achievement in developing a mechanized library for fully verified instruction scheduling while keeping the proof workload acceptably lightweight. The idea to reduce the proof length is to exploit a simple property that the topological reordering of a topological sorted list is equal to a sequence of swapping adjacent unordered elements. Together with the transitivity of semantic simulation relation, the only burden will become proving the semantic preservation of a transition that only swaps two adjacent independent instructions inside one block. After successfully proving this result, proving the correctness of any new instruction scheduling algorithm only requires proof that it preserved the syntax-level dependence among instructions, instead of reasoning about semantics details every time. We implemented a mechanized library of such methods in the Coq proof assistant based on CompCert's library as a framework and used the list scheduling algorithm as a case study to show the correctness can be formally proved using our theory. We show that with our method that abstracts away the semantics details, it is flexible to implement any scheduler that reorders instructions with little extra proof burden. Our scheduler in the case study also abstracts away the outside scheduling heuristic as a universal parameter so it is flexible to modify without touching any correctness proof.

Quantitative Weakest Hyper Pre: Unifying Correctness and Incorrectness Hyperproperties via Predicate Transformers

We present a novel weakest pre calculus for reasoning about quantitative hyperproperties over nondeterministic and probabilistic programs. Whereas existing calculi allow reasoning about the expected value that a quantity assumes after program termination from a single initial state, we do so for initial sets of states or initial probability distributions. We thus (i) obtain a weakest pre calculus for hyper Hoare logic and (ii) enable reasoning about so-called hyperquantities which include expected values but also quantities (e.g. variance) out of scope of previous work. As a byproduct, we obtain a novel strongest post for weighted programs that extends both existing strongest and strongest liberal post calculi. Our framework reveals novel dualities between forward and backward transformers, correctness and incorrectness, as well as nontermination and unreachability.

CoolerSpace: A Language for Physically Correct and Computationally Efficient Color Programming

Color programmers manipulate lights, materials, and the resulting colors from light-material interactions. Existing libraries for color programming provide only a thin layer of abstraction around matrix operations. Color programs are, thus, vulnerable to bugs arising from mathematically permissible but physically meaningless matrix computations. Correct implementations are difficult to write and optimize. We introduce CoolerSpace to facilitate physically correct and computationally efficient color programming. CoolerSpace raises the level of abstraction of color programming by allowing programmers to focus on describing the logic of color physics. Correctness and efficiency are handled by CoolerSpace. The type system in CoolerSpace assigns physical meaning and dimensions to user-defined objects. The typing rules permit only legal computations informed by color physics and perception. Along with type checking, CoolerSpace also generates performance-optimized programs using equality saturation. CoolerSpace is implemented as a Python library and compiles to ONNX, a common intermediate representation for tensor computations. CoolerSpace not only prevents common errors in color programming, but also does so without run-time overhead: even unoptimized CoolerSpace programs out-perform existing Python-based color programming systems by up to 5.7 times; our optimizations provide up to an additional 1.4 times speed-up.

Plume: Efficient and Complete Black-Box Checking of Weak Isolation Levels

Modern databases embrace weak isolation levels to cater for highly available transactions. However, weak isolation bugs have recently manifested in many production databases. This raises the concern of whether database implementations actually deliver their promised isolation guarantees in practice. In this paper we present Plume, the first efficient, complete, black-box checker for weak isolation levels. Plume builds on modular, fine-grained, transactional anomalous patterns, with which we establish sound and complete characterizations of representative weak isolation levels, including read committed, read atomicity, and transactional causal consistency. Plume leverages a novel combination of two techniques, vectors and tree clocks, to accelerate isolation checking. Our extensive assessment shows that Plume can reproduce all known violations in a large collection of anomalous database execution histories, detect new isolation bugs in three production databases along with informative counterexamples, find more weak isolation anomalies than the state-of-the-art checkers, and efficiently validate isolation guarantees under a wide variety of workloads.

Computing Precise Control Interface Specifications

Verifying network programs is challenging because of how they divide labor: the control plane computes high level routes through the network and compiles them to device configurations, while the data plane uses these configurations to realize the desired forwarding behavior. In practice, the correctness of the data plane often assumes that the configurations generated by the control plane will satisfy complex specifications. Consequently, validation tools such as program verifiers, runtime monitors, fuzzers, and test-case generators must be aware of these control interface specifications (ci-specs) to avoid raising false alarms. In this paper, we propose the first algorithm for computing precise ci-specs for network data planes. Our specifications are designed to be efficiently monitorable—concretely, checking that a fixed configuration satisfies a ci-spec can be done in polynomial time. Our algorithm, based on modular program instrumentation, quantifier elimination, and a path-based analysis, is more expressive than prior work, and is applicable to practical network programs. We describe an implementation and show that ci-specs computed by our tool are useful for finding real bugs in real-world data plane programs.

Automating Pruning in Top-Down Enumeration for Program Synthesis Problems with Monotonic Semantics

In top-down enumeration for program synthesis, abstraction-based pruning uses an abstract domain to approximate the set of possible values that a partial program, when completed, can output on a given input. If the set does not contain the desired output, the partial program and all its possible completions can be pruned. In its general form, abstraction-based pruning requires manually designed, domain-specific abstract domains and semantics, and thus has only been used in domain-specific synthesizers. This paper provides sufficient conditions under which a form of abstraction-based pruning can be automated for arbitrary synthesis problems in the general-purpose Semantics-Guided Synthesis (SemGuS) framework without requiring manually-defined abstract domains. We show that if the semantics of the language for which we are synthesizing programs exhibits some monotonicity properties, one can obtain an abstract interval-based semantics for free from the concrete semantics of the programming language, and use such semantics to effectively prune the search space. We also identify a condition that ensures such abstract semantics can be used to compute a precise abstraction of the set of values that a program derivable from a given hole in a partial program can produce. These precise abstractions make abstraction-based pruning more effective. We implement our approach in a tool, Moito, which can tackle synthesis problems defined in the SemGuS framework. Moito can automate interval-based pruning without any a-priori knowledge of the problem domain, and solve synthesis problems that previously required domain-specific, abstraction-based synthesizers— e.g., synthesis of regular expressions, CSV file schema, and imperative programs from examples.

Refinement Type Refutations

Refinement types combine SMT decidable constraints with a compositional, syntax-directed type system to provide a convenient way to statically and automatically check properties of programs. However, when type checking fails, programmers must use cryptic error messages that, at best, point out the code location where a subtyping constraint failed to determine the root cause of the failure. In this paper, we introduce refinement type refutations, a new approach to explaining why refinement type checking fails, which mirrors the compositional way in which refinement type checking is carried out. First, we show how to systematically transform standard bidirectional type checking rules to obtain refutations. Second, we extend the approach to account for global constraint-based refinement inference via the notion of a must-instantiation: a set of concrete inhabitants of the types of subterms that suffice to demonstrate why typing fails. Third, we implement our method in HayStack—an extension to LiqidHaskell which automatically finds type-refutations when refinement type checking fails, and helps users understand refutations via an interactive user-interface. Finally, we present an empirical evaluation of HayStack using the regression benchmark-set of LiqidHaskell, and the benchmark set of G2, a previous method that searches for (non-compositional) counterexample traces by symbolically executing Haskell source. We show that HayStack can find refutations for 99.7% of benchmarks, including those with complex typing constructs (e.g., abstract and bounded refinements, and reflection), and does so, an order of magnitude faster than G2.

The Ultimate Conditional Syntax

Functional programming languages typically support expressive pattern-matching syntax allowing programmers to write concise and type-safe code, especially appropriate for manipulating algebraic data types. Many features have been proposed to enhance the expressiveness of stock pattern-matching syntax, such as pattern bindings, pattern alternatives (a.k.a. disjunction), pattern conjunction, view patterns, pattern guards, pattern synonyms, active patterns, ‘if-let’ patterns, multi-way if-expressions, etc. In this paper, we propose a new pattern-matching syntax that is both more expressive and (we argue) simpler and more readable than previous alternatives. Our syntax supports parallel and nested matches interleaved with computations and intermediate bindings. This is achieved through a form of nested multi-way if-expressions with a condition-splitting mechanism to factor common conditional prefixes as well as a binding technique we call conditional pattern flowing. We motivate this new syntax with many examples in the setting of MLscript, a new ML-family programming language. We describe a straightforward desugaring pass from our rich source syntax into a minimal core syntax that only supports flat patterns and has an intuitive small-step semantics. We then provide a translation from the core syntax into a normalized syntax without backtracking, which is more amenable to coverage checking and compilation, and formally prove that our translation is semantics-preserving. We view this work as a step towards rethinking pattern matching to make it more powerful and natural to use. Our syntax can easily be integrated, in part or in whole, into existing as well as future programming language designs.

On the Expressive Power of Languages for Static Variability

Variability permeates software development to satisfy ever-changing requirements and mass-customization needs. A prime example is the Linux kernel, which employs the C preprocessor to specify a set of related but distinct kernel variants. To study, analyze, and verify variational software, several formal languages have been proposed. For example, the choice calculus has been successfully applied for type checking and symbolic execution of configurable software, while other formalisms have been used for variational model checking, change impact analysis, among other use cases. Yet, these languages have not been formally compared, hence, little is known about their relationships. Crucially, it is unclear to what extent one language subsumes another, how research results from one language can be applied to other languages, and which language is suitable for which purpose or domain. In this paper, we propose a formal framework to compare the expressive power of languages for static (i.e., compile-time) variability. By establishing a common semantic domain to capture a widely used intuition of explicit variability, we can formulate the basic, yet to date neglected, properties of soundness, completeness, and expressiveness for variability languages. We then prove the (un)soundness and (in)completeness of a range of existing languages, and relate their ability to express the same variational systems. We implement our framework as an extensible open source Agda library in which proofs act as correct compilers between languages or differencing algorithms. We find different levels of expressiveness as well as complete and incomplete languages w.r.t. our unified semantic domain, with the choice calculus being among the most expressive languages.

Programmable MCMC with Soundly Composed Guide Programs

Probabilistic programming languages (PPLs) provide language support for expressing flexible probabilistic models and solving Bayesian inference problems. PPLs with programmable inference make it possible for users to obtain improved results by customizing inference engines using guide programs that are tailored to a corresponding model program. However, errors in guide programs can compromise the statistical soundness of the inference. This article introduces a novel coroutine-based framework for verifying the correctness of user-written guide programs for a broad class of Markov chain Monte Carlo (MCMC) inference algorithms. Our approach rests on a novel type system for describing communication protocols between a model program and a sequence of guides that each update only a subset of random variables. We prove that, by translating guide types to context-free processes with finite norms, it is possible to check structural type equality between models and guides in polynomial time. This connection gives rise to an efficient type-inference algorithm for probabilistic programs with flexible constructs such as general recursion and branching. We also contribute a coverage-checking algorithm that verifies the support of sequentially composed guide programs agrees with that of the model program, which is a key soundness condition for MCMC inference with multiple guides. Evaluations on diverse benchmarks show that our type-inference and coverage-checking algorithms efficiently infer types and detect sound and unsound guides for programs that existing static analyses cannot handle.

Extending the C/C++ Memory Model with Inline Assembly

Programs written in C/C++ often include inline assembly: a snippet of architecture-specific assembly code used to access low-level functionalities that are impossible or expensive to simulate in the source language. Although inline assembly is widely used, its semantics has not yet been formally studied. In this paper, we overcome this deficiency by investigating the effect of inline assembly on the consistency semantics of C/C++ programs. We propose the first memory model of the C++ Programming Language with support for inline assembly for Intel's x86 including non-temporal stores and store fences. We argue that previous provably correct compiler optimizations and correct compiler mappings should remain correct under such an extended model and we prove that this requirement is met by our proposed model.

Effects and Coeffects in Call-by-Push-Value

Effect and coeffect tracking integrate many types of compile-time analysis, such as cost, liveness, or dataflow, directly into a language's type system. In this paper, we investigate the addition of effect and coeffect tracking to the type system of call-by-push-value (CBPV), a computational model useful in compilation for its isolation of effects and for its ability to cleanly express both call-by-name and call-by-value computations. Our main result is effect-and-coeffect soundness, which asserts that the type system accurately bounds the effects that the program may trigger during execution and accurately tracks the demands that the program may make on its environment. This result holds for two different dynamic semantics: a generic one that can be adapted for different coeffects and one that is adapted for reasoning about resource usage. In particular, the second semantics discards the evaluation of unused values and pure computations while ensuring that effectful computations are always evaluated, even if their results are not required. Our results have been mechanized using the Coq proof assistant.

Gradient: Gradual Compartmentalization via Object Capabilities Tracked in Types

Modern software needs fine-grained compartmentalization, i.e., intra-process isolation. A particularly important reason for it are supply-chain attacks, the need for which is aggravated by modern applications depending on hundreds or even thousands of libraries. Object capabilities are a particularly salient approach to compartmentalization, but they require the entire program to assume a lack of ambient authority. Most of existing code was written under no such assumption; effectively, existing applications need to undergo a rewrite-the-world migration to reap the advantages of ocap. We propose gradual compartmentalization, an approach which allows gradually migrating an application to object capabilities, component by component in arbitrary order, all the while continuously enjoying security guarantees. The approach relies on runtime authority enforcement and tracking the authority of objects the type system. We present Gradient, a proof-of-concept gradual compartmentalization extension to Scala which uses Enclosures and Capture Tracking as its key components. We evaluate our proposal by migrating the standard XML library of Scala to Gradient.

Compilation of Shape Operators on Sparse Arrays

We show how to build a compiler for a sparse array language that supports shape operators such as reshaping or concatenating arrays, in addition to compute operators. Existing sparse array programming systems implement generic shape operators for only some sparse data structures, reduce shape operators on other data structures to those, and do not support fusion. Our system compiles sparse array expressions to code that efficiently iterates over reshaped views of irregular sparse data structures, without needing to materialize temporary storage for intermediates. Our evaluation shows that our approach generates sparse array code competitive with popular sparse array libraries: our generated shape operators achieve geometric mean speed-ups of 1.66×–15.3× when compared to hand-written kernels in scipy.sparse and 1.67×–651× when compared to generic implementations in pydata/sparse. For operators that require data structure conversions in these libraries, our generated code achieves geometric mean speed-ups of 7.29×–13.0× when compared to scipy.sparse and 21.3×–511× when compared to pydata/sparse. Finally, our evaluation demonstrates that fusing shape and compute operators improves the performance of several expressions by geometric mean speed-ups of 1.22×–2.23×.

Tachis: Higher-Order Separation Logic with Credits for Expected Costs

We present Tachis, a higher-order separation logic to reason about the expected cost of probabilistic programs. Inspired by the uses of time credits for reasoning about the running time of deterministic programs, we introduce a novel notion of probabilistic cost credit. Probabilistic cost credits are a separation logic resource that can be used to pay for the cost of operations in programs, and that can be distributed across all possible branches of sampling instructions according to their weight, thus enabling us to reason about expected cost. The representation of cost credits as separation logic resources gives Tachis a great deal of flexibility and expressivity. In particular, it permits reasoning about amortized expected cost by storing excess credits as potential into data structures to pay for future operations. Tachis further supports a range of cost models, including running time and entropy usage. We showcase the versatility of this approach by applying our techniques to prove upper bounds on the expected cost of a variety of probabilistic algorithms and data structures, including randomized quicksort, hash tables, and meldable heaps. All of our results have been mechanized using Coq, Iris, and the Coquelicot real analysis library.

Making Formulog Fast: An Argument for Unconventional Datalog Evaluation

With its combination of Datalog, SMT solving, and functional programming, the language Formulog provides an appealing mix of features for implementing SMT-based static analyses (e.g., refinement type checking, symbolic execution) in a natural, declarative way. At the same time, the performance of its custom Datalog solver can be an impediment to using Formulog beyond prototyping—a common problem for Datalog variants that aspire to solve large problem instances. In this work we speed up Formulog evaluation, with some surprising results: while 2.2× speedups can be obtained by using the conventional techniques for high-performance Datalog (e.g., compilation, specialized data structures), the big wins come by abandoning the central assumption in modern performant Datalog engines, semi-naive Datalog evaluation. In the place of semi-naive evaluation, we develop eager evaluation, a concurrent Datalog evaluation algorithm that explores the logical inference space via a depth-first traversal order. In practice, eager evaluation leads to an advantageous distribution of Formulog’s SMT workload to external SMT solvers and improved SMT solving times: our eager evaluation extensions to the Formulog interpreter and Soufflé’s code generator achieve mean 5.2× and 7.6× speedups, respectively, over the optimized code generated by off-the-shelf Soufflé on SMT-heavy Formulog benchmarks. All in all, using compilation and eager evaluation (as appropriate), Formulog implementations of refinement type checking, bottom-up pointer analysis, and symbolic execution achieve speedups on 20 out of 23 benchmarks over previously published, hand-tuned analyses written in F, Java, and C++, providing strong evidence that Formulog can be the basis of a realistic platform for SMT-based static analysis. Moreover, our experience adds nuance to the conventional wisdom that traditional semi-naive evaluation is the one-size-fits-all best Datalog evaluation algorithm for static analysis workloads.

Realistic Realizability: Specifying ABIs You Can Count On

The Application Binary Interface (ABI) for a language defines the interoperability rules for its target platforms, including data layout and calling conventions, such that compliance with the rules ensures “safe” execution and perhaps certain resource usage guarantees. These rules are relied upon by compilers, libraries, and foreign-function interfaces. Unfortunately, ABIs are typically specified in prose, and while type systems for source languages have evolved, ABIs have comparatively stalled, lacking advancements in expressivity and safety. We propose a vision for richer, semantic ABIs to improve interoperability and library integration, supported by a methodology for formally specifying ABIs using realizability models. These semantic ABIs connect abstract, high-level types to unwieldy, but well-behaved, low-level code. We illustrate our approach with a case study formalizing the ABI of a functional source language in terms of a reference-counting implementation in a C-like target language. A key contribution supporting this case study is a graph-based model of separation logic that captures the ownership and accessibility of reference-counted resources using modalities inspired by hybrid logic. To highlight the flexibility of our methodology, we show how various design decisions can be interpreted into the semantic ABI. Finally, we provide the first formalization of library evolution, a distinguishing feature of Swift’s ABI.

Hypra: A Deductive Program Verifier for Hyper Hoare Logic

Hyperproperties relate multiple executions of a program and are useful to express common correctness properties (such as determinism) and security properties (such as non-interference). While there are a number of powerful program logics for the deductive verification of hyperproperties, their automation falls behind. Most existing deductive verification tools are limited to safety properties, but cannot reason about the existence of executions, for instance, to prove the violation of a safety property. Others support more flexible hyperproperties such as generalized non-interference, but have limitations in terms of the programs and proof structures they support. In this paper, we present the first deductive verification technique for arbitrary hyperproperties over multiple executions of the same program. Our technique automates the generation of verification conditions for Hyper Hoare Logic. Our key insight is that arbitrary hyperproperties and the corresponding proof rules can be encoded into a standard intermediate verification language by representing sets of states of the input program explicitly in the states of the intermediate program. Verification is then automated using an existing SMT-based verifier for the intermediate language. We implement our technique in a tool called Hypra and demonstrate that it can reliably verify complex hyperproperties.

PolyJuice: Detecting Mis-compilation Bugs in Tensor Compilers with Equality Saturation Based Rewriting

Tensor compilers are essential for deploying deep learning applications across various hardware platforms. While powerful, they are inherently complex and present significant challenges in ensuring correctness. This paper introduces PolyJuice, an automatic detection tool for identifying mis-compilation bugs in tensor compilers. Its basic idea is to construct semantically-equivalent computation graphs to validate the correctness of tensor compilers. The main challenge is to construct equivalent graphs capable of efficiently exploring the diverse optimization logic during compilation. We approach it from two dimensions. First, we propose arithmetic and structural equivalent rewrite rules to modify the dataflow of a tensor program. Second, we design an efficient equality saturation based rewriting framework to identify the most simplified and the most complex equivalent computation graphs for an input graph. After that, the outcome computation graphs have different dataflow and will likely experience different optimization processes during compilation. We applied it to five well-tested industrial tensor compilers, namely PyTorch Inductor, OnnxRuntime, TVM, TensorRT, and XLA, as well as two well-maintained academic tensor compilers, EinNet and Hidet. In total, PolyJuice detected 84 non-crash mis-compilation bugs, out of which 49 were confirmed with 20 fixed.

Semantics Lifting for Syntactic Sugar

Syntactic sugar plays a crucial role in engineering programming languages. It offers convenient syntax and higher-level of abstractions, as witnessed by its pervasive use in both general-purpose and domain-specific contexts. Unfortunately, the traditional approach of translating programs containing syntactic sugars into the host language can lead to abstraction leakage, breaking the promise of convenience and hindering program comprehension. To address this challenge, we introduce the idea of semantics lifting that aims to statically derive self-contained evaluation rules for syntactic sugars. More specifically, we propose a semantics-lifting framework that consists of (i) a general algorithm for deriving host-independent semantics of syntactic sugars from the semantics of the host language and the desugaring rules, (ii) a formulation of the correctness and abstraction properties for a lifted semantics, and (iii) a systematic investigation of sufficient conditions that ensure a lifted semantics is provably correct and abstract. To evaluate our semantics-lifting framework, we have implemented a system named Osazone and conducted several case studies, demonstrating that our approach is flexible, effective, and practical for implementing domain-specific languages.

MEA2: A Lightweight Field-Sensitive Escape Analysis with Points-to Calculation for Golang

Escape analysis plays a crucial role in garbage-collected languages as it enables the allocation of non-escaping variables on the stack by identifying the dynamic lifetimes of objects and pointers. This helps in reducing heap allocations and alleviating garbage collection pressure. However, Go, as a garbage-collected language, employs a fast yet conservative escape analysis, which is field-insensitive and omits point-to-set calculation to expedite compilation. This results in more variables being allocated on the heap. Empirical statistics reveal that field access and indirect memory access are prevalent in real-world Go programs, suggesting potential opportunities for escape analysis to enhance program performance. In this paper, we propose MEA2, an escape analysis framework atop GoLLVM (an LLVM-based Go compiler), which combines field sensitivity and points-to analysis. Moreover, a novel generic function summary representation is designed to facilitate fast inter-procedural analysis. We evaluated it by using MEA2 to perform stack allocation in 12 wildly-use open-source projects. The results show that, compared to Go’s escape analysis, MEA2 can reduce heap allocation sites by 7.9

Weighted Context-Free-Language Ordered Binary Decision Diagrams

This paper presents a new data structure, called Weighted Context-Free-Language Ordered BDDs (WCFLOBDDs), which are a hierarchically structured decision diagram, akin to Weighted BDDs (WBDDs) enhanced with a procedure-call mechanism. For some functions, WCFLOBDDs are exponentially more succinct than WBDDs. They are potentially beneficial for representing functions of type BnD, when a function’s image VD has many different values. We apply WCFLOBDDs in quantum-circuit simulation, and find that they perform better than WBDDs on certain benchmarks. With a 15-minute timeout, the number of qubits that can be handled by WCFLOBDDs is 1-64× that of WBDDs(and 1-128× that of CFLOBDDs, which are an unweighted version of WCFLOBDDs). These results support the conclusion that for this application—from the standpoint of problem size, measured as the number of qubits—WCFLOBDDs provide the best of both worlds: performance roughly matches whichever of WBDDs and CFLOBDDs is better.(From the standpoint of running time, the results are more nuanced.)

Finding ∀∃ Hyperbugs using Symbolic Execution

Many important hyperproperties, such as refinement and generalized non-interference, fall into the class of ∀∃ hyperproperties and require, for each execution trace of a system, the existence of another trace relating to the first one in a certain way. The alternation of quantifiers renders ∀∃ hyperproperties extremely difficult to verify, or even just to test. Indeed, contrary to trace properties, where it suffices to find a single counterexample trace, refuting a ∀∃ hyperproperty requires not only to find a trace, but also a proof that no second trace satisfies the specified relation with the first trace. As a consequence, automated testing of ∀∃ hyperproperties falls out of the scope of existing automated testing tools. In this paper, we present a fully automated approach to detect violations of ∀∃ hyperproperties in software systems. Our approach extends bug-finding techniques based on symbolic execution with support for trace quantification. We provide a prototype implementation of our approach, and demonstrate its effectiveness on a set of challenging examples.

Multris: Functional Verification of Multiparty Message Passing in Separation Logic

We introduce Multris, a separation logic for verifying functional correctness of programs that combine multiparty message-passing communication with shared-memory concurrency. The foundation of our work is a novel concept of multiparty protocol consistency, which guarantees safe communication among a set of parties, provided each party adheres to its prescribed protocol. Our concept of protocol consistency is inspired by the bottom-up approach for multiparty session types. However, by considering it in the context of separation logic instead of a type system, we go further in terms of generality by supporting new notions of implicit transfer of knowledge and implicit transfer of resources. We develop tactics for automatically verifying protocol consistency and for reasoning about message-passing operations in Multris. We evaluate Multris on a range of examples, including the well-known two- and three-buyer protocols, as well as a new verification benchmark based on Chang and Roberts's ring leader election protocol. To ensure the reliability of our work, we prove soundness of Multris w.r.t. a low-level channel semantics using the Iris framework in Coq.

Law and Order for Typestate with Borrowing

Typestate systems are notoriously complex as they require sophisticated machinery for tracking aliasing. We propose a new, transition-oriented foundation for typestate in the setting of impure functional programming. Our approach relies on ordered types for simple alias tracking and its formalization draws on work on bunched implications. Yet, we support a flexible notion of borrowing in the presence of typestate. Our core calculus comes with a notion of resource types indexed by an ordered partial monoid that models abstract state transitions. We prove syntactic type soundness with respect to a resource-instrumented semantics. We give an algorithmic version of our type system and prove its soundness. Algorithmic typing facilitates a simple surface language that does not expose tedious details of ordered types. We implemented a typechecker for the surface language along with an interpreter for the core language.

FPCC: Detecting Floating-Point Errors via Chain Conditions

Floating-point arithmetic is notorious for its rounding errors, which can propagate and accumulate, leading to unacceptable results. Detecting inputs that can trigger significant floating-point errors is crucial for enhancing the reliability of numerical programs. Existing methods for generating error-triggering inputs often rely on costly shadow executions that involve high-precision computations or suffer from false positives. This paper introduces chain conditions to capture the propagation and accumulation of floating-point errors, using them to guide the search for error-triggering inputs. We have implemented a tool named FPCC and evaluated it on 88 functions from the GNU Scientific Library, as well as 21 functions with multiple inputs from previous research. The experimental results demonstrate the effectiveness and efficiency of our approach: (1) FPCC achieves 100% accuracy in detecting significant errors for the reported rank-1 inputs, while 72.69% rank-1 inputs from the state-of-the-art tool ATOMU can trigger significant errors. Overall, 99.64% (1049/1053) of the inputs reported by FPCC can trigger significant errors, whereas only 19.45% (141/723) of the inputs reported by ATOMU can trigger significant errors; (2) FPCC exhibits a 2.17x speedup over ATOMU in detecting significant errors; (3) FPCC also excels in supporting functions with multiple inputs, outperforming the state-of-the-art technique. To facilitate further research in the community, we have made FPCC available on GitHub at https://github.com/DataReportRe/FPCC.

Scaling Abstraction Refinement for Program Analyses in Datalog using Graph Neural Networks

Counterexample-guided abstraction refinement (CEGAR) is a popular approach for automatically selecting abstractions with high precision and low time costs. Existing works cast abstraction refinements as constraint-solving problems. Due to the complexity of these problems, they cannot be scaled to large programs or complex analyses. We propose a novel approach that applies graph neural networks to improve the scalability of CEGAR for Datalog-based program analyses. By constructing graphs directly from the Datalog solver’s calculations, our method then uses a neural network to score abstraction parameters based on the information in these graphs. Then we reform the constraint problems such that the constraint solver ignores parameters with low scores. This in turn reduces the solution space and the size of the constraint problems. Since our graphs are directly constructed from Datalog computation without human effort, our approach can be applied to a broad range of parametric static analyses implemented in Datalog. We evaluate our approach on a pointer analysis and a typestate analysis and our approach can answer 2.83× and 1.5× as many queries as the baseline approach on large programs for the pointer analysis and the typestate analysis, respectively.

Minotaur: A SIMD-Oriented Synthesizing Superoptimizer

A superoptimizing compiler—-one that performs a meaningful search of the program space as part of the optimization process—-can find optimization opportunities that are missed by even the best existing optimizing compilers. We created Minotaur: a superoptimizer for LLVM that uses program synthesis to improve its code generation, focusing on integer and floating-point SIMD code. On an Intel Cascade Lake processor, Minotaur achieves an average speedup of 7.3% on the GNU Multiple Precision library (GMP)’s benchmark suite, with a maximum speedup of 13%. On SPEC CPU 2017, our superoptimizer produces an average speedup of 1.5%, with a maximum speedup of 4.5% for 638.imagick. Every optimization produced by Minotaur has been formally verified, and several optimizations that it has discovered have been implemented in LLVM as a result of our work.

A Typed Multi-level Datalog IR and Its Compiler Framework

The resurgence of Datalog in the last two decades has led to a multitude of new Datalog systems. These systems explore novel ideas for improving Datalog's programmability and performance, making important contributions to the field. Unfortunately, the individual systems progress at a much slower pace than the overall field, because improvements in one system are rarely ported to other systems. The reason for this rift is that each system provides its own Datalog dialect with specific notation, language features, and invariants, enabling specific optimization and execution strategies. This paper presents the first compiler framework for Datalog that can be used to support any Datalog frontend language and to target any Datalog backend. The centerpiece of our framework is a novel typed multi-level Datalog IR that supports IR extensions and guarantees executability. Existing Datalog systems can provide a compiler frontend that translates their Datalog dialect to the extended IR. The IR is then progressively lowered toward core Datalog, allowing optimizations at each level. At last, compiler backends can target different Datalog solvers. We have implemented the compiler framework and integrated 4 Datalog frontends and 3 Datalog backends, using 16 IR extensions. We also formalize the IR's flexible type system, which is bidirectional, flow-sensitive, bipolar, and uses three-valued typing contexts. The type system simultaneously validates type compatibility and precisely tracks bindings of logic variables while permitting IR extensions.

HardTaint: Production-Run Dynamic Taint Analysis via Selective Hardware Tracing

Dynamic taint analysis (DTA), as a fundamental analysis technique, is widely used in security, privacy, and diagnosis, etc. As DTA demands to collect and analyze massive taint data online, it suffers extremely high runtime overhead. Over the past decades, numerous attempts have been made to lower the overhead of DTA. Unfortunately, the reductions they achieved are marginal, causing DTA only applicable to the debugging/testing scenarios. In this paper, we propose and implement HardTaint, a system that can realize production-run dynamic taint tracking. HardTaint adopts a hybrid and systematic design which combines static analysis, selective hardware tracing and parallel graph processing techniques. The comprehensive evaluations demonstrate that HardTaint introduces only around 8% runtime overhead which is an order of magnitude lower than the state-of-the-arts, while without sacrificing any taint detection capability.

Jmvx: Fast Multi-threaded Multi-version Execution and Record-Replay for Managed Languages

Multi-Version eXecution (MVX) is a technique that deploys many equivalent versions of the same program — variants — as a single program, with direct applications in important fields such as: security, reliability, analysis, and availability. MVX can be seen as “online Record/Replay (RR)”, as RR captures a program’s execution as a log stored on disk that can later be replayed to observe the same execution. Unfortunately, current MVX techniques target programs written in C/C++ and do not support programs written in managed languages, which are the vast majority of code written nowadays. This paper presents the design, implementation, and evaluation of Jmvx— a novel system for performing MVX and RR on programs written in managed languages. Jmvx supports programs written in Java by intercepting automatically identified non-deterministic methods, via a novel dynamic analysis technique, and ensuring that all variants execute the same methods and obtain the same data. Jmvx supports multi-threaded programs, by capturing synchronization operations in one variant, and ensuring all other variants follow the same ordering. We validated that Jmvx supports MVX and RR by applying it to a suite of benchmarks representative of programs written in Java. Internally, Jmvx uses a circular buffer located in shared memory between JVMs to enable fast communication between all variants, averaging 5% |47% performance overhead when performing MVX with multithreading support disabled|enabled, 8% |25% when recording, and 13% |73% when replaying.

Lexical Effect Handlers, Directly

Lexically scoping effect handlers is a language-design idea that equips algebraic effects with a modular semantics: it enables local-reasoning principles without giving up on the control-flow expressiveness that makes effect handlers powerful. However, we observe that existing implementations risk incurring costs akin to the run-time search for dynamically scoped handlers. This paper presents a compilation strategy for lexical effect handlers, adhering to the lexical scoping principle and targeting a language with low-level control over stack layout. Key aspects of this approach are formalized and proven correct. We embody the ideas in a language called Lexa: the Lexa compiler translates high-level effect handling to low-level stack switching. We evaluate the Lexa compiler on a set of benchmarks; the results suggest that it generates efficient code, reducing running-time complexity from quadratic to linear in some cases.

HybridSA: GPU Acceleration of Multi-pattern Regex Matching using Bit Parallelism

Multi-pattern matching is widely used in modern software for applications requiring high throughput such as protein search, network traffic inspection, virus or spam detection. Graphics Processor Units (GPUs) excel at executing massively parallel workloads. Regular expression (regex) matching is typically performed by simulating the execution of deterministic finite automata (DFAs) or nondeterministic finite automata (NFAs). The natural implementations of these automata simulation algorithms on GPUs are highly inefficient because they give rise to irregular memory access patterns. This paper presents HybridSA, a heterogeneous CPU-GPU parallel engine for multi-pattern matching. HybridSA uses bit parallelism to efficiently simulate NFAs on GPUs, thus reducing the number of memory accesses and increasing the throughput. Our bit-parallel algorithms extend the classical shift-and algorithm for string matching to a large class of regular expressions and reduce automata simulation to a small number of bitwise operations. We have developed a compiler to translate regular expressions into bit masks, perform optimizations, and choose the best algorithms to run on the GPU. The majority of the regular expressions are accelerated on the GPU, while the patterns that exhibit random memory accesses are executed on the CPU in parallel. We evaluate HybridSA against state-of-the-art CPU and GPU engines, as well as a hybrid combination of the two. HybridSA achieves between 4 and 60 times higher throughput than the state-of-the-art CPU engine and between 4 and 233 times better than the state-of-the-art GPU engine across a collection of real-world benchmarks.

A Runtime System for Interruptible Query Processing: When Incremental Computing Meets Fine-Grained Parallelism

Online data services have stringent performance requirement and must tolerate workload fluctuation. This paper introduces PitStop, a new query language runtime design built on the idea of interruptible query processing: the time-consuming task of data inspection for processing each query or update may be interrupted and resumed later at the boundary of fine-grained data partitions. This counter-intuitive idea enables a novel form of fine-grained concurrency while preserving sequential consistency. We build PitStop through modifying the language runtime of Cypher, the query language of a state-of-the-art graph database, Neo4j. Our evaluation on the Google Cloud shows that PitStop can outperform unmodified Neo4j during workload fluctuation, with reduced latency and increased throughput.

StarMalloc: Verifying a Modern, Hardened Memory Allocator

We present StarMalloc, a verified, efficient, security-oriented, and concurrent memory allocator. Using the Steel separation logic framework, we show how to specify and verify a multitude of low-level patterns and delicate security mechanisms, by relying on a combination of dependent types, SMT, and modular abstractions to enable efficient verification. We produce a verified artifact, in C, that implements the entire API surface of an allocator, and as such works as a drop-in replacement for real-world projects, notably the Firefox browser. As part of StarMalloc, we develop several generic datastructures and proof libraries directly reusable in future systems verification projects. We also extend the Steel toolchain to express several low-level idioms that were previously missing. Finally, we show that StarMalloc exhibits competitive performance by evaluating it against 10 state-of-the-art memory allocators, and against a variety of real-world projects, such as Redis, the Lean compiler, and the Z3 SMT solver.

AUTOMAP: Inferring Rank-Polymorphic Function Applications with Integer Linear Programming

Dynamically typed array languages such as Python, APL, and Matlab lift scalar operations to arrays and replicate scalars to fit applications. We present a mechanism for automatically inferring map and replicate operations in a statically-typed language in a way that resembles the programming experience of a dynamically-typed language while preserving the static typing guarantees. Our type system---which supports parametric polymorphism, higher-order functions, and top-level let-generalization---makes use of integer linear programming in order to find the minimum number of operations needed to elaborate to a well-typed program. We argue that the inference system provides useful and unsurprising guarantees to the programmer. We demonstrate important theoretical properties of the mechanism and report on the implementation of the mechanism in the statically-typed array programming language Futhark.

Concurrent Data Structures Made Easy

Design of an efficient thread-safe concurrent data structure is a balancing act between its implementation complexity and performance. Lock-based concurrent data structures, which are relatively easy to derive from their sequential counterparts and to prove thread-safe, suffer from poor throughput under even light multi-threaded workload. At the same time, lock-free concurrent structures allow for high throughput, but are notoriously difficult to get right and require careful reasoning to formally establish their correctness. In this work, we explore a solution to this conundrum based on a relatively old idea of batch parallelism---an approach for designing high-throughput concurrent data structures via a simple insight: efficiently processing a batch of a priori known operations in parallel is easier than optimising performance for a stream of arbitrary asynchronous requests. Alas, batch-parallel structures have not seen wide practical adoption due to (i) the inconvenience of having to structure multi-threaded programs to explicitly group operations and (ii) the lack of a systematic methodology to implement batch-parallel structures as simply as lock-based ones. We present OBatcher---a Multicore OCaml library that streamlines the design, implementation, and usage of batch-parallel structures. OBatcher solves the first challenge (how to use) by suggesting a new lightweight implicit batching design pattern that is built on top of generic asynchronous programming mechanisms. The second challenge (how to implement) is addressed by identifying a family of strategies for converting common sequential structures into the corresponding efficient batch-parallel versions, and by providing a library of functors that embody those strategies. We showcase OBatcher with a diverse set of benchmarks ranging from Red-Black and AVL trees to van Emde Boas trees, skip lists, and a thread-safe implementation of a Datalog solver. Our evaluation of all the implementations on large asynchronous workloads shows that (a) they consistently outperform the corresponding coarse-grained lock-based implementations---the only ones available in OCaml to date, and that (b) their throughput scales reasonably with the number of processors.

Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models

Large language models (LLMs) have revolutionized language processing, but face critical challenges with security, privacy, and generating hallucinations — coherent but factually inaccurate outputs. A major issue is fact-conflicting hallucination (FCH), where LLMs produce content contradicting ground truth facts. Addressing FCH is difficult due to two key challenges: 1) Automatically constructing and updating benchmark datasets is hard, as existing methods rely on manually curated static benchmarks that cannot cover the broad, evolving spectrum of FCH cases. 2) Validating the reasoning behind LLM outputs is inherently difficult, especially for complex logical relations. To tackle these challenges, we introduce a novel logic-programming-aided metamorphic testing technique for FCH detection. We develop an extensive and extensible framework that constructs a comprehensive factual knowledge base by crawling sources like Wikipedia, seamlessly integrated into Drowzee. Using logical reasoning rules, we transform and augment this knowledge into a large set of test cases with ground truth answers. We test LLMs on these cases through template-based prompts, requiring them to provide reasoned answers. To validate their reasoning, we propose two semantic-aware oracles that assess the similarity between the semantic structures of the LLM answers and ground truth. Our approach automatically generates useful test cases and identifies hallucinations across six LLMs within nine domains, with hallucination rates ranging from 24.7% to 59.8%. Key findings include LLMs struggling with temporal concepts, out-of-distribution knowledge, and lack of logical reasoning capabilities. The results show that logic-based test cases generated by Drowzee effectively trigger and detect hallucinations. To further mitigate the identified FCHs, we explored model editing techniques, which proved effective on a small scale (with edits to fewer than 1000 knowledge pieces). Our findings emphasize the need for continued community efforts to detect and mitigate model hallucinations.

Monotone Procedure Summarization via Vector Addition Systems and Inductive Potentials

This paper presents a technique for summarizing recursive procedures operating on integer variables. The motivation of our work is to create more predictable program analyzers, and in particular to formally guarantee compositionality and monotonicity of procedure summarization. To summarize a procedure, we compute its best abstraction as a vector addition system with resets (VASR) and exactly summarize the executions of this VASR over the context-free language of syntactic paths through the procedure. We improve upon this technique by refining the language of syntactic paths using (automatically synthesized) linear potential functions that bound the number of recursive calls within valid executions of the input program. We implemented our summarization technique in an automated program verification tool; our experimental evaluation demonstrates that our technique computes more precise summaries than existing abstract interpreters and that our tool’s verification capabilities are comparable with state-of-the-art software model checkers.

Model Checking Distributed Protocols in Must

We describe the design and implementation of Must, a framework for modeling and automatically verifying distributed systems. Must provides a concurrency API that supports multiple communication models, on top of a mainstream programming language, such as Rust. Given a program using this API, Must verifies it by means of a novel, optimal dynamic partial order reduction algorithm that maintains completeness and optimality for all communication models supported by the API. We use Must to design and verify models of distributed systems in an industrial context. We demonstrate the usability of Must's API by modeling high-level system idioms (e.g., timeouts, leader election, versioning) as abstractions over the core API, and demonstrate Must's scalability by verifying systems employed in production (e.g., replicated logs, distributed transaction management protocols), the verification of which lies beyond the capacity of previous model checkers.

Reward Augmentation in Reinforcement Learning for Testing Distributed Systems

Bugs in popular distributed protocol implementations have been the source of many downtimes in popular internet services. We describe a randomized testing approach for distributed protocol implementations based on reinforcement learning. Since the natural reward structure is very sparse, the key to successful exploration in reinforcement learning is reward augmentation. We show two different techniques that build on one another. First, we provide a decaying exploration bonus based on the discovery of new states---the reward decays as the same state is visited multiple times. The exploration bonus captures the intuition from coverage-guided fuzzing of prioritizing new coverage points; in contrast to other schemes, we show that taking the maximum of the bonus and the Q-value leads to more effective exploration. Second, we provide waypoints to the algorithm as a sequence of predicates that capture interesting semantic scenarios. Waypoints exploit designer insight about the protocol and guide the exploration to "interesting" parts of the state space. Our reward structure ensures that new episodes can reliably get to deep interesting states even without execution caching. We have implemented our algorithm in Go. Our evaluation on three large benchmarks (RedisRaft, Etcd, and RSL) shows that our algorithm can significantly outperform baseline approaches in terms of coverage and bug finding.

Rustlantis: Randomized Differential Testing of the Rust Compiler

Compilers are at the core of all computer architecture. Their middle-end and back-end are full of subtle code that is easy to get wrong. At the same time, the consequences of compiler bugs can be severe. Therefore, it is important that we develop techniques to increase our confidence in compiler correctness, and to help find the bugs that inevitably happen. One promising such technique that has successfully found many compiler bugs in the past is randomized differential testing, a fuzzing approach whereby the same program is executed with different compilers or different compiler settings to detect any unexpected differences in behavior. We present Rustlantis, the first fuzzer for the Rust programming language that is able to find new correctness bugs in the official Rust compiler. To avoid having to deal with Rust’s strict type and borrow checker, Rustlantis directly generates MIR, the central IR of the Rust compiler for optimizations. The program generation strategy of Rustlantis is a combination of statically tracking the state of the program, obscuring the program state for the compiler, and decoy blocks to lead optimizations astray. This has allowed us to identify 22 previously unknown bugs in the Rust compiler, most of which have been fixed.

Semantics of Remote Direct Memory Access: Operational and Declarative Models of RDMA on TSO Architectures

Remote direct memory access (RDMA) is a modern technology enabling networked machines to exchange information without involving the operating system of either side, and thus significantly speeding up data transfer in computer clusters. While RDMA is extensively used in practice and studied in various research papers, a formal underlying model specifying the allowed behaviours of concurrent RDMA programs running in modern multicore architectures is still missing. This paper aims to close this gap and provide semantic foundations of RDMA on x86-TSO machines. We propose three equivalent formal models, two operational models in different levels of abstraction and one declarative model, and prove that the three characterisations are equivalent. To gain confidence in the proposed semantics, the more concrete operational model has been reviewed by NVIDIA experts, a major vendor of RDMA systems, and we have empirically validated the declarative formalisation on various subtle litmus tests by extensive testing. We believe that this work is a necessary initial step for formally addressing RDMA-based systems by proposing language-level models, verifying their mapping to hardware, and developing reasoning techniques for concurrent RDMA programs.

Imperative Compositional Programming: Type Sound Distributive Intersection Subtyping with References via Bidirectional Typing

Compositional programming is a programming paradigm that emphasizes modularity and is implemented in the CP programming language. The foundations for compositional programming are based on a purely functional variant of System F with intersection types, called Fi+, which includes distributivity rules for subtyping. This paper shows how to extend compositional programming and CP with mutable references, enabling a modular, imperative compositional programming style. A technical obstacle solved in our work is the interaction between distributive intersection subtyping and mutable references. Davies and Pfenning [2000] studied this problem in standard formulations of intersection type systems and argued that, when combined with references, distributive subtyping rules lead to type unsoundness. To recover type soundness, they proposed dropping distributivity rules in subtyping. CP cannot adopt this solution, since it fundamentally relies on distributivity for modularity. Therefore, we revisit the problem and show that, by adopting bidirectional typing, a more lightweight and type sound restriction is possible: we can simply restrict the typing rule for references. This solution retains distributivity and an unrestricted intersection introduction rule. We present a first calculus, based on Davies and Pfenning's work, which illustrates the generality of our solution. Then we present an extension of Fi+ with references, which adopts our restriction and enables imperative compositional programming. We implement an extension of CP with references and show how to model a modular live-variable analysis in CP. Both calculi and their proofs are formalized in the Coq proof assistant.

QuAC: Quick Attribute-Centric Type Inference for Python

Python’s dynamic typing facilitates rapid prototyping and underlies its popularity in many domains. However, dynamic typing reduces the power of many static checking and bug-finding tools. Python type annotations can make these tools more useful. Type inference tools aim to reduce developers’ burden of adding them. However, existing type inference tools struggle to support dynamic features, infer correct types (especially container type parameters and non-builtin types), and run in reasonable time. Inspired by Python’s duck typing, where the attributes accessed on Python expressions characterize their implicit interfaces, we propose QuAC (Quick Attribute-Centric Type Inference for Python). At its core, QuAC collects attribute sets for Python expressions and leverages information retrieval techniques to predict classes from these attribute sets. It also recursively predicts container type parameters. We evaluate QuAC’s performance on popular Python projects. Compared to state-of-the-art non-LLM baselines, QuAC predicts types with high accuracy complementary to those predicted by the baselines while not sacrificing coverage. It also demonstrates clear advantages in predicting container type parameters and non-builtin types and reduces run times. Furthermore, QuAC is nearly two orders of magnitude faster than an LLM-based method while covering nearly half of its errorless non-trivial type predictions. It is also significantly more consistent at predicting container type parameters and non-builtin types than the LLM-based method, regardless of whether the project has ground-truth type annotations.

Automated Verification of Parametric Channel-Based Process Communication

A challenge of writing concurrent message passing programs is ensuring the absence of partial deadlocks, which can cause severe memory leaks in long running systems. Several static analysis techniques have been proposed for automatically detecting partial deadlocks in Go programs. For a large enterprise code base, we found these tools too imprecise to reason about process communication that is parametric, i.e., where the number of channel communication operations or the channel capacities are determined at runtime. We present a novel approach to automatically verify the absence of partial deadlocks in Go program fragments with such parametric process communication. The key idea is to translate Go fragments to a core language that is sufficiently expressive to represent real-world parametric communication patterns and can be encoded into Dafny programs annotated with postconditions enforcing partial deadlock freedom. In situations where a fragment is partial deadlock free only when the concurrency parameters satisfy certain conditions, a suitable precondition can often be inferred. Experimental results on a real-world code base containing 583 program fragments that are beyond the reach of existing techniques have shown that the approach can verify the absence of partial deadlocks in 145 cases. For an additional 228 cases, a nontrivial precondition is inferred that the surrounding code must satisfy to ensure partial deadlock freedom.

Modular Synthesis of Efficient Quantum Uncomputation

A key challenge of quantum programming is uncomputation: the reversible deallocation of qubits. And while there has been much recent progress on automating uncomputation, state-of-the-art methods are insufficient for handling today’s expressive quantum programming languages. A core reason is that they operate on primitive quantum circuits, while quantum programs express computations beyond circuits, for instance, they can capture families of circuits defined recursively in terms of uncomputation and adjoints. In this paper, we introduce the first modular automatic approach to synthesize correct and efficient uncomputation for expressive quantum programs. Our method is based on two core technical contributions: (i) an intermediate representation (IR) that can capture expressive quantum programs and comes with support for uncomputation, and (ii) modular algorithms over that IR for synthesizing uncomputation and adjoints. We have built a complete end-to-end implementation of our method, including an implementation of the IR and the synthesis algorithms, as well as a translation from an expressive fragment of the Silq programming language to our IR and circuit generation from the IR. Our experimental evaluation demonstrates that we can handle programs beyond the capabilities of existing uncomputation approaches, while being competitive on the benchmarks they can handle. More broadly, we show that it is possible to benefit from the greater expressivity and safety offered by high-level quantum languages without sacrificing efficiency.

Type Inference Logics

Type inference is essential for statically-typed languages such as OCaml and Haskell. It can be decomposed into two (possibly interleaved) phases: a generator converts programs to constraints; a solver decides whether a constraint is satisfiable. Elaboration, the task of decorating a program with explicit type annotations, can also be structured in this way. Unfortunately, most machine-checked implementations of type inference do not follow this phase-separated, constraint-based approach. Those that do are rarely executable, lack effectful abstractions, and do not include elaboration. To close the gap between common practice in real-world implementations and mechanizations inside proof assistants, we propose an approach that enables modular reasoning about monadic constraint generation in the presence of elaboration. Our approach includes a domain-specific base logic for reasoning about metavariables and a program logic that allows us to reason abstractly about the meaning of constraints. To evaluate it, we report on a machine-checked implementation of our techniques inside the Coq proof assistant. As a case study, we verify both soundness and completeness for three elaborating type inferencers for the simply typed lambda calculus with Booleans. Our results are the first demonstration that type inference algorithms can be verified in the same form as they are implemented in practice: in an imperative style, modularly decomposed into constraint generation and solving, and delivering elaborated terms to the remainder of the compiler chain.

Wasm-R3: Record-Reduce-Replay for Realistic and Standalone WebAssembly Benchmarks

WebAssembly (Wasm for short) brings a new, powerful capability to the web as well as Edge, IoT, and embedded systems. Wasm is a portable, compact binary code format with high performance and robust sandboxing properties. As Wasm applications grow in size and importance, the complex performance characteristics of diverse Wasm engines demand robust, representative benchmarks for proper tuning. Stopgap benchmark suites, such as PolyBenchC and libsodium, continue to be used in the literature, though they are known to be unrepresentative. Porting of more complex suites remains difficult because Wasm lacks many system APIs and extracting real-world Wasm benchmarks from the web is difficult due to complex host interactions. To address this challenge, we introduce Wasm-R3, the first record and replay technique for Wasm. Wasm-R3 transparently injects instrumentation into Wasm modules to record an execution trace from inside the module, then reduces the execution trace via several optimizations, and finally produces a replay module that is executable standalone without any host environment-on any engine. The benchmarks created by our approach are (i) realistic, because the approach records real-world web applications, (ii) faithful to the original execution, because the replay benchmark includes the unmodified original code, only adding emulation of host interactions, and (iii) standalone, because the replay benchmarks run on any engine. Applying Wasm-R3 to web-based Wasm applications in the wild demonstrates the correctness of our approach as well as the effectiveness of our optimizations, which reduce the recorded traces by 99.53% and the size of the replay benchmark by 9.98%. We release the resulting benchmark suite of 27 applications, called Wasm-R3-Bench, to the community, to inspire a new generation of realistic and standalone Wasm benchmarks.

Semantic-Type-Guided Bug Finding

In recent years, there has been an increased interest in tools that establish incorrectness rather than correctness of program properties. In this work we build on this approach by developing a novel methodology to prove incorrectness of semantic typing properties of functional programs, extending the incorrectness approach to the model theory of functional program typing. We define a semantic type refuter which refutes semantic typings for a simple functional language. We prove our refuter is co-recursively enumerable, and that it is sound and complete with respect to a semantic typing notion. An initial implementation is described which uses symbolic evaluation to efficiently find type errors over a functional language with a rich type system.

Control-Flow Deobfuscation using Trace-Informed Compositional Program Synthesis

Code deobfuscation, which attempts to simplify code that has been intentionally obfuscated to prevent understanding, is a critical technique for downstream security analysis tasks like malware detection. While there has been significant prior work on code deobfuscation, most techniques either do not handle control flow obfuscations that modify control flow or they target specific classes of control flow obfuscations, making them unsuitable for handling new types of obfuscations or combinations of existing ones. In this paper, we study a new deobfuscation technique that is based on program synthesis and that can handle a broad class of control flow obfuscations. Given an obfuscated program P, our approach aims to synthesize a smallest program that is a control-flow reduction of P and that is semantically equivalent. Since our method does not assume knowledge about the types of obfuscations that have been applied to the original program, the underlying synthesis problem ends up being very challenging. To address this challenge, we propose a novel trace-informed compositional synthesis algorithm that leverages hints present in dynamic traces of the obfuscated program to decompose the synthesis problem into a set of simpler subproblems. In particular, we show how dynamic traces can be useful for inferring a suitable control-flow skeleton of the deobfuscated program and performing independent synthesis of each basic block. We have implemented this approach in a tool called Chisel and evaluate it on 546 benchmarks that have been obfuscated using combinations of six different obfuscation techniques. Our evaluation shows that our approach is effective and that it produces code that is almost identical (modulo variable renaming) to the original (non-obfuscated) program in 86% of cases. Our evaluation also shows that Chisel significantly outperforms existing techniques.

Unifying Static and Dynamic Intermediate Languages for Accelerator Generators

Compilers for accelerator design languages (ADLs) translate high-level languages into application-specific hardware. ADL compilers rely on a hardware control interface to compose hardware units. There are two choices: static control, which relies on cycle-level timing; or dynamic control, which uses explicit signalling to avoid depending on timing details. Static control is efficient but brittle; dynamic control incurs hardware costs to support compositional reasoning. Piezo is an ADL compiler that unifies static and dynamic control in a single intermediate language (IL). Its key insight is that the IL’s static fragment is a refinement of its dynamic fragment: static code admits a subset of the run-time behaviors of the dynamic equivalent. Piezo can optimize code by combining facts from static and dynamic submodules, and it opportunistically converts code from dynamic to static control styles. We implement Piezo as an extension to an existing dynamic ADL compiler, Calyx. We use Piezo to implement a frontend for an existing ADL, a systolic array generator, and a packet-scheduling hardware generator to demonstrate its optimizations and the static–dynamic interactions it enables.

Mark–Scavenge: Waiting for Trash to Take Itself Out

Moving garbage collectors (GCs) typically free memory by evacuating live objects in order to reclaim contiguous memory regions. Evacuation is typically done either during tracing (scavenging), or after tracing when identification of live objects is complete (mark–evacuate). Scavenging typically requires more memory (memory for all objects to be moved), but performs less work in sparse memory areas (single pass). This makes it attractive for collecting young objects. Mark–evacuate typically requires less memory and performs less work in memory areas with dense object clusters, by focusing relocation around sparse regions, making it attractive for collecting old objects. Mark–evacuate also completes identification of live objects faster, making it attractive for concurrent GCs that can reclaim memory immediately after identification of live objects finishes (as opposed to when evacuation finishes), at the expense of more work compared to scavenging, for young objects.

We propose an alternative approach for concurrent GCs to combine the benefits of scavenging with the benefits of mark–evacuate, for young objects. The approach is based on the observation that by the time young objects are relocated by a concurrent GC, they are likely to already be unreachable. By performing relocation lazily, most of the relocations in the defragmentation phase of mark–evacuate can typically be eliminated. Similar to scavenging, objects are relocated during tracing with the proposed approach. However, instead of relocating all objects that are live in the current GC cycle, it lazily relocates profitable sparse object clusters that survived from the previous GC cycle. This turns the memory headroom that concurrent GCs typically “waste” in order to safely avoid running out of memory before GC finishes, into an asset used to eliminate much of the relocation work, which constitutes a significant portion of the GC work.

We call this technique mark–scavenge and implement it on-top of ZGC in OpenJDK in a collector we call MS-ZGC. We perform a performance evaluation that compares MS-ZGC against ZGC. The most striking result is (up to) 91% reduction in relocation of dead objects (depending on machine-dependent factors).

Compositionality and Observational Refinement for Linearizability with Crashes

Crash-safety is an important property of real systems, as the main functionality of some systems is resilience to crashes. Toward a compositional verification approach for crash-safety under full-system crashes, one observes that crashes propagate instantaneously to all components across all levels of abstraction, even to unspecified components, hindering compositionality. Furthermore, in the presence of concurrency, a correctness criterion that addresses both crashes and concurrency proves necessary. For this, several adaptations of linearizability have been suggested, each featuring different trade-offs between complexity and expressiveness. The recently proposed compositional linearizability framework shows that to achieve compositionality with linearizability, both a locality and observational refinement property are necessary. Despite that, no linearizability criterion with crashes has been proven to support an observational refinement property. In this paper, we define a compositional model of concurrent computation with full-system crashes. We use this model to develop a compositional theory of linearizability with crashes, which reveals a criterion, crash-aware linearizability, as its inherent notion of linearizability and supports both locality and observational refinement. We then show that strict linearizability and durable linearizability factor through crash-aware linearizability as two different ways of translating between concurrent computation with and without crashes, enabling simple proofs of locality and observational refinement for a generalization of these two criteria. Then, we show how the theory can be connected with a program logic for durable and crash-aware linearizability, which gives the first program logic that verifies a form of linearizability with crashes. We showcase the advantages of compositionality by verifying a library facilitating programming persistent data structures and a fragment of a transactional interface for a file system.

Making Sense of Multi-threaded Application Performance at Scale with NonSequitur

Modern multi-threaded systems are highly complex. This makes their behavior difficult to understand. Developers frequently capture behavior in the form of program traces and then manually inspect these traces. Existing tools, however, fail to scale to traces larger than a million events. In this paper we present an approach to compress multi-threaded traces in order to allow developers to visually explore these traces at scale. Our approach is able to compress traces that contain millions of events down to a few hundred events. We use this approach to design and implement a tool called NonSequitur. We present three case studies which demonstrate how we used NonSequitur to analyze real-world performance issues with Meta's storage engine RocksDB and MongoDB's storage engine WiredTiger, two complex database backends. We also evaluate NonSequitur with 42 participants on traces from RocksDB and WiredTiger. We demonstrate that, in some cases, participants on average scored 11 times higher when performing performance analysis tasks on large execution traces. Additionally, for some performance analysis tasks, the participants spent on average three times longer with other tools than with NonSequitur.

Dependency-Aware Code Naturalness

Code naturalness, which captures repetitiveness and predictability in programming languages, has proven valuable for various code-related tasks in software engineering. However, precisely measuring code naturalness remains a fundamental challenge. Existing methods measure code naturalness over individual lines of code while ignoring the deep semantic relations among different lines, e.g., program dependency, which may negatively affect the precision of the measure. Despite the intuitive appeal of extending the code naturalness measure to the code dependency domain (as there are some work that have initiated the utilization of code dependency for diverse code-related tasks), this assumption remains unexplored and warrants direct investigation. In this study, we aim to perform the first empirical study to investigate whether incorporating code dependency, instead of analyzing individual lines, can enhance the precision of measuring code naturalness. To achieve that, we first propose a new method named DAN for measuring code naturalness by incorporating the rich dependency information in the code. Specifically, DAN extracts multiple sequences of code lines by traversing the program dependency graph, where different code lines are connected by dependencies in each sequence, and then the code naturalness will be measured by taking each sequence as a whole. In this way, the dependency information can be well captured. Finally, we have conducted an extensive study to evaluate the influence of code dependency for measuring code naturalness with DAN, and compared it with the state-of-the-art methods under three emerging application scenarios of code naturalness. The results demonstrate that DAN can not only better distinguish natural and unnatural code, but also substantially boost two important downstream applications of code naturalness, i.e., distinguishing buggy and non-buggy code lines and data cleansing for training better code models, reflecting the significance of code dependency in measuring code naturalness.

Validating SMT Solvers for Correctness and Performance via Grammar-Based Enumeration

We introduce ET, a grammar-based enumerator for validating SMT solver correctness and performance. By compiling grammars of the SMT theories to algebraic datatypes, ET leverages the functional enumerator FEAT. ET is highly effective at bug finding and has many complimentary benefits. Despite the extensive and continuous testing of the state-of-the-art SMT solvers Z3 and cvc5, ET found 102 bugs, out of which 76 were confirmed and 32 were fixed. Moreover, ET can be used to understand the evolution of solvers. We derive eight grammars realizing all major SMT theories including the booleans, integers, reals, realints, bit-vectors, arrays, floating points, and strings. Using ET, we test all consecutive releases of the SMT solvers Z3 and CVC4/cvc5 from the last six years (61 versions) on 8 million formulas, and 488 million solver calls. Our results suggest improved correctness in recent versions of both solvers but decreased performance in newer releases of Z3 on small timeouts (since z3-4.8.11) and regressions in early cvc5 releases on larger timeouts. Due to its systematic testing and efficiency, we further advocate ET's use for continuous integration.

Practical Verification of Smart Contracts using Memory Splitting

SMT-based verification of low-level code requires modeling and reasoning about memory operations. Prior work has shown that optimizing memory representations is beneficial for scaling verification—pointer analysis, for example can be used to split memory into disjoint regions leading to faster SMT solving. However, these techniques are mostly designed for C and C++ programs with explicit operations for memory allocation which are not present in all languages. For instance, on the Ethereum virtual machine, memory is simply a monolithic array of bytes which can be freely accessed by Ethereum bytecode, and there is no allocation primitive. In this paper, we present a memory splitting transformation guided by a conservative memory analysis for Ethereum bytecode generated by the Solidity compiler. The analysis consists of two phases: recovering memory allocation and memory regions, followed by a pointer analysis. The goal of the analysis is to enable memory splitting which in turn speeds up verification. We implemented both the analysis and the memory splitting transformation as part of a verification tool, CertoraProver, and show that the transformation speeds up SMT solving by up to 120x and additionally mitigates 16 timeouts when used on 229 real-world smart contract verification tasks.

Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs

GPUs are progressively being integrated into modern society, playing a pivotal role in Artificial Intelligence and High-Performance Computing. Programmers need a deep understanding of the GPU programming model to avoid subtle data-races in their codes. Static verification that is sound and incomplete can guarantee data-race freedom, but the alarms it raises may be spurious and need to be validated.

In this paper, we establish a True Positive Theorem for a static data-race detector for GPU programs, i.e., a result that identifies a class of programs for which our technique only raises true alarms. Our work builds on the formalism of memory access protocols, that models the concurrency operations of CUDA programs. The crux of our approach is an approximation analysis that can correctly identify true alarms, and pinpoint the conditions that make an alarm imprecise. Our approximation analysis detects when the reported locations are reachable (control independence, or CI), and when the reported locations are precise (data independence, or DI), as well identify inexact values in an alarm. In addition to a True Positive result for programs that are CI and DI, we establish the root causes of spurious alarms depending on whether CI or DI are present.

We apply our theory to introduce FaialAA, the first sound and partially complete data-race detector. We evaluate FaialAA in three experiments. First, in a comparative study with the state-of-the-art tools, we show that FaialAA confirms more DRF programs than others while emitting 1.9× fewer potential alarms. Importantly, the approximation analysis of FaialAA detects 10 undocumented data-races. Second, in an experiment studying 6 commits of data-race fixes in open source projects OpenMM and Nvidia’s MegaTron, FaialAA confirmed the buggy and fixed versions of 5 commits, while others were only able to confirm 2. Third, we show that 59.5% of 2,770 programs are CI and DI, quantifying when the approximation analysis of FaialAA is complete.

This paper is accompanied by the mechanized proofs of the theoretical results presented therein and a tool (FaialAA) implementing of our theory.

Effect Handlers for C via Coroutines

Effect handlers provide a structured means for implementing user-defined, composable, and customisable computational effects, ranging from exceptions to generators to lightweight threads. We introduce libseff, a novel effect handlers library for C, based on coroutines. Whereas prior effect handler libraries for C are intended primarily as compilation targets, libseff is intended to be used directly from C programs. As such, the design of libseff parts ways from traditional effect handler implementations, both by using mutable coroutines as the main representation of pending computations, and by avoiding closures as handlers by way of reified effects. We show that the performance of libseff is competitive across a range of platforms and benchmarks.

When Your Infrastructure Is a Buggy Program: Understanding Faults in Infrastructure as Code Ecosystems

Modern applications have become increasingly complex and their manual installation and configuration is no longer practical. Instead, IT organizations heavily rely on Infrastructure as Code (IaC) technologies, to automate the provisioning, configuration, and maintenance of computing infrastructures and systems. IaC systems typically offer declarative, domain-specific languages (DSLs) that allow system administrators and developers to write high-level programs that specify the desired state of their infrastructure in a reliable, predictable, and documented fashion. Just like traditional programs, IaC software is not immune to faults, with issues ranging from deployment failures to critical misconfigurations that often impact production systems used by millions of end users. Surprisingly, despite its crucial role in global infrastructure management, the tooling and techniques for ensuring IaC reliability still have room for improvement. In this work, we conduct a comprehensive analysis of 360 bugs identified in IaC software within prominent IaC ecosystems including Ansible, Puppet, and Chef. Our work is the first in-depth exploration of bug characteristics in these widely-used IaC environments. Through our analysis we aim to understand: (1) how these bugs manifest, (2) their underlying root causes, (3) their reproduction requirements in terms of system state (e.g., operating system versions) or input characteristics, and (4) how these bugs are fixed. Based on our findings, we evaluate the state-of-the-art techniques for IaC reliability, identify their limitations, and provide a set of recommendations for future research. We believe that our study helps researchers to (1) better understand the complexity and peculiarities of IaC software, and (2) develop advanced tooling for more reliable and robust system configurations.

A Case for First-Class Environments

Formalizations of programming languages typically adopt the substitution model from the lambda calculus. However, substitution creates notorious complications for reasoning and implementation. Furthermore, it is disconnected from practical implementations, which normally adopt environments and closures.

In this paper we advocate for formalizing programming languages using a novel style of small-step environment-based semantics, which avoids substitution and is closer to implementations. We present a call-by-value statically typed calculus, called λE, using our small-step environment semantics. With our alternative environment semantics programming language constructs for first-class environments arise naturally, without creating significant additional complexity. Therefore, λE also adopts first-class environments, adding expressive power that is not available in conventional lambda calculi. λE is a conservative extension of the call-by-value Simply Typed Lambda Calculus (STLC), and employs de Bruijn indices for its formalization, which fit naturally with the environment-based semantics. Reasoning about λE is simple, and in many cases simpler than reasoning about the traditional STLC. We show an abstract machine that implements the semantics of λE, and has an easy correctness proof. We also extend λE with references. We show that λE can model a simple form of first-class modules, and suggest using first-class environments as an alternative to objects for modelling capabilities. All technical results are formalized in the Coq proof assistant. In summary, our work shows that the small-step environment semantics that we adopt has three main and orthogonal benefits: 1) it simplifies the notorious binding problem in formalizations and proof assistants; 2) it is closer to implementations; and 3) additional expressive power is obtained from first-class environments almost for free.

Fast and Optimal Extraction for Sparse Equality Graphs

Equality graphs (e-graphs) are used to compactly represent equivalence classes of terms in symbolic reasoning systems. Beyond their original roots in automated theorem proving, e-graphs have been used in a variety of applications. They have become particularly important as the key ingredient in the popular technique of equality saturation, which has notable applications in compiler optimization, program synthesis, program verification, and symbolic execution, among others. In a typical equality saturation workflow, an e-graph is used to store a large number of equalities that are generated by local rewrites during a saturation phase, after which an optimal term is extracted from the e-graph as the output of the technique. However, despite its crucial role in equality saturation, e-graph extraction has received relatively little attention in the literature, which we seek to start addressing in this paper. Extraction is a challenging problem and is notably known to be NP-hard in general, so current equality saturation tools rely either on slow optimal extraction algorithms based on integer linear programming (ILP) or on heuristics that may not always produce the optimal result. In fact, in this paper, we show that e-graph extraction is hard to approximate within any constant ratio. Thus, any such heuristic will produce wildly suboptimal results in the worst case. Fortunately, we show that the problem becomes tractable when the e-graph is sparse, which is the case in many practical applications. We present a novel parameterized algorithm for extracting optimal terms from e-graphs with low treewidth, a measure of how “tree-like” a graph is, and prove its correctness. We also present an efficient Rust implementation of our algorithm and evaluate it against ILP on a number of benchmarks extracted from the Cranelift benchmark suite, a real-world compiler optimization library based on equality saturation. Our algorithm optimally extracts e-graphs with treewidths of up to 10 in a fraction of the time taken by ILP. These results suggest that our algorithm can be a valuable tool for equality saturation users who need to extract optimal terms from sparse e-graphs.

Automated Robustness Verification of Concurrent Data Structure Libraries against Relaxed Memory Models

Clients reason about the behavior of concurrent data structure libraries such as sets, queues, or stacks using specifications that capture well-understood correctness conditions, such as linearizability. The implementation of these libraries, however, focused as they are on performance, may additionally exploit relaxed memory behavior allowed by the language or underlying hardware that weaken the strong ordering and visibility constraints on shared-memory accesses that would otherwise be imposed by a sequentially consistent (SC) memory model. As an alternative to developing new specification and verification mechanisms for reasoning about libraries under relaxed memory model, we instead consider the orthogonal problem of library robustness, a property that holds when all possible behaviors of a library implementation under relaxed memory model are also possible under SC. In this paper, we develop a new automated technique for verifying robustness of library implementations in the context of a C11-style memory model. This task is challenging because a most-general client may invoke an unbounded number of concurrently executing library operations that can manipulate an unbounded number of shared locations. We establish a novel inductive technique for verifying library robustness that leverages prior work on the robustness problem for the C11 memory model based on the search for a non-robustness witness under SC executions. We crucially rely on the fact that this search is carried out over SC executions, and use high-level SC specifications (including linearizability) of the library to verify the absence of a non-robustness witness. Our technique is compositional - we show how we can safely preserve robustness of multiple interacting library implementations and clients using additional SC fences to guarantee robustness of entire executions. Experimental results on a number of complex realistic library implementations demonstrate the feasibility of our approach.

The ART of Sharing Points-to Analysis: Reusing Points-to Analysis Results Safely and Efficiently

Data-flow analyses like points-to analysis can vastly improve the precision of other analyses, and enable powerful code optimizations. However, whole-program points-to analysis of large Java programs tends to be expensive – both in terms of time and memory. Consequently, many compilers (both static and JIT) and program-analysis tools tend to employ faster – but more conservative – points-to analyses to improve usability. As an alternative to such trading of precision for performance, various techniques have been proposed to perform precise yet expensive fixed-point points-to analyses ahead of time in a static analyzer, store the results, and then transmit them to independent compilation/program-analysis stages that may need them. However, an underlying concern of safety affects all such techniques – can a compiler (or program analysis tool) trust the points-to analysis results generated by another compiler/tool? In this work, we address this issue of trust in the context of Java, while accounting for the issue of performance. We propose ART: Analysis-results Representation Template – a novel scheme to efficiently and concisely encode results of flow-sensitive, context-insensitive points-to analysis computed by a static analyzer for use in any independent system that may benefit from such a precise points-to analysis. ART also allows for fast regeneration of the encoded sound analysis results in such systems. Our scheme has two components: (i) a producer that can statically perform expensive points-to analysis and encode the same concisely, (ii) a consumer that, on receiving such encoded results (called artwork), can regenerate the points-to analysis results encoded by the artwork if it is deemed “safe”. The regeneration scheme completely avoids fixed-point computations and thus can help consumers like static analyzers and JIT compilers to obtain precise points-to information without paying a prohibitively high cost. We demonstrate the usage of ART by implementing a producer (in Soot) and two consumers (in Soot and the Eclipse OpenJ9 JIT compiler). We have evaluated our implementation over various benchmarks from the DaCapo and SPECjvm2008 suites. Our results demonstrate that using ART, a consumer can obtain precise flow-sensitive, context-insensitive points-to analysis results in less than (average) 1% of the time taken by a static analyzer to perform the same analysis, with the storage overhead of ART representing a small fraction of the program size (average around 4%).

Boosting the Performance of Alias-Aware IFDS Analysis with CFL-Based Environment Transformers

The IFDS algorithm is pivotal in solving field-sensitive data-flow problems. However, its conventional use of access paths for field sensitivity leads to the generation of a large number of data-flow facts. This causes scalability challenges in larger programs, limiting its practical application in extensive codebases. In response, we propose a new field-sensitive technique that reinterprets the generation of access paths as a Context-Free Language (CFL) for field-sensitivity and formulates it as an IDE problem. This approach significantly reduces the number of data-flow facts generated and handled during the analysis, which is a major factor in performance degradation. To demonstrate the effectiveness of this approach, we developed a taint analysis tool, IDEDroid, in the IFDS/IDE framework. IDEDroid outperforms FlowDroid, an established IFDS-based taint analysis tool, in the analysis of 24 major Android apps while improving its precision (guaranteed theoretically). The speed improvement ranges from 2.1× to 2,368.4×, averaging at 222.0×, with precision gains reaching up to 20.0% (in terms of false positives reduced). This performance indicates that IDEDroid is substantially more effective in detecting information-flow leaks, making it a potentially superior tool for mobile app vetting in the market.

Higher-Order Model Checking of Effect-Handling Programs with Answer-Type Modification

Model checking is one of the successful program verification methodologies. Since the seminal work by Ong, the model checking of higher-order programs―called higher-order model checking, or HOMC for short―has gained attention. It is also crucial for making HOMC applicable to real-world software to address programs involving computational effects. Recently, Dal Lago and Ghyselen considered an extension of HOMC to algebraic effect handlers, which enable programming the semantics of effects. They showed a negative result for HOMC with algebraic effect handlers―it is undecidable. In this work, we explore a restriction on programs with algebraic effect handlers which ensures the decidability of HOMC while allowing implementations of various effects. We identify the crux of the undecidability as the use of an unbounded number of algebraic effect handlers being active at the same time. To prevent it, we introduce answer-type modification (ATM), which can bound the number of algebraic effect handlers that can be active at the same time. We prove that ATM can ensure the decidability of HOMC and show that it accommodates a wide range of effects. To evaluate our approach, we implemented an automated verifier EffCaml based on the presented techniques and confirmed that the program examples discussed in this paper can be automatically verified.