PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

Full Citation in the ACM Digital Library

SESSION: Papers

Incremental whole-program analysis in Datalog with lattices

Incremental static analyses provide up-to-date analysis results in time proportional to the size of a code change, not the entire code base. This promises fast feedback to programmers in IDEs and when checking in commits. However, existing incremental analysis frameworks fail to deliver on this promise for whole-program lattice-based data-flow analyses. In particular, prior Datalog-based frameworks yield good incremental performance only for intra-procedural analyses.

In this paper, we first present a methodology to empirically test if a computation is amenable to incrementalization. Using this methodology, we find that incremental whole-program analysis may be possible. Second, we present a new incremental Datalog solver called LADDDER to eliminate the shortcomings of prior Datalog-based analysis frameworks. Our Datalog solver uses a non-standard aggregation semantics which allows us to loosen monotonicity requirements on analyses and to improve the performance of lattice aggregators considerably. Our evaluation on real-world Java code confirms that LADDDER provides up-to-date points-to, constant propagation, and interval information in milliseconds.

Revamping hardware persistency models: view-based and axiomatic persistency models for Intel-x86 and Armv8

Non-volatile memory (NVM) is a cutting-edge storage technology that promises the performance of DRAM with the durability of SSD. Recent work has proposed several persistency models for mainstream architectures such as Intel-x86 and Armv8, describing the order in which writes are propagated to NVM. However, these models have several limitations; most notably, they either lack operational models or do not support persistent synchronization patterns.

We close this gap by revamping the existing persistency models. First, inspired by the recent work on promising semantics, we propose a unified operational style for describing persistency using views, and develop view-based operational persistency models for Intel-x86 and Armv8, thus presenting the first operational model for Armv8 persistency. Next, we propose a unified axiomatic style for describing hardware persistency, allowing us to recast and repair the existing axiomatic models of Intel-x86 and Armv8 persistency. We prove that our axiomatic models are equivalent to the authoritative semantics reviewed by Intel and Arm engineers. We further prove that each axiomatic hardware persistency model is equivalent to its operational counterpart. Finally, we develop a persistent model checking algorithm and tool, and use it to verify several representative examples.

Repairing serializability bugs in distributed database programs via automated schema refactoring

Serializability is a well-understood concurrency control mechanism that eases reasoning about highly-concurrent database programs. Unfortunately, enforcing serializability has a high performance cost, especially on geographically distributed database clusters. Consequently, many databases allow programmers to choose when a transaction must be executed under serializability, with the expectation that transactions would only be so marked when necessary to avoid serious concurrency bugs. However, this is a significant burden to impose on developers, requiring them to (a) reason about subtle concurrent interactions among potentially interfering transactions, (b) determine when such interactions would violate desired invariants, and (c) then identify the minimum number of transactions whose executions should be serialized to prevent these violations. To mitigate this burden, this paper presents a sound fully-automated schema refactoring procedure that refactors a program’s data layout – rather than its concurrency control logic – to eliminate statically identified concurrency bugs, allowing more transactions to be safely executed under weaker and more performant database guarantees. Experimental results over a range of realistic database benchmarks indicate that our approach is highly effective in eliminating concurrency bugs, with safe refactored programs showing an average of 120% higher throughput and 45% lower latency compared to a serialized baseline.

Gleipnir: toward practical error analysis for Quantum programs

Practical error analysis is essential for the design, optimization, and evaluation of Noisy Intermediate-Scale Quantum(NISQ) computing. However, bounding errors in quantum programs is a grand challenge, because the effects of quantum errors depend on exponentially large quantum states. In this work, we present Gleipnir, a novel methodology toward practically computing verified error bounds in quantum programs. Gleipnir introduces the (ρ,δ)-diamond norm, an error metric constrained by a quantum predicate consisting of the approximate state ρ and its distance δ to the ideal state ρ. This predicate (ρ,δ) can be computed adaptively using tensor networks based on the Matrix Product States. Gleipnir features a lightweight logic for reasoning about error bounds in noisy quantum programs, based on the (ρ,δ)-diamond norm metric. Our experimental results show that Gleipnir is able to efficiently generate tight error bounds for real-world quantum programs with 10 to 100 qubits, and can be used to evaluate the error mitigation performance of quantum compiler transformations.

Alive2: bounded translation validation for LLVM

We designed, implemented, and deployed Alive2: a bounded translation validation tool for the LLVM compiler’s intermediate representation (IR). It limits resource consumption by, for example, unrolling loops up to some bound, which means there are circumstances in which it misses bugs. Alive2 is designed to avoid false alarms, is fully automatic through the use of an SMT solver, and requires no changes to LLVM. By running Alive2 over LLVM’s unit test suite, we discovered and reported 47 new bugs, 28 of which have been fixed already. Moreover, our work has led to eight patches to the LLVM Language Reference—the definitive description of the semantics of its IR—and we have participated in numerous discussions with the goal of clarifying ambiguities and fixing errors in these semantics. Alive2 is open source and we also made it available on the web, where it has active users from the LLVM community.

Transfinite Iris: resolving an existential dilemma of step-indexed separation logic

Step-indexed separation logic has proven to be a powerful tool for modular reasoning about higher-order stateful programs. However, it has only been used to reason about safety properties, never liveness properties. In this paper, we observe that the inability of step-indexed separation logic to support liveness properties stems fundamentally from its failure to validate the existential property, connecting the meaning of existential quantification inside and outside the logic. We show how to validate the existential property—and thus enable liveness reasoning—by moving from finite step-indices (natural numbers) to transfinite step-indices (ordinals). Concretely, we transform the Coq-based step-indexed logic Iris to Transfinite Iris, and demonstrate its effectiveness in proving termination and termination-preserving refinement for higher-order stateful programs.

Perceus: garbage free reference counting with reuse

We introduce Perceus, an algorithm for precise reference counting with reuse and specialization. Starting from a functional core language with explicit control-flow, Perceus emits precise reference counting instructions such that (cycle-free) programs are _garbage free_, where only live references are retained. This enables further optimizations, like reuse analysis that allows for guaranteed in-place updates at runtime. This in turn enables a novel programming paradigm that we call _functional but in-place_ (FBIP). Much like tail-call optimization enables writing loops with regular function calls, reuse analysis enables writing in-place mutating algorithms in a purely functional way. We give a novel formalization of reference counting in a linear resource calculus, and prove that Perceus is sound and garbage free. We show evidence that Perceus, as implemented in Koka, has good performance and is competitive with other state-of-the-art memory collectors.

Proof repair across type equivalences

We describe a new approach to automatically repairing broken proofs in the Coq proof assistant in response to changes in types. Our approach combines a configurable proof term transformation with a decompiler from proof terms to suggested tactic scripts. The proof term transformation implements transport across equivalences in a way that removes references to the old version of the changed type and does not rely on axioms beyond those Coq assumes.

We have implemented this approach in Pumpkin Pi, an extension to the Pumpkin Patch Coq plugin suite for proof repair. We demonstrate Pumpkin Pi’s flexibility on eight case studies, including supporting a benchmark from a user study,easing development with dependent types, porting functions and proofs between unary and binary numbers, and supporting an industrial proof engineer to interoperate between Coq and other verification tools more easily.

Compiler-assisted object inlining with value fields

Object Oriented Programming has flourished in many areas ranging from web-oriented microservices, data processing, to databases. However, while representing domain entities as objects is appealing to developers, it leads to data fragmentation, resulting in high memory footprint and poor locality.

To improve memory footprint and memory locality, embedding the payload of an object into another (object inlining) has been proposed, however, with severe limitations. We argue that object inlining is mostly useful to optimize objects in the application data-path and that such objects have value semantics, unlocking great potential for inlining objects.

We propose value fields, an abstraction which allows fields to be marked as having value semantics. We take advantage of the closed-world assumption provided by GraalVM Native Image to implement Object inlining. Results show that using value fields requires minimal to no effort from developers and leads to improvements in throughput of up to 3x, memory footprint of up to 40%, and GC pause times of up to 35%.

Unleashing the hidden power of compiler optimization on binary code difference: an empirical study

Hunting binary code difference without source code (i.e., binary diffing) has compelling applications in software security. Due to the high variability of binary code, existing solutions have been driven towards measuring semantic similarities from syntactically different code. Since compiler optimization is the most common source contributing to binary code differences in syntax, testing the resilience against the changes caused by different compiler optimization settings has become a standard evaluation step for most binary diffing approaches. For example, 47 top-venue papers in the last 12 years compared different program versions compiled by default optimization levels (e.g., -Ox in GCC and LLVM). Although many of them claim they are immune to compiler transformations, it is yet unclear about their resistance to non-default optimization settings. Especially, we have observed that adversaries explored non-default compiler settings to amplify malware differences.

This paper takes the first step to systematically studying the effectiveness of compiler optimization on binary code differences. We tailor search-based iterative compilation for the auto-tuning of binary code differences. We develop BinTuner to search near-optimal optimization sequences that can maximize the amount of binary code differences. We run BinTuner with GCC 10.2 and LLVM 11.0 on SPEC benchmarks (CPU2006 & CPU2017), Coreutils, and OpenSSL. Our experiments show that at the cost of 279 to 1,881 compilation iterations, BinTuner can find custom optimization sequences that are substantially better than the general -Ox settings. BinTuner's outputs seriously undermine prominent binary diffing tools' comparisons. In addition, the detection rate of the IoT malware variants tuned by BinTuner falls by more than 50%. Our findings paint a cautionary tale for security analysts that attackers have a new way to mutate malware code cost-effectively, and the research community needs to step back to reassess optimization-resistance evaluations.

RefinedC: automating the foundational verification of C code with refined ownership types

Given the central role that C continues to play in systems software, and the difficulty of writing safe and correct C code, it remains a grand challenge to develop effective formal methods for verifying C programs. In this paper, we propose a new approach to this problem: a type system we call RefinedC, which combines ownership types (for modular reasoning about shared state and concurrency) with refinement types (for encoding precise invariants on C data types and Hoare-style specifications for C functions).

RefinedC is both automated (requiring minimal user intervention) and foundational (producing a proof of program correctness in Coq), while at the same time handling a range of low-level programming idioms such as pointer arithmetic. In particular, following the approach of RustBelt, the soundness of the RefinedC type system is justified semantically by interpretation into the Coq-based Iris framework for higher-order concurrent separation logic. However, the typing rules of RefinedC are also designed to be encodable in a new “separation logic programming” language we call Lithium. By restricting to a carefully chosen (yet expressive) fragment of separation logic, Lithium supports predictable, automatic, goal-directed proof search without backtracking. We demonstrate the effectiveness of RefinedC on a range of representative examples of C code.

Wire sorts: a language abstraction for safe hardware composition

Effective digital hardware design fundamentally requires decomposing a design into a set of interconnected modules, each a distinct unit of computation and state. However, naively connecting hardware modules leads to real-world pathological cases which are surprisingly far from obvious when looking at the interfaces alone and which are very difficult to debug after synthesis. We show for the first time that it is possible to soundly abstract even complex combinational dependencies of arbitrary hardware modules through the assignment of IO ports to one of four new sorts which we call: to-sync, to-port, from-sync, and from-port. This new taxonomy, and the reasoning it enables, facilitates modularity by escalating problematic aspects of module input/output interaction to the language-level interface specification. We formalize and prove the soundness of our new wire sorts, implement them in a practical hardware description language, and demonstrate they can be applied and even inferred automatically at scale. Through an examination of the BaseJump STL, the OpenPiton manycore research platform, and a complete RISC-V implementation, we find that even on our biggest design containing 1.5 million primitive gates, analysis takes less than 31 seconds; that across 172 unique modules analyzed, the inferred sorts are widely distributed across our taxonomy; and that by using wire sorts, our tool is 2.6–33.9x faster at finding loops than standard synthesis-time cycle detection.

DeepCuts: a deep learning optimization framework for versatile GPU workloads

Widely used Deep Learning (DL) frameworks, such as TensorFlow, PyTorch, and MXNet, heavily rely on the NVIDIA cuDNN for performance. However, using cuDNN does not always give the best performance. One reason is that it is hard to handle every case of versatile DNN models and GPU architectures with a library that has a fixed implementation. Another reason is that cuDNN lacks kernel fusion functionality that gives a lot of chances to improve performance. In this paper, we propose a DL optimization framework for versatile GPU workloads, called DeepCuts. It considers both kernel implementation parameters and GPU architectures. It analyzes the DL workload, groups multiple DL operations into a single GPU kernel, and generates optimized GPU kernels considering kernel implementation parameters and GPU architecture parameters. The evaluation result with various DL workloads for inference and training indicates that DeepCuts outperforms cuDNN/cuBLAS-based implementations and the state-of-the-art DL optimization frameworks, such as TVM, TensorFlow XLA, and TensorRT.

Retrofitting effect handlers onto OCaml

Effect handlers have been gathering momentum as a mechanism for modular programming with user-defined effects. Effect handlers allow for non-local control flow mechanisms such as generators, async/await, lightweight threads and coroutines to be composably expressed. We present a design and evaluate a full-fledged efficient implementation of effect handlers for OCaml, an industrial-strength multi-paradigm programming language. Our implementation strives to maintain the backwards compatibility and performance profile of existing OCaml code. Retrofitting effect handlers onto OCaml is challenging since OCaml does not currently have any non-local control flow mechanisms other than exceptions. Our implementation of effect handlers for OCaml: (i) imposes a mean 1% overhead on a comprehensive macro benchmark suite that does not use effect handlers; (ii) remains compatible with program analysis tools that inspect the stack; and (iii) is efficient for new code that makes use of effect handlers.

Unqomp: synthesizing uncomputation in Quantum circuits

A key challenge when writing quantum programs is the need for uncomputation: temporary values produced during the computation must be reset to zero before they can be safely discarded. Unfortunately, most existing quantum languages require tedious manual uncomputation, often leading to inefficient and error-prone programs. We present Unqomp, the first procedure to automatically synthesize uncomputation in a given quantum circuit. Unqomp can be readily integrated into popular quantum languages, allowing the programmer to allocate and use temporary values analogously to classical computation, knowing they will be uncomputed by Unqomp. Our evaluation shows that programs leveraging Unqomp are not only shorter (-19% on average), but also generate more efficient circuits (-71% gates and -19% qubits on average).

Zooid: a DSL for certified multiparty computation: from mechanised metatheory to certified multiparty processes

We design and implement Zooid, a domain specific language for certified multiparty communication, embedded in Coq and implemented atop our mechanisation framework of asynchronous multiparty session types (the first of its kind). Zooid provides a fully mechanised metatheory for the semantics of global and local types, and a fully verified end-point process language that faithfully reflects the type-level behaviours and thus inherits the global types properties such as deadlock freedom, protocol compliance, and liveness guarantees.

Fluid: a framework for approximate concurrency via controlled dependency relaxation

In this work, we introduce the Fluid framework, a set of language, compiler and runtime extensions that allow for the expression of regions within which dataflow dependencies can be approximated in a disciplined manner. Our framework allows the eager execution of dependent tasks before their inputs have finalized in order to capitalize on situations where an eagerly-consumed input has a high probability of sufficiently resembling the value or structure of the final value that would have been produced in a conservative/precise execution schedule. We introduce controlled access to the early consumption of intermediate values and provide hooks for user-specified quality assurance mechanisms that can automatically enforce re-execution of eagerly-executed tasks if their output values do not meet heuristic expectations. Our experimental analysis indicates that the fluidized versions of the applications bring 22.2% average execution time improvements, over their original counterparts, under the default values of our fluidization parameters. The Fluid approach is largely orthogonal to approaches that aim to reduce the task effort itself and we show that utilizing the Fluid framework can yield benefits for both originally precise and originally approximate versions of computation.

Developer and user-transparent compiler optimization for interactive applications

Traditional offline optimization frameworks rely on representative hardware, software, and inputs to compare different optimization decisions on. With application-specific optimization for mobile systems though, the idea of a representative test bench is unrealistic while creating offline inputs is non-trivial. Online approaches partially overcome these problems but they might expose users to suboptimal or even erroneously optimized code. As a result, our mobile code is poorly optimized and this results in wasted performance, wasted energy, and user frustration. In this paper, we introduce a novel compiler optimization approach designed for mobile applications. It requires no developer effort, it tunes applications for the user’s device and usage patterns, and has no negative impact on the user experience. It is based on a lightweight capture and replay mechanism. In its online stage, it captures the state accessed by any targeted code region. By re-purposing existing OS capabilities, it keeps the overhead low. In its offline stage, it replays the code region but under different optimization decisions to enable sound comparisons of different optimizations under realistic conditions. Coupled with a search heuristic for the compiler optimization space, it allows us to discover optimization decisions that improve performance without testing these decisions directly on the user. We implemented a prototype system in Android based on LLVM combined with a genetic search engine. We evaluated it on both benchmarks and real Android applications. Online captures are infrequent and each one introduces an overhead of less than 15ms on average. For this negligible effect on user experience, we achieve speedups of 44% on average over the Android compiler and 35% over LLVM -O3.

Demanded abstract interpretation

We consider the problem of making expressive static analyzers interactive. Formal static analysis is seeing increasingly widespread adoption as a tool for verification and bug-finding, but even with powerful cloud infrastructure it can take minutes or hours to get batch analysis results after a code change. While existing techniques offer some demand-driven or incremental aspects for certain classes of analysis, the fundamental challenge we tackle is doing both for arbitrary abstract interpreters. Our technique, demanded abstract interpretation, lifts program syntax and analysis state to a dynamically evolving graph structure, in which program edits, client-issued queries, and evaluation of abstract semantics are all treated uniformly. The key difficulty addressed by our approach is the application of general incremental computation techniques to the complex, cyclic dependency structure induced by abstract interpretation of loops with widening operators. We prove that desirable abstract interpretation meta-properties, including soundness and termination, are preserved in our approach, and that demanded analysis results are equal to those computed by a batch abstract interpretation. Experimental results suggest promise for a prototype demanded abstract interpretation framework: by combining incremental and demand-driven techniques, our framework consistently delivers analysis results at interactive speeds, answering 95% of queries within 1.2 seconds.

Learning to find naming issues with big code and small supervision

We introduce a new approach for finding and fixing naming issues in source code. The method is based on a careful combination of unsupervised and supervised procedures: (i) unsupervised mining of patterns from Big Code that express common naming idioms. Program fragments violating such idioms indicates likely naming issues, and (ii) supervised learning of a classifier on a small labeled dataset which filters potential false positives from the violations.

We implemented our method in a system called Namer and evaluated it on a large number of Python and Java programs. We demonstrate that Namer is effective in finding naming mistakes in real world repositories with high precision (~70%). Perhaps surprisingly, we also show that existing deep learning methods are not practically effective and achieve low precision in finding naming issues (up to ~16%).

DIY assistant: a multi-modal end-user programmable virtual assistant

While Alexa can perform over 100,000 skills, its capability covers only a fraction of what is possible on the web. Individuals need and want to automate a long tail of web-based tasks which often involve visiting different websites and require programming concepts such as function composition, conditional, and iterative evaluation. This paper presents DIYA (Do-It-Yourself Assistant), a new system that empowers users to create personalized web-based virtual assistant skills that require the full generality of composable control constructs, without having to learn a formal programming language.

With DIYA, the user demonstrates their task of interest in the browser and issues a few simple voice commands, such as naming the skills and adding conditions on the action. DIYA turns these multi-modal specifications into voice-invocable skills written in the ThingTalk 2.0 programming language we designed for this purpose. DIYA is a prototype that works in the Chrome browser. Our user studies show that 81% of the proposed routines can be expressed using DIYA. DIYA is easy to learn, and 80% of users surveyed find DIYA useful.

Web question answering with neurosymbolic program synthesis

In this paper, we propose a new technique based on program synthesis for extracting information from webpages. Given a natural language query and a few labeled webpages, our method synthesizes a program that can be used to extract similar types of information from other unlabeled webpages. To handle websites with diverse structure, our approach employs a neurosymbolic DSL that incorporates both neural NLP models as well as standard language constructs for tree navigation and string manipulation. We also propose an optimal synthesis algorithm that generates all DSL programs that achieve optimal F1 score on the training examples. Our synthesis technique is compositional, prunes the search space by exploiting a monotonicity property of the DSL, and uses transductive learning to select programs with good generalization power. We have implemented these ideas in a new tool called WebQA and evaluate it on 25 different tasks across multiple domains. Our experiments show that WebQA significantly outperforms existing tools such as state-of-the-art question answering models and wrapper induction systems.

RbSyn: type- and effect-guided program synthesis

In recent years, researchers have explored component-based synthesis, which aims to automatically construct programs that operate by composing calls to existing APIs. However, prior work has not considered efficient synthesis of methods with side effects, e.g., web app methods that update a database. In this paper, we introduce RbSyn, a novel type- and effect-guided synthesis tool for Ruby. An RbSyn synthesis goal is specified as the type for the target method and a series of test cases it must pass. RbSyn works by recursively generating well-typed candidate method bodies whose write effects match the read effects of the test case assertions. After finding a set of candidates that separately satisfy each test, RbSyn synthesizes a solution that branches to execute the correct candidate code under the appropriate conditions. We formalize RbSyn on a core, object-oriented language λsyn and describe how the key ideas of the model are scaled-up in our implementation for Ruby. We evaluated RbSyn on 19 benchmarks, 12 of which come from popular, open-source Ruby apps. We found that RbSyn synthesizes correct solutions for all benchmarks, with 15 benchmarks synthesizing in under 9 seconds, while the slowest benchmark takes 83 seconds. Using observed reads to guide synthesize is effective: using type-guidance alone times out on 10 of 12 app benchmarks. We also found that using less precise effect annotations leads to worse synthesis performance. In summary, we believe type- and effect-guided synthesis is an important step forward in synthesis of effectful methods from test cases.

High performance correctly rounded math libraries for 32-bit floating point representations

This paper proposes a set of techniques to develop correctly rounded math libraries for 32-bit float and posit types. It enhances our RLIBM approach that frames the problem of generating correctly rounded libraries as a linear programming problem in the context of 16-bit types to scale to 32-bit types. Specifically, this paper proposes new algorithms to (1) generate polynomials that produce correctly rounded outputs for all inputs using counterexample guided polynomial generation, (2) generate efficient piecewise polynomials with bit-pattern based domain splitting, and (3) deduce the amount of freedom available to produce correct results when range reduction involves multiple elementary functions. The resultant math library for the 32-bit float type is faster than state-of-the-art math libraries while producing the correct output for all inputs. We have also developed a set of correctly rounded elementary functions for 32-bit posits.

Porcupine: a synthesizing compiler for vectorized homomorphic encryption

Homomorphic encryption (HE) is a privacy-preserving technique that enables computation directly on encrypted data. Despite its promise, HE has seen limited use due to performance overheads and compilation challenges. Recent work has made significant advances to address the performance overheads but automatic compilation of efficient HE kernels remains relatively unexplored.

This paper presents Porcupine, an optimizing compiler that generates vectorized HE code using program synthesis. HE poses three major compilation challenges: it only supports a limited set of SIMD-like operators, it uses long-vector operands, and decryption can fail if ciphertext noise growth is not managed properly. Porcupine captures the underlying HE operator behavior so that it can automatically reason about the complex trade-offs imposed by these challenges to generate optimized, verified HE kernels. To improve synthesis time, we propose a series of optimizations including a sketch design tailored to HE to narrow the program search space. We evaluate Porcupine using a set of kernels and show speedups of up to 52% (25% geometric mean) compared to heuristic-driven hand-optimized kernels. Analysis of Porcupine’s synthesized code reveals that optimal solutions are not always intuitive, underscoring the utility of automated reasoning in this domain.

Concolic program repair

Automated program repair reduces the manual effort in fixing program errors. However, existing repair techniques modify a buggy program such that it passes given tests. Such repair techniques do not discriminate between correct patches and patches that overfit the available tests (breaking untested but desired functionality). We propose an integrated approach for detecting and discarding overfitting patches via systematic co-exploration of the patch space and input space. We leverage concolic path exploration to systematically traverse the input space (and generate inputs), while ruling out significant parts of the patch space. Given a long enough time budget, this approach allows a significant reduction in the pool of patch candidates, as shown by our experiments. We implemented our technique in the form of a tool called 'CPR' and evaluated its efficacy in reducing the patch space by discarding overfitting patches from a pool of plausible patches. We evaluated our approach for fixing real-world software vulnerabilities and defects, for fixing functionality errors in programs drawn from SV-COMP benchmarks used in software verification, as well as for test-suite guided repair. In our experiments, we observed a patch space reduction due to our concolic exploration of up to 74% for fixing software vulnerabilities and up to 63% for SV-COMP programs. Our technique presents the viewpoint of gradual correctness - repair run over longer time leads to less overfitting fixes.

Concise, type-safe, and efficient structural diffing

A structural diffing algorithm compares two pieces of tree-shaped data and computes their difference. Existing structural diffing algorithms either produce concise patches or ensure type safety, but never both. We present a new structural diffing algorithm called truediff that achieves both properties by treating subtrees as mutable, yet linearly typed resources. Mutation is required to derive concise patches that only mention changed nodes, but, in contrast to prior work, truediff guarantees all intermediate trees are well-typed. We formalize type safety, prove truediff has linear run time, and evaluate its performance and the conciseness of the derived patches empirically for real-world Python documents. While truediff ensures type safety, the size of its patches is on par with Gumtree, a popular untyped diffing implementation. Regardless, truediff outperforms Gumtree and a typed diffing implementation by an order of magnitude.

CoStar: a verified ALL(*) parser

Parsers are security-critical components of many software systems, and verified parsing therefore has a key role to play in secure software design. However, existing verified parsers for context-free grammars are limited in their expressiveness, termination properties, or performance characteristics. They are only compatible with a restricted class of grammars, they are not guaranteed to terminate on all inputs, or they are not designed to be performant on grammars for real-world programming languages and data formats.

In this work, we present CoStar, a verified parser that addresses these limitations. The parser is implemented with the Coq Proof Assistant and is based on the ALL(*) parsing algorithm. CoStar is sound and complete for all non-left-recursive grammars; it produces a correct parse tree for its input whenever such a tree exists, and it correctly detects ambiguous inputs. CoStar also provides strong termination guarantees; it terminates without error on all inputs when applied to a non-left-recursive grammar. Finally, CoStar achieves linear-time performance on a range of unambiguous grammars for commonly used languages and data formats.

Automated conformance testing for JavaScript engines via deep compiler fuzzing

JavaScript (JS) is a popular, platform-independent programming language. To ensure the interoperability of JS programs across different platforms, the implementation of a JS engine should conform to the ECMAScript standard. However, doing so is challenging as there are many subtle definitions of API behaviors, and the definitions keep evolving.

We present COMFORT, a new compiler fuzzing framework for detecting JS engine bugs and behaviors that deviate from the ECMAScript standard. COMFORT leverages the recent advance in deep learning-based language models to automatically generate JS test code. As a departure from prior fuzzers, COMFORT utilizes the well-structured ECMAScript specifications to automatically generate test data along with the test programs to expose bugs that could be overlooked by the developers or manually written test cases. COMFORT then applies differential testing methodologies on the generated test cases to expose standard conformance bugs. We apply COMFORT to ten mainstream JS engines. In 200 hours of automated concurrent testing runs, we discover bugs in all tested JS engines. We had identified 158 unique JS engine bugs, of which 129 have been verified, and 115 have already been fixed by the developers. Furthermore, 21 of the COMFORT-generated test cases have been added to Test262, the official ECMAScript conformance test suite.

Beyond the elementary representations of program invariants over algebraic data types

First-order logic is a natural way of expressing properties of computation. It is traditionally used in various program logics for expressing the correctness properties and certificates. Although such representations are expressive for some theories, they fail to express many interesting properties of algebraic data types (ADTs). In this paper, we explore three different approaches to represent program invariants of ADT-manipulating programs: tree automata, and first-order formulas with or without size constraints. We compare the expressive power of these representations and prove the negative definability of both first-order representations using the pumping lemmas. We present an approach to automatically infer program invariants of ADT-manipulating programs by a reduction to a finite model finder. The implementation called RInGen has been evaluated against state-of-the-art invariant synthesizers and has been experimentally shown to be competitive. In particular, program invariants represented by automata are capable of expressing more complex properties of computation and their automatic construction is often less expensive.

Fast and precise certification of transformers

We present DeepT, a novel method for certifying Transformer networks based on abstract interpretation. The key idea behind DeepT is our new Multi-norm Zonotope abstract domain, an extension of the classical Zonotope designed to handle ℓ1 and ℓ2-norm bound perturbations. We introduce all Multi-norm Zonotope abstract transformers necessary to handle these complex networks, including the challenging softmax function and dot product. Our evaluation shows that DeepT can certify average robustness radii that are 28× larger than the state-of-the-art, while scaling favorably. Further, for the first time, we certify Transformers against synonym attacks on long sequences of words, where each word can be replaced by any synonym. DeepT achieves a high certification success rate on sequences of words where enumeration-based verification would take 2 to 3 orders of magnitude more time.

Trace-based control-flow analysis

We define a small-step semantics for the untyped λ-calculus, that traces the β-reductions that occur during evaluation. By abstracting the computation traces, we reconstruct k-CFA using abstract interpretation, and justify constraint-based k-CFA in a semantic way. The abstract interpretation of the trace semantics also paves the way for introducing widening operators in CFA that go beyond existing analyses, that are all based on exploring a finite state space. We define ∇CFA, a widening-based analysis that limits the cycles in call stacks, and can achieve better precision than k-CFA at a similar cost.

Compiling Stan to generative probabilistic languages and extension to deep probabilistic programming

Stan is a probabilistic programming language that is popular in the statistics community, with a high-level syntax for expressing probabilistic models. Stan differs by nature from generative probabilistic programming languages like Church, Anglican, or Pyro. This paper presents a comprehensive compilation scheme to compile any Stan model to a generative language and proves its correctness. We use our compilation scheme to build two new backends for the Stanc3 compiler targeting Pyro and NumPyro. Experimental results show that the NumPyro backend yields a 2.3x speedup compared to Stan in geometric mean over 26 benchmarks. Building on Pyro we extend Stan with support for explicit variational inference guides and deep probabilistic models. That way, users familiar with Stan get access to new features without having to learn a fundamentally new language.

Filling typed holes with live GUIs

Text editing is powerful, but some types of expressions are more naturally represented and manipulated graphically. Examples include expressions that compute colors, music, animations, tabular data, plots, diagrams, and other domain-specific data structures. This paper introduces live literals, or livelits, which allow clients to fill holes of types like these by directly manipulating a user-defined GUI embedded persistently into code. Uniquely, livelits are compositional: a livelit GUI can itself embed spliced expressions, which are typed, lexically scoped, and can in turn embed other livelits. Livelits are also uniquely live: a livelit can provide continuous feedback about the run-time implications of the client’s choices even when splices mention bound variables, because the system continuously gathers closures associated with the hole that the livelit is filling. We integrate livelits into Hazel, a live hole-driven programming environment, and describe case studies that exercise these novel capabilities. We then define a simply typed livelit calculus, which specifies how livelits operate as live graphical macros. The metatheory of macro expansion has been mechanized in Agda.

Concurrent deferred reference counting with constant-time overhead

We present a safe automatic memory reclamation approach for concurrent programs, and show that it is both theoretically and practically efficient. Our approach combines ideas from referencing counting and hazard pointers in a novel way to implement concurrent reference counting with wait-free, constant-time overhead. It overcomes the limitations of previous approaches by significantly reducing modifications to, and hence contention on, the reference counts. Furthermore, it is safer and easier to use than manual approaches. Our technique involves using a novel generalization of hazard pointers to defer reference-count decrements until no other process can be incrementing them, and to defer or elide reference-count increments for short-lived references.

We have implemented the approach as a C++ library and compared it experimentally to several methods including existing atomic reference-counting libraries and state-of-the art manual techniques. Our results indicate that our technique is faster than existing reference-counting implementations, and competitive with manual memory reclamation techniques. More importantly, it is significantly safer than manual techniques since objects are reclaimed automatically.

Quantum abstract interpretation

In quantum computing, the basic unit of information is a qubit. Simulation of a general quantum program takes exponential time in the number of qubits, which makes simulation infeasible beyond 50 qubits on current supercomputers. So, for the understanding of larger programs, we turn to static techniques. In this paper, we present an abstract interpretation of quantum programs and we use it to automatically verify assertions in polynomial time. Our key insight is to let an abstract state be a tuple of projections. For such domains, we present abstraction and concretization functions that form a Galois connection and we use them to define abstract operations. Our experiments on a laptop have verified assertions about the Bernstein-Vazirani, GHZ, and Grover benchmarks with 300 qubits.

Central moment analysis for cost accumulators in probabilistic programs

For probabilistic programs, it is usually not possible to automatically derive exact information about their properties, such as the distribution of states at a given program point. Instead, one can attempt to derive approximations, such as upper bounds on tail probabilities. Such bounds can be obtained via concentration inequalities, which rely on the moments of a distribution, such as the expectation (the first raw moment) or the variance (the second central moment). Tail bounds obtained using central moments are often tighter than the ones obtained using raw moments, but automatically analyzing central moments is more challenging.

This paper presents an analysis for probabilistic programs that automatically derives symbolic upper and lower bounds on variances, as well as higher central moments, of cost accumulators. To overcome the challenges of higher-moment analysis, it generalizes analyses for expectations with an algebraic abstraction that simultaneously analyzes different moments, utilizing relations between them. A key innovation is the notion of moment-polymorphic recursion, and a practical derivation system that handles recursive functions.

The analysis has been implemented using a template-based technique that reduces the inference of polynomial bounds to linear programming. Experiments with our prototype central-moment analyzer show that, despite the analyzer’s upper/lower bounds on various quantities, it obtains tighter tail bounds than an existing system that uses only raw moments, such as expectations.

Synthesizing data structure refinements from integrity constraints

Implementations of many data structures use several correlated fields to improve their performance; however, inconsistencies between these fields can be a source of serious program errors. To address this problem, we propose a new technique for automatically refining data structures from integrity constraints. In particular, consider a data structure D with fields F and methods M, as well as a new set of auxiliary fields F′ that should be added to D. Given this input and an integrity constraint Φ relating F and F′, our method automatically generates a refinement of D that satisfies the provided integrity constraint. Our method is based on a modular instantiation of the CEGIS paradigm and uses a novel inductive synthesizer that augments top-down search with three key ideas. First, it computes necessary preconditions of partial programs to dramatically prune its search space. Second, it augments the grammar with promising new productions by leveraging the computed preconditions. Third, it guides top-down search using a probabilistic context-free grammar obtained by statically analyzing the integrity checking function and the original code base. We evaluated our method on 25 data structures from popular Java projects and show that our method can successfully refine 23 of them. We also compare our method against two state-of-the-art synthesis tools and perform an ablation study to justify our design choices. Our evaluation shows that (1) our method is successful at refining many data structure implementations in the wild, (2) it advances the state-of-the-art in synthesis, and (3) our proposed ideas are crucial for making this technique practical.

Provable repair of deep neural networks

Deep Neural Networks (DNNs) have grown in popularity over the past decade and are now being used in safety-critical domains such as aircraft collision avoidance. This has motivated a large number of techniques for finding unsafe behavior in DNNs. In contrast, this paper tackles the problem of correcting a DNN once unsafe behavior is found. We introduce the provable repair problem, which is the problem of repairing a network N to construct a new network N′ that satisfies a given specification. If the safety specification is over a finite set of points, our Provable Point Repair algorithm can find a provably minimal repair satisfying the specification, regardless of the activation functions used. For safety specifications addressing convex polytopes containing infinitely many points, our Provable Polytope Repair algorithm can find a provably minimal repair satisfying the specification for DNNs using piecewise-linear activation functions. The key insight behind both of these algorithms is the introduction of a Decoupled DNN architecture, which allows us to reduce provable repair to a linear programming problem. Our experimental results demonstrate the efficiency and effectiveness of our Provable Repair algorithms on a variety of challenging tasks.

Integration verification across software and hardware for a simple embedded system

The interfaces between layers of a system are susceptible to bugs if developers of adjacent layers proceed under subtly different assumptions. Formal verification of two layers against the same formal model of the interface between them can be used to shake out these bugs. Doing so for every interface in the system can, in principle, yield unparalleled assurance of the correctness and security of the system as a whole. However, there have been remarkably few efforts that carry out this exercise, and all of them have simplified the task by restricting interactivity of the application, inventing new simplified instruction sets, and using unrealistic input and output mechanisms. We report on the first verification of a realistic embedded system, with its application software, device drivers, compiler, and RISC-V processor represented inside the Coq proof assistant as one mathematical object, with a machine-checked proof of functional correctness. A key challenge is structuring the proof modularly, so that further refinement of the components or expansion of the system can proceed without revisiting the rest of the system.

Symbolic Boolean derivatives for efficiently solving extended regular expression constraints

The manipulation of raw string data is ubiquitous in security-critical software, and verification of such software relies on efficiently solving string and regular expression constraints via SMT. However, the typical case of Boolean combinations of regular expression constraints exposes blowup in existing techniques. To address solvability of such constraints, we propose a new theory of derivatives of symbolic extended regular expressions (extended meaning that complement and intersection are incorporated), and show how to apply this theory to obtain more efficient decision procedures. Our implementation of these ideas, built on top of Z3, matches or outperforms state-of-the-art solvers on standard and handwritten benchmarks, showing particular benefits on examples with Boolean combinations.

Our work is the first formalization of derivatives of regular expressions which both handles intersection and complement and works symbolically over an arbitrary character theory. It unifies existing approaches involving derivatives of extended regular expressions, alternating automata and Boolean automata by lifting them to a common symbolic platform. It relies on a parsimonious augmentation of regular expressions: a construct for symbolic conditionals is shown to be sufficient to obtain relevant closure properties for derivatives over extended regular expressions.

Abstraction for conflict-free replicated data types

Strong eventual consistency (SEC) has been used as a classic notion of correctness for Conflict-Free Replicated Data Types (CRDTs). However, it does not give proper abstractions of functionality, thus is not helpful for modular verification of client programs using CRDTs. We propose a new correctness formulation for CRDTs, called Abstract Converging Consistency (ACC), to specify both data consistency and functional correctness. ACC gives abstract atomic specifications (as an abstraction) to CRDT operations, and establishes consistency between the concrete execution traces and the execution using the abstract atomic operations. The abstraction allows us to verify the CRDT implementation and its client programs separately, resulting in more modular and elegant proofs than monolithic approaches for whole program verification. We give a generic proof method to verify ACC of CRDT implementations, and a rely-guarantee style program logic to verify client programs. Our Abstraction theorem shows that ACC is equivalent to contextual refinement, linking the verification of CRDT implementations and clients together to derive functional correctness of whole programs.

Boosting SMT solver performance on mixed-bitwise-arithmetic expressions

Satisfiability Modulo Theories (SMT) solvers have been widely applied in automated software analysis to reason about the queries that encode the essence of program semantics, relieving the heavy burden of manual analysis. Many SMT solving techniques rely on solving Boolean satisfiability problem (SAT), which is an NP-complete problem, so they use heuristic search strategies to seek possible solutions, especially when no known theorem can efficiently reduce the problem. An emerging challenge, named Mixed-Bitwise-Arithmetic (MBA) obfuscation, impedes SMT solving by constructing identity equations with both bitwise operations (and, or, negate) and arithmetic computation (add, minus, multiply). Common math theorems for bitwise or arithmetic computation are inapplicable to simplifying MBA equations, leading to performance bottlenecks in SMT solving.

In this paper, we first scrutinize solvers' performance on solving different categories of MBA expressions: linear, polynomial, and non-polynomial. We observe that solvers can handle simple linear MBA expressions, but facing a severe performance slowdown when solving complex linear and non-linear MBA expressions. The root cause is that complex MBA expressions break the reduction laws for pure arithmetic or bitwise computation. To boost solvers' performance, we propose a semantic-preserving transformation to reduce the mixing degree of bitwise and arithmetic operations. We first calculate a signature vector based on the truth table extracted from an MBA expression, which captures the complete MBA semantics. Next, we generate a simpler MBA expression from the signature vector. Our large-scale evaluation on 3000 complex MBA equations shows that our technique significantly boost modern SMT solvers' performance on solving MBA formulas.

Distance-in-time versus distance-in-space

Cache behavior is one of the major factors that influence the performance of applications. Most of the existing compiler techniques that target cache memories focus exclusively on reducing data reuse distances in time (DIT). However, current manycore systems employ distributed on-chip caches that are connected using an on-chip network. As a result, a reused data element/block needs to travel over this on-chip network, and the distance to be traveled -- reuse distance in space (DIS) -- can be as influential in dictating application performance as reuse DIT. This paper represents the first attempt at defining a compiler framework that accommodates both DIT and DIS. Specifically, it first classifies data reuses into four groups: G1: (low DIT, low DIS), G2: (high DIT, low DIS), G3: (low DIT, high DIS), and G4: (high DIT, high DIS). Then, observing that reuses in G1 represent the ideal case and there is nothing much to be done in computations in G4, it proposes a "reuse transfer" strategy that transfers select reuses between G2 and G3, eventually, transforming each reuse to either G1 or G4. Finally, it evaluates the proposed strategy using a set of 10 multithreaded applications. The collected results reveal that the proposed strategy reduces parallel execution times of the tested applications between 19.3% and 33.3%.

An efficient interpreter for Datalog by de-specializing relations

Datalog is becoming increasingly popular as a standard tool for a variety of use cases. Modern Datalog engines can achieve high performance by specializing data structures for relational operations. For example, the Datalog engine Soufflé achieves high performance with a synthesizer that specializes data structures for relations. However, the synthesizer cannot always be deployed, and a fast interpreter is required.

This work introduces the design and implementation of the Soufflé Tree Interpreter (STI). Key for the performance of the STI is the support for fast operations on relations. We obtain fast operations by de-specializing data structures so that they can work in a virtual execution environment. Our new interpreter achieves a competitive performance slowdown between 1.32 and 5.67× when compared to synthesized code. If compile time overheads of the synthesizer are also considered, the interpreter can be 6.46× faster on average for the first run.

Adaptive restarts for stochastic synthesis

We consider the problem of program synthesis from input-output examples via stochastic search. We identify a robust feature of stochastic synthesis: The search often progresses through a series of discrete plateaus. We observe that the distribution of synthesis times is often heavy-tailed and analyze how these distributions arise. Based on these insights, we present an algorithm that speeds up synthesis by an order of magnitude over the naive algorithm currently used in practice. Our experimental results are obtained in part using a new program synthesis benchmark for superoptimization distilled from widely used production code.

Scooter & Sidecar: a domain-specific approach to writing secure database migrations

Web applications often handle large amounts of sensitive user data. Modern secure web frameworks protect this data by (1) using declarative languages to specify security policies alongside database schemas and (2) automatically enforcing these policies at runtime. Unfortunately, these frameworks do not handle the very common situation in which the schemas or the policies need to evolve over time---and updates to schemas and policies need to be performed in a carefully coordinated way. Mistakes during schema or policy migrations can unintentionally leak sensitive data or introduce privilege escalation bugs. In this work, we present a domain-specific language (Scooter) for expressing schema and policy migrations, and an associated SMT-based verifier (Sidecar) which ensures that migrations are secure as the application evolves. We describe the design of Scooter and Sidecar and show that our framework can be used to express realistic schemas, policies, and migrations, without giving up on runtime or verification performance.

When threads meet events: efficient and precise static race detection with origins

Data races are among the worst bugs in software in that they exhibit non-deterministic symptoms and are notoriously difficult to detect. The problem is exacerbated by interactions between threads and events in real-world applications. We present a novel static analysis technique, O2, to detect data races in large complex multithreaded and event-driven software. O2 is powered by “origins”, an abstraction that unifies threads and events by treating them as entry points of code paths attributed with data pointers. Origins in most cases are inferred automatically, but can also be specified by developers. More importantly, origins provide an efficient way to precisely reason about shared memory and pointer aliases.

Together with several important design choices for race detection, we have implemented O2 for both C/C++ and Java/Android applications and applied it to a wide range of open-source software. O2 has found new races in every single real-world code base we evaluated with, including Linux kernel, Redis, OVS, Memcached, Hadoop, Tomcat, ZooKeeper and Firefox Android. Moreover, O2 scales to millions of lines of code in a few minutes, on average 70x faster (up to 568x) compared to an existing static analysis tool from our prior work, and reduces false positives by 77%. We also compared O2 with the state-of-the-art static race detection tool, RacerD, showing highly promising results. At the time of writing, O2 has revealed more than 40 unique previously unknown races that have been confirmed or fixed by developers.

Viaduct: an extensible, optimizing compiler for secure distributed programs

Modern distributed systems involve interactions between principals with limited trust, so cryptographic mechanisms are needed to protect confidentiality and integrity. At the same time, most developers lack the training to securely employ cryptography. We present Viaduct, a compiler that transforms high-level programs into secure, efficient distributed realizations. Viaduct's source language allows developers to declaratively specify security policies by annotating their programs with information flow labels. The compiler uses these labels to synthesize distributed programs that use cryptography efficiently while still defending the source-level security policy. The Viaduct approach is general, and can be easily extended with new security mechanisms.

Our implementation of the Viaduct compiler comes with an extensible runtime system that includes plug-in support for multiparty computation, commitments, and zero-knowledge proofs. We have evaluated the system on a set of benchmarks, and the results indicate that our approach is feasible and can use cryptography in efficient, nontrivial ways.

Reticle: a virtual machine for programming modern FPGAs

Modern field-programmable gate arrays (FPGAs) have recently powered high-profile efficiency gains in systems from datacenters to embedded devices by offering ensembles of heterogeneous, reconfigurable hardware units. Programming stacks for FPGAs, however, are stuck in the past—they are based on traditional hardware languages, which were appropriate when FPGAs were simple, homogeneous fabrics of basic programmable primitives. We describe Reticle, a new low-level abstraction for FPGA programming that, unlike existing languages, explicitly represents the special-purpose units available on a particular FPGA device. Reticle has two levels: a portable intermediate language and a target-specific assembly language. We show how to use a standard instruction selection approach to lower intermediate programs to assembly programs, which can be both faster and more effective than the complex metaheuristics that existing FPGA toolchains use. We use Reticle to implement linear algebra operators and coroutines and find that Reticle compilation runs up to 100 times faster than current approaches while producing comparable or better run-time and utilization.

Polynomial reachability witnesses via Stellensätze

We consider the fundamental problem of reachability analysis over imperative programs with real variables. Previous works that tackle reachability are either unable to handle programs consisting of general loops (e.g. symbolic execution), or lack completeness guarantees (e.g. abstract interpretation), or are not automated (e.g. incorrectness logic). In contrast, we propose a novel approach for reachability analysis that can handle general and complex loops, is complete, and can be entirely automated for a wide family of programs. Through the notion of Inductive Reachability Witnesses (IRWs), our approach extends ideas from both invariant generation and termination to reachability analysis.

We first show that our IRW-based approach is sound and complete for reachability analysis of imperative programs. Then, we focus on linear and polynomial programs and develop automated methods for synthesizing linear and polynomial IRWs. In the linear case, we follow the well-known approaches using Farkas' Lemma. Our main contribution is in the polynomial case, where we present a push-button semi-complete algorithm. We achieve this using a novel combination of classical theorems in real algebraic geometry, such as Putinar's Positivstellensatz and Hilbert's Strong Nullstellensatz. Finally, our experimental results show we can prove complex reachability objectives over various benchmarks that were beyond the reach of previous methods.

Sound probabilistic inference via guide types

Probabilistic programming languages aim to describe and automate Bayesian modeling and inference. Modern languages support programmable inference, which allows users to customize inference algorithms by incorporating guide programs to improve inference performance. For Bayesian inference to be sound, guide programs must be compatible with model programs. One pervasive but challenging condition for model-guide compatibility is absolute continuity, which requires that the model and guide programs define probability distributions with the same support.

This paper presents a new probabilistic programming language that guarantees absolute continuity, and features general programming constructs, such as branching and recursion. Model and guide programs are implemented as coroutines that communicate with each other to synchronize the set of random variables they sample during their execution. Novel guide types describe and enforce communication protocols between coroutines. If the model and guide are well-typed using the same protocol, then they are guaranteed to enjoy absolute continuity. An efficient algorithm infers guide types from code so that users do not have to specify the types. The new programming language is evaluated with an implementation that includes the type-inference algorithm and a prototype compiler that targets Pyro. Experiments show that our language is capable of expressing a variety of probabilistic models with nontrivial control flow and recursion, and that the coroutine-based computation does not introduce significant overhead in actual Bayesian inference.

SPPL: probabilistic programming with fast exact symbolic inference

We present the Sum-Product Probabilistic Language (SPPL), a new probabilistic programming language that automatically delivers exact solutions to a broad range of probabilistic inference queries. SPPL translates probabilistic programs into sum-product expressions, a new symbolic representation and associated semantic domain that extends standard sum-product networks to support mixed-type distributions, numeric transformations, logical formulas, and pointwise and set-valued constraints. We formalize SPPL via a novel translation strategy from probabilistic programs to sum-product expressions and give sound exact algorithms for conditioning on and computing probabilities of events. SPPL imposes a collection of restrictions on probabilistic programs to ensure they can be translated into sum-product expressions, which allow the system to leverage new techniques for improving the scalability of translation and inference by automatically exploiting probabilistic structure. We implement a prototype of SPPL with a modular architecture and evaluate it on benchmarks the system targets, showing that it obtains up to 3500x speedups over state-of-the-art symbolic systems on tasks such as verifying the fairness of decision tree classifiers, smoothing hidden Markov models, conditioning transformed random variables, and computing rare event probabilities.

Reverse engineering for reduction parallelization via semiring polynomials

Parallel reduction, which summarizes a given dataset, e.g., the total, average, and maximum, plays a crucial role in parallel programming. This paper presents a new approach, reverse engineering, to automatically discovering nontrivial parallel reductions in sequential programs. The body of the sequential reduction loop is regarded as a black box, and its input-output behaviors are sampled. If the behaviors correspond to a set of linear polynomials over a semiring, a divide-and-conquer parallel reduction is generated. Auxiliary reverse-engineering methods enable a long and nested loop body to be decomposed, which makes our parallelization scheme applicable to various types of reduction loops. This approach is not only simple and efficient but also agnostic to the details of the input program. Its potential is demonstrated through several use case scenarios. A proof-of-concept implementation successfully inferred linear polynomials for nearly all of the 74 benchmarks exhaustively collected from the literature. These characteristics and experimental results demonstrate the promise of the proposed approach, despite its inherent unsoundness.

DreamCoder: bootstrapping inductive program synthesis with wake-sleep library learning

We present a system for inductive program synthesis called DreamCoder, which inputs a corpus of synthesis problems each specified by one or a few examples, and automatically derives a library of program components and a neural search policy that can be used to efficiently solve other similar synthesis problems. The library and search policy bootstrap each other iteratively through a variant of "wake-sleep" approximate Bayesian learning. A new refactoring algorithm based on E-graph matching identifies common sub-components across synthesized programs, building a progressively deepening library of abstractions capturing the structure of the input domain. We evaluate on eight domains including classic program synthesis areas and AI tasks such as planning, inverse graphics, and equation discovery. We show that jointly learning the library and neural search policy leads to solving more problems, and solving them more quickly.

Automatically enforcing fresh and consistent inputs in intermittent systems

Intermittently powered energy-harvesting devices enable new applications in inaccessible environments. Program executions must be robust to unpredictable power failures, introducing new challenges in programmability and correctness. One hard problem is that input operations have implicit constraints, embedded in the behavior of continuously powered executions, on when input values can be collected and used. This paper aims to develop a formal framework for enforcing these constraints. We identify two key properties---freshness (i.e., uses of inputs must satisfy the same time constraints as in continuous executions) and temporal consistency (i.e., the collection of a set of inputs must satisfy the same time constraints as in continuous executions). We formalize these properties and show that they can be enforced using atomic regions. We develop Ocelot, an LLVM-based analysis and transformation tool targeting Rust, to enforce these properties automatically. Ocelot provides the programmer with annotations to express these constraints and infers atomic region placement in a program to satisfy them. We then formalize Ocelot's design and show that Ocelot generates correct programs with little performance cost or code changes.

Modular data-race-freedom guarantees in the promising semantics

Local data-race-freedom guarantees, ensuring strong semantics for locations accessed by non-racy instructions, provide a fruitful methodology for modular reasoning in relaxed memory concurrency. We observe that standard compiler optimizations are in inherent conflict with such guarantees in general fully-relaxed memory models. Nevertheless, for a certain strengthening of the promising model by Lee et al. that only excludes relaxed RMW-store reorderings, we establish multiple useful local data-racefreedom guarantees that enhance the programmability aspect of the model.We also demonstrate that the performance price of forbidding these reorderings is insignificant. To the best of our knowledge, these results are the first to identify a model that includes the standard concurrency constructs, supports the efficient mapping of relaxed reads and writes to plain hardware loads and stores, and yet validates several local data-race-freedom guarantees. To gain confidence, our results are fully mechanized in Coq.

DNNFusion: accelerating deep neural networks execution with advanced operator fusion

Deep Neural Networks (DNNs) have emerged as the core enabler of many major applications on mobile devices. To achieve high accuracy, DNN models have become increasingly deep with hundreds or even thousands of operator layers, leading to high memory and computational requirements for inference. Operator fusion (or kernel/layer fusion) is key optimization in many state-of-the-art DNN execution frameworks, such as TensorFlow, TVM, and MNN, that aim to improve the efficiency of the DNN inference. However, these frameworks usually adopt fusion approaches based on certain patterns that are too restrictive to cover the diversity of operators and layer connections, especially those seen in many extremely deep models. Polyhedral-based loop fusion techniques, on the other hand, work on a low-level view of the computation without operator-level information, and can also miss potential fusion opportunities. To address this challenge, this paper proposes a novel and extensive loop fusion framework called DNNFusion. The basic idea of this work is to work at an operator view of DNNs, but expand fusion opportunities by developing a classification of both individual operators and their combinations. In addition, DNNFusion includes 1) a novel mathematical-property-based graph rewriting framework to reduce evaluation costs and facilitate subsequent operator fusion, 2) an integrated fusion plan generation that leverages the high-level analysis and accurate light-weight profiling, and 3) additional optimizations during fusion code generation. DNNFusion is extensively evaluated on 15 DNN models with varied types of tasks, model sizes, and layer counts. The evaluation results demonstrate that DNNFusion finds up to 8.8 × higher fusion opportunities, outperforms four state-of-the-art DNN execution frameworks with 9.3× speedup. The memory requirement reduction and speedups can enable the execution of many of the target models on mobile devices and even make them part of a real-time application.

SyRust: automatic testing of Rust libraries with semantic-aware program synthesis

Rust’s type system ensures the safety of Rust programs; however, programmers can side-step some of the strict typing rules by using the unsafe keyword. A common use of unsafe Rust is by libraries. Bugs in these libraries undermine the safety of the entire Rust program. Therefore, it is crucial to thoroughly test library APIs to rule out bugs. Unfortunately, such testing relies on programmers to manually construct test cases, which is an inefficient and ineffective process.

The goal of this paper is to develop a methodology for automatically generating Rust programs to effectively test Rust library APIs. The main challenge is to synthesize well-typed Rust programs to account for proper chaining of API calls and Rust’s ownership type system and polymorphic types. We develop a program synthesis technique for Rust library API testing, which relies on a novel logical encoding of typing constraints from Rust’s ownership type system. We implement SyRust, a testing framework for Rust libraries that automatically synthesizes semantically valid test cases. Our experiments on 30 popular open-source Rust libraries found 4 new bugs.

Chianina: an evolving graph system for flow- and context-sensitive analyses of million lines of C code

Sophisticated static analysis techniques often have complicated implementations, much of which provides logic for tuning and scaling rather than basic analysis functionalities. This tight coupling of basic algorithms with special treatments for scalability makes an analysis implementation hard to (1) make correct, (2) understand/work with, and (3) reuse for other clients. This paper presents Chianina, a graph system we developed for fully context- and flow-sensitive analysis of large C programs. Chianina overcomes these challenges by allowing the developer to provide only the basic algorithm of an analysis and pushing the tuning/scaling work to the underlying system. Key to the success of Chianina is (1) an evolving graph formulation of flow sensitivity and (2) the leverage of out-of-core, disk support to deal with memory blowup resulting from context sensitivity. We implemented three context- and flow-sensitive analyses on top of Chianina and scaled them to large C programs like Linux (17M LoC) on a single commodity PC.

Path-sensitive sparse analysis without path conditions

Sparse program analysis is fast as it propagates data flow facts via data dependence, skipping unnecessary control flows. However, when path-sensitively checking millions of lines of code, it is still prohibitively expensive because a huge number of path conditions have to be computed and solved via an SMT solver. This paper presents Fusion, a fused approach to inter-procedurally path-sensitive sparse analysis. In Fusion, the SMT solver does not work as a standalone tool on path conditions but directly on the program together with the sparse analysis. Such a fused design allows us to determine the path feasibility without explicitly computing path conditions, not only saving the cost of computing path conditions but also providing an opportunity to enhance the SMT solving algorithm. To the best of our knowledge, Fusion, for the first time, enables whole program bug detection on millions of lines of code in a common personal computer, with the precision of inter-procedural path-sensitivity. Compared to two state-of-the-art tools, Fusion is 10× faster but consumes only 10% of memory on average. Fusion has detected over a hundred bugs in mature open-source software, some of which have even been assigned CVE identifiers due to their security impact.

Cyclic program synthesis

We describe the first approach to automatically synthesizing heap-manipulating programs with auxiliary recursive procedures. Such procedures occur routinely in data structure transformations (e.g., flattening a tree into a list) or traversals of composite structures (e.g., n-ary trees). Our approach, dubbed cyclic program synthesis, enhances deductive program synthesis with a novel application of cyclic proofs. Specifically, we observe that the machinery used to form cycles in cyclic proofs can be reused to systematically and efficiently abduce recursive auxiliary procedures.

We develop the theory of cyclic program synthesis by extending Synthetic Separation Logic (SSL), a logical framework for deductive synthesis of heap-manipulating programs from Separation Logic specifications. We implement our approach as a tool called Cypress, and showcase it by automatically synthesizing a number of programs manipulating linked data structures using recursive auxiliary procedures and mutual recursion, many of which were beyond the reach of existing program synthesis tools.

Hashing modulo alpha-equivalence

In many applications one wants to identify identical subtrees of a program syntax tree. This identification should ideally be robust to alpha-renaming of the program, but no existing technique has been shown to achieve this with good efficiency (better than O(n2) in expression size). We present a new, asymptotically efficient way to hash modulo alpha-equivalence. A key insight of our method is to use a weak (commutative) hash combiner at exactly one point in the construction, which admits an algorithm with O(n (logn)2) time complexity. We prove that the use of the commutative combiner nevertheless yields a strong hash with low collision probability. Numerical benchmarks attest to the asymptotic behaviour of the method.

Phased synthesis of divide and conquer programs

We propose a fully automated method that takes as input an iterative or recursive reference implementation and produces divide-and-conquer implementations that are functionally equivalent to the input. Three interdependent components have to be synthesized: a function that divides the original problem instance, a function that solves each sub-instance, and a function that combines the results of sub-computations. We propose a methodology that splits the synthesis problem into three successive phases, each with a substantially reduced state space compared to the original monolithic task, and therefore substantially more tractable. Our methodology is implemented as an addition to the existing synthesis tool Parsynt, and we demonstrate the efficacy of it by synthesizing highly nontrivial divide-and-conquer implementations of a set of benchmarks fully automatically.

Snapshot-free, transparent, and robust memory reclamation for lock-free data structures

We present a family of safe memory reclamation schemes, Hyaline, which are fast, scalable, and transparent to the underlying lock-free data structures. Hyaline is based on reference counting -- considered impractical for memory reclamation in the past due to high overheads. Hyaline uses reference counters only during reclamation, but not while accessing individual objects, which reduces overheads for object accesses. Since with reference counters, an arbitrary thread ends up freeing memory, Hyaline's reclamation workload is (almost) balanced across all threads, unlike most prior reclamation schemes such as epoch-based reclamation (EBR) or hazard pointers (HP). Hyaline often yields (excellent) EBR-grade performance with (good) HP-grade memory efficiency, which is a challenging trade-off with all existing schemes.

Hyaline schemes offer: (i) high performance; (ii) good memory efficiency; (iii) robustness: bounding memory usage even in the presence of stalled threads, a well-known problem with EBR; (iv) transparency: supporting virtually unbounded number of threads (or concurrent entities) that can be created and deleted dynamically, and effortlessly join existent workload; (v) autonomy: avoiding special OS mechanisms and being non-intrusive to runtime or compiler environments; (vi) simplicity: enabling easy integration into unmanaged C/C++ code; and (vii) generality: supporting many data structures. All existing schemes lack one or more properties.

We have implemented and tested Hyaline on x86(-64), ARM32/64, PowerPC, and MIPS. The general approach requires LL/SC or double-width CAS, while a specialized version also works with single-width CAS. Our evaluation reveals that Hyaline's throughput is very high -- it steadily outperforms EBR by 10% in one test and yields 2x gains in oversubscribed scenarios. Hyaline's superior memory efficiency is especially evident in read-dominated workloads.

Logical bytecode reduction

Reducing a failure-inducing input to a smaller one is challenging for input with internal dependencies because most sub-inputs are invalid. Kalhauge and Palsberg made progress on this problem by mapping the task to a reduction problem for dependency graphs that avoids invalid inputs entirely. Their tool J-Reduce efficiently reduces Java bytecode to 24 percent of its original size, which made it the most effective tool until now. However, the output from their tool is often too large to be helpful in a bug report. In this paper, we show that more fine-grained modeling of dependencies leads to much more reduction. Specifically, we use propositional logic for specifying dependencies and we show how this works for Java bytecode. Once we have a propositional formula that specifies all valid sub-inputs, we run an algorithm that finds a small, valid, failure-inducing input. Our algorithm interleaves runs of the buggy program and calls to a procedure that finds a minimal satisfying assignment. Our experiments show that we can reduce Java bytecode to 4.6 percent of its original size, which is 5.3 times better than the 24.3 percent achieved by J-Reduce. The much smaller output is more suitable for bug reports.

Test-case reduction and deduplication almost for free with transformation-based compiler testing

Recent transformation-based approaches to compiler testing look for mismatches between the results of pairs of equivalent programs, where one program is derived from the other by randomly applying semantics-preserving transformations. We present a formulation of transformation-based compiler testing that provides effective test-case reduction almost for free: if transformations are designed to be as small and independent as possible, standard delta debugging can be used to shrink a bug-inducing transformation sequence to a smaller subsequence that still triggers the bug. The bug can then be reported as a delta between an original and minimally-transformed program. Minimized transformation sequences can also be used to heuristically deduplicate a set of bug-inducing tests, recommending manual investigation of those that involve disparate types of transformations and thus may have different root causes. We demonstrate the effectiveness of our approach via a new tool, spirv-fuzz, the first compiler-testing tool for the SPIR-V intermediate representation that underpins the Vulkan GPU programming model.

Proving non-termination by program reversal

We present a new approach to proving non-termination of non-deterministic integer programs. Our technique is rather simple but efficient. It relies on a purely syntactic reversal of the program's transition system followed by a constraint-based invariant synthesis with constraints coming from both the original and the reversed transition system. The latter task is performed by a simple call to an off-the-shelf SMT-solver, which allows us to leverage the latest advances in SMT-solving. Moreover, our method offers a combination of features not present (as a whole) in previous approaches: it handles programs with non-determinism, provides relative completeness guarantees and supports programs with polynomial arithmetic. The experiments performed with our prototype tool RevTerm show that our approach, despite its simplicity and stronger theoretical guarantees, is at least on par with the state-of-the-art tools, often achieving a non-trivial improvement under a proper configuration of its parameters.

Vectorized secure evaluation of decision forests

As the demand for machine learning–based inference increases in tandem with concerns about privacy, there is a growing recognition of the need for secure machine learning, in which secret models can be used to classify private data without the model or data being leaked. Fully Homomorphic Encryption (FHE) allows arbitrary computation to be done over encrypted data, providing an attractive approach to providing such secure inference. While such computation is often orders of magnitude slower than its plaintext counterpart, the ability of FHE cryptosystems to do ciphertext packing—that is, encrypting an entire vector of plaintexts such that operations are evaluated elementwise on the vector—helps ameliorate this overhead, effectively creating a SIMD architecture where computation can be vectorized for more efficient evaluation. Most recent research in this area has targeted regular, easily vectorizable neural network models. Applying similar techniques to irregular ML models such as decision forests remains unexplored, due to their complex, hard-to-vectorize structures.

In this paper we present COPSE, the first system that exploits ciphertext packing to perform decision-forest inference. COPSE consists of a staging compiler that automatically restructures and compiles decision forest models down to a new set of vectorizable primitives for secure inference. We find that COPSE’s compiled models outperform the state of the art across a range of decision forest models, often by more than an order of magnitude, while still scaling well.

Task parallel assembly language for uncompromising parallelism

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation.

In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

JPortal: precise and efficient control-flow tracing for JVM programs with Intel processor trace

Hardware tracing modules such as Intel Processor Trace perform continuous control-flow tracing of an end-to-end program execution with an ultra-low overhead. PT has been used in a variety of contexts to support applications such as testing, debugging, and performance diagnosis. However, these hardware modules have so far been used only to trace native programs, which are directly compiled down to machine code. As high-level languages (HLL) such as Java and Go become increasingly popular, there is a pressing need to extend these benefits to the HLL community. This paper presents JPortal, a JVM-based profiling tool that bridges the gap between HLL applications and low-level hardware traces by using a set of algorithms to precisely recover an HLL program’s control flow from PT traces. An evaluation of JPortal with the DaCapo benchmark shows that JPortal achieves an overall 80% accuracy for end-to-end control flow profiling with only a 4-16% runtime overhead.

CompCertO: compiling certified open C components

Since the introduction of CompCert, researchers have been refining its language semantics and correctness theorem, and used them as components in software verification efforts. Meanwhile, artifacts ranging from CPU designs to network protocols have been successfully verified, and there is interest in making them interoperable to tackle end-to-end verification at an even larger scale.

Recent work shows that a synthesis of game semantics, refinement-based methods, and abstraction layers has the potential to serve as a common theory of certified components. Integrating certified compilers to such a theory is a critical goal. However, none of the existing variants of CompCert meets the requirements we have identified for this task.

CompCertO extends the correctness theorem of CompCert to characterize compiled program components directly in terms of their interaction with each other. Through a careful and compositional treatment of calling conventions, this is achieved with minimal effort.

Example-guided synthesis of relational queries

Program synthesis tasks are commonly specified via input-output examples. Existing enumerative techniques for such tasks are primarily guided by program syntax and only make indirect use of the examples. We identify a class of synthesis algorithms for programming-by-examples, which we call Example-Guided Synthesis (EGS), that exploits latent structure in the provided examples while generating candidate programs. We present an instance of EGS for the synthesis of relational queries and evaluate it on 86 tasks from three application domains: knowledge discovery, program analysis, and database querying. Our evaluation shows that EGS outperforms state-of-the-art synthesizers based on enumerative search, constraint solving, and hybrid techniques in terms of synthesis time, quality of synthesized programs, and ability to prove unrealizability.

Canary: practical static detection of inter-thread value-flow bugs

Concurrent programs are still prone to bugs arising from the subtle interleavings of threads. Traditional static analysis for concurrent programs, such as data-flow analysis and symbolic execution, has to explicitly explore redundant control states, leading to prohibitive computational complexity.

This paper presents a value flow analysis framework for concurrent programs called Canary that is practical to statically find diversified inter-thread value-flow bugs. Our work is the first to convert the concurrency bug detection to a source-sink reachability problem, effectively reducing redundant thread interleavings. Specifically, we propose a scalable thread-modular algorithm to capture data and interference dependence in a value-flow graph. The relevant edges of value flows are annotated with execution constraints as guards to describe the conditions of value flows. Canary then traverses the graph to detect concurrency defects via tracking the source-sink properties and solving the aggregated guards of value flows with an SMT solver to decide the realizability of interleaving executions. Experiments show that Canary is precise, scalable and practical, detecting over eighteen previously unknown concurrency bugs in large, widely-used software systems with low false positives.

Robustness certification with generative models

Generative neural networks are powerful models capable of learning a wide range of rich semantic image transformations such as altering person's age, head orientation, adding mustache, changing the hair color and many more. At a high level, a generative model effectively produces new and previously unseen images with the desired properties, which can then be used to improve the accuracy of existing models. In this work, we advance the state-of-the-art in verification by bridging the gap between (i) the well studied but limited norm-based and geometric transformations, and (ii) the rich set of semantic transformations used in practice. This problem is especially hard since the images are generated from a highly non-convex image manifold, preventing the use of most existing verifiers, which often rely on convex relaxations. We present a new verifier, called GenProve, which is capable of certifying the rich set of semantic transformations of generative models. GenProve can provide both sound deterministic and probabilistic guarantees, by capturing infinite non-convex sets of activation vectors and distributions over them, while scaling to realistic networks.

Execution reconstruction: harnessing failure reoccurrences for failure reproduction

Reproducing production failures is crucial for software reliability. Alas, existing bug reproduction approaches are not suitable for production systems because they are not simultaneously efficient, effective, and accurate. In this work, we survey prior techniques and show that existing approaches over-prioritize a subset of these properties, and sacrifice the remaining ones. As a result, existing tools do not enable the plethora of proposed failure reproduction use-cases (e.g., debugging, security forensics, fuzzing) for production failures.

We propose Execution Reconstruction (ER), a technique that strikes a better balance between efficiency, effectiveness and accuracy for reproducing production failures. ER uses hardware-assisted control and data tracing to shepherd symbolic execution and reproduce failures. ER’s key novelty lies in identifying data values that are both inexpensive to monitor and useful for eliding the scalability limitations of symbolic execution. ER harnesses failure reoccurrences by iteratively performing tracing and symbolic execution, which reduces runtime overhead. Whereas prior production-grade techniques can only reproduce short executions, ER can reproduce any reoccuring failure. Thus, unlike existing tools, ER reproduces fully replayable executions that can power a variety of debugging and reliabilty use cases. ER incurs on average 0.3% (up to 1.1%) runtime monitoring overhead for a broad range of real-world systems, making it practical for real-world deployment.

Quantitative analysis of assertion violations in probabilistic programs

We consider the fundamental problem of deriving quantitative bounds on the probability that a given assertion is violated in a probabilistic program. We provide automated algorithms that obtain both lower and upper bounds on the assertion violation probability. The main novelty of our approach is that we prove new and dedicated fixed-point theorems which serve as the theoretical basis of our algorithms and enable us to reason about assertion violation bounds in terms of pre and post fixed-point functions. To synthesize such fixed-points, we devise algorithms that utilize a wide range of mathematical tools, including repulsing ranking supermartingales, Hoeffding's lemma, Minkowski decompositions, Jensen's inequality, and convex optimization. On the theoretical side, we provide (i) the first automated algorithm for lower-bounds on assertion violation probabilities, (ii) the first complete algorithm for upper-bounds of exponential form in affine programs, and (iii) provably and significantly tighter upper-bounds than the previous approaches. On the practical side, we show our algorithms can handle a wide variety of programs from the literature and synthesize bounds that are remarkably tighter than previous results, in some cases by thousands of orders of magnitude.

IOOpt: automatic derivation of I/O complexity bounds for affine programs

Evaluating the complexity of an algorithm is an important step when developing applications, as it impacts both its time and energy performance. Computational complexity, which is the number of dynamic operations regardless of the execution order, is easy to characterize for affine programs. Data movement (or, I/O) complexity is more complex to evaluate as it refers, when considering all possible valid schedules, to the minimum required number of I/O between a slow (e.g. main memory) and a fast (e.g. local scratchpad) storage location.

This paper presents IOOpt, a fully automated tool that automatically bounds the data movement of an affine (tilable) program. Given a tilable program described in a DSL, it automatically computes: 1. a lower bound of the I/O complexity as a symbolic expression of the cache size and program parameters; 2. an upper bound that allows one to assess the tightness of the lower bound; 3. a tiling recommendation (loop permutation and tile sizes) that matches the upper bound. For the lower bound algorithm which can be applied to any affine program, a substantial effort has been made to provide bounds that are as tight as possible for neural networks: In particular, it extends the previous work of Olivry et al. to handle multi-dimensional reductions and expose the constraints associated with small dimensions that are present in convolutions. For the upper bound algorithm that reasons on the tile band of the program (e.g. output of a polyhedral compiler such as PluTo), the algebraic computations involved have been tuned to behave well on tensor computations such as direct tensor contractions or direct convolutions. As a bonus, the upper bound algorithm that has been extended to multi-level cache can provide the programmer with a useful tiling recommendation.

We demonstrate the effectiveness of our tool by deriving the symbolic lower and upper bounds for several tensor contraction and convolution kernels. Then we evaluate numerically the tightness of our bound using the convolution layers of Yolo9000 and representative tensor contractions from the TCCG benchmark suite. Finally, we show the pertinence of our I/O complexity model by reporting the running time of the recommended tiled code for the convolution layers of Yolo9000.

Specification synthesis with constrained Horn clauses

The problem of synthesizing specifications of undefined procedures has a broad range of applications, but the usefulness of the generated specifications depends on their quality. In this paper, we propose a technique for finding maximal and non-vacuous specifications. Maximality allows for more choices for implementations of undefined procedures, and non-vacuity ensures that safety assertions are reachable. To handle programs with complex control flow, our technique discovers not only specifications but also inductive invariants. Our iterative algorithm lazily generalizes non-vacuous specifications in a counterexample-guided loop. The key component of our technique is an effective non-vacuous specification synthesis algorithm. We have implemented the approach in a tool called HornSpec, taking as input systems of constrained Horn clauses. We have experimentally demonstrated the tool's effectiveness, efficiency, and the quality of generated specifications on a range of benchmarks.

Mirror: making lock-free data structures persistent

With the recent launch of the Intel Optane memory platform, non-volatile main memory in the form of fast, dense, byte-addressable non-volatile memory has now become available. Nevertheless, designing crash-resilient algorithms and data structures is complex and error-prone as caches and machine registers are still volatile and the data residing in memory after a crash might not reflect a consistent view of the program state. This complex setting has often led to durable data structures being inefficient or incorrect, especially in the concurrent setting.

In this paper, we present Mirror -- a simple, general automatic transformation that adds durability to lock-free data structures, with a low performance overhead. Moreover, in the current non-volatile main memory configuration, where non-volatile memory operates side-by-side with a standard fast DRAM, our mechanism exploits the hybrid system to substantially improve performance. Evaluation shows a significant performance advantage over NVTraverse, which is the state-of-the-art general transformation technique, and over Intel's concurrent lock-based key-value datastore. Unlike some previous transformations, Mirror does not require any restriction on the lock-free data structure format.

AKG: automatic kernel generation for neural processing units using polyhedral transformations

Existing tensor compilers have proven their effectiveness in deploying deep neural networks on general-purpose hardware like CPU and GPU, but optimizing for neural processing units (NPUs) is still challenging due to the heterogeneous compute units and complicated memory hierarchy.

In this paper, we present AKG, a tensor compiler for NPUs. AKG first lowers the tensor expression language to a polyhedral representation, which is used to automate the memory management of NPUs. Unlike existing approaches that resort to manually written schedules, AKG leverages polyhedral schedulers to perform a much wider class of transformations, and extends the semantics of the polyhedral representation to combine complex tiling techniques and hierarchical fusion strategies. We also implement the domain-specific optimization of convolution in AKG. Moreover, to achieve the optimal performance, we introduce complementary optimizations in code generation, which is followed by an auto-tuner.

We conduct extensive experiments on benchmarks ranging from single operators to end-to-end networks. The experimental results show that AKG can obtain superior performance to both manual scheduling approaches and vendor provided libraries. We believe AKG will cast a light on the follow-up compiler works on NPUs.

Frequent background polling on a shared thread, using light-weight compiler interrupts

Recent work in networking, storage and multi-threading has demonstrated improved performance and scalability by replacing kernel-mode interrupts with high-rate user-space polling. Typically, such polling is performed by a dedicated core. Compiler Interrupts (CIs) instead enable efficient, automatic high-rate polling on a shared thread, which performs other work between polls. CIs are instrumentation-based and light-weight, allowing frequent interrupts with little performance impact. For example, when targeting a 5,000 cycle interval, the median overhead of our fastest CI design is 4% vs. 800% for hardware interrupts, across programs in the SPLASH-2, Phoenix and Parsec benchmark suites running with 32 threads. We evaluate CIs on three systems-level applications: (a) kernel bypass networking with mTCP, (b) joint kernel bypass networking and CPU scheduling with Shenango, and (c) delegation, a message-passing alternative to locking, with FFWD. For each application, we find that CIs offer compelling qualitative and quantitative improvements over the current state of the art. For example, CI-based mTCP achieves ≈2× stock mTCP throughput on a sample HTTP application.

Satisfiability modulo ordering consistency theory for multi-threaded program verification

Analyzing multi-threaded programs is hard due to the number of thread interleavings. Partial orders can be used for modeling and analyzing multi-threaded programs. However, there is no dedicated decision procedure for solving partial-order constraints. In this paper, we propose a novel ordering consistency theory for multi-threaded program verification under sequential consistency, and we elaborate its theory solver, which realizes incremental consistency checking, minimal conflict clause generation, and specialized theory propagation to improve the efficiency of SMT solving. We conducted extensive experiments on credible benchmarks; the results show significant promotion of our approach.

Bliss: auto-tuning complex applications using a pool of diverse lightweight learning models

As parallel applications become more complex, auto-tuning becomes more desirable, challenging, and time-consuming. We propose, Bliss, a novel solution for auto-tuning parallel applications without requiring apriori information about applications, domain-specific knowledge, or instrumentation. Bliss demonstrates how to leverage a pool of Bayesian Optimization models to find the near-optimal parameter setting 1.64× faster than the state-of-the-art approaches.

Termination analysis without the tears

Determining whether a given program terminates is the quintessential undecidable problem. Algorithms for termination analysis may be classified into two groups: (1) algorithms with strong behavioral guarantees that work in limited circumstances (e.g., complete synthesis of linear ranking functions for polyhedral loops), and (2) algorithms that are widely applicable, but have weak behavioral guarantees (e.g., Terminator). This paper investigates the space in between: how can we design practical termination analyzers with useful behavioral guarantees?

This paper presents a termination analysis that is both compositional (the result of analyzing a composite program is a function of the analysis results of its components) and monotone (“more information into the analysis yields more information out”). The paper has two key contributions. The first is an extension of Tarjan’s method for solving path problems in graphs to solve infinite path problems. This provides a foundation upon which to build compositional termination analyses. The second is a collection of monotone conditional termination analyses based on this framework. We demonstrate that our tool ComPACT (Compositional and Predictable Analysis for Conditional Termination) is competitive with state-of-the-art termination tools while providing stronger behavioral guarantees.

On probabilistic termination of functional programs with continuous distributions

We study termination of higher-order probabilistic functional programs with recursion, stochastic conditioning and sampling from continuous distributions. Reasoning about the termination probability of programs with continuous distributions is hard, because the enumeration of terminating executions cannot provide any non-trivial bounds. We present a new operational semantics based on traces of intervals, which is sound and complete with respect to the standard sampling-based semantics, in which (countable) enumeration can provide arbitrarily tight lower bounds. Consequently we obtain the first proof that deciding almost-sure termination (AST) for programs with continuous distributions is Π20-complete (for CbN). We also provide a compositional representation of our semantics in terms of an intersection type system. In the second part, we present a method of proving AST for non-affine programs, i.e., recursive programs that can, during the evaluation of the recursive body, make multiple recursive calls (of a first-order function) from distinct call sites. Unlike in a deterministic language, the number of recursion call sites has direct consequences on the termination probability. Our framework supports a proof system that can verify AST for programs that are well beyond the scope of existing methods. We have constructed prototype implementations of our methods for computing lower bounds on the termination probability, and AST verification.

Practical smart contract sharding with ownership and commutativity analysis

Sharding is a popular way to achieve scalability in blockchain protocols, increasing their throughput by partitioning the set of transaction validators into a number of smaller committees, splitting the workload. Existing approaches for blockchain sharding, however, do not scale well when concurrent transactions alter the same replicated state component—a common scenario in Ethereum-style smart contracts.

We propose a novel approach for efficiently sharding such transactions. It is based on a folklore idea: state-manipulating atomic operations that commute can be processed in parallel, with their cumulative result defined deterministically, while executing non-commuting operations requires one to own the state they alter. We present CoSplit—a static program analysis tool that soundly infers ownership and commutativity summaries for smart contracts and translates those summaries to sharding signatures that are used by the blockchain protocol to maximise parallelism. Our evaluation shows that using CoSplit introduces negligible overhead to the transaction validation cost, while the inferred signatures allow the system to achieve a significant increase in transaction processing throughput for real-world smart contracts.