Datalog is a declarative logic programming language that has been used in a variety of applications, including big-data analytics, language processing, networking and distributed systems, and program analysis.
In this paper, we propose first-class Datalog constraints as a mechanism to construct, compose, and solve Datalog programs at run time. The benefits are twofold: We gain the full power of a functional programming language to operate on Datalog constraints-as-values, while simultaneously we can use Datalog where it really shines: to declaratively express and solve fixpoint problems.
We present an extension of the lambda calculus with first-class Datalog constraints, including its semantics and a type system with row polymorphism based on Hindley-Milner. We prove soundness of the type system and implement it as an extension of the Flix programming language.
Effect handlers have recently gained popularity amongst programming language researchers. Existing type- and effect systems for effect handlers are often complicated and potentially hinder a wide-spread adoption.
We present the language Effekt with the goal to close the gap between research languages with effect handlers and languages for working programmers. The design of Effekt revolves around a different view of effects and effect types. Traditionally, effect types express which side effects a computation might have. In Effekt, effect types express which capabilities a computation requires from its context. While this new point in the design space of effect systems impedes reasoning about purity, we demonstrate that it simplifies the treatment of effect polymorphism and the related issues of effect parametricity and effect encapsulation. To guarantee effect safety, we separate functions from values and treat all functions as second-class. We define the semantics of Effekt as a translation to System Xi, a calculus in explicit capability-passing style.
Static typing can guide programmers if feedback is immediate. Therefore, all major IDEs incrementalize type checking in some way. However, prior approaches to incremental type checking are often specialized and hard to transfer to new type systems. In this paper, we propose a systematic approach for deriving incremental type checkers from textbook-style type system specifications. Our approach is based on compiling inference rules to Datalog, a carefully limited logic programming language for which incremental solvers exist. The key contribution of this paper is to discover an encoding of the infinite typing relation as a finite Datalog relation in a way that yields efficient incremental updates. We implemented the compiler as part of a type system DSL and show that it supports simple types, some local type inference, operator overloading, universal types, and iso-recursive types.
Modern highly-concurrent search data structures, such as search trees, obtain multi-core scalability and performance by having operations traverse the data structure without any synchronization. As a result, however, these algorithms are notoriously difficult to prove linearizable, which requires identifying a point in time in which the traversal's result is correct. The problem is that traversing the data structure as it undergoes modifications leads to complex behaviors, necessitating intricate reasoning about all interleavings of reads by traversals and writes mutating the data structure.
In this paper, we present a general proof technique for proving unsynchronized traversals correct in a significantly simpler manner, compared to typical concurrent reasoning and prior proof techniques. Our framework relies only on sequential properties of traversals and on a conceptually simple and widely-applicable condition about the ways an algorithm's writes mutate the data structure. Establishing that a target data structure satisfies our condition requires only simple concurrent reasoning, without considering interactions of writes and reads. This reasoning can be further simplified by using our framework.
To demonstrate our technique, we apply it to prove several interesting and challenging concurrent binary search trees: the logical-ordering AVL tree, the Citrus tree, and the full contention-friendly tree. Both the logical-ordering tree and the full contention-friendly tree are beyond the reach of previous approaches targeted at simplifying linearizability proofs.
CompCert is a moderately optimizing C compiler with a formal, machine-checked, proof of correctness: after successful compilation, the assembly code has a behavior faithful to the source code. Previously, it only supported target instruction sets with sequential semantics, and did not attempt reordering instructions for optimization.
We present here a CompCert backend for a VLIW core (i.e. with explicit parallelism at the instruction level), the first CompCert backend providing scalable and efficient instruction scheduling. Furthermore, its highly modular implementation can be easily adapted to other VLIW or non-VLIW pipelined processors.
Approximation errors must be taken into account when compiling quantum programs into a low-level gate set. We present a methodology that tracks such errors automatically and then optimizes accuracy parameters to guarantee a specified overall accuracy while aiming to minimize the implementation cost in terms of quantum gates. The core idea of our approach is to extract functions that specify the optimization problem directly from the high-level description of the quantum program. Then, custom compiler passes optimize these functions, turning them into (near-)symbolic expressions for (1) the total error and (2) the implementation cost (e.g., total quantum gate count). All unspecified parameters of the quantum program will show up as variables in these expressions, including accuracy parameters. After solving the corresponding optimization problem, a circuit can be instantiated from the found solution. We develop two prototype implementations, one in C++ based on Clang/LLVM, and another using the Q# compiler infrastructure. We benchmark our prototypes on typical quantum computing programs, including the quantum Fourier transform, quantum phase estimation, and Shor's algorithm.
This paper describes a compiler optimization to eliminates dynamic occurrences of expressions in the format a ← a ⊕ b ⊗ c. The operation ⊕ must admit an identity element z, such that a ⊕ z = a. Also, z must be the absorbing element of ⊗, such that b ⊗ z = z ⊗ c = z. Semirings where ⊕ is the additive operator and ⊗ is the multiplicative operator meet this contract. This pattern is common in high-performance benchmarks—its canonical representative being the multiply-add operation a ← a + b × c. However, several other expressions involving arithmetic and logic operations satisfy the required algebra. We show that the runtime elimination of such assignments can be implemented in a performance-safe way via online profiling. The elimination of dynamic redundancies involving identity and absorbing elements in 35 programs of the LLVM test suite that present semiring patterns brings an average speedup of 1.19x (total optimized time over total unoptimized time) on top of clang -O3. When projected onto the entire test suite (259 programs) the optimization leads to a speedup of 1.025x. Once added onto clang, semiring optimizations approximates it to TACO, a specialized tensor compiler.
Some blockchain programs (smart contracts) have included serious security vulnerabilities. Obsidian is a new typestate-oriented programming language that uses a strong type system to rule out some of these vulnerabilities. Although Obsidian was designed to promote usability to make it as easy as possible to write programs, strong type systems can cause a language to be difficult to use. In particular, ownership, typestate, and assets, which Obsidian uses to provide safety guarantees, have not seen broad adoption together in popular languages and result in significant usability challenges. We performed an empirical study with 20 participants comparing Obsidian to Solidity, which is the language most commonly used for writing smart contracts today. We observed that Obsidian participants were able to successfully complete more of the programming tasks than the Solidity participants. We also found that the Solidity participants commonly inserted asset-related bugs, which Obsidian detects at compile time.
Quantum computers promise to perform certain computations exponentially faster than any classical device. Precise control over their physical implementation and proper shielding from unwanted interactions with the environment become more difficult as the space/time volume of the computation grows. Code optimization is thus crucial in order to reduce resource requirements to the greatest extent possible. Besides manual optimization, previous work has adapted classical methods such as constant-folding and common subexpression elimination to the quantum domain. However, such classically-inspired methods fail to exploit certain optimization opportunities across subroutine boundaries, limiting the effectiveness of software reuse. To address this insufficiency, we introduce an optimization methodology which employs annotations that describe how subsystems are entangled in order to exploit these optimization opportunities. We formalize our approach, prove its correctness, and present benchmarks: Without any prior manual optimization, our methodology is able to reduce, e.g., the qubit requirements of a 64-bit floating-point subroutine by 34×.
We present a programming model and typing discipline for complex multi-robot coordination programming. Our model encompasses both synchronisation through message passing and continuous-time dynamic motion primitives in physical space. We specify continuous-time motion primitives in an assume-guarantee logic that ensures compatibility of motion primitives as well as collision freedom. We specify global behaviour of programs in a choreographic type system that extends multiparty session types with jointly executed motion primitives, predicated refinements, as well as a separating conjunction that allows reasoning about subsets of interacting robots. We describe a notion of well-formedness for global types that ensures motion and communication can be correctly synchronised and provide algorithms for checking well-formedness, projecting a type, and local type checking. A well-typed program is communication safe, motion compatible, and collision free. Our type system provides a compositional approach to ensuring these properties.
We have implemented our model on top of the ROS framework. This allows us to program multi-robot coordination scenarios on top of commercial and custom robotics hardware platforms. We show through case studies that we can model and statically verify quite complex manoeuvres involving multiple manipulators and mobile robots---such examples are beyond the scope of previous approaches.
System call whitelisting is a powerful sandboxing approach that can significantly reduce the capabilities of an attacker if an application is compromised. Given a policy that specifies which system calls can be invoked with what arguments, a sandboxing framework terminates any execution that violates the policy. While this mechanism greatly reduces the attack surface of a system, manually constructing these policies is time-consuming and error-prone. As a result, many applications —including those that take untrusted user input— opt not to use a system call sandbox.
Motivated by this problem, we propose a technique for automatically constructing system call whitelisting policies for a given application and policy DSL. Our method combines static code analysis and program synthesis to construct sound and precise policies that never erroneously terminate the application, while restricting the program’s system call usage as much as possible. We have implemented our approach in a tool called Abhayaand experimentally evaluate it 493 Linux and OpenBSD applications by automatically synthesizing Seccomp-bpfand Pledgepolicies. Our experimental results indicate that Abhayacan efficiently generate useful and precise sandboxes for real-world applications.
Rust’s ownership type system enforces a strict discipline on how memory locations are accessed and shared. This discipline allows the compiler to statically prevent memory errors, data races, inadvertent side effects through aliasing, and other errors that frequently occur in conventional imperative programs. However, the restrictions imposed by Rust’s type system make it difficult or impossible to implement certain designs, such as data structures that require aliasing (e.g. doubly-linked lists and shared caches). To work around this limitation, Rust allows code blocks to be declared as unsafe and thereby exempted from certain restrictions of the type system, for instance, to manipulate C-style raw pointers. Ensuring the safety of unsafe code is the responsibility of the programmer. However, an important assumption of the Rust language, which we dub the Rust hypothesis, is that programmers use Rust by following three main principles: use unsafe code sparingly, make it easy to review, and hide it behind a safe abstraction such that client code can be written in safe Rust.
Understanding how Rust programmers use unsafe code and, in particular, whether the Rust hypothesis holds is essential for Rust developers and testers, language and library designers, as well as tool developers. This paper studies empirically how unsafe code is used in practice by analysing a large corpus of Rust projects to assess the validity of the Rust hypothesis and to classify the purpose of unsafe code. We identify queries that can be answered by automatically inspecting the program’s source code, its intermediate representation MIR, as well as type information provided by the Rust compiler; we complement the results by manual code inspection. Our study supports the Rust hypothesis partially: While most unsafe code is simple and well-encapsulated, unsafe features are used extensively, especially for interoperability with other languages.
Learning distributed representations of source code has been a challenging task for machine learning models. Earlier works treated programs as text so that natural language methods can be readily applied. Unfortunately, such approaches do not capitalize on the rich structural information possessed by source code. Of late, Graph Neural Network (GNN) was proposed to learn embeddings of programs from their graph representations. Due to the homogeneous (i.e. do not take advantage of the program-specific graph characteristics) and expensive (i.e. require heavy information exchange among nodes in the graph) message-passing procedure, GNN can suffer from precision issues, especially when dealing with programs rendered into large graphs. In this paper, we present a new graph neural architecture, called Graph Interval Neural Network (GINN), to tackle the weaknesses of the existing GNN. Unlike the standard GNN, GINN generalizes from a curated graph representation obtained through an abstraction method designed to aid models to learn. In particular, GINN focuses exclusively on intervals (generally manifested in looping construct) for mining the feature representation of a program, furthermore, GINN operates on a hierarchy of intervals for scaling the learning to large graphs.
We evaluate GINN for two popular downstream applications: variable misuse prediction and method name prediction. Results show in both cases GINN outperforms the state-of-the-art models by a comfortable margin. We have also created a neural bug detector based on GINN to catch null pointer deference bugs in Java code. While learning from the same 9,000 methods extracted from 64 projects, GINN-based bug detector significantly outperforms GNN-based bug detector on 13 unseen test projects. Next, we deploy our trained GINN-based bug detector and Facebook Infer, arguably the state-of-the-art static analysis tool, to scan the codebase of 20 highly starred projects on GitHub. Through our manual inspection, we confirm 38 bugs out of 102 warnings raised by GINN-based bug detector compared to 34 bugs out of 129 warnings for Facebook Infer. We have reported 38 bugs GINN caught to developers, among which 11 have been fixed and 12 have been confirmed (fix pending). GINN has shown to be a general, powerful deep neural network for learning precise, semantic program embeddings.
Automatic software plagiarism detection tools are widely used in educational settings to ensure that submitted work was not copied. These tools have grown in use together with the rise in enrollments in computer science programs and the widespread availability of code on-line. Educators rely on the robustness of plagiarism detection tools; the working assumption is that the effort required to evade detection is as high as that required to actually do the assigned work.
This paper shows this is not the case. It presents an entirely automatic program transformation approach, MOSSAD, that defeats popular software plagiarism detection tools. MOSSAD comprises a framework that couples techniques inspired by genetic programming with domain-specific knowledge to effectively undermine plagiarism detectors. MOSSAD is effective at defeating four plagiarism detectors, including Moss and JPlag. MOSSAD is both fast and effective: it can, in minutes, generate modified versions of programs that are likely to escape detection. More insidiously, because of its non-deterministic approach, MOSSAD can, from a single program, generate dozens of variants, which are classified as no more suspicious than legitimate assignments. A detailed study of MOSSAD across a corpus of real student assignments demonstrates its efficacy at evading detection. A user study shows that graduate student assistants consistently rate MOSSAD-generated code as just as readable as authentic student code. This work motivates the need for both research on more robust plagiarism detection tools and greater integration of naturally plagiarism-resistant methodologies like code review into computer science education.
Pressed by the difficulty of writing asynchronous, event-driven code, mainstream languages have recently been building in support for a variety of advanced control-flow features. Meanwhile, experimental language designs have suggested effect handlers as a unifying solution to programmer-defined control effects, subsuming exceptions, generators, and async–await. However, despite these trends, complex control flow—in particular, control flow that exhibits a bidirectional pattern—remains challenging to manage.
We introduce bidirectional algebraic effects, a new programming abstraction that supports bidirectional control transfer in a more natural way. Handlers of bidirectional effects can raise further effects to transfer control back to the site where the initiating effect was raised, and can use themselves to handle their own effects. We present applications of this expressive power, which falls out naturally as we push toward the unification of effectful programming with object-oriented programming. We pin down the mechanism and the unification formally using a core language that makes generalizations to effect operations and effect handlers.
The usual propagation semantics of control effects such as exceptions conflicts with modular reasoning in the presence of effect polymorphism—it breaks parametricity. Bidirectionality exacerbates the problem. Hence, we set out to show the core language, which builds on the existing tunneling semantics for algebraic effects, is not only type-safe (no effects go unhandled), but also abstraction-safe (no effects are accidentally handled). We devise a step-indexed logical-relations model, and construct its parametricity and soundness proofs. These core results are fully mechanized in Coq. While a full-featured compiler is left to future work, experiments show that as a first-class language feature, bidirectional handlers can be implemented efficiently.
Probabilistic programming languages (PPLs) are an expressive means of representing and reasoning about probabilistic models. The computational challenge of probabilistic inference remains the primary roadblock for applying PPLs in practice. Inference is fundamentally hard, so there is no one-size-fits all solution. In this work, we target scalable inference for an important class of probabilistic programs: those whose probability distributions are discrete. Discrete distributions are common in many fields, including text analysis, network verification, artificial intelligence, and graph analysis, but they prove to be challenging for existing PPLs.
We develop a domain-specific probabilistic programming language called Dice that features a new approach to exact discrete probabilistic program inference. Dice exploits program structure in order to factorize inference, enabling us to perform exact inference on probabilistic programs with hundreds of thousands of random variables. Our key technical contribution is a new reduction from discrete probabilistic programs to weighted model counting (WMC). This reduction separates the structure of the distribution from its parameters, enabling logical reasoning tools to exploit that structure for probabilistic inference. We (1) show how to compositionally reduce Dice inference to WMC, (2) prove this compilation correct with respect to a denotational semantics, (3) empirically demonstrate the performance benefits over prior approaches, and (4) analyze the types of structure that allow Dice to scale to large probabilistic programs.
Satisfiability modulo theories (SMT) solving has become a critical part of many static analyses, including symbolic execution, refinement type checking, and model checking. We propose Formulog, a domain-specific language that makes it possible to write a range of SMT-based static analyses in a way that is both close to their formal specifications and amenable to high-level optimizations and efficient evaluation.
Formulog extends the logic programming language Datalog with a first-order functional language and mechanisms for representing and reasoning about SMT formulas; a novel type system supports the construction of expressive formulas, while ensuring that neither normal evaluation nor SMT solving goes wrong. Our case studies demonstrate that a range of SMT-based analyses can naturally and concisely be encoded in Formulog, and that — thanks to this encoding — high-level Datalog-style optimizations can be automatically and advantageously applied to these analyses.
Ensuring computations are unit-wise consistent is an important task in software development. Numeric computations are usually performed with primitive types instead of abstract data types, which results in very weak static guarantees about correct usage and conversion of units. This paper presents PUnits, a pluggable type system for expressive units of measurement types and a precise, whole-program inference approach for these types. PUnits can be used in three modes: (1) modularly check the correctness of a program, (2) ensure a possible unit typing exists, and (3) annotate a program with units. Annotation mode allows human inspection and is essential since having a valid typing does not guarantee that the inferred specification expresses design intent. PUnits is the first units type system with this capability. Compared to prior work, PUnits strikes a novel balance between expressiveness, inference complexity, and annotation effort. We implement PUnits for Java and evaluate it by specifying the correct usage of frequently used JDK methods. We analyze 234k lines of code from eight open-source scientific computing projects with PUnits. We compare PUnits against an encapsulation-based units API (the javax.measure package) and discovered unit errors that the API failed to find. PUnits infers 90 scientific units for five of the projects and generates well-specified applications. The experiments show that PUnits is an effective, sound, and scalable alternative to using encapsulation-based units APIs, enabling Java developers to reap the performance benefits of using primitive types instead of abstract data types for unit-wise consistent scientific computations.
Diagnosing software failures is important but notoriously challenging. Existing work either requires extensive manual effort, imposing a serious privacy concern (for in-production systems), or cannot report sufficient information for bug fixes. This paper presents a novel diagnosis system, named WATCHER, that can pinpoint root causes of program failures within the failing process ("in-situ"), eliminating the privacy concern. It combines identical record-and-replay, binary analysis, dynamic analysis, and hardware support together to perform the diagnosis without human involvement. It further proposes two optimizations to reduce the diagnosis time and diagnose failures with control flow hijacks. WATCHER can be easily deployed, without requiring custom hardware or operating system, program modification, or recompilation. We evaluate WATCHER with 24 program failures in real-world deployed software, including large-scale applications, such as Memcached, SQLite, and OpenJPEG. Experimental results show that WATCHER can accurately identify the root causes in only a few seconds.
Incremental and parallel builds are crucial features of modern build systems. Parallelism enables fast builds by running independent tasks simultaneously, while incrementality saves time and computing resources by processing the build operations that were affected by a particular code change. Writing build definitions that lead to error-free incremental and parallel builds is a challenging task. This is mainly because developers are often unable to predict the effects of build operations on the file system and how different build operations interact with each other. Faulty build scripts may seriously degrade the reliability of automated builds, as they cause build failures, and non-deterministic and incorrect outputs.
To reason about arbitrary build executions, we present BuildFS, a generally-applicable model that takes into account the specification (as declared in build scripts) and the actual behavior (low-level file system operation) of build operations. We then formally define different types of faults related to incremental and parallel builds in terms of the conditions under which a file system operation violates the specification of a build operation. Our testing approach, which relies on the proposed model, analyzes the execution of single full build, translates it into BuildFS, and uncovers faults by checking for corresponding violations.
We evaluate the effectiveness, efficiency, and applicability of our approach by examining 612 Make and Gradle projects. Notably, thanks to our treatment of build executions, our method is the first to handle JVM-oriented build systems. The results indicate that our approach is (1) able to uncover several important issues (247 issues found in 47 open-source projects have been confirmed and fixed by the upstream developers), and (2) much faster than a state-of-the-art tool for Make builds (the median and average speedup is 39X and 74X respectively).
Dynamic linking is extremely common in modern software systems, thanks to the flexibility and space savings it offers. However, this flexibility comes at a cost: it’s impossible to perform interprocedural optimizations that involve calls to a dynamic library. The basic problem is that the run-time behavior of the dynamic linker can’t be predicted at compile time, so the compiler can make no assumptions about how such calls will behave.
This paper introduces guided linking, a technique for optimizing dynamically linked software when some information about the dynamic linker’s behavior is known in advance. The developer provides an arbitrary set of programs, libraries, and plugins to our tool, along with constraints that limit the possible dynamic linking behavior of the software. By taking advantage of the constraints, our tool enables any existing optimization to be applied across dynamic linking boundaries. For example, the NoOverride constraint can be applied to a function when the developer knows it will never be overridden with a different definition at run time; guided linking then enables the function to be inlined into its callers in other libraries. We also introduce a novel code size optimization that deduplicates identical functions even across different parts of the software set.
By applying guided linking to the Python interpreter and its dynamically loaded modules, supplying the constraint that no other programs or modules will be used, we increase speed by an average of 9%. By applying guided linking to a dynamically linked distribution of Clang and LLVM, and using the constraint that no other software will use the LLVM libraries, we can increase speed by 5% and reduce file size by 13%. If we relax the constraint to allow other software to use the LLVM libraries, we can still increase speed by 5% and reduce file size by 5%. If we use guided linking to combine 11 different versions of the Boost library, using minimal constraints, we can reduce the total library size by 57%.
Modern SAT solvers are extremely efficient at solving boolean satisfiability problems, enabling a wide spectrum of techniques for checking, verifying, and validating real-world programs. What remains challenging, though, is how to encode a domain problem (e.g., model checking) into a SAT formula because the same problem can have multiple distinct encodings, which can yield performance results that are orders-of-magnitude apart, regardless of the underlying solvers used. We develop Satune, a tool that can automatically synthesize SAT encoders for different problem domains. Satune employs a DSL that allows developers to express domain problems at a high level and a search algorithm that can effectively find efficient solutions. The search process is guided by observations made over example encodings and their performance for the domain and hence Satune can quickly synthesize a high-performance encoder by incorporating patterns from examples that yield good performance. A thorough evaluation with JMCR, SyPet, Dirk, Hexiom, Sudoku, and KillerSudoku demonstrates that Satune can easily synthesize high-performance encoders for different domains including model checking, synthesis, and games. These encoders generate constraint problems that are often several orders of magnitude faster to solve than the original encodings used by the tools.
As one of the fundamental optimizations in modern processors, the out-of-order execution boosts the pipeline throughput by executing independent instructions in parallel rather than in their program orders. However, due to the side effects introduced by such microarchitectural optimization to the CPU cache, secret-critical applications may suffer from timing side-channel leaks. This paper presents a symbolic execution-based technique, named SymO3, for exposing cache timing leaks under the context of out-of-order execution. SymO3 proposes new components that address the modeling, reduction, and reasoning challenges of accommodating program analysis to the software code out-of-order analysis. We implemented SymO3 upon KLEE and conducted three evaluations on it. Experimental results show that SymO3 successfully uncovers a set of cache timing leaks in five real-world programs. Also, SymO3 finds that, in general, program transformation from compiler optimizations shrink the surface to timing leaks. Furthermore, augmented with a speculative execution modeling, SymO3 identifies five more leaky programs based on the compound analysis.
With distributed computing becoming ubiquitous in the modern era, safe distributed programming is an open challenge. To address this, multiparty session types (MPST) provide a typing discipline for message-passing concurrency, guaranteeing communication safety properties such as deadlock freedom.
While originally MPST focus on the communication aspects, and employ a simple typing system for communication payloads, communication protocols in the real world usually contain constraints on the payload. We introduce refined multiparty session types (RMPST), an extension of MPST, that express data dependent protocols via refinement types on the data types.
We provide an implementation of RMPST, in a toolchain called Session*, using Scribble, a toolchain for multiparty protocols, and targeting F*, a verification-oriented functional programming language. Users can describe a protocol in Scribble and implement the endpoints in F* using refinement-typed APIs generated from the protocol. The F* compiler can then statically verify the refinements. Moreover, we use a novel approach of callback-styled API generation, providing static linearity guarantees with the inversion of control. We evaluate our approach with real world examples and show that it has little overhead compared to a naive implementation, while guaranteeing safety properties from the underlying theory.
We describe a design for generics in Go inspired by previous work on Featherweight Java by Igarashi, Pierce, and Wadler. Whereas subtyping in Java is nominal, in Go it is structural, and whereas generics in Java are defined via erasure, in Go we use monomorphisation. Although monomorphisation is widely used, we are one of the first to formalise it. Our design also supports a solution to The Expression Problem.
In this paper, we propose Proq, a runtime assertion scheme for testing and debugging quantum programs on a quantum computer. The predicates in Proq are represented by projections (or equivalently, closed subspaces of the state space), following Birkhoff-von Neumann quantum logic. The satisfaction of a projection by a quantum state can be directly checked upon a small number of projective measurements rather than a large number of repeated executions. On the theory side, we rigorously prove that checking projection-based assertions can help locate bugs or statistically assure that the semantic function of the tested program is close to what we expect, for both exact and approximate quantum programs. On the practice side, we consider hardware constraints and introduce several techniques to transform the assertions, making them directly executable on the measurement-restricted quantum computers. We also propose to achieve simplified assertion implementation using local projection technique with soundness guaranteed. We compare Proq with existing quantum program assertions and demonstrate the effectiveness and efficiency of Proq by its applications to assert two sophisticated quantum algorithms, the Harrow-Hassidim-Lloyd algorithm and Shor’s algorithm.
The advent of non-volatile memory (NVM) technologies is expected to transform how software systems are structured fundamentally, making the task of correct programming significantly harder. This is because ensuring that memory stores persist in the correct order is challenging, and requires low-level programming to flush the cache at appropriate points. This has in turn resulted in a noticeable verification gap.
To address this, we study the verification of NVM programs, and present Persistent Owicki-Gries (POG), the first program logic for reasoning about such programs. We prove the soundness of POG over the recent Intel-x86 model, which formalises the out-of-order persistence of memory stores and the semantics of the Intel cache line flush instructions. We then use POG to verify several programs that interact with NVM.
Lighthouse projects like CompCert, seL4, IronFleet, and DeepSpec have demonstrated that full system verification is feasible by establishing a refinement between an abstract system specification and an executable implementation. Existing approaches however impose severe restrictions on the abstract system specifications due to their limited expressiveness or versatility, or on the executable code due to their use of suboptimal code extraction or inexpressive program logics. We propose a novel methodology that combines the compositional refinement of event-based models of distributed systems with the verification of full-fledged program code using expressive separation logics, which support features of realistic programming languages like heap data structures and concurrency. Our main technical contribution is a formal framework that soundly relates event-based system models to program specifications in separation logics. This enables protocol development tools to soundly interoperate with program verifiers to establish a refinement between the model and the code. We formalized our framework, Igloo, in Isabelle/HOL. We report on three case studies, a leader election protocol, a replication protocol, and a security protocol, for which we refine formal requirements into program specifications that we implement in Java and Python and prove correct using the VeriFast and Nagini tools.
High performance architectures for processing distributed data streams, such as Flink, Spark Streaming, and Storm, are increasingly deployed in emerging data-driven computing systems. Exploiting the parallelism afforded by such platforms, while preserving the semantics of the desired computation, is prone to errors, and motivates the development of tools for specification, testing, and verification. We focus on the problem of differential output testing for distributed stream processing systems, that is, checking whether two implementations produce equivalent output streams in response to a given input stream. The notion of equivalence allows reordering of logically independent data items, and the main technical contribution of the paper is an optimal online algorithm for checking this equivalence. Our testing framework is implemented as a library called DiffStream in Flink. We present four case studies to illustrate how our framework can be used to (1) correctly identify bugs in a set of benchmark MapReduce programs, (2) facilitate the development of difficult-to-parallelize high performance applications, and (3) monitor an application for a long period of time with minimal performance overhead.
We present a simple, practical, and expressive type and effect system based on Boolean constraints. The effect system extends the Hindley-Milner type system, supports parametric polymorphism, and preserves principal types modulo Boolean equivalence. We show how to support type inference by extending Algorithm W with Boolean unification based on the successive variable elimination algorithm.
We implement the type and effect system in the Flix programming language. We perform an in-depth evaluation on the impact of Boolean unification on type inference time and end-to-end compilation time. While the computational complexity of Boolean unification is NP-hard, the experimental results demonstrate that it works well in practice. We find that the impact on type inference time is on average a 1.4x slowdown and the overall impact on end-to-end compilation time is a 1.1x slowdown.
This paper presents CAMP, a new static performance analysis framework for message-passing concurrent and distributed systems, based on the theory of multiparty session types (MPST). Understanding the run-time performance of concurrent and distributed systems is of great importance for the identification of bottlenecks and optimisation opportunities. In the message-passing setting, these bottlenecks are generally communication overheads and synchronisation times. Despite its importance, reasoning about these intensional properties of software, such as performance, has received little attention, compared to verifying extensional properties, such as correctness. Behavioural protocol specifications based on sessions types capture not only extensional, but also intensional properties of concurrent and distributed systems. CAMP augments MPST with annotations of communication latency and local computation cost, defined as estimated execution times, that we use to extract cost equations from protocol descriptions. CAMP is also extendable to analyse asynchronous communication optimisation built on a recent advance of session type theories. We apply our tool to different existing benchmarks and use cases in the literature with a wide range of communication protocols, implemented in C, MPI-C, Scala, Go, and OCaml. Our benchmarks show that, in most of the cases, we predict an upper-bound on the real execution costs with < 15% error.
Verifying the correctness of concurrent software with subtle synchronization is notoriously challenging. We present the Anchor verifier, which is based on a new formalism for specifying synchronization disciplines that describes both (1) what memory accesses are permitted, and (2) how each permitted access commutes with concurrent operations of other threads (to facilitate reduction proofs). Anchor supports the verification of both lock-based blocking and cas-based non-blocking algorithms. Experiments on a variety concurrent data structures and algorithms show that Anchor significantly reduces the burden of concurrent verification.
A software analysis is a computer program that takes some representation of a software product as input and produces some useful information about that product as output. A software product line encompasses many software product variants, and thus existing analyses can be applied to each of the product variations individually, but not to the entire product line as a whole. Enumerating all product variants and analyzing them one by one is usually intractable due to the combinatorial explosion of the number of product variants with respect to product line features. Several software analyses (e.g., type checkers, model checkers, data flow analyses) have been redesigned/re-implemented to support variability. This usually requires a lot of time and effort, and the variability-aware version of the analysis might have new errors/bugs that do not exist in the original one.
Given an analysis program written in a functional language based on PCF, in this paper we present two approaches to transforming (lifting) it into a semantically equivalent variability-aware analysis. A light-weight approach (referred to as shallow lifting) wraps the analysis program into a variability-aware version, exploring all combinations of its input arguments. Deep lifting, on the other hand, is a program rewriting mechanism where the syntactic constructs of the input program are rewritten into their variability-aware counterparts. Compositionally this results in an efficient program semantically equivalent to the input program, modulo variability.
We present the correctness criteria for functional program lifting, together with correctness proof sketches of shallow lifting. We evaluate our approach on a set of program analyses applied to the BusyBox C-language product line.
We address the problem of optimizing sparse tensor algebra in a compiler and show how to define standard loop transformations---split, collapse, and reorder---on sparse iteration spaces. The key idea is to track the transformation functions that map the original iteration space to derived iteration spaces. These functions are needed by the code generator to emit code that maps coordinates between iteration spaces at runtime, since the coordinates in the sparse data structures remain in the original iteration space. We further demonstrate that derived iteration spaces can tile both the universe of coordinates and the subset of nonzero coordinates: the former is analogous to tiling dense iteration spaces, while the latter tiles sparse iteration spaces into statically load-balanced blocks of nonzeros. Tiling the space of nonzeros lets the generated code efficiently exploit heterogeneous compute resources such as threads, vector units, and GPUs.
We implement these concepts by extending the sparse iteration theory implementation in the TACO system. The associated scheduling API can be used by performance engineers or it can be the target of an automatic scheduling system. We outline one heuristic autoscheduling system, but other systems are possible. Using the scheduling API, we show how to optimize mixed sparse-dense tensor algebra expressions on CPUs and GPUs. Our results show that the sparse transformations are sufficient to generate code with competitive performance to hand-optimized implementations from the literature, while generalizing to all of the tensor algebra.
A frequent programming pattern for small tasks, especially expressions, is to repeatedly evaluate the program on an input as its editing progresses. The Read-Eval-Print Loop (REPL) interaction model has been a successful model for this programming pattern. We present the new notion of Read-Eval-Synth Loop (RESL) that extends REPL by providing in-place synthesis on parts of the expression marked by the user. RESL eases programming by synthesizing parts of a required solution. The underlying synthesizer relies on a partial solution from the programmer and a few examples.
RESL hinges on bottom-up synthesis with general predicates and sketching, generalizing programming by example. To make RESL practical, we present a formal framework that extends observational equivalence to non-example specifications.
We evaluate RESL by conducting a controlled within-subjects user-study on 19 programmers from 8 companies, where programmers are asked to solve a small but challenging set of competitive programming problems. We find that programmers using RESL solve these problems with far less need to edit the code themselves and by browsing documentation far less. In addition, they are less likely to leave a task unfinished and more likely to be correct.
Mobile operating systems, especially Android, expose apps to a volatile runtime environment. The app state that reflects past user interaction and system environment updates (e.g., battery status changes) can be destroyed implicitly, in response to runtime configuration changes (e.g., screen rotations) or memory pressure. Developers are therefore responsible for identifying app state affected by volatility and preserving it across app lifecycles. When handled inappropriately, the app may lose state or end up in an inconsistent state after a runtime configuration change or when users return to the app.
To free developers from this tedious and error-prone task, we propose a systematic solution, LiveDroid, which precisely identifies the necessary part of the app state that needs to be preserved across app lifecycles, and automatically saves and restores it. LiveDroid consists of: (i) a static analyzer that reasons about app source code and resource files to pinpoint the program variables and GUI properties that represent the necessary app state, and (ii) a runtime system that manages the state saving and recovering. We implemented LiveDroid as a plugin in Android Studio and a patching tool for APKs. Our evaluation shows that LiveDroid can be successfully applied to 966 Android apps. A focused study with 36 Android apps shows that LiveDroid identifies app state much more precisely than an existing solution that includes all mutable program variables but ignores GUI properties. As a result, on average, LiveDroid is able to reduce the costs of state saving and restoring by 16.6X (1.7X - 141.1X) and 9.5X (1.1X - 43.8X), respectively. Furthermore, compared with the manual state handling performed by developers, our analysis reveals a set of 46 issues due to incomplete state saving/restoring, all of which can be successfully eliminated by LiveDroid.
Automation of fixpoint reasoning has been extensively studied for various mathematical structures, logical formalisms, and computational domains, resulting in specialized fixpoint provers for heaps, for streams, for term algebras, for temporal properties, for program correctness, and for many other formal systems and inductive and coinductive properties. However, in spite of great theoretical and practical interest, there is no unified framework for automated fixpoint reasoning. Although several attempts have been made, there is no evidence that such a unified framework is possible, or practical. In this paper, we propose a candidate based on matching logic, a formalism recently shown to theoretically unify the above mentioned formal systems. Unfortunately, the (Knaster-Tarski) proof rule of matching logic, which enables inductive reasoning, is not syntax-driven. Worse, it can be applied at any step during a proof, making automation seem hopeless. Inspired by recent advances in automation of inductive proofs in separation logic, we propose an alternative proof system for matching logic, which is amenable for automation. We then discuss our implementation of it, which although not superior to specialized state-of-the-art automated provers for specific domains, we believe brings some evidence and hope that a unified framework for automated reasoning is not out of reach.
Neural models of code have shown impressive results when performing tasks such as predicting method names and identifying certain kinds of bugs. We show that these models are vulnerable to adversarial examples, and introduce a novel approach for attacking trained models of code using adversarial examples. The main idea of our approach is to force a given trained model to make an incorrect prediction, as specified by the adversary, by introducing small perturbations that do not change the program’s semantics, thereby creating an adversarial example. To find such perturbations, we present a new technique for Discrete Adversarial Manipulation of Programs (DAMP). DAMP works by deriving the desired prediction with respect to the model’s inputs, while holding the model weights constant, and following the gradients to slightly modify the input code.
We show that our DAMP attack is effective across three neural architectures: code2vec, GGNN, and GNN-FiLM, in both Java and C#. Our evaluations demonstrate that DAMP has up to 89% success rate in changing a prediction to the adversary’s choice (a targeted attack) and a success rate of up to 94% in changing a given prediction to any incorrect prediction (a non-targeted attack). To defend a model against such attacks, we empirically examine a variety of possible defenses and discuss their trade-offs. We show that some of these defenses can dramatically drop the success rate of the attacker, with a minor penalty of 2% relative degradation in accuracy when they are not performing under attack.
Our code, data, and trained models are available at <a>https://github.com/tech-srl/adversarial-examples</a> .
Intermittently powered devices enable new applications in harsh or inaccessible environments, such as space or in-body implants, but also introduce problems in programmability and correctness. Researchers have developed programming models to ensure that programs make progress and do not produce erroneous results due to memory inconsistencies caused by intermittent executions. As the technology has matured, more and more features are added to intermittently powered devices, such as I/O. Prior work has shown that all existing intermittent execution models have problems with repeated device or sensor inputs (RIO). RIOs could leave intermittent executions in an inconsistent state. Such problems and the proliferation of existing intermittent execution models necessitate a formal foundation for intermittent computing.
In this paper, we formalize intermittent execution models, their correctness properties with respect to memory consistency and inputs, and identify the invariants needed to prove systems correct. We prove equivalence between several existing intermittent systems. To address RIO problems, we define an algorithm for identifying variables affected by RIOs that need to be restored after reboot and prove the algorithm correct. Finally, we implement the algorithm in a novel intermittent runtime system that is correct with respect to input operations and evaluate its performance.
Building effective symbolic execution engines poses challenges in multiple dimensions: an engine must correctly model the program semantics, provide flexibility in symbolic execution strategies, and execute them efficiently.
This paper proposes a principled approach to building correct, flexible, and efficient symbolic execution engines, directly rooted in the semantics of the underlying language in terms of a high-level definitional interpreter. The definitional interpreter induces algebraic effects to abstract over semantic variants of symbolic execution, e.g., collecting path conditions as a state effect and path exploration as a nondeterminism effect. Different handlers of these effects give rise to different symbolic execution strategies, making execution strategies orthogonal to the symbolic execution semantics, thus improving flexibility. Furthermore, by annotating the symbolic definitional interpreter with binding-times and specializing it to the input program via the first Futamura projection, we obtain a "symbolic compiler", generating efficient instrumented code having the symbolic execution semantics. Our work reconciles the interpretation- and instrumentation-based approaches to building symbolic execution engines in a uniform framework.
We illustrate our approach on a simple imperative language step-by-step and then scale up to a significant subset of LLVM IR. We also show effect handlers for common path selection strategies. Evaluating our prototype's performance shows speedups of 10~30x over the unstaged counterpart, and ~2x over KLEE, a state-of-the-art symbolic interpreter for LLVM IR.
Applying differential privacy at scale requires convenient ways to check that programs computing with sensitive data appropriately preserve privacy. We propose here a fully automated framework for testing differential privacy, adapting a well-known “pointwise” technique from informal proofs of differential privacy. Our framework, called DPCheck, requires no programmer annotations, handles all previously verified or tested algorithms, and is the first fully automated framework to distinguish correct and buggy implementations of PrivTree, a probabilistically terminating algorithm that has not previously been mechanically checked.
We analyze the probability of DPCheck mistakenly accepting a non-private program and prove that, theoretically, the probability of false acceptance can be made exponentially small by suitable choice of test size.
We demonstrate DPCheck’s utility empirically by implementing all benchmark algorithms from prior work on mechanical verification of differential privacy, plus several others and their incorrect variants, and show DPCheck accepts the correct implementations and rejects the incorrect variants.
We also demonstrate how DPCheck can be deployed in a practical workflow to test differentially privacy for the 2020 US Census Disclosure Avoidance System (DAS).
Halide is a domain-specific language for high-performance image processing and tensor computations, widely adopted in industry. Internally, the Halide compiler relies on a term rewriting system to prove properties of code required for efficient and correct compilation. This rewrite system is a collection of handwritten transformation rules that incrementally rewrite expressions into simpler forms; the system requires high performance in both time and memory usage to keep compile times low, while operating over the undecidable theory of integers. In this work, we apply formal techniques to prove the correctness of existing rewrite rules and provide a guarantee of termination. Then, we build an automatic program synthesis system in order to craft new, provably correct rules from failure cases where the compiler was unable to prove properties. We identify and fix 4 incorrect rules as well as 8 rules which could give rise to infinite rewriting loops. We demonstrate that the synthesizer can produce better rules than hand-authored ones in five bug fixes, and describe four cases in which it has served as an assistant to a human compiler engineer. We further show that it can proactively improve weaknesses in the compiler by synthesizing a large number of rules without human supervision and showing that the enhanced ruleset lowers peak memory usage of compiled code without appreciably increasing compilation times.
Academia has spent much effort into making context-sensitive analyses practical, with great profit. However, the implementation of context-sensitive optimizations, in contrast to analyses, is still not practical, due to code-size explosion. This growth happens because current technology requires the cloning of full paths in the Calling Context Tree. In this paper, we present a solution to this problem. We combine finite state machines and dynamic dispatching to allow fully context-sensitive specialization while cloning only functions that are effectively optimized. This technique makes it possible to apply very liberal optimizations, such as context-sensitive constant propagation, in large programs—something that could not have been easily done before. We demonstrate the viability of our idea by formalizing it in Prolog, and implementing it in LLVM. As a proof of concept, we have used our state machines to implement context-sensitive constant propagation in LLVM. The binaries produced by traditional full cloning are 2.63 times larger than the binaries that we generate with our state machines. When applied on Mozilla Firefox, our optimization increases binary size from 7.2MB to 9.2MB. Full cloning, in contrast, yields a binary of 34MB.
Java 8 introduced streams that allow developers to work with collections of data using functional-style operations. Streams are often used in pipelines of operations for processing the data elements, which leads to concise and elegant program code. However, the declarative data processing style comes at a cost. Compared to processing the data with traditional imperative language mechanisms, constructing stream pipelines requires extra heap objects and virtual method calls, which often results in significant run-time overheads.
In this work we investigate how to mitigate these overheads to enable processing data in the declarative style without sacrificing performance. We argue that ahead-of-time bytecode-to-bytecode transformation is a suitable approach to optimization of stream pipelines, and we present a static analysis that is designed to guide such transformations. Experimental results show a significant performance gain, and that the technique works for realistic stream pipelines. For 10 of 11 micro-benchmarks, the optimizer is able to produce bytecode that is as effective as hand-written imperative-style code. Additionally, 77% of 6879 stream pipelines found in real-world Java programs are optimized successfully.
Build scripts for most build systems describe the actions to run, and the dependencies between those actions - but often build scripts get those dependencies wrong. Most build scripts have both too few dependencies (leading to incorrect build outputs) and too many dependencies (leading to excessive rebuilds and reduced parallelism). Any programmer who has wondered why a small change led to excess compilation, or who resorted to a clean step, has suffered the ill effects of incorrect dependency specification. We outline a build system where dependencies are not specified, but instead captured by tracing execution. The consequence is that dependencies are always correct by construction and build scripts are easier to write. The simplest implementation of our approach would lose parallelism, but we are able to recover parallelism using speculation.
Optimizing the physical data storage and retrieval of data are two key database management problems. In this paper, we propose a language that can express both a relational query and the layout of its data. Our language can express a wide range of physical database layouts, going well beyond the row- and column-based methods that are widely used in database management systems. We use deductive program synthesis to turn a high-level relational representation of a database query into a highly optimized low-level implementation which operates on a specialized layout of the dataset. We build an optimizing compiler for this language and conduct experiments using a popular database benchmark, which shows that the performance of our specialized queries is better than a state-of-the-art in memory compiled database system while achieving an order-of-magnitude reduction in memory use.
In courses that involve programming assignments, giving meaningful feedback to students is an important challenge. Human beings can give useful feedback by manually grading the programs but this is a time-consuming, labor intensive, and usually boring process. Automatic graders can be fast and scale well but they usually provide poor feedback. Although there has been research on improving automatic graders, research on scaling and improving human grading is limited.
We propose to scale human grading by augmenting the manual grading process with an equivalence algorithm that can identify the equivalences between student submissions. This enables human graders to give targeted feedback for multiple student submissions at once. Our technique is conservative in two aspects. First, it identifies equivalence between submissions that are algorithmically similar, e.g., it cannot identify the equivalence between quicksort and mergesort. Second, it uses formal methods instead of clustering algorithms from the machine learning literature. This allows us to prove a soundness result that guarantees that submissions will never be clustered together in error. Despite only reporting equivalence when there is algorithmic similarity and the ability to formally prove equivalence, we show that our technique can significantly reduce grading time for thousands of programming submissions from an introductory functional programming course.
We present a novel methodology for the automated resource analysis of non-deterministic, probabilistic imperative programs, which gives rise to a modular approach. Program fragments are analysed in full independence. Moreover, the established results allow us to incorporate sampling from dynamic distributions, making our analysis applicable to a wider class of examples, for example the Coupon Collector’s problem.
We have implemented our contributions in the tool , exploiting a constraint-solver over iterative refineable cost functions facilitated by off-the-shelf SMT solvers. We provide ample experimental evidence of the prototype’s algorithmic power. Our experiments show that our tool runs typically at least one order of magnitude faster than comparable tools. On more involved examples, it may even be the case that execution times of seconds become milliseconds. At the same time we retain the precision of existing tools.
The extensions in applicability and the greater efficiency of our prototype, yield scalability of sorts. This effects into a wider class of examples, whose expected cost analysis can be thus be performed fully automatically.
In domains that deal with physical space and geometry, programmers need to track the coordinate systems that underpin a computation. We identify a class of geometry bugs that arise from confusing which coordinate system a vector belongs to. These bugs are not ruled out by current languages for vector-oriented computing, are difficult to check for at run time, and can generate subtly incorrect output that can be hard to test for.
We introduce a type system and language that prevents geometry bugs by reflecting the coordinate system for each geometric object. A value's geometry type encodes its reference frame, the kind of geometric object (such as a point or a direction), and the coordinate representation (such as Cartesian or spherical coordinates). We show how these types can rule out geometrically incorrect operations, and we show how to use them to automatically generate correct-by-construction code to transform vectors between coordinate systems. We implement a language for graphics programming, Gator, that checks geometry types and compiles to OpenGL's shading language, GLSL. Using case studies, we demonstrate that Gator can raise the level of abstraction for shader programming and prevent common errors without inducing significant annotation overhead or performance cost.
TypeScript is a dynamically typed language widely used to develop large-scale applications nowadays. These applications are usually designed with complex class or interface hierarchies and have highly polymorphic behaviors. These object-oriented (OO) features will lead to inefficient inline caches (ICs) or trigger deoptimizations, which impact the performance of TypeScript applications.
To address this problem, we introduce an inline caching design called hidden inheritance (HI). The basic idea of HI is to cache the static information of class or interface hierarchies into hidden classes, which are leveraged to generate efficient inline caches for improving the performance of OO-style TypeScript programs. The HI design is implemented in a TypeScript engine STSC (Static TypeScript Compiler) including a static compiler and a runtime system. STSC statically generates hidden classes and enhanced inline caches, which are applied to generate specialized machine code via ahead-of-time compilation (AOTC) or just-in-time compilation (JITC). To evaluate the efficiency of this technique, we implement STSC on a state-of-the-art JavaScript virtual machine V8 and demonstrate its performance improvements on industrial benchmarks and applications.
Every newly created object goes through several initialization states: starting from a state where all fields are uninitialized until all of them are assigned. Any operation on the object during its initialization process, which usually happens in the constructor via this, has to observe the initialization states of the object for correctness, i.e. only initialized fields may be used. Checking safe usage of this statically, without manual annotation of initialization states in the source code, is a challenge, due to aliasing and virtual method calls on this.
Mainstream languages either do not check initialization errors, such as Java, C++, Scala, or they defend against them by not supporting useful initialization patterns, such as Swift. In parallel, past research has shown that safe initialization can be achieved for varying degrees of expressiveness but by sacrificing syntactic simplicity.
We approach the problem by upholding local reasoning about initialization which avoids whole-program analysis, and we achieve typestate polymorphism via subtyping. On this basis, we put forward a novel type-and-effect system that can effectively ensure initialization safety while allowing flexible initialization patterns. We implement an initialization checker in the Scala 3 compiler and evaluate on several real-world projects.
Garbage collection (GC) support for unmanaged languages can reduce programming burden in reasoning about liveness of dynamic objects. It also avoids temporal memory safety violations and memory leaks. Sound GC for weakly-typed languages such as C/C++, however, remains an unsolved problem. Current value-based GC solutions examine values of memory locations to discover the pointers, and the objects they point to. The approach is inherently unsound in the presence of arbitrary type casts and pointer manipulations, which are legal in C/C++. Such language features are regularly used, especially in low-level systems code.
In this paper, we propose Dynamic Pointer Provenance Tracking to realize sound GC. We observe that pointers cannot be created out-of-thin-air, and they must have provenance to at least one valid allocation. Therefore, by tracking pointer provenance from the source (e.g., malloc) through both explicit data-flow and implicit control-flow, our GC has sound and precise information to compute the set of all reachable objects at any program state. We discuss several static analysis optimizations, that can be employed during compilation aided with profiling, to significantly reduce the overhead of dynamic provenance tracking from nearly 8× to 16% for well-behaved programs that adhere to the C standards. Pointer provenance based sound GC invocation is also 13% faster and reclaims 6% more memory on average, compared to an unsound value-based GC.
Superoptimization is a compilation strategy that uses search to improve code quality, rather than relying on a canned sequence of transformations, as traditional optimizing compilers do. This search can be seen as a program synthesis problem: from unoptimized code serving as a specification, the synthesis procedure attempts to create a more efficient implementation. An important family of synthesis algorithms works by enumerating candidates and then successively checking if each refines the specification, using an SMT solver. The contribution of this paper is a pruning technique which reduces the enumerative search space using fast dataflow-based techniques to discard synthesis candidates that contain symbolic constants and uninstantiated instructions. We demonstrate the effectiveness of this technique by improving the runtime of an enumerative synthesis procedure in the Souper superoptimizer for the LLVM intermediate representation. The techniques presented in this paper eliminate 65% of the solver calls made by Souper, making it 2.32x faster (14.54 hours vs 33.76 hours baseline, on a large multicore) at solving all 269,113 synthesis problems that Souper encounters when optimizing the C and C++ programs from SPEC CPU 2017.
Reachability analysis is a fundamental program analysis with a wide variety of applications. We present FlowCFL, a type-based reachability analysis that accounts for mutable heap data. The underlying semantics of FlowCFL is Context-Free-Language (CFL)-reachability.
We make three contributions. First, we define a dynamic semantics that captures the notion of flow commonly used in reachability analysis. Second, we establish correctness of CFL-reachability over graphs with inverse edges (inverse edges are necessary for the handling of mutable heap data). Our approach combines CFL-reachability with reference immutability to avoid the addition of certain inverse edges, which results in graph reduction and precision improvement. The key contribution of our work is the formal account of correctness, which extends to the case when inverse edges are removed. Third, we present a type-based reachability analysis and establish equivalence between a certain CFL-reachability analysis and the type-based analysis, thus proving correctness of the type-based analysis.
We present Graphick, a new technique for automatically learning graph-based heuristics for pointer analysis. Striking a balance between precision and scalability of pointer analysis requires designing good analysis heuristics. For example, because applying context sensitivity to all methods in a real-world program is impractical, pointer analysis typically uses a heuristic to employ context sensitivity only when it is necessary. Past research has shown that exploiting the program's graph structure is a promising way of developing cost-effective analysis heuristics, promoting the recent trend of ``graph-based heuristics'' that work on the graph representations of programs obtained from a pre-analysis. Although promising, manually developing such heuristics remains challenging, requiring a great deal of expertise and laborious effort. In this paper, we aim to reduce this burden by learning graph-based heuristics automatically, in particular without hand-crafted application-specific features. To do so, we present a feature language to describe graph structures and an algorithm for learning analysis heuristics within the language. We implemented Graphick on top of Doop and used it to learn graph-based heuristics for object sensitivity and heap abstraction. The evaluation results show that our approach is general and can generate high-quality heuristics. For both instances, the learned heuristics are as competitive as the existing state-of-the-art heuristics designed manually by analysis experts.
There is a large gap between the specification of type systems and the implementation of their type checkers, which impedes reasoning about the soundness of the type checker with respect to the specification. A vision to close this gap is to automatically obtain type checkers from declarative programming language specifications. This moves the burden of proving correctness from a case-by-case basis for concrete languages to a single correctness proof for the specification language. This vision is obstructed by an aspect common to all programming languages: name resolution. Naming and scoping are pervasive and complex aspects of the static semantics of programming languages. Implementations of type checkers for languages with name binding features such as modules, imports, classes, and inheritance interleave collection of binding information (i.e., declarations, scoping structure, and imports) and querying that information. This requires scheduling those two aspects in such a way that query answers are stable—i.e., they are computed only after all relevant binding structure has been collected. Type checkers for concrete languages accomplish stability using language-specific knowledge about the type system.
In this paper we give a language-independent characterization of necessary and sufficient conditions to guarantee stability of name and type queries during type checking in terms of critical edges in an incomplete scope graph. We use critical edges to give a formal small-step operational semantics to a declarative specification language for type systems, that achieves soundness by delaying queries that may depend on missing information. This yields type checkers for the specified languages that are sound by construction—i.e., they schedule queries so that the answers are stable, and only accept programs that are name- and type-correct according to the declarative language specification. We implement this approach, and evaluate it against specifications of a small module and record language, as well as subsets of Java and Scala.
The R programming language is widely used in a variety of domains. It was designed to favor an interactive style of programming with minimal syntactic and conceptual overhead. This design is well suited to data analysis, but a bad fit for tools such as compilers or program analyzers. In particular, R has no type annotations, and all operations are dynamically checked at run-time. The starting point for our work are the two questions: what expressive power is needed to accurately type R code? and which type system is the R community willing to adopt? Both questions are difficult to answer without actually experimenting with a type system. The goal of this paper is to provide data that can feed into that design process. To this end, we perform a large corpus analysis to gain insights in the degree of polymorphism exhibited by idiomatic R code and explore potential benefits that the R community could accrue from a simple type system. As a starting point, we infer type signatures for 25,215 functions from 412 packages among the most widely used open source R libraries. We then conduct an evaluation on 8,694 clients of these packages, as well as on end-user code from the Kaggle data science competition website.
Traditionally, IoT devices send collected sensor data to an intelligent cloud where machine learning (ML) inference happens. However, this course is rapidly changing and there is a recent trend to run ML on the edge IoT devices themselves. An intelligent edge is attractive because it saves network round trip (efficiency) and keeps user data at the source (privacy). However, the IoT devices are much more resource constrained than the cloud, which makes running ML on them challenging. Specifically, consider Arduino Uno, a commonly used board, that has 2KB of RAM and 32KB of read-only Flash memory. Although recent breakthroughs in ML have created novel recurrent neural network (RNN) models that provide good accuracy with KB-sized models, deploying them on tiny devices with such hard memory requirements has remained elusive.
We provide, Shiftry, an automatic compiler from high-level floating-point ML models to fixed-point C-programs with 8-bit and 16-bit integers, which have significantly lower memory requirements. For this conversion, Shiftry uses a data-driven float-to-fixed procedure and a RAM management mechanism. These techniques enable us to provide first empirical evaluation of RNNs running on tiny edge devices. On simpler ML models that prior work could handle, Shiftry-generated code has lower latency and higher accuracy.
Real-time data analysis applications increasingly rely on complex streaming computations over time-series data. We propose StreamQL, a language that facilitates the high-level specification of complex analyses over streaming time series. StreamQL is designed as an algebra of stream transformations and provides a collection of combinators for composing them. It integrates three language-based approaches for data stream processing: relational queries, dataflow composition, and temporal formalisms. The relational constructs are useful for specifying simple transformations, aggregations, and the partitioning of data into key-based groups or windows. The dataflow abstractions enable the modular description of a computation as a pipeline of stages or, more generally, as a directed graph of independent tasks. Finally, temporal constructs can be used to specify complex temporal patterns and time-varying computations. These constructs can be composed freely to describe complex streaming computations. We provide a formal denotational semantics for StreamQL using a class of monotone functions over streams. We have implemented StreamQL as a lightweight Java library, which we use to experimentally evaluate our approach. The experiments show that the throughput of our implementation is competitive compared to state-of-the-art streaming engines such as RxJava and Reactor.
Software products are evolving during their life cycles. Ideally, every revision need be formally verified to ensure software quality. Yet repeated formal verification requires significant computing resources. Verifying each and every revision can be very challenging. It is desirable to ameliorate regression verification for practical purposes. In this paper, we regard predicate analysis as a process of assertion annotation. Assertion annotations can be used as a certificate for the verification results. It is thus a waste of resources to throw them away after each verification. We propose to reuse the previously-yielded assertion annotation in regression verification. A light-weight impact-analysis technique is proposed to analyze the reusability of assertions. A novel assertion strengthening technique is furthermore developed to improve reusability of annotation. With these techniques, we present an incremental predicate analysis technique for regression verification. Correctness of our incremental technique is formally proved. We performed comprehensive experiments on revisions of Linux kernel device drivers. Our technique outperforms the state-of-the-art program verification tool CPAchecker by getting 2.8x speedup in total time and solving additional 393 tasks.
Recently, there is growing concern that machine-learned software, which currently assists or even automates decision making, reproduces, and in the worst case reinforces, bias present in the training data. The development of tools and techniques for certifying fairness of this software or describing its biases is, therefore, critical. In this paper, we propose a perfectly parallel static analysis for certifying fairness of feed-forward neural networks used for classification of tabular data. When certification succeeds, our approach provides definite guarantees, otherwise, it describes and quantifies the biased input space regions. We design the analysis to be sound, in practice also exact, and configurable in terms of scalability and precision, thereby enabling pay-as-you-go certification. We implement our approach in an open-source tool called Libra and demonstrate its effectiveness on neural networks trained on popular datasets.
Modern programming languages support concurrent programming based on channels and processes. Channels enable synchronous and asynchronous message-passing between independent light-weight processes making it easy to express common concurrency patterns. The implementation of channels and processes in compilers and language runtimes is a difficult task that relies heavily on traditional and error-prone low-level concurrency primitives, raising concerns about correctness and reliability. In this paper, we present an automatic program generation technique to test such programming language implementations. We define a type and effect system for programs that communicate over channels and where every execution is guaranteed to eventually terminate. We can generate and run such programs, and if a program fails to terminate, we have found a bug in the programming language implementation. We implement such an automatic program generator and apply it to Go, Kotlin, Crystal, and Flix. We find two new bugs in Flix, and reproduce two bugs; one in Crystal and one in Kotlin.
JavaScript libraries are widely used and evolve rapidly. When adapting client code to non-backwards compatible changes in libraries, a major challenge is how to locate affected API uses in client code, which is currently a difficult manual task. In this paper we address this challenge by introducing a simple pattern language for expressing API access points and a pattern-matching tool based on lightweight static analysis.
Experimental evaluation on 15 popular npm packages shows that typical breaking changes are easy to express as patterns. Running the static analysis on 265 clients of these packages shows that it is accurate and efficient: it reveals usages of breaking APIs with only 14% false positives and no false negatives, and takes less than a second per client on average. In addition, the analysis is able to report its confidence, which makes it easier to identify the false positives. These results suggest that the approach, despite its simplicity, can reduce the manual effort of the client developers.
Large scale distributed systems require to embrace the trade off between consistency and availability, accepting lower levels of consistency to guarantee higher availability. Existing programming languages are, however, agnostic to this compromise, resulting in consistency guarantees that are the same for the whole application and are implicitly adopted from the middleware or hardcoded in configuration files. In this paper, we propose to integrate availability in the design of an object-oriented language, allowing developers to specify different consistency and isolation constraints in the same application at the granularity of single objects. We investigate how availability levels interact with object structure and define a type system that preserves correct program behavior. Our evaluation shows that our solution performs efficiently and improves the design of distributed applications.
There is growing interest in termination reasoning for nonlinear programs and, meanwhile, recent dynamic strategies have shown they are able to infer invariants for such challenging programs. These advances led us to hypothesize that perhaps such dynamic strategies for nonlinear invariants could be adapted to learn recurrent sets (for non-termination) and/or ranking functions (for termination).
In this paper, we exploit dynamic analysis and draw termination and non-termination as well as static and dynamic strategies closer together in order to tackle nonlinear programs. For termination, our algorithm infers ranking functions from concrete transitive closures, and, for non-termination, the algorithm iteratively collects executions and dynamically learns conditions to refine recurrent sets. Finally, we describe an integrated algorithm that allows these algorithms to mutually inform each other, taking counterexamples from a failed validation in one endeavor and crossing both the static/dynamic and termination/non-termination lines, to create new execution samples for the other one.
We have implemented these algorithms in a new tool called DynamiTe. For nonlinear programs, there are currently no SV-COMP termination benchmarks so we created new sets of 38 terminating and 39 non-terminating programs. Our empirical evaluation shows that we can effectively guess (and sometimes even validate) ranking functions and recurrent sets for programs with nonlinear behaviors. Furthermore, we show that counterexamples from one failed validation can be used to generate executions for a dynamic analysis of the opposite property. Although we are focused on nonlinear programs, as a point of comparison, we compare DynamiTe's performance on linear programs with that of the state-of-the-art tool, Ultimate. Although DynamiTe is an order of magnitude slower it is nonetheless somewhat competitive and sometimes finds ranking functions where Ultimate was unable to. Ultimate cannot, however, handle the nonlinear programs in our new benchmark suite.
Static analysis of smart contracts as-deployed on the Ethereum blockchain has received much recent attention. However, high-precision analyses currently face significant challenges when dealing with the Ethereum VM (EVM) execution model. A major such challenge is the modeling of low-level, transient “memory” (as opposed to persistent, on-blockchain “storage”) that smart contracts employ. Statically understanding the usage patterns of memory is non-trivial, due to the dynamic allocation nature of in-memory buffers. We offer an analysis that models EVM memory, recovering high-level concepts (e.g., arrays, buffers, call arguments) via deep modeling of the flow of values. Our analysis opens the door to Ethereum static analyses with drastically increased precision. One such analysis detects the extraction of ERC20 tokens by unauthorized users. For another practical vulnerability (redundant calls, possibly used as an attack vector), our memory modeling yields analysis precision of 89%, compared to 16% for a state-of-the-art tool without precise memory modeling. Additionally, precise memory modeling enables the static computation of a contract’s gas cost. This gas-cost analysis has recently been instrumental in the evaluation of the impact of the EIP-1884 repricing (in terms of gas costs) of EVM operations, leading to a reward and significant publicity from the Ethereum Foundation.
Gradual typing provides a methodology to integrate static and dynamic typing, harmonizing their often conflicting advantages in a single language. When a user wants to enjoy the advantages of static typing, most gradual languages require that they add type annotations. Many nontrivial tasks must be undertaken while adding type annotations, including understanding program behaviors and invariants. Unfortunately, if this is done incorrectly then the added type annotations can be wrong–leading to inconsistencies between the program and the type annotations. Gradual typing implementations detect such inconsistencies at runtime, raise cast errors, and generate messages. However, solely relying on such error messages for understanding and fixing inconsistencies and their resulting cast errors is often insufficient for multiple reasons. One reason is that while such messages cover inconsistencies in one execution path, fixing them often requires reconciling information from multiple paths. Another is that users may add many wrong type annotations that they later find difficult to identify and fix, when considering all added annotations.
Recent studies provide evidence that type annotations added during program migration are often wrong and that many programmers prefer compile-time warnings about wrong annotations. Motivated by these results, we develop exploratory typing to help with the static detection, understanding, and fixing of inconsistencies. The key idea of exploratory typing is that it systematically removes dynamic types and explores alternative types for static type annotations that can remedy inconsistencies. To demonstrate the feasibility of exploratory typing, we have implemented it in PyHound, which targets programs written in Reticulated Python, a gradual variant of Python. We have evaluated PyHound on a set of Python programs, and the evaluation results demonstrate that our idea can effectively detect inconsistencies in 98% of the tested programs and fix 93% of inconsistencies, significantly outperforming pytype, a widely used Python tool for enforcing type annotations.
Solvers in the framework of Satisfiability Modulo Theories (SMT) have been widely successful in practice. Recently there has been an increasing interest in solvers for string constraints to address security issues in web programming, for example. To be practically useful, the solvers need to support an expressive constraint language over unbounded strings, and in particular, over string lengths. Satisfiability checking for these formulas, especially in the SMT context, is very hard; it is generally undecidable for a rich fragment. In this paper, we propose a form of dependency analysis for a rich fragment of string constraints including high-level operations such as length, contains to deal with their inter-theory interaction so as to solve them more efficiently. We implement our dependency analysis in the string theory of the Z3 solver to obtain a new one, called S3N. Finally, we demonstrate the superior performance of S3N over state-of-the-art string solvers such as Z3str3, CVC4, S3P, and Z3 on several large industrial-strength benchmarks.
We propose type-aware operator mutation, a simple, but unusually effective approach for testing SMT solvers. The key idea is to mutate operators of conforming types within the seed formulas to generate well-typed mutant formulas. These mutant formulas are then used as the test cases for SMT solvers. We realized type-aware operator mutation within the OpFuzz tool and used it to stress-test Z3 and CVC4, two state-of-the-art SMT solvers. Type-aware operator mutations are unusually effective: During one year of extensive testing with OpFuzz, we reported 1092 bugs on Z3’s and CVC4’s respective GitHub issue trackers, out of which 819 unique bugs were confirmed and 685 of the confirmed bugs were fixed by the developers. The detected bugs are highly diverse — we found bugs of many different types (soundness bugs, invalid model bugs, crashes, etc.), logics and solver configurations. We have further conducted an in-depth study of the bugs found by OpFuzz. The study results show that the bugs found by OpFuzz are of high quality. Many of them affect core components of the SMT solvers’ codebases, and some required major changes for the developers to fix. Among the 819 confirmed bugs found by OpFuzz,184 were soundness bugs, the most critical bugs in SMT solvers,and 489 were in the default modes of the solvers. Notably, OpFuzz found 27 critical soundness bugs in CVC4, which has proved to be a very stable SMT solver.
Relaxed memory models must simultaneously achieve efficient implementability and thread-compositional reasoning. Is that why they have become so complicated? We argue that the answer is no: It is possible to achieve these goals by combining an idea from the 60s (preconditions) with an idea from the 80s (pomsets), at least for X64 and ARMv8. We show that the resulting model (1) supports compositional reasoning for temporal safety properties, (2) supports all expected sequential compiler optimizations, (3) satisfies the DRF-SC criterion, and (4) compiles to X64 and ARMv8 microprocessors without requiring extra fences on relaxed accesses.
A plethora of program analysis and optimization techniques rely on linear programming at their heart. However, such techniques are often considered too slow for production use. While today’s best solvers are optimized for complex problems with thousands of dimensions, linear programming, as used in compilers, is typically applied to small and seemingly trivial problems, but to many instances in a single compilation run. As a result, compilers do not benefit from decades of research on optimizing large-scale linear programming. We design a simplex solver targeted at compilers. A novel theory of transprecision computation applied from individual elements to full data-structures provides the computational foundation. By carefully combining it with optimized representations for small and sparse matrices and specialized small-coefficient algorithms, we (1) reduce memory traffic, (2) exploit wide vectors, and (3) use low-precision arithmetic units effectively. We evaluate our work by embedding our solver into a state-of-the-art integer set library and implement one essential operation, coalescing, on top of our transprecision solver. Our evaluation shows more than an order-of-magnitude speedup on the core simplex pivot operation and a mean speedup of 3.2x (vs. GMP) and 4.6x (vs. IMath) for the optimized coalescing operation. Our results demonstrate that our optimizations exploit the wide SIMD instructions of modern microarchitectures effectively. We expect our work to provide foundations for a future integer set library that uses transprecision arithmetic to accelerate compiler analyses.
Compilers should not crash and they should not miscompile applications. Random testing is an effective method for finding compiler bugs that have escaped other kinds of testing. This paper presents Yet Another Random Program Generator (YARPGen), a random test-case generator for C and C++ that we used to find and report more than 220 bugs in GCC, LLVM, and the Intel® C++ Compiler. Our research contributions include a method for generating expressive programs that avoid undefined behavior without using dynamic checks, and generation policies, a mechanism for increasing diversity of generated code and for triggering more optimizations. Generation policies decrease the testing time to find hard-to-trigger compiler bugs and, for the kinds of scalar optimizations YARPGen was designed to stress-test, increase the number of times these optimizations are applied by the compiler by an average of 20% for LLVM and 40% for GCC. We also created tools for automating most of the common tasks related to compiler fuzzing; these tools are also useful for fuzzers other than ours.
We present CompCertELF, the first extension to CompCert that supports verified compilation from C programs all the way to a standard binary file format, i.e., the ELF object format. Previous work on Stack-Aware CompCert provides a verified compilation chain from C programs to assembly programs with a realistic machine memory model. We build CompCertELF by modifying and extending this compilation chain with a verified assembler which further transforms assembly programs into ELF object files.
CompCert supports large-scale verification via verified separate compilation: C modules can be written and compiled separately, and then linked together to get a target program that refines the semantics of the program linked from the source modules. However, verified separate compilation in CompCert only works for compilation to assembly programs, not to object files. For the latter, the main difficulty is to bridge the two different views of linking: one for CompCert's programs that allows arbitrary shuffling of global definitions by linking and the other for object files that treats blocks of encoded definitions as indivisible units.
We propose a lightweight approach that solves the above problem without any modification to CompCert's framework for verified separate compilation: by introducing a notion of syntactical equivalence between programs and proving the commutativity between syntactical equivalence and the two different kinds of linking, we are able to transit from the more abstract linking operation in CompCert to the more concrete one for ELF object files. By applying this approach to CompCertELF, we obtain the first compiler that supports verified separate compilation of C programs into ELF object files.
A major challenge in writing applications that execute across hosts, such as distributed online services, is to reconcile (a) parallelism (i.e., allowing components to execute independently on disjoint tasks), and (b)cooperation (i.e., allowing components to work together on common tasks). A good compromise between the two is vital to scalability, a core concern in distributed networked applications.
The actor model of computation is a widely promoted programming model for distributed applications, as actors can execute in individual threads (parallelism) across different hosts and interact via asynchronous message passing (collaboration). However, this makes it hard for programmers to reason about combinations of messages as opposed to individual messages, which is essential in many scenarios.
This paper presents a pragmatic variant of the actor model in which messages can be grouped into units that are executed in a serializable manner, whilst still retaining a high degree of parallelism. In short, our model is based on an orchestration of actors along a directed acyclic graph that supports efficient decentralized synchronization among actors based on their actual interaction. We present the implementation of this model, based on a dynamic DAG-inducing referencing discipline, in the actor-based programming language AEON. We argue serializability and the absence of deadlocks in our model, and demonstrate its scalability and usability through extensive evaluation and case studies of wide-ranging applications.
Research on program termination has a long tradition. However, most of the existing techniques target a single program only. We propose in this paper an incremental termination analysis approach by reusing certified modules across different program versions. A transformation-based procedure is further developed to increase the reusability of certified modules. The proposed approach has wide applicability, applicable to various program changes. The proposed technique, to the best of our knowledge, represents a novel attempt to the termination analysis of evolving programs. We implemented the approach on top of Ultimate Automizer. Experimental results show dramatic improvement of our approach over the state-of-the-art tool.
Computer programs are increasingly being deployed in partially-observable environments. A partially observable environment is an environment whose state is not completely visible to the program, but from which the program receives partial observations. Developers typically deal with partial observability by writing a state estimator that, given observations, attempts to deduce the hidden state of the environment. In safety-critical domains, to formally verify safety properties developers may write an environment model. The model captures the relationship between observations and hidden states and is used to prove the software correct.
In this paper, we present a new methodology for writing and verifying programs in partially observable environments. We present belief programming, a programming methodology where developers write an environment model that the program runtime automatically uses to perform state estimation. A belief program dynamically updates and queries a belief state that captures the possible states the environment could be in. To enable verification, we present Epistemic Hoare Logic that reasons about the possible belief states of a belief program the same way that classical Hoare logic reasons about the possible states of a program. We develop these concepts by defining a semantics and a program logic for a simple core language called BLIMP. In a case study, we show how belief programming could be used to write and verify a controller for the Mars Polar Lander in BLIMP. We present an implementation of BLIMP called CBLIMP and evaluate it to determine the feasibility of belief programming.
Motivated by applications in robotics, we consider the task of synthesizing linear temporal logic (LTL) specifications based on examples and natural language descriptions. While LTL is a flexible, expressive, and unambiguous language to describe robotic tasks, it is often challenging for non-expert users. In this paper, we present an interactive method for synthesizing LTL specifications from a single example trace and a natural language description. The interaction is limited to showing a small number of behavioral examples to the user who decides whether or not they exhibit the original intent. Our approach generates candidate LTL specifications and distinguishing examples using an encoding into optimization modulo theories problems. Additionally, we use a grammar extension mechanism and a semantic parser to generalize synthesized specifications to parametric task descriptions for subsequent use. Our implementation in the tool LtlTalk starts with a domain-specific language that maps to a fragment of LTL and expands it through example-based user interactions, thus enabling natural language-like robot programming, while maintaining the expressive power and precision of a formal language. Our experiments show that the synthesis method is precise, quick, and asks only a few questions to the users, and we demonstrate in a case study how LtlTalk generalizes from the synthesized tasks to other, yet unseen, tasks.
Flaky tests are tests that can non-deterministically pass or fail for the same code version. These tests undermine regression testing efficiency, because developers cannot easily identify whether a test fails due to their recent changes or due to flakiness. Ideally, one would detect flaky tests right when flakiness is introduced, so that developers can then immediately remove the flakiness. Some software organizations, e.g., Mozilla and Netflix, run some tools—detectors—to detect flaky tests as soon as possible. However, detecting flaky tests is costly due to their inherent non-determinism, so even state-of-the-art detectors are often impractical to be used on all tests for each project change. To combat the high cost of applying detectors, these organizations typically run a detector solely on newly added or directly modified tests, i.e., not on unmodified tests or when other changes occur (including changes to the test suite, the code under test, and library dependencies). However, it is unclear how many flaky tests can be detected or missed by applying detectors in only these limited circumstances.
To better understand this problem, we conduct a large-scale longitudinal study of flaky tests to determine when flaky tests become flaky and what changes cause them to become flaky. We apply two state-of-theart detectors to 55 Java projects, identifying a total of 245 flaky tests that can be compiled and run in the code version where each test was added. We find that 75% of flaky tests (184 out of 245) are flaky when added, indicating substantial potential value for developers to run detectors specifically on newly added tests. However, running detectors solely on newly added tests would still miss detecting 25% of flaky tests. The percentage of flaky tests that can be detected does increase to 85% when detectors are run on newly added or directly modified tests. The remaining 15% of flaky tests become flaky due to other changes and can be detected only when detectors are always applied to all tests. Our study is the first to empirically evaluate when tests become flaky and to recommend guidelines for applying detectors in the future.
Differential privacy has emerged as a leading theoretical framework for privacy-preserving data gathering and analysis. It allows meaningful statistics to be collected for a population without revealing ``too much'' information about any individual member of the population. For software profiling, this machinery allows profiling data from many users of a deployed software system to be collected and analyzed in a privacy-preserving manner. Such a solution is appealing to many stakeholders, including software users, software developers, infrastructure providers, and government agencies.
We propose an approach for differentially-private collection of frequency vectors from software executions. Frequency information is reported with the addition of random noise drawn from the Laplace distribution. A key observation behind the design of our scheme is that event frequencies are closely correlated due to the static code structure. Differential privacy protections must account for such relationships; otherwise, a seemingly-strong privacy guarantee is actually weaker than it appears. Motivated by this observation, we propose a novel and general differentially-private profiling scheme when correlations between frequencies can be expressed through linear inequalities. Using a linear programming formulation, we show how to determine the magnitude of random noise that should be added to achieve meaningful privacy protections under such linear constraints. Next, we develop an efficient instance of this general machinery for an important subclass of constraints. Instead of LP, our solution uses a reachability analysis of a constraint graph. As an exemplar, we employ this approach to implement differentially-private method frequency profiling for Android apps.
Any differentially-private scheme has to balance two competing aspects: privacy and accuracy. Through an experimental study to characterize these trade-offs, we (1) show that our proposed randomization achieves much higher accuracy compared to related prior work, (2) demonstrate that high accuracy and high privacy protection can be achieved simultaneously, and (3) highlight the importance of linear constraints in the design of the randomization. These promising results provide evidence that our approach is a good candidate for privacy-preserving frequency profiling of deployed software.
Garbage collectors relieve the programmer from manual memory management, but lead to compiler-generated machine code that can behave differently (e.g. out-of-memory errors) from the source code. To ensure that the generated code behaves exactly like the source code, programmers need a way to answer questions of the form: what is a sufficient amount of memory for my program to never reach an out-of-memory error?
This paper develops a cost semantics that can answer such questions for CakeML programs. The work described in this paper is the first to be able to answer such questions with proofs in the context of a language that depends on garbage collection. We demonstrate that positive answers can be used to transfer liveness results proved for the source code to liveness guarantees about the generated machine code. Without guarantees about space usage, only safety results can be transferred from source to machine code.
Our cost semantics is phrased in terms of an abstract intermediate language of the CakeML compiler, but results proved at that level map directly to the space cost of the compiler-generated machine code. All of the work described in this paper has been developed in the HOL4 theorem prover.
We present Hoogle+, a web-based API discovery tool for Haskell. A Hoogle+ user can specify a programming task using either a type, a set of input-output tests, or both. Given a specification, the tool returns a list of matching programs composed from functions in popular Haskell libraries, and annotated with automatically-generated examples of their behavior. These features of Hoogle+ are powered by three novel techniques. First, to enable efficient type-directed synthesis from tests only, we develop an algorithm that infers likely type specifications from tests. Second, to return high-quality programs even with ambiguous specifications, we develop a technique that automatically eliminates meaningless and repetitive synthesis results. Finally, we show how to extend this elimination technique to automatically generate informative inputs that can be used to demonstrate program behavior to the user. To evaluate the effectiveness of Hoogle+ compared with traditional API search techniques, we perform a user study with 30 participants of varying Haskell proficiency. The study shows that programmers equipped with Hoogle+ generally solve tasks faster and were able to solve 50% more tasks overall.
Resolution and subtyping are two common mechanisms in programming languages. Resolution is used by features such as type classes or Scala-style implicits to synthesize values automatically from contextual type information. Subtyping is commonly used to automatically convert the type of a value into another compatible type. So far the two mechanisms have been considered independently of each other. This paper shows that, with a small extension, subtyping with intersection types can subsume resolution. This has three main consequences. Firstly, resolution does not need to be implemented as a separate mechanism. Secondly, the interaction between resolution and subtyping becomes apparent. Finally, the integration of resolution into subtyping enables first-class (implicit) environments. The extension that recovers the power of resolution via subtyping is the modus ponens rule of propositional logic. While it is easily added to declarative subtyping, significant care needs to be taken to retain desirable properties, such as transitivity and decidability of algorithmic subtyping, and coherence. To materialize these ideas we develop λiMP, a calculus that extends a iprevious calculus with disjoint intersection types, and develop its metatheory in the Coq theorem prover.
Dynamic programming languages face semantic and performance challenges in the presence of features, such as eval, that can inject new code into a running program. The Julia programming language introduces the novel concept of world age to insulate optimized code from one of the most disruptive side-effects of eval: changes to the definition of an existing function. This paper provides the first formal semantics of world age in a core calculus named juliette, and shows how world age enables compiler optimizations, such as inlining, in the presence of eval. While Julia also provides programmers with the means to bypass world age, we found that this mechanism is not used extensively: a static analysis of over 4,000 registered Julia packages shows that only 4-9% of packages bypass world age. This suggests that Julia's semantics aligns with programmer expectations.
The Dependent Object Types (DOT) calculus serves as a foundation of the Scala programming language, with a machine-verified soundness proof. However, Scala's type system has been shown to be unsound due to null references, which are used as default values of fields of objects before they have been initialized. This paper proposes ιDOT, an extension of DOT for ensuring safe initialization of objects. DOT was previously extended to κDOT with the addition of mutable fields and constructors. To κDOT, ιDOT adds an initialization effect system that statically prevents the possibility of reading a null reference from an uninitialized object. To design ιDOT, we have reformulated the Freedom Before Commitment object initialization scheme in terms of disjoint subheaps to make it easier to formalize in an effect system and prove sound. Soundness of ιDOT depends on the interplay of three systems of rules: a type system close to that of DOT, an effect system to ensure definite assignment of fields in each constructor, and an initialization system that tracks the initialization status of objects in a stack of subheaps. We have proven the overall system sound and verified the soundness proof using the Coq proof assistant.
Callbacks are an effective programming discipline for implementing event-driven programming, especially in environments like Ethereum which forbid shared global state and concurrency. Callbacks allow a callee to delegate the execution back to the caller. Though effective, they can lead to subtle mistakes principally in open environments where callbacks can be added in a new code. Indeed, several high profile bugs in smart contracts exploit callbacks.
We present the first static technique ensuring modularity in the presence of callbacks and apply it to verify prominent smart contracts. Modularity ensures that external calls to other contracts cannot affect the behavior of the contract. Importantly, modularity is guaranteed without restricting programming.
In general, checking modularity is undecidable—even for programs without loops. This paper describes an effective technique for soundly ensuring modularity harnessing SMT solvers. The main idea is to define a constructive version of modularity using commutativity and projection operations on program segments. We believe that this approach is also accessible to programmers, since counterexamples to modularity can be generated automatically by the SMT solvers, allowing programmers to understand and fix the error.
We implemented our approach in order to demonstrate the precision of the modularity analysis and applied it to real smart contracts, including a subset of the 150 most active contracts in Ethereum. Our implementation decompiles bytecode programs into an intermediate representation and then implements the modularity checking using SMT queries. Overall, we argue that our experimental results indicate that the method can be applied to many realistic contracts, and that it is able to prove modularity where other methods fail.
Large scale production distributed systems are difficult to design and test. Correctness must be ensured when processes run asynchronously, at arbitrary rates relative to each other, and in the presence of failures, e.g., process crashes or message losses. These conditions create a huge space of executions that is difficult to explore in a principled way. Current testing techniques focus on systematic or randomized exploration of all executions of an implementation while treating the implemented algorithms as black boxes. On the other hand, proofs of correctness of many of the underlying algorithms often exploit semantic properties that reduce reasoning about correctness to a subset of behaviors. For example, the communication-closure property, used in many proofs of distributed consensus algorithms, shows that every asynchronous execution of the algorithm is equivalent to a lossy synchronous execution, thus reducing the burden of proof to only that subset. In a lossy synchronous execution, processes execute in lock-step rounds, and messages are either received in the same round or lost forever—such executions form a small subset of all asynchronous ones.
We formulate the communication-closure hypothesis, which states that bugs in implementations of distributed consensus algorithms will already manifest in lossy synchronous executions and present a testing algorithm based on this hypothesis. We prioritize the search space based on a bound on the number of failures in the execution and the rate at which these failures are recovered. We show that a random testing algorithm based on sampling lossy synchronous executions can empirically find a number of bugs—including previously unknown ones—in production distributed systems such as Zookeeper, Cassandra, and Ratis, and also produce more understandable bug traces.
Logic bugs in Database Management Systems (DBMSs) are bugs that cause an incorrect result for a given query, for example, by omitting a row that should be fetched. These bugs are critical, since they are likely to go unnoticed by users. We propose Query Partitioning, a general and effective approach for finding logic bugs in DBMSs. The core idea of Query Partitioning is to, starting from a given original query, derive multiple, more complex queries (called partitioning queries), each of which computes a partition of the result. The individual partitions are then composed to compute a result set that must be equivalent to the original query's result set. A bug in the DBMS is detected when these result sets differ. Our intuition is that due to the increased complexity, the partitioning queries are more likely to stress the DBMS and trigger a logic bug than the original query. As a concrete instance of a partitioning strategy, we propose Ternary Logic Partitioning (TLP), which is based on the observation that a boolean predicate p can either evaluate to TRUE, FALSE, or NULL. Accordingly, a query can be decomposed into three partitioning queries, each of which computes its result on rows or intermediate results for which p, NOT p, and p IS NULL hold. This technique is versatile, and can be used to test WHERE, GROUP BY, as well as HAVING clauses, aggregate functions, and DISTINCT queries. As part of an extensive testing campaign, we found 175 bugs in widely-used DBMSs such as MySQL, TiDB, SQLite, and CockroachDB, 125 of which have been fixed. Notably, 77 of these were logic bugs, while the remaining were error and crash bugs. We expect that the effectiveness and wide applicability of Query Partitioning will lead to its broad adoption in practice, and the formulation of additional partitioning strategies.
Data repositories often consist of text files in a wide variety of standard formats, ad-hoc formats, as well as mixtures of formats where data in one format is embedded into a different format. It is therefore a significant challenge to parse these files into a structured tabular form, which is important to enable any downstream data processing.
We present Unravel, an extensible framework for structure interpretation of ad-hoc formats. Unravel can automatically, with no user input, extract tabular data from a diverse range of standard, ad-hoc and mixed format files. The framework is also easily extensible to add support for previously unseen formats, and also supports interactivity from the user in terms of examples to guide the system when specialized data extraction is desired. Our key insight is to allow arbitrary combination of extraction and parsing techniques through a concept called partial structures. Partial structures act as a common language through which the file structure can be shared and refined by different techniques. This makes Unravel more powerful than applying the individual techniques in parallel or sequentially. Further, with this rule-based extensible approach, we introduce the novel notion of re-interpretation where the variety of techniques supported by our system can be exploited to improve accuracy while optimizing for particular quality measures or restricted environments. On our benchmark of 617 text files gathered from a variety of sources, Unravel is able to extract the intended table in many more cases compared to state-of-the-art techniques.
Synchronization primitives for fault-tolerant distributed systems that ensure an effective and efficient cooperation among processes are an important challenge in the programming languages community. We present a new programming abstraction, ReSync, for implementing benign and Byzantine fault-tolerant protocols. ReSync has a new round structure that offers a simple abstraction for group communication, like it is customary in synchronous systems, but also allows messages to be received one by one, like in the asynchronous systems. This extension allows implementing network and algorithm-specific policies for the message reception, which is not possible in classic round models.
The execution of ReSync programs is based on a new generic round switch protocol that generalizes the famous theoretical result about consensus in the presence of partial synchrony by of Dwork, Lynch, and Stockmeyer. We evaluate experimentally the performance of ReSync’s execution platform, by comparing consensus implementations in ReSync with LibPaxos3, etcd, and Bft-SMaRt, three consensus libraries tolerant to benign, resp. byzantine faults.
Actor concurrency is becoming increasingly important in the development of real-world software systems. Although actor concurrency may be less susceptible to some multithreaded concurrency bugs, such as low-level data races and deadlocks, it comes with its own bugs that may be different. However, the fundamental characteristics of actor concurrency bugs, including their symptoms, root causes, API usages, examples, and differences when they come from different sources are still largely unknown. Actor software development can significantly benefit from a comprehensive qualitative and quantitative understanding of these characteristics, which is the focus of this work, to foster better API documentation, development practices, testing, debugging, repairing, and verification frameworks. To conduct this study, we take the following major steps. First, we construct a set of 186 real-world Akka actor bugs from Stack Overflow and GitHub via manual analysis of 3,924 Stack Overflow questions, answers, and comments and 3,315 GitHub commits, messages, original and modified code snippets, issues, and pull requests. Second, we manually study these actor bugs and their fixes to understand and classify their symptoms, root causes, and API usages. Third, we study the differences between the commonalities and distributions of symptoms, root causes, and API usages of our Stack Overflow and GitHub actor bugs. Fourth, we discuss real-world examples of our actor bugs with these symptoms and root causes. Finally, we investigate the relation of our findings with those of previous work and discuss their implications. A few findings of our study are: (1) symptoms of our actor bugs can be classified into five categories, with Error as the most common symptom and Incorrect Exceptions as the least common, (2) root causes of our actor bugs can be classified into ten categories, with Logic as the most common root cause and Untyped Communication as the least common, (3) a small number of Akka API packages are responsible for most of API usages by our actor bugs, and (4) our Stack Overflow and GitHub actor bugs can differ significantly in commonalities and distributions of their symptoms, root causes, and API usages. While some of our findings agree with those of previous work, others sharply contrast.
We address the problem of predicting edit completions based on a learned model that was trained on past edits. Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of the snippet. We refer to this task as the EditCompletion task and present a novel approach for tackling it. The main idea is to directly represent structural edits. This allows us to model the likelihood of the edit itself, rather than learning the likelihood of the edited code. We represent an edit operation as a path in the program’s Abstract Syntax Tree (AST), originating from the source of the edit to the target of the edit. Using this representation, we present a powerful and lightweight neural model for the EditCompletion task.
We conduct a thorough evaluation, comparing our approach to a variety of representation and modeling approaches that are driven by multiple strong models such as LSTMs, Transformers, and neural CRFs. Our experiments show that our model achieves a 28% relative gain over state-of-the-art sequential models and 2× higher accuracy than syntactic models that learn to generate the edited code, as opposed to modeling the edits directly.
Our code, dataset, and trained models are publicly available at <a>https://github.com/tech-srl/c3po/</a> .
This paper presents an extension to Liquid Haskell that facilitates stating and semi-automatically proving properties of typeclasses. Liquid Haskell augments Haskell with refinement types—our work allows such types to be attached to typeclass method declarations, and ensures that instance implementations respect these types. The engineering of this extension is a modular interaction between GHC, the Glasgow Haskell Compiler, and Liquid Haskell’s core proof infrastructure. The design sheds light on the interplay between modular proofs and typeclass resolution, which in Haskell is coherent by default (meaning that resolution always selects the same implementation for a particular instantiating type), but in other dependently typed languages is not.
We demonstrate the utility of our extension by using Liquid Haskell to modularly verify that 34 instances satisfy the laws of five standard typeclasses.
More substantially, we implement a framework for programming distributed applications based on replicated data types (RDTs). We define a typeclass whose Liquid Haskell type captures the mathematical properties RDTs should satisfy; prove in Liquid Haskell that these properties are sufficient to ensure that replicas’ states converge despite out-of-order update delivery; implement (and prove correct) several instances of our RDT typeclass; and use them to build two realistic applications, a multi-user calendar event planner and a collaborative text editor.
We present Deuterium---a framework for implementing Java methods as executable contracts. Deuterium introduces a novel, type-safe way to write method contracts entirely in Java, as a combination of imperative generators and declarative specifications (written in a first-order relational logic with transitive closure). Existing approaches are typically based on encoding both the specification and the program heap into a constraint language, and then using an off-the-shelf constraint solver---without any additional guidance---to search for a new program heap that satisfies the specification. Deuterium takes advantage of user-provided generators to prune the search space and reduce incurred overhead of constraint solving. Deuterium supports two ways of solving declarative constraints: SAT-based and search-based with in-memory state exploration. We evaluate our approach on a suite of data structures, established as a standard benchmark by prior work. Furthermore, we use random and sequence-based test generation to create a new benchmark designed to mimic realistic execution scenarios. Our results show that generators improve the performance of executable contracts and that in-memory state exploration is the algorithm of choice when heap sizes are small.
We propose a solution to the problem of efficient matching regular expressions (regexes) with bounded repetition, such as (ab){1,100}, using deterministic automata. For this, we introduce novel counting-set automata (CsAs), automata with registers that can hold sets of bounded integers and can be manipulated by a limited portfolio of constant-time operations. We present an algorithm that compiles a large sub-class of regexes to deterministic CsAs. This includes (1) a novel Antimirov-style translation of regexes with counting to counting automata (CAs), nondeterministic automata with bounded counters, and (2) our main technical contribution, a determinization of CAs that outputs CsAs. The main advantage of this workflow is that the size of the produced CsAs does not depend on the repetition bounds used in the regex (while the size of the DFA is exponential to them). Our experimental results confirm that deterministic CsAs produced from practical regexes with repetition are indeed vastly smaller than the corresponding DFAs. More importantly, our prototype matcher based on CsA simulation handles practical regexes with repetition regardless of sizes of counter bounds. It easily copes with regexes with repetition where state-of-the-art matchers struggle.
While editing code, it is common for developers to make multiple related repeated edits that are all instances of a more general program transformation. Since this process can be tedious and error-prone, we study the problem of automatically learning program transformations from past edits, which can then be used to predict future edits. We take a novel view of the problem as a semi-supervised learning problem: apart from the concrete edits that are instances of the general transformation, the learning procedure also exploits access to additional inputs (program subtrees) that are marked as positive or negative depending on whether the transformation applies on those inputs. We present a procedure to solve the semi-supervised transformation learning problem using anti-unification and programming-by-example synthesis technology. To eliminate reliance on access to marked additional inputs, we generalize the semi-supervised learning procedure to a feedback-driven procedure that also generates the marked additional inputs in an iterative loop. We apply these ideas to build and evaluate three applications that use different mechanisms for generating feedback. Compared to existing tools that learn program transformations from edits, our feedback-driven semi-supervised approach is vastly more effective in successfully predicting edits with significantly lesser amounts of past edit data.
In order to generate efficient code, dynamic language compilers often need information, such as dynamic types, not readily available in the program source. Leveraging a mixture of static and dynamic information, these compilers speculate on the missing information. Within one compilation unit, they specialize the generated code to the previously observed behaviors, betting that past is prologue. When speculation fails, the execution must jump back to unoptimized code. In this paper, we propose an approach to further the specialization, by disentangling classes of behaviors into separate optimization units. With contextual dispatch, functions are versioned and each version is compiled under different assumptions. When a function is invoked, the implementation dispatches to a version optimized under assumptions matching the dynamic context of the call. As a proof-of-concept, we describe a compiler for the R language which uses this approach. Our implementation is, on average, 1.7× faster than the GNU R reference implementation. We evaluate contextual dispatch on a set of benchmarks and measure additional speedup, on top of traditional speculation with deoptimization techniques. In this setting contextual dispatch improves the performance of 18 out of 46 programs in our benchmark suite.
Automatic translation validation across the unoptimized intermediate representation (IR) of the original source code and the optimized executable assembly code is a desirable capability, and has the potential to compete with existing approaches to verified compilation such as CompCert. A difficult subproblem is the automatic identification of the correlations across the transitions between the two programs' respective locations. We present a counterexample-guided algorithm to identify these correlations in a robust and scalable manner. Our algorithm has both theoretical and empirical advantages over prior work in this problem space.
Many programming problems call for turning geometrical thoughts into code: tables, hierarchical structures, nests of objects, trees, forests, graphs, and so on. Linear text does not do justice to such thoughts. But, it has been the dominant programming medium for the past and will remain so for the foreseeable future.
This paper proposes a novel mechanism for conveniently extending textual programming languages with problem-specific visual syntax. It argues the necessity of this language feature, demonstrates the feasibility with a robust prototype, and sketches a design plan for adapting the idea to other languages.
The Amber rules are well-known and widely used for subtyping iso-recursive types. They were first briefly and informally introduced in 1985 by Cardelli in a manuscript describing the Amber language. Despite their use over many years, important aspects of the metatheory of the iso-recursive style Amber rules have not been studied in depth or turn out to be quite challenging to formalize.
This paper aims to revisit the problem of subtyping iso-recursive types. We start by introducing a novel declarative specification that we believe captures the “spirit” of Amber-style iso-recursive subtyping. Informally, the specification states that two recursive types are subtypes if all their finite unfoldings are subtypes. The Amber rules are shown to be sound with respect to this declarative specification. We then derive a sound, complete and decidable algorithmic formulation of subtyping that employs a novel double unfolding rule. Compared to the Amber rules, the double unfolding rule has the advantage of: 1) being modular; 2) not requiring reflexivity to be built in; and 3) leading to an easy proof of transitivity of subtyping. This work sheds new insights on the theory of subtyping iso-recursive types, and the new double unfolding rule has important advantages over the original Amber rules for both implementations and metatheoretical studies involving recursive types. All results are mechanically formalized in the Coq theorem prover. As far as we know, this is the first comprehensive treatment of iso-recursive subtyping dealing with unrestricted recursive types in a theorem prover.
Programming by example (PBE) is an important subproblem of program synthesis, and PBE techniques have been applied to many domains. Though many techniques for accelerating PBE systems have been explored, the scalability remains one of the main challenges: There is still a gap between the performances of state-of-the-art synthesizers and the industrial requirement. To further speed up solving PBE tasks, in this paper, we propose a novel PBE framework MaxFlash. MaxFlash uses a model based on structural probability, named topdown prediction models, to guide a search based on dynamic programming, such that the search will focus on subproblems that form probable programs, and avoid improbable programs. Our evaluation shows that MaxFlash achieves × 4.107− × 2080 speed-ups against state-of-the-art solvers on 244 real-world tasks.
We address the problem of reverse engineering of stripped executables, which contain no debug information. This is a challenging problem because of the low amount of syntactic information available in stripped executables, and the diverse assembly code patterns arising from compiler optimizations. We present a novel approach for predicting procedure names in stripped executables. Our approach combines static analysis with neural models. The main idea is to use static analysis to obtain augmented representations of call sites; encode the structure of these call sites using the control-flow graph (CFG) and finally, generate a target name while attending to these call sites. We use our representation to drive graph-based, LSTM-based and Transformer-based architectures. Our evaluation shows that our models produce predictions that are difficult and time consuming for humans, while improving on existing methods by 28% and by 100% over state-of-the-art neural textual models that do not use any static analysis. Code and data for this evaluation are available at https://github.com/tech-srl/Nero.
Modern memory consistency models are complex, and it is difficult to reason about the relaxed behaviors that current systems allow. Programming languages, such as C and OpenCL, offer a memory model interface that developers can use to safely write concurrent applications. This abstraction provides functional portability across any platform that implements the interface, regardless of differences in the underlying systems. This powerful abstraction hinges on the ability of the system to correctly implement the interface. Many techniques for memory consistency model validation use empirical testing, which has been effective at uncovering undocumented behaviors and even finding bugs in trusted compilation schemes. Memory model testing consists of small concurrent unit tests called “litmus tests”. In these tests, certain observations, including potential bugs, are exceedingly rare, as they may only be triggered by precise interleaving of system steps in a complex processor, which is probabilistic in nature. Thus, each test must be run many times in order to provide a high level of confidence in its coverage.
In this work, we rigorously investigate empirical memory model testing. In particular, we propose methodologies for navigating complex stressing routines and analyzing large numbers of testing observations. Using these insights, we can more efficiently tune stressing parameters, which can lead to higher confidence results at a faster rate. We emphasize the need for such approaches by performing a meta-study of prior work, which reveals results with low reproducibility and inefficient use of testing time.
Our investigation is presented alongside empirical data. We believe that OpenCL targeting GPUs is a pragmatic choice in this domain as there exists a variety of different platforms to test, from large HPC servers to power-efficient edge devices. The tests presented in the work span 3 GPUs from 3 different vendors. We show that our methodologies are applicable across the GPUs, despite significant variances in the results. Concretely, our results show: lossless speedups of more than 5× in tuning using data peeking; a definition of portable stressing parameters which loses only 12% efficiency when generalized across our domain; a priority order of litmus tests for tuning. We stress test a conformance test suite for the OpenCL 2.0 memory model and discover a bug in Intel’s compiler. Our methods are evaluated on the other two GPUs using mutation testing. We end with recommendations for official memory model conformance tests.
A key challenge in program synthesis is the astronomical size of the search space the synthesizer has to explore. In response to this challenge, recent work proposed to guide synthesis using learned probabilistic models. Obtaining such a model, however, might be infeasible for a problem domain where no high-quality training data is available. In this work we introduce an alternative approach to guided program synthesis: instead of training a model ahead of time we show how to bootstrap one just in time, during synthesis, by learning from partial solutions encountered along the way. To make the best use of the model, we also propose a new program enumeration algorithm we dub guided bottom-up search, which extends the efficient bottom-up search with guidance from probabilistic models.
We implement this approach in a tool called Probe, which targets problems in the popular syntax-guided synthesis (SyGuS) format. We evaluate Probe on benchmarks from the literature and show that it achieves significant performance gains both over unguided bottom-up search and over a state-of-the-art probability-guided synthesizer, which had been trained on a corpus of existing solutions. Moreover, we show that these performance gains do not come at the cost of solution quality: programs generated by Probe are only slightly more verbose than the shortest solutions and perform no unnecessary case-splitting.
Current static verification techniques do not provide good support for incrementality, making it difficult for developers to focus on specifying and verifying the properties and components that are most important. Dynamic verification approaches support incrementality, but cannot provide static guarantees. To bridge this gap, prior work proposed gradual verification, which supports incrementality by allowing every assertion to be complete, partial, or omitted, and provides sound verification that smoothly scales from dynamic to static checking. The prior approach to gradual verification, however, was limited to programs without recursive data structures. This paper extends gradual verification to programs that manipulate recursive, mutable data structures on the heap. We address several technical challenges, such as semantically connecting iso- and equi-recursive interpretations of abstract predicates, and supporting gradual verification of heap ownership. This work thus lays the foundation for future tools that work on realistic programs and support verification within an engineering process in which cost-benefit trade-offs can be made.
Macros provide a powerful means of extending languages. They have proven useful in both general-purpose and domain-specific programming contexts. This paper presents an architecture for implementing macro-extensible DSLs on top of macro-extensible host languages. The macro expanders of these DSLs inherit the syntax system, hygienic expansion, and more from the host. They transform the extensible DSL syntax into a DSL core language. This arrangement has several important consequences. It becomes straightforward to integrate the syntax of various DSLs and the host language when their expanders share these inherited components. Also, a DSL compiler may be designed around a fixed core language, even for an extensible DSL. Finally, macros empower programmers to safely grow DSLs on their own and tailor them to their needs.
Concurrency bugs are notoriously hard to detect and reproduce. Controlled concurrency testing (CCT) techniques aim to offer a solution, where a scheduler explores the space of possible interleavings of a concurrent program looking for bugs. Since the set of possible interleavings is typically very large, these schedulers employ heuristics that prioritize the search to “interesting” subspaces. However, current heuristics are typically tuned to specific bug patterns, which limits their effectiveness in practice.
In this paper, we present QL, a learning-based CCT framework where the likelihood of an action being selected by the scheduler is influenced by earlier explorations. We leverage the classical Q-learning algorithm to explore the space of possible interleavings, allowing the exploration to adapt to the program under test, unlike previous techniques. We have implemented and evaluated QL on a set of microbenchmarks, complex protocols, as well as production cloud services. In our experiments, we found QL to consistently outperform the state-of-the-art in CCT.
Formally verifying software correctness is a highly manual process. However, because verification proof scripts often share structure, it is possible to learn from existing proof scripts to fully automate some formal verification. The goal of this paper is to improve proof script synthesis and enable fully automating more verification. Interactive theorem provers, such as the Coq proof assistant, allow programmers to write partial proof scripts, observe the semantics of the proof state thus far, and then attempt more progress. Knowing the proof state semantics is a significant aid. Recent research has shown that the proof state can help predict the next step. In this paper, we present TacTok, the first technique that attempts to fully automate proof script synthesis by modeling proof scripts using both the partial proof script written thus far and the semantics of the proof state. Thus, TacTok more completely models the information the programmer has access to when writing proof scripts manually. We evaluate TacTok on a benchmark of 26 software projects in Coq, consisting of over 10 thousand theorems. We compare our approach to five tools. Two prior techniques, CoqHammer, the state-of-the-art proof synthesis technique, and ASTactic, a proof script synthesis technique that models proof state. And three new proof script synthesis technique we create ourselves, SeqOnly, which models only the partial proof script and the initial theorem being proven, and WeightedRandom and WeightedGreedy, which use metaheuristic search biased by frequencies of proof tactics in existing, successful proof scripts. We find that TacTok outperforms WeightedRandom and WeightedGreedy, and is complementary to CoqHammer and ASTactic: for 24 out of the 26 projects, TacTok can synthesize proof scripts for some theorems the prior tools cannot. Together with TacTok, 11.5% more theorems can be proven automatically than by CoqHammer alone, and 20.0% than by ASTactic alone. Compared to a combination of CoqHammer and ASTactic, TacTok can prove an additional 3.6% more theorems, proving 115 theorems no tool could previously prove. Overall, our experiments provide evidence that partial proof script and proof state semantics, together, provide useful information for proof script modeling, and that metaheuristic search is a promising direction for proof script synthesis. TacTok is open-source and we make public all our data and a replication package of our experiments.
A robot’s code needs to sense the environment, control the hardware, and communicate with other robots. Current programming languages do not provide suitable abstractions that are independent of hardware platforms. Currently, developing robot applications requires detailed knowledge of signal processing, control, path planning, network protocols, and various platform-specific details. Further, porting applications across hardware platforms remains tedious. We present Koord—a domain specific language for distributed robotics—which abstracts platform-specific functions for sensing, communication, and low-level control. Koord makes the platform-independent control and coordination code portable and modularly verifiable. Koord raises the level of abstraction in programming by providing distributed shared memory for coordination and port interfaces for sensing and control. We have developed the formal executable semantics of Koord in the K framework. With this symbolic execution engine, we can identify assumptions (proof obligations) needed for gaining high assurance from Koord applications. We illustrate the power of Koord through three applications: formation flight, distributed delivery, and distributed mapping. We also use the three applications to demonstrate how platform-independent proof obligations can be discharged using the Koord Prover while platform-specific proof obligations can be checked by verifying the obligations using physics-based models and hybrid verification tools.
Code embedding, as an emerging paradigm for source code analysis, has attracted much attention over the past few years. It aims to represent code semantics through distributed vector representations, which can be used to support a variety of program analysis tasks (e.g., code summarization and semantic labeling). However, existing code embedding approaches are intraprocedural, alias-unaware and ignoring the asymmetric transitivity of directed graphs abstracted from source code, thus they are still ineffective in preserving the structural information of code.
This paper presents Flow2Vec, a new code embedding approach that precisely preserves interprocedural program dependence (a.k.a value-flows). By approximating the high-order proximity, i.e., the asymmetric transitivity of value-flows, Flow2Vec embeds control-flows and alias-aware data-flows of a program in a low-dimensional vector space. Our value-flow embedding is formulated as matrix multiplication to preserve context-sensitive transitivity through CFL reachability by filtering out infeasible value-flow paths. We have evaluated Flow2Vec using 32 popular open-source projects. Results from our experiments show that Flow2Vec successfully boosts the performance of two recent code embedding approaches codevec and codeseq for two client applications, i.e., code classification and code summarization. For code classification, Flow2Vec improves codevec with an average increase of 21.2%, 20.1% and 20.7% in precision, recall and F1, respectively. For code summarization, Flow2Vec outperforms codeseq by an average of 13.2%, 18.8% and 16.0% in precision, recall and F1, respectively.