Modern software systems heavily use the memory heap. As systems grow more complex and compute with increasing amounts of data, it can be difficult for developers to understand how their programs actually use the bytes that they allocate on the heap and whether improvements are possible. To answer this question of heap usage efficiency, we have built a new, detailed heap profiler called Memoro. Memoro uses a combination of static instrumentation, subroutine interception, and runtime data collection to build a clear picture of exactly when and where a program performs heap allocation, and crucially how it actually uses that memory. Memoro also introduces a new visualization application that can distill collected data into scores and visual cues that allow developers to quickly pinpoint and eliminate inefficient heap usage in their software. Our evaluation and experience with several applications demonstrates that Memoro can reduce heap usage and produce runtime improvements of 10%.
We present FRC, a high-performance concurrent parallel reference counter for unmanaged languages. It is well known that high-performance garbage collectors help developers write memory-safe, highly concurrent systems and data structures. While C++, C, and other unmanaged languages are used in high-performance applications, adding concurrent memory management to these languages has proven to be difficult. Unmanaged languages like C++ use pointers instead of references, and have uncooperative mutators which do not pause easily at a safe point. Thus, scanning mutator stack root references is challenging.
FRC only defers decrements and does not require mutator threads to pause during collection. By deferring only decrements, FRC avoids much of the synchronization overhead of a fully-deferred implementation. Root references are scanned without interrupting the mutator by publishing these references to a thread-local array. FRC's performance can exceed that of the C++ standard library's shared pointer by orders of magnitude. FRC's thread-safety guarantees and low synchronization overhead enable significant throughput gains for concurrently-readable shared data structures.
We describe the components of FRC, including our static tree router data structure: a novel barrier which improves the scalability of parallel collection workers. FRC's performance is evaluated on several concurrent data structures. We release FRC and our tests as open-source code and expect FRC will be useful for many concurrent C++ software systems.
We propose a scalable, cycle-collecting, decentralized, reference counting garbage collector with partial tracing. The algorithm is based on the Brownbridge system but uses four different types of references to label edges. Memory usage is O (log n) bits per node, where n is the number of nodes in the graph. The algorithm assumes an asynchronous network model with a reliable reordering channel. It collects garbage in O (E a ) time, where E a is the number of edges in the in- duced subgraph. The algorithm uses termination detection to manage the distributed computation, a unique identifier to break the symmetry among multiple collectors, and a transaction-based approach when multiple collectors conflict. Unlike existing algorithms, ours is not centralized, does not require barriers, does not require migration of nodes, does not require back-pointers on every edge, and is stable against concurrent mutation.
Dynamic programming languages are becoming increasingly popular, yet often show a significant performance slowdown compared to static languages. In this paper, we study the performance overhead of automatic memory management in dynamic languages. We propose to improve the performance and memory bandwidth usage of dynamic languages by co-optimizing garbage collection overhead and cache performance for newly-initialized and dead objects. Our study shows that less frequent garbage collection results in a large number of cache misses for initial stores to new objects. We solve this problem by directly placing uninitialized objects into on-chip caches without off-chip memory accesses. We further optimize the garbage collection by reducing unnecessary cache pollution and write-backs through partial tracing that invalidates dead objects between full garbage collections. Experimental results on PyPy and V8 show that less frequent garbage collection along with our optimizations can significantly improve the performance of dynamic languages.
The cloud is an increasingly popular platform to deploy applications as it lets cloud users to provide resources to their applications as needed. Furthermore, cloud providers are now starting to offer a "pay-as-you-use" model in which users are only charged for the resources that are really used instead of paying for a statically sized instance. This new model allows cloud users to save money, and cloud providers to better utilize their hardware.
However, applications running on top of runtime environments such as the Java Virtual Machine (JVM) cannot benefit from this new model because they cannot dynamically adapt the amount of used resources at runtime. In particular, if an application needs more memory than what was initially predicted at launch time, the JVM will not allow the application to grow its memory beyond the maximum value defined at launch time. In addition, the JVM will hold memory that is no longer being used by the application. This lack of dynamic vertical scalability completely prevents the benefits of the "pay-as-you-use" model, and forces users to over-provision resources, and to lose money on unused resources.
We propose a new JVM heap sizing strategy that allows the JVM to dynamically scale its memory utilization according to the application's needs. First, we provide a configurable limit on how much the application can grow its memory. This limit is dynamic and can be changed at runtime, as opposed to the current static limit that can only be set at launch time. Second, we adapt current Garbage Collection policies that control how much the heap can grow and shrink to better fit what is currently being used by the application.
The proposed solution is implemented in the OpenJDK 9 HotSpot JVM, the new release of OpenJDK. Changes were also introduced inside the Parallel Scavenge collector and the Garbage First collector (the new by-default collector in HotSpot). Evaluation experiments using real workloads and data show that, with negligible throughput and memory overhead, dynamic vertical memory scalability can be achieved. This allows users to save significant amounts of money by not paying for unused resources, and cloud providers to better utilize their physical machines.
While single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Our experience shows that the state-of-the-art single-machine in-memory MapReduce system Metis frequently experiences out-of-memory crashes. Even though today's computers are equipped with efficient secondary storage devices, the frameworks do not utilize these devices mainly because disk access latencies are much higher than those for main memory. Therefore, the single-machine setup of the Hadoop system performs much slower when it is presented with the datasets which are larger than the main memory. Moreover, such frameworks also require tuning a lot of parameters which puts an added burden on the programmer. In this paper we present OMR, an Out-of-core MapReduce system that not only successfully handles datasets that are far larger than the size of main memory, it also guarantees linear scaling with the growing data sizes. OMR actively minimizes the amount of data to be read/written to/from disk via on-the-fly aggregation and it uses block sequential disk read/write operations whenever disk accesses become necessary to avoid running out of memory. We theoretically prove OMR's linear scalability and empirically demonstrate it by processing datasets that are up to 5x larger than main memory. Our experiments show that in comparison to the standalone single-machine setup of the Hadoop system, OMR delivers far higher performance. Also in contrast to Metis, OMR avoids out-of-memory crashes for large datasets as well as delivers higher performance when datasets are small enough to fit in main memory.
Web applications employ key-value stores to cache the data that is most commonly accessed. The cache improves an web application's performance by serving its requests from memory, avoiding fetching them from the backend database. Since the memory space is limited, maximizing the memory utilization is a key to delivering the best performance possible. This has lead to the use of multi-tenant systems, allowing applications to share cache space. In addition, application data access patterns change over time, so the system should be adaptive in its memory allocation.
In this work, we address both multi-tenancy (where a single cache is used for multiple applications) and dynamic workloads (changing access patterns) using a model that relates the cache size to the application miss ratio, known as a miss ratio curve. Intuitively, the larger the cache, the less likely the system will need to fetch the data from the database. Our efficient, online construction of the miss ratio curve allows us to determine a near optimal memory allocation given the available system memory, while adapting to changing data access patterns. We show that our model outperforms an existing state-of-the-art sharing model, Memshare, in terms of overall cache hit ratio and does so at a lower time cost. We show that for a typical system, overall hit ratio is consistently 1 percentage point greater and 99.9th percentile latency is reduced by as much as 2.9% under standard web application workloads containing millions of requests.
Cache in multicore machines is often shared, and the cache performance depends on how memory accesses belonging to different programs interleave with one another. The full range of performance possibilities includes all possible interleavings, which are too numerous to be studied by experiments for any mix of non-trivial programs.
This paper presents a theory to characterize the effect of memory access interleaving due to parallel execution of non-data-sharing programs. The theory uses an established metric called the footprint (which can be used to calculate miss ratios in fully-associative LRU caches) to measure cache demand, and considers the full range of interleaving possibilities. The paper proves a lower bound for footprints of interleaved traces, and then formulates an upper bound in terms of the footprints of the constituent traces. It also shows the correctness of footprint composition used in a number of existing techniques, and places precise bounds on its accuracy.
Work-stealing is promising for scheduling and balancing parallel workloads. It has a wide range of applicability on middleware, libraries, and runtime systems of programming languages. OpenJDK uses work-stealing for copying garbage collection (GC) to balance copying tasks among GC threads. Each thread has its own queue to store tasks. When a thread has no task in its queue, it acts as a thief and attempts to steal a task from another thread's queue. However, this work-stealing algorithm requires expensive memory fences for pushing, popping, and stealing tasks, especially on weak memory models such as POWER and ARM. To address this problem, we propose a work-stealing algorithm that uses double queues. Each GC thread has a public queue that is accessible from other GC threads and a private queue that is only accessible by itself. Pushing and popping tasks in the private queue are free from expensive memory fences. The most significant point in our algorithm is providing a mechanism to maintain the load balance on the basis of the use of double queues. We developed a prototype implementation for parallel GC in OpenJDK8 for ppc64le. We evaluated our algorithm by using SPECjbb2015, SPECjvm2008, TPC-DS, and Apache DayTrader.