PPoPP '16- Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Full Citation in the ACM Digital Library

SESSION: Applications

Coarse grain parallelization of deep neural networks

High performance model based image reconstruction

Exploiting accelerators for efficient high dimensional similarity search

SESSION: Language implementation and domain specific languages

Declarative coordination of graph-based parallel programs

Distributed Halide

Parallel type-checking with haskell using saturating LVars and stream generators

SESSION: Algorithms

Articulation points guided redundancy elimination for betweenness centrality

Multi-core on-the-fly SCC decomposition

A high-performance parallel algorithm for nonnegative matrix factorization

AUTOGEN: automatic discovery of cache-oblivious parallel recursive algorithms for solving dynamic programs

SESSION: GPUs and scheduling

Gunrock: a high-performance graph processing library on the GPU

GPU multisplit

Keep calm and react with foresight: strategies for low-latency and energy-efficient elastic data stream processing

Work stealing for interactive services to meet target latency

SESSION: Shared-memory data structures

Adding approximate counters

A wait-free queue as fast as fetch-and-add

Lease/release: architectural support for scaling contended data structures

SESSION: Optimistic concurrency

Optimistic concurrency with OPTIK

Refined transactional lock elision

Drinking from both glasses: combining pessimistic and optimistic tracking of cross-thread dependences

SESSION: Locking

Be my guest: MCS lock now welcomes guests

Contention-conscious, locality-preserving locks

DomLock: a new multi-granularity locking technique for hierarchies

SESSION: Consistency models

Benchmarking weak memory models

The virtues of conflict: analysing modern concurrency

Causal consistency: beyond memory

SESSION: Performance analysis and debugging

ESTIMA: extrapolating scalability of in-memory applications

Grain graphs: OpenMP performance analysis made easy

Production-guided concurrency debugging

POSTER SESSION: Poster abstracts

Affinity-aware work-stealing for integrated CPU-GPU processors

An interval constrained memory allocator for the Givy GAS runtime

A programming system for future proofing performance critical libraries

A scalable lock-free hash table with open addressing

Concurrent hash tables: fast and general?(!)

CUDA acceleration for Xen virtual machines in infiniband clusters with rCUDA

Effect of portable fine-grained locality on energy efficiency and performance in concurrent search trees

Efficient distributed workstealing via matchmaking

Data-centric combinatorial optimization of parallel code

DSMR: a shared and distributed memory algorithm for single-source shortest path problem

Generic messages: capability-based shared memory parallelism for event-loop systems

Hybrid CPU-GPU scheduling and execution of tree traversals

Improving efficacy of internal binary search trees using local recovery

Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format

NUMA-aware scheduling and memory allocation for data-flow task-parallel applications

On designing NUMA-aware concurrency control for scalable transactional memory

On ordering transaction commit

OPR: deterministic group replay for one-sided communication

Preemption-aware planning on big-data systems

Samsara parallel: a non-BSP parallel-in-time model

Scalable adaptive NUMA-aware lock: combining local locking and remote locking for efficient concurrency

SPIRIT: a runtime system for distributed irregular tree applications

Tidex: a mutual exclusion lock

Unifying fixed code and fixed data mapping of load-imbalanced pipelined loops

User-assisted storage reuse determination for dynamic task graphs

Verification of MPI Java programs using software model checking