Abstracts
Performance of Fractal-Tree Databases
Stony Brook University and Tokutek, Inc.
Insertion bottlenecks lie at the heart of database and file-system innovations, best practices, and system workarounds. Most databases and file systems are based on B-tree data structures, and suffer from the performance cliffs and unpredictable run times of B-trees. In this talk, I introduce the Fractal Tree data structure and explain how it provides dramatically improved performance in both theory and in practice.
From a theoretical perspective, if B is the block-transfer size, the B-tree performs block transfers per insert in the worst case. In contrast, the Fractal Tree structure performs memory transfers per insert, which translates to run-time improvements of two orders of magnitude.
To relate that theory to practice, I present an algorithmic model for B-tree performance bottlenecks. I explain how the bottlenecks affect best practice and how database designers typically modify B-trees to try to mitigate the bottlenecks. Then I show how Fractal Tree structures can attain faster insertion rates, intuitively by transforming disk-seek bottlenecks into disk-bandwidth bottlenecks
I conclude with performance results. Tokutek has developed a
transaction-safe Fractal-Tree storage engine for MySQL. I present
performance results, showing how the system can maintain rich indexes
more efficiently than B-trees. Surprisingly, Fractal Tree structures
seem to maintain their order-of-magnitude competitive advantage over
B-trees on both traditional rotating media as well as SSDs.
Network Obliviousness
University of Padova
The design of algorithms that can run unchanged yet efficiently on a variety of machines characterized by different degrees of parallelism and communication capabilities is a highly desirable goal. We propose a framework for network-obliviousness based on a model of computation where the only parameter is the problem's input size. Algorithms are then evaluated on a model with two parameters, capturing parallelism and granularity of communication. We show that, for a wide class of network-oblivious algorithms, optimality in the latter model implies optimality in a block-variant of the Decomposable BSP model, which effectively describes a wide and significant class of parallel platforms. We illustrate our framework by providing optimal network-oblivious algorithms for a few key problems, and also establish some negative results.
Automatic Performance Tuning: Today and Future Challenges
University of Southern California
On the Implementation of Circuits for the Discrete Fourier Transform
University of Padova
The Discrete Fourier Transform (DFT) remains a key building block in countless computing systems. Commercial solutions usually rely on ASIC implementations of the radix-2 FFT algorithm, but this classical, static approach is being questioned by the demanding requirements of modern applications. A clear example is given by new communication standards, which mandate support for many combinations of parameters to ensure good performance in different scenarios: for instance, DVB-T2 requires to switch between 6 different DFT lengths. Other applications, such as cellular telephony, mandate support for multiple standards at once. The need for adaptivity is foreseen to grow further in the future, with the spreading of software-defined and cognitive radio systems.
In this scenario, the talk will present some remarks on the silicon
implementation of the DFT in modern processors, such as multi-core
chips or application-specific instruction-set processor. The objective
is to shed light on key properties that are valid regardless of the
size, algorithm and other details of the DFT implementation.
SW/HW Approach for Optimizing the Performance
of Synchronous Shared Memory Architectures to Low-TLP Situations
VTT Electronics
Synchronous shared memory (SSM) architectures are promising candidates for future CMP architectures due to their ability to execute general purpose parallel code efficiently down to the finest granularity and support for easy-to-use parallel programming models. While the recent SSM architectures are tuned for fast execution of parallel workloads and co-exploitation of ILP and TLP, the solutions used in them do not support efficient execution of low-TLP code fragments. More generally speaking, this inability of architectures optimized for parallel execution to efficiently execute sequential code has been shown to be one of the design bottlenecks in the theory of architectures.
In this presentation we propose a SW/HW approach for dynamically
optimizing the performance of recent SSM architectures to low-TLP
situations. The HW part includes changes to the processor pipeline and
instruction set as well as a new technique called bunching that
combines execution slots of multiple threads into a single bunch
executing a single thread with a speedup proportional to the number of
threads. The SW part includes language level mechanism to support
seamless bunching concurrently with parallel execution. Preliminary
evaluation of the approach is given.
Abstraction in Manycore Programming: Active Libraries and Indexed Metadata
Imperial College London
This talk is a review and manifesto, looking at what makes building
high-performance software hard. I'll talk about some of the
correctness and performance issues, with a specific focus on how these
concerns are destructive to software quality - how performance hacking
breaks software abstractions and reusability. My research has
attacked this problem from multiple perspectives; the main aim of this
talk is to set out our agenda for software technologies that allow
performance engineering to be isolated from higher-level algorithmic
development - by delivering pluggable domain-specific, or
library-specific, optimisation tools. Our recent examples of this
include a collaboration with The Foundry Ltd , on image processing in
visual effects postproduction, an active library for dense and sparse
linear algebra, and program generation tools for finite-element CFD.
Context-aware Composition of Parallel Components
Linköping University
Programming parallel systems efficiently is notoriously difficult. Components are a well-proven concept to manage software design and implementation complexity. However, to be reusable, components are often more general than necessary. Moreover, they hardcode and hide too many design decisions such as scheduling, algorithm selection or data representation selection, which could be bound better at a later point of time, e.g. at run-time when more information about available resources or problem parameters is known. We show how such decisions can be guided at run-time with low overhead by look-up tables that are computed off-line from profiling data. This context-aware composition is a powerful optimization technique that can be seen as a generalization of the currently very popular autotuning methods for special domains and library functions. While formulated for a parallel scenario, our method also applies to optimized composition of sequential programs as a special case.
(This is joint work with Welf Löwe, Univ. of Växjö, Sweden.)
Obliviousness
Massachusetts Institute of Technology
To achieve high performance and high productivity, systems much support oblivious programmers. Programmers do not and cannot deal with complex systems well, so as systems theorists and designers we must hide a bunch of details. Here are some examples:
- Scheduler-Obliviousness: Programmers know there are many cores in their computer, but it's better if they don't know exactly how many. The number of cores will certainly change over the useful life of a program, and it may even change dynamically over a single run. A provably good thread scheduler (such as a work-stealing scheduler) can hide the details of scheduling work onto cores, and a provably good job scheduler can hide the details of which jobs get how many cores. Instead of worrying about how to schedule a program, a programmer can worry about minimize the work and span (critical path length) of a program.
- Cache-Obliviousness: To achieve good performance in a cache hierarchy (which may include a disk system), software must operate on memory in a blocked fashion. The optimal block size may be difficult to ascertain and to program, especially in a multi-level hierarchy. With disk systems, the optimal block size may change over time (because of prefetching) and over space (e.g., disk inner tracks have smaller effective block sizes than outer tracks). Cache-oblivious algorithms provide a way to get good performance without measuring the cache hierarchy.
Moving Threads instead of Data - Benefits and Challenges
University of Turku
A traditional solution in multiprocessor/multicore systems based on the shared memory concept is to move data read/write requests between processors/cores. This can lead to memory consistency problems, if cache memories are attached to each processor/core and memory values can replicated into several caches to compensate the slowness of memory hierarchy. Another possible problems are imbalance of threads allocated on cores and/or dynamic hotspots in the memory system due to non-successful allocation of program's data structures.
In this talk, we consider dynamic moving threads instead of data write/write requests. The moving threads approach makes it possible to avoid consistency problems, if each core can access only a separate portion of the whole shared memory through its cache memory. In principle, randomness in mapping program's data structures into the shared memory can make the threads to evenly load each core (and its cache). An obvious drawback of this however is a rather high thread moving capability requirement. On the other hand, one can also consider having special annotations in programs to guide the allocation of data structures so that temporal locality in the execution of threads is maximized and the thread moving requirements are minimized. Of course, this complicates programming considerably.
We present a RISC-based multicore architecture framework supporting the moving threads approach and consider the cost of synchronous execution of programs on it. We consider issues like the size of threads (the number of registers), the need of threads per core to hide thread moving and memory hierarchy latencies, their domination of these two types of latencies, and heap construction.
Enhanced Performance Analysis of Parallel Applications with an Integrated Tool-chain
Jülich Supercomputing Center
Programming and optimizing large parallel applications is an ambitious
and time consuming challenge. Therefore a number of software tools
have been developed in the past to assist the programmer in analysing
and optimizing their codes. Scalasca from Juelich Supercomputing
Centre and Vampir from the Technical University of Dresden are two of
these performance-analysis tools that are already well established and
recognized in the community. While Scalasca shows performance flaws
due to certain common patterns in a compact and easy-to-use tree
representation, the Vampir framework visualizes all details of the
function call sequences and communication pattern in a great variety
of time-line or statistic displays. Scalasca shows the problem and its
severity but gives limited practical hints on the case history to find
a solution. Looking at the huge amount of data in Vampir can make it
hard to find the interesting spots in spatial and temporal resolution
for identification of problems and selection for further
investigation. This observation led to the combination of these tools
resulting in information gain higher then the added gain of both tools
each taken by itself. In this presentation we present our approach to
connect Scalasca and Vampir through a common bus system. A practical
real world example is provided and the beneficial impact of the
integration on the work-flow is shown by optimizing the adaptive mesh
refinement and load balancing phases of the metal forming simulation
FE software INDEED.
The Importance and Effectiveness of Speculative Optimizations
IBM T. J. Watson Research Center
Dynamic approaches have the potential for significant more impact on
performance optimization than static approaches. As an example,
just-in-time or runtime compilers have access to a lot more
information about the behavior of a program than static
compilers. That information includes execution metrics such as method
frequency and loop bounds, as well as dependence information that can
be useful in enabling compiler transformations. Nevertheless, it can
still be difficult (or expensive) to determine the legality of certain
optimizations even at runtime. In this talk, we will discuss the
effectiveness of performing optimizations speculatively. That is,
optimized code is executed even though the compiler and runtime system
are not sure of its legality. Possible violations of proper program
semantics are then detected while the code executing and some form of
correction is performed to undo the effects of incorrect code. We will
review mechanism to both detect and correct the violations and discuss
the performance impact of those mechanisms and the optimizations that
they enable.
Scalable performance analysis on large systems: What to do with all the data?
Technical University of Dresden
Performance analysis and optimization is an interesting - and challenging - field in computer science. The goal is to understand what happens on a computer system, to identify short-commings in programming and system bottlenecks, and to improve the execution behaviour to just get a faster solution with better utilization of system resources. If everything is going well, the user gets a faster solution - sometimes one or two orders of magnitudes.
Over time, the community has learned how to deal with parallel machines. Nevertheless, looking to current trends in technology we might loose the capabilities of the infrastructure needed to investigate systems with half a million cores and even beyond, it has become a severe issue already today. Therefore, new ideas and software concepts are needed, and the talk will present a strategy and a prototype implementation to address that field.
HCE: The Optimized Design Flow from ANSI C Programs to Optimized FPGA Architectures
ENEA and Ylichron Srl
The HCE (HARWEST Compiling Environment) design flow, developed by Ylichron, will be shortly presented. Starting from an algorithm expressed through an ANSI C program, HCE generates an optimized parallel architecture which implements the algorithm and is described as synthesizable VHDL-RTL, ready to be implemented on FPGA technology. The theoretical basis of HCE will be reviewed, as well as the latest developments and improvements on the design flow. Examples will be shown to illustrate the impact of different C programming styles on the performances achievable by the final architecture.
Programming Landscape and its Impact of System Design
IBM T. J. Watson Research Center
In recent years, besides the leading programming languages like C,
COBOL, Fortran etc., a series of new languages such as Java, Perl,
PHP, Python, and Ruby are becoming quite popular. Productivity
considerations and the globalization of code development and code
service are impacting the evolution of these languages. It is expected
that, in enterprise deployment of applications written in these
languages, the runtime and the hardware synergistically and
autonomically will provide the needed performance to
applications. This talk, after briefly describing the evolution of the
programming language landscape, will deal with some of its impact on
hardware design and leave with a number of significant but yet to be
answered questions.
Self-adaptive Networks: a Coordination Concept for Open Environments
University of Hertfordshire
In this talk we will briefly introduce the coordination language S-Net and then discuss some extensions to it that allow a distributed application to be self-adaptive. Extended S-Net will enable the most complete possible separation of concerns between the stream processing schema, which can be reasoned about in terms of data manipulation, correctness, etc., and a coexisting problem-specific management layer which shares the communication infrastructure and computational resources with the rest of the code and which has the ability to monitor and adapt the schema to achieve better utilisation of resources in a distributed environment. Our proposed solution is in terms of a component technology and a coordination ``glue'' that enable concurrency engineers and subject area programmers to collaborate without knowing much of each other's toolkit. The presentation will be illustrated by code examples.
This work was funded by the EU framework 6 project AETHER and its
continuation is being funded under framework 7, the project ADVANCE.
It's a Computer Jim; but not as we know it!
ARM Ltd
For the conventional computation model, increased performance equates
directly to increased processing throughput. Embedded Intelligent
Systems are Computers; just not like 'proper' computers! So when we talk
about Scaling the Performance of such computers it is by no means
obvious what this might mean. Typically scaling of an Embedded Computer
is much less concerned with raw throughput, than improvements in
non-functional aspects of performance, such as Power, Cost, System
Integration, Productivity, Quality and Reusability. To understand
Scalable Performance in this domain must relate to this context. The
presenter will explore the architecture of typical embedded systems and
what scalability might mean to them.
Amorphous Data-Parallelism
University of Texas, Austin
Most client-side applications running on multicore processors are
likely to be irregular programs that deal with complex, pointerbased
data structures such as sparse graphs and trees. However, we
understand very little about the nature of parallelism in irregular
algorithms. In this talk, we define a generalized form of
data-parallelism that we call amorphous data-parallelism and show that
it is ubiquitous in irregular algorithms. We also show that the
concepts behind amorphous data-parallelism lead to a natural
categorization of irregular algorithms. We argue that optimistic or
speculative execution is essential for exploiting amorphous
dataparallelism in many irregular algorithms, but show that the
overheads of speculative execution can be reduced dramatically in most
cases by exploiting algorithmic structure.
STAPL: A Parallel Programming Infrastructure
University of Texas, A&M
The Standard Template Adaptive Parallel Library (STAPL) is a
collection of generic data structures and algorithms that provides a
high productivity, parallel programming infrastructure with an
approach that draws heavily in design from the C++ Standard Template
Library (STL). By abstracting much of the complexities of parallelism
from the end user, STAPL provides a platform for high productivity by
enabling the user to focus on algorithmic design instead of lower
level parallel implementation issues. In this talk, we provide an
overview of the major STAPL components, discuss its framework for
adaptive algorithm selection, and show that several important
scientific applications can be written with relative ease in STAPL and
still have scalable performance.
The QPACE Project
University of Ferrara & INFN - Ferrara
The talk is focused on the QPACE project, a Petaflops-range supercomputer system designed by a collaboration of academic and industrial partners. QPACE is a massively parallel system based on the IBM-PowerXCell8i processor, and interconnected via a first-neighbor 3D-torus network. The talk will give an overview of the hardware and software of the system, and an update of the status.
HPC in Phase Change
Louisiana State University
Since 2007, high performance computing has been at the beginnings of
the most dramatic change in form and function in the last decade and a
half. Since the advent of the killer micro and the MPPs and commodity
Clusters it spawned supported by message-passing programming
techniques, most notably MPI, HPC has been on an exponential curve
augmenting performance at historic rates through incremental changes
to feature size, clock rate, and architectural complexity. But as
always happens with S-curves, HPC is turning towards its final
asymptote and is undergoing what may prove to be its 6th and potential
final phase change. Most visible is the adoption of multicore
heterogeneous system architectures driven by constraints in power,
complexity, clock rate, and reliability while continuing to exploit
improvements in feature size to achieve growth in performance. To
realize this goal and the achievement of Exascale performance by the
end of the next decade within practical limitations, critical advances
in efficiency, scalability, energy, and programmability will be
required. In all previous such metamorphoses in HPC, the underlying
principles of a new execution model was used to guide the co-design of
new architectures, programming methods, and system software. Such is
the case for the emerging HPC Phase VI. This presentation will discuss
the likely elements the new execution model based on the exploratory
ParalleX model of computation, and describe key attributes of
architecture, operating and runtime system software, and programming
methods that are likely to gain ascendency over the next decade.
Green Flash: Co-designing Hardware/Applications/Auto-tuners for Extreme Scale Computing Systems
Lawrence Berkeley National Laboratory
Green Flash is a project to develop a new design methodology for extreme scale computing systems. The approach is to take a vertical slice through the space of applications, algorithms, software, and hardware to perform a study on how to build the most scalable, energy-efficient system to solve a particular computational science problem. The initial target is climate simulation, and the solution is likely to require higher degrees of parallelism in the climate algorithms and software, as well as new hardware designs based on technology from outside the traditional workstation and server domains, particularly low-power embedded technology. Auto-tuning technologies are one key component in the design cycle between hardware and software to achieve optimal performance and power efficiency. The result will be a proof-of-concept machine design, along with supporting analysis of how this compares to alternative designs, and new algorithms and software techniques as needed to make the case for feasibility of computing the science problem on the system.
A Bridging Model for Multi-Core Computing
Harvard University
We propose a bridging model aimed at capturing the most basic resource parameters of multi-core architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully pursued as an effort in designing portable algorithms for such a bridging model. Portable algorithms would contain efficient designs for all reasonable ranges of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines. We show that for several fundamental problems, namely matrix multiplication, fast Fourier Transform, and sorting, algorithms do exist that are optimal in a defined sense, for all combinations of input size and parameter values for this model.
Dynamic Adaptability for a Many-computing Element Dataflow Processor
Institute for High Performance Computing and Networking (ICAR) - CNR
In the last twenty years, industry has delivered dramatic performance gains by increasing the frequency of its processors, from 5 MHz to more than 3 GHz, while at the same time, improving IPC (instructions per cycle). However, power thermal issues, such as dissipating heat from increasingly densely packed transistors, have begun to limit the rate at which processor frequency can also be increased. Although frequency increases have been a design staple for the last 20 years, the next 20 years will require a new approach. Basically, industry needs to develop improved micro-architectures at a faster rate, and in coordination with each new silicon manufacturing process, from 45 nm, to 32 nm, and beyond. For this new approach we can take advantage of Moore's law because transistor feature size is expected to continue to be reduced at a rate similar to that in the past. New technologies, such as 3-D die stacking, may allow even greater increases in total transistor counts within a given footprint, beyond the increases made possible by improvements in lithography alone.
In this scenario many-cores on a single chip is going to become the dominant mechanism for scaling processor performance. Exponential growth in the number of cores on a single processor is expected to lead in a short time to mainstream computers with thousands of cores. Scalable implementations of parallel algorithms will be necessary in order to achieve improved single-application performance on such processors. Because of memory access will continue to be an important limiting factor on achieving performance, heterogeneous systems may make use of cores with varying capabilities and performance characteristics, and an appropriate programming model can address scalability and can expose data locality while making it possible to migrate application code between processors with different parallel architectures and variable numbers and kinds of cores.
In this talk we present some ideas on our dynamic approach to the dataflow processor design of the PHOENIX project in which key words are: dynamically dataflow graph reconfiguration, virtual computing units, and adaptation of hardware resources on data parallelism. Coming soon...