Intel
This talk proposes architectural support for accelerating transactions
executed entirely in software. We propose instruction set architecture
(ISA) extensions and novel hardware mechanisms that improve STM
performance. We adapt a high-performance STM algorithm supporting rich
transactional semantics to our ISA extensions (called hardware
accelerated software transactional memory or HASTM). HASTM accelerates
fully virtualized nested transactions, supports language integration,
and provides both object-based and cache-line based conflict detection.
Our results show that (1) HASTM single-thread performance is comparable
to a conventional HTM implementation; (2) HASTM scaling is comparable to
a STM implementation; and (3) HASTM is resilient to spurious aborts and
can scale better than HTM in a multi-core setting. HASTM thus provides
the flexibility and rich semantics of STM, while giving the performance
of HTM.
University of Padova
In this talk, we consider a framework for stochastic analysis. The address trace is modeled as a stochastic process, characterized by independent, identically distributed LRU stack distances. A wide class of workloads can be modeled by varying the stack-distance distribution. The framework enables analitical evaluation of the miss probability for some well-known policies. It also allows to derive the optimal policy, for a given distribution.
(The above line of research is in collaboration with K. Ekanadham and
P. Pattnaik of IBM Research and with A. Ferrante and F. Versaci of the
University of Padova.)
IBM T. J. Watson Research Center
Productivity is the latest buzzword in High Performance Computing (HPC)
programming motivated by the DARPA HPCS investment of $600 mil. to change
the landscape of HPC and make supercomputing performance available to the
masses. Is that an attainable goal? In this talk we shall present our
experience with two programming paradigms that enhance productivity, PGAS
languages and Transactional Memory. We discuss what are some of their
advantages in terms of ease of use, what are the implementation challenges
and how do adding these features affect the entire execution stack. We
conclude by discussing how these programming paradigms may affect the
design of commercial systems.
Yahoo!
In our experimental evaluation and we achieve up to 22% execution-time
reduction versus GotoBLAS/ATLAS alone for a single core and up to 19% for a 2
dual-core processor system. Even for small matrices such as
, our approach attains already 10% execution-time reduction
and, for matrices larger than
, it delivers performance that
would correspond, for a classic algorithm, to faster-than-processor
peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS
performance on a single core with 4.4 GFLOPS peak performance and where GotoBLAS
achieves only 4 GFLOPS). Therefore, our algorithm is faster than any classic MM
algorithms could ever be for matrices of this sizes. Furthermore, we present
experimental evidence that our algorithm is --for practical purpose-- as
accurate as the classic algorithms.
Ghent University
The roadmap details several of the key challenges that need to be tackled in the coming decade, in order to achieve scalable performance in multi-core systems, and in order to make them a practical mainstream technology for high-performance embedded systems.
The HiPEAC roadmap is organized around 10 central themes: (i) single core architecture, (ii) multi-core architecture, (iii) interconnection networks, (iv) programming models and tools, (v) compilation, (vi) run-time systems, (vii) benchmarking, (viii) simulation and system modeling, (ix) reconfigurable computing, and (x) real-time systems. Per theme, a list of challenges is identified. In total 55 key challenges are listed in this roadmap. The list of challenges can serve as a valuable source of reference for researchers active in the field, it can help companies building their own R&D roadmap, and - although not intended as a tutorial document - it can even serve as an introduction to scientists and professionals interested in learning about high-performance embedded architecture and compilation.
(Joint work with: Koen De Bosschere, Wayne Luk, Xavier Martorell, Nacho Navarro,
Mike O’Boyle, Dionisios Pnevmatikatos, Alex Ramirez, Pascal Sainrat, André
Seznec, Per Stenström, and Olivier Temam)
IBM India Systems and Technology Lab
Delft University of Technology
Cornell University
Chip multiprocessors (CMPs) hold the prospect of delivering long-term performance growth by integrating more cores on the die with each new technology generation. In the short term, on-chip integration of a few relatively large cores may yield sufficient throughput when running multiprogrammed workloads. However, harnessing the full potential of CMPs in the long term makes a broad adoption of parallel programming inevitable.
We envision a CMP-dominated future where a diverse landscape of software in different stages of parallelization exists at all times. Unfortunately, in this future, the inherent rigidity in current proposals for CMP designs makes it hard to come up with a ``universal'' CMP that can accommodate this software diversity.
In this talk I will discuss Core Fusion, a CMP architecture where cores can
``fuse'' into larger cores on demand to execute sequential code very fast, while
still retaining the ability to operate independently to run highly parallel code
efficiently. Core Fusion builds upon a substrate of fundamentally independent
cores and conventional memory coherence/ consistency support, and enables the
CMP to dynamically morph into different configurations to adapt to the changing
needs of software at run-time. Core Fusion does not require specialized software
support, it leverages mature micro- architecture technology, and it can
interface with the application through small extensions encapsulated in ordinary
parallelization libraries, macros, or directives.
ClearSpeed
IBM T. J. Watson Research Center
University of Padova
As a network of processing elements grows in size and complexity, must the computational resources of individual nodes also grow? Or is there a ``universal processor'' with limited state (say, 32 bits, total) and reliability (say, one out of every hundred operations produces a wrong result) out of which we can build reliable, arbitrarily large networks of arbitrarily complex topology?
In this talk, after presenting some reasons why one might want to build
systems out of such skimpy processing units, we'll review what is known
and what appear to be the main open problems. In a nutshell, with a
handful of (unreliable) bits, one might be able to achieve much more than
might be expected.
University of Padova
The design of algorithms that can run unchanged yet
efficiently on a variety of machines characterized by different
degrees of parallelism and communication capabilities is a highly
desirable goal. We propose a framework for network-obliviousness
based on a model of computation where the only parameter is the
problem's input size. Algorithms are then evaluated on a model with
two parameters, capturing parallelism and granularity of
communication. We show that, for a wide class of network-oblivious
algorithms, optimality in the latter model implies optimality in a
block-variant of the Decomposable BSP model, which effectively
describes a wide and significant class of parallel platforms. We
illustrate our framework by providing optimal network-oblivious
algorithms for a few key problems, and also establish some negative
results.
University of Texas at Austin
A cache-oblivious algorithm is a system-independent cache-efficient algorithm that simultaneously adapt to all levels of a multi-level memory hierarchy.
In this talk we present the cache-oblivious Gaussian Elimination Paradigm (GEP), a cache-oblivious framework for problems solvable using the triply-nested loop construct similar to the computation in Gaussian elimination without pivoting.
The traditional GEP code executes operations and incurs I/Os (or transfers of memory blocks), where B is the block size. Our cache-oblivious version executes the same operations but incurs only I/Os, where M is the size of the cache.
We present two versions of cache-oblivious GEP:
- IGEP, which is in-place and applies to Gaussian elimination without pivoting, Floyd-Warshall's all-pairs shortest paths, and other important special cases of GEP, and
- CGEP, which is completely general and applies to all GEP problems, at the expense of using a modest amount of extra space.
We compare the performance of IGEP both to traditional code and to high-performance BLAS. Our results show that IGEP incurs fewer cache misses than BLAS. In terms of performance, IGEP is somewhat slower than BLAS, but overall it provides a good trade-off between computation speed on one hand, and simplicity and portability on the other. Our code for IGEP and CGEP is easy to parallelize, leading to good scalable performance for these important applications.
Both IGEP and CGEP have potential application in optimizing compilers as cache-oblivious tiling techniques.
(Joint work with Rezaul Alam Chowdhury)
MIT & VMWare
The challenge for Chip Multiprocessors is to keep the memory bandwidth
requirements constant while increasing the number of operations happening in
parallel. This assumes that there is a sufficient number of parallel activities
to keep all the processors or cores active. This also assumes that there is
some time sharing of the processors to hide the memory latency. In such cases,
gang scheduling for parallel jobs reduces the memory footprint. No only that,
there is some form of gang scheduling for independent, non- communicating jobs.
There may be many identical but independent tasks or jobs, such as Apache
Servers, that can share pages of code and data. We show how to use machine
virtualization techniques to shared identical pages across processors. The
challenge is how to modify the scheduler to take advantage of this sharing.
Dolphin Interconnect Solutions
We see a clear trend in multi-core development for standard microprocessors.
with on-chip cores with a cache coherent shared memory view. How does this relate
to multi-processing beyond the chip and the programming paradigms involved? Can
an efficient off-chip cache coherency protocol provide a seamless shared memory
view to ease the burden on programmers for creating parallel software?
University of Ferrara & INFN - Ferrara
Sun Microsystems
Transactional Memory has emerged as a leading technique that enables
applications to better take advantage of multi-threaded, multi-core
microprocessors. Setting goals for the scope of an implementation of
Transactional Memory is a key milestone that has a pervasive impact upon the
overall architecture of a modern microprocessor. In this talk, a description of
what we believe is the first hardware implementation of Transactional Memory
will be given. The synergy between a modern pipeline capable of handling today's
memory latency as well as supporting sophisticated multithreading, is the key
enabler of our approach to Transactional Memory. This will be the second public
disclosure for TM support on the Rock microprocessor.
Brown University
We study the long term (steady state) performance of a simple, randomized, local load balancing technique. We assume a system of processors connected by an arbitrary network topology. Jobs are placed in the processors by a deterministic or randomized adversary. The adversary knows the current and past load distribution in the network and can use this information to place the new tasks in the processors. The adversary can put a number of new jobs in each processor, in each step, as long as the (expected) total number of new jobs arriving at a given step is bounded by . A node can execute one job per step, and also participate in one load balancing operation in which it can move tasks to a direct neighbor in the network. In the protocol we analyze here, a node equalizes its load with a random neighbor in the graph.
We first study the stability of a system running our load balancing protocol. Clearly, if the system cannot be stable. We show that for any , and any connected network topology, the system is stable.
When the system is stable, the next performance parameter of interest is the waiting time of jobs. We develop high probability bounds and bounds on the expectation of the waiting time of jobs in terms of the network topology. In particular, if the network is an expander graph the expected wait of a task is , and the waiting time of a task that enters the network at an arbitrary time is with high probability.
We contrast these results with the work stealing load balancing protocol, where we show that, in sparse networks, the load in the system and the execution time can be exponential in the network size.
(Joint work with Aris Anagnostopoulos and Adam Kirsch.)
CNR - ICAR
From the architectural point of view, this new trend raises at least two
queries: how to exploit such spatial parallelism, how to program such systems.
The first query brings us to seriously reconsider the dataflow paradigm, given
the fine grain nature of its operations. In fact, instead of carrying out in
sequence a set of operations like a von Neumann processor does, a many-core
dataflow processor could calculate a function first connecting and configuring a
number of identical simple cores as a dataflow graph and then allowing data
asynchronously flow through them.
However, despite the model simplicity, in the past technology limits and
heterogeneity of links and actor I/O conditions shifted this approach more and
more towards the von Neumann model But with the introduction of the
homogeneous High-Level Dataflow System (hHLDS), we believe that
the dataflow approach is still a valid proposal to increase performance and
seriously attack the von Neumann model at least at processor level.
The second query brings us to seriously reconsider the functional programming
style, given its intrinsic simplicity in writing parallel programs. In fact,
functional languages have three key properties that make them attractive for
parallel programming: they have powerful mechanisms for abstracting over both
computation and coordination; they eliminate unnecessary dependencies; and their
high-level coordination achieves a largely architecture-independent style of
parallelism.
Moreover, functional programming, thanks to its properties which stem from the
inherent mathematical foundation of functions, constitute a valid computational
model to naturally extract the fine grain parallelism that programs present and
a dataflow engine can exploit.
Eurotech SpA