Abstracts (in progress)
Experience in scaling performance-portable weather and climate applications
CSCS Lugano, Switzerland
Developing performance portable libraries is challenging: first you need clear separation of concerns and clear interfaces to provide the right abstraction layers for a given application field, second it is often necessary to constrain the users in ways not enforceable with (language provided) syntactic checks. When scaling applications to large machines the communication primitives also poses additional requirements depending on what programming paradigms will be supported for the on-node computations. This talk will discuss some of the experiences we had, and the challenges we are facing in the GridTools project.
The I/O complexity of Strassen’s matrix multiplication
Univeristy of Padova, Italy
A tight Ω((n/√M)log27M) lower bound is derived on the I/O complexity of Strassen’s algorithm to multiply two n × n matrices, in a two-level storage hierarchy with M words of fast memory. A proof technique is introduced, which exploits the Grigoriev flow of the matrix multiplication function as well as some combinatorial properties of the Strassen computation DAG. Applications to parallel computation are also developed. The result generalizes a similar bound previously obtained under the constraint of no-recomputation, that is, that intermediate results cannot be computed more than once. For this restricted case, another lower bound technique is presented, which leads to a simpler analysis of the I/O complexity of Strassen’s algorithm and can be readily extended to other “Strassen-like” algorithms.
Joint work with Lorenzo De Stefani (Brown University).
Automatic Tuning of Stencil Computations with Ordinal Regression
T.U. Berlin, Germany
Stencil computations expose a large and complex space of equivalent implementations. These computations often rely on autotuning techniques, based on iterative compilation or machine learning (ML), to achieve high performance. Iterative compilation autotuning is a challenging and time-consuming task that may be not affordable in many scenarios. Meanwhile, traditional ML autotuning approaches exploiting classification algorithms (such as neural networks and support vector machines) face difficulties in capturing all features of large search spaces. In this presentation, I will introduce a new way of automatically tuning stencil computations based on structural learning. By organizing the training data in a set of partially-sorted samples (i.e., rankings), the problem is formulated as a ranking prediction model, which translates to an ordinal regression problem. Our approach can be coupled with an iterative compilation method or used as a standalone autotuner. We demonstrate its potential by comparing it with state-of-the-art iterative compilation methods on a set of nine stencil codes and by analyzing the quality of the obtained ranking in terms of Kendall rank correlation coefficients. (Reference paper)
Joint work with Juan Durillo (Leibniz Supercomputing Centre), Stefano Ermon (Stanford University) and Ben Juurlink (TU Berlin).
Going from C to stream-like RTL implementations for FPGA acceleration
Xilinx, San Jose, California, USA
TBA
MemComputing with self-organizing logic gates
University of California, San Diego, USA
I will discuss how to
employ memory (time non-locality) in a novel physics-based approach to
computation, memcomputing, and its practical realization
with self-organizing logic gates (SOLGs). SOLGs are
terminal-agnostic gates that self-organize to always satisfy their
logical proposition regardless to which terminal(s) the truth value is
assigned. As examples, I will show the polynomial-time solution of
prime factorization, the search version of the subset-sum problem,
approximations to the Max-SAT beyond the inapproximability gap,
and acceleration of deep learning using polynomial resources. I
will also show that these digital memcomputing machines compute via an
instantonic phase, implying that they are robust against noise and
disorder. The digital memcomputing machines we propose can be
efficiently simulated, are scalable and can be easily realized with
available nanotechnology components. Work supported in part by
MemComputing, Inc. (http://memcpu.com/).
Accelerating general purpose parallel computing with the TPA architecture
VTT, Finland
The coexisting proximity of the end of Moore’s Law and dramatical raise of AI technologies are causing confusion among the computer architecture community. Some researchers are ready to dump entirely the current architectures and switch to neural networks or tensor units (TPU) while other emphasize the role of application-specific architectures in the form of GPUs and TPUs etc. Characteristic to these groups seem to be sudden and surprising change of research directions or could we have seen this inevitably coming?
In this presentation we will try to have first a short but
constructive view to coverage of these approaches to the simplistic
computational space. As it seems that these trends are (again)
ignoring the need for efficient general purpose parallel computing, we
will have a look at our Thick Control Flow Processor Architecture
(TPA) that tries to contribute covering that wast area in the
computational space. We will show that many parallel
computing-specific techniques are efficiently implementable in TPA
despite of the generality of the used Thick Control Flow (TCF)
model. We will also consider the usefulness of some ideas for tuning
this kind of architecture for linear algebra for ML.
Accelerating Dense Matrix Multiplication through DataFlow-Threads (DF-Threads)
University of Siena, Italy
Dataflow techniques have been explored at different levels of
granularity, such as instruction-level in the very successful
superscalar processors, where instructions fire out-of-order when
operands are ready, or at coarser level, in programming models such as
OmpSs/StarSs (now converging into next OpenMP), where the runtime is
managing the data flow among tasks and schedules a possibly large
group of instructions (i.e., a dataflow task) on the available
computational units (CPU-cores, GPU-cores, accelerators). However,
the hardware-software interface is not yet providing: i) proper
support to parallelize threads; ii) a universally accepted memory
consistency model. Most of the burden is caused by the need of
synchronization, consistency and coherency: an old problem now
exacerbated by intensive use of inexpensive massive parallel systems.
In the recent projects TERAFLUX and AXIOM, DataFlow-Threads
(DF-Threads) have been explored as a possible approach for scaling
performance while providing a cleaner interface for future massive
parallel systems: DF-Threads can be supported in the architecture by
few new instructions, and can extend existing processor so they
provide a more efficient and performing parallelism. Our experimental
results show almost perfect scalability, for systems with 1000+
general-purpose x86_64 (extended) cores, running Dense Matrix
Multiplications over Linux-based OS. Could DF-Threads become an easier
and more efficient way to deploy highly scalable general purpose
systems?
Using AI to Transform Informational Videos and Our Watching Behavior
VideoKen and IIIT Bangalore
Videos account for about 75% of the Internet traffic and enterprises are
increasingly using videos for various informational purposes, including
training of customers, partners and employees, marketing and internal
communication. However, most viewers do not have the patience to watch
these videos end-to-end and our video watching experience has not evolved
much in over a decade. We present an AI-based approach to automatically
index videos in the form of a table-of-contents, a phrase cloud and a
searchable transcript, which helps summarize the key topics in a video and
lets viewers navigate directly to the topics of interest. We use a
combination of visual classification, object detection, automated speech
recognition, text summarization, and domain classification, and show the
results achieved on a range of informational videos. We conclude with some
thoughts on the promise of transforming how informational videos are
consumed as well as open problems and future directions.
Power Modeling of Heterogeneous Mobile SoCs using Machine Learning
T.U. Berlin, Germany
GPUs have recently become important computational units on mobile
devices, resulting in heterogeneous SoCs that can run a variety of
parallel processing applications. While developing and optimizing such
applications, estimating power consumption is of immense importance as
energy efficiency has become the key design constraints to optimize
for on these devices. In this work, we apply machine learning
techniques in building a predictive model for estimating power
consumption of parallel applications on heterogeneous mobile SoCs. Our
model is an artificial neural network trained using CPU and GPU
performance counter values along with measured power data. The model
is trained and evaluated with data collected using a set of graphics
OpenGL workloads as well as OpenCL compute benchmarks. Our
evaluations show that our model can achieve accurate power estimates
with mean relative error of 4% for both compute and graphics
workloads. When compared to a statistical model, our neural network
model is about 4x better than a state of the art statistical linear
regression model.
Theory of Accelerator Computing
Huawei, Paris, France
The talk will survey basic problems, algorithms, and results related to computing "deep computations" on memory-limited edge devices and accelerators. Pebbling and Space-Time tradeoffs are revisited for the new AI Era.
Implementing the GraphBLAS C API
IBM Research, NY, USA
This talk will describe our implementation of the GraphBLAS C API. The
GraphBLAS C API defines a set of objects and methods that are building
blocks for graph algorithms. Furthermore, the same building blocks
have been demonstrated to also support deep neural network
algorithms. The implementation fully hides the internals of GraphBLAS
objects from application programs, which only operate on pointers to
opaque objects. The interface is defined by a C11-compliant include
file, while the methods are implemented in C++ and packaged as a
library with C language bindings. Internally, the library is built
using a layered approach, in which most of the heavy work is done by
member methods of the objects, while an API layer provides the
compatibility with the specification. We present an overview of the
GraphBLAS C API and then proceed to describe the organization of the
library, explain how it translates the polymorphic interface to its
nonpolymorphic version, and we also describe the inner workings of two
GraphBLAS methods, reduction and matrix-matrix multiplication, which
serve as running examples of the implementation.
T.U. Dresden, Germany
For many years, HPC is using fast supercomputers to provide solutions for many different scientific areas. Besides technology improvements, modern algorithms and methods are significantly contributing to the successes achieved in modeling and simulation. More and more often, these days progress is coming also from large data sets and innovative data analytics methods. To obtain excellent results in both areas, HPC infrastructures have to adequately support data and data analytics capabilities. For that reason, TU Dresden has extended his existing HPC infrastructure – already organized as Infiniband islands – by three components:
- A large 2 PB SSD farm supporting an I/O-bandwidth of up to 2 TB/s;
- An object store infrastructure for about 10 PB;
- A Power 9 based machine learning infrastructure from IBM with more than 100 Nvidia V10.
The talk will introduce the architectural environment and – based on very
detailed measurements – will compare the linear algebra capabilities from several Nvidia
GPUs, especially from Nvidia V100.
Ca’ Foscari University of Venice, Italy
Machine-learnt models based on additive ensembles of regression trees are currently deemed the best solution to address complex classification, regression, and ranking tasks. The deployment of such models is computationally demanding: to compute the final prediction, the whole ensemble must be traversed by accumulating the contributions of all its trees. In particular, traversal cost impacts applications where the number of candidate items is large, the time budget available to apply the learnt model to them is limited, and the users’ expectations in terms of quality-of-service is high. Document ranking in web search, where sub-optimal ranking models are deployed to find a proper trade-off between efficiency and effectiveness of query answering, is probably the most typical example of this challenging issue. This paper investigates multi/many-core parallelization strategies for speeding up the traversal of large ensembles of regression trees thus obtaining machine-learnt models that are, at the same time, effective, fast, and scalable. Our best results are obtained by the GPU-based parallelization of the state-of-the-art algorithm, with speedups of up to 102.6x.
Joint work with F. Lettich, C. Lucchese, F. M. Nardini, R. Perego, N. Tonellotto, R. Venturini.
LORE: A Loop Repository for the Evaluation of Compilers and other Educational Applications
University of California Irvine, USA
Although numerous loop optimization techniques have been designed and
deployed in commercial compilers in the past, virtually no common
experimental infrastructure nor repository exists to help the compiler
community evaluate the effectiveness of these techniques. This paper
describes a repository, LORE, that maintains a large number of C
language for loop nests extracted from popular benchmarks, libraries,
and real applications. It also describes the infrastructure that
builds and maintains the repository. Each loop nest in the repository
has been compiled, transformed, executed, and measured independently.
These loops cover a variety of properties that can be used by the
compiler community to evaluate loop optimizations using a broad and
representative collection of loops. To illustrate the usefulness of
the repository, we also present two example applications. One is
assessing the capabilities of the auto-vectorization features of three
widely used compilers. The other is our use of this repository as a
vehicle for quick hands-on teaching of some core parallel processing
concepts and vectorization.
Aspects of provenance and performance for computational workflows
Brookhaven National Laboratory, NY, USA
In extreme-scale computing environments such as the DOE Leadership Computing Facilities that exhibit increased heterogeneity and expected asynchrony, scientific workflows are routinely used to coordinate software processes for the execution of complex, computational applications that perform in-silico experiments run as model simulations. Workflows enable the orchestration of the numerous processes that read, write, and analyze data and calculate quantities of interest for parallel and distributed scientific applications that range from quantum chemistry, molecular dynamics, climate modeling, and many others. Important factors for any application running in such environments include execution time and performance, accuracy of calculations, and the ability to analyze results. Given the limitations of a fixed resource budget (number of cores allocated for a specific period of time) and with simulations running for several days or weeks, it is important to determine if a simulation run is progressing as expected, what variations in performance a run exhibits and where they can be attributed. Monitoring the performance of workflows in HPC provides insightsinto this progression, how the computational resources are used, and where execution bottlenecks occur. But monitoring performance without also simultaneously tracking provenance is not sufficient to understand variations between runs, configurations, versions of a code, and between changes in an implemented stack, and systems, i.e. the variability of performance metrics data in their historical context. For gaining this type of insights, the provenance of workflow performance is needed. We define the provenance of workflow performance as the provenance that captures and correlates traditional provenance characteristics and performance metrics data. This type of provenance is used for performance analysis in empirical studies on the performance of a software or workflow during a development phase or in different computational environments.
Against the backdrop of complicated architectures with large heterogeneity of devices, complex memory hierarchies, and communication patterns, teams face an increasing need to establish credibility by producing reproducible science. We propose an approach for improved reproducibility in High Performance Computing that includes capturing and relating fine-grained, targeted provenance that includes run-time environment. Extracting provenance and environment will become even more important in the context of exascale systems that will feature multiple GPUs per node. We extend our approach by capturing provenance characteristics of Machine Learning algorithms and performance metrics in a data analysis architecture as such techniques are increasingly used in science for a variety of purposes, in particular anomaly detection and data reduction. The system capabilities are illustrated on a quantum chemistry use case provided by NWChemEx where we detail performance reproducibility in molecular dynamics workflows on HPC computing platforms.
Program Generation for Small-Scale Linear Algebra
Department of Computer Science, ETH Zürich, Switzerland
Many performance-critical computations in communication, control, multimedia processing, machine learning, or graphics fall into the domain of linear algebra. Existing optimized libraries for linear algebra are usually optimized for large scale computation and for uses in scientific computing. For small scale computations in other domains they are often suboptimal. In this talk I present our work on generating optimized linear algebra code directly from a mathematical description using techniques developed in Spiral ( HYPERLINK "http://www.spiral.net" www.spiral.net): layers of domain-specific languages to express the mathematics and the use of rewriting systems to reshape the computation at a high level of abstraction to overcome known compiler limitations.
This is the thesis work by Daniele
Spampinato; project website: HYPERLINK
"https://acl.inf.ethz.ch/research/LGen/"
https://acl.inf.ethz.ch/research/LGen/.
Towards tools for performance bottleneck analysis
Ohio State University, OH 43210, USA
Accurate modeling of the performance of alternative configurations is very desirable for optimizing compilers but is an extremely challenging problem. This talk will discuss a new approach being pursued to performance bottleneck identification in GPU code. It involves abstract emulation of the binary code of a GPU kernel along with sensitivity analysis with respect to fundamental latency/throughput parameters of component resources. The use of the bottleneck analysis approach to optimizing tensor contraction kernels will be discussed.
Joint work with C. Hong, A. Sukumaran-Rajam, J. Kim, P. Rawat, F. Rastello, L.-N. Pouchet, S. Krishnamoorthy.
Tensor Comprehensions for Machine Learning in SaC
Heriot-Watt University, United Kingdom
Over the last few years, several machine learning frameworks such as TensorFlow,
Coffee, Caffee2, or Torch have established themselves. They can be seen as specialised
DSLs with highly elaborate hand-tuned high-performance implementations at their core
engine. Typically with many man-hours of implementation and tuning effort behind their
implementation. This talk looks at an alternative approach of using the generic
programming language SaC that aims at high-productivity. We show how the tensor
comprehension constructs in SaC can be used to implement Deep-Learning algorithms very
concisely, reflecting the underlying mathematics very directly. We also present some
initial performance figures which suggest that the generated codes are competitive with
those obtained from the highly tuned frameworks.
Adaptive MapReduce Similarity Joins
University of Padova, Italy
Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x in S and y in R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (PODS 17) investigated joins in a massively parallel setting, showing strong results that adapt to the size of the output. Meanwhile, Ahle, Aumuller, and Pagh (SODA 17) showed a sequential algorithm that adapts to the structure of the data, matching classic bounds in the worst case but improving them significantly on more structured data. We show that this adaptive strategy can be adapted to the parallel setting, combining the advantages of these approaches. In particular, we show that a simple modification to Hu et al.’s algorithm achieves bounds that depend on the density of points in the dataset as well as the total outsize of the output. Our algorithm uses no extra parameters over other LSH approaches (in particular, its execution does not depend on the structure of the dataset), and is likely to be efficient in practice.
Joint work with with Samuel McCauley.
Efficient Implementation of Unconstrained Submodular Function Minimization
ETH Zürich, Switzerland
Submodular set functions are an important topic in machine learning. These are functions that operate on subsets, returning a real number. They exhibit a diminishing returns property whereby it is more valuable to add an element to a smaller set than to a larger one. This is a property that arises naturally in many applications, giving rise to tasks that boil down to either submodular function minimization or maximization. Examples of interesting tasks that can be solved with either submodular function minimization or maximization include: optimal sensor placement, finding the maximum flow in a flow network, corpus selection, and image de-noising.
In this talk we will discuss an efficient implementation of the
Fujishige-Wolfe minimum norm point algorithm. This algorithm solves a
submodular set function minimization problem by turning it into a
geometric one. This geometric problem is solved by maintaining a set
of vectors, represented as a matrix S, and finding the point closest
to the origin within their affine hull. The computations involved in
this algorithm can be broken into three categories: incrementally
updating the QR factorization of S, using that QR factorization to
find the point closest to the origin within the affine hull of S, and
evaluating the submodular function to decide the next vector to add to
S. We will describe efficient, scalable implementations of these
operations and analyze their performance.
How to spot Linear Algebra in Spreadsheets
Imperial College, United Kingdom
Spreadsheets provide Turing-complete programmability to a large number of
businesses and users. In this talk, we discuss benefits of vectorising the compute-graph
of spreadsheets, with a view towards finding patterns matching the compute structures of
linear algebra, in order to optimise evaluation times of end nodes.
On the Use of Data Mining for multi-dimensional Sensor Data
T.U. Munich, Germany
The talk will present TurbO (Turbine Optimisation), a project funded by Bayerische Forschungsstiftung which investigates the applicability of data mining algorithms for industrial sensor data gathered from gas-fired power plants across the world. As a follow-up to the talk given at ScalPerf’16, which presented some preliminary ideas, we will show results which were obtained within the first year of the project, which started in 2017 and will last for 3 years in total.
Operation of gas turbines creates huge amounts of data (within the range of several TB per year for each power plant) through comprehensive sensor measurements. To be able to process and analyse these huge amounts of sensor data - which comes as a time series - dedicated hardware is required in terms of storage, compute power, and network as well as specific software infrastructure. Once this infrastructure has been set up, appropriate data mining and machine learning algorithms need to be identified and adapted to properly cope with the data.
We will start with introducing the hardware and software solutions that we set up for our investigations. The main part of the talk will focus on algorithms we consider as suitable so far.
The first algorithm under investigation (SCRIMP++) is based on the so called matrix profile, which contains information on similarities within multi-dimensional time series. By detecting these similarities we are able to analyse re-occuring events (which could be a hint to some unwanted behaviour) within the gas power station.
Another algorithm which is being investigated is called TICC (Toeplitz Inverse Covariance based Clustering) and is based on Markov Random Field (MRF). Through this, clusters with similar characteristics can be identified within multi-dimensional time series.
These two approaches are currently being applied to sensor data from Stadtwerke München which serves as a real world example to demonstrate their feasibility.
The talk will conclude with an outlook on future analysis using higher sampling frequencies (25kHz instead of 1Hz) and additional data sources. Open questions with regard to specific frameworks or machine learning methods as well hardware shall be discussed.
A model-driven approach for a new generation of adaptive libraries
Free University of Bozen, Italy
Efficient high-performance libraries often expose multiple
tunable parameters to provide highly optimized routines. These can range
from simple loop unroll factors or vector sizes all the way to algorithmic
changes, given that some implementations can be more suitable for certain
devices by exploiting hardware characteristics such as local memories and
vector units. Traditionally, such parameters and algorithmic choices are
tuned and then hard-coded for a specific architecture and for certain
characteristics of the inputs. However, emerging applications are often
data-driven, thus traditional approaches are not effective across the wide
range of inputs and architectures used in practice. In this paper, we
predictive model to select the optimal algorithmic parameters by training
with synthetic and real datasets. We demonstrate the effectiveness of a
BLAS library and specifically on its matrix multiplication routine. We
present experimental results for two GPU architectures and show significant
performance gains of up to 3x (on a high-end NVIDIA Pascal GPU) and 2.5x
(on an embedded ARM Mali GPU) when compared to a traditionally optimized
library.
Components for a High Performance GraphBLAS
Louisiana State University, Baton Rouge, LA, USA
The GraphBLAS provides a powerful API for instantiating graph analytic
operations in the language of linear algebra. Like its namesake, the
BLAS, a goal of this project is to provide a set of efficient building
blocks for creating performance portable graph routines. However, for
unstructured graph operations it is particularly difficult to extract
performance from modern hardware features, such as multilevel caches
and vector instructions. In the case of more structured domains, like
dense linear algebra, this is achieved through the careful
organization of data movement and computation based on the underlying
structure of the data. In this work, we adapt this approach for linear
algebra-based graph analytic algorithms. When applicable, we exploit
the hierarchical structure observed in real-world networks to fit the
data to the cache structure of the target system. In turn, these
efficient access patterns feed generated and tuned computational
kernels. The end result of this work includes high performance
implementations of key graph operations along with a collection of
efficient components for a broader set of GraphBLAS operations.
A collective communication model for WK-recursive network topologies
CNR-ICAR Napoli, Italy
Deep learning techniques have seen great success in diverse application
domains including speech processing, computer vision, and natural
language processing. the two main reasons for their recent success
are: 1) the availability of large-scale training data; and 2) advances
in computing hardware to efficiently train large-scale neural networks
using this data. Aside from the high data parallelism, convolution
neural network (CNN) training also involves high volumes of message
transfers between the many-core accelerators, mainly due to the
forwarding and storing of data between adjacent CNN layers. Most of
these accelerators are spatial in nature, i.e., an array of
interconnected processing elements (PEs) is used to provide
parallelism. The performance of a many-core architecture is highly
dependent on the capabilities of its communication backbone, strictly
speaking the Network on-chip (NoC). An efficient NoC designed for a
many-core platform should align the connectivity of the NoC with the
application’s on-chip traffic patterns. However, there has been little
research on the architecture or implementation of a NoC
interconnecting the PEs to each other. Usually, NoCs use message
passing to communicate with each other since it is scalable and more
efficient than using shared memory. For computation-intensive tasks,
parallel applications are typically employed which leverage underlying
parallel architecture. These parallel applications employ multiple
co-operating and communicating processes to speed up the
computation. Communication operations may be either point-to-point,
which involves single source and a single destination, or collective,
in which more than two processes participate. However, being
collective communication traffic more challenging to handle on
conventional NoC platforms than the unicast traffic because of the
requirement of long-range data exchanges and dense collective
communication that can cause heavy congestion. Without proper support
from the underlying NoC architecture and routing protocol, this
communication can lead to high network and queueing latencies. In a
WK-Recursive interconnect topology nodes can scale up to thousand
ones. Consequently, simple policies are required to efficiently manage
inter-node communication, especially when groups of cooperating nodes
need to exchange groups of related information. The solution proposed
for this collective communication resides in treating message transfer
as communication among homologous level-nodes. This solution, which
takes advantage of recursive network properties, allows to each gather
node of a sender level-node 1) to identify the correct message
sequence; 2) to evaluate the quantum to assign to the scheduler; 3) to
transfer the sequence as a unique message. The resulting message is
then transferred to homologous ones of a receiver level-node by a
level channel in pipeline fashion to speed up the
throughput. Delivering activity occurs adopting a straightforward
communication algorithm.
Are existing Parallel Programming Models ready for Future HPC Systems?
T.U. Munich, Germany
In this talk, I will discuss pros and cons of existing and proposed
parallel programming models in regard to scalability challenges on
future, deeply hierarchical heterogeneous systems with huge amounts of
compute components, as well as new usage expectations such as
malleability and fault tolerance. To this end, I will present the
current state of our own library proposal (“LAIK”) on top of MPI,
together with first promising results.