Abstracts (in progress)
A Machine Learning Approach to Online Fault Classification in HPC Systems
University of Bologna, Italy
As High-Performance Computing (HPC) systems push towards the Exascale
goal, failure rates both at the hardware and software levels will
increase significantly. Detecting and classifying faults in HPC
systems as they occur and initiating corrective actions before they
can transform into failures becomes essential for their continued
operation. In this talk, I will present a fault classification method
for HPC systems based on machine learning. The novelty of our approach
rests with the fact that it can be operated on streamed data in an
online manner, thus opening the possibility to devise and enact
control actions on the target system in real-time. I will then
describe a high-level, easy-to-use fault injection tool called FINJ,
with a focus on the management of complex experiments. In order to
train and evaluate our machine learning classifiers, we injected
faults to an in-house experimental HPC system using FINJ, and
generated an extensive fault dataset. Experimental results demonstrate
that our approach allows almost perfect classification accuracy to be
reached for different fault types with low computational overhead and
minimal delay. Both FINJ and the dataset are publicly available to
facilitate resiliency research in the HPC systems field.
The I/O complexity of Toom-Cook integer multiplication
University of Padova, Italy
Nearly matching upper and lower bounds are derived for the I/O complexity of the Toom-Cook-k (or Toom-k) algorithm computing the products of two integers, each represented with n digits in a given base s, in a two-level storage hierarchy with M words of fast memory, with different digits stored in different memory words.
An IOAk(n,M) = Ω((n/M)logk 2k−1M) lower bound on the I/O complexity is established, by a technique that combines an analysis of the size of the dominators of suitable sub-CDAGs of the Toom-Cook-k CDAG (Computational Directed Acyclic Graph) and the analysis of a function, which we call “Partial Grigoriev’s flow”, which captures the amount of information to be transferred between specific subsets of input and output variables, by any algorithm that solves the integer multiplication problem. The lower bound applies even if the recomputation of partial results is allowed.
A careful implementation of the Toom-Cook-k algorithm, assuming that M=Ω(k3 logs k), is also developed and analyzed, leading to an I/O complexity upper bound that is within a factor O(k2) of the corresponding lower bound, hence asymptotically optimal for fixed k.
Both the lower and the upper bounds are actually derived in the more
general case where the value of k is allowed to vary with the level
of recursion, although the quantitative expressions become more
involved. Extensions of the lower bound for a parallel model
with P processors are also discussed.
OpenABL: A Domain-Specific Language for Parallel and Distributed Agent-Based Simulations
T.U. Berlin, Germany
Agent-based simulations are becoming widespread among scientists from
different areas, who use them to model increasingly complex
problems. To cope with the growing computational complexity, parallel
and distributed implementations have been developed for a wide range
of platforms. However, it is difficult to have simulations that are
portable to different platforms while still achieving high
performance. We present OpenABL, a domain-specific language for
portable, high-performance, parallel agent modeling. It comprises an
easy-to-program language that relies on high-level abstractions for
programmability and explicitly exploits agent parallelism to deliver
high performance. A source-to-source compiler translates the input
code to a high-level intermediate representation exposing parallelism,
locality and synchronization, and, thanks to an architecture based on
pluggable backends, generates target code for multi-core CPUs, GPUs,
large clusters and cloud systems. OpenABL has been evaluated on six
applications from various fields such as ecology, animation, and
social sciences. The generated code scales to large clusters and
performs similarly to hand-written target-specific code, while
requiring significantly fewer lines of codes.
HPC and AI: Impact and Opportunities
NTNU: Norwegian University of Science and Technology, Norway
HPC (High Performance Computing) techniques are needed in order to be able to get the desired performance for computational intensive tasks, ranging from image processing to astro-physics and weather simulations. Traditionally, the HPC field drove companies like Cray, IBM and others to develop processors for supercomputing. However, the market forces in other fields have—since the proliferation of COTS (commercial off-the shelf) processors, including GPUs for gaming and now more recently AI—driven the innovation in processor design. This means that algorithms, tools and applications now should adapt and take advantage of tensor processor, Machine Learning techniques, and other related technology, rather than expecting that old computational models will hold true. In this talk, we will discuss these issues, including how this is also an opportunity to help develop better graph algorithms for AI, and apply some of the techniques from AI to HPC challenges.
A Lightweight Programming Model for a Data-Flow Execution Model
University of Siena, Italy
Data-Flow Threads (DF-Threads) represent a simple and scalable approach for High Performance Computing. To easy the mapping to generic high level programming models, we defined a low-level minimalistic API, which can be targeted by compilers, hopefully by direct mapping internal data-flow intermediate representations into such API. This API uses C-language as programming model and can be also used directly by the programmer: some basic example are provided. We have evaluated our approach against OpenMPI and CUDA. Our approach achieves higher GFLOPS per core both when compared to MPI and CUDA. In particular, OpenMPI has a large overhead due to the OS-kernel activity.
ModSim in the AI Era
Brookhaven National Laboratories, USA
.
It is difficult to make predictions. Especially about the future. But for sure it is heterogeneous
T.U. Berlin, Germany
Dennard observed in 1974 that, as transistors become smaller, their
power density stays roughly constant, so that the power consumption of
integrated circuits stays roughly constant. This scaling law broke
down around the year 2006, which motivated the multicore
revolution. Furthermore, there are also strong signs that Moore’s law
is close to ending. For these reasons it seems that the only way to
provide even higher performance (besides, as of yet, unproven
technologies such as quantum computers) is heterogeneous computer
systems. In fact, such heterogeneous systems have dominated the TOP500
list of supercomputers in recent years, since they achieve higher
performance per Watt than homogeneous designs. In this talk I will
briefly describe some of the research my team members and I performed
in this area. In particular, some of the research results obtained in
the EU projects Encore, LPGPU, and LPGPU2 will be presented.
“Turing Tariff” Reduction: architectures, compilers and languages to break the universality barrier
Imperial College, United Kingdom
This talk is about “Turing tariff”, sometimes called “Turing Tax” (I
did not invent the term but I don’t know where it came from). We pay
Turing tariffs when we use a general-purpose machine for a purpose
where a specialized product would be better. The aim of this talk is
to focus attention on how, when and why we pay Turing tariffs – and
what strategies we know for avoiding them. I will look at processor
architecture - and accelerator architecture. I will also look at the
role of the compiler. I’ll also try to show that there is an
analogous price to pay when we use a general-purpose language when we
might have done better with a toolchain able to deliver
domain-specific optimisations.
High-level programming infrastructure and data movement optimization for accelerator-based HPC architectures
Linkoping University, Sweden
This presentation has two parts. In the first part I will shortly present the design of the EXA2PRO high-level programming model and framework, which is currently under development. EXA2PRO (exa2pro.eu) is a H2020 FETHPC project (2018-2021) with 7 partners from 5 countries. The vision of EXA2PRO is to develop, and apply, a programming environment for productive deployment of highly parallel, portable applications in exascale computing systems. The EXA2PRO programming model is designed to support convenient and high-level programming of heterogeneous parallel systems by seamlessly integrating: multi-backend algorithmic skeletons, multi-variant components, smart data-containers, and a platform modeling and introspection framework, in a HPC programming and execution environment.
In the second part I consider the global transfer fusion problem in heterogeneous computer systems with non-shared memory, i.e., the problem of minimizing, for a heterogeneous program modeled by a dataflow graph of kernel calls, the overall number of operand data transfers, and thus, the accumulated transfer startup overhead. Our approach analyzes the kernel-operand dependence graph and reorders the operand arrays in memory such that transfers and memory allocations of multiple operands adjacent in memory can be merged, saving transfer startup costs and memory allocation overheads.
Performance Software for Deep Learning Chips
Intel, USA
Scaling a Fine-Grained Parallel Biological Sequence Alignment Application to Hundreds of GPUs
University of Brasilia, Brasil
In this talk, we discuss related work in the area of parallel biological sequence alignment with the Smith-Waterman algorithm and present CUDAlign, our fine-grained multi-GPU strategy to align DNA sequences with up to 249 millions of characters in hundreds of GPUs. In order to achieve this, CUDAlign uses (a) a fine-grained parallelogram-shaped strategy to exploit parallelism; (b) overlapping of computation and communication and (c) an innovative speculation technique, which is able to parallelize a phase of the Smith-Waterman algorithm that is inherently sequential. We will show that CUDAlign is able to attain the impressive rate of 10.3 TCUPS (Trillions of matrix Cells Updated per Second), the best performance in the literature so far. We also present a pruning technique for one GPU that is able to prune more than 50% of the Smith-Waterman matrix and still retrieve the optimal alignment. Then, we will discuss our thoughts on how to adapt this pruning strategy to a multi-GPU environment and make it scalable to hundreds of GPUs, showing some preliminary results. Finally, we will present open challenges and research directions.
On the ROI of Parallel Performance Optimization
Jülich Supercomputing Center, Germany
Developers and users of HPC applications can count on free advice from European experts to analyse the performance of their scientific codes. The Performance Optimization and Productivity (POP) Centre of Excellence, funded by the European Commission under H2020, gathers together experts from BSC, IT4I, JSC, HLRS, RWTH Aachen NAG, Teratec, and UVSQ. A first phase ran from October 2015 to March 2018, and the CoE got renewed for another 3 years at December 2018. The objective of POP is to provide performance measurement and analysis services to the industrial and academic HPC community, help them to better understand the performance behaviour of their codes and suggest improvements to increase their efficiency. Training and user education regarding application tuning is also provided. Further information can be found at https://www.pop-coe.eu/.
The talk will briefly introduce the POP CoE, its objectives, and its
performance analysis and optimization services provided to the
community. Details about the tools, methods, and metrics used for the
services will also be given. The main part of presentation will report
on statistics, outcomes, and success stories from the over 150
performance assessments already performed so far during the
project. The talk will conclude with lessons learned from providing
performance analysis and optimization in a service-oriented
professional setting. .
IBM, T. J. Watson Research Lab., USA
Cyber security threats and vulnerabilities draw more public attention
to computing than any other issue, and rightly so. Millions of people
are impacted every year by unexpected attacks, to varying degree of
financial losses. It is no surprise that the community, including
government, industry and academia, are paying a lot more attention to
the problem. In this talk, we will cover some of the basic forms of
attacks that threaten our computing systems. We will also discuss what
are the architectural innovations being developed and deployed in
computer systems to counter those attacks. We plan to discuss a broad
range of issues, including return-oriented and jump-oriented
programming, memory encryption, memory integrity, side-channel
attacks, and homomorphic computing. We will conclude the talk with a
list of specific actions that can make our computing systems more
secure.
Large Capacity Memory Systems with Intel(R) Optane DC Memory: An Introduction
Intel, USA
A new kind of memory, based on flash technology, that is lot more dense than traditional DRAM but with more latency is available in the market today. The high density of this memory technology enables semiconductor manufacturers to build memory DIMMs with very large capacity. One such product is Intel(R) Optane DC Persistent Memory which allows a user to build a compute platform containing up to 6TB of main memory. In this talk we introduce the system architecture of such a compute platform and share the performance results of few graph algorithms, operating on very large graphs, that use significant portion of the memory capacity of this machine. We will also share details of optimizations we had to do to achieve good performance in operating system and in user code. We will introduce some development tools that are available for these machines that were useful in tuning applications so that they run efficiently on these large memory capacity machines.
What can machine learning do for computer systems?
University of Texas at Austin, USA
Machine learning has transformed areas like vision and AI but its impact on
computer systems has so far been more limited. In this talk, I discuss some new directions
in computer systems for which I believe machine learning is essential.
Leveraging the Use of High Performance Computing Systems in Mining of Streams of Real Valued Data Points
T.U. Munich, Germany
TurbO project aims at studying and optimizing heavy duty gas turbines and power plants. Monitoring the operation and combustion process of these turbines generates large and fine-grained data streams, which must be analyzed in detail. Such analyses bring insights to turbine operators and manufacturers enabling further optimizations of their operation and better future designs. These analyses require high computational resources, and I will discuss how HPC systems can be exploited for that and can open the door for a wide range of new possibilities to gain fine-grained insights from data streams.
Modern online processing techniques rely on domain knowledge and insights from previous operations of the system under investigation. Various techniques are used to extract such insights from the operation of systems. In particular, we focus on the class of “Matrix Profile” approaches, which encode similarities and anomalies in univariate and multivariate streams and simplifies data mining tasks on the streams. In this talk, I will discuss kernels for computing Matrix Profiles and how they can be implemented on HPC systems.
I will introduce efficient kernels to calculate Matrix Profiles, what the resource requirements for efficient execution are, and how one might approach implementing an efficient matrix profile calculator in a distributed system. I will introduce a domain decomposition scheme based on partitioning of records, leading to similar data distribution for parallel execution of different kernels. Furthermore, I will discuss the optimization techniques that are used to boost the performance of the kernels on a CPU-based system.
Productivity through Abstraction
Transfinite Arrays for Unifying Streams and Arrays
Heriot-Watt University, United Kingdom
Arrays and streams seem to be fundamentally at odds: arrays require their size to be known in advance, streams are dynamically growing; arrays offer direct access to all of their elements, streams provide direct access to the front elements only; the list goes on. Despite these differences, many computational problems at some point require shifting from one paradigm to the other. The driving forces for such a shift can be manifold. Typically, it is a shift in the task requirements at hand or it is motivated by efficiency considerations. In this talk, we propose a generalisation of arrays, named "Transfinite Arrays" which allows to switch between these two worlds without changing the program specification itself. Instead, a compiler can generate both, a streamed version and an array based version from the very same program specification.
Faster Matrix Multiplication via Sparse Decomposition
Hebrew University, Israel
Fast matrix multiplication algorithms are of practical use only if the
leading coefficient of their arithmetic complexity is sufficiently
small. Many algorithms with low asymptotic cost have large leading
coefficients, and are thus impractical. Karstadt and Schwartz (2017)
have demonstrated a technique that reduces the leading coefficient by
introducing fast O(n2 logn) basis transformations, applied to the
input and output matrices. We generalize their technique, by allowing
bases of larger dimension for the transformations, while maintaining
low overhead. Thus we accelerate several matrix multiplication
algorithms, beyond what is known to be possible using the previous
technique. Of particular interest are a few new sub-cubic algorithms
with leading coefficient 2, matching that of classical matrix
multiplication. For example, we obtain an algorithm with arithmetic
complexity of 2nlog3 23 + o(n log3 23 ), compared to 2n3
−n2 of the classical algorithm. Such new algorithms can outperform
previous ones (classical included) even on relatively small
matrices. We obtain lower bounds matching the coefficient of several
of our algorithms, proving them to be optimal.
Toward a Complexity Theory for Massively Parallel Computations
University of Padova, Italy
The Massively Parallel Computation (MPC) model serves as a common abstraction of many modern large-scale data processing frameworks, and has been receiving increasingly more attention over the past few years, especially in the context of classical graph problems. So far, the only way to argue lower bounds for this model is to condition on conjectures about the hardness of some specific problems, such as graph connectivity on promise graphs that are either one cycle or two cycles, usually called the “one cycle vs. two cycles” problem. This is unlike the traditional arguments based on conjectures about complexity classes (e.g., P ≠ NP), which are often more robust in the sense that refuting them would lead to groundbreaking algorithms for a whole bunch of problems.
In this paper we present connections between problems and classes of
problems that allow the latter type of arguments. Our conjectures
concern the class of problems solvable in a sub-logarithmic amount of
rounds in the MPC model, denoted by MPC(o(log n)), and some standard
classes concerning space complexity, namely L and NL. Our conjectures
are robust in the sense that refuting them would lead to many
surprisingly fast new algorithms in the MPC model. In the process we
also obtain new conditional lower bounds by leveraging known
reductions and by proving new reductions and equivalences between
problems.
Advancing General-Purpose Optimising Compilers for GPUs
T.U. Berlin, Germany
Stratton identified eight code optimisation patterns particularly
relevant to improving run time performance on GPUs. Questions
regarding policy - when and to which degree to apply each optimisation
- are commonly difficult problems that involve managing a complex set
of trade-offs with the potential to gain significant speed-ups. I will
look at different optimisation patterns and their applicability to
general-purpose GPU compilers. I will discuss different approaches to
policy decisions in a fully automated manner, and throughout focus on
my own experience of working on restructuring application parallelism
in general-purpose GPU codes.
Celerity: High-productivity Programming for Accelerator Clusters
University of Innsbruck, Austria
In contemporary large-scale HPC systems, the highest tiers of performance and efficiency are generally achieved by accelerator clusters. However, while efficient, these architectures burden developers with both the complexities inherent in distributed memory systems as well as those introduced by accelerator programming. The Celerity system is a more user-focused approach designed to enable high-productivity programming of accelerator clusters.
While several efforts to improve on the status quo have been made in the past, the limited support of research toolchains means that the classic "MPI + X" approach – where X is most likely CUDA – remains the most popular real-world choice. To enable broad adoption, Celerity builds on an established industry standard, SYCL. SYCL allows writing single-source parallel programs targeting accelerators using a convenient and modern C++14 API, and Celerity extends this approach to clusters of accelerators.
The main novelty of the Celerity API comes in the form “range mappers”: While SYCL’s device accessors express concisely which resources are accessed on a compute device, and how, Celerity’s range mappers additionally specify how each individual GPU thread will access a particular buffer. This allows the Celerity runtime to transparently split kernels across multiple workers and know the required inputs as well as produced outputs for each of the kernel chunks. Based on this information, the Celerity runtime system asynchronously builds and distributes a task and fine-grained command graph, efficiently executing the given workload on an accelerator cluster.
This talk primarily aims to introduce the Celerity API from a user perspective, but will also provide some insight into the design challenges faced by the underlying system components.
Two Years of TurbO – an Update
T.U. Munich, Germany
The talk will give an update on the TurbO project, a joint research project between TUM and IfTA, a medium enterprise specialising on thermoacoustic analysis.The goal of the project is to make power plants more productive due to failure prediction, monitoring and modelling combustion processes, and increasing energy efficiency.
The talk will focus around analysis and modelling of real data which was obtained from a power plant in Munich, showing some unwanted operational event. This data was subject to comprehensive investigation and analysis. In the end it turned out that in spite of the fact that everyone is discussing artificial intelligence and machine learning, a relatively simple analytical model helped to significantly improve failure prediction for gas turbines. This case will be presented and discussed in the talk.
The basic approach works as follows: Assuming that we have a (recurring) failure pattern, we aim at modeling and predicting future events based on oscillation data. Within this context, the following issues arise:
- How do you decide when a failure is likely to occur?
- When modelling the process, one has to distinguish between operational data and traditional engineering models and methods.
- As a consequence, a tradeoff has to be made whether one should use analytical methods or machine learning.
- Very often the former is way more efficient and straightforward!
University of Colorado, Boulder, USA
We present a brief history of high-performance computing (HPC) through
the lens of the tera-, peta-, and pre-exascale milestones. Key
challenges and (potential) solutions are discussed and trends
identified. An overview of the current HPC landscape is provided and,
when coupled with technology roadmaps and evolving workloads, several
possible avenues to practical exascale computing are identified. Buy
many challenges remain and we discuss them in the context of several
real world examples.
Notes to the Intel Configurable Spatial Accelerator
CNR, Napoli
While the demand for computing performance continues to grow exponentially, the underlying advances due to the Moore’s law and Dennard scaling that have sustained improvements in performance have been faltering. To respond to such requirement, hardware accelerators seem to be a promising solution. Common hardware accelerators come in many forms, from the fully customizable ASIC designed for a specific function (e.g., a floating-point unit) to the more flexible graphics processing unit (GPU) and the highly programmable field programmable gate array (FPGA). The problem with today’s hardware accelerators, however, is that they are ad-hoc solutions for specific compute/data-intensive queries. Although FPGAs are programmable, reconfiguration including compilation, synthesis, and routing usually takes hours. Therefore, applications must know the query patterns in advance so that they can prepare the hardware for the tasks. On July 5; 2018, Intel in conjunction with the US Department of Defense disclosed a US Patent about of a “configurable spatial accelerator (CSA) targets both high performance computing and the direct execution of a dataflow graph to yield a computationally dense yet energy efficient spatial microarchitecture which far exceeds conventional roadmap architectures.” The notes focus mainly on the dataflow aspects of the CSA engine.
Stream processing of genomic data
CRS4, Cagliari, Italy
Modern sequencing machines produce order of a terabyte of data per day, which need subsequently to go through a complex processing pipeline. The conventional workflow begins with a few independent, shared-memory tools, which communicate by means of intermediate files. Given its lack of robustness and scalability, this approach is ill-suited to exploiting the full potential of sequencing in the context of healthcare, where large-scale, population-wide applications are the norm.
In this work we propose the adoption of stream computing to simplify
the genomic resequencing pipeline, boosting its performance and
improving its fault-tolerance. We decompose the first steps of the
genomic processing in two distinct and specialized modules:
preprocessing, implemented within the Apache Flink framework, and
alignment, based on BWA-MEM. We loosely compose them via communication
through Kafka streams, in order to allow for easy composability and
integration in the already-existing YARN-based pipelines. The
proposed solution is then experimentally validated on real data and
shown to scale almost linearly.
More Science by Better Energy Efficiency and Power Management
Leibniz Supercomputing Centre, Germany
Nowadays the cost of energy consumed by large HPC systems makes up a significant portion of total expenses for building and running such a system. For a compute center with a given budget to serve academia, high energy efficiency and adequate power management adapting to work load results in more scientific outcome for the same money. In this talk, I will describe the efforts done and to be done at LRZ towards these goals.