Abstracts (in progress)

Time series motif discovery – a probabilistic approach

Matteo Ceccarello

Free university of Bozen-Bolzano, Italy

Time series data is ubiquitous, in that it allows to model processes related to the most diverse phenomena. When analyzing time series, a particularly interesting task is to find so called motifs, i.e. repeating patterns in the data. State of the art approaches fall in one of two categories: they are either exact, requiring time at least quadratic in the size of the input, or approximate with no guarantees on the quality of the returned solution.

In this talk I will present ongoing work on an approximate algorithm to find motifs using Locality Sensitive Hashing. On the one hand, the rich theory of LSH allows to state probabilistic guarantees on the quality of the solution. On the other hand, the structure of time series data allows to speed up the computation of hash values in ways that are not possible with the vector spaces where LSH is usually applied.

To relate this work to this year's SCALPERF topic, we will discuss how the usage of the Rust programming language to implement the algorithm allows to simultaneously achieve high performance and peace of heart with the management of memory.

Privacy this unknown - The new design dimension of computing architecture

Mauro Conti

University of Padova, Italy

The Security is often presented as being based on the CIA triad, where the “C” actually stands for Confidentiality. Indeed, in many human activities we like to keep some(things) confidential, or “private”; this is particularly true when these activities are done in the cyber world where a lot of our private data are transmitted, processed, and stored. In this talk, we will first introduce the concept of privacy, and then see how this is interlaced with two important research threads. First we’ll discuss how computer architectures and particularly “trusted components” in processors could be helpful to protect privacy, allowing us to trust remote systems. Finally, we’ll discuss the issues of side-channels (in a broad sense, not only in processors) that could lead to leak of private information.

Dependability, Security and Trustworthiness in Smart Living: A Data Science Approach

Sajal Das

Missouri University of Science and Technology, USA

Our daily lives are becoming increasingly dependent on a variety of sensor and IoT enabled cyber-physical systems (CPS) such as smart city, smart energy, smart transportation, smart healthcare, smart manufacturing, smart agriculture, and so on, the goal of which is to improve the quality of life. However, CPS and IoT systems are extremely vulnerable to dependability and security challenges due to uncertainty related to the complexity, scale, heterogeneity, big data, human behavior, and resource limitations. This talk will first highlight unique challenges in such environments and then propose novel frameworks and models for dependable, secure, and trustworthy CPS and IoT systems, based on a rich set of theoretical and practical design principles, such as machine learning, data analytics, uncertainty reasoning, information theory, prospect theory, reputation scoring, and trust models. Two case studies will be considered. The first one aims to design security solutions and lightweight anomaly detection in smart grid CPS to defend against organized and persistent threats that can launch data integrity attacks on smart meters using stealthy strategies. The novelty of this approach lies in a newly defined information-theoretic metric that quantifies robustness and security, minimizing the attacker’s impact and false alarm rates. The second case study deals with secure and trustworthy decision-making in mobile crowd sensing based vehicular CPS to detect false/spam contributions due to users’ selﬁsh and malicious behaviors. Based on the cumulative prospect theory, reputation and trust model, our approach prevents revenue loss owing to undue incentives and improves operational reliability and decision accuracy at scale. The talk will be concluded with directions of future research.

MemComputing: Fundamentals and Applications

Massimiliano Di Ventra

University of California San Diego, USA

MemComputing is a novel physics-based approach to computation that employs time non-locality (memory) to both process and store information on the same physical location. Its digital version is designed to solve combinatorial optimization problems. A practical realization of digital memcomputing machines (DMMs) can be accomplished via circuits of non-linear dynamical systems with memory engineered so that periodic orbits and chaos can be avoided. A given logic problem is first mapped into this type of dynamical system whose point attractors represent the solutions of the original problem. A DMM then finds the solution via a succession of elementary avalanches (instantons) whose role is to eliminate configurations of logical inconsistency (‘‘logical defects’’) from the circuit. I will discuss the physics behind MemComputing and show many examples of its applicability to various combinatorial optimization and Machine Learning problems demonstrating its advantages over traditional approaches and even quantum computing. Work supported by DARPA, DOE, NSF, CMRR, and MemComputing, Inc. (http://memcpu.com/).

From Norwegian Computing & IBM to Benchmarking Challenges & Autotuning

Anne C. Elster

NTNU: Norwegian University of Science and Technology, Norway

Not all are aware that Norway fostered one of the earliest computing pioneers, Fredrik Rosinge Bul, whose name lives on in today´s largest supercomputer in Europe: The Atos BullSquena. In this talk I will first take you through the history of Norwegian (and European) computing until today´s low-powered devices from ARM, MyWo and others. The second part of the talk will cover some of the recent autotuning and benchmarking work in my research group, including a tool developed for automatically generating practical performance Roofline models using high performing autotuned benchmarks; the BAT Benchmark suite we developed for AutoTuning, and LS-CAT, the dataset we built from available CUDA codes on GitHub which we have parameterized and then autotuned using basic ML techniques.

Parallel algorithm for the simulation of the OPT cache replacement policy

Cristina Fabris

University of Padova, Italy

The simulation of the hit/miss behaviour of the OPT cache replacement policy has been investigated over the last six decades. Particular attention has been devoted to the computation of the stack distance of each access required by the input trace, which encodes the hit/miss behaviour of all cache capacities. The Multi Valued MIN (MV-MIN) algorithm by Belady and Palermo (1974) computes the stack distance in sequential time T(1) = O(LV), where L and V respectively denote the length and the number of distinct addresses in the trace. Critical Stack Algorithm (CSA) by Bilardi, Ekandaham, and Pattnaik (2017) achieves sequential time T(1) = O(Llog V), which is optimal within several models of computation.

In this work, we begin the exploration of parallel versions of the above algorithms, beginning with MV-MIN, which has a simpler structure. We develop a parallel variant, which runs in time T(P) = O((LVlog V)/P + V²log P)=O((T(1)log V)/P + V²log P, on P processors. In particular, T(L/V) = V²log L).

The main ideas behind the parallel algorithm are: (i) Casting the sequential algorithm as the evolution of a finite state machine (FSM); (ii) Formulating the evolution of the FSM as a prefix computation on the semigroup of state-transition functions associated with the machine; (iii) Exposing a "triangular" structure of the semigroup, which enables a decomposition of the machine state computation into V computations for machines with much smaller state space; (iv) Developing specialized representations of the semigroup elements (for the smaller machines), which can be implemented by dynamic balanced trees so that the multiplication of two semigroup elements can be done in (sequential) time O(V) and the multiplication of an arbitrary semigroup element by a semigroup generator can be done in (sequential) time O(log V). Future works will try to extend the outlined methodoogy to the CSA.

(Joint work with Gianfranco Bilardi of the University of Padova.)

Early performance and programmability comparison of the Thick control flow architecture and current multicore processors

Martti Forsell

VTT Technical Research Centre of Finland, Finland

Commercial multicore central processing units (CPU) integrate a number of processor cores on a single chip to support parallel execution of computational tasks. Multicore CPUs can possibly improve performance over single cores for independent parallel tasks nearly linearly as long as sufficient bandwidth is available. Ideal speedup is, however, difficult to achieve when dense intercommunication between the cores or complex memory access patterns are required. This is caused by expensive synchronization and thread switching, and insufficient latency toleration. These facts guide programmers away from straight-forward parallel processing patterns toward complex and error-prone programming techniques. To address these problems, we have introduced the Thick control flow (TCF) Processor Architecture. TCF is an abstraction of parallel computation that combines self-similar threads into computational entities. In this presentation, we give an early comparison on the performance and programmability of an entry-level TCF processor versus two Intel Skylake multicore CPUs on commonly used parallel kernels to find out how well our architecture solves these issues that greatly reduce the productivity of parallel software development.

Extending Performance and Reliability via Thread-Level Dataflow Management

Roberto Giorgi

University of Siena, Italy

As the number of transistors per computing system is ever-increasing, so are the fault rates, which are particularly critical in large computations on High-Performance Computing (HPC) systems. It has been predicted that we are reaching a point when the time to checkpoint such systems will be longer than the mean time between interruptions (MTTI) as seen by applications. In this presentation, checkpointing is applied at a very small granularity by relying on a disciplined data flow among the application threads. The underlying execution model is known as dataflow-threads (DF-threads) and the fault-detection extension of this model allows to achieve a resilient execution of an application while faults are affecting the system. In the proposed implementation, the execution time gracefully degrades as the number of faults increases, without the need for global checkpointing and without interrupting the application execution. The technique has been evaluated on a full-system x86-64 simulator with encouraging results.

The surprising dynamics of non-lockstep execution

Georg Hager

Erlangen National High Performance Computing Center, Germany

Many optimizations for massively parallel programs aim at improving load balance, thereby enforcing a lockstep-like behavior of the code even if strict synchronization is not required. Processes getting out of sync is generally regarded as an undesirable state. In my talk I will show some of the dynamics that may ensue if we drop the requirement of lock-step execution and accept a certain level of noise as something that does not necessarily have to be suppressed. The concept of bottlenecks plays a crucial role here since it determines if the lockstep mode is stable against noise. Within certain limits, desynchronized execution can mitigate the impact of communication overhead, which leads to the non-intuitive conclusion that deliberate injection of noise may be an optimization strategy.

Balancing conventional optimization and quantum computing

Thomas Husslein

OptWare GmbH, Germany

Optimization problems of the structure and size relevant to industry will not be solved entirely by a quantum computer in the foreseeable future. Therefore, the question of how quantum computing can be usefully integrated in an optimization environment for large-scale and complex planning problems in near future is adressed. With our applied research we want to find a decomposition approach for analyzing the structure of optimization problems and slicing them into subproblems, such that overall costs for (conventional and quantum) computation are minimized and given restrictions are respected.

Here the understanding of dependable is in the sense of “foreseeable” runtime, which means that problems are broken down in such a way that the mix of conventional computing and QC fits the given runtime.

Dealing with Dependencies from Different HPC User Groups - a Multi-Layered Story

René Jäckel

Technische Universität Dresden, Germany

In HPC, nowadays we se a strong demand from AI applications for fast computing capabilities. But, those applications require special support from the HPC system and therefore bring in a bunch of diverse requirements. In contrast to classic simulation-based applications, several different aspects are touched here at once. Efficient AI applications with a high expressive power or predictive quality need access to high quality mass data to achieve good results. HPC systems provide fast access to large amounts of data while allowing highly parallel execution of learning application. However, it is not always easy for the user to choose the appropriate execution environment, since it is not directly obvious whether an AI application will run efficiently on the system. The HPC system must be able to provide, depending on the actual needs of the application, adaptable and customized execution environments for development and production. Furthermore, there are more dependencies which might lead to different execution environments, e.g. induced by security or data needs. In my presentation I will manly focus on these requirements which especially the AI community imposes to HPC with respect to "dependable computing".

Productivity, Performance, and Portability: toward cross-domain DSL

Paul Kelly

Imperial College, United Kingdom

TBA

Everyone Loves File: File Storage Service (FSS) in Oracle Cloud Infrastructure

Bradley Kuszmaul

Google, USA

Oracle File Storage Service FSS is an elastic filesystem provided as a managed NFS service. A pipelined Paxos implementation underpins a scalable block store that provides linearizable multipage limited-size transactions. Above the block store, a scalable B-tree holds filesystem metadata and provides linearizable multikey limited-size transactions. Self-validating B-tree nodes and housekeeping operations performed as separate transactions allow each key in a B-tree transaction to require only one page in the underlying block transaction. The filesystem provides snapshots by using versioned key-value pairs. The system is programmed using a nonblocking lock-free programming style. Presentation servers maintain no persistent local state making them scalable and easy to failover. A non-scalable Paxos-replicated hash table holds configuration information required to bootstrap the system. An additional B-tree provides conversational multi-key mini-transactions for control-plane information. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication. FSS has been in production since January 2018, and holds tens of thousands of customer file systems comprising many petabytes of data.

Note: I'm at Google now, and this talk reflects the state of the system when I left Oracle nearly two years ago. As I understand it, the Oracle system has been under continued development, but the architecture remains basically unchanged.

Joint work with Matteo Frigo, Justin Mazzola Paluska and Sasha Sandler.

Privacy Preserving Machine Learning

Wei Li

Intel, USA

We are scaling AI everywhere with a combination of software and hardware acceleration. While AI hardware continues to expand its compute capabilities, Intel software AI accelerators deliver additional 10-100X gains in performance. Our partners and customers have created a wide range of AI applications, pushing the boundaries of AI into our daily lives - from healthcare, finance to entertainment. As the usages expand, the needs of compute and security keep increasing. This talk will cover machine learning, privacy preserving techniques and tools.

Dependability in Space using FPGA-based COTS: Development of an Inference Benchmark

Amir Raoofy

T.U. Munich, Germany

Special requirements of space missions, including limited energy budgets and radiation tolerance, impose strict dependability requirements on the on-board data processing system. Specifically, physical integrity and the integrity of the computing system call for the deployment of highly robust, tolerant, and resilient solutions. Similar challenges are relevant for deploying Machine Learning (ML) inference in satellites introduces. FPGA-enabled Commercial-Off-The-Shelf (COTS) solutions are viable platforms that satisfy many of these requirements. In the meantime, developing the solutions and benchmarks with considering the additional accountability measures require flexibility and support in the programming paradigms, given that machine learning solutions are very diverse, and the solutions change rapidly.

Various programming paradigms are introduced and used in practice. In this presentation, we provide a high-level overview of the paradigms introduced and used by Xilinx (as a main player in this domain), and discuss their flexibility to cope with such accountability measures. We specifically discuss the development of an inference benchmark that considers these various programming paradigms.

Compilers for Sound Floating-Point

Joao Rivera

ETH Zürich, Switzerland

Floating-point arithmetic is widely used in scientific and engineering applications but is unsound, i.e., there is no guarantee on the accuracy obtained. When working with floating-point, developers often use it as a proxy for real arithmetic and expect close to exact results. Unfortunately, while often true, the result may also deviate due to round-off errors. Their non-intuitive nature makes it hard to predict when and how these errors accumulate to the extent that they invalidate the final result. In this talk, we discuss an automatic approach to soundly bound such uncertainties. We present a source-to-source compiler that translates a C program performing floating-point computations to an equivalent C program with soundness guarantees, i.e., the generated program yields a precision certificate on the number of correct bits in the result. We will briefly discuss the design and implementation of our compiler, followed by techniques used to improve the performance and accuracy of the result.

The Katana Graph Intelligence Platform

Chris Rossbach

University of Texas at Austin, USA

Knowledge Graphs now power many applications across diverse industries such as FinTech, Pharma and Manufacturing. Data volumes are growing at a staggering rate, and graphs with hundreds of billions edges are not uncommon. Computations on such data sets include querying, analytics, and pattern mining, and there is growing interest in using machine learning to perform inference on large graphs. In many applications, it is necessary to combine these operations seamlessly to extract actionable intelligence as quickly as possible. Katana Graph is a start-up based in Austin and the Bay Area that is building a scale-out platform for seamless, high-performance computing on such graph data sets. We describe the key features of the Katana Graph Intelligence Platform and how dependability is incorporated into this platform.

Comprehensive Design-Space Exploration for Optimizing CNN for Multicore CPU

Saday Sadayappan

University of Utah, USA

TBA.

Efficient Dependable Computing on Modern Platforms - A contradiction?

Martin Schulz

Technical University of Munich, Germany

Performance predictability is an important aspect of dependable computing. This is obvious for safety critical systems (e.g., reactor control, safety systems in cars or stability control in aircrafts), but also a key aspect in many other systems: if computing is ubiquitous in all aspects of our lives, we automatically have certain performance expectations. Unfortunately, modern architectures and systems are becoming more and more variable in their performance, e.g., due to manufacturing variability, active power/energy adaptation, background services or changing availability of accelerators. In this talk, I will first highlight aspects of this variability in the area of HPC and Cloud systems and will then discuss approaches we are pursuing to mitigate this variability or, in some cases, even exploit it. Finally, I will discuss how some of these approaches may also be applicable in dependable systems, but may in turn impact other critical aspects such as data integrity, reproducibility or resource isolation.

Hardware security and its implications for dependable computing

Magnus Själander

Norwegian University of Science, Norway

With the prevalence of hardware security vulnerabilities, such as RowHammer, Meltdown, and Spectre, system architects not only need to assure that the software itself is secure but that it also protects against vulnerabilities of the underlying hardware upon which it is executed. This comes at a cost, for example, the Linux kernel performance has degraded as a result of implementing mitigations against speculative side-channel attacks, i.e., Spectre like mitigations. The performance impact has been shown to be especially aggravated for IO intensive applications, and thus negatively impact large scale systems that has a high amount of communication. In this talk, we will highlight some of the security vulnerabilities in modern computer architectures, potential mitigations, and the impact on legacy as well as future computer systems.

Democratizing domain-specific architectures for sparse matrix multiplication

Paolo Sylos Labini

Free University of Bozen-Bolzano, Italy

Many objects in modern scientific computing are encoded by sparse multidimensional data structures, and the most computationally expensive part of problems in such domains is often the repeated multiplication of sparse matrices. Unfortunately, the irregular data layout of sparse objects is unfriendly to modern accelerators such as TPU and tensor cores - unbalanced loads and irregular memory accesses make it hard to exploit at its best the power of these throughput-oriented devices, which were mostly developed to excel on the multiplication of dense matrices. Existing solutions for high performance parallel sparse multiplication thus have to forgo the many optimisations developed for dense problems. Instead, they reorganize the multiplication routine at a fundamental level, customizing the mapping from specific sparse storage formats to a given parallel architecture.

In this talk, I will explore one of the alternative solutions to extend the use of dense-specific accelerators to sparse problems - breaking a sparse matrix multiplication into a set of smaller dense matrix multiplication. This is obtained by reordering the rows of a sparse matrix to cluster its non-zero entries into small, dense blocks. Such reordering is a high-risk, high-reward gamble: a costly transformation for a potentially big increase of efficiency. However, bridging the gap between sparse and dense multiplication may allow us to extend the domain of high-level dense-dense matrix multiplication routines and dense-specific architectures, and also reduce the need for custom multiplication routines for sparse formats.

Mapping HPC Concepts to OBCs in Spaceflight

Carsten Trinitis

T.U. Munich, Germany

High Performance Computing (HPC) and its innovations have been a driving force for technical innovations in computing technology for many years or even decades. For example, today's smartphone are equipped with multi-core processors and powerful components so that these devices would have been ranked in the Top500 list some twenty years ago. Another field to which HPC innovations are becoming of interest is spaceflight: With satellites becoming smaller and smaller, swarms of nano-satellites can be launched into space withe each of these nano-satellites acting like a node in an HPC cluster. Dedicated nodes in this "cluster" can e.g. act as hubs or fulfil monitoring tasks (like wildfire detection) and transmit their sensor data within the cluster for further processing. Furthermore, some components can be removed or reconfigured, while others can be added without having to launch an entire satellite swarm. When it comes to dependability, well known redundancy concepts can be applied within the satellite swarm. These concepts can also be used to maintain data dependability and security in space. The talk will first give an introduction to nano-satellites and those launched by TUM. Furthermore, it will be discussed how these can be used to implement the strategies outlined above.

Toward Micromorphic Multiphysics Porous and Particulate Materials Simulations

Henry Tufo

University of Colorado, Boulder, USA

TBA

Transparent application integrated fault tolerance in parallel programming models

Josef Weidendorfer

T.U. Munich, Germany

For useful elasticity support in parallel programming models, the application must be able to cope with shrinking compute resources during runtime. With such an feature, it becomes a natural extension to allow to cope with failing nodes. In this talk, I shortly introduce the ongoing study on an elastic parallel programming model (we call LAIK) and its benefits for compute centers. Then, I show how we integrated fault tolerance features with the use of neighbour checkpointing.

Home >
Abstracts >

ScalPerf'21

Scalable Approaches to High Performance and High Productivity Computing

Abstracts (in progress)