Abstracts (in progress)
Time series data is ubiquitous, in that it allows to model processes related to the most diverse phenomena. When analyzing time series, a particularly interesting task is to find so called motifs, i.e. repeating patterns in the data. State of the art approaches fall in one of two categories: they are either exact, requiring time at least quadratic in the size of the input, or approximate with no guarantees on the quality of the returned solution.
In this talk I will present ongoing work on an approximate algorithm to find motifs using Locality Sensitive Hashing. On the one hand, the rich theory of LSH allows to state probabilistic guarantees on the quality of the solution. On the other hand, the structure of time series data allows to speed up the computation of hash values in ways that are not possible with the vector spaces where LSH is usually applied.
To relate this work to this year's SCALPERF topic, we will discuss how the usage of the Rust programming language to implement the algorithm allows to simultaneously achieve high performance and peace of heart with the management of memory.
The Security is often presented as being based on the CIA triad, where
the “C” actually stands for Confidentiality. Indeed, in many human
activities we like to keep some(things) confidential, or “private”;
this is particularly true when these activities are done in the cyber
world where a lot of our private data are transmitted, processed, and
stored. In this talk, we will first introduce the concept of privacy,
and then see how this is interlaced with two important research
threads. First we’ll discuss how computer architectures and
particularly “trusted components” in processors could be helpful to
protect privacy, allowing us to trust remote systems. Finally, we’ll
discuss the issues of side-channels (in a broad sense, not only in
processors) that could lead to leak of private information.
Our daily lives are becoming increasingly dependent on a variety of sensor and
IoT enabled cyber-physical systems (CPS) such as smart city, smart energy, smart
transportation, smart healthcare, smart manufacturing, smart agriculture, and so on, the
goal of which is to improve the quality of life. However, CPS and IoT systems are
extremely vulnerable to dependability and security challenges due to uncertainty related
to the complexity, scale, heterogeneity, big data, human behavior, and resource
limitations. This talk will first highlight unique challenges in such environments and
then propose novel frameworks and models for dependable, secure, and trustworthy CPS and
IoT systems, based on a rich set of theoretical and practical design principles, such as
machine learning, data analytics, uncertainty reasoning, information theory, prospect
theory, reputation scoring, and trust models. Two case studies will be considered. The
first one aims to design security solutions and lightweight anomaly detection in smart
grid CPS to defend against organized and persistent threats that can launch data integrity
attacks on smart meters using stealthy strategies. The novelty of this approach lies in a
newly defined information-theoretic metric that quantifies robustness and security,
minimizing the attacker’s impact and false alarm rates. The second case study deals with
secure and trustworthy decision-making in mobile crowd sensing based vehicular CPS to
detect false/spam contributions due to users’ selfish and malicious behaviors. Based on the
cumulative prospect theory, reputation and trust model, our approach prevents revenue loss
owing to undue incentives and improves operational reliability and decision accuracy at
scale. The talk will be concluded with directions of future research.
MemComputing is a novel physics-based approach to
computation that employs time non-locality (memory) to both
process and store information on the same physical location. Its digital version is designed to solve combinatorial
optimization problems. A practical realization of digital memcomputing machines (DMMs)
can be accomplished via circuits of non-linear dynamical systems with memory engineered so
that periodic orbits and chaos can be avoided. A given logic problem is first mapped into this
type of dynamical system whose point attractors represent the solutions of the original
problem. A DMM then finds the solution via a succession of elementary avalanches
(instantons) whose role is to eliminate configurations of logical inconsistency (‘‘logical
defects’’) from the circuit. I will discuss the physics behind MemComputing and show many
examples of its applicability to various combinatorial optimization and Machine Learning
problems demonstrating its advantages over traditional approaches and even quantum
computing. Work supported by DARPA, DOE, NSF, CMRR, and MemComputing, Inc.
(http://memcpu.com/).
Not all are aware that Norway fostered one of the earliest computing
pioneers, Fredrik Rosinge Bul, whose name lives on in today´s largest
supercomputer in Europe: The Atos BullSquena. In this talk I will
first take you through the history of Norwegian (and European)
computing until today´s low-powered devices from ARM, MyWo and others.
The second part of the talk will cover some of the recent autotuning
and benchmarking work in my research group, including a tool developed
for automatically generating practical performance Roofline models
using high performing autotuned benchmarks;
the BAT Benchmark suite we developed for AutoTuning, and LS-CAT, the
dataset we built from available CUDA codes on GitHub which we have
parameterized and then autotuned using basic ML techniques.
The simulation of the hit/miss behaviour of the OPT cache replacement policy has been investigated over the last six decades. Particular attention has been devoted to the computation of the stack distance of each access required by the input trace, which encodes the hit/miss behaviour of all cache capacities. The Multi Valued MIN (MV-MIN) algorithm by Belady and Palermo (1974) computes the stack distance in sequential time T(1) = O(LV), where L and V respectively denote the length and the number of distinct addresses in the trace. Critical Stack Algorithm (CSA) by Bilardi, Ekandaham, and Pattnaik (2017) achieves sequential time T(1) = O(Llog V), which is optimal within several models of computation.
In this work, we begin the exploration of parallel versions of the above algorithms, beginning with MV-MIN, which has a simpler structure. We develop a parallel variant, which runs in time T(P) = O((LVlog V)/P + V2log P)=O((T(1)log V)/P + V2log P, on P processors. In particular, T(L/V) = V2log L).
The main ideas behind the parallel algorithm are: (i) Casting the sequential algorithm as the evolution of a finite state machine (FSM); (ii) Formulating the evolution of the FSM as a prefix computation on the semigroup of state-transition functions associated with the machine; (iii) Exposing a "triangular" structure of the semigroup, which enables a decomposition of the machine state computation into V computations for machines with much smaller state space; (iv) Developing specialized representations of the semigroup elements (for the smaller machines), which can be implemented by dynamic balanced trees so that the multiplication of two semigroup elements can be done in (sequential) time O(V) and the multiplication of an arbitrary semigroup element by a semigroup generator can be done in (sequential) time O(log V). Future works will try to extend the outlined methodoogy to the CSA.
(Joint work with Gianfranco Bilardi of the University of Padova.)
Commercial multicore central processing units (CPU) integrate a number of processor cores on a single chip to support parallel execution of computational tasks. Multicore CPUs can possibly improve performance over single cores for independent parallel tasks nearly linearly as long as sufficient bandwidth is available. Ideal speedup is, however, difficult to achieve when dense intercommunication between the cores or complex memory access patterns are required. This is caused by expensive synchronization and thread switching, and insufficient latency toleration. These facts guide programmers away from straight-forward parallel processing patterns toward complex and error-prone programming techniques. To address these problems, we have introduced the Thick control flow (TCF) Processor Architecture. TCF is an abstraction of parallel computation that combines self-similar threads into computational entities. In this presentation, we give an early comparison on the performance and programmability of an entry-level TCF processor versus two Intel Skylake multicore CPUs on commonly used parallel kernels to find out how well our architecture solves these issues that greatly reduce the productivity of parallel software development.
As the number of transistors per computing system is ever-increasing, so
are the fault rates, which are particularly critical in large computations
on High-Performance Computing (HPC) systems.
It has been predicted that we are reaching a point when the time to
checkpoint such systems will be longer than the mean time between
interruptions (MTTI) as seen by applications.
In this presentation, checkpointing is applied at a very small granularity
by relying on a disciplined data flow among the application threads.
The underlying execution model is known as dataflow-threads (DF-threads)
and the fault-detection extension of this model allows to achieve a
resilient execution of an application while faults are affecting the system.
In the proposed implementation, the execution time gracefully degrades as
the number of faults increases, without the need for global checkpointing
and without interrupting the application execution.
The technique has been evaluated on a full-system x86-64 simulator with
encouraging results.
Many optimizations for massively parallel programs aim at improving
load balance, thereby enforcing a lockstep-like behavior of the code
even if strict synchronization is not required. Processes getting
out of sync is generally regarded as an undesirable state. In my talk
I will show some of the dynamics that may ensue if we drop the
requirement of lock-step execution and accept a certain level of
noise as something that does not necessarily have to
be suppressed. The concept of bottlenecks plays a crucial role here
since it determines if the lockstep mode is stable against noise.
Within certain limits, desynchronized execution can mitigate the
impact of communication overhead, which leads to the non-intuitive
conclusion that deliberate injection of noise may be an optimization
strategy.
Optimization problems of the structure and size relevant to industry will not be solved entirely by a quantum computer in the foreseeable future. Therefore, the question of how quantum computing can be usefully integrated in an optimization environment for large-scale and complex planning problems in near future is adressed. With our applied research we want to find a decomposition approach for analyzing the structure of optimization problems and slicing them into subproblems, such that overall costs for (conventional and quantum) computation are minimized and given restrictions are respected.
Here the understanding of dependable is in the sense of “foreseeable” runtime, which means that problems are broken down in such a way that the mix of conventional computing and QC fits the given runtime.
In HPC, nowadays we se a strong demand from AI applications for fast
computing capabilities. But, those applications require special
support from the HPC system and therefore bring in a bunch of diverse
requirements. In contrast to classic simulation-based applications,
several different aspects are touched here at once. Efficient AI
applications with a high expressive power or predictive quality need
access to high quality mass data to achieve good results. HPC systems
provide fast access to large amounts of data while allowing highly
parallel execution of learning application. However, it is not always
easy for the user to choose the appropriate execution environment,
since it is not directly obvious whether an AI application will run
efficiently on the system. The HPC system must be able to provide,
depending on the actual needs of the application, adaptable and
customized execution environments for development and production.
Furthermore, there are more dependencies which might lead to different
execution environments, e.g. induced by security or data needs. In my
presentation I will manly focus on these requirements which especially
the AI community imposes to HPC with respect to "dependable computing".
TBA
Oracle File Storage Service FSS is an elastic filesystem provided as a managed NFS service. A pipelined Paxos implementation underpins a scalable block store that provides linearizable multipage limited-size transactions. Above the block store, a scalable B-tree holds filesystem metadata and provides linearizable multikey limited-size transactions. Self-validating B-tree nodes and housekeeping operations performed as separate transactions allow each key in a B-tree transaction to require only one page in the underlying block transaction. The filesystem provides snapshots by using versioned key-value pairs. The system is programmed using a nonblocking lock-free programming style. Presentation servers maintain no persistent local state making them scalable and easy to failover. A non-scalable Paxos-replicated hash table holds configuration information required to bootstrap the system. An additional B-tree provides conversational multi-key mini-transactions for control-plane information. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication. FSS has been in production since January 2018, and holds tens of thousands of customer file systems comprising many petabytes of data.
Note: I'm at Google now, and this talk reflects the state of the system
when I left Oracle nearly two years ago. As I understand it, the Oracle
system has been under continued development, but the architecture remains
basically unchanged.
Joint work with Matteo Frigo, Justin Mazzola
Paluska and Sasha Sandler.
We are scaling AI everywhere with a combination of software and hardware acceleration. While AI hardware continues to expand its compute capabilities, Intel software AI accelerators deliver additional 10-100X gains in performance. Our partners and customers have created a wide range of AI applications, pushing the boundaries of AI into our daily lives - from healthcare, finance to entertainment. As the usages expand, the needs of compute and security keep increasing. This talk will cover machine learning, privacy preserving techniques and tools.
Special requirements of space missions, including limited energy budgets and radiation tolerance, impose strict dependability requirements on the on-board data processing system. Specifically, physical integrity and the integrity of the computing system call for the deployment of highly robust, tolerant, and resilient solutions. Similar challenges are relevant for deploying Machine Learning (ML) inference in satellites introduces. FPGA-enabled Commercial-Off-The-Shelf (COTS) solutions are viable platforms that satisfy many of these requirements. In the meantime, developing the solutions and benchmarks with considering the additional accountability measures require flexibility and support in the programming paradigms, given that machine learning solutions are very diverse, and the solutions change rapidly.
Various programming paradigms are introduced and used in practice. In this presentation, we provide a high-level overview of the paradigms introduced and used by Xilinx (as a main player in this domain), and discuss their flexibility to cope with such accountability measures. We specifically discuss the development of an inference benchmark that considers these various programming paradigms.
Floating-point arithmetic is widely used in scientific and engineering applications but is unsound, i.e., there is no guarantee on the accuracy obtained. When working with floating-point, developers often use it as a proxy for real arithmetic and expect close to exact results. Unfortunately, while often true, the result may also deviate due to round-off errors. Their non-intuitive nature makes it hard to predict when and how these errors accumulate to the extent that they invalidate the final result. In this talk, we discuss an automatic approach to soundly bound such uncertainties. We present a source-to-source compiler that translates a C program performing floating-point computations to an equivalent C program with soundness guarantees, i.e., the generated program yields a precision certificate on the number of correct bits in the result. We will briefly discuss the design and implementation of our compiler, followed by techniques used to improve the performance and accuracy of the result.
Knowledge Graphs now power many applications across diverse industries such as FinTech, Pharma and Manufacturing. Data volumes are growing at a staggering rate, and graphs with hundreds of billions edges are not uncommon. Computations on such data sets include querying, analytics, and pattern mining, and there is growing interest in using machine learning to perform inference on large graphs. In many applications, it is necessary to combine these operations seamlessly to extract actionable intelligence as quickly as possible. Katana Graph is a start-up based in Austin and the Bay Area that is building a scale-out platform for seamless, high-performance computing on such graph data sets. We describe the key features of the Katana Graph Intelligence Platform and how dependability is incorporated into this platform.
TBA.
Performance predictability is an important aspect of dependable computing. This is obvious for safety critical systems (e.g., reactor control, safety systems in cars or stability control in aircrafts), but also a key aspect in many other systems: if computing is ubiquitous in all aspects of our lives, we automatically have certain performance expectations. Unfortunately, modern architectures and systems are becoming more and more variable in their performance, e.g., due to manufacturing variability, active power/energy adaptation, background services or changing availability of accelerators. In this talk, I will first highlight aspects of this variability in the area of HPC and Cloud systems and will then discuss approaches we are pursuing to mitigate this variability or, in some cases, even exploit it. Finally, I will discuss how some of these approaches may also be applicable in dependable systems, but may in turn impact other critical aspects such as data integrity, reproducibility or resource isolation.
With the prevalence of hardware security vulnerabilities, such as
RowHammer, Meltdown, and Spectre, system architects not only need to assure
that the software itself is secure but that it also protects against
vulnerabilities of the underlying hardware upon which it is executed. This
comes at a cost, for example, the Linux kernel performance has degraded as
a result of implementing mitigations against speculative side-channel
attacks, i.e., Spectre like mitigations. The performance impact has been
shown to be especially aggravated for IO intensive applications, and thus
negatively impact large scale systems that has a high amount of
communication. In this talk, we will highlight some of the security
vulnerabilities in modern computer architectures, potential mitigations,
and the impact on legacy as well as future computer systems.
Many objects in modern scientific computing are encoded by sparse multidimensional data structures, and the most computationally expensive part of problems in such domains is often the repeated multiplication of sparse matrices. Unfortunately, the irregular data layout of sparse objects is unfriendly to modern accelerators such as TPU and tensor cores - unbalanced loads and irregular memory accesses make it hard to exploit at its best the power of these throughput-oriented devices, which were mostly developed to excel on the multiplication of dense matrices. Existing solutions for high performance parallel sparse multiplication thus have to forgo the many optimisations developed for dense problems. Instead, they reorganize the multiplication routine at a fundamental level, customizing the mapping from specific sparse storage formats to a given parallel architecture.
In this talk, I will explore one of the alternative solutions to extend the use of dense-specific accelerators to sparse problems - breaking a sparse matrix multiplication into a set of smaller dense matrix multiplication. This is obtained by reordering the rows of a sparse matrix to cluster its non-zero entries into small, dense blocks. Such reordering is a high-risk, high-reward gamble: a costly transformation for a potentially big increase of efficiency. However, bridging the gap between sparse and dense multiplication may allow us to extend the domain of high-level dense-dense matrix multiplication routines and dense-specific architectures, and also reduce the need for custom multiplication routines for sparse formats.
High Performance Computing (HPC) and its innovations have been a driving
force for technical innovations in computing technology for many years
or even decades.
For example, today's smartphone are equipped with multi-core processors
and powerful components so that these devices would have been ranked in
the Top500 list some twenty years ago.
Another field to which HPC innovations are becoming of interest is
spaceflight: With satellites becoming smaller and smaller, swarms of
nano-satellites can be launched into space withe each of these
nano-satellites acting like a node in an HPC cluster. Dedicated nodes in
this "cluster" can e.g. act as hubs or fulfil monitoring tasks (like
wildfire detection) and transmit their sensor data within the cluster
for further processing. Furthermore, some components can be removed or
reconfigured, while others can be added without having to launch an
entire satellite swarm.
When it comes to dependability, well known redundancy concepts can be
applied within the satellite swarm. These concepts can also be used to
maintain data dependability and security in space.
The talk will first give an introduction to nano-satellites and those
launched by TUM. Furthermore, it will be discussed how these can be used
to implement the strategies outlined above.
TBA
For useful elasticity support in parallel programming models, the application must
be able to cope with shrinking compute resources during runtime. With such an
feature, it becomes a natural extension to allow to cope with failing nodes.
In this talk, I shortly introduce the ongoing study on an elastic parallel programming
model (we call LAIK) and its benefits for compute centers. Then, I show how we
integrated fault tolerance features with the use of neighbour checkpointing.