Abstracts (in progress)
Data access lower bounds for memory hierarchies
University of Padova
We consider the Clique of H-RAMs (f(x),L)−CHp, a fully connected networks of p processors, each equipped with a local memory hierarchy. An access to memory location x of node k performed by node j takes time f(x) (for some function f) if j=k and time L+f(x) if j≠k. Thus L models the inter-node latency while f(x) models the hierarchical-memory latency.
The closed-dichotomy technique is introduced to derive lower bounds to the access complexity of a computation on a single processor, based on the closed dichotomy width of the computation DAG. (The lower bounds hold under the restriction that a vertex of the computation DAG cannot be recomputed multiple times.)
A generalization of the close dichotomy width, dubbed distributed closed dichotomy width, enables the derivation of lower bounds for the execution time of DAG computations on the Clique of H-RAMs.
Portability, Efficiency and Scalability of scientific applications on heterogeneous architectures.
University of Ferrara and INFN
High performance computing increasingly relies on heterogeneous systems, based on multi-core CPUs tightly coupled to accelerators such as GPUs or many-core devices.
Designing software to run on clusters of these systems poses new challenges: to reach high sustained performances, parallelism has to be exploited at several levels; at the same time, the lack of a single standard programming framework may undermine code portability.
This talk presents several implementations of massively parallel and portable codes adopting some of the available programming frameworks for heterogeneous systems, such as OpenCL (Open Computing Language), OpenACC (Open ACCelerators) and Message Passing Interface (MPI). Presented codes run on standard clusters based on multi-core CPUs as well as on hybrid clusters including GPU accelerators and many-core devices.
The discussion focuses on different implementations of the same algorithms, addressing portability issues across several heterogeneous architectures, highlighting the trade-offs with efficiency and scalability.
Experimental Evaluation of Multi-Round Matrix Multiplication on MapReduce
University of Padova
In this talk, we propose an Hadoop library, named M3, for performing dense and sparse matrix multiplication in MapReduce. The library features multi-round MapReduce algorithms that allow to tradeoff round number with the amount of data shuffled in each round and the amount of memory required by reduce functions. We claim that multi-round MapReduce algorithms are preferable in cloud settings to traditional monolithic algorithms, that is, algorithms requiring just one or two rounds. We perform an extensive experimental evaluation of the M3 library on an in-house cluster and on a cloud provider, aiming at assessing the performance of the library and at comparing the multi-round and monolithic approaches. Joint work with Francesco Silvestri.
DAG-based Optimizations for Distributed Heterogeneous Computing Systems
University of Innsbruck
The current landscape of architectures for High-Performance Computing (HPC) is increasingly dominated by large-scale, heterogeneous systems. While providing potentially very high peak performance, many-core systems pose new, difficult challenges for application programmers in programming models and performance exploitation.
This talk will introduce our recent work on libWater, a library-based extension of the OpenCL programming model that simplifies the development of heterogeneous distributed applications. libWater provides a runtime system that tracks dependency information enforced by event synchronization to dynamically build a DAG-based representation of all program operations. DAG operations (commands) include computation and data transfer, and can be easily integrated with existing performance models.
The talk will show two DAG-based optimization approaches to minimize compute time and energy consumption, and how it is possible to integrate the libWater runtime with different parallel programming approaches.
Graph models of computations, NP-completeness, algorithms, heuristics:
what can we expect from them?
ENS Lyon
What is a model and how do we use them? In this talk, I will try to illustrate, through examples I met during my past work on DAG scheduling and loop optimizations, what can be extracted from models of computations, NP-completeness, algorithms and heuristics, and what cannot or should not be concluded. On the positive side, there are many cases where simplified models make possible the generalization of methods, the discovery of fruitful links between problems or algorithms that seem very different, or the derivation of methods whose “performance” can be analyzed. There are also cases where NP-completeness can help contradict some simplifying intuitions. On the negative side, there are also many cases where wrong conclusions were taken from models, either because the models were wrong, not well defined, or not well-understood. Not a criticism on how we work, this talk is just a reminder that we should never forget that models are just a particular representation of the input data and objectives on which we work, and not of the full complexity.
Decision Making and Inference under Limited Information and High Dimensionality
Stanford University
Statistical inference in high-dimensional probabilistic graphical models (i.e., with many variables) is one of the central problems of statistical machine learning and stochastic decision making. To date, only a handful of distinct methods have been developed, most notably (MCMC) sampling, decomposition, and variational methods. In this talk, I will introduce a fundamentally new approach based on random projections and combinatorial optimization. Our approach provides provable guarantees on accuracy, is amenable to massive parallelization, and outperforms traditional methods in a range of domains, in particular those involving combinations of probabilistic and causal dependencies (such as those coming from physical laws) among the variables. As an example, I will discuss an application in the emerging field of Computational Sustainability aimed at discovering new fuel-cell materials where we greatly improved the quality of the results by incorporating prior background knowledge of the physics of the system into the model.
Solving Sparse Linear Systems on GPUs: the Conjugate Gradient Method and FSAI Preconditioner
University of Padova
It is well known that GPUs exhibit significantly higher peak performance than conventional CPUs; however, it is equally acknowledged that highly parallel algorithms are required to exploit GPU microarchitectures to their fullest. In this context, and targeting the solution of sparse linear systems, conjugate gradient (CG) methods and the Factorized Sparse Approximate Inverse (FSAI) preconditioner promise to be appropriate ingredients for an efficient solver on GPUs. During the talk we will illustrate our experience with the porting of a CG-like solver and FSAI preconditioner on NVidia GPUs, and show that multiple, nontrivial recastings of computational steps were required. Our efforts were rewarded with a significant speedup over the CPU-only implementation.
Software Defined Environments
IBM Research
Information-processing applications and domains require a rethinking of how IT is to be delivered as solutions and services, meeting a variety of requirements that include performance, resilience, security, and agility. Cloud-computing platforms are becoming one of the preferred deployment platforms for both traditional applications and a new generation of mobile, social, and analytics applications. Changes are evident at all layers of the platform stack. At the infrastructure level, the network, compute, and storage are becoming more software defined, i.e., the control layer and operational layers are separating, allowing the control to often be executed on standard servers and thus enabling a larger variety of offerings. For instance, software defined networking and software defined storage are currently undergoing this transition. As the underlying environment becomes more software defined and programmable, applications will become progressively easier to develop, deploy, optimize, and manage. This, in turn, will foster innovation in the solutions, tooling, and runtime domains, leading to entirely new ecosystems that will enable capturing workload topologies and characteristics, and that will enable compiling them onto an infrastructure that is assembled from the software defined components for optimal outcomes. The emergence of flat networks as the back bone of the data center allows this rethinking on how systems will be composed. Rather then assembling racks of servers, the infrastructure will be composed of resource pools ( storage, accelerators, memory nodes) that are interconnected through the high speed low latency flat networks. Network latencies and bandwidth approach bus attached I/O performance and therefore allow for resource disaggregation and subsequent re-aggregation based on the specific resource requirements of individual workloads. The programmable nature of this process enables a highly agile, flexible and consequently robust environment. In this talk, we will discuss the key ingredients of this disruptive trend of software defined environments and illustrate their potential benefits.
Contention Bounds for Combinations of Computation Graphs and Network Topologies
University of California at Berkeley
Network topologies can have significant effect on the costs of algorithms due to inter-processor communication. Parallel algorithms that ignore network topology can suffer from contention along network links. However, for particular combinations of computations and network topologies, costly network contention may inevitably become a bottleneck, even for optimally designed algorithms. We obtain a novel contention lower bound that is a function of the network and the computation graph parameters. To this end, we compare the communication bandwidth needs of subsets of processors and the available network capacity (as opposed to per-processor analysis in most previous studies). Applying this analysis we improve communication cost lower bounds for several combinations of fundamental computations on common network topologies.
A Graph based Framework for Data Analytics
Xerox
We describe a framework for addressing a broad class of problems in services, including customer care, transportation and healthcare, based on exploiting inherent relationships among data elements. A key element of our approach is to integrate and model the data using non-planar graphs and discover new insights through sophisticated graph analytics techniques such as dimensionality reduction using Laplacian embedding, spectral-analysis for diffusion analysis, spatio-temporal analysis and optimized execution through efficient node placement. While some business problems inherently exhibit graphical structure (such as social networks and bus routes), other business problems (such as the creation of an entity profile using different data sources or understanding business processes) can be solved using an induced graph through techniques such as semantic-data fusion, process mining, and dependency analysis. Our approach can provide new insights based on the inherent and implicit relationships among different data elements.
Ordering Heuristics for Parallel Graph Coloring
Massachusetts Institute of Technology
This paper introduces the largest-log-degree-first (LLF) and smallest-log-degree-last (SLL) ordering heuristics for parallel greedy graph-coloring algorithms, which are inspired by the largest-degree-first (LF) and smallest-degree-last (SL) serial heuristics, respectively. We show that although LF and SL, in practice, generate colorings with relatively small numbers of colors, they are vulnerable to adversarial inputs for which any parallelization yields a poor parallel speedup. In contrast, LLF and SLL allow for provably good speedups on arbitrary inputs while, in practice, producing colorings of competitive quality to their serial analogs.
We applied LLF and SLL to the parallel greedy coloring algorithm introduced by Jones and Plassmann, referred to here as JP. Jones and Plassmann analyze the variant of JP that processes the vertices of a graph in a random order, and show that on an O(1)-degree graph G=(V,E), this JP-R variant has an expected parallel running time of O(lgV / lglgV) in a PRAM model. We improve this bound to show, using work-span analysis, that JP-R, augmented to handle arbitrary-degree graphs, colors a graph G=(V,E) with degree Δ using Θ(V+E) work and O(lgV + lgΔ · min √E, Δ + lgΔ lgV/ lglgV ) expected span. We prove that JP-LLF and JP-SLL — JP using the LLF and SLL heuristics, respectively — execute with the same asymptotic work as JP-R and only logarithmically more span while producing higher-quality colorings than JP-R in practice.
We engineered an efficient implementation of JP for modern shared-memory multicore computers and evaluated its performance on a machine with 12 Intel Core-i7 (Nehalem) processor cores. Our implementation of JP-LLF achieves a geometric-mean speedup of 7.83 on eight real-world graphs and a geometric-mean speedup of 8.08 on ten synthetic graphs, while our implementation using SLL achieves a geometric-mean speedup of 5.36 on these real-world graphs and a geometric-mean speedup of 7.02 on these synthetic graphs. Furthermore, on one processor, JP-LLF is slightly faster than a well-engineered serial greedy algorithm using LF, and likewise, JP-SLL is slightly faster than the greedy algorithm using SL.
Executing Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling
Massachusetts Institute of Technology
A data-graph computation — popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi — is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex’s prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next round.
This talk introduces Prism, a chromatic-scheduling algorithm for executing dynamic data-graph computations. Prism uses a vertex-coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by Prism to maintain a dynamic set of active vertices as an unordered set partitioned by color. We prove theoretical bounds on the work and span of Prism and compare it empirically to existing nondeterministic lock-based schedulers in GraphLab. On 7 application benchmarks the 12-core running time of Prism is 1.1-2.3 times faster than Graphlab’s nondeterministic lock-based scheduler.
Cell-based simulations of plants and graph analysis: a perfect match?
University of Antwerpen
The last decade has seen a significant increase in computer simulations of plants that combine a cell-based approach with regulatory networks of biochemicals coupled with the physics of the plant’s tissue and the intricacies of cell growth and division. The ultimate goal is creating in silico models of plants that will help us to fully understand the origin of form and function in plant biology. The multidisciplinarity of this relatively young research field attracts a broad audience of experimental plant biologists, systems biologists, numerical mathematicians and computer scientists. This talk gives an introduction to this wide subject from the perspective of a numerical computer scientist. In particular it highlights some of the challenges and opportunities for the HPC community as the increasing size and complexity of the emerging models are becoming limiting factors for large scale simulations. Approaching these simulations of cell networks as graph algorithms seems like a very natural fit. The open question though still remains: how to design simulation frameworks that are very efficient and scalable while being flexible enough to answer the needs of the biological models that are yet to come?
Industry oriented Graph benchmarking, the case for LDBC
Polytechnic University of Catalonia
In this talk we will cover all the aspects of the creation of an industry oriented benchmark for Graph technologies. In particular, we will discuss the case for the Social Network Benchmark of LDBC (www.ldbcouncil.org) which will support three different workloads based on the data generator created, DATAGEN (see github.com/ldbc/ldbc_snb_docs) and three specific sets of queries, namely the Interactive, the Business Intelligence and the Analytical. We will explain the different important characteristics of the data generated as well as the requirements posed on the queries to support the so called "choke points" which entail explicitly injected difficulties for the technologies that will have to solve the benchmarks.
Architectural Support for Graph Analytics
IBM Research
The purpose of graph analytics is to extract relationships from data. Key computations for graph analytics include breadth first search, betweenness centrality, and minimal spanning tree. All these operations can be formulated in a linear algebra approach, using (sparse) matrix and vector kernels as building blocks. In this talk we will discuss important kernels for graph analytics and explain how they can be accelerated through architectural facilities such as GPUs, SIMD processing and near memory operations. The same architectural facilities can also be used to accelerate other types of applications.
OpenModelica in Science an Engineering: How to schedule parallel tasks
Technical University of Dresden
The language "Modelica" has become a widely used standard to describe and simulate physical models in science and engineering. A Modelica compiler, e.g. OpenModelica, is able to transform the model equations into a set of differential equations, which are converted to C/C++ - code. Unfortunately, this code is only single threaded at the moment. We are developing a task graph based approach to simplify and parallelize the calculation of the model equations. The main goal is to achieve a scalable implementation for shared and distributed memory systems. The major challenge is the automated analysis and transformation of the task graph. For any given model, it has to be automatically decided which rules for partitioning and scheduling should be applied, using what memory layout for best performance.
Graph Analytics
IBM Research
Graph Analytics play a key role in toolkit of cognitive computing. Graph analytics is confluence of large volume data processing and sparse linear algebra. From the perspective derived from our experience with the IBM POWER systems, which have been a leadership systems in both these areas, our talk highlights the values and nature of graph computing and describe the computation using a linear algebra on semiring formulation. Using this background the will point out some of the challenging computer science problems in this area, and the societal values the solutions to the problem can bring.
This talk is based on joint work with K. Ekanadham, Bill Horn, Joefon Jann, Manoj Kumar, Jose Moreira, Gabi Tanase, Hao Yu of IBM Research.
Parallel Graph Decomposition and Diameter Approximation in o(Diameter) Time and Linear Space
University of Padova
We present the first parallel (MapReduce) algorithm to approximate the diameter of large graphs through graph decomposition which requires a number of parallel rounds that is sub-linear in the diameter and total space linear in the graph size. The quality of the diameter approximation is expressed in terms of the doubling dimension of the graph and is polylogarithmic when the graph has a constant doubling dimension. Extensive experiments demonstrate the effectiveness and efficiency of our approach on large graphs.
Joint work with Matteo Ceccarello, Andrea Pietracaprina, and Eli Upfal
On characterizing the data movement complexity of computational directed acyclic graphs for parallel execution
Louisiana State University
Technology trends are making the cost of data movement increasingly dominant over the cost of performing arithmetic operations in computer systems, both in terms of energy and time. The fundamental ratio of aggregate data movement bandwidth to the total computational power in parallel computer systems is decreasing. It is therefore of considerable importance to characterize the inherent data movement requirements of parallel algorithms, so that the minimal architectural balance parameters required to support it on future systems can be well understood. In this talk, we will consider an extension of the well-known red-blue pebble game to develop lower bounds on the data movement complexity for the parallel execution of computational directed acyclic graphs.
On Using Data Movement Lower Bounds to Identify Fundamental Performance and Energy Limits
Ohio State University
The roofline model is a popular approach to performance analysis. It models upper limits on performance as a function of operational intensity, the ratio of operations per byte of data moved from/to memory. But it is currently very complicated to use it to obtain broader insights on bottlenecks, when all semantically equivalent implementations of algorithms are considered. In this talk, we present an approach to alleviate this problem by using lower bounds on data movement, enabling schedule-independent bottleneck analysis. We present its utility in assessing fundamental limits to performance and energy efficiency for benchmark algorithms across a design space of architectural variations.
Type-Driven Data-Layout Inference for Improved Vectorisation
Heriot-Watt University
SIMD instructions of modern CPUs are crucially important for the performance of compute-intensive programs. Often, auto-vectorisation leads to sub-optimal results due to the data-layout chosen by the application programmer. The reasons for choosing an unfavourable layout can be manifold: software engineering principles may contradict performance demands, the layout demands may depend on the target architecture a program was first run on, etc. However, most of the time it is simply a lack of awareness of the programmer.
In this talk, we present an approach towards liberating the programmer from the burden of considering the suitability of his data structures for vectorisation. We present an inference scheme that allows compilers to identify suitable data-layouts and a refactoring scheme to modify the code accordingly. We demonstrate our ideas by means of a running example and present some initial performance results of our technique.
AtraKahn: Kahn Networks as a coordination model; bridging between number–crunching and graph-based processing.
University of Hertfordshire
In this talk I will describe an approach to HPC based on separation of concerns. The program is represented as a graph where vertices are either stateless functions (called boxes) or finite state-machines (with channel messages as input signals) augmented with a path functional that defines output messages. The latter are called synchronisers and used for combining messages in order to feed them to boxes; the boxes are written by domain experts without knowledge of or concern for issues of distributed processing. We discuss the ramifications of this approach to integration of data analytics (graph processing) and conventional HPC and show that it nicely combines data and process mobility, which is useful for data analytics.
The Input/Output Complexity of Triangle Enumeration
University of Padova
In this talk, we consider the well-known problem of enumerating all triangles of an undirected graph. Our focus is on determining the input/output (I/O) complexity of this problem. Let E be the number of edges, M<E the size of internal memory, and B the block size. The best previous results are sort(E3/2) I/Os (Dementiev, PhD thesis, 2006) and O(E2/(MB)) I/Os (Hu et al., SIGMOD 2013), where sort(n) denotes the number of I/Os for sorting n items. By exploiting a new color coding technique, we propose cache-oblivious and cache-aware algorithms for triangle enumeration that require O(E3/2/(√M B)) expected I/Os and improve previous bounds by a factor min(√E/M,√M). Our results can be extended to the enumeration of any given subgraph and easily parallelized in the MapReduce framework. Joint work with U Kang, Rasmus Pagh, and Ha-Myung Park.
Parallel Graph Analytics: The Dataset-Algorithm-Platform Triangle
Technical University of Delft
Due to increasingly large datasets, graph analytics applications “?” like traversals, all pairs shortest path computations, or centrality measures - are becoming the focus of high-performance computing (HPC). Because HPC is currently dominated by many-core architectures (both CPUs and GPUs), new, highly-parallel graph processing solutions have to be defined to efficiently use such computing resources.
In this talk, we discuss the performance impact that the dataset, the algorithm, and the platform have on parallel graph analytics. We present two empirical studies: one measuring the performance of different traversal implementations on GPUs for 7 different datasets, and a second one measuring the performance of 3 graph analytics algorithms (a single code implemented in OpenCL and targeted at many-core CPUs and GPUs), on 3 different platforms, using 11 real-world and synthetic datasets.
Our results show a large variation in the observed performance, making it impossible to chose the best-performing algorithm or computational platform without understanding the dataset. Therefore, we discuss our first attempt to define these dependencies for our chosen algorithms.
Our conclusion is two-fold.
First, the impact of datasets on performance is too high to allow
platform-specific algorithms to outperform portable ones by large
margins in all cases; therefore, simple, portable parallel graph
analytics implementations remain a good solution for environments with
high-productivity requirements.
Second, we find that high-performance graph analytics requires the
Dataset-Algorithm-Platform triangle to be well understood; therefore,
our future work focuses on generalizing and/or defining this triangle
for different algorithms and platforms.
Architecture – the Small for the Large
Indiana University, Bloomington
The approaching end of Moore’s Law and Dennard scaling has forced a rapid increase in the number of microprocessor cores within high-end systems to sustain the exponential growth in delivered system performance of the last two decades. But the exploitation of traditional processor architectures developed for mainstream markets is imposing severe operational limitations on advanced techniques for efficiency, scalability, and generality; three among the many challenges to extreme scale computing. Processor design has been focused on local socket performance with some support for inter-socket integration for node capability as applied to enterprise server capabilities. The rest of the system architecture, which may comprise millions of cores, is left to the interconnection network and software. New experimental execution models are being pursued to attack the challenges of starvation, latency, overhead, and contention to achieve unprecedented scale and generality through dynamic adaptive runtime methods. But for full success to be realized, processor core architectures are going to have to change beyond conventional practices to reflect implications of global parallel computation including system-wide name spaces, asynchrony management, lightweight multi-threading, event-driven control, rich synchronization semantics, locality management, and introspection. Additional architecture features are required for fault tolerance and energy reduction. A significant change in goals is the return to strong scaling after decades of emphasis on weak scaling. Without this, supercomputers will become increasingly specialized with ever more problems falling off the scaling curve. This presentation will identify and discuss candidate architecture advances that may be required for future HPC systems design including the integration of network, processing, and global address supervision.
Towards Analytical Modeling of Application Irregularity
Pacific Northwest National Laboratory
In extreme-scale computing, it is increasingly important to model both application irregularity and irregular applications. We discuss work at PNNL motivated by the challenges of irregularity. First, we have been working on the problem of model generation and reproducibility. To ease the burden of analytical performance modeling, we developed Palm (Performance and Architecture Lab Modeling tool), a new tool for generating models of application performance. Palm coordinates models and source code, resulting in ‘first-class’ and reproducible performance models. Second, we show that it is possible to use analytical models to predict application irregularity. We have used this work to reduce an application’s power and energy consumption by decreasing core frequency during blocking communication. Third, we combine lightweight measurement and modeling to to determine when asynchronous tasks cause network contention. Finally, we touch on new work in modeling performance of scientific workflows and silicon photonic interconnects.
The Melbourne Shuffle: Improving Oblivious Storage in the Cloud
Brown University
We present a simple, efficient, and secure data-oblivious randomized shuffle algorithm. This is the first secure data-oblivious shuffle that is not based on sorting. Our method can be used to improve previous oblivious storage solutions for network-based outsourcing of data. (joint work with Olga Ohrimenko, Michael T. Goodrich, and Roberto Tamassia.)
Graph Processing Platforms and Their Performance: An Overview with Examples
University of Amsterdam
Graph-processing platforms have emerged as a response to the increasing popularity of graph algorithms in a variety of domains. Aiming to reduce the cost of developing scalable algorithms for processing graph data, these systems can differ visibly in terms of usability, scalability, underlying programming model(s), and supported hardware platforms. This diversity makes them difficult to rank and compare, leaving potential users simply guessing when selecting an appropriate platform for their specific application.
In this talk, we first sketch the landscape of large scale graph processing systems, emphasizing the classes of systems and their range of supported platforms and applications. Further, we propose an empirical method for benchmarking graph-processing platforms. We define a comprehensive process, and a selection of representative metrics, datasets, and algorithmic classes. We implement a benchmarking suite of five classes of algorithms and seven diverse graphs. Our suite reports on basic (user-lever) performance, resource utilization, scalability, and various overhead. We use our benchmarking suite to analyze and compare six platforms. We gain valuable insights for each platform and present a first comprehensive comparison of graph-processing platforms.
Our conclusion is two-fold: first, despite the large set of available systems, there still are desirable features missing across the entire spectrum, and, second, there is a large variance in the performance of these systems, making it impossible to select an absolute winner - the "fastest" or the "best" platform. Therefore, we end the talk with guidelines and future work recommendations for both users and system designers.
A graph based configurable architecture: the static dataflow approach
Consiglio Nazionale delle Ricerche, Napoli
Graphs are a very powerful abstraction of various phenomena that can be expressed as a relation between entities. For several decades researchers in theoretical computer science and discrete mathematics have been developing a wealth of graph theory and graph algorithms. Recently, however, we see a qualitative change how graph algorithms are used in practice. This change is due to (i) the complex structure of graphs in new and emerging applications, (ii) the size of typical inputs, and (iii) the computer systems on which graph problems are solved. Additional challenges are posed by the hardware. Tomorrow’s computer architectures are becoming increasingly parallel and heterogeneous but how to build such architectures remains an open question. More, their organization may require a critical re-examination of both architecture design and consolidated wisdom in terms of programming style and execution model. Yet, such systems are expected to be constituted by thousands of processors with thousands of cores per chip. However, many graph algorithms as well as techniques in relevant applications are inherently sequential, and that is why many previous results do not transfer easily into efficient computational tools. To fully exploit emerging architectures, however, it is important to develop parallel algorithms and tools. Algorithmic techniques that show promise try to be oblivious to the specific hardware, and optimize for general classes of typical hardware. This talk focuses a novel computing system based on a configurable architecture and a static dataflow execution model. We assume that the basic computational unit is constituted by a dataflow graph. Each processing node is constituted by an ad hoc kernel processor – designed to manage and schedule dataflow graphs, and a manycore dataflow execution engine - designed to execute such dataflow graphs. The main components of the dataflow execution engine are the Dataflow Actor Cores (DACs) which are small, identical and configurable with a self-scheduled clockless mechanism to start operations on the presence of validity tokens only. Such design choice avoids the need of temporary storage for tokens on the links of the DACs allowing, thus, to transfer the image of a dataflow program graph into a dataflow execution engine and execute it like in the static dataflow model.