Abstracts
Portability of Performance: the Case for Runtime Systems
INRIA
Computing platform hardware has dramatically evolved ever since the computer science began, always striving to provide new convenient accelerating features. Each new accelerating hardware feature inevitably leaves programmers to decide whether to make their application dependent on that feature (and break compatibility) or not (and miss the potential benefit), or even to handle both cases (at the cost of extra management code in the application). This common problem is known as the performance portability issue. The first purpose of runtime systems is thus to provide abstraction. Runtime systems offer a uniform programming interface for a specific subset of hardware (e.g., OpenGL or DirectX are well-established examples of runtime systems dedicated to hardware-accelerated graphics) or low-level software entities (e.g., POSIX-thread implementations). Applications then target these uniform programming interfaces in a portable manner. The abstraction provided by runtime systems thus enables portability.
Abstraction alone is however not enough to provide portability of performance, as it does nothing to leverage low-level-specific features to get increased performance. Consequently, the second role of runtime systems is to optimize abstract application requests by dynamically mapping them onto low-level requests and resources as efficiently as possible. This mapping process makes use of scheduling algorithms and heuristics to decide the best actions to take for a given metric and the application state at a given point in its execution time. This allows applications to readily benefit from available underlying low-level capabilities to their full extent without breaking their portability. Thus, optimization together with abstraction allows runtime systems to offer portability of performance.
Models for Parallel and Hierarchical Computation
Department of Information Engineering
University of Padova
To fully exploit the potential of computing engines, it is crucial to expose and axploit both concurrency and locality of the computation. In this scenario, we focus on models of computation that can guide the optimization of algorithms and of architectures. Specifically, we will present results and open issues along three directions:
- The pipline of accesses in the memory hierarchy to increase memory bandwidth utilization.
- The information-exchange methodology to identify the best partition of chip area between functional units and storage elements, under chip I/O bandwidth constraints.
- The network-oblivious approach as a step toward efficient algorithmic portability across multiprocessors with different organizations.
Parametric Tiling with Pipelining
ENS Lyon
For increasing performance, tiling is a well-known and important loop transformation. By combining loop iterations in blocks that can be computed in an atomic fashion, it has several positive effects for performance: it increases the granularity, it enables communication coalescing (grouping data transfers) for each tile, it enables the reordering of memory accesses, which may improve both spatial and temporal locality, etc. However, so far, all methods compute tiles either in sequence, or in parallel, but with no inter-tile data reuse, nor tile pipelining. In this talk, we will describe a method for tile pipelining with intra-tile data reuse. A priori, when the tile size is a parameter, such a problem involving analysis across multiple tiles is not solvable with polyhedral techniques as it leads to non-linear optimizations. However, surprisingly, we found a way to solve it in a parametric fashion. This opens the door to an automatic method for what we call “parametric kernel offloading”, where computations of a kernel are pipelined to a distant platform, tile by tile, with automatic tile size selection and automatic local memory re-organization. One of our goal is to pursue this effort towards an efficient compilation of streaming languages, with automatic tiling and pipelining.
Thick Control Flow Computing
VTT Electronics
Many advanced parallel programming languages (fork, e, replica) attempt to capture the power and simplicity of the PRAM model by providing programmer a MIMD-style fixed set of threads. While this organization of computation has its strengths and may seem inevitable in the light of the most obvious architectural implementations of PRAM, it often replicates much of the execution unnecessarily, makes implementing time-shared multitasking expensive, and forces a programmer to use looping and conditional control primitives in the case the application parallelism does not match the hardware parallelism.
In this presentation we introduce the thick control flow (TCF) computing scheme that solves the above "thread arithmetics" problem by providing a number of control flows that have certain thickness that can vary according to needs of the application catching the best parts of the dynamism and generality of the original unbounded PRAM model and simplicity of the SIMD model. We discuss the architectural implementation and programming of the TCF model.
A look at the existing programming models for HPC and its future into the clouds
HP Labs Singapore
The first part of this talk will discuss the different programming models that have been developed for high performance computing, and how they have affected the development of applications. In particular, it will focus on how they can enable scientists to focus on the implementation of their algorithms, rather than their parallelization. The second part will look into the future, and will illustrate with an example how we can pursue high performance computations outside of traditional supercomputers. More specifically, computational resources within cloud datacenters are readily available to a broad community, and this community should be able to effectively utilize these resources for HPC tasks.
A Note on Duality of Peak and Application Peak Performance
University of Munich
Scientific computing is a fast growing technology as it allows scientists to gather deeper insights within shorter time-frames into a specific application field. Since the complexity of implemented models and as well the complexity of supercomputers have increased rapidly during the last years, highly efficient implementations of numerical simulation codes and, in particular, their underlying numerical core routines are mandatory but also challenging. Modern hardware architectures expose parallelism at several levels: instruction level parallelism, vector instructions, hardware threading, multi/many-core, multi-socket and multi-node. In order to achieve the system’s peak performance all these different levels have to be exploited. This talk will discuss how and if these features can be used in case of latest numerical methods and as well in standard benchmarks. Furthermore, it will highlight difficulties for scientists and will derive some possible new mind sets for the field of algorithm design and implementation.
Practical performance modeling for HPC: past and future
PNNL
In this presentation we will describe the trajectory of practical performance modeling in the last couple of decades and a prediction of its way to Exascale computing. A taxonomy of methodologies will be discussed, and success stories related to applications of modeling in practice presented. We will then discuss the challenges that Exascale computing will pose, and correspondingly future R&D directions needed in our field so as to establish modeling as the preferred tool of co-design for Exascale systems and applications.
Software for scalable performance – retrospective, and news from the “messy” frontier – unstructured-mesh CFD
Imperial College London
This talk begins with an attempt to reflect on ten years of software research for high-performance computing. Comparing the state of practice, and the programs of major HPC conferences, it's clear that much has changed, but rarely as a result of specific research efforts.
Indeed often the story is of the triumph of simple, effective ideas well-delivered, over sophisticated and elaborate research ideas. In this context, I present our own work on access-execute descriptors, and the OP2 model for parallel computation on unstructured meshes - and our ambition to deliver tools that are transformative yet simple enough to get used. In particular I'll report on OP2 in an industrial fluid dynamics application, and our experience of moving from model, benchmark problems to production-scale code. I'll show performance results on clusters of GPU-accelerated multicore processors, and some of the optimisations necessary - loop fission, fusion and tiling.
The PEPPHER Composition Tool: Performance-Aware Dynamic Composition of Applications for GPU-based Systems
University of Linkoping
The PEPPHER component model defines an environment for annotation of native C/C++ based components for homogeneous and heterogeneous multicore and manycore systems, including GPU and multi-GPU based systems. For the same computational functionality, captured as a component, different sequential and explicitly parallel implementation variants using various types of execution units might be provided, together with metadata such as explicitly exposed tunable parameters. The goal is to compose an application from its components and variants such that, depending on the run-time context, the most suitable implementation variant will be chosen automatically for each invocation.
We describe and evaluate the PEPPHER composition tool, which explores the application’s components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code. Thus, the composition tool also provides a high-level programming front-end for the task-based PEPPHER runtime system (StarPU). The current prototype supports code generation for components written in sequential C/C++, OpenMP, CUDA and OpenCL. Recently added features include the usage of so-called smart containers to optimize data transfers and automatically leverage inter-component parallelism and an adaptive method for off-line sampling and tuning based on adaptive incremental decision tree learning.
The Pochoir Stencil Compiler System
Massachusetts Institute of Technology
Pochoir is a compiler for a domain-specific language embedded in C++ which produces excellent code from a simple specification of a desired stencil computation. Pochoir allows a wide variety of boundary conditions to be specified, and it automatically parallelizes and optimizes cache performance. Benchmarks of Pochoir-generated code demonstrate a performance advantage of 2-10 times over straightforward parallel loop code. I'll describe the Pochoir specification language and shows how a wide range of stencil computations can be easily specified
Performance, Performance, Wherefore Art Thou Performance?
University of Oregon
Scalable parallel computing has at its core the goal of performance. Since the beginning of “high-performance” parallel computing, observing and analyzing performance for purposes of finding bottlenecks and identifying opportunities for improvement has been at the heart of delivering on the performance potential of next-generation scalable systems.
However, the outlook to exascale poses new challenges and demands a new perspective on the role of performance observation and analysis as integral parts of the exascale software stack that enables top-down application transformations to be optimized for runtime and system layers by feeding back dynamic information about hardware and software resources from the bottom-up. Performance observation and analysis technology should be an inherent aspect at all exascale levels to make it possible not only to bridge the gap between programming (computation) model semantics and execution model operation, but to deliver opportunities for online, adaptive optimization and control.
The reliance on post-mortem analysis of low-level performance measurements is prohibitive for exascale because of the performance data volume, the primitive basis for performance data attribution, and the inability to reflect back execution dynamics at runtime. With a multi-level exascale programming stack involving high-level transformations, it is necessary to provide richer context for attributions, beyond code locations and simple program events, together with a programmable, hierarchical, and dynamic “performance backplane” with model-driven measurement and analysis, and meaningful mapping back to program performance abstractions.
The perspective can go beyond the exascale software stack to consider how certain performance observation, computational semantics, and feedback support can be implemented in the exascale system architecture and what advantages it may entail. Thinking here could lead to the creation of new hardware technology to specifically to make the exascale machine more performance-aware and performance-adaptive.
On-The-Fly Computing
Heinz Nixdorf Institute, University of Paderborn
This talk will survey an approach to develop techniques and processes for automatic on-the-fly configuration and provision of individual IT services out of basic services that are available on world-wide markets. In addition to the configuration by special "on-the-fly service providers" and the provision by "on-the-fly compute centers", this involves developing methods for quality assurance and the protection of participating clients and providers, methods for the target-oriented further development of markets, and methods to support the interaction of the participants in dynamically changing markets. This research direction is proposed in our CRC 901 " On-The-Fly Computing (OTF Computing)", see http://sfb901.uni-paderborn.de/.
HOlistic Performance System Analysis (HOPSA)
Juelich Supercomputing Center
HOPSA is a coordinated twin project funded under EU FP7-ICT-2011-EU-Russia and by the Russian Ministry of Education and Science.
To maximise the scientific output of a high-performance computing system, different stakeholders pursue different strategies. While individual application developers are trying to shorten the time to solution by optimising their codes, system administrators are tuning the configuration of the overall system to increase its throughput. Yet, the complexity of today’s machines with their strong interrelationship between application and system performance presents serious challenges to achieving these goals.
The HOPSA project (HOlistic Performance System Analysis) therefore sets out to create an integrated diagnostic infrastructure for combined application and system tuning - with the former provided by the EU and the latter by the Russian project partners. Starting from system-wide basic performance screening of individual jobs, an automated workflow routes findings on potential bottlenecks either to application developers or system administrators with recommendations on how to identify their root cause using more powerful diagnostic tools. Developers can choose from a variety of mature performance-analysis tools developed by our consortium. Within this project, the tools will be further integrated and enhanced with respect to scalability, depth of analysis, and support for asynchronous tasking, a node-level paradigm playing an increasingly important role in hybrid programs on emerging hierarchical and heterogeneous systems
All Roads Lead to Parallelism
IBM Research
When we compare two top supercomputers roughly 40 years apart, the Cray 1 and Blue Gene/Q, we observe the following: 1) Processor frequency went up 20-fold; 2) The number of floating-point operations per instruction stream per cycle has remained constant; and 3) The number of instruction streams has increased (approximately) 10 million-fold. This is a remarkable, yet often overlooked, success story for parallelism. In this talk we will take a more detailed look into the evolution of parallelism in supercomputers, including an analysis of the factors that enabled and supported this incredible growth in number of instruction streams. We will also take a look into the future, discussing if and how we can continue this trend.
Performance on HPC systems: What have we learned, what do we have to expect!
Technical University of Dresden
Since a couple of years, Parallelism and Scalability have become the major issue in all areas of Computing, and Exascale computing is nowadays used as the motivation buzzword to highlight the short-comings we have experienced in the development of HPC systems in many years. We have to recognize that technology is driving hardware development, and so far software is following very slowly most given architectural trends. We had success with addressing Linpack performance, nevertheless we failed in many other areas like scalability, sustained applications performance, and I/O. The talk will summarize a couple of facts and identify some strategies which could help to make reasonable progress in HPC computing, may be as a side effect even to Exascale computing.
Energy Efficient Multicore Computing at NTNU. Early Results and Future Research
Norwegian University of Science and Technology
The talk gives an overview of the strategic project .Energy Efficient Computing Systems. (EECS) at NTNU with a focus on research within the CARD group at Department of Computer and Information Systems and the NTNU HPC-section. We follow a vertical approach from HPC applications down to the hardware where we add energy measurements to performance monitoring to increase our knowledge of the SW resource usage on Intel Core i7 processors. We use the OmpSs environment for dependency aware Task Based Programming (TBP) and the talk will discuss recent energy efficiency results for a set of benchmarks/application for three platforms; off-the shelf Intel core i7 quad-core desktops (Sandy Bridge and Ivy Bridge) and a server equipped with two 8 core Sandy Bridge processors. The talk ends by presenting ideas for future research in the areas of multicore data structures and autotuning. The research is partly supported by PRACE.
Analytics in the age of Smarter Planet
IBM Research
Over past few years the, information and communication technology (ICT) has played a key role in the world economy and in up lifting many peoples from a life of poverty. At the Scalperf 2010 we discussed the nature of this revolution. Since then a significant advancements have been made in the area of Smarter Planet. This talk will present the role of Analytics computation in this field and given an over view of this area, while presenting the areas of intense research.
Ieri, Oggi, Domani
University of Texas at Austin
This talk is a perspective on the accomplishments and failures of high-performance and parallel computing research over the past 25 years in the particular areas of parallel programming and software. We will also extract the main lessons for the future from these successes and failures.
Machine versus algorithmic performance from the perspective of Lattice QCD
Juelich Supercomputing Center
In this talk we will introduce in detail some key algorithmic and computational challenges related to simulations of Lattice QCD. We will review both the developments and performance improvements in the area of machines as well as algorithms. While a broader quantitative comparison of the performance gains due to new machines and algorithms is not possible in practice, we will present selected results to provide some quantitative estimates. After our conclusions we will provide a brief outlook on future challenges.
Data storage for a data-hungry world
HGST, a Western Digital company
In my presentation I will give an overview of the evolution of digital data-storage technology towards ever-increasing bit-densities. A brief survey of key gating items in current data-storage is followed by a discussion of an outlook for future ultra-high data-density technologies, including thermally assisted magnetic recording (TAMR) and bit-patterned magnetic recording (BPMR).
On the Stress Field Between Explicit Program Control and Performance Portability
Heriot-Watt University Edinburgh
Many approaches to parallel programming argue for the need to provide the programmer with a cost intuition and for the need to give programmers explicit control over the intended parallelism within their applications.
In this talk, we argue that the above position is fundamentally at odds with performance portability, at least, if we look at performance portability across rather different hardware architectures. At a few examples, we show how programs need to be vastly re-factored when targeting a wider range of architectures such as SMPs, GPUs, Microgrids or FPGA as accelerators.
How coordination programming may be the way forward for scalable performance and high productivity
Compiler Technology and Computer Architecture Group University of Hertfordshire
The ideas of coordination programming are enjoying something of a comeback now that Intel promotes its Concurrent Collections, Microsoft its language F#, and there are a number of academic projects (StreamIT, Reo, S-Net to name but a few) that research various aspects of coordination languages.
Concurrency and communication control abstracted from data computation are at the core of coordination programming in all its various forms. However, real applications, and especially those that are different from primarily number crunching HPC, require concurrent processing of distributing data structures, where sharing of data and its concurrent update, potentially with race conditions, is essential. In other words, it is not just concurrency and communication but also serialisability and sequential semantics that are important, which is something coordination is not quite at home with.
This talk will outline the coordination language S-Net being developed by a consortium of European Universities. S-Net is a language that focuses on connecting and synchronising a collection of independent components that only interact through their SIngle-Input-Single-Output interfaces. Several industrial applications have been programmed with S-Net, and we have reported performance and usability results before. The present talk will focus on the advancement of the language, specifically in giving it the facilities to handle distributed data structures and enabling sharing between concurrent activities. Interestingly we have found a way of doing this without breaking the opacity of the coordinated components. We argue that our form of coordination paves the way to not only scalable performance, but also high productivity due to the drastic separation of concerns inherent in our method.
Space-Round Tradeoffs for MapReduce Computations
University of Padova
This work explores fundamental modeling and algorithmic issues arising in the well-established MapReduce framework. First, we formally specify a computational model for MapReduce which captures the functional flavor of the paradigm by allowing for a flexible use of parallelism. Indeed, the model diverges from a traditional processor-centric view by featuring parameters which embody only global and local memory constraints, thus favoring a more data-centric view. Second, we apply the model to the fundamental computation task of matrix multiplication presenting upper and lower bounds for both dense and sparse matrix multiplication, which highlight interesting tradeoffs between space and round complexity. Finally, building on the matrix multiplication results, we derive further space-round tradeoffs on matrix inversion and matching.
Supporting Non-Regular Problems in Modern Multicores: A Software Perspective
Delft University of Technology
Modern Multicores exhibit a number of properties that make the effective implementation of non-regular problems a challenge. Non-regularity appears in various forms and includes irregularity and dynamical behaviour in both data structures and computations. A major question is whether software solutions can support the development of such problems. In the talk we will review existing solutions and approaches and discuss their effectiveness. As a case-study, we present the requirements, design, and potential uses of a dynamic memory manager for OpenCL kernels.
Towards Extreme Scale Computing in the Coming Decade
Indiana University at Bloomington
The continued increase of performance enabled by advances in device technology demands innovations in system structures, methods of resource management and task scheduling, and programming models and tools. Bounding conditions of productivity, power, and resilience as well as other cost factors are precluding reliance of conventional practices to achieve exascale computing. “Exascale” is vaguely perceived as a 1000X Petaflops computing measured across a number of dimensions, and it is much more than HPL exaflops Rmax. This is particularly evident when considered in the emerging domain of “big data” in its many forms including graph analytics where floating point operations play little or no role. A new execution model(s) is required to redirect strategies for efficiency and scalability while modulating energy consumption for bounded energy and sustaining execution in the presence of faults. Transitioning from static to dynamic methods is necessary to exploit information during execution requiring a new generation of runtime systems. This presentation will describe a perspective of HPC computing evolving over the next decade defined to satisfy the dominant constraints and to achieve the levels of efficiency and scalability necessary to enable exascale operation. It will discuss the new US DOE X-stack Program just begun as the latest initiative to conduct focused research attacking the salient problems that are inhibiting progress.
System-level Energy-efficient scalable HPC
Eurotech
In these years the HPC technology is undergoing a radical trasformation that is reaching all the layers of the HPC systems architecture, from transistors to software. Such transformation is mainly due to the need of a sustainable performance growth compatible with an affordable energy budget for the state-of-the-art supercomputers that will be deployed in the future. This presentation will describe how this requirement is dealt with in two different design domains: at architectural level by describing the main goals of the DEEP project where the scalability aspects are mainly addressed and at system level by presenting the Eurotech approach to the design and deployment of energy efficient HPC systems.
The major Sources of Performance - Past, Present and Beyond
University of Munich
The talk will identify the major milestones that helped booting supercomputers’ performance by taking a look at how supercomputers evolved, which features they introduced and how these represented another step forward towards Mega-, Giga-, Tera-, Peta-, and Exascale systems. It will be discussed to what extent these features are present in today’s (super)computers, and an outlook will be given which of these technologies will prevail, which may disappear, and what new features can be expected in upcoming architectures.
A new computational model and microprocessor architecture: a way to scale program performance in the next decade
Institute for High Performance Computing and Networking (ICAR)- CNR
While chip manufacturers continue increasing the number of cores on a single chip and promise to market thousands core chips in the next decade, it too increases the effort of software developers and compiler writers to exploit such a parallel hardware and attempt matching the performance demand by exploiting many, but slower (power efficient) cores. Program performance, however, has not scaled at the envisioned rate in past decade and such an increase in complexity of both hardware and software techniques make attaining performance scaling as an uncertain goal. After all, the solution to go beyond homogeneous parallelism to embrace heterogeneity does not seem the right one.
This has created a renewed interest in all things parallel (compilation, languages, etc.) that poses major technical challenges (the parallel processing problem) and drives considerable innovation in computing . by forcing us to rethink the von Neumann model that has prevailed since the 1940s. As future growth in computing performance must come from parallelism and the microprocessor industry has already begun to deliver chips with hundreds (and soon thousands) of cores, that will require new techniques in software models, languages, tools, and architectures. Programming models should express application parallelism naturally and explicitly. However, whether or when most programmers should be exposed to explicit parallelism remains an open question. Future systems will include many more processors whose allocation, load balancing, and data communication and synchronization interactions will be difficult to handle well. For chip multiprocessor (CMP) architectures we must determine if large numbers of cores work in most computer deployments as well as we must address how to handle synchronization and scheduling and how to address the challenges associated with power and energy.
In order to tackle the parallel processing problem, we believe that innovative solutions are urgently needed, which in turn require extensive co-development of hardware and software, which in turn require rethinking the canonical computing stack, parallel architectures, and power efficiency. Demand Data Driven Architecture System (D3AS) project, a step in this direction, is an attempt to apply a true hardware/software (H/S) co-design approach that can support a variety of H/S research projects. The project.s target is to investigate a computing system capable of exploiting the coarsegrained functional parallelism as well as the fine-grained instruction level parallelism. That through the direct hardware execution of static dataflow program graphs responding to the new execution model named homogeneous High-Level Dataflow System and created by a functional programming language. The language is both the high and low level programming language for the entire system. D3AS computing system is based on the integration of two naturally parallel organizations for computation, the data-driven and demand-driven models.
Joint work with Rosario Cammarota (University of California Irvine) and Roberto Vaccaro (Institute for High Performance Computing and Networking (ICAR)- CNR).
Stochastic Optimization and Memory Management
Vienna University of Technology
The eviction problem for memory hierarchies is studied for the Hidden Markov Reference Model (HMRM) of the memory trace, showing how miss minimization can be naturally formulated in the optimal control setting. In addition to the traditional version assuming a buffer of fixed capacity, a relaxed version is also considered, in which buffer occupancy can vary and its average is constrained. Resorting to multiobjective optimization, viewing occupancy as a cost rather than as a constraint, the optimal eviction policy is obtained by composing solutions for the individual addressable items. This approach is then specialized to the Least Recently Used Stack Model (LRUSM), a type of HMRM often considered for traces.
The techniques shown will deal with Markov chains, optimal control and multiobjective optimization and might (hopefully) be of interest in any application where a stochastic modeling of inputs is considered.
Some hard computational problems from Soft Matter
Department of Physical and Inorganic Chemistry, University of Bologna
The physical properties of soft materials vary significantly with their phase organization and morphology and a key aspect in the development of new materials of this kind rests on the possibility of predicting their molecular assemblage from the nano scale, e.g. for organic electronics applications [1], up to the much larger micro-scale, e.g. for photonic applications. Correspondingly the modelling and simulation problem has also to be tackled at different resolutions and atomistic and coarser grain molecular models are employed. In the talk we plan to show some examples of each, highlighting some current grand computational challenges. In particular for coarse grained models the vision is that of directly simulating devices, e.g. liquid crystal displays [2] or elastomeric actuators [3] from the molecular level up, thus avoiding the use of continuum theories. At atomistic level we discuss the novel possibilities, offered by high performance computing, of realistically predicting the physical properties and molecular organizations of functional organic materials both in the bulk [4] and at interfaces [5,6] from their molecular structure.
[1] D. Beljonne, J. Cornil, L. Muccioli, C. Zannoni, J.-L. Bre’das, F. Castet, Electronic processes at organic-organic interfaces: Insight from modeling and implications for opto-electronic devices, Chem. Mater., 23, 591-609 (2011)
[2] M. Ricci, M. Mazzeo, R. Berardi, P. Pasini and C. Zannoni, A molecular level simulation of a Twisted Nematic cell, Faraday Discuss. 144, 171-185 (2010)
[3] G. Skacej and C. Zannoni, Molecular Simulations Elucidate Electric Field Actuation in Swollen Liquid Crystal Elastomers, PNAS 109, 10193-10198 (2012)
[4] G. Tiberio, L. Muccioli, R. Berardi and C. Zannoni, Towards in silico liquid crystals. Realistic transition temperatures and physical properties for n-cyanobiphenyls via molecular dynamics simulations, ChemPhysChem 10, 125 (2009)
[5] 1. A. Pizzirusso, L. Muccioli , M. Ricci and C. Zannoni, An Atomistic Approach to Predicting Liquid Crystal Anchoring. The Molecular Organization Across a Thin Film of 5CB on Crystalline Silicon, Chem. Sci. 3, 573-579 (2012)
[6] L. Muccioli, G. D’Avino and C. Zannoni, Simulation of vapor-phase deposition and growth of a pentacene thin film on C60(001), Adv. Mater. 23, 4532-6 (2011)