Ebook: Communicating Process Architectures 2015 & 2016
This book presents the proceedings of two conferences, the 37th and 38th in the WoTUG series; Communicating Process Architectures (CPA) 2015, held in Canterbury, England, in August 2015, and CPA 2016, held in Copenhagen, Denmark, in August 2016.
Fifteen papers were accepted for presentation at the 2015 conference. They cover a spectrum of concurrency concerns: mathematical theory, programming languages, design and support tools, verification, multicore infrastructure and applications ranging from supercomputing to embedded. Three workshops and two evening fringe sessions also formed part of the conference, and the workshop position papers and fringe abstracts are included in this book.
Fourteen papers covering the same broad spectrum of topics were presented at the 2016 conference, one of them in the form of a workshop. They are all included here, together with abstracts of the five fringe sessions from the conference.
Preface CPA 2015
Communicating Process Architectures (CPA) 2015 is the thirty-seventh in the WoTUG series of conferences and took place from Sunday August 23th to Wednesday August 26th 2015, hosted by the School of Computing at the University of Kent in Canterbury, England.
Keynote talks this year were from David May (FRS) and Eric Verhulst. David, of course, was the architect of the Inmos Transputer – the first microprocessor designed to support multiprocessing – and the designer of the occam concurrent programming language. His most recent venture is XMOS, which he co-founded in 2005 and which develops and markets multicore multiprocessors for embedded applications together with the all-important software tools for harnessing them (i.e. concurrent design, reasoning and programming). David's talk reviews and forecasts the development of the ideas behind Communicating Processes and Processors from 1975 through 2025. Eric led the development of the Virtuoso multi-board RTOS, used in the ESA's Rosetta space mission to comet 67P/Churyumov-Gerasimenko. Virtuoso and its successor developments, OpenComRTOS and VirtuosoNext, all apply valuable principles and lessons learned from CSP, the transputer and occam. Eric's talk relates his own personal journey through the challenges of parallel programming, leading to the realisation of industrial applications based on viewing complex systems as a set of interacting entities exchanging information according to laws formalised in CSP – and the success that has resulted.
Fifteen papers were accepted for presentation at the conference, following the strong editorial process developed and refined by CPA over many years. They cover a spectrum of concurrency concerns: mathematical theory, programming languages, design and support tools, verification, multicore infrastructure and applications ranging from supercomputing to embedded. Three workshops were included in the conference, asking questions about the use of concurrency in modern High Performance Computing, how to work with and model analog (e.g. continuous time) and digital mechanisms in real-world hybrid systems, and how to classify and usefully benchmark the many languages and support libraries now available for message-passing concurrency. The conference also hosted two evening fringe sessions, for reporting most recent work, floating wild ideas and/or challenging received wisdom. The workshop position papers and fringe abstracts are included in these Proceedings. Altogether, contributions came from twelve different countries: Belgium, Brazil, Denmark, Canada, Germany, the Netherlands, Norway, South Africa, Spain, Sweden, the UK and the USA.
Finally, we thank everyone who submitted papers, the reviewers for their conscientious and very hard work, the delegates attending for their contributions that make it all worthwhile and the secretaries and conference officers who just make it work. Thank you all very much.
Kevin Chalmers (Edinburgh Napier University)
Jan Bækgaard Pedersen (University of Nevada Las Vegas)
Frederick R.M. Barnes (University of Kent)
Jan F. Broenink (University of Twente)
Ruth Ivimey-Cook (eLifesciences Ltd.)
Adam Sampson (Abertay University)
Peter H. Welch (University of Kent)
Preface CPA 2016
Communicating Process Architectures (CPA) 2016 is the thirty-eighth in the WoTUG series of conferences and took place from Sunday August 21st to Tuesday August 23rd 2016, hosted by the Niels Bohr Institute at the University of Copenhagen, Denmark. This sees the first visit to a Scandinavian country for this conference.
The keynote talk this year was from Markus Jochum from the Niels Bohr Institute. His talk, “Computational Challenges for Climate Modelling” presented two key issues facing climate science and climate modelling: turbulence in the fluid dynamics of oceans and establishing the fidelity of the models (by demonstrating their accuracy over pre-historic warm periods). Both require an order of magnitude faster and higher resolution integration than is available even on today's massively parallel computing architectures and need to take into account Earth's full carbon biochemistry. Dr. Jochum completed his PhD in physical oceanography in 2002 at MIT. He then worked for 8 years as ocean model developer at the National Center for Atmospheric Research, USA. Since 2012, he has been Professor for Physical Oceanography at the Niels Bohr Institute, Copenhagen.
Fourteen papers were accepted for presentation at the conference and publication in these proceedings, following the strong editorial process developed and refined by CPA over many years. They cover a full spectrum of concurrency concerns – from mathematical theory, design and programming language and support tools, verification, multicore run-time infrastructure through to applications at all levels from supercomputing to embedded. One of the fourteen papers was presented as a workshop on building sensor systems for a large scale Arctic observatory. This year's submissions came from Denmark, France, the Netherlands, Norway, Malta, the UK and the USA.
As usual, the conference also hosted fringe sessions presenting work in progress, new ideas, demonstrations and concerns that certain common practices in concurrency are harmful. This year, we had 5 such fringe presentations and their abstracts can be found in these proceedings.
Finally, we thank everyone who submitted papers, the reviewers for their conscientious and very hard work, the delegates attending for their contributions that make it all worthwhile and the secretaries and conference officers who just make it work. Thank you all very much.
Kevin Chalmers (Edinburgh Napier University)
Jan Bækgaard Pedersen (University of Nevada Las Vegas)
Kenneth Skovhede (University of Copenhagen)
Brian Vinter (University of Copenhagen)
Peter H. Welch (University of Kent)
The ideas that gave rise to CSP, occam and transputers originated in the UK around 1975; occam and the Inmos transputer were launched around 1985. Thousands of computer scientists and engineers learned about concurrency and parallel computing using the occam-transputer platform and the influence is evident in many current architectures and programming tools. I will reflect on the relevance of these ideas to-day – after 30 years – and talk about communicating processes and processors in the age of clouds, things and robots. Slides used in the presentation can be downloaded from [1].
“Keep it simple but not too simple” means that a complex solution is really a problem that's not very well understood. In formal methods, this is reflected not only in the size of the state space, but also in the dependencies between these states. This is the main reason why Formal Modelling is not delivering as expected: the state space explosion would require an infinite amount of resources. If an automated tool cannot handle the state space, how can we expect engineers to do so? This is where CSP comes in: it divides the state space in small manageable chunks, making it easier to reason about the behaviour. There are however a few pre-conditions for this to work: one must take a step back, dividing the complex state space before conquering it, hence thinking about functionalities and how they are related before thinking about the punctual states in space and time.
Extrapolating the CSP abstract process algebra leads to a generic concept of describing systems as a set of Interacting Entities, whereby the Interactions are seen as first class citizens, at the same level as the Entities, decoupling the Entities' states by explicit information exchanges. We enter hereby the domain of modelling. One major issue with modelling approaches is that, while we need different and complementary models to develop a real system, these often have different semantics (if the semantics are properly defined at all). By being able to hide the internal semantics, one can focus on the interactions and use these as standardised interfaces.
It is clear that for this to work in the software domain, the natural programming model should be concurrent and execute on hardware that is compatible with it – a design feature of the transputer that has not been matched since. This opens the door to multi-domain modelling where, for example, parts of the system are continuous and other parts are discrete (as in executing a clocked logic). This gives us an interesting new domain of hybrid logic, a topic we want to explore further in a workshop at the conference.
This lecture will be guided by my own personal journey, starting with a spreadsheet to program a parallel machine, covering Peter Welch's courses in occam [1] and the formal development of our distributed RTOS. Slides used in the presentation can be downloaded from [2].
The OpenTransputer is a new implementation of the Transputer first launched by Inmos in 1985. It supports the same instruction set, but uses a different microarchitecture that takes advantage of today's manufacturing technology; this results in a reduced cycle count for many of the instructions. Our new Transputer includes support for channels that connect to input-output ports, enabling direct connection to external devices such as sensors, actuators and standard communication interfaces. We have also generalised the channel communications with the support of virtual channels and the design of a message routing component based on a Beneš switch to enable the construction of networks with scalable throughput and low latency. We aim to make the OpenTransputer and switch components available as open-source designs.
SpiNNaker is a computing system composed of over a million ARM cores, embedded in a bespoke asynchronous communication fabric. The physical realization of the system consist of 57600 nodes (a node is a silicon die), each node containing 18 ARM cores and a routing engine. The communication infrastructure allows the cores to communicate via short, fixed-length (40- or 72-bit), hardware-brokered packets. The packets find their way through the network in a sequence of hops, and the specifics of each route are held (distributed) in the route engines, not unlike internet routing. On arrival at a target core, a hardware-triggered interrupt invokes code to handle the incoming packet. Within this computing model, the state of the system-under-simulation is distributed, held in memory local to the cores, and the topology is also distributed, held in the routing engine internal tables. The message passing is non-deterministic and non-transitive, there is no memory coherence between the core local memories, and there is no global synchronization. This paper shows how such a system can be used to simulate large systems of neurons using discrete event-based techniques. More notably, the solution time remains approximately constant with neural system size as long as sufficient hardware cores are available.
This work extends the subset of CSPm that can be translated to C++ using the tool CSP++. Specifications can now contain user defined functions, make limited use of set and sequence data, and utilise built-in CSPm functions that operate on such data. An extended development paradigm is suggested, based on applying transformational methods from UML to C++. All techniques are demonstrated over three case-study examples.
A significant number of parallel applications are implemented using MPI (Message Passing Interface) and several existing approaches focus on their verification. However, these approaches typically work with complete applications and fixing any undesired behaviour at this late stage of application development is difficult and time consuming. To address this problem, we present a lightweight formal approach that helps developers build safety into the MPI applications from the early stages of the program development. Our approach consists of a methodology that includes verification during the program development process. We provide tools that hide the more difficult formal aspects from developers making it possible to verify properties such as freedom from deadlock as well as automatically generating partial skeletons of the code. We evaluate our approach with respect to its ability and efficiency in detecting deadlocks.
Model-driven methodologies can help developers create more reliable software. In previous work, we have advocated a model-driven approach for the analysis and design of concurrent, safety-critical systems. However, to take full advantage of those techniques, they must be supported by code generation schemes for concrete programming languages. Ideally, this translation should be traceable, automated and should support verification of the generated code. In this paper we consider the problem of generating concurrent Java code from high-level interaction models. Our proposal includes an extension of JML that allows to specify shared resources as Java interfaces, and several translation patterns for (partial) generation of CSP-based Java code. The template code thus obtained is JML-annotated Java using the JCSP library, with proof obligations that help with both traceability and verification. Finally, we present an experiment in code verification using the KeY tool.
MicroCSP is a run-time system written in C supporting process-oriented concurrency. It features CSP-style synchronous communication over point-to-point channels and alternation (including timeouts) and implements preemptive priority process scheduling. The MicroCSP programmer writes the logic of a process as an ordinary function which, barring preemption, runs to completion every time the process is scheduled. To make such an approach feasible requires that the programmer cast each process's logic in normal form: a loop controlling a single choice of guarded events, with event-free computation following each event. A context switch requires a mere function call, with the exception of those that occur as the result of an interrupt. The system is memory-efficient, fast and has a narrow hardware interface. The intended target of MicroCSP is bare microcontroller hardware, and its efficient use of the stack makes it particularly suitable for microcontrollers with restricted memory. The current version runs on a single processor over a Linux implementation of the hardware interface and serves as a prototype for the microcontroller implementations. A multicore implementation appears to be possible.
Research on concurrent systems requires modelling, simulation and evaluation in many settings. We describe a design for Interchangeable Simulation and Implementation (ISI, pronounced easy), an approach that enables the simultaneous modelling, simulation and evaluation of highly concurrent process based systems.
We describe how ISI can been used to design and evaluate techniques used in storage systems. Our key insight is that, while the simulation of systems could be implemented as regular discrete event simulation, the benefits of designing it as the highly concurrent process based system that it is — with the full data and control flow in mind — enables seamless interchangeability between discrete and real time simulation.
Core Aldwych is a simple model for concurrent computation, involving the concept of agents that communicate through shared variables. Each variable will have exactly one agent that can write to it, and its value can never be changed once written, but a value can contain further variables that are written to later. A key aspect is that the reader of a value may become the writer of variables in it. In this paper we show how this model can be used to encode lambda calculus. Individual function applications can be explicitly encoded as lazy or not, as required. We then show how this encoding can be extended to cover functions that manipulate mutable variables, but with the underlying Core Aldwych implementation still using only immutable variables. The ordering of function applications then becomes an issue, with Core Aldwych able to model either the enforcement of an ordering or the retention of indeterminate ordering, which allows parallel execution.
In this paper, we examine a new CSP inspired library for the Common Intermediate Language: CoCoL (Concurrent Communications Library). The use of the Common Intermediate Language makes the library accessible from a number of languages, including C#, F#, Visual Basic and IronPython. The processes are based on tasks and continuation callbacks, rather than threads, which enables networks with millions of running processes on a single machine. The channels are based on request queues with two-phase commit tickets, which enables external choice without coordination among channels. We evaluate the performance of the library on different operating systems and compare the performance with JCSP and C++CSP.
This case study is embedded in a wider project aimed at investigating process-based software development to better utilise the multiple cores on contemporary hardware platforms. Three alternative process-based architectures for the classical Aho-Corasick failure function construction algorithm are proposed, described in CSP and implemented in Go. Empirical results show that these process-based implementations attain significant speedups over the conventional sequential implementation of the algorithm for significantly-sized data sets. Evidence is also presented to demonstrate that the process-based performances are comparable to the performance of a more conventional concurrent implementation in which the input data is simply partitioned over several concurrent processes.
Guppy is a new and experimental process-oriented programming language, taking much inspiration (and some code-base) from the existing occam-π language. This paper reports on a variety of aspects related to this, specifically language, compiler and run-time system development, enabling Guppy programs to run on desktop and embedded systems. A native code-generation approach is taken, using C as the intermediate language, and with stack-space requirements determined at compile-time.
This work presents some initial thoughts and future directions when considering Communicating Process Architectures in light of parallel design patterns and algorithmic skeletons. Consideration of where CPA aligns with current work in the parallel application design world is considered. Proposals are made on how the scope of CPA can expand to better support the current trends in parallel application development are presented and proposed as future work.
In this work we present a new design and implementation of the Synchronous Message Exchange model. The new version uses explicit busses, which may include multiple fields, and where a components may use a bus for both reading and writing, whereas the original version allowed only reading from or writing to a bus, which triggered a need for some busses to exist in two versions for different directions. In addition to the new and improved bus-model, the new SME version also produces traces that may be used for validating a later VHDL implementation of the designed component, and can produce a graphical representation of a design to help with debugging.
Model-Driven Development (MDD) – based on the concepts of model, meta-model and model transformation – is an approach to develop predictable and reliable software for Cyber-Physical Systems (CPS). The work presented here concerns a methodology to design simulation software based on MDD techniques, supporting the TERRA tool suite to describe and simulate process communication flows. TERRA is implemented using MDD techniques and Communicating Sequential Process algebra (CSP). Simulation support for TERRA helps the designer to understand the semantics of the designed model and, hence, to increase the probability of first-time-right software implementation. A new simulation meta-model is proposed, abstracting the simulation process of a TERRA model. With this new meta-model and our previously designed CSP meta-model, a simulation model can be transformed from its TERRA source. The Eclipse Modelling Framework (EMF) is used to implement the meta-model. The Eclipse Epsilon Framework includes the Epsilon Transformation Language (ETL) and the Epsilon Generation Language (EGL), used for modelto-model and model-to-text transformation. The simulation support is shown using an example, in which the generated trace text is shown as well. Further work is to implement an animation facility showing the trace text in the TERRA graphical model using colours.
In this work we explore and evaluate the effect of automatic code specialisation on auto-generated GPU kernels. When combining the high productivity coding environment of computational science with the Just-In-Time compilation nature of many GPU runtime systems there is a clear cut opportunity for code optimisation and specialisation. We have developed a hybrid kernel generation method which is shown to be useful and competitive across very different use cases, and requires minimal knowledge of the overall structure of the program. Stencil codes which are commonly found at the core of computer simulations are ideal candidates for this type of code specialisation. For exactly this type of application we are able to achieve speedups of up to 2.5 times with the implemented strategy.
One of the issues that has been bothering embedded systems engineers is how to deal with time. Some approaches have attempted to make time part of the modelling language, other approaches turn it in a partial order of events, while most programmers ignore it completely equating QoS with real-time (most of the time but not guaranteed). While in the discrete domain, time is considered to be progressing as a number of clock cycles, in the continuous domain time has an infinitesimal granularity. If we now want to proof the correctness of a hybrid system, people traditionally use time dependent partial ordinary differential equations in the continuous domain and model checkers or provers for the discrete domain.
How can we combine the two? Following the Keynote [1] theme, we remember to separate the concerns. Hence we need time-independent models that, when executed on a given platform or in a given system context, result in specific time properties (like meeting a deadline or stability). In the discrete domain, this can be achieved by using priority as a scheduling parameter. In the continuous domain, engineers routinely transform the model into the frequency domain to prove desired properties using Nyquist or Laplace. The workshop will look for answers on how such hybrid models can be formally verified (as opposed to simulation and testing only). Slides used introducing this workshop can be downloaded from [2].
The modern HPC centre is a complex entity. A centre easily hosts tens of thousands of processors in the smallest cases, and several million processors for the largest facilities. Before 2020, this number is expected to pass 100 million processors in a single centre alone. Thus, massive parallel processing is everyday business today and the major challenges of running such facilities are well established, with solutions existing for most of these challenges.
In the following we will make a strong distinction between parallelism (e.g. Single Program Multiple Data and Single Instruction Multiple Data) and concurrency (multiple processes, both identical and different) that may run in parallel or by interleaving, depending on timing and available hardware.
While concurrency mechanisms were commonly used to express applications for parallel computers two or three decades ago, this use of concurrency has completely died out. There are several explanations for this, but the most important is that the cost of an HPC installation today is so high that users must document that they use the facility efficiently. All applications must stay above a specified threshold, typically 70%, of CPU utilisation. Even with SPMD type programming, asynchrony amongst processors is a common challenge when trying to stay above this threshold.
This does not mean that concurrency is not relevant in the modern HPC centre. While the actual compute-nodes may not use concurrency, the underlying fabric, that allows those nodes to operate, has a large potential for concurrency. In many cases, these elements could benefit from a formal approach to concurrency control.
In this workshop, we will present challenges in the HPC centre that do depend on concurrency: storage-systems, schedulers, backup-systems, archiving and network-bulk transfers, to name a few. The interesting challenge is that while all these elements require concurrency control to operate correctly and efficiently, they are also highly interdependent – the concurrency aspects must cover the full set of infrastructure components for optimal efficiency. We will seek to describe scenarios for real HPC centres and sketch solutions that are built on a structured concurrency approach. Slides used introducing this workshop can be downloaded from [1].
In the last few years, there have been a number of new programming languages which incorporate message-passing concurrency. Examples, such as Google's Go and Mozilla's Rust, have shown an increased industry and academic interest in the ideas of message-passing concurrency as a first order concern. These languages have joined existing ones such as occam-π, Erlang, and Ada with strong communication-based concurrency. It is therefore argued that the concurrent systems programmer has a number of options for exploiting message-based concurrency.
The Communicating Process Architectures (CPA) community, and others, have for a number of years developed libraries to support message-passing concurrency within existing programming languages and run-times. This support is normally built upon the thread support libraries of the host language. JCSP (for Java) and PyCSP (for Python) are commonly discussed, but support for CPA ideas has also been implemented in Haskell, C++, Lua, and other languages.
The languages and libraries supporting message-passing concurrency are normally inspired by one or more process algebras. Hoare's Communicating Sequential Processes (CSP) and Milner's π-calculus are the two main inspirations for message-passing work. It is questionable, however, how well these process algebras are supported in the languages and libraries they inspire.
Slides used introducing this workshop can be downloaded from [1].
Communicating to fellow programmers that the concept of blocking in process-oriented design is perfectly acceptable, while using a word with basically negative connotations, is difficult. The bare need to do it often means that misunderstanding this concept is common and harmful. The first contender on a “blocking” channel has alternatively been said (by several people) to wait (rather than block) on the channel. This seems a more accurate description and is certainly more positive.
A better correspondence between the negative meaning and semantics is when “blocking” describes serious side effects on the temporal properties of other concurrent components. This is the correctly feared blocking.
One example is when a process has too many internal work-tasks to handle and any one such cannot acceptably “block” for other work tasks for too long. The “blocking-on-each-other” blocking type that we know as deadlock is perhaps the worst kind. A system cannot survive for long should this happen (in a sub-system, say).
When the literature over the last years talks about non-blocking algorithms, it has in fact been talking about the “waiting” type. It might be easier to refer to this when describing the semantics of a synchronous (and, in this context, non-blocking) CSP type channel. The same could be reasoned of the interface-RPC-call or entry type mechanisms of Ada and xC: i.e. that they are the “waiting” type. To gain understanding and acceptance of these types of solutions, we should simply avoid the “blocking” term in the CSP context. Phrases like “wait”, “yield”, “synchronous” and “non-buffered” should suffice. This fringe presentation will explore this further and invite discussion. Slides used in the presentation can be downloaded from [1].
Concurrency is, in the literature, often used as a noun with a range of strengths: there is more or less concurrency; it is more or less limited; it may even be seen described as complete. In trying to discuss multi-threaded programming with programmers who state that they program single-threaded, it is important to communicate that they may program less concurrently, but probably not as non-concurrently as they believe. What are the factors that increase concurrency and which factors are orthogonal to the degree of concurrency? Does a Go language goroutine increase it and is a C++ object orthogonal? Will the CSP paradigm generally enable increased concurrency? Is the CSP paradigm of communication-with-synchronisation itself orthogonal to the degree of concurrency? It is also important to understand the term parallel slackness: does it introduce or enable more concurrency? And what about atomicity?
This fringe presentation aims to raise more questions that it is able to answer. However, some lines of reasoning are suggested. Finally: is it at all meaningful to raise the awareness of concurrent as an adjective?
Slides used in the presentation can be downloaded from [1].
This fringe presentation focuses on development undertaken on a library to support CSP ideas within the new C++11 standard. The library has been developed to undertake development of a MPI layer for CPA Networking, with C and C++ being simpler languages to build up such a framework. This fringe presentation will generally look at how we can write applications using C++11 CSP.