wiki:partnerLIP

Laboratoire LIP / ENS Lyon / Equipe Projet Compsys

Communicating Regular Processes

http://perso.ens-lyon.fr/paul.feautrier/samos.pdf

Chunking

The gist of Cédric Bastoul thesis http://perso.ens-lyon.fr/paul.feautrier/chunking.pdf

Polymorphic Torus

An old proposal for an intelligent network http://perso.ens-lyon.fr/paul.feautrier/polymorphicTorus.pdf

A recent elaboration of the same theme: Kayhan M. Imre, Cesur Baransel and Harun Artuner:Efficient and Scalable Routing Algorithms for Collective Communication Operations on 2D All-Port Torus Networks. International Journal of Parallel Programming, 39,6,2011,746-782.

Paul's Own Thoughts on Process Networks on Chip

Landscape

Due to technological problems (quantum effects, power dissipation) the clock frequency of present day digital circuits seems to have reached an upper bound around 3 GHz. However, the gate density is still increasing, albeit not as fast as it did before the millenium. Increases in processing power, which are required by new applications, can only be found by increased parallelism. Reaching the exascale level (1018 floating point operations per seconds) will need billions of processors. Embedded and mobile applications, while less greedy, will run on thousands of processors, as exemplified by the Intel MIC (1000 cores) or the Kalray chip (from 256 to 1024 processors in the near future).

Increasing the number of processors does not mechanically result in increased processing power: there are many bottlenecks or walls to be breached before being able to efficiently use such architectures, among them the power wall, the memory and pin wall, the communication bottleneck, and the most important one, the programmer wall.

Memory

Each processor consumes a few data words per cycle. These words can come from registers, from several levels of local memory (caches or scratchpads) from global memory or from permanent storage. Since the efficiency of a memory hierarchy cannot be made arbitrarily large, the number of memory banks will be a fraction of the number of processors. Since each processor chip can be connected only to a small number of memory chips (the pin wall), the system will necessarily have a Non Uniform Memory Access latency. A pure dynamic replacement policy, as found for instance in present day caches and virtual memory may not scale up to the necessary sizes. Beside, such a policy forbids an accurate execution time prediction, and is unsuitable for real-time applications. In the same way, it seems that dynamic coherence protocols will not scale: the required data movements must be orchestrated by the application software, possibly with the help of the compiler.

Control and Parallelism

Conditional jumps and parallelism have never fitted well. This is true for instance

  • for superscalar and pipelined processors, which have to stall instruction issue while waiting for the outcome of a test, or to resort to heuristics like speculation, branch prediction and delay slots,
  • for SIMD systems, in which all processors execute the same code, some of them being prevented from modifying memory by an activity bit, which is set to the outcome of a test,
  • vector processors and GPU use variants of this scheme (mask vector, thread serialization)
  • in hardware, there are two solutions:
    • start the test and both branches at the same time, and feed both results to a multiplexer controlled by the result of the test (efficient if all three branches are well balanced, but costly in silicon area)
    • use the result of the test to select the next state of a finite state machine (same problem as for ordinary processors: the test is in the critical path; however, it is easier to anticipate if dependences permit).

It is likely that introducing control for dataflow systems will be at least as difficult as for ordinary processors. Observe that there is no problem introducing tests in dataflow *languages*; but the net result is that one will lose almost all interesting properties, like static scheduling, deadlock detection, channel bounds, and even determinism if one relax the "only one writer and one reader per channel" constraint.

A suggestion: start from the Communicating Regular Processes model and introduce control progressively. Control in purely computational parts is easy: one just has to be careful to discard dependences between opposite branches of a test. Since in CRP, reading is not destructive, there is no problem with conditional reads, except that they might enlarge the lifespan of some channel tokens. Dealing with conditional writes is more difficult. One must insure that the single assignment property is not violated. Beside, conditional writes may create undefined values (bubbles) in channels. How to deal with bubbles at the receiving end?

The net result is that conditionals in CRP processes have to be severely restricted, and these restrictions must be verifiable by the compiler.

Synchronous Languages and the Polyhedral Model

There has been in the past an attempt to combine SIGNAL and MMALPHA , but I have no information on the subject. An obvious possibility is to allows signal of arrays (signal whose values are arrays instead of scalars). Since all elements of the array will have the same clock, there will be no modification to the clock calculus. The point-wise operators will have to be extended to allow polyhedral operations (loops or vector notation). Scheduling will be done in two steps: firstly, by the usual clock calculus, and second, at each event, a polyhedral scheduling which may even uncover additional parallelism.

A more ambitious proposal is to introduce array of signals. Here, each elementary signal may have its own clock. Analyzing such a system may necessitate a parametric clock calculus.

Network Scheduling

Present day communication networks, from Networks on Chips to the Internet, operate in "best effort" mode. Each message bears a routing prefix, and each router dispatches it as best it can, according to local information, and sometime to scanty global information (e.g. neighboring link congestion). In some cases, one may create a route and assign to it some portion of the available bandwidth. Another solution would be to use passive routers, able only to connect their input ports to their output ports in arbitrary pattern, under control either of the associated processor or of a pre-stored configuration table, to be prepared beforehand by a compiler. See the "polymorphic torus" paper for a similar proposal; note however that some of the assumptions in this paper may not be valid for today architecture; an example is the hypothesis that each router acts as a short circuit, and hence that the propagation delay across the network is negligible. This scheme is also reminiscent of SIMD architectures like the Connection Machine 2 or Propal, but in their case the connection pattern was fixed: a hypercube for the CM2 and a shift register for Propal.

As these examples show, this proposal implies synchrony, at least whenever a communication is intended; the programming style must be SPMD and somewhat like BSP. This might be barely possible for systems on chip like the Intel MIC or the Kalray chip, but is excluded for exascale architectures; such architectures will probably use a hierarchy of interconnects, and the proposed scheme may be applicable to the lowest level.

Another point is that the compiler must have complete knowledge of the application communication pattern and of the network topology. On the other hand, the placement of the computations on the network is one of the unknowns of the problem. In abstract term, one has to find both a placement and several communication patterns, the main objective being the minimization of link contention. As such, it will need a mixture of temporal and spatial reasoning, much like the construction of a railway timetable, a very interesting subject.

Last modified 12 years ago Last modified on Jan 9, 2012, 12:43:02 PM