Context Navigation

Back to WikiStart

WikiStart: anr-2010.tex

File anr-2010.tex, 39.0 KB (added by coach, 14 years ago)

Line
1	\section{Project context}
2	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
3	% 1. CONTEXTE ET POSITIONNEMENT DU PROJET
4	% (1 page maximum) Présentation générale du problème qu'il est proposé de traiter
5	% dans le projet et du cadre de travail (recherche fondamentale, industrielle ou
6	% développement expérimental).
7	\end{verbatim}
8	\end{scriptsize}
9	An embedded system is an application integrated into one or several chips
10	in order to accelerate it or to embedd it into a small device such as a personal
11	digital assistant (PDA).
12	This topic is investigated since 80s using Applications Specific Integrated Circuits (ASIC),
13	Digital Signal Processing (DSP) and parallel computing on multiprocessor machines or networks.
14	More recently, since end of 90s, other technologies appeared like Very Large Instruction Word (VLIW),
15	Application Specific Instruction Processors (ASIP), System on Chip (SoC),
16	Multi-Processors SoC (MPSoC).
17	\\
18	During these last decades embedded system was reserved to major industrial companies targeting high volume market
19	due to the design and fabrication costs.
20	Nowadays Field Programmable Gate Arrays (FPGA), like Virtex5 from Xilinx and Stratix4 from Altera,
21	can implement a SoC with multiple processors and several coprocessors for less than 10K euros the piece.
22	In addition, High Level Synthesis (HLS) becomes more mature and allows to automize design
23	and to decrease drastically its cost in terms of man power. Thus, both FPGA and HLS tends to spread over
24	HPC for small companies targeting low volume markets.
25	\par
26	To get an efficient embedded system, designer has to take into account application characteristics when it
27	chooses one of the former technologies.
28	This choice is not easy and in most cases designer has to try different technologies to retain the
29	most adapted one.
30	\\
31	The first objective of COACH is to provide an open-source framework to design embedded system
32	on FPGA device.
33	COACH framework allows designer to explore various software/hardware partitions of the
34	target application, to run timing and functional simulations and to generate automatically both
35	the software and the synthesizable description of the hardware.
36	The main topics of the project are:
37	\begin{itemize}
38	\item
39	Design space exploration: It consists in analysing the application runnig on FPGA, defining the target
40	technology (SoC, MPSoC, ASIP, ...) and hardware/software partitioning of tasks depending on
41	technology choice. This exploration is driven basically by throughput, latency and power consumption
42	criteria.
43	\item
44	Micro-architectural exploration: When hardware components are required, the HLS tools of the framework
45	generate them automatically. At this stage the framework provides various HLS tools allowing the
46	micro-architectural space design exploration. The exploration criteria are also throughput, latency
47	and power consumption.
48	% FIXME
49	%CA At this stage, preliminary source-level transformations will be
50	%CA required to improve the efficiency of the target component.
51	%CA COACH will also provide such facilities, such as automatic parallelization
52	%CA and memory optimisation.
53	\item
54	Performance measurement: For each point of design space exploration, metrics of criteria are available
55	such as throughput, latency, power consumption, area, memory allocation and data locality.
56	They are evaluated using virtual prototyping, estimation or analysing methodologies.
57	\item
58	Targeted hardware technology: The COACH description of system is independent of the FPGA family.
59	Every point of the design exploration space can be implemented on any FPGA having the required resources.
60	Basically, COACH handles both Altera and Xilinx FPGA families.
61	\end{itemize}
62	As an extension of embedded system design, COACH deals also with High Performance Computing (HPC).
63	In HPC, the kind of targeted application is an existing one running on PC. COACH helps designer
64	to accelerate it by migrating critical parts into a SoC implemented on a FPGA plugged to the PC bus.
65	\par
66	COACH is the result of the will of several laboratory to unify their know how and skills in the
67	following domains: Operating system and hardware communication (TIMA, SITI), SoC and MPSoC (LIP6 and TIMA),
68	ASIP (IRISA) and HLS (LIP6, Lab-STIC and LIP). The project objective is to integrate these various
69	domains into a unique free framework (licence ...) masking as much as possible these domains and its
70	different tools to the user.
71
72
73	\subsection{Economical context and interest}
74	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
75	% 1.1. CONTEXTE ET ENJEUX ECONOMIQUES ET SOCIETAUX
76	% (2 pages maximum)
77	% Décrire le contexte économique, social, réglementaire. dans lequel se situe
78	% le projet en présentant une analyse des enjeux sociaux, économiques, environnementaux,
79	% industriels. Donner si possible des arguments chiffrés, par exemple, pertinence et
80	% portée du projet par rapport à la demande économique (analyse du marché, analyse des
81	% tendances), analyse de la concurrence, indicateurs de réduction de coûts, perspectives
82	% de marchés (champs d'application, .). Indicateurs des gains environnementaux, cycle
83	% de vie.
84	\end{verbatim}
85	\end{scriptsize}
86	Microelectronic allows to integrate complicated functions into products, to increase their
87	commercial attractivity and to improve their competitivity. Multimedia and communication
88	sectors have taken advantage from microelectronics facilities thanks to developpment of
89	design methodologies and tools for real time embedded systems. Many other sectors could
90	benefit from microelectronics if these methologies and tools are adapted to their features.
91	The Non Recurring Engineering (NRE) costs involded in designing and manufacturing an ASIC is
92	very high. It costs several milliars of euros for IC factory and several millions to fabricate
93	a specific circuit. Consequently, it is generally unfeasible to design and fabricate ASICs in
94	low volumes and ICs are designed to cover a broad applications spectrum at the cost of
95	performance degradation.
96	\\
97	Today, FPGAs become important actors in the computational domain that was originally dominated
98	by microprocessors and ASICs. Just like microprocessors FPGA based systems can be reprogrammed
99	on a per-application basis. At the same time, FPGAs offer significant performance benefits over
100	microprocessors implementation for a number of applications. Although these benefits are still
101	generally an order of magnitude less than equivalent ASIC implementations, low costs
102	(500 euros to 10K euros), fast time to market and flexibility of FPGAs make them an attractive
103	choice for low-to-medium volume applications.
104	Since their introduction in the mid eighties, FPGAs evolved from a simple,
105	low-capacity gate array technology to devices (Altera STRATIX III, Xilinx Virtex V) that
106	provide a mix of coarse-grained data path units, memory blocks, microprocessor cores,
107	on chip A/D conversion, and gate counts by millions. This high logic capacity allows to implement
108	complex systems like multi-processors platform with application dedicated coprocessors.
109	Using FPGA limits the NRE costs to design cost. This boosts the developpment of methodologies
110	and tools to automize design and reduce its cost.
111	\par
112	Nowadays, there are neither commercial nor free tools covering the whole design process.
113	For instance, with SOPC Builder from Altera, users can select and parameterize IP components
114	from an extensive drop-down list of communication, digital signal processor (DSP), microprocessor
115	and bus interface cores, as well as incorporate their own IP. Designers can then generate
116	a synthesized netlist, simulation test bench and custom software library that reflect the hardware
117	configuration.
118	Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors and to
119	simulate the platform at a high design level (system C).
120	In addition, SOPC Builder is proprietary and only works together with Altera's Quartus compilation
121	tool to implement designs on Altera devices (Stratix, Arria, Cyclone).
122	PICO [CITATION] and CATAPULT [CITATION] allow to synthesize coprocessors from a C++ description.
123	Nevertheless, they can only deal with data dominated applications and they do not handle the
124	platform level.
125	The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to
126	Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs.
127	Designers can design and simulate a system using MATLAB and Simulink. The tool will then
128	automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx
129	pre-optimized algorithms.
130	However, this tool targets only DSP based algorithms.
131	\\
132	Consequently, designer developping a embedded system needs to master for example
133	SoCLib for design exploration,
134	SOPC Builde at the platform level,
135	PICO for synthesizing the data dominated coprocessors
136	and Quartus for design implementation.
137	This requires an important tools interfacing effort and makes the design process very complex
138	and achievable only by designers skilled in various domains.
139	COACH project integrates all these tools in the same framework masking them to the user.
140	The objective is to allow \textbf{pure software} developpers to realize embedded systems.
141	\par
142	% ZIED: CHIFFRES MARCHE, ASIC, EMBEDED system, HPC. Nombre de socites et taille faisant du ES et du HPC
143	The combination of the framework dedicated to software developpers and FPGA target, allows small
144	and even very small companies to propose embedded system and accelerating solutions for standard
145	software applications with acceptable prices.
146	This allows to avoid huge hardware investment in opposite to ASIC based solution.
147	\\
148	The combination of the framework dedicated to software developpers and FPGA target can open new markets
149	to small and even very small companies.
150	Such markets we can state HPC (High Performance Computing) and embedded applications.
151	HPC consists in proposing accelerating solutions for standard software applications with acceptable
152	prices, for example, DNA sequencing recognization or DBMS acceleration.
153	Embedded application consists in implementing an application on a low power standalone device,
154	for example distributed intelligent sensors.
155	\\
156	This new market may explose like it was done by micro-computing in eighties. This success were due
157	to the low cost of first micro-computers (compared to main frame) and the advent of high level
158	programming languages that allow a high number of programmers to launch start-ups in software
159	engineering.
160
161	\subsection{Project position}
162	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
163	% 1.2. POSITIONNEMENT DU PROJET
164	% (2 pages maximum)
165	% Préciser :
166	% - positionnement du projet par rapport au contexte développé précédemment :
167	% vis- à-vis des projets et recherches concurrents, complémentaires ou antérieurs,
168	% des brevets et standards.
169	% - positionnement du projet par rapport aux axes thématiques de l'appel à projets.
170	% - positionnement du projet aux niveaux européen et international.
171	\end{verbatim}
172	\end{scriptsize}
173	The aim of this project is to propose an open-source framework for architecture synthesis
174	targeting mainly field programmable gate array circuits (FPGA).
175	\\% LIP6/TIMA
176	To evaluate the different architectures, the project uses the prototyping platform
177	of the SoCLIB ANR project (2006-2009).
178	\\% IRISA
179	The project will also borrow from the ROMA ANR project (2007-2009) and the ongoing
180	joint INRIA-STMicro Nano2012 project. In particular we will adapt
181	existing pattern extraction algorithms and datapath merging techniques to the synthesis of customized
182	ASIP processors.
183	\par
184	%%% 1 -- POUVEZ VOUS CHACUN AJOUTER SVP (SI POSSIBLE) UNE LIGNE
185	%%% 1 -- REFERANT UN PROJET ANR OU EUROPEEN
186	%%% 1 -- Projets européens ou ANR réutilisés ou continués
187	%%% 1 LIP6/TIMA/LAB-STIC OK
188	Regarding the expertise in High Level Synthesis (HLS), the project leverages on know-how acquired over 15 years
189	with GAUT project developped in Lab-STIC laboratory and UGH project developped in LIP6
190	and TIMA laboratories. \\
191	Regarding architecture synthesis skills, the project is based on a know-how acquired over 10 years
192	with the COSY European project (1998-2000) and the DISYDENT project developped in LIP6. \\
193	%%% 1 IRISA OK
194	Regarding Application Specific Instruction Processor (ASIP) design, the CAIRN group at INRIA Bretagne
195	Atlantique benefits from several years of expertise in the domain of retargetable compiler (Armor/Calife
196	since 1996, and the Gecos compilers since 2002).
197	% LIP FIXME:UN:PEU:LONG ET HORS:SUJET
198	%CA% The source-level transformations required by the HLS tools will be
199	%CA% designed in the {\em polyhedral model}, a general framework
200	%CA% initiated by Paul Feautrier 20 years ago. The programs handled in
201	%CA% the polyhedral model are such that loop iterators describe a
202	%CA% polyhedron (hence the name). This includes most of the kernels used
203	%CA% in embedded applications. This property allows to design precise
204	%CA% analysis by means of integer programming techniques.
205	%CA% %communaute active & internationale
206	%CA% %transfert techno (Reservoir)
207	%CA% The polyhedral community is very active, and the technological
208	%CA% transfer has now started. Reservoir Labs inc., a company based in
209	%CA% New-York, is currently integrating the last polyhedral developments
210	%CA% in its commercial compiler.
211	%CA% %transfert techno (gcc)
212	%CA% Also, polyhedra are progressively migrating into the {\sc GNU Gcc}
213	%CA% compiler, via {\sc Graphite}, a module initially developed by
214	%CA% Sebastian Pop.
215	%CA% %outils existants
216	%CA% Several tools have been developed in the polyhedral community,
217	%CA% such as {\sc Piplib} (parameter integer programming library), and
218	%CA% {\sc Polylib}, a library providing set operations on polyhedra. Both
219	%CA% tools are almost mandatory in polyhedral tools, and have reached
220	%CA% a sufficient level of maturity to be considered as standard.
221	%syntol & bee ???
222	% FIN
223	% and on more than 15 years of experience on parallel hardware generation
224	% in the polyedral model in the CAIRN group (MMAlpha software
225	% developped in the group since 1996).
226	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
227	%%% 2 -- A COMPLETER (COURT)
228	%%% 2 -- For polyedric transformation and memory optimization ... LIP
229	%%% 2 -- For ASIP IRISA
230	%%% 2 -- For ... CITI
231	%%% 2 -- For ... TIMA
232	\par
233	The SoCLIB ANR platform were developped by 11 laboratories and 6 companies. It allows to
234	describe hardware architectures with shared memory space and to deploy software
235	applications on them to evaluate their performance.
236	The heart of this platform is a library containing simulation models (in SystemC)
237	of hardware IP cores such as processors, buses, networks, memories, IO controller.
238	The platform provides also embedded operating systems and software/hardware
239	communication components useful to implement applications quickly.
240	However, the synthesisable description of IPs have to be provided by users. \\
241	This project enhances SoCLib by providing synthesisable VHDL of standard IPs.
242	In addition, HLS tools such as UGH and GAUT allow to get automatically a synthesisable
243	description of an IP (coprocessor) from a sequential algorithm.
244	%\par
245	%%% 2 IRISA ?
246	%%% 2 ASIP tool such as ... IRISA
247	%%% 2 ...
248	%%% 2 Coach uses pattern extractions from ROMA
249	%\par
250	%%% 2 LIP ?
251	\par
252	The different points proposed in this project cover priorities defined by the commission
253	experts in the field of Information Technolgies Society (IST) for Embedded
254	systems: <<Concepts, methods and tools for designing systems dealing with systems complexity
255	and allowing to apply efficiently applications and various products on embedded platforms,
256	considering resources constraints (delais, power, memory, etc.), security and quality
257	services>>.
258	\\
259	Our team aims at covering all the steps of the design flow of architecture synthesis.
260	Our project overcomes the complexity of using various synthesis tools and description
261	languages required today to design architectures.
262
263	\section{Scientific and Technical Description}
264	\subsection{State of the art}
265	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
266	% 2. DESCRIPTION SCIENTIFIQUE ET TECHNIQUE
267	% 2.1. ÉTAT DE L'ART
268	% (3 pages maximum)
269	% Décrire le contexte et les enjeux scientifiques dans lequel se situe le projet
270	% en présentant un état de l'art national et international dressant l'état des
271	% connaissances sur le sujet. Faire apparaître d'éventuels résultats préliminaires.
272	% Inclure les références bibliographiques nécessaires en annexe 7.1.
273	\end{verbatim}
274	\end{scriptsize}
275	Our project covers several critical domains in system design in order
276	to achieve high performance computing. Starting from a high level description we aim
277	at generating automatically both hardware and software components of the system.
278
279	\subsubsection{High Performance Computing}
280	Accelerating high-performance computing (HPC) applications with field-programmable
281	gate arrays (FPGAs) can potentially improve performance.
282	However, using FPGAs presents significant challenges [1].
283	First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
284	Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive
285	to the implementation quality [2].
286	Finally, High-performance computing programmers are a highly sophisticated but scarce
287	resource. Such programmers are expected to readily use new technology but lack the time
288	to learn a completely new skill such as logic design [3].
289	\\
290	HPC/FPGA hardware is only now emerging and in early commercial stages,
291	but these techniques have not yet caught up.
292	Thus, much effort is required to develop design tools that translate high level
293	language programs to FPGA configurations.
294
295	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
296	[1] M.B. Gokhale et al., Promises and Pitfalls of Reconfigurable
297	Supercomputing, Proc. 2006 Conf. Eng. of Reconfigurable
298	Systems and Algorithms, CSREA Press, 2006, pp. 11-20;
299	http://nis-www.lanl.gov/~maya/papers/ersa06_gokhale_paper.
300	pdf.
301	[2] D. Buell, Programming Reconfigurable Computers: Language
302	Lessons Learned, keynote address, Reconfigurable Systems
303	Summer Institute 2006, 12 July 2006; http://gladiator.
304	ncsa.uiuc.edu/PDFs/rssi06/presentations/00_Duncan_Buell.pdf
305	[3] T. Van Court et al., Achieving High Performance
306	with FPGA-Based Computing, Computer, vol. 40, no. 3,
307	pp. 50-57, Mar. 2007, doi:10.1109/MC.2007.79
308	\end{verbatim}
309	\end{scriptsize}
310
311	\subsubsection{System Synthesis}
312	Today, several solutions for system design are proposed and commercialized. The most common are
313	those provided by Altera and Xilinx to promote their FPGA devices.
314	\\
315	The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to
316	Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs.
317	Designers can design and simulate a system using MATLAB and Simulink. The tool will then
318	automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx
319	pre-optimized algorithms.
320	However, this tool targets only DSP based algorithms, Xilinx FPGAs and cannot handle complete
321	SoC. Thus, it is not really a system synthesis tool.
322	\\
323	In the opposite, SOPC Builder [CITATION] allows to describe a system, to synthesis it,
324	to programm it into a target FPGA and to upload a software application.
325	% FIXME(C2H from Altera, marche vite mais ressource monstrueuse)
326	Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors.
327	Users have to provide the synthesizable description with the feasible bus interface.
328	\\
329	In addition, Xilinx System Generator and SOPC are closed world since each one imposes
330	their own IPs which are not interchangeable.
331	We can conclude that the existing commercial or free tools does not coverthe whole system
332	synthesis process in a full automatic way. Moreover, they are bound to a particular device family
333	and to IPs library.
334
335	\subsubsection{High Level Synthesis}
336	High Level Synthesis translates a sequential algorithmic description and a constraints set
337	(area, power, frequency, ...) to a micro-architecture at Register Transfer Level (RTL).
338	Several academic and commercial tools are today available.
339	Most common tools are SPARK [HLS1], GAUT [HLS2], UGH [HLS3] in the academic world
340	and catapultC [HLS4], PICO [HLS5] and Cynthesizer [HLS6] in commercial world.
341	Despite their maturity, their usage is restrained by:
342	\begin{itemize}
343	\item They do not respect accurately the frequency constraint when they target an FPGA device.
344	Their error is about 10 percent. This is annoying when the generated component is integrated
345	in a SoC since it will slow down the hole system.
346	\item These tools take into account only one or few constraints simultaneously while realistic
347	designs are multi-constrained.
348	Moreover, low power consumption constraint is mandatory for embedded systems.
349	However, it is not yet well handled by common synthesis tools.
350	\item The parallelism is extracted from initial algorithm. To get more parallelism or to reduce
351	the amout of required memory, the user must re-write it while there is techniques as polyedric
352	transformations to increase the intrinsec parallelism.
353	\item Despite they have the same input language (C/C++), they are sensitive to the style in
354	which the algorithm is written. Consequently, engineering work is required to swap from
355	a tool to another.
356	\item The HLS tools are not integrated into an architecture and system exploration tool.
357	Thus, a designer who needs to accelerate a software part of the system, must adapt it manually
358	to the HLS input dialect and performs engineering work to exploit the synthesis result
359	at the system level.
360	\end{itemize}
361	Regarding these limitations, it is necessary to create a new tool generation reducing the gap
362	between the specification of an heterogenous system and its hardware implementation.
363
364	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
365	[HLS1] SPARK universite de californie San Diego
366	[HLS2] GAUT UBS/Lab-STIC
367	[HLS3] UGH
368	[HLS4] catapultC Mentor
369	[HLS5] PICO synfora
370	[HLS6] Cynthesizer Forte design system
371	\end{verbatim}
372	\end{scriptsize}
373
374	\subsubsection{Application Specific Instruction Processors}
375	ASIP (Application-Specific Instruction-Set Processor) are programmable
376	processors in which both the instruction and the micro architecture have
377	been tailored to a given application domain (eg. video processing), or
378	in some extreme cases to a specific application (eg H264 specific ASIP).
379	This processor specialization usually offers a good compromise between
380	performance (compared to a pure software implementation on a COTS
381	embeded processor) and flexibility (compared to an application specific
382	hardware co-processor).
383	\\
384	As a consequence, this type of architecture is a very attractive choice
385	as a System on chip building block. In spite of their obvious
386	advantages, using/designing ASIPs remains a difficult task, since it
387	involves designing both an efficient micro-architecture and implementing
388	an efficient compiler for this
389	specific micro-architecture.
390	\\
391	Recently, the use of instruction set extensions has received a lot of
392	interest from the embedded systems design community [NIOS2,FSL,ST70],
393	since it allows to rely on a template micro-architecture in which only a
394	small fraction of the architecture has to be specialized. Even if such
395	an approach offers less flexiblity and forbids very tight coupling
396	between the extensions and the template micro-architecture, it makes the
397	design of the micro-architecture more tractable and amenable to a fully
398	automated flow.
399	\\
400	However, to our knowledge, there is still no available open-source
401	design flow addressing those two design challenges together, either
402	because the target architecture is proprietary, or because the compiler
403	technology is closed/commercial.
404	\\
405	In the context of the COACH project, we propose to add to the
406	infra-structure a design flow targeted to automatic instruction set
407	extension for the MIPS-based CPU, which will come as a complement or an
408	alternative to the other proposed approaches (hardware accelerator,
409	multi processors).
410
411	\subsubsection{Automatic Parallelization}
412	\begin{Large}\begin{verbatim}
413	-- A COMPLETER LIP
414	\end{verbatim}
415	\end{Large}
416	%CA% Parallel machines are often difficult and painful to program
417	%CA% directly, and one would like the compiler to %do the job, that is to
418	%CA% turn automatically a sequential program into a parallel form. This
419	%CA% transformation is referred as {\em automatic parallelization}, and has
420	%CA% been widely addressed since the 70s. Automatic parallelization
421	%CA% relies on data dependences, which cannot be computed in general.%, as
422	%CA% %one cannot predict at compile time the variable values on a given
423	%CA% %execution point.
424	%CA% This negative result led researchers to (i) find a
425	%CA% program model in which no approximation is needed (ie polyhedral
426	%CA% model), (ii) make conservative approximations (iii) remark that
427	%CA% variable values are known at runtime, and make the decisions during
428	%CA% program execution. The latter approach is obviously not suitable
429	%CA% there, as we target hardware generation. We will give there a short
430	%CA% history of the approaches that fall in the first category.
431	%CA%
432	%CA%% In the real world, we deal with a limited amount of processors,
433	%CA%% and the communication between processors takes time, and is
434	%CA%% critical for performance. %Whenever we have synchronisation-free
435	%CA%% parallelism, like for embarrassingly parallel kernels, this is not an
436	%CA%% issue. But in case of pipelined parallelism, we need to reduce
437	%CA%% communications as much as possible.
438	%CA%% So we also need to find parallelism toghether with a proper mapping
439	%CA%% of operations and data on physical processors.
440	%CA%
441	%CA% As programs spend most of there time in loops, the community has
442	%CA% focused on loop transformations that reveal parallelism.
443	%CA%%unimodulaire
444	%CA% The first approaches worked on perfect loop nests, where the tree
445	%CA% formed by the nested loops is linear. In this program model, the
446	%CA% loops can be seen as a basis that drive the way the iteration
447	%CA% domain will be described. Hence, a first idea was to change this
448	%CA% basis such that one vector (one loop) at least is parallel. To ease
449	%CA% the code generation, the area of defined by the news vectors must
450	%CA% be a unit volume. %Otherwise, one would produce an homothetic
451	%CA%% expansion of the iteration domain, which will force to put modulos
452	%CA%% in the target code.
453	%CA% For this reason, these transformations are called {\em unimodular
454	%CA% transformations}.
455	%CA%%tiling
456	%CA%
457	%CA% The next approaches include {\em loop tiling}, a simple
458	%CA% partitioning of the iteration domain, whose initial purpose is to
459	%CA% execute every partition on a different processor. %In the same way,
460	%CA% The execution order is modified with a proper unimodular
461	%CA% transformation, then the tiles are obtained by cutting the
462	%CA% iteration domain with the hyperplanes directed by every vector of
463	%CA% the new (unimodular) basis, at regular intervals. When the tiling
464	%CA% hyperplanes are properly chosen, we can both improve data-locality
465	%CA% on every processor, and reduce the communication between two
466	%CA% different tiles (which will be mapped on processors). This last
467	%CA% property implying that one tend to find a degree of parallelism as
468	%CA% great as possible.
469	%CA%
470	%CA%%affine scheduling
471	%CA% The previous approaches were restricted to kernels with perfect
472	%CA% loop nests (linear loop tree), and unimodular transformations. The
473	%CA% last generation of approaches broke with these limitations. We now
474	%CA% choose a different basis for every assignment, without the
475	%CA% unimodularity restriction. A dual way to present the things is the
476	%CA% notion of {\em affine schedule}, introduced by Feautrier [part1],
477	%CA% that simply assigns an abstract execution date to every assignment
478	%CA% execution. As an assignment execution is exactly characterised by
479	%CA% the current value of the loops counters (iteration vector), the
480	%CA% affine schedule will be defined as an affine form of the iteration
481	%CA% vector (hence the 'affine'). The affine property allows to use
482	%CA% integer programming techniques to compute the schedule. With this
483	%CA% approach, additional techniques are required to allocate the
484	%CA% parallel operations and the data to processor in an efficient way
485	%CA% [griebl, feautrier].
486	%CA%
487	%CA%%modularity??
488	%CA%%% As loop nests are no longer perfect, we deal with (transformed)
489	%CA%%% iteration domains of different dimensions, which can possibly (and
490	%CA%%% certainly) overlap. At this point, a new code generation technique
491	%CA%%% was needed. The first attempt is due to Chamsky et al. [??], and
492	%CA%%% was improved by Quillere et al. [QRW]. The code is now implemented
493	%CA%%% in an efficient tool [cloog], that gave a new life to polyhedral
494	%CA%%% techniques.
495	%CA%
496	%CA%%pluto's tiling
497	%CA% The tiling techniques were extended to non-perfect loop nest with
498	%CA% {\em affine partitioning}. Affine partitioning is to affine
499	%CA% scheduling what (original) tiling was to unimodular
500	%CA% transformations. An affine partitioning assigns to every assignment
501	%CA% its coordinates in the basis defined by the normals to the tiling
502	%CA% hyperplanes. Recently, a way to compute efficient hyperplanes were
503	%CA% found [uday], with a good data locality, and communications
504	%CA% confined in a small neighborhood around every processor.
505	%CA%
506	%CA%\subsubsection{Source-level Memory Optimisation}
507	%CA% The HLS process allows to customise memory, which impacts on final
508	%CA% circuit size and power consumption. Though most HLS tools already
509	%CA% try to optimise memory usage, it is better to provide an independent
510	%CA% source-level pass, that could be reused for different tools and in
511	%CA% other contexts.
512	%CA%
513	%CA% There exists many approaches to evaluate and reduce the memory
514	%CA% requirement of a program. The first approaches are concerned with
515	%CA% {\em memory size estimation}, which can be defined as the maximum
516	%CA% number of memory cells used at the same time [clauss,zhao]. These
517	%CA% approaches provide an estimation as a symbolic expression of program
518	%CA% parameters, which can be used further to guide loop optimisations.
519	%CA% However, no explicit way to reduce the memory size is given. {\em
520	%CA% Intra-array reuse} approaches brake with this limitation, and
521	%CA% collapse the array cells which are not alive at the same time. The
522	%CA% collapse is done by means of a data layout transformation, specified
523	%CA% with a linear (modular) mapping. The first approaches were
524	%CA% developed at IMEC [balasa,catthoor], and basically try to linearize
525	%CA% the arrays and fold them using a modulo operator. Then, Lefebvre et
526	%CA% al. propose a solution to fold independently the array dimensions
527	%CA% [lefebvre]. Finally, Darte et al. provide a general formalisation of
528	%CA% the problem, together with a solution that subsumes the previous
529	%CA% approaches [darte]. A first implementation was made with the tool
530	%CA% {\sc Bee}, but there are still many limitations.
531	%CA%
532	%CA% \begin{itemize}
533	%CA% \item The tool is restricted to regular programs, whereas more
534	%CA% general programs could be handled with a conservative array liveness
535	%CA% analysis.
536	%CA%
537	%CA% \item Programs depending on parameters (inputs) are not handled,
538	%CA% which forbids to handle, for example, the body of tiled loops.
539	%CA%
540	%CA% \item The new array layout can brake spatial locality, and then impact
541	%CA% performance and power consumption. One would like to get a mapping
542	%CA% that improve or, at least, preserve the spatial locality of the
543	%CA% program.
544	%CA%
545	%CA% \item Finally, the final memory compaction strongly depends on the
546	%CA% program schedule, and is naturally hindered by the
547	%CA% parallelism. Consequently, there is a trade-off to find with
548	%CA% automatic parallelization. An ideal solution would be to reduce
549	%CA% memory usage, while preserving parallelism.
550	%CA% \end{itemize}
551
552	\subsubsection{Interfaces}
553	\begin{Large}\begin{verbatim}
554	-- A COMPLETER INSA Etat de l'art
555	\end{verbatim}
556	\end{Large}
557	%
558	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
559	\subsection{Objectives and innovation aspects}
560	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
561	% 2.2. OBJECTIFS ET CARACTERE AMBITIEUX/NOVATEUR DU PROJET
562	% (2 pages maximum)
563	% Décrire les objectifs scientifiques/techniques du projet.
564	% Présenter l'avancée scientifique attendue. Préciser l'originalité et le caractère
565	% ambitieux du projet.
566	% Détailler les verrous scientifiques et techniques à lever par la réalisation du projet.
567	% Décrire éventuellement le ou les produits finaux développés à l'issue du projet
568	% montrant le caractère innovant du projet.
569	% Présenter les résultats escomptés en proposant si possible des critères de réussite
570	% et d'évaluation adaptés au type de projet, permettant d'évaluer les résultats en
571	% fin de projet.
572	% Le cas échéant (programmes exigeant la pluridisciplinarité), démontrer l'articulation
573	% entre les disciplines scientifiques.
574	\end{verbatim}
575	\end{scriptsize}
576
577	% les objectifs scientifiques/techniques du projet.
578	The objectives of COACH project are to develop a complete framework to
579	HPC (accelerating solutions for existing software applications)
580	and embedded applications (implementing an application on a low power standalone device).
581	The design steps are presented figure 1.
582	\begin{figure}[hbtp]\leavevmode\center
583	\includegraphics[width=.8\linewidth]{anr-2010}
584	\caption{\label{coach-flow} COACH flow.}
585	\end{figure}
586	\begin{description}
587	\item[HPC setup] Here the user splits the application into 2 parts: the host application
588	which remains on PC and the SoC application which migrates on SoC.
589	The framework provides a simulation model allowing to evaluate the partitioning.
590	\item[SoC design] In this phase,
591	The user can obtain simulators at different abstraction levels of the SoC by giving to COACH framework
592	a SoC description.
593	This description consists of a process network corresponding to the SoC application,
594	an OS, an instance of a generic hardware platform
595	and a mapping of processes on the platform components. The supported mapping are
596	software (the process runs on a SoC processor),
597	XXXpeci (the process runs on a SoC processor enhanced with dedicated instructions),
598	and hardware (the process runs into a coprocessor generated by HLS and plugged on the SoC bus).
599	\item[Application compilation] Once SoC description is validated, COACH generates automatically
600	an FPGA bitstream containing the hardware platform with SoC application software and
601	an executable containing the host application. The user can launch the application by
602	loading the bitstream on FPGA and running the executable on PC.
603	\end{description}
604
605	% l'avancee scientifique attendue. Preciser l'originalite et le caractere
606	% ambitieux du projet.
607	The main scientific contribution of the project is to unify various synthesis techniques
608	(same input and output formats) allowing the user to swap without engineering effort
609	from one to an other and even to chain them, for example, to run polyedric transformation
610	before synthesis.
611	Another advantage of this framework is to provide different abstraction levels from
612	a single description.
613	Finally, this description is device family independent and its hardware implementation
614	is automatically generated.
615
616	% Detailler les verrous scientifiques et techniques a lever par la realisation du projet.
617	System design is a very complicated task and in this project we try to simplify it
618	as much as possible. For this purpose we have to deal with the following scientific
619	and technological barriers.
620	\begin{itemize}
621	\item The main problem in HPC is the communication between the PC and the SoC.
622	This problem has 2 aspects. The first one is the efficiency. The second is to
623	eliminate enginnering effort to implement it at different abstract levels.
624	\item COACH design flow has a top-down approach. In the such case,
625	the required performance of a coprocessor (run frequency, maximum cycles for
626	a given computation, power consumption, etc) are imposed by the other system
627	components. The challenge is to allow user to control accurately the synthesis
628	process. For instance, the run frequency must not be a result of the RTL synthesis
629	but a strict synthesis constraint.
630	\item HLS tools are sensitive to the style in which the algorithm is written.
631	In addition, they are are not integrated into an architecture and system
632	exploration tool.
633	Consequently, engineering work is required to swap from a tool to another,
634	to integrate the resulting simulation model to an architectural exploration tool
635	and to synthesize the generated RTL description.
636	%CA Additionnal preprocessing, source-level transformations, are thus
637	%CA required to improve the process.
638	%CA Particularly, this includes parallelism exposure and efficient memory mapping.
639	\item Most HLS tools translate a sequential algorithm into a coprocessor
640	containing a single data-path and finite state machine (FSM). In this way,
641	only the fine grained parallelism is exploited (ILP parallelism).
642	The challenge is to identify the coarse grained parallelism and to generate,
643	from a sequential algorithm, coprocessor containing multiple communicating
644	tasks (data-paths and FSMs).
645	\end{itemize}
646
647	%Presenter les resultats escomptes en proposant si possible des criteres de reussite
648	%et d'evaluation adaptes au type de projet, permettant d'evaluer les resultats en
649	%fin de projet.
650	The main result is the framework. It is composed concretely of:
651	2 HPC communication shemes with their implementation,
652	5 HLS tools (control dominated HLS, data dominated HLS, Coarse grained HLS,
653	Memory optimisation HLS and ASIP),
654	3 systemC based virtual prototyping environment extended with synthesizable
655	RTL IP cores (generic, ALTERA/NIOS/AVALON, XILINX/MICROBLAZE/OPB),
656	one design space exploration tool,
657	one operating system (OS).
658	\\
659	The framework fonctionality will be demonstrated with XXX-EXAMPLE1, XXX-EXAMPLE2
660	and XXX-EXAMPLE3 on 4 archictures (generic/XILINX, generic/ALTERA,
661	proprietary/XILINX, proprietary/ALTERA).
662
663	%% \section{}
664	%% %3. PROGRAMME SCIENTIFIQUE ET TECHNIQUE, ORGANISATION DU PROJET
665	%% \subsection{}
666	%% %3.1. PROGRAMME SCIENTIFIQUE ET STRUCTURATION DU PROJET
667	%% %(2 pages maximum)
668	%% %Présentez le programme scientifique et justifiez la décomposition en tâches du
669	%% %programme de travail en cohérence avec les objectifs poursuivis.
670	%% %Utilisez un diagramme pour présenter les liens entre les différentes tâches
671	%% %(organigramme technique)
672	%% %Les tâches représentent les grandes phases du projet. Elles sont en nombre limité.
673	%% %N'oubliez pas les activités et actions correspondant à la dissémination et à la
674	%% %valorisation.
675	%%
676	%% %METTRE UNE FIGURE ICI DECRIVANT LES TACHES ET LEURS INTERACTION (AVEC LE FLOT
677	%% %EN FILIGRANE ? )
678	%% \subsection{}
679	%% %3.2. MANAGEMENT DU PROJET
680	%% %(2 pages maximum)
681	%% %Préciser les aspects organisationnels du projet et les modalités de coordination
682	%% %(si possible individualisation d'une tâche coordination : cf. tâche 0 du document
683	%% %de soumission A).
684	%% \subsection{}
685	%% %3.3. DESCRIPTION DES TRAVAUX PAR TACHE
686	%% %(idéalement 1 ou 2 pages par tâche)
687	%% %Pour chaque tâche, décrire :
688	%% %- les objectifs de la tâche et éventuels indicateurs de succès,
689	%% %- le responsable de la tâche et les partenaires impliqués (possibilité de
690	%% %l'indiquer sous forme graphique),
691	%% %- le programme détaillé des travaux par tâche,
692	%% %- les livrables de la tâche,
693	%% %- les contributions des partenaires (le " qui fait quoi "),
694	%% %- la description des méthodes et des choix techniques et de la manière dont
695	%% %les solutions seront apportées,
696	%% %- les risques de la tâche et les solutions de repli envisagées.
697
698
699
700
701
702

Download in other formats:

Original Format