wiki:EnMultiCourseTP7

Course "Architecture of Multi-Processor Systems"

TP7: DMA Controller

(franck.wajsburt@…)

A. Objectives

The aim of this tutorial is to analyse the operation of a more complex device than those analysed in TP6. A device with DMA (Direct Memory Access) capability behaves both as a master capable of reading or writing directly to memory, and as a target capable - like any other device - of receiving commands from the operating system.

We use the same architecture as in TP5 and TP6, but we instantiate a single processor, and we activate the DMA controller. We will use caches with a capacity of 2 Kbytes (lines of 16 bytes, 4-way associative, 32 sets).

B. DMA controller

The PibusDma component is able to address memory directly (read and write), to move data from one area of memory to another. The DMA coprocessor therefore behaves like a target, since it must be configured by the operating system to start the transfer, but it also behaves like a master, since it is able to initiate transactions on the bus to read or write to memory.

So we will have two masters that can work in parallel in this architecture: the MIPS32 processor and the DMA coprocessor. We wish to analyse the mechanisms - hardware and software - allowing the cooperation between the two parallel processes, i.e. the program (software) running on the processor, and the DMA coprocessor automaton (hardware) performing the transfer. The general mechanism is as follows:

  • The system software running on the MIPS32 processor configures the DMA coprocessor and starts the transfer by writing to various "memory-mapped" DMA controller registers the parameters of the transfer which are: the base address of the source buffer, the base address of the destination buffer, and finally the number of words to be transferred. Once this has been done, the program that ordered the transfer continues its execution.
  • The DMA coprocessor performs the transfer by building bursts of fixed length. It executes as many pairs of transactions as necessary: each pair consists of a burst read of a packet in the source buffer, followed by a burst write of that packet to the destination buffer. The length of the burst is defined by the internal storage capacity of the DMA coprocessor. The term "coprocessor" is used because the DMA controller and the processor work in parallel during this transfer.
  • The transfer time is highly variable, as it depends on both the bus load and the number of bytes to be transferred. When the transfer is complete, the DMA controller signals it to the operating system by activating an interrupt. This end-of-transfer signal is necessary to allow the program that initiated the transfer to re-use the relevant memory buffers (source and destination).

Read the functional specification of the component PibusDma that you will find in the header of the file pibus_dma.h to answer the following questions.

Question B1: What are the addressable registers of the DMA controller, and what is the effect of a read or write to each of these registers? Why must the base address of the segment associated with the DMA controller as a target be aligned to a 32-byte block boundary?

Question B2: What is the meaning of the burst argument of the PibusDma component constructor?

Question B3: Why are two automata (MASTER_FSM and TARGET_FSM) needed to control the DMA coprocessor?

Question B4: This hardware component obviously contains other registers than the 5 addressable registers. By analysing the SystemC model contained in the file pibus_dma.cpp, describe precisely the function of the r_stop flip-flop.

Question B5: Complete the graph below representing the transition function of the MASTER_FSM automaton of the PibusDma component. Caution, the graph below is incomplete.

  • Two transitions must be added to take into account the software reset mechanism (writing to the RESET register): from the state (READ_DT) to the state (IDLE), labelled "V", and from the state (WRITE_DT) to the state (IDLE), labelled "W".
  • Two transitions must be added to account for single-word bursts: from state (READ_AD) to state (READ_DT), labelled "X", and from state (WRITE_AD) to state (WRITE_DT), labelled "Y".

The signals to be used as inputs are as follows:

  • STOP: internal signal (r_stop flip-flop) indicating no request or software reset
  • GNT: signal from the bus indicating that bus access is authorized
  • LAST: internal signal indicating the last word of a burst request
  • ACK: signal from the bus indicating whether the request has been taken into account (possible values: READY, WAIT, ERROR)
  • END: internal signal indicating that this is the last request for the requested copy.

C. Hardware Architecture

The archive multi_tp7.tgz contains the files you will need. Create a working directory tp7, and copy the files tp5_top.cpp, and tp5.desc into this directory. Unzip the archive. As usual, you will find the embedded software in the soft directory, which you copy into tp7.

By analysing the SystemC code describing the generic architecture (file tp5_top.cpp), answer the following questions:

Question C1: What is the default length of a burst in number of 32-bit words)? What is the advantage of using large bursts? What is the hardware consequence of increasing the burst length?

Question C2: What is the base address of the segment associated with the DMA device? What is its target number for the BCU component? Since the DMA device is also a master on the bus, it is connected to the BCU component by the REQ_DMA and GNT_DMA signals. What is its "master number" for the BCU? To which input port of the ICU component is the IRQ interrupt line controlled by the DMA connected?

D. Software Application

Go to the soft directory.

The program defined in the file main_dma.c is an extension of the program used in TP5: it displays successively on the graphic screen a series of images, which are checkerboards whose number of cells varies from one image to the next. At each iteration, the program builds a new image, which it stores in the BUF buffer before using the fb_sync_write() system call to display this image.

Question D1: The fb_sync_write() system call does not use the DMA coprocessor. Which hardware component performs the transfer of the image pixels between the memory buffer in user space and the video memory (frame buffer)? Explain why this system call is blocking. The answer can be found in the stdio.c and drivers.c files.

Question D2: Compile and run on the virtual prototype this first software application not using the DMA controller. How long does it take to build an image (time to fill the buffer)? What is the display time?

The DMA controller is usually used when large amounts of data need to be transferred, such as copying an image from a memory buffer (in user space) to graphics memory (located in protected system space). The system calls fb_write() and fb_completed() are used.

Question D3: What is the difference between the fb_sync_write() system call and the fb_write() system call? What is the use of the fb_completed() system call?

Modify the program contained in the file main_dma.c to replace the system call fb_sync_write() with the couple fb_write() / fb_completed(). Check in the reset.s file that the ISR _isr_dma is initialized in the interrupt vector, and that the mask register of the ICU component only allows the DMA IRQ to pass. Compile, then run the virtual prototype.

Question D4: How long does it take to display an image with the DMA?

Since we have two hardware components capable of running in parallel, it is tempting to parallelize the build and display phases to further reduce processing time. For example, we can try to build the (n + 1) image while we display the (n) image.

Restart the execution after deleting the fb_complete() system call in the main_dma.c file, so that you don't have to wait for the end of the display of image (n) to start building image (n + 1). To highlight the problem, the length of the DMA controller's bursts must be reduced to a single 32-bit word, to deliberately slow down the display (using the DMABURST parameter on the simulator command line).

Question D5: What defect do you observe on the left edge of the displayed image? Explain precisely the cause of this malfunction.

To synchronize the user program and the DMA controller, we use the global variable _dma_busy, which is a global variable of the operating system, which behaves like a SET/RESET flip-flop: It is set to 1 by the program that gives the display order, and reset to 0 by the DMA coprocessor when the transfer is finished.

Question D6: How is this variable used by the two system calls fb_write() and fb_completed()? In which function is the code for setting the variable _dma_busy to 1? In which function is the code for setting it to 0? In which segment is this variable stored?

E. Software Pipeline

The difficulties analysed in the previous section are related to the concurrent accesses to the BUF buffer allowing the communication between the build task executed by the processor (producer) and the display task executed by the DMA coprocessor (consumer).

To parallelize these two tasks, two buffers BUF1 and BUF2 must be used in a toggle fashion, allowing the producer task to write to buffer BUF2, while the consumer task reads from buffer BUF1, and vice versa. This allows a software pipeline mechanism to be implemented as described below:

Period 1 Period 2 Period 3 Period 4 Period 5 Period 6
PROC Built[1] Built[2] Built[3] Built[4] Built[5]
DMA Display[1] Display[2] Display[3] Display[4] Display[5]

The general operation is as follows:

  • During the odd periods (2i + 1) :
    1. Construction of the image (2i + 1) in BUF1.
    2. Display the image (2i) stored in BUF2 (if i > 0)
  • During even periods (2i) :
    1. Build image (2i) in BUF2, (if i < 3)
    2. Display the image (2i - 1) stored in BUF1.

Because of the alternative use of odd and even memory buffers, at no time are the same buffer read and written to simultaneously.

Rewrite a program main_pipe.c that performs the above software pipeline. This program should be organised in three phases:

  • initial loading of the pipeline, called prologue (period 0)
  • traversing the pipeline (periods 1, 2, 3, 4)
  • emptying the pipeline, called epilogue (period 5)

Question E1: The pipeline must be synchronised. Which condition must be tested by the software to go from period (n) to period (n + 1)?

Question E2: What is the gain (in number of cycles) brought by the pipeline parallelism, compared to a sequential execution? How do you interpret this result?

F. Error handling

We are now interested in error handling and reporting.

The DMA controller is a master that can directly address the memory (read and write), but this master takes no initiative. It only executes transfers that have been defined by a user program, and the programmer can make errors on the value of the source or destination buffer addresses.

Question F1: Why does the operating system prohibit the source buffer address (in the case of the fb_write() system call) or the destination buffer address (in the case of the fb_read() system call) from belonging to the protected area of the addressable space? Why must this type of error be detected before the DMA controller starts transferring?

Another cause of error is the use of addresses that do not correspond to any defined segment. This error is detected when the DMA controller performs the transfer, and receives a "bus error" response from the memory controller in response to its read or write command. A programmable processor such as the MIPS32 that receives a "bus error" response would connect to the exception handler to report the error and allow debugging, but the DMA controller is a wired automaton that obviously has no "software exception handler"...

Question F2: What is the mechanism that allows the DMA controller to report this type of error to the user program? To answer this question, we must analyse the whole signalling chain:

  • behaviour of the DMA controller which receives a bus error response: in the pibus_dma.cpp file
  • code of the interrupt routine _isr_dma associated with the DMA: in the file irq_handler.c.
  • code of the system function _fb_completed() : in the file drivers.c

To test these error reporting mechanisms, modify the program contained in the file main_dma.c to intentionally introduce an error on the buffer address passed as an argument to the fb_write() system call.

Question F3: Which system call reports the error if the erroneous address is undefined (e.g. address 0x0, which does not correspond to any defined segment)? Which system call signals the error if the erroneous address belongs to the protected area (e.g. address 0x80000000)?

Don't be lazy. Take the time to analyse the code in detail, as this part on error reporting is the most important part of the course.

G. Improving parallelism

In a software pipeline, the gain from parallelization between several tasks is small in the case where one task has a much longer duration than the others, because the pipeline is not "balanced". In this case, it is the image construction task that is much longer than the display task. In order to improve performance, the image construction task must be split between several software tasks running in parallel on several cores. This is the aim of this last part.

The aim is to create a pipeline where one "hardware" task and four "software" tasks run in parallel according to the following timeline:

Period 1 Period 2 Period 3 Period 4 Period 5 Period 6
PROC_0 Built[1] Built[2] Built[3] Built[4] Built[5]
PROC_1 Built[1] Built[2] Built[3] Built[4] Built[5]
PROC_2 Built[1] Built[2] Built[3] Built[4] Built[5]
PROC_3 Built[1] Built[2] Built[3] Built[4] Built[5]
DMA Display[1] Display[2] Display[3] Display[4] Display[5]

As far as synchronisation is concerned, we now need a barrier between 5 tasks instead of 2. This can be achieved by instructing a single software task to synchronise with the hardware DMA task using the fb_write() and fb_completed() primitives, and adding a synchronisation barrier between the four software tasks.

Question G1: Modify the hardware architecture to have 4 processors. Modify the image building program to share the work between 4 tasks based on what you did in TP5. You also need to modify the reset code to run the four software applications on all four processors.

H. Reporting

The answers to the above questions should be written in a text editor and this report should be handed in at the beginning of the next lab session. In the same way, the simulator will be checked (in pairs) at the beginning of the next week's practical session.

Last modified 6 weeks ago Last modified on Mar 15, 2024, 3:12:18 PM