wiki:EnMultiCourseTP8

Course "Architecture of Multi-Processor Systems"

TP8: Disk Controller & Device Sharing

(franck.wajsburt@…)

A. Objectives

The first objective of this practical is to analyze the functioning of a new device controller: The IOC (Input Output Controller). This component can be used to perform data transfers between the memory and an external storage device (magnetic disk, USB key, etc...).

The second objective is to analyse the problems posed by the sharing of peripherals when several programs run in parallel on several processors, and use the same peripheral. The hardware architecture is therefore the generic multi-processor architecture, already used in TP5, TP6, and TP7.

B. Disk Controller

The hardware component PibusBlockDevice, allows access to an external storage device. Unlike RAM, which is volatile, these external storage devices have the property of retaining stored information even when the machine is turned off. The price to pay is usually very long access times (several million cycles for a magnetic disk).

In the external storage space, data is grouped in blocks of 512 bytes, and the external address (called Logical Block Address or LBA) is a block number, instead of a byte number as in the internal address space. All data transfers between the internal and external space are therefore done in blocks of 512 bytes.

The IOC controller therefore allows a program running on the MIPS32 processor to trigger the transfer of one or more data blocks between a memory buffer and the external storage device.

To simplify the simulator, a file stored on the disk of the workstation running the simulation is used directly as external storage capacity. The disk of the simulated platform is thus represented by a single file located on the workstation disk. A device of this type is seen as a sequence of 512-byte blocks. For the simulated machine, the file representing the disk (and the number of blocks it contains) are hardware characteristics that obviously cannot be modified by the software. The disk image provided for this tutorial is contained in the file images.raw. It contains a sequence of 21 images. Each image contains 128 lines of 128 pixels, and each pixel is encoded in one byte (256 grey levels).

The communication scenario between the user program and the disk controller is as follows:

  1. The user program, running on the MIPS32 processor, uses a system call that configures the IOC controller and initiates the transfer by writing to various "memory-mapped" registers. This system call writes to four registers to:
    • set the direction of the transfer (read or write to disk),
    • set the base address of the memory buffer,
    • set the number of blocks to be transferred,
    • set the number of the first block on disk (LBA),
  2. The IOC coprocessor performs the transfer block by block (one burst transaction per block). If the transfer is a write to the disk, the transactions are reads from the pibus, and vice versa. We call this an I/O coprocessor, because the I/O controller and the main processor work in parallel during this transfer.
  3. The transfer time is even more unpredictable than in the case of the DMA controller, because it depends not only on the bus load, but also on the (mechanical) position of the disk read head. When the transfer is complete, the IOC controller activates an interrupt, to signal the end of the transfer to the operating system, which can then unblock the application that requested the transfer.

Read the functional specification of the PibusBlockDevice component found in the header of the pibus_block_device.h file to answer the following questions:

Question B1: What is the meaning of the block_size and latency arguments of the PibusBlockDevice hardware component constructor?

Question B2: How many blocks does an image occupy on the disk, for blocks of 512 bytes?

Question B3: What are the addressable registers of the disk controller, and what effect does a write or read to each of these registers have?

Question B4: What are the different values of the internal state of the disk controller that can be read by the software, and what is the meaning of each of these states?

C. Hardware architecture

The archive multi_tp8.tgz contains the files you will need. Create a working directory tp8, and copy the two files tp5_top.cpp, and tp5.desc into this directory. Unzip the archive. As usual, you will find the embedded software in the soft directory, which you will copy into tp8. We will use a 4-core architecture in this project.

Caution: it is necessary to edit the file tp5_top.cpp to modify the value of a parameter of the constructor of the component PibusSegBcu. The timeout parameter defines the maximum duration of a transaction. If the duration of a transaction exceeds this value, the bus controller aborts the transaction. It is the use of the PibusBlockDevice component that makes this modification necessary.

Question C1: Why does the use of the PibusBlockDevice component require increasing the time-out value of the PibusSegBcu component? What value should be given to the timeout parameter of the PibusSegBcu component constructor?

For the Frame Buffer component, we will choose a height and width corresponding to the images on the disk (images of 128 lines of 128 pixels).

By analysing the content of the file tp5_top.cpp, answer the following questions:

Question C2: What are the values of the base address and the length of the segment associated with the disk controller (IOC). Given the variable number of processors in this architecture, what are the lengths of the segments associated with the ICU, TTY and TIMER components?

Question C3: How many master components are there in this architecture? How many target components?

Question C4: Knowing that the generic architecture used in this architecture allows the number of cores to vary, analyse the file tp5_top.cpp to determine how many incoming interrupt lines the ICU component receives, from the 4 devices TTY, TIMER, DMA and IOC? How many outgoing interrupt lines does it have? How are the IRQs from the devices connected to the IRQ_IN[i] ports of the ICU?

Check and/or change the default values of the hardware parameters to give the hardware platform the above characteristics, and generate the simulator.

D. Boot Code

Go to the soft directory.

The reset.s file contains the boot code to initialise the interrupt vector and the ICU device, as well as the stack pointers of the different processors.

Question D1: Recall why the initialization of the stack pointer depends on the processor number.

Question D2: Recall the general mechanism that allows the operating system to route - by software - the different incoming interrupt lines on the ICU component to different processors.

Question D3: In the case of a 4 processor architecture, what are the values to be stored in the 4 mask registers of the ICU if we want to perform the following routing:

  • IRQ_TIMER[0], IRQ_TTY[0] IRQ_DMA and IRQ_IOC to processor 0
  • IRQ_TIMER[1], IRQ_TTY[1] to processor 1
  • IRQ_TIMER[2], IRQ_TTY[2] to processor 2
  • IRQ_TIMER[3], IRQ_TTY[3] to processor 3

Complete the reset.s file to initialise the stack pointer of the 4 processors (64 Kbytes per processor), and to perform the interrupt routing defined above. The TIMER device is not used in this tutorial and does not need to be initialized.

E. Image processing software application

The file main_image.c contains an image processing program. This program has been designed to allow parallel execution on an architecture with 1, 2 or 4 processors.

We start with a sequential execution on a single processor architecture.

At each iteration of the main loop, the program successively performs the following three operations:

  1. Reading an image from disk, and copying that image into a memory buffer buf_in, using the system call ioc_read(), and then waiting for the transfer to complete, with the system call ioc_completed().
  2. Process this image pixel by pixel, applying a threshold to the image. The modified image is copied to a second memory buffer buf_out.
  3. Display the modified image in the buf_out buffer on the graphics terminal, using the system call fb_sync_write(), which does not use the DMA controller.

Question E1: What are the arguments to the ioc_read() system call? What does this system call do? The answer can be found in the files stdio.c and drivers.c. Does this system call wait until the transfer is complete before returning? In what case is this system call blocking?

Question E2: What does the ioc_completed() system call do? What are its arguments? Is this system call blocking?

Complete the main_image.c file to define the missing arguments for the system calls.

In the configuration file config.h, check that the argument NB_PROCS has the value 1, and that the argument NO_HARD_CC has the value 0.

Use the Makefile to generate the binary code.

Return to the tp8 directory and run on a single processor architecture, setting the path to the disk image file on the command line, and using large caches:

./simul.x -DISK images.raw -ISETS 1024 -IWAYS 4 -DSETS 1024 -DWAYS 4 -NPROCS 1

This is an interactive application: the main loop ends with a tty_getc() system call to read a character typed on the keyboard. Since this system call is blocking, you have to press a key on the keyboard (any key) to allow the next image to be loaded and displayed.

Question E3: What problem do you observe when displaying the following images? Explain precisely what is the cause of this malfunction. Hint: the cause of the problem is related to the operation of the data cache.

Change - on the command line - the value of the SNOOP parameter, which enables the bus snooping mechanism of the PibusMip32Xcache component... and verify that the observed malfunction disappears.

By analysing the content of the file pibus_mips32_xcache.cpp, and more particularly the behaviour of the SNOOP_FSM automaton, answer the following questions:

Question E4: What conditions cause the SNOOP_FSM to exit the IDLE state? Is the strategy implemented in case of an external hit an update or an invalidation?

Question E5: Why is there a particular problem with detecting multiple consecutive external hits? How is this problem solved?

Change the default value of the SNOOP parameter in the tp5_top.cpp file, so that SNOOP is always enabled.

Question E6: Measure, for the first 4 images, the execution times of each of the three stages of the program (loading, thresholding, display) and report these measurements in a table.

F. Running on a multi-processor architecture

We now want to run the image processing application on an architecture with 4 processors. The basic idea is to accelerate the processing by sharing the work between several processors working in parallel. The principle of parallelization is to split the image into horizontal strips, so that each processor is responsible for processing one strip. For an image of 128 lines, each processor processes 32 lines, representing 1/4 of the image.

Question F1: Of the three processing phases (loading, filtering, displaying), which ones can actually be parallelized?

At the start of processing an image, each processor seeks to use the disk controller to load into memory the part of the image that concerns it.

Question F2: Since the IOC controller can only perform one transfer at a time, describe the general mechanism that allows the 4 transfers requested by the 4 processors to be sequenced.

This is the _ioc_get_lock() system function (itself called by the ioc_read() system call, which allows the operating system to take the lock allowing exclusive access to the IOC controller. This function uses two particular assembly instructions LL and SC to perform an atomic read_then_write access:

  • LL(X) (linked load) instructions: this is a read of a 32-bit word at a memory address X, with a reservation taken on the memory slot X. Any writing to address X (by another processor) cancels the reservation.
  • SC(X) (store conditional) instruction: this is a conditional write of a 32-bit word to address X. If there has been no other write access to address X since the reservation made by LL(X), it is a success, the writing is done, and the reservation is cancelled. Otherwise, it is a failure and the write is not performed. This SC(X) instruction is blocking like a read, since it returns a value: 1 in case of success, and 0 in case of failure.

In the (frequent) case where several programs carry out a reservation (LL instruction) on the same address X, it is thus the first program which carries out the SC instruction which wins.

Question F3: Analyse in detail the code of the function _ioc_get_lock() found in the file drivers.c, and explain what this code does.

Question F4: What is the system function that releases the lock protecting exclusive access to the IOC component. Why is it not necessary to use a particular instruction to release the lock?

Recompile the image processing application in the soft directory for a 4 processor architecture by specifying the number of processors in the config.h file. Launch the execution.

$ ./simul.x -DISK images.raw -ISETS 1024 -IWAYS 4 -DSETS 1024 -DWAYS 4 -NPROCS 4

G. Hardware implementation of LL/SC

The (LL/SC) mechanism allows any X memory address to be used as an exclusive access lock, and the association between an X address and a protected resource is a convention that is defined either by the operating system (for locks used by the system) or by the applications (for locks directly managed by the application code). LL and SC instructions are therefore not protected instructions, reserved for the operating system. In order to give the software full freedom to choose the lock addresses, the hardware must be able to register the reservation made by the LL(X) instruction on any X address.

In practice, in bus-based architectures, the registers for recording this reservation take are usually located in the cache controller and not in the memory controller.

Question G1: What is the reason for performing this registration on the processor side rather than the memory side?

This processor-side storage obviously poses a problem: to guarantee exclusive access, a reservation on address X made by an LL(X) instruction executed on processor P0 must be cancelled if another processor P1 executes an SC(X) instruction on the same address first.

Question G2: In the scenario described above, how is the processor P0 informed of the write performed by P1?

Question G3: To test your hypothesis, run the image processing application with the snoop mechanism disabled, using the argument (-SNOOP 0) on the command line. How do you explain what you observe?

Question G4: Summarise in one sentence the two uses of the snoop mechanism that have been highlighted in this tutorial.

H. Report

The answers to the above questions must be written in a text editor and this report must be handed in at the beginning of the next lab session. Similarly, the simulator will be checked (by pairs) at the beginning of the next week's practical session.

Last modified 6 weeks ago Last modified on Mar 19, 2024, 11:13:26 AM