wiki:EnMultiCourseTP4

Course "Architecture of Multi-Processor Systems"

TP4: Characterization and Sizing of Caches

(pirouz.bazargan-sabet@…)

A. Objectives

In this tutorial, we try to evaluate in a systematic way the influence of different cache characteristics on the execution time of a software application. The important parameters are the capacity of the caches (data and instruction), the number of words per cache line, the number of associativity levels, and the depth of the posted write buffer.

To have a point of comparison, we start by running the software application on an architecture with very large caches. Indeed, when all the executable binary code can reside in the instruction cache and all the data can be stored in the data cache, the MISS rate tends to 0, and we have a "near perfect" memory system.

Unfortunately, the capacity of the caches is limited by two physical factors:

  • on the one hand, the silicon area occupied by the caches, and thus the manufacturing cost, are proportional to the storage capacity.
  • On the other hand, the access time (time needed to extract data from the cache) increases with capacity, and this access time influences the cycle time, and therefore the frequency of the processor.

Trade-offs must therefore be made.

Furthermore, it is known that - for the same capacity - a set associative cache has a lower MISS rate than a direct match cache. But the associative cache takes up (a little) more space and is (a little) slower. Here again, there are trade-offs to be made, and we will study the influence of different parameters on performance.

To facilitate the measurement of performance, we use the proctime() system call of the GIET, which returns the content of an internal cycle counter of the processor. This counter is initialized to 0 when the simulation starts.

B. Hardware architecture

To make these comparisons, we use the same hardware architecture as in the two previous tutorials (a single processor, a RAM, a ROM and a TTY terminal). We choose a software application written in C language, which recursively calculates the sum of the first N integers, for N varying between 0 and 29, and we display the result of the calculation at each iteration of the main loop.

The archive multi_tp4.tgz contains the files you will need. Create a working directory tp4, and copy the two files tp2_top.cpp and tp2.desc describing the hardware architecture into this directory. Unzip the archive, and copy in tp4 the soft directory containing the main.c file (C program), the reset.s file (boot code), the sys.ld, app.ld and seg.ld files (linker control), and the Makefile file (binary code generation).

C. Almost perfect memory system

Compile the software application by running the makefile provided in the soft directory to generate the sys.bin and app.bin files containing the binary code.

Compile the virtual prototype describing the hardware architecture, after checking that all segments other than the seg_tty and seg_kunc segments are cacheable in the Segment Table.

This virtual prototype is generic, since it is possible to modify the characteristics of both caches, by defining parameter values on the command line.

Question C1: Run the simulation, with very large caches: 256 sets, 16 words per line, and 4 levels of associativity i.e. a capacity of 64 Kbytes for each of the two caches. We also choose a post-write buffer of depth 8 words of 32 bits. What is the execution time of the application?

The SystemC model of the !PibusMips32Xcache hardware component has pseudo-registers used to do non-intrusive instrumentation: The c_total_cycles pseudo-register counts the total number of cycles executed since the RESET. The pseudo-records c_imiss_count and c_imiss_frz count respectively the total number of miss instructions and the total number of processor freeze cycles due to miss instructions. This measures the "miss instruction rate" and the "miss instruction cost". Symmetrically, the pseudo-registers c_dmiss_count and c_dmiss_frz provide the same service for the data cache. One can also measure the average number of cycles per instruction (CPI), as well as the percentage of data read instructions (cached or uncached), or the percentage of write instructions.

To display these statistics, you can use the -STATS argument on the command line. Caution: the display period must be chosen carefully so that the measurements are made while the processor is actually active.

Question C2: Rerun the simulation, and determine the MISS rates for the instruction cache and the data cache as well as the CPI value. Are the MISS rates constant over time? How do they change over the first 1000 cycles? Interpret this behaviour.

Question C3: By analysing the contents of the files pibus_mips32_xcache.cpp and pibus_mips32_xcache.h, explain how the CPI and the two MISS rates are calculated.

D. Influence of the instruction cache capacity

We now rerun the application for different instruction cache capacities, but keeping a large data cache of 64 Kbytes. We use a direct match instruction cache (-IWAYS 1), and vary the number of slots for the values: 1, 4, 16, 64 and 256, keeping a fixed width of 8 words per cache line.

-ISETS -IWORDS -IWAYS -DSETS -DWORDS -DWAYS
256 8 1 256 16 4
64 8 1 256 16 4
16 8 1 256 16 4
4 8 1 256 16 4
1 8 1 256 16 4

Question D1: For each configuration, find the program execution time, the MISS rate on the instruction cache, the instruction miss cost and the CPI value. How do you interpret these results?

Question D2: Why is the value obtained for the MISS cost different from the value estimated in TP3? How do you explain that the cost of the miss can have a non-integer value? Why do not all MISS have the same cost?

E. Influence of the cache line width

We now try to evaluate the influence of the width of the cache line, at constant storage capacity. We keep a data cache of 64 Kbytes, and we therefore restart the execution of the application for the following 5 configurations of the instruction cache (constant capacity of 1 Kbyte):

-ISETS -IWORDS -IWAYS -DSETS -DWORDS -DWAYS
256 1 1 256 16 4
128 2 1 256 16 4
64 4 1 256 16 4
32 8 1 256 16 4
16 16 1 256 16 4
8 32 1 256 16 4

Question E1: Graph the execution time of the program as a function of the width of the cache line. What is the most efficient configuration? How do you explain this result?

F. Influence of the data cache capacity

For the data cache, one could similarly vary the cache capacity, the width of the cache line, and the number of associativity levels separately. Here we simply study the influence of the data cache capacity by continuing to use a large 64 Kbyte instruction cache. We use a direct match data cache, 8-word lines, and vary the number of sets.

-ISETS -IWORDS -IWAYS -DSETS -DWORDS -DWAYS
256 16 4 256 8 1
256 16 4 64 8 1
256 16 4 16 8 1
256 16 4 4 8 1
256 16 4 1 8 1

Question F1: Write down, for each value, the execution time of the program, the MISS rate on the data cache, the cost of the data miss and the value of the CPI.

G. Influence of the depth of the posted write buffer

Finally, we study the influence of the depth of the post-write buffer. The parameter WBUF_DEPTH defines the maximum number of write requests (corresponding to instructions of type sw or sb) that can be stored in this buffer.

Question G1: How is the posted write buffer implemented? What information must be stored in the buffer for each posted write request? What happens when the processor makes a write request when the posted write buffer is full? What happens when the processor makes a read request (instruction or data) that misses, while the write buffer is not empty? Why does this behaviour occur?

We keep an instruction cache and a data cache of 64 Kbytes each, and we restart the execution of the application for different depths of the posted write buffer. The parameter WBUF represents the maximum number of write requests that can be stored in the buffer. Execution times will be measured successively for depths of 1, 2, 4, and 8 words.

Question G2: Measure the execution time of the application, the CPI, the cost of writes, and the frequency of writes for depths of 1, 2, 4, and 8 words. What is the cost of writes? How do you explain the fact that the write cost is so low?

H. Report

The answers to the above questions must be written in a text editor and this report must be handed in at the beginning of the next lab session. Similarly, the simulator will be checked (by pairs) at the beginning of the next week's practical session.

Last modified 17 months ago Last modified on Dec 15, 2022, 5:33:46 PM