wiki:CacheCoherence

Version 23 (modified by alain, 4 years ago) (diff)

--

Cache coherence protocol

1. General Principles

This section describes the DHCCP protocol (Distributed Hybrid Cache Coherence Protocol), implemented by the TSAR architecture. For scalability purposes, the TSAR architecture implement a “Directory Based” cache coherence policy. From a conceptual point of view, the coherence protocol is supported by a Global Directory located in the memory controller : This Global Directory stores the status of each cache line replicated in at least one L1 cache of the TSAR architecture.

The main goal being the protocol scalability, the L1 caches implement a WRITE-THROUGH policy. The coherence protocol is much simpler than the MESI or MSI protocols used in most architectures implementing a WRITE_BACK policy. With a WRITE-THROUGH policy, the main memory contains always the most recent value of a cache line, and there is NO exclusive ownership state for a cache line.

The basic mechanism is the following : when the memory controller receives a WRITE request for a given cache line, he must send an UPDATE or INVAL request to all L1 caches containing a copy (but the writer). The write request is acknowledged to the writer only when all UPDATE or INVAL transactions are completed.

In the TSAR architecture, the memory controller is distributed, as it is implemented by the distributed L2 caches (one per cluster). Therefore, the global directory itself is distributed. The L2 cache is inclusive for all L1 caches: a cache line L that is present in at least one L1 cache must be present in the owner L2 cache cache. With this property, the Global Directory can be implemented as an extension of the memory cache directory.

In case of MISS, the L2 cache controller must evict a victim line to bring in the missing line. In order to maintain the inclusive property, all copies of the evicted cache line in L1 caches must be invalidated. To do it, the L2 cache controller must send invalidate requests to all L1 caches containing a copy.

The TSAR architecture wants to guaranty the cache coherence by hardware, for both the data and instruction L1 caches. The modifications of shared data are very frequent events, but the number of copies is generally not very high. The modifications of shared code are very rare events (self modifying code, or dynamic libraries), but the number of replicated copies can be very large ( the exception handler, or the libc are generally replicated in all L1 caches ). Reflecting the different behaviour of data & instruction caches, the "hybrid" cache coherence protocol DHCCP defines two different strategies, depending on the number of copies :

  • MULTICAST_UPDATE : When the number of copies is smaller than the DHCCP threshold, the L2 cache controller registers the locations of all the copies, and can send a dedicated update(L) request to each relevant L1 cache in case of modification of L.
  • BROADCAST_INVAL : When the number of copies is larger than the DHCCP threshold, the memory cache controller registers only the number of copies (without localization) and broadcast an inval request to all L1 caches in case of modification of L.

2. Transactions between L1 and L2 caches

Nine types of transactions, have been identified that can be split in two classes:

  • 5 Direct transactions : READ / WRITE / LL / SC / CAS
  • 4 Coherence transactions : MULTI_UPDATE / MULTI_INVAL / BROADCAST_INVAL / CLEANUP

For dead-lock prevention, the transaction must be transported on three (virtually or physically) separated networks.

2.1 Direct transactions

These transactions are always initiated by the L1 cache controller, that can be located in any cluster. The target is a L2 cache controller, acting as a physical memory bank, that can be located in any cluster.

All direct transactions require two packets: one command packet (from L1 to L2), and one response packet (from L2 to L1).

To avoid deadlocks, the directs transactions require two separated physical networks for commands and responses.

For all direct transactions, the packet (command & responses) respect the VCI format AS the L1 cache controller can issue several simultaneous direct transactions, that are distinguished by the VCI TRDID and PKTID values.

  • A READ transaction can have four sub-types: It can be instruction or data, and it can be cacheable or uncacheable. In case of a burst transaction the burst must be included in a 16 words cache line. This constraint applies for both the L1 cache controllers and the I/O controllers with a DMA capability. For all READ transaction, the VCI command packet contains one single VCI flit, and the VCI response packet contains at most 16 flits.
  • A WRITE transaction can be a single word request or a variable length burst request. In case of burst, all words must belong to the same cache line. and the BE field can have different values for each flit (including the zero value). The VCI command packet contains at most 16 flits and the VCI response packet contains one VCI flit. A WRITE burst transaction initiated by a DMA controller must respect the same constraint.
  • A LL (Linked Load) transaction can target any single word contained in a memory cache. The response returns two 32 bits values that are the addressed data value, and a signature that has been allocated by the memory cache to this LL reservation. This means that the VCI command packet contains one flit and the VCI response packet contains two flits.
  • A SC (Store Conditionnal) transaction can target any single word contained in a memory cache. The command must transport both the new data value and the signature obtained after the LL transaction. The response returns only a Boolean indicating failure/success for the SC transaction. This means that the VCI command packet contains two flits and the VCI response packet contains one flit.
  • A CAS (Compare & Swap) transaction can target any single word contained in a memory cache. The command must transport both the old data value and the new data value. The response returns only a Boolean indicating failure/success for the CAS transaction. This means that the VCI command packet contains two flits and the VCI response packet contains one flit.

2.2 Coherence transactions

For each cache line stored in the L2 cache, the L2 cache implements a linked list of copies replicated in the L1 caches. Each entry in this list contains the SRCID of the L1 cache that contains a copy, as well as the type of the copy (instruction/data). If the same cache line is replicated in both the instruction cache and the data cache of a given core, this defines two separated entries in the list. When the number copies for a given cache line L exceeds the DHCCP threshold, the corresponding list of copies is flushed, and the L2 cache registers only the number of copies.

A coherence transaction can be initiated by the L1 cache or by the L2 cache. Depending on the transaction type, a coherence transaction can require two or three packets.

  • A CLEANUP transaction is initiated by the L1 cache when it must evict a line L for replacement, to signal to the owner L2 cache that it does not contains anymore a copy of L. This transaction requires two packet types:
    1. The L1 cache send a cleanup(L) packet to the owner L2 cache.
    2. The L2 cache returns a clack(L) packet to signal that its list of copies for L has been updated.

For the L1 cache, the CLEANUP transaction is completed when the L1 cache receive the clack packet.

  • A MULTI_UPDATE transaction is a multi-cast transaction initiated by the L2 cache when it receives a WRITE request to a replicated cache line, and the number of copies does not exceeds the DHCCP threshold. This transaction requires two packet types:
    1. The L2 send as many update(L,DATA) packets as the number of registered copies (but the writer).
    2. Each L1 cache returns an update_ack(L) packet to the L2 cache to signal that the local copy has been updated.

For the L2 cache, the MULTICAST_UPDATE transaction is completed when the L2 cache received all expected update_ack packets.

  • A MULTI_INVAL transaction is a multi-cast transaction, initiated by the L2 cache, when it must evict a given line L, and the number of copies does not exceeds the DHCCP threshold. To keep the inclusion property, all copies in L1 caches must be invalidated. This transaction requires three types of packets:
    1. The L2 cache send as many inval(L) packets as the number of registered copies to all registered L1 caches.
    2. Each L1 cache send a cleanup(L) packet to the L2 cache to signal that the local copy has been invalidated.
    3. The L2 cache returns to each L1 cache a clack(L) packet to signal that its list of copies for L has been updated.

For the L2 cache, the MULTI_INVAL transaction is completed when the last cleanup packet has been received.

  • A BROADCAST_INVAL transaction is a broadcast transaction initiated by a L2 cache when a line L has been modified by a WRITE, or when the line L must be evicted for replacement, and the number of copies exceeds the DHCCP threshold. This transaction request three types of packets:
    1. The L2 cache send to all L1 caches controller a bc_inval(L) broadcast packet.
    2. Each L1 cache that contains a copy of L send a cleanup(L) packet to the L2 cache to signal that the local copy has been invalidated.
    3. The L2 cache returns to each L1 cache that made a cleanup, a clack(L) packet to signal that its list of copies for L has been updated.

For the L2 cache, it simply decrement the counter of copies for each received cleanup, and the BROADCAST_INVAL transaction is completed when the last cleanup packet has been received.

As the MULTI_INVAL and BROADCAST_INVAL transactions require three packets, the coherence transactions require three separated physical networks reqiore t

3. Transactions between L2 and L3 caches

These transactions are initiated by the L2 caches, to fetch or save a complete cache to/from the L3 cache. The general policy between the memory caches and the external memory is WRITE_BACK : The external memory is only updated in case of line replacement. The target is always the external RAM controller.

All these L2/L3 transactions use a separated external network, implementing a separated address space. The memory cache and the external RAM controller ports used to access the external network respect a simplified version of the VCI advanced format : the VCI fields PLEN, PKTID, CONST, CONTIG and BE are not used. The VCI ADDRESS field contains 30 bits (a 64 bytes cache line index). The VCI WDATA & RDATA fields contain 64 bits, in order to improve the bandwidth. The VCI SRCID field contains the memory cache index (cluster index). As the L2 cache controller can process several external transactions simultaneously, the VCI TRDID field contains the transaction index.

  • The L2 cache makes a GET transaction to the L3, to handle a L2 miss. The VCI command packet contains one single flit. The VCI CMD field contains the READ value. The VCI response packet contains 8 flits (corresponding to the 64 bytes of a cache line).
  • The L2 cache makes a PUT transaction to the L3, to handle the replacement of a dirty line. The VCI command packet contains 8 flits. The VCI CMD field contains the WRITE value. The VCI response packet contains 1 single flit.