wiki:CacheCoherence

Version 2 (modified by alain, 15 years ago) (diff)

--

Cache coherence protocol

1. General Principles

This section describes the DHCCP protocol (Distributed Hybrid Cache Coherence protocol), implemented by the TSAR architecture. For scalability purposes, the TSAR architecture implement a “Directory Based” cache coherence policy. From a conceptual point of view, the coherence protocol is supported by a Global Directory located in the memory controller : This Global Directory stores the status of each cache line replicated in at least one L1 cache of the TSAR architecture.

The main goal being the protocol scalability, the L1 caches implement a WRITE-THROUGH policy. The coherence protocol is much simpler than the MESI protocol used in most architectures implementing a WRITE_BACK policy. With a WRITE-THROUGH policy, the main memory contains always the most recent value of a cache line, and there is NO exclusive ownership state for a L1 cache.

The basic mechanism is the following : when the memory controller receives a WRITE request for a given cache line, he must send an UPDATE or INVALIDATE request to all L1 caches containing a copy (but the writer). The write request is acknowledged only when all UPDATE or INVALIDATE transactions are completed.

In the TSAR architecture, the memory controller is distributed, as it is implemented by the distributed memory caches (one per cluster). Therefore, the global directory itself is distributed. The memory cache being inclusive: a cache line L that is present in at least one L1 cache must be present in the corresponding memory cache cache (in the home cluster). With this property, the Global Directory can be implemented as an extension of the memory cache directory.

In case of MISS, the memory cache controller must evict a victim line to bring in the missing line. In order to maintain the inclusive property, all copies of the evicted cache line in L1 caches must be invalidated. To do it, the memory cache controller must send INVALIDATE requests to all L1 caches containing a copy.

The TSAR architecture wants to guaranty the cache coherence by hardware, for both the data and instruction caches (L1 caches). Reflecting the different behaviour of data & instruction caches, the DHCCP protocol defines different strategies, depending on the number of copies :

  • Regarding the data, the modifications of shared data are very frequent events, but – in average – the number of copies is not very high. Therefore, the DHCCP protocol will preferably use a multicast/update strategy for the data caches.
  • Regarding the instructions, the modifications of shared code are rather rare events ( in case of self modifying code, or dynamic libraries ), but the number of replicated copies can be very large ( the system call handler, or the libc are likely replicated in all L1 caches ). Therefore, the DHCCP ptotocol will generally use a broadcast/invalidate policy for instruction caches.

2. Types of transaction

Three types of transactions, have been identified :

  • Direct transactions : READ / WRITE / LL / SC
  • Coherence transactions : UPDATE / INVALIDATE / CLEANUP
  • External Transactions : PUT / GET

For dead-lock prevention, these three types of transaction must be transported on three (virtually or physically) separated networks.

As a general rule, all these transactions respect the VCI advanced packet format, and there is one response packet for each command packet : For a burst transaction, a READ command packets contains one single flit, and the corresponding READ response packets contains N flits. Symmetrically, a WRITE command packet contains N flits, and the corresponding WRITE response contains one single flit.

There is one exception : For a BROADCAST_INVALIDATE transaction, the initiator sends one single flit VCI packet, but receives several single flit VCI response packets (see section 2.2).

2.1 READ / WRITE / LL / SC

These transactions are initiated by a processor (actually the L1 cache controller), or by another initiator ( an I/O peripheral or hardware coprocessor with a DMA capability). This initiator can be located in any cluster. For those transactions, the target is a memory cache controller, acting as a physical memory bank, or another VCI target peripheral. This target can be located in any cluster.

  • A READ transaction can be a single word request (in case of uncached access), or a burst, corresponding to a complete cache line (16 words). A READ burst transaction initiated by any DMA controller must respect the same 16 words cache line format. For all READ transaction, the VCI command packet contains one single VCI flit. The VCI CMD field contains the VCI_READ code. The VCI PLEN field is used to define the burst length. A READ transaction has a type, encoded with two bits in the VCI TRDID field : bit 0 of the TRID field is 0 for an uncached access, and 1 for a cached access. bit 1 of the TRDID field is 0 for a data cache request, and 1 for an instruction cache request. The response packet contains one VCI flit (single word) or 16 VCI flits (cache line). The VCI PKTID field is not used.
  • A WRITE transaction can be a single word request or a variable length burst request. In case of burst, the the VCI command packet contains at most 8 VCI flits, with consecutive addresses. All words belong to the same half cache line, and the VCI BE field can have different values for each flit (including the zero value). The VCI response packet contains one VCI flit. A WRITE burst transaction initiated by any DMA controller must respect the same 8 aligned words constraint. The VCI CMD field contains the VCI_WRITE code. When the VCI TRDID field contains a non-zero value, it signals that the write request is “posted” : The VCI target must send a response to respect the VCI protocol, but this response can be send before the write is actually performed. This can be used by by the VCI/HT bridge. The VCI PKTID fields is not used. If the modified cache line is replicated in one or several other L1 caches, all copies must be updated or invalidated before the WRITE transaction is acknowledged.
  • The TSAR architecture supports the LL/SC mechanism for atomic operations (see AtomicOperation?). For both a LL (Linked Load) or a SC (Store Conditionnal) transaction, the VCI command packet and the VCI response packet contain one single VCI flit. The VCI CMD field must contain the VCI_LINKED_READ value (resp. VCI_STORE_CONDITIONNAL) value. The VCI VCI PKTID and TRDID fields are not used.

2.2 MULTI_UPDATE / MULTI_INVAL / BROADCAST_INVAL / CLEANUP

These transactions are initiated by a memory cache controller to update or invalidate copies in the L1 caches. For each cache line stored in the memory cache, the memory cache handles an INS bit indicating that this cache line is replicated in at least one L1 instruction cache. This bit is set as soon as the memory cache receives a cache line READ request with the INS bit set in the TRDID field. When the cache line is marked as data (INS = 0), the memory cache handles an explicit set of the SRCIDs of all L1 caches containing a copy. When the cache line is marked as instruction (INS = 1), the memory cache handles a counter containing the number of copies in L1 caches.

  • A MULTI_UPDATE transaction is a multi-cast transaction, that is composed of several VCI transactions. When a memory cache controller receives a WRITE request to a replicated cache line marked as data (INS = 0) , it sends as many VCI transactions as the number of registered copies (but the writer). The VCI command packet contains (N+2) flits. The VCI ADDRESS field is constant & contains the address of the memory mapped UPDATE register in the L1 cache. The VCI CMD field contains the WRITE value. As the memory cache controller can handle several update/invalidate transactions simultaneously, the VCI TRDID field contains the transaction index. The VCI PLEN field contains the value 4*N, where N is the actual number of modified words in the cache line. The VCI WDATA field contains the line index in the first flit (30 bits), the first modified word index (4 bits) in the second flit, and the N modified words in the N following flits. For each modified word, the VCI BE field can have a different value (including the 0x0 value). The VCI response packet contains one single flit. The memory cache controller counts the number of VCI responses to detect the completion of the MULTI_UPDATE transaction.
  • A MULTI_INVAL transaction is a multi-cast transaction, that is composed of several VCI transactions. When a memory cache makes a cache line replacement (following a MISS), and the victim line has the data type (INS = 0), it sends as many VCI transactions as the number of registered copies. Both the VCI command packet and the VCI response packet contain only one flit. The VCI CMD field contains the WRITE value. The VCI ADDRESS field contains the address of the memory mapped INVAL register in the L1 cache. The VCI CMD field contains the WRITE value. As the memory cache controller can handle several update/invalidate transactions simultaneously, the VCI TRDID field contains the transaction index.The VCI WDATA field contains the line index. The memory cache controller counts the number of VCI responses to detect the completion of the MULTI_INVAL transaction.
  • A BROADCAST_INVAL transaction is a broadcast transaction. This transaction is initiated when a memory cache controller replace a line that has the instruction type (INS = 1), or when the memory cache receives a WRITE request to a replicated cache line that has the instruction type (INS = 1). The VCI command packet contains one single flit. This packet is replicated & dynamically broadcasted by the network itself. The VCI CMD field contains the WRITE value. The VCI ADDRESS field contains the global broadcast address 0x000000003 (only the two LSB bits are set). The VCI WDATA field contains the line index. This VCI command is broadcasted to all L1 caches in the system, but only L1 caches that have a copy send a VCI response packet. All VCI response packets are independently returned to the memory cache initiator, that counts the number of VCI responses to detect the completion of the BROADCAST_INVAL transaction. If a L1 cache contains two copies of a cache line (i.e. the line is replicated in both the DATA cache, and the INSTRUCTION cache), it must send two VCI responses.
  • A CLEANUP transaction is initiated by a L1 cache controller to a memory cache controller, to signal that a cache line copy has been removed from an instruction or data cache. Both the VCI command packet and the VCI response packet contain one single flit. For a CLEANUP transaction, the VCI ADDRESS field must contain the removed cache line address, and the VCI TRDID field must contain a non zero value.

4.2.3 PUT / GET

The PUT and GET transactions are initiated by the memory caches, to get or save a complete cache line in case of MISS. The targets are always the external RAM controller(s). All these transactions use a separated network, and a separated address space. The memory cache and external RAM controller ports respect an simplified version of the VCI advanced format : the VCI fields PLEN, PKTID, CONST, CONTIG and BE are not used. The VCI ADDRESS field contains 30 bits (a 64 bytes cache line index). (30 bits). The VCI WDATA & RDATA fields contain 64 bits, in order to improve the bandwidth. The VCI SRCID field contains the memory cache index (cluster index). As the memory cache controller can process several PU and/or GET transaction simultaneously, the VCI TRDID field contains the transaction index.

  • For a GET transaction, the VCI command packet contains one single flit. The VCI CMD field contains the READ value. The VCI response packet contains 8 flits (corresponding to the 64 bytes of a cache line).
  • For a PUT transaction, the VCI command packet contains 8 flits. The VCI CMD field contains the WRITE value. The VCI response packet contains 1 flit.