Changes between Version 8 and Version 9 of CacheCoherence


Ignore:
Timestamp:
Feb 7, 2010, 3:59:45 PM (14 years ago)
Author:
alain
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CacheCoherence

    v8 v9  
    66
    77This section describes the DHCCP protocol (Distributed Hybrid Cache Coherence Protocol), implemented by the TSAR architecture.
    8 For scalability purposes, the TSAR architecture implement a “Directory Based” cache coherence policy. From a conceptual point of view, the coherence protocol  is supported by a Global Directory located in the memory controller : This Global Directory stores the status of each cache line replicated in at least one L1 cache of the TSAR architecture.
     8For scalability purposes, the TSAR architecture implement a “Directory Based” cache coherence policy.
     9From a conceptual point of view, the coherence protocol  is supported by a Global Directory located in the memory controller :
     10This Global Directory stores the status of each cache line replicated in at least one L1 cache of the TSAR architecture.
    911
    10 The main goal being the protocol scalability, the L1 caches implement a WRITE-THROUGH policy. The coherence protocol is much simpler than the MESI protocol used in most architectures implementing a WRITE_BACK policy. With a WRITE-THROUGH policy, the main memory contains always the most recent value of a cache line, and there is NO exclusive ownership state for a L1 cache.
     12The main goal being the protocol scalability, the L1 caches implement a WRITE-THROUGH policy. The coherence protocol
     13is much simpler than the MESI protocol used in most architectures implementing a WRITE_BACK policy.
     14With a WRITE-THROUGH policy, the main memory contains always the most recent value of a cache line,
     15and there is NO exclusive ownership state for a L1 cache.
    1116
    12 The basic mechanism is the following : when the memory controller receives a WRITE request for a given cache line, he must send an UPDATE or INVALIDATE request to all L1 caches containing a copy (but the writer). The write request is acknowledged only when all UPDATE or INVALIDATE transactions are completed.
     17The basic mechanism is the following : when the memory controller receives a WRITE request for a given cache line,
     18he must send an UPDATE or INVALIDATE request to all L1 caches containing a copy (but the writer).
     19The write request is acknowledged only when all UPDATE or INVALIDATE transactions are completed.
    1320
    14 In the TSAR architecture, the memory controller is distributed, as it is implemented by the distributed memory caches (one per cluster). Therefore, the global directory itself is distributed.  The memory cache being inclusive: a cache line L that is present in at least one L1 cache must be present in the corresponding memory cache cache (in the home cluster). With this property, the Global Directory can be implemented as an extension of the memory cache directory.
     21In the TSAR architecture, the memory controller is distributed, as it is implemented by the distributed memory caches
     22(one per cluster). Therefore, the global directory itself is distributed.  The memory cache being inclusive:
     23a cache line L that is present in at least one L1 cache must be present in the corresponding memory cache cache
     24(in the home cluster). With this property, the Global Directory can be implemented as an extension of the memory cache directory.
    1525
    16 In case of MISS, the memory cache controller must evict a victim line to bring in the missing line. In order to maintain the inclusive property, all copies of the evicted cache line in L1 caches must be invalidated. To do it, the memory cache controller must send INVALIDATE requests to all L1 caches containing a copy.
     26In case of MISS, the memory cache controller must evict a victim line to bring in the missing line. In order to maintain the inclusive property,
     27all copies of the evicted cache line in L1 caches must be invalidated. To do it, the memory cache controller must send
     28INVALIDATE requests to all L1 caches containing a copy.
    1729
    18 The TSAR architecture wants to guaranty the cache coherence by hardware, for both the data and instruction L1 caches. Reflecting the different behaviour of data & instruction caches, the "hybrid" cache coherence protocol defines two different strategies, depending on the number of copies :
    19  * '''MULTICAST_UPDATE''' :  the modifications of shared data are very frequent events, but the number of copies is generally not very high. When the number of copies is smaller than the DHCCP threshold, the memory cache controller registers the locations of all the copies, and send a ''multicast_update'' transaction to each concerned L1 cache in case of modification.
    20  * '''BROADCAST_INVAL''' : the modifications of shared code rare events ( self modifying code, or dynamic libraries ), but the number of replicated copies can be very large ( the exception handler, or the libc are generally replicated in all L1 caches ). When the number of copies is larger than the DHCCP threshold, the memory cache controller will simply store the number of copies (without localization) and send a ''broadcast_inval'' transaction  to all L1 caches in case of modication. 
     30The TSAR architecture wants to guaranty the cache coherence by hardware, for both the data and instruction L1 caches.
     31The modifications of shared data are very frequent events, but the number of copies is generally not very high.
     32The modifications of shared code are very rare events ( self modifying code, or dynamic libraries ), but the number
     33of replicated copies can be very large ( the exception handler, or the libc are generally replicated in all L1 caches ).
     34Reflecting the different behaviour of data & instruction caches, the "hybrid" cache coherence protocol DHCCP defines two different strategies,
     35depending on the number of copies :
     36 * '''MULTICAST_UPDATE''' :  When the number of copies is smaller than the DHCCP threshold, the memory cache controller registers the locations
     37of all the copies, and send a ''multicast_update'' transaction to each concerned L1 cache in case of modification.
     38 * '''BROADCAST_INVALIDATE''' :  When the number of copies is larger than the DHCCP threshold, the memory cache controller registers only the number
     39of copies (without localization) and send a ''broadcast_invalidate'' transaction  to all L1 caches in case of modication. 
    2140
    2241== 2.  Types of transaction ==
     
    2443Three types of transactions, have been identified :
    2544 * Direct transactions : READ / WRITE / LL / SC
    26  * Coherence transactions : MULTI_UPDATE / MULTI_INVAL / BROADCAST_INVAL / CLEANUP
     45 * Coherence transactions : MULTI_UPDATE / MULTI_INVALIDATE / BROADCAST_INVALIDATE / CLEANUP
    2746 * External transactions : PUT / GET
    2847       
    2948For dead-lock prevention, these three types of transaction must be transported on three (virtually or physically) separated networks.
    3049
    31 As a general rule, all these transactions respect the VCI advanced packet format, and there is one response packet for each command packet : For a burst transaction, a READ command packet contains one single flit, and the corresponding READ response packet contains N flits. Symmetrically, a WRITE command packet contains N flits, and the corresponding WRITE response packet contains one single flit.
     50As a general rule, all these transactions respect the VCI advanced packet format, and there is one response packet for each command packet :
     51For a burst transaction, a READ command packet contains one single flit, and the corresponding READ response packet contains N flits.
     52Symmetrically, a WRITE command packet contains N flits, and the corresponding WRITE response packet contains one single flit.
    3253
    33 There is one exception : For a BROADCAST_INVALIDATE transaction, the initiator sends one single flit VCI packet, but receives several single flit VCI response packets.
     54There is one exception : For a BROADCAST_INVALIDATE transaction, the initiator sends one single flit VCI packet,
     55but receives several single flit VCI response packets.
    3456 
    3557=== 2.1  Direct transactions ===
    3658 
    37 These transactions are initiated by a processor (actually the L1 cache controller), or by another initiator ( an I/O peripheral or hardware coprocessor with a DMA capability). This initiator can be located in any cluster. For those transactions, the target is a memory cache controller, acting as a physical memory bank, or another VCI target peripheral. This target can be located in any cluster.
     59These transactions are initiated by a processor (actually the L1 cache controller), or by another initiator
     60(an I/O peripheral or hardware coprocessor with a DMA capability). This initiator can be located in any cluster. For those transactions,
     61the target is a memory cache controller, acting as a physical memory bank, or another VCI target peripheral. This target can be located in any cluster.
    3862
    39  * A '''READ''' transaction can be a single word request (in case of uncached access), or a burst, corresponding to a complete cache line (16 words). A READ burst transaction initiated by any DMA controller must respect the same 16 words cache line format. For all READ transaction, the VCI command packet contains one single VCI flit. The  VCI CMD field contains the VCI_READ code. The VCI PLEN field is used to define the burst length. A READ transaction has a type, encoded with two bits in the VCI TRDID field : bit 0 of the TRID field is 0 for an uncached access, and 1 for a cached access. bit 1 of the TRDID field is 0 for a data cache request, and 1 for an instruction cache request. The response packet contains one VCI flit (single word) or 16 VCI flits (cache line). The VCI PKTID field is not used.
     63The L1 cache controller can issue several simultaneous VCI transactions, that must be distinguished by the TRDID field value.
    4064
    41  * A '''WRITE''' transaction can be a single word request or a variable length burst request. In case of burst, the VCI command packet contains at most 8 VCI flits, with consecutive addresses. All words belong to the same half cache line, and the VCI BE field can have different values for each flit (including the zero value). The VCI response packet contains one VCI flit. A WRITE burst transaction initiated by any DMA controller must respect the same 8 aligned words constraint.  The VCI CMD field contains the VCI_WRITE code. When the VCI TRDID field contains a non-zero value, it signals that the write request is “posted” : The VCI target must send a response to respect the VCI protocol, but this response can be send before the write is actually performed. This can be used by by the VCI/HT bridge. The VCI PKTID fields is not used. If the modified cache line is replicated in one or several other L1 caches, all copies must be updated or invalidated before the WRITE transaction is acknowledged.
     65 * A '''READ''' transaction can be a single word request (in case of uncached access), or a burst, corresponding to a complete cache line (16 words).
     66A READ burst transaction initiated by any DMA controller must respect the same 16 words cache line format.
     67For all READ transaction, the VCI command packet contains one single VCI flit. The  VCI CMD field contains the VCI_READ code.
     68The VCI PLEN field is used to define the burst length. A READ transaction has a type, encoded in the VCI TRDID field :
     69The MSB bit of the TRDID field has the value 1. Bit 0 of the TRID field is 0 for an uncached access, and 1 for a cached access.
     70Bit 1 of the TRDID field is 0 for a data cache request, and 1 for an instruction cache request.
     71The response packet contains one VCI flit (single word) or 16 VCI flits (cache line).
     72The VCI PKTID field is not used.
     73
     74 * A '''WRITE''' transaction can be a single word request or a variable length burst request. In case of burst, the VCI command packet contains
     75at most 8 VCI flits, with consecutive addresses. All words belong to the same half cache line, and the VCI BE field can have different values
     76for each flit (including the zero value). The VCI response packet contains one VCI flit.
     77A WRITE burst transaction initiated by any DMA controller must respect the same 8 aligned words constraint. 
     78The VCI CMD field contains the VCI_WRITE code.
     79The MSB bit of the TRDID field has the value 0. The LSB bits of the TRDID field define the index of the write transaction.
     80When these LSB bits have a non zero value, the write request is “posted” : The VCI target must send a response
     81to respect the VCI protocol, but this response can be send before the write is actually performed. This can be used by by the VCI/HT bridge.
     82The VCI PKTID fields is not used.
     83If the modified cache line is replicated in one or several other L1 caches, the memory cache must guaranty that all copies have been
     84updated or invalidated before the WRITE transaction is acknowledged.
    4285
    4386 * The TSAR architecture supports the '''LL/SC''' mechanism for atomic operations (see AtomicOperations). For both a  LL (Linked Load) or a SC (Store Conditionnal) transaction, the VCI command packet and the VCI response packet contain one single VCI flit. The VCI CMD field must contain the VCI_LINKED_READ value (resp. VCI_STORE_CONDITIONNAL value). The VCI VCI PKTID and TRDID fields are not used.
     
    4588=== 2.2 Coherence transactions ===
    4689
    47 These transactions implement the DHCCP protocol : For each cache line stored in the memory cache, the memory cache implement a Registration Table that contain the copies replicated in the L1 caches. Each entry in this Registration Table contains the SRCID of the L1 cache that contains a copy, as well as the type of the copy (instruction/data). When the same cache line is replicated in both the instruction cache and the data cache of a processor, this defines two separated entries in the Registration Table. When the number copies for a given cache line L exceeds the DHCCP threshold, the corresponding Registration Table is flushed, and the memory cache registers only the number of copies.
     90For each cache line stored in the memory cache, the memory cache implement a Registration Table that contain the copies replicated
     91in the L1 caches. Each entry in this Registration Table contains the SRCID of the L1 cache that contains a copy, as well as the type
     92of the copy (instruction/data). When the same cache line is replicated in both the instruction cache and the data cache of a processor,
     93this defines two separated entries in the Registration Table. When the number copies for a given cache line L exceeds the DHCCP threshold,
     94the corresponding Registration Table is flushed, and the memory cache registers only the number of copies.
    4895
    4996The coherence transactions use a logically separated ''coherence network'', implementing a separated address space.
     97All these transactions are write transactions.
    5098 
    51  * A '''MULTICAST_UPDATE''' transaction is a multi-cast transaction sent by the memory cache controller when it receives a WRITE request to a replicated cache line and the number of copies does not exceeds the DHCCP threshold. It sends as many VCI transactions as the number of registered copies (but the writer). The VCI command packet contains (N+2) flits. The VCI ADDRESS field is constant and contains the address of the memory mapped UPDATE register in the L1 cache. The VCI CMD field contains the WRITE value. As the memory cache controller can handle several simultaneous update/invalidate transactions, the VCI TRDID field contains the transaction index. The VCI PLEN field contains the value  4*N, where N is the actual number of modified words in the cache line. The line index (34 bits) is transported in the VCI WDATA and VCI BE fields (the two LSB bits), of the first flit. The first modified word index (3 bits) is transported in the WDATA field of the second flit, and the N modified words in the WDATA and BE fields of the N following flits. For each modified word, the VCI BE field can have a different value (including the 0x0 value). The VCI response packet contains one single flit. The memory cache controller counts the number of VCI responses to detect the completion of the MULTI_UPDATE transaction.
     99 * A '''MULTICAST_UPDATE''' transaction is a multi-cast transaction sent by the memory cache controller when it receives a WRITE request
     100to a replicated cache line and the number of copies does not exceeds the DHCCP threshold. It sends as many VCI transactions as the number
     101of registered copies (but the writer). The VCI command packet contains (N+2) flits. The VCI ADDRESS field is constant and contains the address
     102of the memory mapped UPDATE register in the L1 cache. The VCI CMD field contains the VCI_WRITE code. As the memory cache controller can
     103handle several simultaneous update/invalidate transactions, the VCI TRDID field contains the transaction index. The VCI PLEN field contains the value  4*N,
     104where N is the actual number of modified words in the cache line. The line index (34 bits) is transported in the VCI WDATA and VCI BE fields (the two LSB bits),
     105of the first flit. The first modified word index (3 bits) is transported in the WDATA field of the second flit, and the N modified words in the WDATA and BE fields
     106of the N following flits. For each modified word, the VCI BE field can have a different value (including the 0x0 value).
     107The VCI response packet contains one single flit. The memory cache controller counts the number of VCI responses to detect the completion
     108of the MULTICAST_UPDATE transaction.
    52109
    53  * A '''MULTICAST_INVAL''' transaction is a multi-cast transaction, that is composed of several VCI transactions. When a memory cache makes a cache line replacement (following a MISS), and the victim line has a number of copies smaller than the DHCCP threshold, it sends as many VCI transactions as the number of registered copies. Both the VCI command packet and the VCI response packet contain only one flit. The VCI ADDRESS field contains the address of the memory mapped INVAL register in the L1 cache. The VCI CMD field contains the WRITE value. As the memory cache controller can handle several update/invalidate transactions simultaneously, the VCI TRDID field contains the transaction index.The VCI WDATA and VCI BE (the two LSB bits) fields contain the 34 bits line index. The memory cache controller counts the number of VCI responses to detect the completion of the  MULTI_INVAL transaction.
     110 * A '''MULTICAST_INVALIDATE''' transaction is a multi-cast transaction, that is composed of several VCI transactions. When a memory cache makes a cache line
     111replacement (following a MISS in the memory cache), and the victim line has a number of copies smaller than the DHCCP threshold, it sends as many VCI transactions
     112as the number of registered copies. Both the VCI command packet and the VCI response packet contain only one flit. The VCI ADDRESS field contains the address
     113of the memory mapped INVAL register in the L1 cache. The VCI CMD field contains the VCI_WRITE code. As the memory cache controller can handle several
     114update/invalidate transactions simultaneously, the VCI TRDID field contains the transaction index.The VCI WDATA and VCI BE (the two LSB bits) fields contain the
     11534 bits line index. The memory cache controller counts the number of VCI responses to detect the completion of the  MULTI_INVAL transaction.
    54116
    55  * A '''BROADCAST_INVAL''' transaction is a broadcast transaction. This transaction is initiated when a memory cache controller replaces a line, or receives a WRITE request to a replicated cache line, that has a number of copies larger than the DHCCP threshold. The VCI command packet contains one single flit. This packet is replicated and dynamically broadcasted by the network itself. The VCI CMD field contains the WRITE value. The VCI ADDRESS field contains the global broadcast address 0x0000000003 (only the two LSB bits are set). The VCI WDATA and the VCI BE (the two LSB bits) field contain the line index. This VCI command is broadcasted to all L1 caches in the system, but only L1 caches that have a copy send a VCI response packet. All VCI response packets are independently returned to the memory cache initiator, that counts the number of VCI responses to detect the completion of the BROADCAST_INVAL transaction. If a L1 cache contains two copies of a cache line (i.e. the line is replicated in both the DATA cache, and the INSTRUCTION cache), it must send two VCI responses.
     117 * A '''BROADCAST_INVAL''' transaction is a broadcast transaction. This transaction is initiated when a memory cache controller replaces a line,
     118or receives a WRITE request to a replicated cache line, and this cache line has a number of copies larger than the DHCCP threshold.
     119The VCI command packet contains one single flit. This packet is replicated and dynamically broadcasted by the network itself.
     120The VCI CMD field contains the VCI_WRITE code. The VCI ADDRESS field contains the global broadcast address 0x0000000003
     121(only the two LSB bits are set). The VCI WDATA and the VCI BE (the two LSB bits) field contain the line index.
     122This VCI command is broadcasted to all L1 caches in the system, but only L1 caches that have a copy send a VCI response packet.
     123All VCI response packets are independently returned to the memory cache initiator, that counts the number of VCI responses
     124to detect the completion of the BROADCAST_INVAL transaction.
     125If a L1 cache contains two copies of a cache line (i.e. the line is replicated in both the DATA cache, and the INSTRUCTION cache), it must send two VCI responses.
    56126
    57  * A '''CLEANUP''' transaction is initiated by a L1 cache controller to a memory cache controller, to signal that a cache line copy has been removed from an instruction or data cache. Both the VCI command packet and the VCI response packet contain one single flit. For a CLEANUP transaction, the VCI ADDRESS field must contain the removed cache line address, and the VCI TRDID field must contain a non zero value.
     127 * A '''CLEANUP''' transaction is initiated by a L1 cache controller to a memory cache controller, to signal that a cache line copy
     128has been removed from an instruction or data cache. Both the VCI command packet and the VCI response packet contain one single flit.
     129For a CLEANUP transaction, the VCI ADDRESS field must contain the removed cache line address, and the VCI TRDID field must contain a non zero value.
     130
     131
    58132
    59133=== 2.3 External transactions ===
    60134
    61 These transactions are initiated by the memory caches, to fetch or save a complete cache line in case of MISS in the memory cache. The general policy between the memory caches and the external memory is WRITE_BACK : The external memory is only updated in case of line replacement. The target is always the external RAM controller.
     135These transactions are initiated by the memory caches, to fetch or save a complete cache line in case of MISS in the memory cache.
     136The general policy between the memory caches and the external memory is WRITE_BACK : The external memory is only updated
     137in case of line replacement. The target is always the external RAM controller.
    62138
    63 All the external transactions use a separated ''external network'', implementing a separated address space. The memory cache and external RAM controller ports used to access the external network respect a simplified version of the VCI advanced format : the VCI fields PLEN, PKTID, CONST, CONTIG and BE are not used. The VCI ADDRESS field contains 30 bits (a 64 bytes cache line index). (30 bits). The VCI WDATA & RDATA fields contain 64 bits, in order to improve the bandwidth. The VCI SRCID field contains the memory cache index (cluster index). As the memory cache controller can process several external transactions simultaneously, the VCI TRDID field contains the transaction index.
     139All the external transactions use a separated ''external network'', implementing a separated address space. The memory cache and
     140the external RAM controller ports used to access the external network respect a simplified version of the VCI advanced format :
     141the VCI fields PLEN, PKTID, CONST, CONTIG and BE are not used. The VCI ADDRESS field contains 30 bits (a 64 bytes cache line index). 
     142The VCI WDATA & RDATA fields contain 64 bits, in order to improve the bandwidth. The VCI SRCID field contains the memory cache index (cluster index).
     143As the memory cache controller can process several external transactions simultaneously, the VCI TRDID field contains the transaction index.
    64144
    65  * For a '''GET''' transaction, the VCI command packet contains one single flit. The VCI CMD field contains the READ value. The VCI response packet contains 8 flits (corresponding to the 64 bytes of a cache line).
     145 * For a '''GET''' transaction, the VCI command packet contains one single flit. The VCI CMD field contains the READ value.
     146The VCI response packet contains 8 flits (corresponding to the 64 bytes of a cache line).
    66147
    67  * For a PUT transaction, the VCI command packet contains 8 flits. The VCI CMD field contains the WRITE value. The VCI response packet contains 1 flit.
     148 * For a PUT transaction, the VCI command packet contains 8 flits. The VCI CMD field contains the WRITE value.
     149The VCI response packet contains 1 flit.
    68150
    69151