[[PageOutline]] = Atomic Operations = == 1. Goals == The TSAR architecture implements two atomic read-then-write operations to support various synchronization mechanisms : * The '''LL/SC''' (Linked Load / Store Conditional) operations are implemented as TWO specific Command/Response VCI transactions. As the LL/SC instructions are implemented in the MIPS32 instruction set, these instructions van be used by both the kernel code and by the application code to read a data at address X, test this data, and write the (possibly modified) data at the same address X, with the guaranty that no other access to this data was done between the read and write access. * The '''CAS''' (Compare and Swap) operation is implemented as ONE specific Command/Response VCI transaction. As there is no CAS instruction in the MIPS32 instruction set, this operation is cannot be used by the software. It is only used by some hardware components such as the MMU contained in the L1 cache controller (to update the DIRTY bit in the page tables), or by some DMA peripheral such as the vci_mwmr_dma component to atomically access the lock protecting a shared communication channel. == 2. LL/SC mechanism == As we want to support commodity operating systems and existing software applications, any memory address can be the target of an atomic access. As the atomic access can be used to implement spin-locks, the address must be cacheable in order to benefit from the general coherence protocol, and avoid unnecessary transactions on the interconnection network. === 2.1 General principle === From a conceptual point of view, the atomicity is handled on the memory controller side, that is actually the L2 cache controller in the TSAR architecture. Each L2 cache controller contains a list of all pending LL/SC atomic operations in an associative ''reservation table'', that contains 32 entries. * When a processor P executes the LL(X) instruction to an address X, this réservation request is sent to the L2 cache by the L1 cache. The L2 cache allocates a 32 bits authentication key for this reservation. It registers both the X address and the K key in the associative ''reservation table'', and returns both the value stored at address X and the K value to the L1. Both the X address and the K key are also registered in the L1 cache. If another processor P' request a reservation for the same address X, it receives the save K value from the L2 cache. * When a processor P executes the SC(X,D) instruction to an address X, this conditional write is sent to the L2 cache by the L1 cache, and the command contains also the reservation key K and the data to be written D. The L2 cache makes an associative search in the ''reservation table''. If both the address X and the key K matches, the atomic operation is a success : The reservation is canceled in the ''reservation table'', the D value is written at address X, and a ''success'' value is returned in the response to the L1 cache. If there is no match in the ''associative table'', the atomic operation is a failure: the D value is not written at address X, the ''reservation table is not modified, and a ''failure'' value is returned to the L1 cache. Clearly, in case of concurrent LL/SC access to the same address X by two or more L1 caches, the winner is defined by the first SC(X) instruction received by the L2 cache. === 2.2 Key allocation policy === Tha key allocator is a simple 32 bits counter, that is incremented each time a new K value is allocated to satisfy a LL requiring a new reservation. As there is a finite number of values, the mechanism requires that a K value allocated for a given reservation has a finite ''time of life''. * In the ''réservation table'' implemented in the L2 cache, the maximum ''time of life'' for an entry containing the K value is defined by a maximum number of (2**31) allocated keys: Each time a new K' value is allocated, any entry in the ''reservation table'' containing the K value, such as (|K-K'| == 2**31) is invalidated. * In the L1 cache, the bounded ''time of life'' of the registered reservation is implemented as a cycle counter: The counter starts when a new reservation is registered in the L1 cache, and the reservation is invalidated when the counter reaches 2**31 cycles. === 2.3 Replacement policy === === 2.4 Detailed specification for the L1 cache === Each L1 cache controller contains 4 registers to store one single LL/SC reservation: * physical address : 40 bits * reservation key : 32 bits * cycle counter : 31 bits * valid reservation : 1 bit We summarize below the actions done by the L1 cache controller receiving a LL(X), SC(X,DT) or SW(X,DT) request from the processor: * '''LL(X)''' : The L1 cache registers the X address and the K key in the reservation register, activates the cycle counter, and send a single flit VCI LL command containing the X address to the L2 cache. * '''SC(X,D)''' : The L1 cache checks the X address agains the registered address. In case of miss, it returns a ''failure'' code to the processor, without any VCI transaction on the network. In case of hit, it invalidates the local reservation and sent a two flits VCI SC command containing the X address, the registered K value, and the D value. * '''SW(X,D)''' : The L1 cache checks the X address against the registered address. In case of hit the reservation is invalidated. In case of miss, the reservation is not modified. In both cases the write request is sent to the L2 cache. === 2.5 Detailed specification for the L2 cache === We summarize === 2.6 Failure / Success encoding === The actual encoding of the (success/failure) return value for a SC access depends on the processor core: For the MIPS2 and ARM processors, a success is encoded as a non-zero value. For the PPC processor, a success is encoded as a zero value. In the TSAR architecture, the memory cache controller returns the value 0 for a success, and the value 1 for a failure. If the architecture uses a MIPS or ARM processor, the SC value must be transcoded by the L1 cache controller before to be transmitted to the processor core. === 2.3 Software implementation on MIPS32 processor === As described below, the LL/SC mechanism can be used to implement a spin-lock, using any memory address : * The lock acquisition is done by an atomic LL/SC operation. * The lock release is done by a simple WRITE instruction. Remember that a SC failure in encoded by a zero value for the MIPS processor. {{{ _itmask # enter critical section # lock acquisition loop LL r1, 0(r4) # r1 <= M[r4] BNEZ r1, loop # retry if lock already taken (r1 != 0) ORI r1, r0, 1 # r1 <= 1 SC r1, 0(r4) # if atomic (M[r4] <= 1 / r1 <= 1) else (r1 <= 0) BEQZ r1, loop # retry if not atomic (r1 == 0) ... # lock release ORI r1, r0, 0 # r1 <= 0 SW r1, 0(r4) # M[r4] <= 0 _itunmask # exit critical section }}} == 3. CAS mechanism == === 3.1 new semantic === The formal semantic of a LL/SC access is : (1) The Store Conditional succeeds if there was no other Store Conditional at the same address since the last Linked Load. The implemented semantic is : (2) The Store Conditional succeeds if the content of the memory has not changed since the last Linked Load. Clearly we have (1) => (2) . But we have only (2) => (1) when the software does not use the LL/SC mechanism to monitor WRITEs. Such use of the LL/SC mechanism has to be avoided. === 3.2 new protocol === In the new protocol, there is no more LL on the network : * Linked Loads become simple Reads, where the data sent to the processor is recorded in a register of the L1 cache. When the processor issues a Store Conditional, the L1 cache sends a "SC" packet, where the first flits contain the data previously read and the next flits contain the data to write in the memory. So this new SC packet is 2 (32 bits accesses) or 4 flits (64 bits accesses) long. * The memory cache controller then compares the data read by the L1 cache to the data in the memory cache. If these two values are equal, the Store Conditional is done, and the response to the SC is success (0 value), else the Store Conditional is not done, and the response is failure (1 value). === 3.3 memory cache controller === The memory cache controller does not have a linked access buffer any more. It simply has to be capable of recording pending Store Conditional in the table of transactions to the external RAM controller, in case of miss in the memory cache. The actions done by the memory cache controller for the various commands are described below : '''SC(SRCID, X)''' {{{ Read the data in the memory cache. If ( data read by the L1 cache == data from the memory cache ) { - write data in the memory cache - send updates or invalidates to the owners of this cache line - after all responses to UPDATE or INVALIDATE have been received, return true to the L1 cache. } else { return false to the L1 cache } }}} '''WRITE(SRCID, X)''' {{{ - Scan the linked list of copies to send an UPDATE request to the L1 caches (other than SRCID) - Write data in the memory cache - after all responses to UPDATE have been received, acknowledge the write request. }}} '''READ(SRCID, X)''' {{{ If ( cachable request ) { - register the SRCID in the list of copies associated to the X address. - return the complete or partial cache line } else { - return a single word. } }}} === 3.4 L1 cache controller === The L1 cache controller receiving a new LL(X) request from the processor must locally register this request on the X address and the data sent to the processor. When it receives a SC(X) request from the processor, it checks the LL register and, if this register is valid, it sends a Store Conditional to the memory cache controller containing both the data read by the LL(X) and the data to write. This requires an extra register to store the address, the data, a VALID flip-flop, and a LONG flip-flop (in case of 64 bits LL/SC) in the L1 cache controller. The actions done by the L1 cache controller for the various commands are described below : '''LL(X) from processor''' {{{ If (VALID = true & ADDRESS = X) { // local spin-lock return the read data to the processor } else { // first LL access VALID <= true ADDRESS <= X DATA <= READ(X), and return the read value to the processor } }}} '''SC(X) from processor''' {{{ If (VALID = true & ADDRESS = X) { // possible success send a SC(X) request to the memory cache containing the data stored in the LL register and the data to write, and return the Boolean response to the processor VALID <= false } else { // failure return a false value to the processor VALID <= false } }}} '''INVAL(L) or UPDATE(L) from memory controller''' {{{ The LL register must not be invalidated. The L1 cache is updated or invalidated. }}}