wiki:AtomicOperations

Version 11 (modified by alain, 8 years ago) (diff)

--

Atomic Operations

1. Goals

The TSAR architecture implements atomic read-then-write operations to support various software synchronization mechanisms. The constraints are the following :

  • A software program must have the possibility to read a data at address X, test this data, and write the (possibly modified) data at the same address X, with the guaranty that no other access to this data was done between the read and write access.
  • As we want to support commodity operating systems and existing software applications, any memory address can be the target of an atomic access.
  • As the atomic access can be used to implement spin-locks, the lock address must be cacheable in order to benefit from the general coherence protocol, and avoid unnecessary transactions on the interconnection network.

2. LL/SC mechanism

The TSAR memory sub-system supports the LL/SC mechanism. The LL & SC commands are defined in the VCI/OCP protocol, and the LL and SC instructions must be defined in the processor Instructon Set Architecture. This is natively the case for the MIPS32 & PowerPC processors. On the direct network, the VCI CMD field can take four values : READ, WRITE, LINKED_LOAD (LL), and STORE_CONDITIONAL (SC). From a conceptual point of view, the atomicity his handled on the memory controller side (actually the memory cache controller), as the memory controllers must maintain a list of all pending atomic operations in a reservation table :

2.1 General principle

  • When a processor, identified by its SRCID, executes the LL(X) instruction to an address X, the memory controller registers an entry (SRCID, X) in the reservation table, and returns the memory value stored at address X in the VCI RDATA field. If there was another reservation for the same processor SRCID, but for another address X’, the previous reservation for X’ is lost (it means that the previous reservation is cancelled).
  • When a processor, identified by its SRCID, executes the SC(X) instruction, there is two possibilities. If there is a valid reservation entry (SRCID, X) indicating that no other access to the X address has been received, the atomic operation is a success : the write is done, the memory cache controller returns a success value in he RDATA VCI field and all entries in the reservation table for the X address are cancelled. If there is no valid reservation entry (SRCID, X) in the reservation table, the write is not done, and the memory cache returns a fail value in the RDATA field.

Clearly, in case of concurrent access, the winner is defined by the first SC instruction received by the memory controller.

2.2 Failure / Success encoding

The actual encoding of the (success/failure) return value for a SC access depends on the processor core: For the MIPS2 and ARM processors, a success is encoded as a non-zero value. For the PPC processor, a success is encoded as a zero value. In the TSAR architecture, the memory cache controller returns the value 0 for a success, and the value 1 for a failure. If the architecture uses a MIPS or ARM processor, the SC value must be transcoded by the L1 cache controller before to be transmitted to the processor core.

2.3 Software implementation on MIPS32 processor

As described below, the LL/SC mechanism can be used to implement a spin-lock, using any memory address :

  • The lock acquisition is done by an atomic LL/SC operation.
  • The lock release is done by a simple WRITE instruction.

Remember that a SC failure in encoded by a zero value for the MIPS processor.

			_itmask		        # enter critical section
# lock acquisition
loop		        LL r1,   0(r4)		# r1 <= M[r4]
			BNEZ r1, loop	        # retry if lock already taken (r1 != 0)
			ORI r1,  r0, 1	        # r1 <= 1
			SC  r1,  0(r4)	        # if atomic (M[r4] <= 1 / r1 <= 1) else (r1 <= 0)
			BEQZ r1, loop	        # retry if not atomic (r1 == 0)
			...
# lock release
			ORI r1, r0, 0           # r1 <= 0
			SW r1, 0(r4)	        # M[r4] <= 0
			_itunmask		# exit critical section

3. Cachable atomic operations

In order to support cachable spin-locks and a better scalability, the TSAR memory cache controller, and the L1 cache controller cooperate to implement the LL/SC mechanism. But the standard semantic of the LL/SC mechanism has to be modified:

  • The LL operation is implemented by the L1 cache controller as a standard Read operation.
  • The SC opration is implemented as a Compare and Swap operation.

Furthermore, the LL/SC mechanism is extended to support both 32 and 64 bits atomic accesses.

3.1 new semantic

The formal semantic of a LL/SC access is : (1) The Store Conditional succeeds if there was no other Store Conditional at the same address since the last Linked Load.

The implemented semantic is : (2) The Store Conditional succeeds if the content of the memory has not changed since the last Linked Load.

Clearly we have (1) => (2) . But we have only (2) => (1) when the software does not use the LL/SC mechanism to monitor WRITEs. Such use of the LL/SC mechanism has to be avoided.

3.2 new protocol

In the new protocol, there is no more LL on the network :

  • Linked Loads become simple Reads, where the data sent to the processor is recorded in a register of the L1 cache. When the processor issues a Store Conditional, the L1 cache sends a "SC" packet, where the first flits contain the data previously read and the next flits contain the data to write in the memory. So this new SC packet is 2 (32 bits accesses) or 4 flits (64 bits accesses) long.
  • The memory cache controller then compares the data read by the L1 cache to the data in the memory cache. If these two values are equal, the Store Conditional is done, and the response to the SC is success (0 value), else the Store Conditional is not done, and the response is failure (1 value).

3.3 memory cache controller

The memory cache controller does not have a linked access buffer any more. It simply has to be capable of recording pending Store Conditional in the table of transactions to the external RAM controller, in case of miss in the memory cache.

The actions done by the memory cache controller for the various commands are described below :

SC(SRCID, X)

  Read the data in the memory cache.
  If ( data read by the L1 cache == data from the memory cache ) {
      - write data in the memory cache
      - send updates or invalidates to the owners of this cache line
      - after all responses to UPDATE or INVALIDATE have been received, return true 
        to the L1 cache.   
  } else {
      return false to the L1 cache
  }

WRITE(SRCID, X)

  - Scan the linked list of copies to send an UPDATE request 
    to the L1 caches (other than SRCID)
  - Write data in the memory cache
  - after all responses to UPDATE have been received, acknowledge the 
    write request.

READ(SRCID, X)

  If ( cachable request ) {
    - register the SRCID in the list of copies associated to the X address.
    - return the complete or partial cache line
  } else {
    - return a single word.
  }

3.4 L1 cache controller

The L1 cache controller receiving a new LL(X) request from the processor must locally register this request on the X address and the data sent to the processor. When it receives a SC(X) request from the processor, it checks the LL register and, if this register is valid, it sends a Store Conditional to the memory cache controller containing both the data read by the LL(X) and the data to write. This requires an extra register to store the address, the data, a VALID flip-flop, and a LONG flip-flop (in case of 64 bits LL/SC) in the L1 cache controller.

The actions done by the L1 cache controller for the various commands are described below :

LL(X) from processor

  If (VALID = true & ADDRESS = X) {    // local spin-lock         
    return the read data to the processor
  } else {                                          // first LL access                       
    VALID <= true
    ADDRESS <= X
    DATA <= READ(X),
    and return the read value to the processor
  } 

SC(X) from processor

  If (VALID = true & ADDRESS = X)  {    // possible success
    send a SC(X) request to the memory cache containing the data stored in the LL register and the data to write,
    and return the Boolean response to the processor
    VALID <= false
  } else {                                         // failure  
   return a false value to the processor
   VALID <= false
  }

INVAL(L) or UPDATE(L) from memory controller

  The LL register must not be invalidated.
  The L1 cache is updated or invalidated.