Version 33 (modified by alain, 4 years ago) (diff)


Input/Output Operations

A) Peripheral identification

ALMOS-MK identifies a peripheral by a composite index (func,impl). The func index defines a functional type, the impl index defines a specific hardware implementation.

  • Each value of the functional index fun defines a generic (implementation independent) device XXX, that is characterized by an API defined in the dev_xxx.h file. This generic API allows the kernel to access the peripheral without taking care on the actual hardware implementation.
  • For each generic device XXX, it can exist several hardware implementation, and each value of the implementation index impl is associated with a specific driver, that must implement the API defined for the XXX generic device.

ALMOS-MK supports two types of peripheral components:

  • External peripherals are accessed through a bridge located in one single cluster (called cluster_io, identified by the io_cxy parameter in the arch_info description). External devices are shared resources that can be used by any thread running in any cluster. Examples are the generic IOC device (Block Device Controller), the generic NIC device (Network Interface Controller), the generic TXT device (Text Terminal), the generic FBF device (Frame Buffer for Graphical Display Controller).
  • Internal peripherals are replicated in all clusters. Each internal peripheral is associated to the local kernel instance, but can be accessed by any thread running in any cluster. There are very few internal peripherals. Examples are the generic ICU device (Interrupt Controller Unit), or the generic MMC device (L2 Cache Configuration and coherence management).

ALMOS-MK supports multi-channels external peripherals, where one single peripheral controller contains N channels that can run in parallel. Each channel has a separated set of addressable registers, and each channel can be used by the OS as an independent device. Examples are the TXT peripheral (one channel per text terminal), or the NIC peripheral (one channel per MAC interface).

The set of available peripherals, and their location in a given many-core architecture must be described in the file. For each peripheral, the composite index is implemented as a 32 bit integer, where the 16 MSB bits define the type, and the 16 LSB bits define the subtype.

B) Generic Devices APIs

To represent the available peripherals in a given manycore architecture, ALMOS-MK uses generic device descriptors (implemented by the device_t structure). For multi-channels peripherals, ALMOS-MK defines one device descriptor per channel. This descriptor contains the functional index, the implementation index, the channel index, and the physical base address of the segment containing the addressable registers for this peripheral channel.

Each device descriptor contains a waiting queue of pending commands registered by the various client threads.

For each generic device type, the device specific API defines the list of available commands, and the specific structure defining the command descriptor (containing the command type and arguments). As an IO operation is blocking for the calling thread, a client thread can only post one command at a given time. This command is registered in the client thread descriptor, to be passed to the hardware specific driver.

The set of supported generic devices, and their associated APIs are defined below:

device type usage api definition
IOC ext block device controller ioc_api
TXT ext text terminal controller txt_api
NIC ext network interface controller nic_api
PIC ext External Interrupt controller pic_api
ICU int Internal Interrupt Controller icu_api

To signal the completion of an I/O operation, ALMOS-MK defines three types of interrupts :

  • HWI : The HardWare? Interrupt are physical signals connecting one peripheral IRQ to the distributed XCU hardware component..
  • WTI : The Write Triggered Interrupt are mailboxes implemented in the distributed ICU component to support software IPI (Inter Processor Interrupt), or to route external peripheral IRQ from the PIC component to the client core through a specific ICU.
  • PTI : The Programmable Timer Interrupt are implemented in the distributed XCU to support periodical interrupts used by the preemptive context switch mechanism.

WARNING: The two PIC (external) and ICU (internal) devices in the list defined above have a special role: they do NOT perform I/O operations, but are used as configurable interrupt routers to dynamically link a peripheral channel interrupt to a given core. Therefore, the functions defined by the ICU and PIC APIs are service functions, called by the other devices functions. These ICU and PIC functions don't use the waiting queue implemented in the generic device descriptor, but call directly the ICU or PIC drivers.

C) Devices Descriptors Placement

Internal peripherals are replicated in all clusters. In each cluster, the device descriptor is stored in the same cluster as the hardware device itself. These device descriptors are shared resources: they are mostly accessed by the local kernel instance, but can also be accessed by threads running in another cluster. This the case for both the ICU and the MMC devices.

External peripherals are shared resources, located in the I/O cluster. To minimize contention, the corresponding device descriptors are distributed on all clusters, as uniformly as possible. Therefore, an I/O operation involves generally three clusters: the client cluster, the I/O cluster containing the external peripheral, and the server cluster containing the device descriptor.

The devices_directory_t structure contains extended pointers on all generic devices descriptors defined in the manycore architecture. This structure is organized as a set of arrays:

  • There is one entry per channel for each external peripheral, and the corresponding array is indexed by the channel index.
  • There is one entry per cluster for each internal peripheral, and the corresponding array is indexed by the cluster index (it is not indexed by the cluster identifier cxy, because cxy is not a continuous index).

This device directory, implemented as a global variable, is replicated in all clusters, and is initialized in the kernel initialization phase.

D) Waiting queue Management

The commands waiting queue is implemented as a distributed XLIST, rooted in the device descriptor. To launch an I/O operation, a client thread, running in any cluster, calls a function of the device API. This function builds the command descriptor embedded in the thread descriptor, and registers the thread in the waiting queue.

For all I/O operations, ALMOS-MK implements a blocking policy: the thread calling a command function is blocked on the THREAD_BLOCKED_IO condition, and descheduled. It will be re-activated by the driver ISR (Interrupt Service Routine) signaling the completion of the I/O operation.

The waiting queue is handled as a Multi-Writers / Single-Reader FIFO, protected by a remote_lock. The N writers are the clients threads, whose number is not bounded. The single reader is a server thread associated to the device descriptor, and created at kernel initialization. This thread is in charge of consuming the pending commands from the waiting queue. When the queue is empty, the server thread blocks on the THREAD_BLOCKED_QUEUE condition, and is descheduled. It is activated by the client thread when a new command is registered in the queue.

Finally, each generic device descriptor contains a link to the specific driver associated to the available hardware implementation. This link is established in the kernel initialization phase.

E) Drivers API

To start an I/O operation, the server thread associated to the device must call the specific driver corresponding to the hardware peripheral available in the manycore architecture.

To signal the completion of a given I/O operation, the peripheral rises an IRQ to execute a specific ISR (Interrupt Service Routine) in the client cluster, on the core running the client thread. This requires to dynamically route the IRQ to this core.

Any driver must therefore implement the three following functions:


This function initialises both the peripheral hardware registers, and the specific global variables defined by a given hardware implementation. It is called in the kernel initialization phase.

driver_cmd( xptr_t thread , device_t * device )

This function is called by the server thread. It accesses to the peripheral hardware registers to start the I/O operation. Depending on the hardware peripheral implementation, it can be blocking or non-blocking for the server thread.

  • It is blocking on the THREAD_BLOCKED_DEV_ISR condition, if the hardware peripheral supports only one simultaneous I/O operation. Examples are a simple disk controller, or a text terminal controller. The blocked server thread must be re-activated by the ISR signaling completion of the current I/O operation.
  • It is non-blocking if the hardware peripheral supports several simultaneous I/O operations. Example is an AHCI compliant disk controller. It blocks only if the number of simultaneous I/O operations becomes larger than the max number of concurrent operations supported by the hardware.

The thread argument is the extended pointer on the client thread, containing the embedded command descriptor. The device argument is the local pointer on the device descriptor.

driver_isr( xptr_t device )

This function is executed in the client cluster, on the core running the client thread. It accesses the peripheral hardware registers to get the I/O operation error status, acknowledge the IRQ, and unblock the client thread. If the server thread has been blocked, it also unblocks the server thread. The device argument is the extended pointer on the device descriptor.

F) I/O operation

The I/O operation mechanism involves generally three clusters : client cluster / server cluster / IO cluster. It does not use any RPC:

  • To post a new command in the waiting queue of a given (remote) device descriptor, the client thread uses only few remote accesses to be registered in the distributed XLIST rooted in the server cluster.
  • To launch the I/O operation on the (remote) peripheral, the server thread uses only remote accesses to the physical registers located in the I/O cluster.
  • To complete the I/O operation, the ISR running on the client cluster accesses peripheral registers in the I/O cluster, reports the I/O operation status in the command descriptor, and unblocks the client and server threads, using only local or remote accesses.

G) Interrupts Routing

The completion of an I/O operation is signaled by the involved hardware device using an interrupt. This interrupt must be handled by the core running the server thread that launched the I/O operation. Therefore, the interrupt must be routed to the cluster containing the device descriptor involved in the I/O operation. ALMOS-MKH makes the assumption that interrupt routing (from peripherals to cores) is done by a dedicated hardware device, called PIC (Programmable Interrupt Controller). The main service provide by the PIC device is to allow the OS to route any IRQ (generated by a given peripheral channel to any core.

This generic PIC device is supposed to be implemented by a distributed hardware infrastructure containing two types of hardware components:

  • The IOPIC component (one single component in I/O cluster) interfaces the externals peripheral IRQs (one IPQ per channel) to the PIC infrastructure
  • The LAPIC components (one component per cluster) interfaces the PIC infrastructure to the local cores in a given cluster.

Each external IRQ (IRQ generated by a given external device channel) is identified by a irq_id index, used as an identifier by the kernel. For a given hardware architecture, this irq_id index is defined by the arch_info file describing the architecture, and is registered by the kernel in the iopic_input structure, that is a global variable replicated in all clusters.

The actual interrupt routing is defined during the PIC device initialization, by the architecture specific PIC driver, as defined below.

TSAR_MIPS32 architecture

The IOPIC external controller provides two services:

  1. It translate each IRQ identified by its irq_id to a write transactions to a specific mailbox contained in a local LAPIC controller, for a given core in a given cluster. as explained below.
  2. It allows the kernel to selectively enable/disable any external IRQ identified by its irq_id index.

The LAPIC controller (called XCU) is replicated in all clusters containing at least one core. It handle three types of IRQs: The HWI (HardWare? Interrupts) are generated by local internal peripherals, and connected to the local XCU, to be routed to a given local core. The WTI ( Write Triggered Interrupts) are actually mailboxes implemented in the local XCU. They are used to implement both software IPIs (Inter-Processor-Interrupts), or to register the external IRQs (write transactions) generated by the IOPIC controller. Finally the PTI (Programmable Timer Interrupts) are actually timers contained contained in the LAPIC,programmed by the kernel, and routed to a local core to implement context switches (TICK event). The numbers of interrupts of each type in a given cluster are defined in the XCU_CONFIG register of the XCU component, and cannot be larger than the SOCLIB_MAX_HWI, SOCLIB_MAX_WTI, SOCLIB_MAX_PTI constants defined in the soclib_pic.h file.

The actual IRQ routing policy implemented by the SOCLIB_PIC driver depends on the IRQ type. For the external IRQs, the routing is done by the soclib_pic_bind_irq() function.

  • Local Hardware Interrupt : There is only two local peripherals. The MMC device is the L2 cache configuration
  • PTI : There is one PTI per local core, and the PTI index is equal to the core local index.
  • IPI : There is one IPI per local core. Each IPI is implemented as a WTI mailbox.
  • External Hardware Interrupt : each external IRQ is translated to a WTI event,

The LAPIC controller provides three main services:

  1. It allows the kernel to selectively enable/disable any IRQ (identified by its type and index) for a given core. It is the kernel responsibility to enable a given IRQ for a single core as a given IRQ event should be handled by only one core.
  • 2) It makes a global OR between all enabled IRQs for a given core, to interrupt
  • the core when at least one enabled IRQ is active.
  • 3) It is capable to return the highest priority active IRQ of each type.
  • For each type, the lowest index have the highest priority.

X86_64 architecture