wiki:replication_distribution

Version 51 (modified by alain, 4 years ago) (diff)

--

Virtual segments replication & distribution policy

The replication / distribution policy of segments has two goals: enforce locality (as much as possible), and avoid contention (it is the main goal). To actually control data placement on the physical memory banks, the kernel uses the paged virtual memory MMU to map a virtual segment to a given physical memory bank in a given cluster.

A vseg is a contiguous memory zone in the process virtual space, defined by the two (base, size) values. All adresses in this interval can be accessed by in this process without segmentation violation: if the corresponding is not mapped, the page fault will be handled by the kernel, and a physical page will be dynamically allocated (and initialized if required). A vseg always occupies always an integer number of pages, as a given page cannot be shared by two different vsegs.

Depending on its type, a vseg has some specific attributes regarding access rights, and defining the replication and/or distribution policy:

  • A vseg is public when it can be accessed by any thread T of the involved process, whatever the cluster running the thread T. It is private when it can only be accessed by the threads running in the cluster containing the physical memory bank where this vseg is defined and mapped.
  • For a public vseg, ALMOS-MKH implements a global mapping : In all clusters, a given virtual address is mapped to the same physical address. For a private vseg, ALMOS-MKH implements a local mapping : the same virtual address can be mapped to different physical addresses, in different clusters.
  • A public vseg can be localized (all vseg pages are mapped in the same cluster), or distributed (different pages are mapped on different clusters). A private vseg is always localized.

To avoid contention, in case of parallel applications defining a large number of threads in one single process P, almos-mkh replicates, the process descriptor in all clusters containing at least one thread of P, and these clusters are called active clusters. The virtual memory manager VMM(P,K) of process P in cluster K, contains two main structures:

  • The VSL(P,K) is the list of all vsegs registered for process P in cluster K,
  • The GPT(P,K) is the generic page table, defining the actual physical mapping of these vsegs.

For a given process P, all VMM(P,K) descriptors in different clusters can have different contents for several reasons :

  1. A private vseg can be registered in only one VSL(P,K) in cluster K, and be totally undefined in the others VSL(P,K').
  2. A public vseg can be replicated in deveral VSL(P,K), but the registration of a vseg in a given VSL(P,K) is on demand: the vseg is only registered in VSL(P,K) when a thread of process P running in cluster K try to access this vseg.
  3. Similarly, the mapping of a given virtual page VPN of a given vseg (i.e. the allocation of a physical page PPN to a virtual page VPN, and the registration of this PPN in the GPT(P,K) is on demand: the page table entry will be updated in the GPT(P,K) only when a thread of process P in cluster K try to access this VPN.

The replication of the VSL(P,K) and GPT(P,K) kernel structures creates a coherence problem for the public vsegs:

  • A VSL(P,K) contains all private vsegs in cluster K, but contains only the public vsegs that have been actually accessed by a thread of P running in cluster K. Only the reference process descriptor stored in the reference cluster KREF contains the complete list VSL(P,KREF) of all public vsegs for the P process.
  • A GPT(P,K) contains all mapped entries corresponding to private vsegs but for public vsegs, it contains only the entries corresponding to pages that have been accessed by a thread running in cluster K. Only the reference cluster KREF contains the complete GPT(P,KREF) of all mapped entries of public vsegs for process P.

Therefore, almos-mkh defines the following rules :

For the public vsegs, the VMM(P,K) structures - other than the reference one - can be considered as read-only caches. When a given vseg or a given entry in the page table must be removed by the kernel, this modification must be done first in the reference cluster, and broadcast to all other clusters for update. When a miss is detected in a non-reference cluster, the reference VMM(P,KREF) must be accessed first to check a possible false segmentation fault or a false page fault.

For the private vsegs, and the corresponding entries in the page table, the VSL(P,K) and the GPT(P,K) are only shared by the threads of P running in cluster K, and these structures can be privately handled by the local kernel instance in cluster K.

For more details on implementation:

The vseg API is defined in the almos_mk/kernel/mm/vseg and almos-mkh/kernel/mm/vseg.c files.

The Virtual Memory Manager API is defined in the almos_mkh/kernel/mm/vmm.h and almos-mkh/kernel/mm/vmm.c files.

1. User segments types

This section describes the six types of user virtual segments defined by almost-mkh:

Type Access Replication Placement Allocation policy in virtual space
STACK private localized Read Write one physical mapping per thread same cluster as thread using it dynamic (one stack allocator per cluster)
CODE private localized Read Only one physical mapping per cluster same cluster as thread using it static (defined in .elf file)
DATA public distributed Read Write same mapping for all threads distributed on all clusters static (defined in .elf file)
ANON public localized Read Write same mapping for all threads same cluster as calling thread dynamic (one heap allocator per process
FILE public localized Read Write same mapping for all threads same cluster as the file cache dynamic (one heap allocator per process)
REMOTE public localized Read Write same mapping for all threads cluster defined by user dynamic (one heap allocator per process)
  1. CODE : This private vseg contains the application code. It is replicated in all clusters. ALMOS-MK creates one CODE vseg per active cluster. For a process P, the CODE vseg is registered in the VSL(P,Z) when the process is created in reference cluster KREF. In the other clusters K, the CODE vseg is registered in VSL(P,K) when a page fault is signaled by a thread of P running in cluster K. In each active cluster K, the CODE vseg is localized, and physically mapped in cluster K.
  2. DATA : This public vseg contains the user application global data. ALMOS-MK creates one DATA vseg, that is registered in the reference VSL(P,KREF) when the process P is created in reference cluster KREF. In the other clusters K, the DATA vseg is registered in VSL(P,K) when a page fault is signaled by a thread of P running in cluster K. To avoid contention, this vseg is physically distributed on all clusters, with a page granularity. For each page, the physical mapping is defined by the LSB bits of the page VPN.
  3. STACK : This private vseg contains the execution stack of a thread. Almos-mkh creates one STACK vseg for each thread of P running in cluster K. This vseg is registered in the VSL(P,K) when the thread descriptor is created in cluster K. To enforce locality, this vseg is of course mapped in cluster K.
  4. ANON : This public vseg is dynamically created by ALMOS-MK to serve an anonymous mmap system call executed by a client thread running in a cluster K. The first vseg registration and the physical mapping are done in the reference cluster KREF, but the vseg is mapped in the client cluster K.
  5. FILE : This public vseg is dynamically created by ALMOS-MK to serve a file based mmap system call executed by a client thread running in a cluster K. The first vseg registration and the physical mapping are done in the reference cluster KREF, but the vseg is mapped in cluster Y containing the file cache.
  6. REMOTE : This public vseg is dynamically created by ALMOS-MK to serve a remote mmap system call, where a client thread running in a cluster X requests to create a new vseg mapped in another cluster Y. The first vseg registration and the physical mapping are done by the reference cluster K, but the vseg is mapped in cluster Y specified by the user.

2. kernel segments types

For any process descriptor P in a cluster K, the VSL(P,K) and the GPT(P,K) contains not only the user vsegs, but also the kernel vsegs, because all user theads can make system calls, that must access to these kernel vsegs, and this requires address translation. This section describes the three types of kernel virtual segments defined by almost-mkh

Type Access Replication Placement Allocation policy in virtual space
KCODE private localized Read Only one physical mapping per cluster same cluster as thread using it static (defined in .elf file)
KDATA public localized Read Write same mapping for all threads distributed on all cl static (defined in .elf file)
KHEAP public
KDEV public localized Read Write one physical mapping per thread same cluster as thread using it dynamic (one stack allocator per cluster)
  1. KCODE : This private vseg contains the kernel code. Almost-mkh creates one KCODE vseg per cluster. For a process P, the CODE vseg is registered in the VSL(P,Z) when the process is created in reference cluster KREF. In the other clusters K, the CODE vseg is registered in VSL(P,K) when a page fault is signaled by a thread of P running in cluster K. In each active cluster K, the CODE vseg is localized, and physically mapped in cluster K.
  2. KDATA : This public vseg contains the global data, statically allocated at compilation time. The initial values are identical in all clusters, but th global data. ALMOS-MK creates one DATA vseg in each vseg, that is registered in the reference VSL(P,KREF) when the process P is created in reference cluster KREF. In the other clusters K, the DATA vseg is registered in VSL(P,K) when a page fault is signaled by a thread of P running in cluster K. To avoid contention, this vseg is physically distributed on all clusters, with a page granularity. For each page, the physical mapping is defined by the LSB bits of the page VPN.* The read-only segment containing the user code is replicated in all clusters where there is at least one thread using it.

To enforce locality, there is one KDATA segment per cluster, containing a copy of all global variables statically allocated at compilation time. But these vsegs are not read-only, and can evolve differently in different clusters. On the other hand, all structures dynamically allocated by the kernel (to create a new process descriptor, a new thread descriptor, a new file descriptor, etc.) are allocated in the KHEAP segment of the target cluster, and will be mainly handled by a kernel instance running in this same kernel. Therefore, most kernel memory accesses expected to be local.

In the - rare - situations where the kernel running in cluster K must access data in a remote cluster K' (to access a globally distributed structure such as the DQDT, or for inter-cluster client/server communication) almos-mkh uses specific remote access primitives defined in the hal_remote.h file.

2.1 TSAR-MIPS32

In the TSAR architecture, and for any process P in any cluster K, almost-mkh registers only one extra KCODE vseg in the VMM[P,K), because almos-mkh does not use the DATA-MMU during kernel execution : Each time a core enters the kernel, to handle a sys call, an interrupt, or an exception, the DATA-MMU is deactivated, and It is reactivated when the core returns to user code.

The architecture dependent remote access functions use the TSAR specific extension register to build a 40 bits physical address from the 32 bits virtual address.

This pseudo identity mapping impose some constraints on the KCODE vseg.

2.2 Intel 64 bits

TODO

3. virtual space organisation

3.1 TSAR-MIP32

The virtual address space of an user process P is split in 5 fixed size zones, defined by configuration parameters in https://www-soc.lip6.fr/trac/almos-mkh/browser/trunk/kernel/kernel_config.h. Each zone contains one or several vsegs, as described below.

3.1.1 The kernel zone

It contains the kcode vseg (type KCODE), that must be mapped in all user processes. It is located in the lower part of the virtual space, and starts a address 0. Its size cannot be less than a big page size (2 Mbytes for the TSAR architecture), because it will be mapped as one (or several big) pages. 3.1.2 The utils zone

It contains the two args and envs vsegs, whose sizes are defined by specific configuration parameters. The args vseg (DATA type) contains the process main() arguments. The envs vseg (DATA type) contains the process environment variables. It is located on top of the kernel zone, and starts at address defined by the CONFIG_VMM_ELF_BASE parameter.

3.1.3 The elf zone

It contains the text (CODE type) and data (DATA type) vsegs, defining the process binary code and global data. The actual vsegs base addresses and sizes are defined in the .elf file and reported in the boot_info_t structure by the boot loader.

3.1.4 The heap zone

It contains all vsegs dynamically allocated / released by the mmap / munmap system calls (i.e. FILE / ANON / REMOTE types). It is located on top of the elf zone, and starts at the address defined by the CONFIG_VMM_HEAP_BASE parameter. The VMM defines a specific MMAP allocator for this zone, implementing the buddy algorithm. The mmap( FILE ) syscall maps directly a file in user space. The user level malloc library uses the mmap( ANON ) syscall to allocate virtual memory from the heap and map it in the same cluster as the calling thread. Besides the standard malloc() function, this library implements a non-standard remote_malloc() function, that uses the mmap( REMOTE ) syscall to dynamically allocate virtual memory from the heap, and map it to a remote physical cluster.

3.1.5 The stack zone

It is located on top of the mmap zone and starts at the address defined by the CONFIG_VMM_STACK_BASE parameter. It contains an array of fixed size slots, and each slot contains one stack vseg. The size of a slot is defined by the CONFIG_VMM_STACK_SIZE. In each slot, the first page is not mapped, in order to detect stack overflows. As threads are dynamically created and destroyed, the VMM implements a specific STACK allocator for this zone, using a bitmap vector. As the stack vsegs are private (the same virtual address can have different mappings, depending on the cluster) the number of slots in the stack zone actually defines the max number of threads for given process in a given cluster.

3.2 Intel 64 bits

TODO