wiki:page_tables

Version 17 (modified by alain, 5 years ago) (diff)

--

Page tables and vsegs lists implementation

1) vseg descriptors

A vseg descriptor contains the following fields :

  • TYPE : Defines the replication/distribution policy (CODE / STACK / DATA / HEAP / HEAPXY / FILE / ANON)
  • FLAGS : Defines access rights
  • VBASE : Base virtual address
  • LENGTH : Segment length
  • BIN : Pathname to the .elf file (only for DATA and CODE types).
  • X,Y : Coordinates of the cluster where the vseg is mapped (only for a localized vseg).
  • MAPPER : radix-tree containing the physical pages allocated to this vseg (only for CODE, DATA and FILE types).

2) Page tables and vseg lists

The various information associated to a process P can be found in the process descriptor (process_t structure). This process descriptor and the structures within are - partly - replicated in all clusters containing at least one thread of process P, these clusters are called "active" clusters.

The VSL(P,K) (virtual Segment List of process P in cluster K) is replicated in all active clusters. It is used by the kernel when a page fault occurs, it checks that the unmapped virtual address corresponds to a registered segment and determines the segment type.

The GPT(P,K) (Generic Page Table of process P in cluster K) is also replicated in all active clusters. It used by the kernel to store the mapping of each page of each vseg of the process.

2.1) Dynamic behaviour of the GPT(P,K)

For a P process, the contents of the different page tables GPT(P,K) can changes over time, and it evolves differently in active clusters : On the one hand, the P page tables content evolves dynamically in the clusters depending on the page faults triggered by the thread of P running in each cluster. Moreover, the private segments mapping (CODE and STACK types) differs from one cluster to another, since a same virtual address corresponds to different addresses depending on the cluster. For public vsegs, only the reference cluster contains the complete mapping state.

2.2) Dynamic Behaviour of VSL(P,K)

For a P process, the contents of the various vsegs lists VSL(P,K) changes over time too, and is not the same in all cluster: The vsegs list must be identical for public vsegs, but each private vseg is registered only in the cluster it belongs to. For public vsegs, every dynamic insertion of a new vseg or extension of an existing vseg must be reported into all active clusters.

3) Registering and destruction of vsegs in VSL(P,K)

The registering and destruction policy in the VSL(P,K) depends on the vseg type :

3.1) DATA

This vseg type is registered in VSL(P,Z), the Z cluster being the owner of process P at its creation. It is registered in VSL(P,A) in an A cluster every time a thread of P is created in this A cluster, and this A cluster didn't have a thread of P yet. The length is defined in the .elf file containing the process' binary code. There are no mapping cluster for a distributed vseg. This type of vseg gets destroyed only when the P process is destroyed.

3.2) CODE

This vseg type is registered is registered in VSL(P,Z), the Z cluster being the owner of process P at its creation. It is registered in VSL(P,A,) in an A cluster every time a thread of P is created in this A cluster, and this A cluster didn't have a thread of P yet. The length is defined in the .elf file containing the process' binary code. This mapping cluster is always the local cluster for a private vseg. This type of vseg gets destroyed only when the P process is destroyed.

3.3) STACK

This type of vseg is registered in VSL(P,X) every time a new thread of process P is created in cluster X. The VSL(P,Y) of other Y clusters don't need to be updated because a STACK vseg in an X cluster is never known nor accessed by another Y cluster. The length is defined in a global parameter in the OS : MIN_STACK_SIZE. This mapping cluster is always the local cluster for a private vseg. This type of vseg is removed from VSL(P,X) when the thread is destroyed.

3.4) ANON

This type of vseg is registered in VSL(P,Z), the Z cluster being the owner of process P at its creation. It is registered in VSL(P,A,) in an A cluster every time a thread of P is created in this A cluster, and this A cluster didn't have a thread of P yet. The length is defined in a global parameter in the OS : STANDARD_MALLOC_HEAP_SIZE. There are no mapping cluster for a distributed vseg. This type of vseg is destroyed when the process is destroyed.

3.5) REMOTE

This type of vseg is registered in VSL(P,A) of all A clusters containing at least one thread of P, when a thread of P executes a remote_malloc(x,y) in a K cluster. The kernel instance in cluster K sends a VVSEG_REQUEST_RPC to the Z cluster, owner of P, if there wasn't already a REMOTE vseg in VSL(P,K). The arguments are the PID and the type of the missing vseg. The length is defined in a global parameter in the OS : REMOTE_MALLOC_HEAP_SIZE. The mapping cluster is defined by arguments (x,y) from the remote_malloc(). This type of vseg is destroyed only at the process destruction.

3.6) FILE

This type of vseg is registered in the VSL(P,A) of all A clusters containing at least one thread of P, it is registered when a thread of P executes mmap(file, size) in a cluster K. The kernel instance running in cluster K sends a VSEG_REQUEST_RPC to the Z cluster, owner of process P. The arguments are the PID, the vseg type, the file descriptor and the size. The kernel instance in cluster Z broadcasts a VSEG_REGISTER_RPC to all the other active cluster of P. Te vseg length is defined by the size argument of mmap(). The mapping cluster is defined by the file argument, and it can be any cluster since a file cache can be placed on any cluster (uniform dispatching policy). This vseg type is destroyed on a munmap() call, using a two-RPCs mechanism as for the creation.

3.7) ANON

This vseg type is registered in the VSL(P,A) of all A clusters containing at least one thread of P, when a thread of P executes a mmap(anonymous, size) in a cluster K. The kernel instance of cluster K sends a VSEG_REQUEST_RPC to the cluster Z that owns P. The arguments are the PID, the vseg type, the file descriptor, the size, ... to be completed... The kernel instance of cluster Z broadcasts a VSEG_REGISTER_BCRPC to all active clusters of p. The vseg length is defined by the size argument of mmap(). There is no mapping cluster for a distributed vseg. This vseg type is destroyed on munmap() call, using a two-RPCs mechanism just as for the creation.

4) Insertion of a PTE (Page Table Entry) in the GPT(P,K)

Adding a new entry in a GPT(P,K) for a process P in a cluster K is the result of a page fault, triggered by any thread of process P running in cluster K, based on the "on-demand paging" principle. All threads of a P process in a K cluster use exclusively the local PT(P,K), and report the page fault to the local kernel instance. The handling of the page fault depends on the segment type :

4.1) CODE

There is a CODE vseg in the VSL of all the clusters having at least one thread of process P. If the K cluster that detected the page fault is different from the Z cluster owner of P, the kernel of cluster K has to allocate a physical page in cluster K. To initialize this page, it sends a PT_MISS_RPC to cluster Z, owner of P. When it gets the PTE stored in PT(P,Z), it does a remote_memcpy() to copy the contents of physical page in cluster Z to the physical page of cluster K. It then ends with inserting the missing PTE to the PT(P,K). IF cluster K is the owner cluster, it allocates a physical page, initializes this page by addressing the file system to retrieve the content of the missing page in the .elf file cache, then updates the PT(P,Z).

QUESTION : dans le cluster propriétaire Z, faut-il faire une copie de la page du cache de fichier vers une autre page physique ? [AG]

4.2) STACK

The STACK vsegs associated to the threads placed in a cluster X are mapped in this cluster X and are handled independently from each other in the different clusters. The kernel instance in cluster X has to allocate a physical page and register it in the local GPT(P,X) without initializing it. IF the requested address is in the last possible page for the vseg, the STACK vseg length can be locally dynamically increased in the local VSL(P,X), if there is enough space in the virtual space zone used for the stacks. As suggested by Franck, we can imagine an allocation policy by dichotomy using two arguments : MAX_STACK_SIZE, defining the total length of the zone reserved for the stacks, and MIN_STACK_SIZE, defining the minimal length of one stack.

4.3) DATA

This vseg being distributed, the physical pages are distributed among all the cluster depending on the VPN LSBs. If the K cluster, that detects the page fault, is different from the owner Z cluster, then the kernel instance of cluster K sends a PT_MISS_RPC to cluster Z in order to obtain the PTE stored in PT(P,Z). The arguments are the PID and the VPN of the missing page. When it receives the response, it updates the PT(P,K). If the cluster that detects the page fault is the owner Z cluster, it chooses a target M cluster from the VPN LSBs and sends a RPC_PMEM_GET_SPP to cluster M in order to obtain the PPN of a physical page in cluster M. In response to this RPC, the kernel instance of cluster M allocates a physical page and returns its PPN. The kernel instance of cluster Z addresses the file system to retrieve the contents of the missing page in the .elf file cache and initialized the physical page in M via a remote_memcpy(). Then it updates the PT(P,Z).

4.4) HEAP

This vseg being distributed, the physical pages are distributed among all the cluster depending on the VPN LSBs. If the K cluster, that detects the page fault, is different from the owner Z cluster, then the kernel instance of cluster K sends a PT_MISS_RPC to cluster Z in order to obtain the PTE stored in PT(P,Z). The arguments are the PID and the VPN of the missing page. When it receives the response, it updates the PT(P,K). If the cluster that detects the page fault is the owner Z cluster, it chooses a target M cluster from the VPN LSBs and sends a RPC_PMEM_GET_SPP to cluster M in order to obtain the PPN of a physical page in cluster M. In response to this RPC, the kernel instance of cluster M allocates a physical page and returns its PPN. When the kernel of cluster Z obtains the PPN, it updates the PT(P,Z).

4.5) REMOTE

This vseg being localized, the M mapping cluster's coordinates are registered in the vseg descriptor. If the K cluster, that detects the page fault, is different from the owner Z cluster, then the kernel instance of cluster K sends a PT_MISS_RPC to cluster Z in order to obtain the PTE stored in PT(P,Z). The arguments are the PID and the VPN of the missing page. When it receives the response, it updates the PT(P,K). If the cluster that detects the page fault is the owner Z cluster, it sends a RPC_PMEM_GET_SPP to cluster M in order to obtain the PPN of a physical page in cluster M. In response to this RPC, the kernel of cluster M allocated a physical page and returns its PPN. Whe the kernel of cluster Z obtains the PPN, it updates the PT(P,Z).

4.6) FILE

This vseg being localized, the M mapping cluster's coordinates are registered in the vseg descriptor. If the K cluster, that detects the page fault, is different from the owner Z cluster, then the kernel instance of cluster K sends a PT_MISS_RPC to cluster Z in order to obtain the PTE stored in PT(P,Z). The arguments are the PID and the VPN of the missing page. When it receives the response, it updates the PT(P,K). If the cluster that detects the page fault is the owner Z cluster, it sends a GET_FILE_CACHE_RPC to cluster M, that contains the file cache, in order to obtain the PPN. The arguments are the PID, the file descriptor and the page index in the mapper. In response to this RPC, the kernel of cluster M access the vseg mapper and returns the corresponding PPN. When the kernel instance of cluster Z obtains the PPN, it updates the PT(P,Z).

4.7) ANON

This vseg being distributed, the physical pages are distributed among all cluster depending the VPN LSBs. The handling of a page fault is the same as for a HEAP vseg.

5) Invalidation of an entry in the page table

In a cluster Z, owner of process P, the kernel can invalidate an entry of PT(P,Z). It can occur in case of a lack of memory in cluster Z for example, or more simply in case of a munmap(). If the vseg concerned is a STACK vseg, the invalidated entry in PT(P,Z) must also be invalidated in the PT(P,K) of all other clusters. To do so, the kernel of cluster Z has to broadcast a PT_INVAL_BCRPC to all other active clusters of P.

6) RPC broadcasts optimization

In a broadcast RPC, all recipient clusters have to signal the termination by atomically incrementing a response couner, which is polled by the initiator cluster. To reduce the number of receivers, a process P descriptor of owner cluster Z can keep four variables XMIN, XMAX, YMIN, YMAX defining the minimal rectangle covering all active clusters of P at any time. In this case, a broadcast RPC has to be sent only to (XMAX - XMIN + 1) * (YMAX - YMIN +1) recipients. These variables are updated upon each thread creation.

7 ) Page fault handling optimization

To reduce the number of RPC triggered by page faults, the kernel of a cluster X detecting a page fault can use a remote_read() in the PT(P,Z) table of the reference cluster insteadof a PT_MISS_RPC. In this case, however, a multi-reader lock must be used in order to avoid an inconsistent state in cas of a PT_INVAL_BC_RPC simultaneous transaction initiated by cluster Z. This lock must be systematically taken by the owner cluster before a PT_INVAL_BC_RPC and by the other clusters before a remote_read(). It ensures that a PT_INVAL_RPC should be launched only after the termination of all current remote_read(). It guarantees that no new remote_read() will be accepted before the completion of the PT_INVAL_RPC.