| 1 | Nowadays, single-chip cache-coherent many-core processors having up to 100 cores |
| 2 | are a reality. Many-cores with hundreds or even a thousand of cores are planned |
| 3 | in the near future. In theses architectures, the question of the locality of L1 |
| 4 | cache-miss related traffic (data, instruction and TLB) is essential for both |
| 5 | scalability and power consumption (energy by moved bit). Our thesis is that: |
| 6 | (i) handling the locality of memory accesses should be done at kernel level of |
| 7 | an operating system in a transparent manner to user applications; and (ii) the |
| 8 | current monolithic kernels are not able to enforce the locality of memory accesses |
| 9 | of multi-threaded applications, because the concept of thread in these kernels is |
| 10 | inherently unsuitable for many-core processors. Therefore, we believe that the |
| 11 | evolution approach of monolithic kernels undertaken until now is insufficient and |
| 12 | it is imperative to put the question of the locality of memory accesses in the |
| 13 | heart of this evolution. |
| 14 | |
| 15 | To prove our thesis, we designed and implemented ALMOS (Advanced Locality |
| 16 | Management Operating System), an experimental operating system based on a |
| 17 | distributed monolithic kernel. ALMOS has a new concept of thread, called Hybrid |
| 18 | Process. It allows its kernel to enforce the locality of memory accesses of each |
| 19 | thread. The resources (cores and physical memory) management in ALMOS's kernel is |
| 20 | distributed enforcing the locality of memory accesses when performing system |
| 21 | services. Decision making regarding memory allocation, tasks placement and load |
| 22 | balancing in ALMOS's kernel is decentralized, multi-criteria and without locking. |
| 23 | It is based on a distributed infrastructure coordinating, in a scalable manner, |
| 24 | the accesses to resources. |
| 25 | |
| 26 | Using the cycle accurate and bit accurate virtual prototype of TSAR many-core |
| 27 | processor, we experimentally demonstrated that: (i) performance (scalability and |
| 28 | execution time) on 256 cores of the distributed scheduling scheme of ALMOS's kernel |
| 29 | outperform those of the shared scheduling scheme found in existing monolithic |
| 30 | kernels; (ii) distributed realization of the fork system call enables this system |
| 31 | service to scale on 512 cores; (iii) updating the distrusted decision-making |
| 32 | infrastructure of ALMOS's kernel costs just 0.05 % of the total computing power of |
| 33 | TSAR processor; (iv) performance (scalability, execution time and remote traffic) |
| 34 | of memory affinity strategy of ALMOS's kernel, called Auto-Next-Touch, outperform |
| 35 | those of two existing strategies First-Touch and Interleave on 64 cores; (v) concept |
| 36 | of Hybrid Process of ALMOS's kernel scales up two existing highly multi-threads |
| 37 | applications on 256 cores and a third one on 1024 cores; and finally (vi) the couple |
| 38 | ALMOS/TSAR (64 cores) gives systematically much better scalability than the couple |
| 39 | Linux/AMD (Interlagos 64 cores) for 8 multi-threads applications belonging to HPC |
| 40 | and image processing domains. |