Ignore:
Timestamp:
Jul 18, 2019, 2:06:55 PM (5 years ago)
Author:
alain
Message:

Introduce the non-standard pthread_parallel_create() system call
and re-write the <fft> and <sort> applications to improve the
intrinsic paralelism in applications.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/libs/libalmosmkh/almosmkh.h

    r629 r637  
    22 * almosmkh.h - User level ALMOS-MKH specific library definition.
    33 *
    4  * Author     Alain Greiner (2016,2017,2018)
     4 * Author     Alain Greiner (2016,2017,2018,2019)
    55 *
    66 * Copyright (c) UPMC Sorbonne Universites
     
    7272
    7373/***************************************************************************************
    74  * This syscall returns the cluster an local index for the calling core.
     74 * This syscall returns the cluster identifier and the local index
     75 * for the calling core.
    7576 ***************************************************************************************
    7677 * @ cxy      : [out] cluster identifier.
     
    7879 * @ return always 0.
    7980 **************************************************************************************/
    80 int get_core( unsigned int * cxy,
    81               unsigned int * lid );
     81int get_core_id( unsigned int * cxy,
     82                 unsigned int * lid );
     83
     84/***************************************************************************************
     85 * This syscall returns the number of cores in a given cluster.
     86 ***************************************************************************************
     87 * @ cxy      : [in]  target cluster identifier.
     88 * @ ncores   : [out] number of cores in target cluster.
     89 * @ return always 0.
     90 **************************************************************************************/
     91int get_nb_cores( unsigned int   cxy,
     92                  unsigned int * ncores );
     93
     94/***************************************************************************************
     95 * This syscall uses the DQDT to search, in a macro-cluster specified by the
     96 * <cxy_base> and <level> arguments arguments, the core with the lowest load.
     97 * it writes in the <cxy> and <lid> buffers the selected core cluster identifier
     98 * and the local core index.
     99 ***************************************************************************************
     100 * @ cxy_base : [in]  any cluster identifier in macro-cluster.in clusters array.
     101 * @ level    : [in]  macro-cluster level in [1,2,3,4,5].
     102 * @ cxy      : [out] selected core cluster identifier.
     103 * @ lid      : [out] selectod core local index.
     104 * @ return 0 if success / 1 if no core in macro-cluster / -1 if illegal arguments.
     105 **************************************************************************************/
     106int get_best_core( unsigned int   cxy_base,
     107                   unsigned int   level,
     108                   unsigned int * cxy,
     109                   unsigned int * lid );
    82110
    83111/***************************************************************************************
    84  * This function returns the calling core cycles counter,
     112 * This function returns the value contained in the calling core cycles counter,
    85113 * taking into account a possible overflow on 32 bits architectures.
    86114 ***************************************************************************************
     
    414442                      unsigned int cxy );
    415443
     444/********* Non standard (ALMOS-MKH specific) pthread_parallel_create() syscall  *********/
     445
     446//////////////////////////////////////////////////////////////////////////////////////////
     447// This system call can be used to parallelize the creation and the termination
     448// of a parallel multi-threaded application. It removes the loop in the main thread that
     449// creates the N working threads (N  sequencial pthread_create() ). It also removes the
     450// loop that waits completion of these N working threads (N sequencial pthread_join() ).
     451// It creates one "work" thread (in detached mode) per core in the target architecture.
     452// Each "work" thread is identified by the [cxy][lid] indexes (cluster / local core).
     453// The pthread_parallel_create() function returns only when all "work" threads completed
     454// (successfully or not).
     455//
     456// To use this system call, the application code must define the following structures:
     457// - To define the arguments to pass to the <work> function the application must allocate
     458//   and initialize a first 2D array, indexed by [cxy] and [lid] indexes, where each slot
     459//   contains an application specific structure, and another 2D array, indexed by the same
     460//   indexes, containing pointers on these structures. This array of pointers is one
     461//   argument of the pthread_parallel_create() function.
     462// - To detect the completion of the <work> threads, the application must allocate a 1D
     463//   array, indexed by the cluster index [cxy], where each slot contains a pthread_barrier
     464//   descriptor. This barrier is initialised by the pthread_parallel_create() function,
     465//   in all cluster containing at least one work thread. This array of barriers is another
     466//   argument of the pthread_parallel_create() function.
     467//
     468// Implementation note:
     469// To parallelize the "work" threads creation and termination, the pthread_parallel_create()
     470// function creates a distributed quad-tree (DQT) of "build" threads covering all cores
     471// required to execute the parallel application.
     472// Depending on the hardware topology, this DQT can be truncated, (i.e. some
     473// parent nodes can have less than 4 chidren), if (x_size != y_size), or if one size
     474// is not a power of 2. Each "build" thread is identified by two indexes [cxy][level].
     475// Each "build" thread makes the following tasks:
     476// 1) It calls the pthread_create() function to create up to 4 children threads, that
     477//    are are "work" threads when (level == 0), or "build" threads, when (level > 0).
     478// 2) It initializes the barrier (global variable), used to block/unblock
     479//    the parent thread until children completion.
     480// 3) It calls the pthread_barrier_wait( self ) to wait until all children threads
     481//    completed (successfully or not).
     482// 4) It calls the pthread_barrier_wait( parent ) to unblock the parent thread.
     483//////////////////////////////////////////////////////////////////////////////////////////
     484
     485/*****************************************************************************************
     486 * This blocking function creates N working threads that execute the code defined
     487 * by the <work_func> and <work_args> arguments.
     488 * The number N of created threads is entirely defined by the <root_level> argument.
     489 * This value defines an abstract quad-tree, with a square base : level in [0,1,2,3,4],
     490 * side in [1,2,4,8,16], nclusters in [1,4,16,64,256]. This base is called  macro_cluster.
     491 * A working thread is created on all cores contained in the specified macro-cluster.
     492 * The actual number of physical clusters containing cores can be smaller than the number
     493 * of clusters covered by the quad tree. The actual number of cores in a cluster can be
     494 * less than the max value.
     495 *
     496 * In the current implementation, all threads execute the same <work_func> function,
     497 * on different arguments, that are specified as a 2D array of pointers <work_args>.
     498 * This can be modified in a future version, where the <work_func> argument can become
     499 * a 2D array of pointers, to have one specific function for each thread.
     500 *****************************************************************************************
     501 * @ root_level            : [in]  DQT root level in [0,1,2,3,4].
     502 * @ work_func             : [in]  pointer on start function.
     503 * @ work_args_array       : [in]  pointer on a 2D array of pointers.
     504 * @ parent_barriers_array : [in]  pointer on a 1D array of barriers.
     505 * @ return 0 if success / return -1 if failure.
     506 ****************************************************************************************/
     507int pthread_parallel_create( unsigned int   root_level,
     508                             void         * work_func,
     509                             void         * work_args_array,
     510                             void         * parent_barriers_array );
     511
    416512#endif /* _LIBALMOSMKH_H_ */
    417513
Note: See TracChangeset for help on using the changeset viewer.