CPU 调度器如何工作?#

NumPy 调度器基于多源编译,这意味着获取某个源并使用不同的编译器标志以及影响代码路径的不同 C 定义多次编译它.这为每个编译对象启用某些指令集,具体取决于所需的优化,并最终将返回的对象链接在一起.

../../_images/opt-infra.png

此机制应支持所有编译器,并且不需要任何编译器特定的扩展,但与此同时,它会向正常的编译添加一些步骤,如下所述.

1- 配置#

在通过两个命令参数开始构建源文件之前,用户配置所需的优化,如上所述:

  • --cpu-baseline : 所需优化的最小集合.

  • --cpu-dispatch : 调度的额外优化集合.

2- 发现环境#

在此部分中,我们检查编译器和平台架构,并缓存一些中间结果以加快重新构建.

3- 验证请求的优化#

通过针对编译器进行测试,并根据请求的优化查看编译器可以支持的内容.

4- 生成主配置文件头#

生成的头文件 _cpu_dispatch.h 包含在先前步骤中已验证的所需优化的指令集的所有定义和头文件.

它还包含额外的 C 定义,这些定义用于定义 NumPy 的 Python 级别模块属性 __cpu_baseline____cpu_dispatch__ .

这个头文件中有什么?

该示例头文件由 gcc 在 X86 机器上动态生成.编译器支持 --cpu-baseline="sse sse2 sse3"--cpu-dispatch="ssse3 sse41" ,结果如下.

// The header should be located at numpy/numpy/_core/src/common/_cpu_dispatch.h
/**NOTE
 ** C definitions prefixed with "NPY_HAVE_" represent
 ** the required optimizations.
 **
 ** C definitions prefixed with 'NPY__CPU_TARGET_' are protected and
 ** shouldn't be used by any NumPy C sources.
 */
/******* baseline features *******/
/** SSE **/
#define NPY_HAVE_SSE 1
#include <xmmintrin.h>
/** SSE2 **/
#define NPY_HAVE_SSE2 1
#include <emmintrin.h>
/** SSE3 **/
#define NPY_HAVE_SSE3 1
#include <pmmintrin.h>

/******* dispatch-able features *******/
#ifdef NPY__CPU_TARGET_SSSE3
  /** SSSE3 **/
  #define NPY_HAVE_SSSE3 1
  #include <tmmintrin.h>
#endif
#ifdef NPY__CPU_TARGET_SSE41
  /** SSE41 **/
  #define NPY_HAVE_SSE41 1
  #include <smmintrin.h>
#endif

基线特性是通过 --cpu-baseline 配置的所需最小优化集.它们没有预处理器保护,并且始终处于启用状态,这意味着它们可以在任何源中使用.

这是否意味着 NumPy 的基础结构将编译器的基线特性标志传递给所有源?

绝对是,是的.但是 dispatch-able sources 的处理方式有所不同.

如果用户在构建期间指定了某些基线功能,但在运行时,机器甚至不支持这些功能,该怎么办?编译后的代码会通过这些定义之一调用,或者编译器本身是否会根据提供的命令行编译器标志自动生成/向量化某些代码段?

在加载 NumPy 模块期间,有一个验证步骤可以检测此行为.它将引发 Python 运行时错误来通知用户.这是为了防止 CPU 达到非法指令错误而导致段错误.

可分派特性是我们通过 --cpu-dispatch 配置的一组额外的优化.它们默认情况下未激活,并且始终受以 NPY__CPU_TARGET_ 为前缀的其他 C 定义保护. C 定义 NPY__CPU_TARGET_ 仅在可分派源中启用.

5- 可分派源和配置语句#

可分派源是特殊的 C 文件,可以使用不同的编译器标志多次编译,也可以使用不同的 C 定义进行编译. 这些会影响代码路径,从而根据必须在 C 注释(//) 之间声明并在每个可分派源顶部以特殊标记 @targets 开头的“配置语句”,为每个编译对象启用某些指令集. 同时,如果通过命令参数 --disable-optimization 禁用了优化,则可分派源将被视为普通 C 源.

什么是配置语句?

配置语句是组合在一起以确定可分派源所需的优化的关键字.

示例:

/*@targets avx2 avx512f vsx2 vsx3 asimd asimdhp */
// C code

这些关键字主要代表通过 --cpu-dispatch 配置的额外优化,但它也可以代表其他选项,例如:

  • 目标组:预配置的配置语句,用于从可分派源外部管理所需的优化.

  • 策略:用于更改默认行为或强制编译器执行某些操作的选项集合.

  • “baseline”:一个唯一的关键字,表示通过 --cpu-baseline 配置的最小优化.

Numpy的基础设施通过四个步骤处理可分派源:

  • (A) Recognition: Just like source templates and F2PY, the dispatch-able sources requires a special extension *.dispatch.c to mark C dispatch-able source files, and for C++ *.dispatch.cpp or *.dispatch.cxx NOTE: C++ not supported yet.

  • (B) Parsing and validating: In this step, the dispatch-able sources that had been filtered by the previous step are parsed and validated by the configuration statements for each one of them one by one in order to determine the required optimizations.

  • (C) Wrapping: This is the approach taken by NumPy’s infrastructure, which has proved to be sufficiently flexible in order to compile a single source multiple times with different C definitions and flags that affect the code paths. The process is achieved by creating a temporary C source for each required optimization that related to the additional optimization, which contains the declarations of the C definitions and includes the involved source via the C directive #include. For more clarification take a look at the following code for AVX512F :

    /*
     * this definition is used by NumPy utilities as suffixes for the
     * exported symbols
     */
    #define NPY__CPU_TARGET_CURRENT AVX512F
    /*
     * The following definitions enable
     * definitions of the dispatch-able features that are defined within the main
     * configuration header. These are definitions for the implied features.
     */
    #define NPY__CPU_TARGET_SSE
    #define NPY__CPU_TARGET_SSE2
    #define NPY__CPU_TARGET_SSE3
    #define NPY__CPU_TARGET_SSSE3
    #define NPY__CPU_TARGET_SSE41
    #define NPY__CPU_TARGET_POPCNT
    #define NPY__CPU_TARGET_SSE42
    #define NPY__CPU_TARGET_AVX
    #define NPY__CPU_TARGET_F16C
    #define NPY__CPU_TARGET_FMA3
    #define NPY__CPU_TARGET_AVX2
    #define NPY__CPU_TARGET_AVX512F
    // our dispatch-able source
    #include "/the/absolute/path/of/hello.dispatch.c"
    
  • (D) Dispatch-able configuration header: The infrastructure generates a config header for each dispatch-able source, this header mainly contains two abstract C macros used for identifying the generated objects, so they can be used for runtime dispatching certain symbols from the generated objects by any C source. It is also used for forward declarations.

    生成的头文件采用可分派源的名称,在排除扩展名后,将其替换为 .h ,例如,假设我们有一个名为 hello.dispatch.c 的可分派源,其中包含以下内容:

    // hello.dispatch.c
    /*@targets baseline sse42 avx512f */
    #include <stdio.h>
    #include "numpy/utils.h" // NPY_CAT, NPY_TOSTR
    
    #ifndef NPY__CPU_TARGET_CURRENT
      // wrapping the dispatch-able source only happens to the additional optimizations
      // but if the keyword 'baseline' provided within the configuration statements,
      // the infrastructure will add extra compiling for the dispatch-able source by
      // passing it as-is to the compiler without any changes.
      #define CURRENT_TARGET(X) X
      #define NPY__CPU_TARGET_CURRENT baseline // for printing only
    #else
      // since we reach to this point, that's mean we're dealing with
        // the additional optimizations, so it could be SSE42 or AVX512F
      #define CURRENT_TARGET(X) NPY_CAT(NPY_CAT(X, _), NPY__CPU_TARGET_CURRENT)
    #endif
    // Macro 'CURRENT_TARGET' adding the current target as suffix to the exported symbols,
    // to avoid linking duplications, NumPy already has a macro called
    // 'NPY_CPU_DISPATCH_CURFX' similar to it, located at
    // numpy/numpy/_core/src/common/npy_cpu_dispatch.h
    // NOTE: we tend to not adding suffixes to the baseline exported symbols
    void CURRENT_TARGET(simd_whoami)(const char *extra_info)
    {
        printf("I'm " NPY_TOSTR(NPY__CPU_TARGET_CURRENT) ", %s\n", extra_info);
    }
    

    现在假设您将 hello.dispatch.c 附加到源代码树,那么基础设施应该生成一个名为 hello.dispatch.h 的临时配置头文件,该文件可以被源代码树中的任何源文件访问,并且它应该包含以下代码:

    #ifndef NPY__CPU_DISPATCH_EXPAND_
      // To expand the macro calls in this header
        #define NPY__CPU_DISPATCH_EXPAND_(X) X
    #endif
    // Undefining the following macros, due to the possibility of including config headers
    // multiple times within the same source and since each config header represents
    // different required optimizations according to the specified configuration
    // statements in the dispatch-able source that derived from it.
    #undef NPY__CPU_DISPATCH_BASELINE_CALL
    #undef NPY__CPU_DISPATCH_CALL
    // nothing strange here, just a normal preprocessor callback
    // enabled only if 'baseline' specified within the configuration statements
    #define NPY__CPU_DISPATCH_BASELINE_CALL(CB, ...) \
      NPY__CPU_DISPATCH_EXPAND_(CB(__VA_ARGS__))
    // 'NPY__CPU_DISPATCH_CALL' is an abstract macro is used for dispatching
    // the required optimizations that specified within the configuration statements.
    //
    // @param CHK, Expected a macro that can be used to detect CPU features
    // in runtime, which takes a CPU feature name without string quotes and
    // returns the testing result in a shape of boolean value.
    // NumPy already has macro called "NPY_CPU_HAVE", which fits this requirement.
    //
    // @param CB, a callback macro that expected to be called multiple times depending
    // on the required optimizations, the callback should receive the following arguments:
    //  1- The pending calls of @param CHK filled up with the required CPU features,
    //     that need to be tested first in runtime before executing call belong to
    //     the compiled object.
    //  2- The required optimization name, same as in 'NPY__CPU_TARGET_CURRENT'
    //  3- Extra arguments in the macro itself
    //
    // By default the callback calls are sorted depending on the highest interest
    // unless the policy "$keep_sort" was in place within the configuration statements
    // see "Dive into the CPU dispatcher" for more clarification.
    #define NPY__CPU_DISPATCH_CALL(CHK, CB, ...) \
      NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX512F)), AVX512F, __VA_ARGS__)) \
      NPY__CPU_DISPATCH_EXPAND_(CB((CHK(SSE)&&CHK(SSE2)&&CHK(SSE3)&&CHK(SSSE3)&&CHK(SSE41)), SSE41, __VA_ARGS__))
    

    以下是使用上述配置头文件的示例:

    // NOTE: The following macros are only defined for demonstration purposes only.
    // NumPy already has a collections of macros located at
    // numpy/numpy/_core/src/common/npy_cpu_dispatch.h, that covers all dispatching
    // and declarations scenarios.
    
    #include "numpy/npy_cpu_features.h" // NPY_CPU_HAVE
    #include "numpy/utils.h" // NPY_CAT, NPY_EXPAND
    
    // An example for setting a macro that calls all the exported symbols at once
    // after checking if they're supported by the running machine.
    #define DISPATCH_CALL_ALL(FN, ARGS) \
        NPY__CPU_DISPATCH_CALL(NPY_CPU_HAVE, DISPATCH_CALL_ALL_CB, FN, ARGS) \
        NPY__CPU_DISPATCH_BASELINE_CALL(DISPATCH_CALL_BASELINE_ALL_CB, FN, ARGS)
    // The preprocessor callbacks.
    // The same suffixes as we define it in the dispatch-able source.
    #define DISPATCH_CALL_ALL_CB(CHECK, TARGET_NAME, FN, ARGS) \
      if (CHECK) { NPY_CAT(NPY_CAT(FN, _), TARGET_NAME) ARGS; }
    #define DISPATCH_CALL_BASELINE_ALL_CB(FN, ARGS) \
      FN NPY_EXPAND(ARGS);
    
    // An example for setting a macro that calls the exported symbols of highest
    // interest optimization, after checking if they're supported by the running machine.
    #define DISPATCH_CALL_HIGH(FN, ARGS) \
      if (0) {} \
        NPY__CPU_DISPATCH_CALL(NPY_CPU_HAVE, DISPATCH_CALL_HIGH_CB, FN, ARGS) \
        NPY__CPU_DISPATCH_BASELINE_CALL(DISPATCH_CALL_BASELINE_HIGH_CB, FN, ARGS)
    // The preprocessor callbacks
    // The same suffixes as we define it in the dispatch-able source.
    #define DISPATCH_CALL_HIGH_CB(CHECK, TARGET_NAME, FN, ARGS) \
      else if (CHECK) { NPY_CAT(NPY_CAT(FN, _), TARGET_NAME) ARGS; }
    #define DISPATCH_CALL_BASELINE_HIGH_CB(FN, ARGS) \
      else { FN NPY_EXPAND(ARGS); }
    
    // NumPy has a macro called 'NPY_CPU_DISPATCH_DECLARE' can be used
    // for forward declarations any kind of prototypes based on
    // 'NPY__CPU_DISPATCH_CALL' and 'NPY__CPU_DISPATCH_BASELINE_CALL'.
    // However in this example, we just handle it manually.
    void simd_whoami(const char *extra_info);
    void simd_whoami_AVX512F(const char *extra_info);
    void simd_whoami_SSE41(const char *extra_info);
    
    void trigger_me(void)
    {
        // bring the auto-generated config header
        // which contains config macros 'NPY__CPU_DISPATCH_CALL' and
        // 'NPY__CPU_DISPATCH_BASELINE_CALL'.
        // it is highly recommended to include the config header before executing
      // the dispatching macros in case if there's another header in the scope.
        #include "hello.dispatch.h"
        DISPATCH_CALL_ALL(simd_whoami, ("all"))
        DISPATCH_CALL_HIGH(simd_whoami, ("the highest interest"))
        // An example of including multiple config headers in the same source
        // #include "hello2.dispatch.h"
        // DISPATCH_CALL_HIGH(another_function, ("the highest interest"))
    }