CPU 调度器如何工作?#

NumPy 调度器基于多源编译,这意味着获取某个源并使用不同的编译器标志以及影响代码路径的不同 C 定义多次编译它.这为每个编译对象启用特定的指令集,具体取决于所需的优化,并最终将返回的对象链接在一起.

../../_images/opt-infra.png

这种机制应该支持所有编译器,并且不需要任何编译器特定的扩展,但同时它会给正常的编译过程增加一些步骤,如下所述.

1- 配置#

在开始构建源文件之前,用户可以通过两个命令行参数配置所需的优化,如上所述:

  • --cpu-baseline :所需优化的最小集合.

  • --cpu-dispatch :分派的额外优化集合.

2- 发现环境#

在此部分,我们检查编译器和平台架构,并缓存一些中间结果以加快重建速度.

3- 验证请求的优化#

通过针对编译器测试它们,并根据请求的优化查看编译器可以支持的内容.

4- 生成主配置文件头#

生成的头文件 _cpu_dispatch.h 包含指令集的所有定义和头文件,用于在前一步骤中验证的所需优化.

它还包含额外的 C 定义,用于定义 NumPy 的 Python 级别模块属性 __cpu_baseline____cpu_dispatch__ .

这个头文件里有什么?

示例头文件是由 gcc 在 X86 机器上动态生成的.编译器支持 --cpu-baseline="sse sse2 sse3"--cpu-dispatch="ssse3 sse41" ,结果如下.

// The header should be located at numpy/numpy/_core/src/common/_cpu_dispatch.h
/**NOTE
 ** C definitions prefixed with "NPY_HAVE_" represent
 ** the required optimizations.
 **
 ** C definitions prefixed with 'NPY__CPU_TARGET_' are protected and
 ** shouldn't be used by any NumPy C sources.
 */
/******* baseline features *******/
/** SSE **/
#define NPY_HAVE_SSE 1
#include <xmmintrin.h>
/** SSE2 **/
#define NPY_HAVE_SSE2 1
#include <emmintrin.h>
/** SSE3 **/
#define NPY_HAVE_SSE3 1
#include <pmmintrin.h>

/******* dispatch-able features *******/
#ifdef NPY__CPU_TARGET_SSSE3
  /** SSSE3 **/
  #define NPY_HAVE_SSSE3 1
  #include <tmmintrin.h>
#endif
#ifdef NPY__CPU_TARGET_SSE41
  /** SSE41 **/
  #define NPY_HAVE_SSE41 1
  #include <smmintrin.h>
#endif

Baseline 功能是通过 --cpu-baseline 配置的所需优化的最小集合.它们没有预处理器保护,并且始终启用,这意味着它们可以在任何源中使用.

这是否意味着 NumPy 的基础设施将 baseline 功能的编译器标志传递给所有源?

绝对是的.但是 dispatch-able sources 的处理方式不同.

如果用户在构建期间指定了某些 baseline 功能,但在运行时机器甚至不支持这些功能怎么办?编译后的代码是否会通过这些定义之一调用,或者编译器本身是否根据提供的命令行编译器标志自动生成/向量化了某些代码片段?

在加载 NumPy 模块期间,有一个验证步骤可以检测到此行为.它将引发 Python 运行时错误以通知用户.这是为了防止 CPU 达到非法指令错误而导致段错误.

Dispatch-able 功能是通过 --cpu-dispatch 配置的额外优化集合.它们默认情况下未激活,并且始终受其他 C 定义的保护,这些定义以 NPY__CPU_TARGET_ 为前缀.C 定义 NPY__CPU_TARGET_ 仅在 dispatch-able 源中启用.

5- Dispatch-able 源和配置语句#

Dispatch-able 源是特殊的 C 文件,可以使用不同的编译器标志和不同的 C 定义多次编译.这些会影响代码路径,从而根据必须在 C 注释(//) 之间声明的"配置语句"为每个编译对象启用某些指令集,并以每个 dispatch-able 源顶部的特殊标记 @targets 开头.同时,如果通过命令参数 --disable-optimization 禁用了优化,则 dispatch-able 源将被视为普通的 C 源.

什么是配置语句?

配置语句是组合在一起的一种关键字,用于确定 dispatch-able 源所需的优化.

示例:

/*@targets avx2 avx512f vsx2 vsx3 asimd asimdhp */
// C code

这些关键字主要代表通过 --cpu-dispatch 配置的额外优化,但它也可以代表其他选项,例如:

  • 目标组:预配置的配置语句,用于从 dispatch-able 源外部管理所需的优化.

  • 策略:用于更改默认行为或强制编译器执行某些操作的选项集合.

  • “baseline”:一个唯一的关键字,代表通过 --cpu-baseline 配置的最小优化.

NumPy 的基础设施分四个步骤处理 dispatch-able 源:

  • (A) Recognition: Just like source templates and F2PY, the dispatch-able sources requires a special extension *.dispatch.c to mark C dispatch-able source files, and for C++ *.dispatch.cpp or *.dispatch.cxx NOTE: C++ not supported yet.

  • (B) Parsing and validating: In this step, the dispatch-able sources that had been filtered by the previous step are parsed and validated by the configuration statements for each one of them one by one in order to determine the required optimizations.

  • (C) Wrapping: This is the approach taken by NumPy’s infrastructure, which has proved to be sufficiently flexible in order to compile a single source multiple times with different C definitions and flags that affect the code paths. The process is achieved by creating a temporary C source for each required optimization that related to the additional optimization, which contains the declarations of the C definitions and includes the involved source via the C directive #include. For more clarification take a look at the following code for AVX512F :

    /*
     * this definition is used by NumPy utilities as suffixes for the
     * exported symbols
     */
    #define NPY__CPU_TARGET_CURRENT AVX512F
    /*
     * The following definitions enable
     * definitions of the dispatch-able features that are defined within the main
     * configuration header. These are definitions for the implied features.
     */
    #define NPY__CPU_TARGET_SSE
    #define NPY__CPU_TARGET_SSE2
    #define NPY__CPU_TARGET_SSE3
    #define NPY__CPU_TARGET_SSSE3
    #define NPY__CPU_TARGET_SSE41
    #define NPY__CPU_TARGET_POPCNT
    #define NPY__CPU_TARGET_SSE42
    #define NPY__CPU_TARGET_AVX
    #define NPY__CPU_TARGET_F16C
    #define NPY__CPU_TARGET_FMA3
    #define NPY__CPU_TARGET_AVX2
    #define NPY__CPU_TARGET_AVX512F
    // our dispatch-able source
    #include "/the/absolute/path/of/hello.dispatch.c"
    
  • (D) Dispatch-able configuration header: The infrastructure generates a config header for each dispatch-able source, this header mainly contains two abstract C macros used for identifying the generated objects, so they can be used for runtime dispatching certain symbols from the generated objects by any C source. It is also used for forward declarations.

    生成的头文件采用可调度源的名称,排除扩展名后将其替换为 .h ,例如,假设我们有一个名为 hello.dispatch.c 的可调度源,其中包含以下内容:

    // hello.dispatch.c
    /*@targets baseline sse42 avx512f */
    #include <stdio.h>
    #include "numpy/utils.h" // NPY_CAT, NPY_TOSTR
    
    #ifndef NPY__CPU_TARGET_CURRENT
      // wrapping the dispatch-able source only happens to the additional optimizations
      // but if the keyword 'baseline' provided within the configuration statements,
      // the infrastructure will add extra compiling for the dispatch-able source by
      // passing it as-is to the compiler without any changes.
      #define CURRENT_TARGET(X) X
      #define NPY__CPU_TARGET_CURRENT baseline // for printing only
    #else
      // since we reach to this point, that's mean we're dealing with
        // the additional optimizations, so it could be SSE42 or AVX512F
      #define CURRENT_TARGET(X) NPY_CAT(NPY_CAT(X, _), NPY__CPU_TARGET_CURRENT)
    #endif
    // Macro 'CURRENT_TARGET' adding the current target as suffix to the exported symbols,
    // to avoid linking duplications, NumPy already has a macro called
    // 'NPY_CPU_DISPATCH_CURFX' similar to it, located at
    // numpy/numpy/_core/src/common/npy_cpu_dispatch.h
    // NOTE: we tend to not adding suffixes to the baseline exported symbols
    void CURRENT_TARGET(simd_whoami)(const char *extra_info)
    {
        printf("I'm " NPY_TOSTR(NPY__CPU_TARGET_CURRENT) ", %s\n", extra_info);
    }
    

    现在假设您将 hello.dispatch.c 附加到源树,那么基础设施应该生成一个名为 hello.dispatch.h 的临时配置头文件,源树中的任何源都可以访问该文件,并且它应该包含以下代码:

    #ifndef NPY__CPU_DISPATCH_EXPAND_
      // To expand the macro calls in this header
        #define NPY__CPU_DISPATCH_EXPAND_(X) X
    #endif
    // Undefining the following macros, due to the possibility of including config headers
    // multiple times within the same source and since each config header represents
    // different required optimizations according to the specified configuration
    // statements in the dispatch-able source that derived from it.
    #undef NPY__CPU_DISPATCH_BASELINE_CALL
    #undef NPY__CPU_DISPATCH_CALL
    // nothing strange here, just a normal preprocessor callback
    // enabled only if 'baseline' specified within the configuration statements
    #define NPY__CPU_DISPATCH_BASELINE_CALL(CB, ...) \
      NPY__CPU_DISPATCH_EXPAND_(CB(__VA_ARGS__))
    // 'NPY__CPU_DISPATCH_CALL' is an abstract macro is used for dispatching
    // the required optimizations that specified within the configuration statements.
    //
    // @param CHK, Expected a macro that can be used to detect CPU features
    // in runtime, which takes a CPU feature name without string quotes and
    // returns the testing result in a shape of boolean value.
    // NumPy already has macro called "NPY_CPU_HAVE", which fits this requirement.
    //
    // @param CB, a callback macro that expected to be called multiple times depending
    // on the required optimizations, the callback should receive the following arguments:
    //  1- The pending calls of @param CHK filled up with the required CPU features,
    //     that need to be tested first in runtime before executing call belong to
    //     the compiled object.
    //  2- The required optimization name, same as in 'NPY__CPU_TARGET_CURRENT'
    //  3- Extra arguments in the macro itself
    //
    // By default the callback calls are sorted depending on the highest interest
    // unless the policy "$keep_sort" was in place within the configuration statements
    // see "Dive into the CPU dispatcher" for more clarification.
    #define NPY__CPU_DISPATCH_CALL(CHK, CB, ...) \
      NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX512F)), AVX512F, __VA_ARGS__)) \
      NPY__CPU_DISPATCH_EXPAND_(CB((CHK(SSE)&&CHK(SSE2)&&CHK(SSE3)&&CHK(SSSE3)&&CHK(SSE41)), SSE41, __VA_ARGS__))
    

    根据上述内容使用配置头文件的示例:

    // NOTE: The following macros are only defined for demonstration purposes only.
    // NumPy already has a collections of macros located at
    // numpy/numpy/_core/src/common/npy_cpu_dispatch.h, that covers all dispatching
    // and declarations scenarios.
    
    #include "numpy/npy_cpu_features.h" // NPY_CPU_HAVE
    #include "numpy/utils.h" // NPY_CAT, NPY_EXPAND
    
    // An example for setting a macro that calls all the exported symbols at once
    // after checking if they're supported by the running machine.
    #define DISPATCH_CALL_ALL(FN, ARGS) \
        NPY__CPU_DISPATCH_CALL(NPY_CPU_HAVE, DISPATCH_CALL_ALL_CB, FN, ARGS) \
        NPY__CPU_DISPATCH_BASELINE_CALL(DISPATCH_CALL_BASELINE_ALL_CB, FN, ARGS)
    // The preprocessor callbacks.
    // The same suffixes as we define it in the dispatch-able source.
    #define DISPATCH_CALL_ALL_CB(CHECK, TARGET_NAME, FN, ARGS) \
      if (CHECK) { NPY_CAT(NPY_CAT(FN, _), TARGET_NAME) ARGS; }
    #define DISPATCH_CALL_BASELINE_ALL_CB(FN, ARGS) \
      FN NPY_EXPAND(ARGS);
    
    // An example for setting a macro that calls the exported symbols of highest
    // interest optimization, after checking if they're supported by the running machine.
    #define DISPATCH_CALL_HIGH(FN, ARGS) \
      if (0) {} \
        NPY__CPU_DISPATCH_CALL(NPY_CPU_HAVE, DISPATCH_CALL_HIGH_CB, FN, ARGS) \
        NPY__CPU_DISPATCH_BASELINE_CALL(DISPATCH_CALL_BASELINE_HIGH_CB, FN, ARGS)
    // The preprocessor callbacks
    // The same suffixes as we define it in the dispatch-able source.
    #define DISPATCH_CALL_HIGH_CB(CHECK, TARGET_NAME, FN, ARGS) \
      else if (CHECK) { NPY_CAT(NPY_CAT(FN, _), TARGET_NAME) ARGS; }
    #define DISPATCH_CALL_BASELINE_HIGH_CB(FN, ARGS) \
      else { FN NPY_EXPAND(ARGS); }
    
    // NumPy has a macro called 'NPY_CPU_DISPATCH_DECLARE' can be used
    // for forward declarations any kind of prototypes based on
    // 'NPY__CPU_DISPATCH_CALL' and 'NPY__CPU_DISPATCH_BASELINE_CALL'.
    // However in this example, we just handle it manually.
    void simd_whoami(const char *extra_info);
    void simd_whoami_AVX512F(const char *extra_info);
    void simd_whoami_SSE41(const char *extra_info);
    
    void trigger_me(void)
    {
        // bring the auto-generated config header
        // which contains config macros 'NPY__CPU_DISPATCH_CALL' and
        // 'NPY__CPU_DISPATCH_BASELINE_CALL'.
        // it is highly recommended to include the config header before executing
      // the dispatching macros in case if there's another header in the scope.
        #include "hello.dispatch.h"
        DISPATCH_CALL_ALL(simd_whoami, ("all"))
        DISPATCH_CALL_HIGH(simd_whoami, ("the highest interest"))
        // An example of including multiple config headers in the same source
        // #include "hello2.dispatch.h"
        // DISPATCH_CALL_HIGH(another_function, ("the highest interest"))
    }