Skip to content

Latest commit

 

History

History
799 lines (774 loc) · 35.7 KB

4 llvm pass overview.md

File metadata and controls

799 lines (774 loc) · 35.7 KB

引言

bitcode文件生成后,经过优化进入后端,这是bitcode文件到ptx文件的生成过程 clang的生成过程使用如下的命令:

"/home/yhz/llvm-project/build-debug/bin/clang-18" -cc1 -triple nvptx64-nvidia-cuda -aux-triple x86_64-unknown-linux-gnu -S -dumpdir axpy- -save-temps=cwd -disable-free -clear-ast-before-backend -main-file-name axpy.cu -mrelocation-model static -mframe-pointer=all -fno-rounding-math -no-integrated-as -aux-target-cpu x86-64 -fcuda-is-device -mllvm -enable-memcpyopt-without-libcalls -fcuda-allow-variadic-functions -mlink-builtin-bitcode /usr/local/cuda-12.3/nvvm/libdevice/libdevice.10.bc -target-sdk-version=12.3 -target-cpu sm_60 -target-feature +ptx83 -debugger-tuning=gdb -fno-dwarf-directory-asm -fdebug-compilation-dir=/home/yhz/cuda -v -resource-dir /home/yhz/llvm-project/build-debug/lib/clang/18 -fno-autolink -ferror-limit 19 -pthread -fgnuc-version=4.2.1 -fcolor-diagnostics -cuid=805820bbe493b9aa -o axpy-cuda-nvptx64-nvidia-cuda-sm_60.s -x ir axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc

这个命令很复杂,生成的nvptx代码如下:

//
// Generated by LLVM NVPTX Back-End
//

.version 8.3
.target sm_60
.address_size 64

        // .globl       _Z4axpyfPfS_            // -- Begin function _Z4axpyfPfS_
                                        // @_Z4axpyfPfS_
.visible .entry _Z4axpyfPfS_(
        .param .f32 _Z4axpyfPfS__param_0,
        .param .u64 _Z4axpyfPfS__param_1,
        .param .u64 _Z4axpyfPfS__param_2
)
{
        .reg .b32       %r<2>;
        .reg .f32       %f<4>;
        .reg .b64       %rd<8>;

// %bb.0:                               // %entry
        ld.param.f32    %f1, [_Z4axpyfPfS__param_0];
        ld.param.u64    %rd1, [_Z4axpyfPfS__param_2];
        cvta.to.global.u64      %rd2, %rd1;
        ld.param.u64    %rd3, [_Z4axpyfPfS__param_1];
        cvta.to.global.u64      %rd4, %rd3;
这里直接引用了%tid.x预定义寄存器的值,访问线程块的x维索引
        mov.u32         %r1, %tid.x;
        mul.wide.u32    %rd5, %r1, 4;
        add.s64         %rd6, %rd4, %rd5;
        ld.global.f32   %f2, [%rd6];
        mul.rn.f32      %f3, %f2, %f1;
        add.s64         %rd7, %rd2, %rd5;
        st.global.f32   [%rd7], %f3;
        ret;
                                        // -- End function
}

也可以使用如下命令生成nvptx

llc -O1 -mcpu=sm_60 -mattr=+ptx83 axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc -o axpy-cuda-nvptx64-nvidia-cuda-sm_60-llc.s

但是生成的结果不同,使用-print-after-all打印可以看出,两个过程的优化pass并不相同,特别的clang调用会调用AlwaysInlinerPass,把对函数的调用进行内联,因此clang生成的结果更简单。 llc生成的结果保留了对函数的调用,为了分析的完整性,我们采用llc的过程。 调整一下输入cuda源文件,增加device函数

__device__ float multiply(float a, float b) {
  return a * b;
}
__global__ void axpy(float a, float* x, float* y) {
  y[threadIdx.x] = multiply(a, x[threadIdx.x]);
}

后面可能用到的命令

llc -O1 -mcpu=sm_60 -mattr=+ptx83 axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc -debug-only=isel -o axpy-cuda-nvptx64-nvidia-cuda-sm_60-llc.s
llc -O1 -mcpu=sm_60 -mattr=+ptx83 axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc -print-after-all -o axpy-cuda-nvptx64-nvidia-cuda-sm_60-llc.s
llc -O1 -mcpu=sm_60 -mattr=+ptx83 axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc -view-dag-combine1-dags -o axpy-cuda-nvptx64-nvidia-cuda-sm_60-llc.s

nvptx后端分析 - 完整的遍表结构

检查后端的PASS

llc -O1 -mcpu=sm_60 -mattr=+ptx83 axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc -debug-pass=Structure -o axpy-cuda-nvptx64-nvidia-cuda-sm_60-llc.s
Pass Arguments:  -targetlibinfo -targetpassconfig -machinemoduleinfo -tti -nvptx-aa -external-aa -assumption-cache-tracker -tbaa -scoped-noalias-aa -profile-summary-info -collector-metadata -machine-branch-prob -pre-isel-intrinsic-lowering -expand-large-div-rem -expand-large-fp-convert -nvvm-reflect -nvptx-assign-valid-global-names -generic-to-nvvm -nvptx-lower-args -domtree -sroa -nvptx-lower-alloca -infer-address-spaces -nvptx-atomic-lower -domtree -loops -separate-const-offset-from-gep -speculative-execution -scalar-evolution -slsr -early-cse -scalar-evolution -nary-reassociate -early-cse -atomic-expand -nvptx-lower-ctor-dtor -verify -domtree -basic-aa -loops -loop-simplify -scalar-evolution -canon-freeze -iv-users -loop-reduce -basic-aa -aa -mergeicmps -loops -lazy-branch-prob -lazy-block-freq -expand-memcmp -gc-lowering -shadow-stack-gc-lowering -lower-constant-intrinsics -unreachableblockelim -loops -postdomtree -branch-prob -block-freq -consthoist -replace-with-veclib -partially-inline-libcalls -expandvp -scalarize-masked-mem-intrin -expand-reductions -loops -tlshoist -early-cse -basic-aa -aa -scalar-evolution -load-store-vectorizer -sroa -nvptx-lower-unreachable -domtree -loops -codegenprepare -lowerinvoke -unreachableblockelim -callbrprepare -safe-stack -stack-protector -verify -nvptx-lower-aggr-copies -alloca-hoisting -domtree -basic-aa -aa -loops -postdomtree -branch-prob -debug-ata -lazy-branch-prob -lazy-block-freq -nvptx-isel -finalize-isel -lazy-machine-block-freq -early-tailduplication -opt-phis -slotindexes -stack-coloring -localstackalloc -dead-mi-elimination -machinedomtree -machine-loops -machine-block-freq -early-machinelicm -machinedomtree -machine-block-freq -machine-cse -machinepostdomtree -machine-cycles -machine-sink -peephole-opt -nvptx-proxyreg-erasure -processimpdefs -unreachable-mbb-elimination -livevars -machinedomtree -machine-loops -phi-node-elimination -twoaddressinstruction -slotindexes -liveintervals -register-coalescer -machine-scheduler -livestacks -machine-block-freq -stack-slot-coloring -nvptx-peephole -removeredundantdebugvalues -fixup-statepoint-caller-saved -machinedomtree -machine-loops -machine-block-freq -branch-folder -postrapseudos -gc-analysis -machinedomtree -machine-loops -machine-block-freq -machinepostdomtree -block-placement -fentry-insert -xray-instrumentation -machine-sanmd -lazy-machine-block-freq -machine-opt-remark-emitter -stack-frame-layout -machinedomtree -machine-loops
这是第一个PASS,提供当前目标可用的库信息
Target Library Information
这是第二个PASS,完成对Target相关PASS配置的初始化
Target Pass Configuration
Machine Module Information
Target Transform Information
NVPTX Address space based Alias Analysis
由createExternalAAWrapperPass生成,这里添加了一个NVPTX的定制NVPTXAAWrapperPass
External Alias Analysis
Assumption Cache Tracker
Type-Based Alias Analysis
Scoped NoAlias Alias Analysis
Profile summary info
Create Garbage Collector Module Metadata
Machine Branch Probability Analysis
这之前的PASS都是ImmutablePass类型PASS,优先放在了前面。

  ModulePass Manager
这是对应到createPreISelIntrinsicLoweringPass生成PASS
    Pre-ISel Intrinsic Lowering
    FunctionPass Manager
createExpandLargeDivRemPass生成PASS
      Expand large div/rem
createExpandLargeFpConvertPass生成PASS
      Expand large fp convert
addIRPasses -> 
      Replace occurrences of __nvvm_reflect() calls with 0/1
      NVPTX Image Optimizer
    Assign valid PTX names to globals
    Ensure that the global variables are in the global address space
    FunctionPass Manager
      Lower pointer arguments of CUDA kernels
      Dominator Tree Construction
      SROA
      convert address space of alloca'ed memory to local
      Infer address spaces
      NVPTX lower atomics of local memory
      Dominator Tree Construction
      Natural Loop Information
      Split GEPs to a variadic base and a constant offset for better CSE
      Speculatively execute instructions
      Scalar Evolution Analysis
      Straight line strength reduction
      Early CSE
      Scalar Evolution Analysis
      Nary reassociation
      Early CSE
      Expand Atomic instructions
    Lower ctors and dtors for NVPTX
    FunctionPass Manager
      Module Verifier
      Dominator Tree Construction
      Basic Alias Analysis (stateless AA impl)
      Natural Loop Information
      Canonicalize natural loops
      Scalar Evolution Analysis
      Loop Pass Manager
        Canonicalize Freeze Instructions in Loops
        Induction Variable Users
        Loop Strength Reduction
      Basic Alias Analysis (stateless AA impl)
      Function Alias Analysis Results
      Merge contiguous icmps into a memcmp
      Natural Loop Information
      Lazy Branch Probability Analysis
      Lazy Block Frequency Analysis
      Expand memcmp() to load/stores
      Lower Garbage Collection Instructions
      Shadow Stack GC Lowering
      Lower constant intrinsics
      Remove unreachable blocks from the CFG
      Natural Loop Information
      Post-Dominator Tree Construction
      Branch Probability Analysis
      Block Frequency Analysis
      Constant Hoisting
      Replace intrinsics with calls to vector library
      Partially inline calls to library functions
      Expand vector predication intrinsics
      Scalarize Masked Memory Intrinsics
      Expand reduction intrinsics
      Natural Loop Information
      TLS Variable Hoist
      Early CSE
      Basic Alias Analysis (stateless AA impl)
      Function Alias Analysis Results
      Scalar Evolution Analysis
      GPU Load and Store Vectorizer
      SROA
      add an exit instruction before every unreachable
      Dominator Tree Construction
      Natural Loop Information
      CodeGen Prepare
      Lower invoke and unwind, for unwindless code generators
      Remove unreachable blocks from the CFG
      Prepare callbr
      Safe Stack instrumentation pass
      Insert stack protectors
      Module Verifier
      Lower aggregate copies/intrinsics into loops
      NVPTX specific alloca hoisting
      Dominator Tree Construction
      Basic Alias Analysis (stateless AA impl)
      Function Alias Analysis Results
      Natural Loop Information
      Post-Dominator Tree Construction
      Branch Probability Analysis
      Assignment Tracking Analysis
      Lazy Branch Probability Analysis
      Lazy Block Frequency Analysis
      NVPTX DAG->DAG Pattern Instruction Selection
      Finalize ISel and expand pseudo-instructions
      Lazy Machine Block Frequency Analysis
      Early Tail Duplication
      Optimize machine instruction PHIs
      Slot index numbering
      Merge disjoint stack slots
      Local Stack Slot Allocation
      Remove dead machine instructions
      MachineDominator Tree Construction
      Machine Natural Loop Construction
      Machine Block Frequency Analysis
      Early Machine Loop Invariant Code Motion
      MachineDominator Tree Construction
      Machine Block Frequency Analysis
      Machine Common Subexpression Elimination
      MachinePostDominator Tree Construction
      Machine Cycle Info Analysis
      Machine code sinking
      Peephole Optimizations
      NVPTX Proxy Register Instruction Erasure
      Process Implicit Definitions
      Remove unreachable machine basic blocks
      Live Variable Analysis
      MachineDominator Tree Construction
      Machine Natural Loop Construction
      Eliminate PHI nodes for register allocation
      Two-Address instruction pass
      Slot index numbering
      Live Interval Analysis
      Register Coalescer
      Machine Instruction Scheduler
      Live Stack Slot Analysis
      Machine Block Frequency Analysis
      Stack Slot Coloring
      NVPTX Prolog Epilog Pass
      NVPTX optimize redundant cvta.to.local instruction
      Remove Redundant DEBUG_VALUE analysis
      Fixup Statepoint Caller Saved
      MachineDominator Tree Construction
      Machine Natural Loop Construction
      Machine Block Frequency Analysis
      Control Flow Optimizer
      Post-RA pseudo instruction expansion pass
      Analyze Machine Code For Garbage Collection
      MachineDominator Tree Construction
      Machine Natural Loop Construction
      Machine Block Frequency Analysis
      MachinePostDominator Tree Construction
      Branch Probability Basic Block Placement
      Insert fentry calls
      Insert XRay ops
      Machine Sanitizer Binary Metadata
      Lazy Machine Block Frequency Analysis
      Machine Optimization Remark Emitter
      Stack Frame Layout Analysis
      MachineDominator Tree Construction
      Machine Natural Loop Construction
      NVPTX Assembly Printer
      Free MachineFunction

后端考虑根据遍情况,从前向后分析编译过程,尤其是有些有趣的遍可以重点分析一下。

如何增加PASS结构

第一个PASS "Target Library Information"

llvm/tools/llc/llc.cpp

定义PASS管理结构,增加所有PASS
645   // Build up all of the passes that we want to do to the module.
646   legacy::PassManager PM;
648   // Add an appropriate TargetLibraryInfo pass for the module's triple.
返回Triple信息,实现TargetLibraryInfoImpl结构
649   TargetLibraryInfoImpl TLII(Triple(M->getTargetTriple()));
增加一个PASS,提供当前目标可用的库信息
654   PM.add(new TargetLibraryInfoWrapperPass(TLII));

这里getTargetTriple返回TargetTriple,这是通过M->setTargetTriple设置的。对于nvptx后端为"nvptx64-nvidia-cuda" Triple构建Triple结构,用于初始化TargetLibraryInfoImpl

 891 TargetLibraryInfoImpl::TargetLibraryInfoImpl(const Triple &T) {
 892   // Default to everything being available.
 893   memset(AvailableArray, -1, sizeof(AvailableArray));
 894
 895   initialize(*this, T, StandardNames);
 896 }

这里调用了initialize定义为llvm/lib/Analysis/TargetLibraryInfo.cpp

 161 /// Initialize the set of available library functions based on the specified
 162 /// target triple. This should be carefully written so that a missing target
 163 /// triple gets a sane set of defaults.
 164 static void initialize(TargetLibraryInfoImpl &TLI, const Triple &T,
 165                        ArrayRef<StringLiteral> StandardNames) {
根据Triple初始化可用的库函数
 842   // As currently implemented in clang, NVPTX code has no standard library to
 843   // speak of.  Headers provide a standard-ish library implementation, but many
 844   // of the signatures are wrong -- for example, many libm functions are not
 845   // extern "C".
 846   //
 847   // libdevice, an IR library provided by nvidia, is linked in by the front-end,
 848   // but only used functions are provided to llvm.  Moreover, most of the
 849   // functions in libdevice don't map precisely to standard library functions.
 850   //
 851   // FIXME: Having no standard library prevents e.g. many fastmath
 852   // optimizations, so this situation should be fixed.
这里根据NVPTX后端设置了一些可用的库函数
 853   if (T.isNVPTX()) {
 854     TLI.disableAllFunctions();
 855     TLI.setAvailable(LibFunc_nvvm_reflect);
 856     TLI.setAvailable(llvm::LibFunc_malloc);
 857     TLI.setAvailable(llvm::LibFunc_free);
 858
 859     // TODO: We could enable the following two according to [0] but we haven't
 860     //       done an evaluation wrt. the performance implications.
 861     // [0]
 862     // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations
 863     //
 864     //    TLI.setAvailable(llvm::LibFunc_memcpy);
 865     //    TLI.setAvailable(llvm::LibFunc_memset);
 866
 867     TLI.setAvailable(llvm::LibFunc___kmpc_alloc_shared);
 868     TLI.setAvailable(llvm::LibFunc___kmpc_free_shared);
 869   } else {
 870     TLI.setUnavailable(LibFunc_nvvm_reflect);
 871   }

这里看在TargetLibraryInfoImpl定义了可用的库函数信息。

TargetLibraryInfoWrapperPass也是一个PASS类型,是第一个PASS,这里调用了

633   explicit TargetLibraryInfoWrapperPass(const TargetLibraryInfoImpl &TLI);

1363 TargetLibraryInfoWrapperPass::TargetLibraryInfoWrapperPass(
1364     const TargetLibraryInfoImpl &TLIImpl)
1365     : ImmutablePass(ID), TLA(TLIImpl) {
1366   initializeTargetLibraryInfoWrapperPassPass(*PassRegistry::getPassRegistry());
1367 }

initializeTargetLibraryInfoWrapperPassPasss是类定义时定义的函数,通过如下方式

1371 // Register the basic pass.
1372 INITIALIZE_PASS(TargetLibraryInfoWrapperPass, "targetlibinfo",
1373                 "Target Library Information", false, true)

INITIALIZE_PASS定义了函数initializeTargetLibraryInfoWrapperPassPasss

这里第三个参数是名字name信息,对应PASS的名字,因此第一个PASS的名字是"Target Library Information"
 38 #define INITIALIZE_PASS(passName, arg, name, cfg, analysis)                    \
 39   static void *initialize##passName##PassOnce(PassRegistry &Registry) {        \
新建一个PassInfo数据结构
 40     PassInfo *PI = new PassInfo(                                               \
 41         name, arg, &passName::ID,                                              \
 42         PassInfo::NormalCtor_t(callDefaultCtor<passName>), cfg, analysis);     \
并且登记到Registry中
 43     Registry.registerPass(*PI, true);                                          \
返回PassInfo数据结构指针
 44     return PI;                                                                 \
 45   }                                                                            \
 46   static llvm::once_flag Initialize##passName##PassFlag;                       \
这里定义了上面的函数,调用前面PassOnce一次。
 47   void llvm::initialize##passName##Pass(PassRegistry &Registry) {              \
 48     llvm::call_once(Initialize##passName##PassFlag,                            \
 49                     initialize##passName##PassOnce, std::ref(Registry));       \
 50   }

第二个PASS "Target Pass Configuration"

#0  addPassesToGenerateCode (TM=..., PM=..., DisableVerify=85, MMIWP=...) at /home/yhz/llvm-project/llvm/lib/CodeGen/LLVMTargetMachine.cpp:118
#1  0x000055555779de5e in llvm::LLVMTargetMachine::addPassesToEmitFile (this=0x7ffff7a18010, PM=..., Out=..., DwoOut=0x0,
    FileType=llvm::CodeGenFileType::AssemblyFile, DisableVerify=false, MMIWP=0x55555be71020)
    at /home/yhz/llvm-project/llvm/lib/CodeGen/LLVMTargetMachine.cpp:241
#2  0x0000555556998333 in compileModule (argv=0x7fffffffdf68, Context=...) at /home/yhz/llvm-project/llvm/tools/llc/llc.cpp:720
#3  0x000055555699601a in main (argc=8, argv=0x7fffffffdf68) at /home/yhz/llvm-project/llvm/tools/llc/llc.cpp:425

这里检查后端增加了那些PASS lib/CodeGen/LLVMTargetMachine.cpp

114 /// addPassesToX helper drives creation and initialization of TargetPassConfig.
115 static TargetPassConfig *
116 addPassesToGenerateCode(LLVMTargetMachine &TM, PassManagerBase &PM,
117                         bool DisableVerify,
118                         MachineModuleInfoWrapperPass &MMIWP) {
119   // Targets may override createPassConfig to provide a target-specific
120   // subclass.
这里实际调用的NVPTXTargetMachine子类的函数createPassConfig,定义在llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
实际返回的TargetPassConfig类型是子类NVPTXPassConfig,但是类型名字不变"Target Pass Configuration"
121   TargetPassConfig *PassConfig = TM.createPassConfig(PM);

llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp

213 TargetPassConfig *NVPTXTargetMachine::createPassConfig(PassManagerBase &PM) {
214   return new NVPTXPassConfig(*this, PM);
215 }

NVPTXPassConfig调用TargetPassConfig的constructor函数

 571 TargetPassConfig::TargetPassConfig(LLVMTargetMachine &TM, PassManagerBase &pm)
 572     : ImmutablePass(ID), PM(&pm), TM(&TM) {
 575   // Register all target independent codegen passes to activate their PassIDs,
 576   // including this pass itself.
 577   initializeCodeGen(*PassRegistry::getPassRegistry());

 19 /// initializeCodeGen - Initialize all passes linked into the CodeGen library.
 20 void llvm::initializeCodeGen(PassRegistry &Registry) {
133   initializeTargetPassConfigPass(Registry);

 353 INITIALIZE_PASS(TargetPassConfig, "targetpassconfig",
 354                 "Target Pass Configuration", false, false)

返回addPassesToGenerateCode

TargetPassConfig本身也是一个PASS
124   PM.add(PassConfig);

因此第二个PASS为"Target Pass Configuration",功能是完成对Target相关PASS配置的初始化

125   PM.add(&MMIWP);

这里的PASS名称为"Machine Module Information"

183 // Handle the Pass registration stuff necessary to use DataLayout's.
184 INITIALIZE_PASS(MachineModuleInfoWrapperPass, "machinemoduleinfo",
185                 "Machine Module Information", false, false)

238   if (!MMIWP)
239     MMIWP = new MachineModuleInfoWrapperPass(this);

171 MachineModuleInfoWrapperPass::MachineModuleInfoWrapperPass(
172     const LLVMTargetMachine *TM)
173     : ImmutablePass(ID), MMI(TM) {
174   initializeMachineModuleInfoWrapperPassPass(*PassRegistry::getPassRegistry());
175 }

完成Pass的初始化。

addISelPasses

238   /// High level function that adds all passes necessary to go from llvm IR
239   /// representation to the MI representation.
240   /// Adds IR based lowering and target specific optimization passes and finally
241   /// the core instruction selection passes.
242   /// \returns true if an error occurred, false otherwise.
243   bool addISelPasses();

TargetPassConfig.cpp
1054 bool TargetPassConfig::addISelPasses() {
没有使用这个PASS
1055   if (TM->useEmulatedTLS())
1056     addPass(createLowerEmuTLSPass());
1057
PASS "Target Transform Information",ImmutablePass,直接增加到PASS 管理器
1058   PM->add(createTargetTransformInfoWrapperPass(TM->getTargetIRAnalysis()));
PreISelIntrinsicLoweringLegacyPasss是ModulePass类型,看遍表的打印结果,ModulePass在ImmutablePass后面,不清楚是调用确实存在先后,还是遍表打印过程分配打印,猜测应该是前者?MPPassManager是一个Pass的子类,本身应该和ImmutablePass优先级相同?
1059   addPass(createPreISelIntrinsicLoweringPass());
这是一个FunctionPass类型Pass,运行在后面。FPPassManager是函数Pass管理器,本身是一个ModulePass。
1060   addPass(createExpandLargeDivRemPass());
这是一个FunctionPass类型Pass
1061   addPass(createExpandLargeFpConvertPass());
这个函数NVPTX后端有自己的定义
1062   addIRPasses();
...
1068 }

NVPTXTargetMachine.cpp

339 void NVPTXPassConfig::addIRPasses() {
这是ImmutablePass类型PASS,"NVPTX Address space based Alias Analysis",对应打印遍表的下面的遍。
357   addPass(createNVPTXAAWrapperPass());
358   addPass(createExternalAAWrapperPass([](Pass &P, Function &, AAResults &AAR) {
359     if (auto *WrapperPass = P.getAnalysisIfAvailable<NVPTXAAWrapperPass>())
360       AAR.addAAResult(WrapperPass->getResult());
361   }));

362
363   // NVVMReflectPass is added in addEarlyAsPossiblePasses, so hopefully running
364   // it here does nothing.  But since we need it for correctness when lowering
365   // to NVPTX, run it here too, in case whoever built our pass pipeline didn't
366   // call addEarlyAsPossiblePasses.
367   const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();
增加一个FunctionPass
368   addPass(createNVVMReflectPass(ST.getSmVersion()));

370   if (getOptLevel() != CodeGenOptLevel::None)
371     addPass(createNVPTXImageOptimizerPass());
另外的FunctionPASS,前面的FunctionPASS都合并到一个FunctionManager中管理
这是一个ModuelPASS
372   addPass(createNVPTXAssignValidGlobalNamesPass());
这是一个ModuelPASS,对应"Ensure that the global variables are in the global address space"
373   addPass(createGenericToNVVMLegacyPass());
FunctionPASS,对应"Lower pointer arguments of CUDA kernels"
377   addPass(createNVPTXLowerArgsPass());
因为打开了优化开关,这里对应一系列的优化PASS
378   if (getOptLevel() != CodeGenOptLevel::None) {
379     addAddressSpaceInferencePasses();
380     addStraightLineScalarOptimizationPasses();
381   }
对应"Expand Atomic instructions"
383   addPass(createAtomicExpandPass());
对应 "Lower ctors and dtors for NVPTX" PASS
384   addPass(createNVPTXCtorDtorLoweringLegacyPass());

387   TargetPassConfig::addIRPasses();
调用基类的PASS。
408   addPass(createNVPTXLowerUnreachablePass(Options.TrapUnreachable,
409                                           Options.NoTrapAfterNoreturn));
最后为"add an exit instruction before every unreachable"

返回TargetPassConfig.cpp

1067   return addCoreISelPasses();
 967 bool TargetPassConfig::addCoreISelPasses() {
1041   } else if (addInstSelector())
1046   addPass(&FinalizeISelID);

NVPTX后端指令选择PASS

NVPTX后端定制了函数addInstSelector

412 bool NVPTXPassConfig::addInstSelector() {
413   const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();
414
PASS "Lower aggregate copies/intrinsics into loops"
415   addPass(createLowerAggrCopies());
PASS "NVPTX specific alloca hoisting"
416   addPass(createAllocaHoisting());
PASS "NVPTX DAG->DAG Pattern Instruction Selection"
417   addPass(createNVPTXISelDag(getNVPTXTargetMachine(), getOptLevel()));
418
419   if (!ST.hasImageHandles())
420     addPass(createNVPTXReplaceImageHandlesPass());
421
422   return false;
423 }

CodeGen/LLVMTargetMachine.cpp

129   PassConfig->addMachinePasses();

llvm/lib/CodeGen/TargetPassConfig.cpp

1077 /// Add the complete set of target-independent postISel code generator passes.
1079 /// This can be read as the standard order of major LLVM CodeGen stages. Stages
1080 /// with nontrivial configuration or multiple passes are broken out below in
1081 /// add%Stage routines.
1095 void TargetPassConfig::addMachinePasses() {
1099   if (getOptLevel() != CodeGenOptLevel::None) {
1100     addMachineSSAOptimization();
有优化标志,增加优化PASS
470 void NVPTXPassConfig::addMachineSSAOptimization() {
"Early Tail Duplication"
"Optimize machine instruction PHIs"
"Merge disjoint stack slots"
"Local Stack Slot Allocation"
"Remove dead machine instructions"
1111   addPreRegAlloc();
"NVPTX Proxy Register Instruction Erasure"
1260   PM->add(createStackFrameLayoutAnalysisPass());
"Stack Frame Layout"

addPassesToGenerateCode结束

245   if (TargetPassConfig::willCompleteCodeGenPipeline()) {
246     if (addAsmPrinter(PM, Out, DwoOut, FileType, MMIWP->getMMI().getContext()))
247       return true;
"NVPTX Assembly Printer"

"Free MachineFunction"

PASS Name的来源

查看PASS NAME的来源,例如PASS "NVPTX Assembly Printer"

#0  llvm::NVPTXAsmPrinter::getPassName (this=0x0) at /home/yhz/llvm-project/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.h:154
#1  0x0000555558137b73 in llvm::Pass::dumpPassStructure (this=0x55555bea7f90, Offset=3) at /home/yhz/llvm-project/llvm/lib/IR/Pass.cpp:75
#2  0x00005555580bb80e in llvm::FPPassManager::dumpPassStructure (this=0x55555be88790, Offset=2)
    at /home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1393
#3  0x00005555580b6e4d in (anonymous namespace)::MPPassManager::dumpPassStructure (this=0x55555be73520, Offset=1)
    at /home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:435
#4  0x00005555580b8c06 in llvm::PMTopLevelManager::dumpPasses (this=0x55555be73128) at /home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:837
#5  0x00005555580b704c in llvm::legacy::PassManagerImpl::run (this=0x55555be72f80, M=...)
    at /home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:529
#6  0x00005555580bcb57 in llvm::legacy::PassManager::run (this=0x7fffffffd7b0, M=...) at /home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1685
#7  0x000055555699855a in compileModule (argv=0x7fffffffdf68, Context=...) at /home/yhz/llvm-project/llvm/tools/llc/llc.cpp:749
#8  0x000055555699601a in main (argc=8, argv=0x7fffffffdf68) at /home/yhz/llvm-project/llvm/tools/llc/llc.cpp:425

/home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp

 525 bool PassManagerImpl::run(Module &M) {
 528   dumpArguments();
 529   dumpPasses();
 840 void PMTopLevelManager::dumpArguments() const {
 841
 842   if (PassDebugging < Arguments)
 843     return;
 844
这里打印Pass Arguments
 845   dbgs() << "Pass Arguments: ";
首先打印所有的ImmutablePasses
 846   for (ImmutablePass *P : ImmutablePasses)
 847     if (const PassInfo *PI = findAnalysisPassInfo(P->getPassID())) {
 848       assert(PI && "Expected all immutable passes to be initialized");
 849       if (!PI->isAnalysisGroup())
 850         dbgs() << " -" << PI->getPassArgument();
 851     }
其他PASS
 852   for (PMDataManager *PM : PassManagers)
 853     PM->dumpPassArguments();
 854   dbgs() << "\n";
 855 }
 822 void PMTopLevelManager::dumpPasses() const {
 823
根据PassDebugging,小于Struture不打印
 824   if (PassDebugging < Structure)
 825     return;
同样先打印ImmutablePasses
 827   // Print out the immutable passes
 828   for (unsigned i = 0, e = ImmutablePasses.size(); i != e; ++i) {
 829     ImmutablePasses[i]->dumpPassStructure(0);
 830   }
再打印其他PASS的信息
 836   for (PMDataManager *Manager : PassManagers)
 837     Manager->getAsPass()->dumpPassStructure(1);
调用对应PASS的getPassName函数,打印PASS NAME
 74 void Pass::dumpPassStructure(unsigned Offset) {
 75   dbgs().indent(Offset*2) << getPassName() << "\n";
 76 }
llvm::NVPTXAsmPrinter::getPassName()

再检查"NVPTX DAG->DAG Pattern Instruction Selection"遍

#define PASS_NAME "NVPTX DAG->DAG Pattern Instruction Selection"
INITIALIZE_PASS(NVPTXDAGToDAGISel, DEBUG_TYPE, PASS_NAME, false, false)

这个PASS的名字来自INITIALIZE_PASS定义的第三个参数。 这是调用的Pass类的

StringRef Pass::getPassName()
 81 StringRef Pass::getPassName() const {
这里首先找到PassID
 82   AnalysisID AID =  getPassID();
然后找到对应的PassInfo
 83   const PassInfo *PI = PassRegistry::getPassRegistry()->getPassInfo(AID);
调用对应的PassInfo的getPassName,这是通过INITIALIZE_PASS注册的Pass Name
 84   if (PI)
 85     return PI->getPassName();
 86   return "Unnamed pass: implement Pass::getPassName()";
 87 }

这里的NVPTXDAGToDAGISel定义的Pass一路继承自Pass,因此调用了Pass::getPassName()函数。 而前面的NVPTXAsmPrinter PASS则自己重载了getPassName,因此打印的是自己定义的Pass Name。 因此自定义了getPassName函数的情况下打印名字优先,否则打印的名字是INITIALIZE_PASS这些宏定义的PassName

后端逐PASS分析

Pre-ISel Intrinsic Lowering

PreISelIntrinsicLowering.cpp

381   bool runOnModule(Module &M) override {
用来返回TargetTransformInfo
382     auto LookupTTI = [this](Function &F) -> TargetTransformInfo & {
383       return this->getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
384     };
385
返回TargetMachine
386     const auto &TM = getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
生成PreISelIntrinsicLowering
387     PreISelIntrinsicLowering Lowering(TM, LookupTTI);
调用lowerIntrinsics
388     return Lowering.lowerIntrinsics(M);
274 bool PreISelIntrinsicLowering::lowerIntrinsics(Module &M) const {
276   for (Function &F : M) {
编译当前Module的所有函数
277     switch (F.getIntrinsicID()) {
返回当前函数的Intrinsic编号,或者返回不是intrinsic
282     case Intrinsic::memset:
283       Changed |= expandMemIntrinsicUses(F);
如果函数是memset

208 bool PreISelIntrinsicLowering::expandMemIntrinsicUses(Function &F) const {
209   Intrinsic::ID ID = F.getIntrinsicID();
210   bool Changed = false;
211
找到所有对Intrinsic的调用
212   for (User *U : llvm::make_early_inc_range(F.users())) {
213     Instruction *Inst = cast<Instruction>(U);
214
215     switch (ID) {

250     case Intrinsic::memset: {
251       auto *Memset = cast<MemSetInst>(Inst);
252       Function *ParentFunc = Memset->getFunction();
查找memset函数的目标变换信息
253       const TargetTransformInfo &TTI = LookupTTI(*ParentFunc);
254       if (shouldExpandMemIntrinsicWithSize(Memset->getLength(), TTI)) {
如果仍然使用intrinsic函数,并且支持MEMSET,则退出
255         if (UseMemIntrinsicLibFunc &&
256             canEmitLibcall(TM, ParentFunc, RTLIB::MEMSET))
257           break;
258
这里把memset修改为循环处理
259         expandMemSetAsLoop(Memset);
260         Changed = true;
把原来的函数删除掉
261         Memset->eraseFromParent();
262       }
263
264       break;
265     }

Expand large div/rem

CodeGen/ExpandLargeDivRem.cpp

ExpandLargeDivRemLegacyPass::runOnFunction
118   bool runOnFunction(Function &F) override {
119     auto *TM = &getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
120     auto *TLI = TM->getSubtargetImpl(F)->getTargetLowering();
121     return runImpl(F, *TLI);

 57 static bool runImpl(Function &F, const TargetLowering &TLI) {

依次使用Function PASS处理每个函数,直到处理完毕所有的函数,当前Function PASS处理完毕,然后进入接下来的Module Pass

Replace occurrences of __nvvm_reflect() calls with 0/1

参考https://llvm.org/docs/NVPTXUsage.html#ptxas-complains-of-undefined-function-nvvm-reflect This pass looks for calls to the @__nvvm_reflect function and replaces them with constants based on the defined reflection parameters. 实现Target/NVPTX/NVVMReflect.cpp

Assign valid PTX names to globals

实现NVPTXAssignValidGlobalNames.cpp 参考

// Clean up the names of global variables in the module to not contain symbols
// that are invalid in PTX.
//
// Currently NVPTX, like other backends, relies on generic symbol name
// sanitizing done by MC. However, the ptxas assembler is more stringent and
// disallows some additional characters in symbol names. This pass makes sure
// such names do not reach MC at all.

替换掉后端不允许的一些变量名命名。

Ensure that the global variables are in the global address space

实现NVPTXGenericToNVVM.cpp

289   bool runOnModule(Module &M) override;
304 bool GenericToNVVMLegacyPass::runOnModule(Module &M) {
305   return GenericToNVVM().runOnModule(M);
306 }

 56 bool GenericToNVVM::runOnModule(Module &M) {

把ADDRESS_SPACE_GENERIC全局变量替换为ADDRESS_SPACE_GLOBAL

Lower pointer arguments of CUDA kernels

481 bool NVPTXLowerArgs::runOnFunction(Function &F) {
482   auto &TM = getAnalysis<TargetPassConfig>().getTM<NVPTXTargetMachine>();
483
分别处理kernel函数和device函数
484   return isKernelFunction(F) ? runOnKernelFunction(TM, F)
485                              : runOnDeviceFunction(TM, F);
486 }

kernel函数

421 bool NVPTXLowerArgs::runOnKernelFunction(const NVPTXTargetMachine &TM,
422                                          Function &F) {

427   auto HandleIntToPtr = [this](Value &V) {
如果V的所有使用都是整数到指针的转换操作
428     if (llvm::all_of(V.users(), [](User *U) { return isa<IntToPtrInst>(U); })) {
429       SmallVector<User *, 16> UsersToUpdate(V.users());
将其替换为全局地址空间的指针类型
430       for (User *U : UsersToUpdate)
431         markPointerAsGlobal(U);
432     }
433   };

457   for (Argument &Arg : F.args()) {
458     if (Arg.getType()->isPointerTy()) {
459       if (Arg.hasByValAttr())
460         handleByValParam(TM, &Arg);
461       else if (TM.getDrvInterface() == NVPTX::CUDA)
462         markPointerAsGlobal(&Arg);
这里将参数转换到全局地址空间,然后再转换回原来的通用地址空间。
463     } else if (Arg.getType()->isIntegerTy() &&
464                TM.getDrvInterface() == NVPTX::CUDA) {
465       HandleIntToPtr(Arg);
466     }
467   }

Lower ctors and dtors for NVPTX

NVPTXCtorDtorLowering.cpp

/// This pass creates a unified init and fini kernel with the required metadata
267 static bool lowerCtorsAndDtors(Module &M) {
268   bool Modified = false;
269   Modified |= createInitOrFiniKernel(M, "llvm.global_ctors", /*IsCtor =*/true);
270   Modified |= createInitOrFiniKernel(M, "llvm.global_dtors", /*IsCtor =*/false);
271   return Modified;
272 }

这里的实例不涉及这个问题。

add an exit instruction before every unreachable

NVPTXLowerUnreachable.cpp LLVM IR处理,方便后面的ptxas生成指令

Lower aggregate copies/intrinsics into loops

// Lower aggregate copies, memset, memcpy, memmov intrinsics into loops when
// the size is large or is not a compile-time constant.

NVPTX DAG->DAG Pattern Instruction Selection

进入LLVM到DAG转换过程