bitcode文件生成后,经过优化进入后端,这是bitcode文件到ptx文件的生成过程 clang的生成过程使用如下的命令:
"/home/yhz/llvm-project/build-debug/bin/clang-18" -cc1 -triple nvptx64-nvidia-cuda -aux-triple x86_64-unknown-linux-gnu -S -dumpdir axpy- -save-temps=cwd -disable-free -clear-ast-before-backend -main-file-name axpy.cu -mrelocation-model static -mframe-pointer=all -fno-rounding-math -no-integrated-as -aux-target-cpu x86-64 -fcuda-is-device -mllvm -enable-memcpyopt-without-libcalls -fcuda-allow-variadic-functions -mlink-builtin-bitcode /usr/local/cuda-12.3/nvvm/libdevice/libdevice.10.bc -target-sdk-version=12.3 -target-cpu sm_60 -target-feature +ptx83 -debugger-tuning=gdb -fno-dwarf-directory-asm -fdebug-compilation-dir=/home/yhz/cuda -v -resource-dir /home/yhz/llvm-project/build-debug/lib/clang/18 -fno-autolink -ferror-limit 19 -pthread -fgnuc-version=4.2.1 -fcolor-diagnostics -cuid=805820bbe493b9aa -o axpy-cuda-nvptx64-nvidia-cuda-sm_60.s -x ir axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc
这个命令很复杂,生成的nvptx代码如下:
//
// Generated by LLVM NVPTX Back-End
//
.version 8.3
.target sm_60
.address_size 64
// .globl _Z4axpyfPfS_ // -- Begin function _Z4axpyfPfS_
// @_Z4axpyfPfS_
.visible .entry _Z4axpyfPfS_(
.param .f32 _Z4axpyfPfS__param_0,
.param .u64 _Z4axpyfPfS__param_1,
.param .u64 _Z4axpyfPfS__param_2
)
{
.reg .b32 %r<2>;
.reg .f32 %f<4>;
.reg .b64 %rd<8>;
// %bb.0: // %entry
ld.param.f32 %f1, [_Z4axpyfPfS__param_0];
ld.param.u64 %rd1, [_Z4axpyfPfS__param_2];
cvta.to.global.u64 %rd2, %rd1;
ld.param.u64 %rd3, [_Z4axpyfPfS__param_1];
cvta.to.global.u64 %rd4, %rd3;
这里直接引用了%tid.x预定义寄存器的值,访问线程块的x维索引
mov.u32 %r1, %tid.x;
mul.wide.u32 %rd5, %r1, 4;
add.s64 %rd6, %rd4, %rd5;
ld.global.f32 %f2, [%rd6];
mul.rn.f32 %f3, %f2, %f1;
add.s64 %rd7, %rd2, %rd5;
st.global.f32 [%rd7], %f3;
ret;
// -- End function
}
也可以使用如下命令生成nvptx
llc -O1 -mcpu=sm_60 -mattr=+ptx83 axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc -o axpy-cuda-nvptx64-nvidia-cuda-sm_60-llc.s
但是生成的结果不同,使用-print-after-all打印可以看出,两个过程的优化pass并不相同,特别的clang调用会调用AlwaysInlinerPass,把对函数的调用进行内联,因此clang生成的结果更简单。 llc生成的结果保留了对函数的调用,为了分析的完整性,我们采用llc的过程。 调整一下输入cuda源文件,增加device函数
__device__ float multiply(float a, float b) {
return a * b;
}
__global__ void axpy(float a, float* x, float* y) {
y[threadIdx.x] = multiply(a, x[threadIdx.x]);
}
后面可能用到的命令
llc -O1 -mcpu=sm_60 -mattr=+ptx83 axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc -debug-only=isel -o axpy-cuda-nvptx64-nvidia-cuda-sm_60-llc.s
llc -O1 -mcpu=sm_60 -mattr=+ptx83 axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc -print-after-all -o axpy-cuda-nvptx64-nvidia-cuda-sm_60-llc.s
llc -O1 -mcpu=sm_60 -mattr=+ptx83 axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc -view-dag-combine1-dags -o axpy-cuda-nvptx64-nvidia-cuda-sm_60-llc.s
检查后端的PASS
llc -O1 -mcpu=sm_60 -mattr=+ptx83 axpy-cuda-nvptx64-nvidia-cuda-sm_60.bc -debug-pass=Structure -o axpy-cuda-nvptx64-nvidia-cuda-sm_60-llc.s
Pass Arguments: -targetlibinfo -targetpassconfig -machinemoduleinfo -tti -nvptx-aa -external-aa -assumption-cache-tracker -tbaa -scoped-noalias-aa -profile-summary-info -collector-metadata -machine-branch-prob -pre-isel-intrinsic-lowering -expand-large-div-rem -expand-large-fp-convert -nvvm-reflect -nvptx-assign-valid-global-names -generic-to-nvvm -nvptx-lower-args -domtree -sroa -nvptx-lower-alloca -infer-address-spaces -nvptx-atomic-lower -domtree -loops -separate-const-offset-from-gep -speculative-execution -scalar-evolution -slsr -early-cse -scalar-evolution -nary-reassociate -early-cse -atomic-expand -nvptx-lower-ctor-dtor -verify -domtree -basic-aa -loops -loop-simplify -scalar-evolution -canon-freeze -iv-users -loop-reduce -basic-aa -aa -mergeicmps -loops -lazy-branch-prob -lazy-block-freq -expand-memcmp -gc-lowering -shadow-stack-gc-lowering -lower-constant-intrinsics -unreachableblockelim -loops -postdomtree -branch-prob -block-freq -consthoist -replace-with-veclib -partially-inline-libcalls -expandvp -scalarize-masked-mem-intrin -expand-reductions -loops -tlshoist -early-cse -basic-aa -aa -scalar-evolution -load-store-vectorizer -sroa -nvptx-lower-unreachable -domtree -loops -codegenprepare -lowerinvoke -unreachableblockelim -callbrprepare -safe-stack -stack-protector -verify -nvptx-lower-aggr-copies -alloca-hoisting -domtree -basic-aa -aa -loops -postdomtree -branch-prob -debug-ata -lazy-branch-prob -lazy-block-freq -nvptx-isel -finalize-isel -lazy-machine-block-freq -early-tailduplication -opt-phis -slotindexes -stack-coloring -localstackalloc -dead-mi-elimination -machinedomtree -machine-loops -machine-block-freq -early-machinelicm -machinedomtree -machine-block-freq -machine-cse -machinepostdomtree -machine-cycles -machine-sink -peephole-opt -nvptx-proxyreg-erasure -processimpdefs -unreachable-mbb-elimination -livevars -machinedomtree -machine-loops -phi-node-elimination -twoaddressinstruction -slotindexes -liveintervals -register-coalescer -machine-scheduler -livestacks -machine-block-freq -stack-slot-coloring -nvptx-peephole -removeredundantdebugvalues -fixup-statepoint-caller-saved -machinedomtree -machine-loops -machine-block-freq -branch-folder -postrapseudos -gc-analysis -machinedomtree -machine-loops -machine-block-freq -machinepostdomtree -block-placement -fentry-insert -xray-instrumentation -machine-sanmd -lazy-machine-block-freq -machine-opt-remark-emitter -stack-frame-layout -machinedomtree -machine-loops
这是第一个PASS,提供当前目标可用的库信息
Target Library Information
这是第二个PASS,完成对Target相关PASS配置的初始化
Target Pass Configuration
Machine Module Information
Target Transform Information
NVPTX Address space based Alias Analysis
由createExternalAAWrapperPass生成,这里添加了一个NVPTX的定制NVPTXAAWrapperPass
External Alias Analysis
Assumption Cache Tracker
Type-Based Alias Analysis
Scoped NoAlias Alias Analysis
Profile summary info
Create Garbage Collector Module Metadata
Machine Branch Probability Analysis
这之前的PASS都是ImmutablePass类型PASS,优先放在了前面。
ModulePass Manager
这是对应到createPreISelIntrinsicLoweringPass生成PASS
Pre-ISel Intrinsic Lowering
FunctionPass Manager
createExpandLargeDivRemPass生成PASS
Expand large div/rem
createExpandLargeFpConvertPass生成PASS
Expand large fp convert
addIRPasses ->
Replace occurrences of __nvvm_reflect() calls with 0/1
NVPTX Image Optimizer
Assign valid PTX names to globals
Ensure that the global variables are in the global address space
FunctionPass Manager
Lower pointer arguments of CUDA kernels
Dominator Tree Construction
SROA
convert address space of alloca'ed memory to local
Infer address spaces
NVPTX lower atomics of local memory
Dominator Tree Construction
Natural Loop Information
Split GEPs to a variadic base and a constant offset for better CSE
Speculatively execute instructions
Scalar Evolution Analysis
Straight line strength reduction
Early CSE
Scalar Evolution Analysis
Nary reassociation
Early CSE
Expand Atomic instructions
Lower ctors and dtors for NVPTX
FunctionPass Manager
Module Verifier
Dominator Tree Construction
Basic Alias Analysis (stateless AA impl)
Natural Loop Information
Canonicalize natural loops
Scalar Evolution Analysis
Loop Pass Manager
Canonicalize Freeze Instructions in Loops
Induction Variable Users
Loop Strength Reduction
Basic Alias Analysis (stateless AA impl)
Function Alias Analysis Results
Merge contiguous icmps into a memcmp
Natural Loop Information
Lazy Branch Probability Analysis
Lazy Block Frequency Analysis
Expand memcmp() to load/stores
Lower Garbage Collection Instructions
Shadow Stack GC Lowering
Lower constant intrinsics
Remove unreachable blocks from the CFG
Natural Loop Information
Post-Dominator Tree Construction
Branch Probability Analysis
Block Frequency Analysis
Constant Hoisting
Replace intrinsics with calls to vector library
Partially inline calls to library functions
Expand vector predication intrinsics
Scalarize Masked Memory Intrinsics
Expand reduction intrinsics
Natural Loop Information
TLS Variable Hoist
Early CSE
Basic Alias Analysis (stateless AA impl)
Function Alias Analysis Results
Scalar Evolution Analysis
GPU Load and Store Vectorizer
SROA
add an exit instruction before every unreachable
Dominator Tree Construction
Natural Loop Information
CodeGen Prepare
Lower invoke and unwind, for unwindless code generators
Remove unreachable blocks from the CFG
Prepare callbr
Safe Stack instrumentation pass
Insert stack protectors
Module Verifier
Lower aggregate copies/intrinsics into loops
NVPTX specific alloca hoisting
Dominator Tree Construction
Basic Alias Analysis (stateless AA impl)
Function Alias Analysis Results
Natural Loop Information
Post-Dominator Tree Construction
Branch Probability Analysis
Assignment Tracking Analysis
Lazy Branch Probability Analysis
Lazy Block Frequency Analysis
NVPTX DAG->DAG Pattern Instruction Selection
Finalize ISel and expand pseudo-instructions
Lazy Machine Block Frequency Analysis
Early Tail Duplication
Optimize machine instruction PHIs
Slot index numbering
Merge disjoint stack slots
Local Stack Slot Allocation
Remove dead machine instructions
MachineDominator Tree Construction
Machine Natural Loop Construction
Machine Block Frequency Analysis
Early Machine Loop Invariant Code Motion
MachineDominator Tree Construction
Machine Block Frequency Analysis
Machine Common Subexpression Elimination
MachinePostDominator Tree Construction
Machine Cycle Info Analysis
Machine code sinking
Peephole Optimizations
NVPTX Proxy Register Instruction Erasure
Process Implicit Definitions
Remove unreachable machine basic blocks
Live Variable Analysis
MachineDominator Tree Construction
Machine Natural Loop Construction
Eliminate PHI nodes for register allocation
Two-Address instruction pass
Slot index numbering
Live Interval Analysis
Register Coalescer
Machine Instruction Scheduler
Live Stack Slot Analysis
Machine Block Frequency Analysis
Stack Slot Coloring
NVPTX Prolog Epilog Pass
NVPTX optimize redundant cvta.to.local instruction
Remove Redundant DEBUG_VALUE analysis
Fixup Statepoint Caller Saved
MachineDominator Tree Construction
Machine Natural Loop Construction
Machine Block Frequency Analysis
Control Flow Optimizer
Post-RA pseudo instruction expansion pass
Analyze Machine Code For Garbage Collection
MachineDominator Tree Construction
Machine Natural Loop Construction
Machine Block Frequency Analysis
MachinePostDominator Tree Construction
Branch Probability Basic Block Placement
Insert fentry calls
Insert XRay ops
Machine Sanitizer Binary Metadata
Lazy Machine Block Frequency Analysis
Machine Optimization Remark Emitter
Stack Frame Layout Analysis
MachineDominator Tree Construction
Machine Natural Loop Construction
NVPTX Assembly Printer
Free MachineFunction
后端考虑根据遍情况,从前向后分析编译过程,尤其是有些有趣的遍可以重点分析一下。
llvm/tools/llc/llc.cpp
定义PASS管理结构,增加所有PASS
645 // Build up all of the passes that we want to do to the module.
646 legacy::PassManager PM;
648 // Add an appropriate TargetLibraryInfo pass for the module's triple.
返回Triple信息,实现TargetLibraryInfoImpl结构
649 TargetLibraryInfoImpl TLII(Triple(M->getTargetTriple()));
增加一个PASS,提供当前目标可用的库信息
654 PM.add(new TargetLibraryInfoWrapperPass(TLII));
这里getTargetTriple返回TargetTriple,这是通过M->setTargetTriple设置的。对于nvptx后端为"nvptx64-nvidia-cuda" Triple构建Triple结构,用于初始化TargetLibraryInfoImpl
891 TargetLibraryInfoImpl::TargetLibraryInfoImpl(const Triple &T) {
892 // Default to everything being available.
893 memset(AvailableArray, -1, sizeof(AvailableArray));
894
895 initialize(*this, T, StandardNames);
896 }
这里调用了initialize定义为llvm/lib/Analysis/TargetLibraryInfo.cpp
161 /// Initialize the set of available library functions based on the specified
162 /// target triple. This should be carefully written so that a missing target
163 /// triple gets a sane set of defaults.
164 static void initialize(TargetLibraryInfoImpl &TLI, const Triple &T,
165 ArrayRef<StringLiteral> StandardNames) {
根据Triple初始化可用的库函数
842 // As currently implemented in clang, NVPTX code has no standard library to
843 // speak of. Headers provide a standard-ish library implementation, but many
844 // of the signatures are wrong -- for example, many libm functions are not
845 // extern "C".
846 //
847 // libdevice, an IR library provided by nvidia, is linked in by the front-end,
848 // but only used functions are provided to llvm. Moreover, most of the
849 // functions in libdevice don't map precisely to standard library functions.
850 //
851 // FIXME: Having no standard library prevents e.g. many fastmath
852 // optimizations, so this situation should be fixed.
这里根据NVPTX后端设置了一些可用的库函数
853 if (T.isNVPTX()) {
854 TLI.disableAllFunctions();
855 TLI.setAvailable(LibFunc_nvvm_reflect);
856 TLI.setAvailable(llvm::LibFunc_malloc);
857 TLI.setAvailable(llvm::LibFunc_free);
858
859 // TODO: We could enable the following two according to [0] but we haven't
860 // done an evaluation wrt. the performance implications.
861 // [0]
862 // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations
863 //
864 // TLI.setAvailable(llvm::LibFunc_memcpy);
865 // TLI.setAvailable(llvm::LibFunc_memset);
866
867 TLI.setAvailable(llvm::LibFunc___kmpc_alloc_shared);
868 TLI.setAvailable(llvm::LibFunc___kmpc_free_shared);
869 } else {
870 TLI.setUnavailable(LibFunc_nvvm_reflect);
871 }
这里看在TargetLibraryInfoImpl定义了可用的库函数信息。
TargetLibraryInfoWrapperPass也是一个PASS类型,是第一个PASS,这里调用了
633 explicit TargetLibraryInfoWrapperPass(const TargetLibraryInfoImpl &TLI);
1363 TargetLibraryInfoWrapperPass::TargetLibraryInfoWrapperPass(
1364 const TargetLibraryInfoImpl &TLIImpl)
1365 : ImmutablePass(ID), TLA(TLIImpl) {
1366 initializeTargetLibraryInfoWrapperPassPass(*PassRegistry::getPassRegistry());
1367 }
initializeTargetLibraryInfoWrapperPassPasss是类定义时定义的函数,通过如下方式
1371 // Register the basic pass.
1372 INITIALIZE_PASS(TargetLibraryInfoWrapperPass, "targetlibinfo",
1373 "Target Library Information", false, true)
INITIALIZE_PASS定义了函数initializeTargetLibraryInfoWrapperPassPasss
这里第三个参数是名字name信息,对应PASS的名字,因此第一个PASS的名字是"Target Library Information"
38 #define INITIALIZE_PASS(passName, arg, name, cfg, analysis) \
39 static void *initialize##passName##PassOnce(PassRegistry &Registry) { \
新建一个PassInfo数据结构
40 PassInfo *PI = new PassInfo( \
41 name, arg, &passName::ID, \
42 PassInfo::NormalCtor_t(callDefaultCtor<passName>), cfg, analysis); \
并且登记到Registry中
43 Registry.registerPass(*PI, true); \
返回PassInfo数据结构指针
44 return PI; \
45 } \
46 static llvm::once_flag Initialize##passName##PassFlag; \
这里定义了上面的函数,调用前面PassOnce一次。
47 void llvm::initialize##passName##Pass(PassRegistry &Registry) { \
48 llvm::call_once(Initialize##passName##PassFlag, \
49 initialize##passName##PassOnce, std::ref(Registry)); \
50 }
#0 addPassesToGenerateCode (TM=..., PM=..., DisableVerify=85, MMIWP=...) at /home/yhz/llvm-project/llvm/lib/CodeGen/LLVMTargetMachine.cpp:118
#1 0x000055555779de5e in llvm::LLVMTargetMachine::addPassesToEmitFile (this=0x7ffff7a18010, PM=..., Out=..., DwoOut=0x0,
FileType=llvm::CodeGenFileType::AssemblyFile, DisableVerify=false, MMIWP=0x55555be71020)
at /home/yhz/llvm-project/llvm/lib/CodeGen/LLVMTargetMachine.cpp:241
#2 0x0000555556998333 in compileModule (argv=0x7fffffffdf68, Context=...) at /home/yhz/llvm-project/llvm/tools/llc/llc.cpp:720
#3 0x000055555699601a in main (argc=8, argv=0x7fffffffdf68) at /home/yhz/llvm-project/llvm/tools/llc/llc.cpp:425
这里检查后端增加了那些PASS lib/CodeGen/LLVMTargetMachine.cpp
114 /// addPassesToX helper drives creation and initialization of TargetPassConfig.
115 static TargetPassConfig *
116 addPassesToGenerateCode(LLVMTargetMachine &TM, PassManagerBase &PM,
117 bool DisableVerify,
118 MachineModuleInfoWrapperPass &MMIWP) {
119 // Targets may override createPassConfig to provide a target-specific
120 // subclass.
这里实际调用的NVPTXTargetMachine子类的函数createPassConfig,定义在llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
实际返回的TargetPassConfig类型是子类NVPTXPassConfig,但是类型名字不变"Target Pass Configuration"
121 TargetPassConfig *PassConfig = TM.createPassConfig(PM);
llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
213 TargetPassConfig *NVPTXTargetMachine::createPassConfig(PassManagerBase &PM) {
214 return new NVPTXPassConfig(*this, PM);
215 }
NVPTXPassConfig调用TargetPassConfig的constructor函数
571 TargetPassConfig::TargetPassConfig(LLVMTargetMachine &TM, PassManagerBase &pm)
572 : ImmutablePass(ID), PM(&pm), TM(&TM) {
575 // Register all target independent codegen passes to activate their PassIDs,
576 // including this pass itself.
577 initializeCodeGen(*PassRegistry::getPassRegistry());
19 /// initializeCodeGen - Initialize all passes linked into the CodeGen library.
20 void llvm::initializeCodeGen(PassRegistry &Registry) {
133 initializeTargetPassConfigPass(Registry);
353 INITIALIZE_PASS(TargetPassConfig, "targetpassconfig",
354 "Target Pass Configuration", false, false)
返回addPassesToGenerateCode
TargetPassConfig本身也是一个PASS
124 PM.add(PassConfig);
因此第二个PASS为"Target Pass Configuration",功能是完成对Target相关PASS配置的初始化
125 PM.add(&MMIWP);
这里的PASS名称为"Machine Module Information"
183 // Handle the Pass registration stuff necessary to use DataLayout's.
184 INITIALIZE_PASS(MachineModuleInfoWrapperPass, "machinemoduleinfo",
185 "Machine Module Information", false, false)
238 if (!MMIWP)
239 MMIWP = new MachineModuleInfoWrapperPass(this);
171 MachineModuleInfoWrapperPass::MachineModuleInfoWrapperPass(
172 const LLVMTargetMachine *TM)
173 : ImmutablePass(ID), MMI(TM) {
174 initializeMachineModuleInfoWrapperPassPass(*PassRegistry::getPassRegistry());
175 }
完成Pass的初始化。
238 /// High level function that adds all passes necessary to go from llvm IR
239 /// representation to the MI representation.
240 /// Adds IR based lowering and target specific optimization passes and finally
241 /// the core instruction selection passes.
242 /// \returns true if an error occurred, false otherwise.
243 bool addISelPasses();
TargetPassConfig.cpp
1054 bool TargetPassConfig::addISelPasses() {
没有使用这个PASS
1055 if (TM->useEmulatedTLS())
1056 addPass(createLowerEmuTLSPass());
1057
PASS "Target Transform Information",ImmutablePass,直接增加到PASS 管理器
1058 PM->add(createTargetTransformInfoWrapperPass(TM->getTargetIRAnalysis()));
PreISelIntrinsicLoweringLegacyPasss是ModulePass类型,看遍表的打印结果,ModulePass在ImmutablePass后面,不清楚是调用确实存在先后,还是遍表打印过程分配打印,猜测应该是前者?MPPassManager是一个Pass的子类,本身应该和ImmutablePass优先级相同?
1059 addPass(createPreISelIntrinsicLoweringPass());
这是一个FunctionPass类型Pass,运行在后面。FPPassManager是函数Pass管理器,本身是一个ModulePass。
1060 addPass(createExpandLargeDivRemPass());
这是一个FunctionPass类型Pass
1061 addPass(createExpandLargeFpConvertPass());
这个函数NVPTX后端有自己的定义
1062 addIRPasses();
...
1068 }
NVPTXTargetMachine.cpp
339 void NVPTXPassConfig::addIRPasses() {
这是ImmutablePass类型PASS,"NVPTX Address space based Alias Analysis",对应打印遍表的下面的遍。
357 addPass(createNVPTXAAWrapperPass());
358 addPass(createExternalAAWrapperPass([](Pass &P, Function &, AAResults &AAR) {
359 if (auto *WrapperPass = P.getAnalysisIfAvailable<NVPTXAAWrapperPass>())
360 AAR.addAAResult(WrapperPass->getResult());
361 }));
362
363 // NVVMReflectPass is added in addEarlyAsPossiblePasses, so hopefully running
364 // it here does nothing. But since we need it for correctness when lowering
365 // to NVPTX, run it here too, in case whoever built our pass pipeline didn't
366 // call addEarlyAsPossiblePasses.
367 const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();
增加一个FunctionPass
368 addPass(createNVVMReflectPass(ST.getSmVersion()));
370 if (getOptLevel() != CodeGenOptLevel::None)
371 addPass(createNVPTXImageOptimizerPass());
另外的FunctionPASS,前面的FunctionPASS都合并到一个FunctionManager中管理
这是一个ModuelPASS
372 addPass(createNVPTXAssignValidGlobalNamesPass());
这是一个ModuelPASS,对应"Ensure that the global variables are in the global address space"
373 addPass(createGenericToNVVMLegacyPass());
FunctionPASS,对应"Lower pointer arguments of CUDA kernels"
377 addPass(createNVPTXLowerArgsPass());
因为打开了优化开关,这里对应一系列的优化PASS
378 if (getOptLevel() != CodeGenOptLevel::None) {
379 addAddressSpaceInferencePasses();
380 addStraightLineScalarOptimizationPasses();
381 }
对应"Expand Atomic instructions"
383 addPass(createAtomicExpandPass());
对应 "Lower ctors and dtors for NVPTX" PASS
384 addPass(createNVPTXCtorDtorLoweringLegacyPass());
387 TargetPassConfig::addIRPasses();
调用基类的PASS。
408 addPass(createNVPTXLowerUnreachablePass(Options.TrapUnreachable,
409 Options.NoTrapAfterNoreturn));
最后为"add an exit instruction before every unreachable"
返回TargetPassConfig.cpp
1067 return addCoreISelPasses();
967 bool TargetPassConfig::addCoreISelPasses() {
1041 } else if (addInstSelector())
1046 addPass(&FinalizeISelID);
NVPTX后端定制了函数addInstSelector
412 bool NVPTXPassConfig::addInstSelector() {
413 const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();
414
PASS "Lower aggregate copies/intrinsics into loops"
415 addPass(createLowerAggrCopies());
PASS "NVPTX specific alloca hoisting"
416 addPass(createAllocaHoisting());
PASS "NVPTX DAG->DAG Pattern Instruction Selection"
417 addPass(createNVPTXISelDag(getNVPTXTargetMachine(), getOptLevel()));
418
419 if (!ST.hasImageHandles())
420 addPass(createNVPTXReplaceImageHandlesPass());
421
422 return false;
423 }
CodeGen/LLVMTargetMachine.cpp
129 PassConfig->addMachinePasses();
llvm/lib/CodeGen/TargetPassConfig.cpp
1077 /// Add the complete set of target-independent postISel code generator passes.
1079 /// This can be read as the standard order of major LLVM CodeGen stages. Stages
1080 /// with nontrivial configuration or multiple passes are broken out below in
1081 /// add%Stage routines.
1095 void TargetPassConfig::addMachinePasses() {
1099 if (getOptLevel() != CodeGenOptLevel::None) {
1100 addMachineSSAOptimization();
有优化标志,增加优化PASS
470 void NVPTXPassConfig::addMachineSSAOptimization() {
"Early Tail Duplication"
"Optimize machine instruction PHIs"
"Merge disjoint stack slots"
"Local Stack Slot Allocation"
"Remove dead machine instructions"
1111 addPreRegAlloc();
"NVPTX Proxy Register Instruction Erasure"
1260 PM->add(createStackFrameLayoutAnalysisPass());
"Stack Frame Layout"
addPassesToGenerateCode结束
245 if (TargetPassConfig::willCompleteCodeGenPipeline()) {
246 if (addAsmPrinter(PM, Out, DwoOut, FileType, MMIWP->getMMI().getContext()))
247 return true;
"NVPTX Assembly Printer"
"Free MachineFunction"
查看PASS NAME的来源,例如PASS "NVPTX Assembly Printer"
#0 llvm::NVPTXAsmPrinter::getPassName (this=0x0) at /home/yhz/llvm-project/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.h:154
#1 0x0000555558137b73 in llvm::Pass::dumpPassStructure (this=0x55555bea7f90, Offset=3) at /home/yhz/llvm-project/llvm/lib/IR/Pass.cpp:75
#2 0x00005555580bb80e in llvm::FPPassManager::dumpPassStructure (this=0x55555be88790, Offset=2)
at /home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1393
#3 0x00005555580b6e4d in (anonymous namespace)::MPPassManager::dumpPassStructure (this=0x55555be73520, Offset=1)
at /home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:435
#4 0x00005555580b8c06 in llvm::PMTopLevelManager::dumpPasses (this=0x55555be73128) at /home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:837
#5 0x00005555580b704c in llvm::legacy::PassManagerImpl::run (this=0x55555be72f80, M=...)
at /home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:529
#6 0x00005555580bcb57 in llvm::legacy::PassManager::run (this=0x7fffffffd7b0, M=...) at /home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1685
#7 0x000055555699855a in compileModule (argv=0x7fffffffdf68, Context=...) at /home/yhz/llvm-project/llvm/tools/llc/llc.cpp:749
#8 0x000055555699601a in main (argc=8, argv=0x7fffffffdf68) at /home/yhz/llvm-project/llvm/tools/llc/llc.cpp:425
/home/yhz/llvm-project/llvm/lib/IR/LegacyPassManager.cpp
525 bool PassManagerImpl::run(Module &M) {
528 dumpArguments();
529 dumpPasses();
840 void PMTopLevelManager::dumpArguments() const {
841
842 if (PassDebugging < Arguments)
843 return;
844
这里打印Pass Arguments
845 dbgs() << "Pass Arguments: ";
首先打印所有的ImmutablePasses
846 for (ImmutablePass *P : ImmutablePasses)
847 if (const PassInfo *PI = findAnalysisPassInfo(P->getPassID())) {
848 assert(PI && "Expected all immutable passes to be initialized");
849 if (!PI->isAnalysisGroup())
850 dbgs() << " -" << PI->getPassArgument();
851 }
其他PASS
852 for (PMDataManager *PM : PassManagers)
853 PM->dumpPassArguments();
854 dbgs() << "\n";
855 }
822 void PMTopLevelManager::dumpPasses() const {
823
根据PassDebugging,小于Struture不打印
824 if (PassDebugging < Structure)
825 return;
同样先打印ImmutablePasses
827 // Print out the immutable passes
828 for (unsigned i = 0, e = ImmutablePasses.size(); i != e; ++i) {
829 ImmutablePasses[i]->dumpPassStructure(0);
830 }
再打印其他PASS的信息
836 for (PMDataManager *Manager : PassManagers)
837 Manager->getAsPass()->dumpPassStructure(1);
调用对应PASS的getPassName函数,打印PASS NAME
74 void Pass::dumpPassStructure(unsigned Offset) {
75 dbgs().indent(Offset*2) << getPassName() << "\n";
76 }
llvm::NVPTXAsmPrinter::getPassName()
再检查"NVPTX DAG->DAG Pattern Instruction Selection"遍
#define PASS_NAME "NVPTX DAG->DAG Pattern Instruction Selection"
INITIALIZE_PASS(NVPTXDAGToDAGISel, DEBUG_TYPE, PASS_NAME, false, false)
这个PASS的名字来自INITIALIZE_PASS定义的第三个参数。 这是调用的Pass类的
StringRef Pass::getPassName()
81 StringRef Pass::getPassName() const {
这里首先找到PassID
82 AnalysisID AID = getPassID();
然后找到对应的PassInfo
83 const PassInfo *PI = PassRegistry::getPassRegistry()->getPassInfo(AID);
调用对应的PassInfo的getPassName,这是通过INITIALIZE_PASS注册的Pass Name
84 if (PI)
85 return PI->getPassName();
86 return "Unnamed pass: implement Pass::getPassName()";
87 }
这里的NVPTXDAGToDAGISel定义的Pass一路继承自Pass,因此调用了Pass::getPassName()函数。 而前面的NVPTXAsmPrinter PASS则自己重载了getPassName,因此打印的是自己定义的Pass Name。 因此自定义了getPassName函数的情况下打印名字优先,否则打印的名字是INITIALIZE_PASS这些宏定义的PassName
PreISelIntrinsicLowering.cpp
381 bool runOnModule(Module &M) override {
用来返回TargetTransformInfo
382 auto LookupTTI = [this](Function &F) -> TargetTransformInfo & {
383 return this->getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
384 };
385
返回TargetMachine
386 const auto &TM = getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
生成PreISelIntrinsicLowering
387 PreISelIntrinsicLowering Lowering(TM, LookupTTI);
调用lowerIntrinsics
388 return Lowering.lowerIntrinsics(M);
274 bool PreISelIntrinsicLowering::lowerIntrinsics(Module &M) const {
276 for (Function &F : M) {
编译当前Module的所有函数
277 switch (F.getIntrinsicID()) {
返回当前函数的Intrinsic编号,或者返回不是intrinsic
282 case Intrinsic::memset:
283 Changed |= expandMemIntrinsicUses(F);
如果函数是memset
208 bool PreISelIntrinsicLowering::expandMemIntrinsicUses(Function &F) const {
209 Intrinsic::ID ID = F.getIntrinsicID();
210 bool Changed = false;
211
找到所有对Intrinsic的调用
212 for (User *U : llvm::make_early_inc_range(F.users())) {
213 Instruction *Inst = cast<Instruction>(U);
214
215 switch (ID) {
250 case Intrinsic::memset: {
251 auto *Memset = cast<MemSetInst>(Inst);
252 Function *ParentFunc = Memset->getFunction();
查找memset函数的目标变换信息
253 const TargetTransformInfo &TTI = LookupTTI(*ParentFunc);
254 if (shouldExpandMemIntrinsicWithSize(Memset->getLength(), TTI)) {
如果仍然使用intrinsic函数,并且支持MEMSET,则退出
255 if (UseMemIntrinsicLibFunc &&
256 canEmitLibcall(TM, ParentFunc, RTLIB::MEMSET))
257 break;
258
这里把memset修改为循环处理
259 expandMemSetAsLoop(Memset);
260 Changed = true;
把原来的函数删除掉
261 Memset->eraseFromParent();
262 }
263
264 break;
265 }
CodeGen/ExpandLargeDivRem.cpp
ExpandLargeDivRemLegacyPass::runOnFunction
118 bool runOnFunction(Function &F) override {
119 auto *TM = &getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
120 auto *TLI = TM->getSubtargetImpl(F)->getTargetLowering();
121 return runImpl(F, *TLI);
57 static bool runImpl(Function &F, const TargetLowering &TLI) {
依次使用Function PASS处理每个函数,直到处理完毕所有的函数,当前Function PASS处理完毕,然后进入接下来的Module Pass
参考https://llvm.org/docs/NVPTXUsage.html#ptxas-complains-of-undefined-function-nvvm-reflect This pass looks for calls to the @__nvvm_reflect function and replaces them with constants based on the defined reflection parameters. 实现Target/NVPTX/NVVMReflect.cpp
实现NVPTXAssignValidGlobalNames.cpp 参考
// Clean up the names of global variables in the module to not contain symbols
// that are invalid in PTX.
//
// Currently NVPTX, like other backends, relies on generic symbol name
// sanitizing done by MC. However, the ptxas assembler is more stringent and
// disallows some additional characters in symbol names. This pass makes sure
// such names do not reach MC at all.
替换掉后端不允许的一些变量名命名。
实现NVPTXGenericToNVVM.cpp
289 bool runOnModule(Module &M) override;
304 bool GenericToNVVMLegacyPass::runOnModule(Module &M) {
305 return GenericToNVVM().runOnModule(M);
306 }
56 bool GenericToNVVM::runOnModule(Module &M) {
把ADDRESS_SPACE_GENERIC全局变量替换为ADDRESS_SPACE_GLOBAL
481 bool NVPTXLowerArgs::runOnFunction(Function &F) {
482 auto &TM = getAnalysis<TargetPassConfig>().getTM<NVPTXTargetMachine>();
483
分别处理kernel函数和device函数
484 return isKernelFunction(F) ? runOnKernelFunction(TM, F)
485 : runOnDeviceFunction(TM, F);
486 }
kernel函数
421 bool NVPTXLowerArgs::runOnKernelFunction(const NVPTXTargetMachine &TM,
422 Function &F) {
427 auto HandleIntToPtr = [this](Value &V) {
如果V的所有使用都是整数到指针的转换操作
428 if (llvm::all_of(V.users(), [](User *U) { return isa<IntToPtrInst>(U); })) {
429 SmallVector<User *, 16> UsersToUpdate(V.users());
将其替换为全局地址空间的指针类型
430 for (User *U : UsersToUpdate)
431 markPointerAsGlobal(U);
432 }
433 };
457 for (Argument &Arg : F.args()) {
458 if (Arg.getType()->isPointerTy()) {
459 if (Arg.hasByValAttr())
460 handleByValParam(TM, &Arg);
461 else if (TM.getDrvInterface() == NVPTX::CUDA)
462 markPointerAsGlobal(&Arg);
这里将参数转换到全局地址空间,然后再转换回原来的通用地址空间。
463 } else if (Arg.getType()->isIntegerTy() &&
464 TM.getDrvInterface() == NVPTX::CUDA) {
465 HandleIntToPtr(Arg);
466 }
467 }
NVPTXCtorDtorLowering.cpp
/// This pass creates a unified init and fini kernel with the required metadata
267 static bool lowerCtorsAndDtors(Module &M) {
268 bool Modified = false;
269 Modified |= createInitOrFiniKernel(M, "llvm.global_ctors", /*IsCtor =*/true);
270 Modified |= createInitOrFiniKernel(M, "llvm.global_dtors", /*IsCtor =*/false);
271 return Modified;
272 }
这里的实例不涉及这个问题。
NVPTXLowerUnreachable.cpp LLVM IR处理,方便后面的ptxas生成指令
// Lower aggregate copies, memset, memcpy, memmov intrinsics into loops when
// the size is large or is not a compile-time constant.
进入LLVM到DAG转换过程