Skip to content

nvFuser Profiler Instructions

Kevin Stephano edited this page Nov 3, 2023 · 3 revisions

What is being profiled?

A Cuda Device Synchronize is executed prior to each measured fusion in order to expose the Host Time to measurement. The first line of each profile includes the aggregate fusion information and 1st segment's information. Successive segments have their aggregate fusion information omitted.

  • Aggregate Fusion information
    • Fus# (nocupti) - Fusion Number
    • NSegs (nocupti) - Number of Segments
    • CuEvtTime(ms) (nocupti) - The Cuda Event Time that wraps the entire fusion process including kernel execution.
    • CmpTm(ms) (nocupti) - Is the Parallel Compilation time or the aggregate of serialized segment compilation depending on which option is used. Compilation time includes Kernel Lowering and NVRTC compilation.
    • KerTime(ms) - The aggregated Kernel Time of all the segments executed.
    • EffBw(GB/S) - The Effective Bandwidth of the inputs/outputs to the fusions which does not include the I/O specific to each segment. Therefore, segmentation should penalize the aggregate Effective Bandwidth.
    • %PeakBw - The Percentage of the Peak Bandwidth of the Fusion with respect to the previous Effective Bandwidth. This is also the %SOL of the Fusion.
    • In(MB) ( verbose) - The size of the fusion inputs in MBs.
    • Out(MB) ( verbose) - The size of the fusion outputs in MBs.
  • Segment Information
    • S-Seg# = The segment index found in the line
    • S-KerName - The Cuda Kernel's short name
    • S-KerTm(ms) - The CUPTI profiled Kernel Time for the given segment
    • S-CmpTm(ms) ( verbose) - The Compile Time for the given segment if kernels are compiled without Parallel Compilation.
    • S-EffBw(GB/s) - The effective bandwidth of the segment given the specific I/O of this segment's kernel.
    • S-%PeakBw - The percent of the Peak Bandwidth of this kernel. The kernel's %SOL assuming the kernel is bandwidth limited.
    • S-In(MB) - The size of the kernel inputs in MBs.
    • S-Out(MB) - The size of the kernel outputs in MBs.
    • S-Smem[Dyn,Stat] - The Dynamic and Static Shared Memory used by the kernel in bytes
    • S-Regs - The registers-per-thread used by the kernel
    • S-Grid - The grid dimensions of the kernel
    • S-Block - The block dimensions of the kernel
    • S-Cluster ( verbose) - The cluster dimensions of the kernel
    • S-Dev ( verbose) - The device id of the kernel execution.
    • S-Stm (verbose) - The stream id of the kernel execution.
    • S-DeviceName ( verbose) - The market name of the device used for kernel execution.
    • S-PeakBW(GB/s) ( verbose) - The Peak Bandwidth of the device used for execution.

Limitations

This profiler is adding device synchronizations prior to every fusion. Therefore, this profiling is not suitable for end-to-end profiling of a network where you should expect nvFuser's host work to overlap with previous device work. This profiler is suitable for measuring the host and device activity of an individual fusion and its kernel.

For End-to-End measure these Nsight Systems commands are appropriate:

  • Print trace to console: nsys nvprof --print-gpu-trace <Your Program>
  • Create an Nsys Report: nsys profile -w true -t cublas,cuda,nvtx,osrt -o <Report Name> <Your Program>

ENV Variable NVFUSER_PROF Values to Enable Profiling

All environment variables minimally enable the Fusion Profiler that populates the FusionProfile object.

NVFUSER_PROF values:

  • enable : Enables full profiling and records the activity in the FusionProfile object only.
  • enable.nocupti : Only enables host timer profiling of fusion level activity and records the activity in the FusionProfile object only. The "nocupti" object allows the user to see the host timing without CUPTI overhead that can be as significant as 10us to 30us.
  • print : Enables full profiling and records the activity in the FusionProfile object and prints it to the console.
  • print.verbose : Enables full profiling and records the activity in the FusionProfile object and prints it to the console. Additional information is printed in extra columns that may cause line wrapping on smaller terminal windows. The (verbose) level indicates the set of additional columns printed, above.
  • print.nocupti : Only enables host timer profiling of fusion level activity and records the activity in the FusionProfile object and print it to the console with an abbreviated set of columns for the limited information. The (nocupti) level indicates the set of columns printed, above.

Profiling Output as an Object

You can collect statistics from profiling a specific fusion by querying the output object:

FusionProfile fprof = FusionProfiler::get()->profile();

FusionProfile Object

struct FusionProfile {
  int64_t fusion_id{-1};
  int64_t segments{0};

  double cuda_evt_time_ms{0.0};
  double host_time_ms{0.0};
  double compile_time_ms{0.0};
  double kernel_time_ms{0.0};

  int64_t input_bytes{0};
  int64_t output_bytes{0};

  double effective_bandwidth_gbs{0.0};
  double percentage_peak_bandwidth{0.0};

  std::vector<KernelProfile> kernel_profiles{};
};

KernelProfile Object

struct KernelProfile {
  std::string name{};
  int64_t segment_id;
  int device{-1};
  uint32_t stream{0};
  uint32_t correlation_id{0};

  double compile_time_ms{0.0};
  double time_ms{0.0};
  double effective_bandwidth_gbs{0.0};
  double percentage_peak_bandwidth{0.0};

  std::array<int32_t, 3> grid{0, 0, 0};
  std::array<int32_t, 3> block{0, 0, 0};
  std::array<uint32_t, 3> cluster{0, 0, 0};

  int32_t dynamic_shared_mem{0};
  int32_t static_shared_mem{0};
  uint32_t registers{0};

  int64_t input_bytes{0};
  int64_t output_bytes{0};

  std::string device_name{};
  double peak_bandwidth_gbs{0.0};
};

Profiling Output in C++ stdout

Default - NVFUSER_PROF=print

Fus#  NSegs CuEvtTm(ms) HstTm(ms) CmpTm(ms) KerTm(ms) EffBw(GB/s) %PeakBw   Seg# S-KerName  S-KerTm(ms) S-EffBw(GB/s) S-%PeakBw S-In(MB)  S-Out(MB) S-Smem[Dyn,Stat] S-Regs S-Grid           S-Block
    0     1     328.582   328.529   277.940     0.059     1765.88     91.24    0    kernel1       0.059       1765.88     91.24    52.442    52.494        [512, 16]     41     [8192, 1, 1]      [128, 1, 1]
    1     1      95.395    95.345    93.025     0.057     1845.05     95.33    0    kernel2       0.057       1845.05     95.33    52.429    52.429           [0, 0]     16    [25600, 1, 1]      [128, 1, 1]
    2     1    1076.190  1076.054  1012.063     0.141     1583.82     81.84    0    kernel3       0.141       1583.82     81.84   104.870   118.030        [512, 16]     52     [8192, 1, 1]      [128, 1, 1]
    3     1     139.589   139.350   129.783     0.246     1707.12     88.21    0    kernel4       0.246       1707.12     88.21   209.715   209.715           [0, 0]     16   [102400, 1, 1]      [128, 1, 1]
    4     1     283.830   283.744   276.638     0.092     1855.98     95.90    0    kernel5       0.092       1855.98     95.90   104.858    65.536           [0, 4]     34    [25600, 1, 1]      [128, 1, 1]
    5     1     124.316   124.253   119.113     0.069     1716.20     88.68    0    kernel6       0.069       1716.20     88.68    65.536    52.429           [0, 0]     18    [25600, 1, 1]      [128, 1, 1]
    6     2     369.042   368.662   226.779     0.431     1582.54     81.77    0    kernel7       0.048       1085.88     56.11    52.429     0.006       [2048, 16]     32      [50, 17, 1]      [32, 16, 1]
    -     -           -         -         -         -           -         -    1    kernel8       0.382       1645.26     85.01   419.430   209.715           [0, 0]     22   [102400, 1, 1]      [128, 1, 1]
    7     2    2024.829  2024.633   978.544     0.338     1436.02     74.20    0    kernel9       0.132       1585.10     81.90   209.715     0.026       [2048, 16]     32      [50, 30, 1]       [64, 8, 1]
    -     -           -         -         -         -           -         -    1   kernel10       0.205       1340.02     69.24   170.466   104.870        [512, 16]    124      [1, 432, 1]       [2, 64, 1]
    8     2     197.137   197.089   193.144     0.101     1559.95     80.60    0   kernel12       0.049       1070.98     55.34    52.429     0.006       [2048, 16]     32      [50, 17, 1]      [32, 16, 1]
    -     -           -         -         -         -           -         -    1   kernel11       0.052       2021.47    104.45    52.429    52.429           [0, 0]     16    [25600, 1, 1]      [128, 1, 1]
    9     2    1729.789  1729.616   872.094     0.284     1294.23     66.87    0   kernel13       0.104       1506.07     77.82   157.286     0.019       [2048, 16]     32      [38, 31, 1]       [64, 8, 1]
    -     -           -         -         -         -           -         -    1   kernel14       0.179       1170.76     60.49   157.358    52.442        [512, 16]    126      [1, 432, 1]       [2, 64, 1]
   10     1       0.149     0.099     0.000     0.058     1809.74     93.51    0    kernel1       0.058       1809.74     93.51    52.442    52.494        [512, 16]     41     [8192, 1, 1]      [128, 1, 1]

Verbose - NVFUSER_PROF=print.verbose

This option gives added details in the far right columns that includes things like the device type profiled and the peak bandwidth of the device.

Fus#  NSegs CuEvtTm(ms) HstTm(ms) CmpTm(ms) KerTm(ms) EffBw(GB/s) %PeakBw   Seg# S-KerName  S-KerTm(ms) S-CmpTm(ms) S-EffBw(GB/s) S-%PeakBw S-In(MB)  S-Out(MB) S-Smem[Dyn,Stat] S-Regs S-Grid           S-Block          S-Cluster        S-Dev S-Stm S-DeviceName         S-PeakBw(GB/s)
    0     1     328.880   328.828   277.962     0.059     1777.37     91.84    0    kernel1       0.059     276.450       1777.37     91.84    52.442    52.494        [512, 16]     41     [8192, 1, 1]      [128, 1, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    1     1      95.295    95.241    92.902     0.059     1791.61     92.57    0    kernel2       0.059      92.573       1791.61     92.57    52.429    52.429           [0, 0]     16    [25600, 1, 1]      [128, 1, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    2     1    1078.698  1078.567  1014.494     0.139     1599.47     82.64    0    kernel3       0.139    1013.166       1599.47     82.64   104.870   118.030        [512, 16]     52     [8192, 1, 1]      [128, 1, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    3     1     139.003   138.766   129.382     0.244     1719.43     88.84    0    kernel4       0.244     128.690       1719.43     88.84   209.715   209.715           [0, 0]     16   [102400, 1, 1]      [128, 1, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    4     1     283.905   283.822   276.647     0.092     1855.35     95.87    0    kernel5       0.092     276.213       1855.35     95.87   104.858    65.536           [0, 4]     34    [25600, 1, 1]      [128, 1, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    5     1     124.911   124.845   119.094     0.069     1706.67     88.18    0    kernel6       0.069     118.597       1706.67     88.18    65.536    52.429           [0, 0]     18    [25600, 1, 1]      [128, 1, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    6     2     368.721   368.344   226.877     0.428     1593.32     82.33    0    kernel7       0.048       0.000       1085.91     56.11    52.429     0.006       [2048, 16]     32      [50, 17, 1]      [32, 16, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    -     -           -         -         -         -           -         -    1    kernel8       0.379       0.000       1657.88     85.66   419.430   209.715           [0, 0]     22   [102400, 1, 1]      [128, 1, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    7     2    2031.692  2031.500   975.809     0.336     1441.90     74.50    0    kernel9       0.133       0.000       1571.43     81.20   209.715     0.026       [2048, 16]     32      [50, 30, 1]       [64, 8, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    -     -           -         -         -         -           -         -    1   kernel10       0.203       0.000       1356.72     70.10   170.466   104.870        [512, 16]    124      [1, 432, 1]       [2, 64, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    8     2     196.050   196.000   192.079     0.103     1533.67     79.24    0   kernel12       0.049       0.000       1073.08     55.45    52.429     0.006       [2048, 16]     32      [50, 17, 1]      [32, 16, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    -     -           -         -         -         -           -         -    1   kernel11       0.054       0.000       1952.80    100.90    52.429    52.429           [0, 0]     16    [25600, 1, 1]      [128, 1, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    9     2    1722.642  1722.485   865.384     0.266     1381.35     71.37    0   kernel13       0.103       0.000       1530.45     79.08   157.286     0.019       [2048, 16]     32      [38, 31, 1]       [64, 8, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36
    -     -           -         -         -         -           -         -    1   kernel14       0.163       0.000       1287.31     66.52   157.358    52.442        [512, 16]    126      [1, 432, 1]       [2, 64, 1]        [0, 0, 0]     0     7   NVIDIA A100 80GB P        1935.36

No-Cupti - NVFUSER_PROF=print.nocupti

The "no-cupti" option omits kernel profiling and, therefore, you do not see any segment information. The purpose of this option is to show you the Host Time HstTm(ms) without the overhead of CUPTI. It empirically appears to be between 10us and 30us in the Host Time when the Compile Time CmpTm(ms) does not dominate the Host Time.

Fus#  NSegs CuEvtTm(ms) HstTm(ms) CmpTm(ms)
    0     1     328.874   328.818   278.639
    1     1      94.973    94.920    92.629
    2     1    1076.356  1076.217  1012.515
    3     1     139.532   139.288   129.602
    4     1     283.904   283.817   277.174
    5     1     124.174   124.110   119.053
    6     2     370.288   369.907   226.604
    7     2    2031.641  2031.442   977.103
    8     2     196.486   196.436   192.158
    9     2    1726.821  1726.661   870.991