... views
在开始之前,请确保您已准备以下内容:
gemma-3-4b-it-f16.gguf
)通过设置 CUDA_VISIBLE_DEVICES
环境变量来指定用于性能分析的 GPU 设备。
export CUDA_VISIBLE_DEVICES=1
使用 nsys profile
命令启动 llamacpp 服务器并启用性能分析。以下命令指定了性能分析数据的输出文件以及服务器运行的参数:
nsys profile -o dev1-gemma-4b.nsys-rep bin/llama-server -ngl 81 -t 8 -c 0 --port 8000 -fa -m /media/do/sata-512G/modelhub/gemma-3-4b-it-f16.gguf
-o dev1-gemma-4b.nsys-rep
:指定性能分析报告的输出文件名。bin/llama-server
:llamacpp 服务器的可执行文件。-ngl 81
:指定 GPU 层数。-t 8
:设置线程数。-c 0
:设置上下文大小。--port 8000
:指定服务器端口。-fa
:启用快速注意力机制。-m /media/do/sata-512G/modelhub/gemma-3-4b-it-f16.gguf
:模型文件的路径。启动服务器后,向其发送推理请求以生成性能分析数据。
在收集到足够的性能数据后,使用以下命令安全关闭 llamacpp 服务器:
pidof llama-server | xargs kill
此命令会查找 llamacpp 服务器的进程 ID 并将其优雅地终止。
使用 nsys stats
命令分析生成的性能数据文件:
nsys stats dev1-gemma-4b.nsys-rep
** CUDA API Summary (cuda_api_sum):
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ --------- ----------- ------------ ---------------------------------
81.2 18,784,449,306 13,185 1,424,683.3 6,991.0 306 26,541,226 4,702,507.3 cudaStreamSynchronize
10.0 2,306,825,472 6,810 338,740.9 2,772.5 1,153 413,139,275 5,360,508.9 cudaMemcpyAsync
7.7 1,781,747,074 1,055,137 1,688.6 1,547.0 1,399 21,880,618 23,246.7 cudaLaunchKernel
0.6 140,144,078 3 46,714,692.7 37,087.0 28,224 140,078,767 80,855,660.3 cudaMemGetInfo
0.2 54,673,249 2 27,336,624.5 27,336,624.5 574,638 54,098,611 37,847,164.3 cudaMallocHost
0.1 25,398,086 2 12,699,043.0 12,699,043.0 635,453 24,762,633 17,060,492.6 cudaFreeHost
0.1 16,873,777 7 2,410,539.6 143,525.0 3,083 15,853,845 5,929,834.4 cudaMalloc
0.0 10,527,215 8 1,315,901.9 1,214,730.0 24,559 3,369,309 1,214,198.7 cudaFree
0.0 3,075,263 4 768,815.8 769,219.5 209,398 1,327,426 483,970.0 cuLibraryLoadData
0.0 2,260,349 1 2,260,349.0 2,260,349.0 2,260,349 2,260,349 0.0 cuMemUnmap
0.0 1,974,822 2 987,411.0 987,411.0 922,431 1,052,391 91,895.6 cudaGetDeviceProperties_v2_v12000
0.0 718,345 2 359,172.5 359,172.5 66,360 651,985 414,099.4 cuMemSetAccess
0.0 582,601 68 8,567.7 148.5 143 569,861 69,082.9 cuKernelGetFunction
0.0 527,918 101 5,226.9 1,402.0 793 161,138 17,268.8 cudaEventRecord
0.0 332,472 68 4,889.3 1,609.0 1,390 30,410 8,298.4 cuLaunchKernel
0.0 136,102 838 162.4 138.0 73 4,072 154.4 cuGetProcAddress_v2
0.0 76,657 6 12,776.2 11,467.5 3,590 24,162 7,379.2 cudaMemsetAsync
0.0 68,629 2 34,314.5 34,314.5 22,629 46,000 16,525.8 cuMemCreate
0.0 39,030 1 39,030.0 39,030.0 39,030 39,030 0.0 cudaStreamDestroy
0.0 31,947 606 52.7 43.0 37 984 48.9 cuStreamGetCaptureInfo_v2
0.0 21,251 4 5,312.8 3,251.5 1,966 12,782 5,060.8 cudaDeviceSynchronize
0.0 17,689 18 982.7 762.5 479 4,059 808.4 cudaEventDestroy
0.0 15,733 5 3,146.6 3,520.0 444 4,435 1,593.2 cuLibraryGetKernel
0.0 15,477 1 15,477.0 15,477.0 15,477 15,477 0.0 cuMemAddressFree
0.0 13,237 1 13,237.0 13,237.0 13,237 13,237 0.0 cudaStreamCreateWithFlags
0.0 9,735 1 9,735.0 9,735.0 9,735 9,735 0.0 cuMemAddressReserve
0.0 9,523 18 529.1 290.0 229 2,234 590.5 cudaEventCreateWithFlags
0.0 5,787 2 2,893.5 2,893.5 2,639 3,148 359.9 cuMemMap
0.0 4,399 4 1,099.8 1,249.5 626 1,274 316.0 cuInit
0.0 4,155 1 4,155.0 4,155.0 4,155 4,155 0.0 cudaEventQuery
0.0 3,589 1 3,589.0 3,589.0 3,589 3,589 0.0 cuMemGetAllocationGranularity
0.0 1,116 3 372.0 164.0 132 820 388.3 cuModuleGetLoadingMode
0.0 990 2 495.0 495.0 331 659 231.9 cudaGetDriverEntryPoint_v11030
0.0 353 2 176.5 176.5 145 208 44.5 cuMemRelease
Processing [dev1-gemma-4b.sqlite] with [/usr/local/cuda-12.9/nsight-systems-2025.1.3/host-linux-x64/reports/cuda_gpu_kern_sum.py]...
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- -------- --------- ----------- ----------------------------------------------------------------------------------------------------
86.2 16,650,623,556 253,109 65,784.4 20,736.0 10,976 5,023,823 194,450.8 void mul_mat_vec<__half, __half, (int)1, (int)256>(const T1 *, const float *, const int *, float *,…
5.0 958,434,630 36,040 26,593.6 25,632.0 21,088 1,551,515 30,793.9 void flash_attn_ext_f16<(int)256, (int)256, (int)4, (int)2, (int)2, (int)1, (bool)0, (bool)0>(const…
2.2 416,796,498 217,505 1,916.3 1,952.0 1,663 20,928 236.7 void k_bin_bcast<&op_mul, float, float, float>(const T2 *, const T3 *, T4 *, int, int, int, int, in…
2.1 407,080,542 145,357 2,800.6 2,720.0 2,399 17,472 383.8 void rms_norm_f32<(int)1024>(const float *, float *, int, long, long, long, float)
1.2 225,183,495 36,040 6,248.2 6,240.0 5,120 370,527 2,037.9 void flash_attn_stream_k_fixup<(int)256, (int)4, (int)2>(float *, const float2 *, int, int, int)
0.7 138,333,795 72,148 1,917.4 2,112.0 1,567 18,144 351.3 void rms_norm_f32<(int)32>(const float *, float *, int, long, long, long, float)
0.7 131,226,534 72,148 1,818.9 1,824.0 1,504 17,440 154.4 void k_bin_bcast<&op_add, float, float, float>(const T2 *, const T3 *, T4 *, int, int, int, int, in…
0.7 129,531,394 74,270 1,744.1 1,696.0 1,631 17,856 276.2 void cpy_f32_f16<&cpy_1_f32_f16>(const char *, char *, int, int, int, int, int, int, int, int, int,…
0.5 93,407,271 72,148 1,294.7 1,280.0 1,184 20,127 150.0 void rope_neox<(bool)1, (bool)0, float>(const T3 *, T3 *, int, int, int, int, int, const int *, flo…
0.4 68,189,621 36,074 1,890.3 1,856.0 1,760 18,144 364.0 void unary_gated_op_kernel<&op_gelu, float>(const T2 *, const T2 *, T2 *, long, long, long, long)
0.2 40,957,914 37,135 1,102.9 1,088.0 992 13,600 238.3 scale_f32(const float *, float *, float, int)
0.1 14,449,912 235 61,489.0 22,688.0 12,448 1,269,181 110,896.0 void mul_mat_vec<__half, __half, (int)2, (int)256>(const T1 *, const float *, const int *, float *,…
0.1 12,504,441 66 189,461.2 164,304.0 159,168 1,805,209 201,962.9 turing_h1688gemm_256x64_sliced1x2_ldg8_tn
0.1 10,182,623 101 100,818.0 39,296.0 30,816 1,614,491 201,661.3 turing_h1688gemm_256x64_ldg8_stages_32x1_tn
0.0 3,792,981 2,122 1,787.5 1,984.0 1,472 2,656 265.1 void k_get_rows_float<float, float>(const T1 *, const int *, T2 *, long, long, unsigned long, unsig…
0.0 1,759,995 68 25,882.3 25,968.0 25,056 26,368 300.3 void cutlass::Kernel2<cutlass_75_wmma_tensorop_h161616gemm_32x32_128x2_tn_align8>(T1::Params)
0.0 1,740,533 34 51,192.1 50,591.5 49,056 67,776 3,121.1 void flash_attn_ext_f16<(int)256, (int)256, (int)16, (int)2, (int)4, (int)2, (bool)0, (bool)0>(cons…
0.0 1,079,070 235 4,591.8 3,456.0 2,208 9,023 2,622.3 void convert_unary<__half, float>(const void *, T2 *, long, long, long, long, long, long)
0.0 1,016,827 235 4,326.9 3,648.0 3,008 10,304 1,904.0 void convert_unary<float, __half>(const void *, T2 *, long, long, long, long, long, long)
0.0 582,528 34 17,133.2 17,120.0 17,056 17,568 87.1 void flash_attn_stream_k_fixup<(int)256, (int)16, (int)2>(float *, const float2 *, int, int, int)
0.0 478,112 101 4,733.8 4,480.0 3,744 12,096 1,080.3 void cublasLt::splitKreduce_kernel<(int)32, (int)16, int, __half, __half, __half, __half, (bool)0, …
Processing [dev1-gemma-4b.sqlite] with [/usr/local/cuda-12.9/nsight-systems-2025.1.3/host-linux-x64/reports/cuda_gpu_mem_time_sum.py]...
** CUDA GPU MemOps Summary (by Time) (cuda_gpu_mem_time_sum):
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- ----------- ----------- -------- ----------- ----------- ----------------------------
87.6 2,527,438,955 5,749 439,631.1 1,600.0 319 413,623,340 5,864,457.8 [CUDA memcpy Host-to-Device]
11.7 338,976,207 1,061 319,487.5 319,103.0 318,719 332,639 1,401.7 [CUDA memcpy Device-to-Host]
0.7 18,757,906 6 3,126,317.7 2,499,002.0 320,127 7,067,246 3,163,909.3 [CUDA memset]
时间占比 | API 名称 | 调用次数 | 平均时间(ns) | 最大时间(ns) | 说明 |
---|---|---|---|---|---|
81.2% | cudaStreamSynchronize | 13,185 | 1,424,683 | 26,541,226 | 同步操作主导开销 |
10.0% | cudaMemcpyAsync | 6,810 | 338,741 | 413,139,275 | 异步内存拷贝 |
7.7% | cudaLaunchKernel | 1,055,137 | 1,689 | 21,880,618 | 内核启动开销低 |
关键发现:
cudaStreamSynchronize
占比过高(81.2%),表明存在严重的CPU-GPU同步瓶颈时间占比 | 内核名称 | 调用次数 | 平均时间(μs) | 说明 |
---|---|---|---|---|
86.2% | void mul_mat_vec<__half, __half, ...> |
253,109 | 65.78 | 矩阵乘法核心 |
5.0% | void flash_attn_ext_f16<...> |
36,040 | 26.59 | FlashAttention |
2.2% | void k_bin_bcast<&op_mul, ...> |
217,505 | 1.92 | 元素级操作 |
关键发现:
mul_mat_vec
) 主导计算(86.2% GPU时间)