Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

小网络在华为麒麟芯片上gpu比cpu慢,在骁龙芯片上gpu比CPU快,为什么? #1230

Open
MHGL opened this issue Aug 11, 2021 · 4 comments
Assignees

Comments

@MHGL
Copy link

MHGL commented Aug 11, 2021

1. 环境(environment)

  • Build OS and Version: Ubuntu20.04
  • RunTime OS Version: Android 8.0
  • RunTime DEVICE: ARM

2. Github版本

  • branch:master
  • commit(optional): 8c4178d

3. 编译方式(compile method)

4. 麒麟芯片(HUAWEI P9,芯片Hisilicon Kirin 955)

  • shufflenet_v2.tnnproto(TNN官方)
./test/libTNNBenchmarkTest.so: 1 file pushed, 0 skipped. 1326.2 MB/s (230616 bytes in 0.000s)
./libTNN.so: 1 file pushed, 0 skipped. 31.6 MB/s (7525920 bytes in 0.227s)
test/TNNTest: 1 file pushed, 0 skipped. 360.1 MB/s (192504 bytes in 0.001s)
/home/liyang/GitHub/TNN/benchmark/benchmark_android/../benchmark-model/: 18 files pushed, 0 skipped. 3.6 MB/s (283496 bytes in 0.074s)
/data/local/tmp/tnn-benchmark/benchmark_models_result.txt: 1 file pulled, 0 skipped. 10.3 MB/s (52083 bytes in 0.005s)
EVA-AL10

benchmark device: ARM 

                        Summary
--------------------------------------------------------
|        Op Type | Total Kernel Time(ms) | Percent (%) |
--------------------------------------------------------
|    Convolution |                 7.473 |      86.476 |
|   StridedSlice |                 0.451 |       5.215 |
|        Pooling |                 0.262 |       3.034 |
|   BatchNormCxx |                 0.194 |       2.244 |
| ShuffleChannel |                 0.134 |       1.546 |
|         Concat |                 0.128 |       1.486 |
--------------------------------------------------------
kernel runtime total: 8.64113 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] shufflenet_v2.tnnproto - ARM                  TNN Benchmark time cost: min =  9.404   ms  |  max =  9.596   ms  |  avg =  9.474   ms 
08-11 15:55:58.555 18785 18785 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] shufflenet_v2.tnnproto - ARM                  TNN Benchmark time cost: min =  9.404   ms  |  max =  9.596   ms  |  avg =  9.474   ms 

benchmark device: OPENCL 

I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 120] OpenCL version: CL_TARGET_OPENCL_VERSION 200   CL_HPP_TARGET_OPENCL_VERSION 110   CL_HPP_MINIMUM_OPENCL_VERSION 110
I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 155] Create common opencl context

                        Summary
--------------------------------------------------------
|        Op Type | Total Kernel Time(ms) | Percent (%) |
--------------------------------------------------------
|       Conv_1x1 |                 5.047 |      49.808 |
|       Conv_3x3 |                 1.331 |      13.131 |
| Conv_Depthwise |                 1.110 |      10.951 |
| ShuffleChannel |                 0.956 |       9.437 |
|         Concat |                 0.680 |       6.705 |
|    StrideSlice |                 0.556 |       5.491 |
|        Pooling |                 0.320 |       3.157 |
|      BatchNorm |                 0.134 |       1.318 |
--------------------------------------------------------
kernel runtime total: 10.1338 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] shufflenet_v2.tnnproto - OPENCL               TNN Benchmark time cost: min = 32.275   ms  |  max = 47.919   ms  |  avg = 38.150   ms 
08-11 15:56:06.504 18802 18802 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] shufflenet_v2.tnnproto - OPENCL               TNN Benchmark time cost: min = 32.275   ms  |  max = 47.919   ms  |  avg = 38.150   ms 
  • ESPNetV2(自定义)
./test/libTNNBenchmarkTest.so: 1 file pushed, 0 skipped. 776.0 MB/s (230616 bytes in 0.000s)
./libTNN.so: 1 file pushed, 0 skipped. 22.5 MB/s (7525920 bytes in 0.320s)
test/TNNTest: 1 file pushed, 0 skipped. 357.9 MB/s (192504 bytes in 0.001s)
/home/liyang/GitHub/TNN/benchmark/benchmark_android/../benchmark-model/: 18 files pushed, 0 skipped. 4.4 MB/s (283496 bytes in 0.062s)
/data/local/tmp/tnn-benchmark/benchmark_models_result.txt: 1 file pulled, 0 skipped. 10.8 MB/s (123124 bytes in 0.011s)
EVA-AL10

benchmark device: ARM 
 Summary
------------------------------------------------------
|      Op Type | Total Kernel Time(ms) | Percent (%) |
------------------------------------------------------
|  Convolution |               171.766 |      75.569 |
|        PReLU |                 9.712 |       4.273 |
|     Upsample |                 8.760 |       3.854 |
|       Concat |                 8.670 |       3.814 |
|          Add |                 8.004 |       3.522 |
|      Pooling |                 6.693 |       2.945 |
|          Pad |                 4.063 |       1.788 |
| BatchNormCxx |                 3.819 |       1.680 |
| SoftmaxCaffe |                 3.749 |       1.649 |
|       SplitV |                 2.059 |       0.906 |
------------------------------------------------------
kernel runtime total: 227.296 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - ARM                       TNN Benchmark time cost: min = 227.797  ms  |  max = 232.323  ms  |  avg = 230.944  ms 
08-11 16:07:09.257 18913 18913 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - ARM                       TNN Benchmark time cost: min = 227.797  ms  |  max = 232.323  ms  |  avg = 230.944  ms 

benchmark device: OPENCL 

I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 120] OpenCL version: CL_TARGET_OPENCL_VERSION 200   CL_HPP_TARGET_OPENCL_VERSION 110   CL_HPP_MINIMUM_OPENCL_VERSION 110
I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 155] Create common opencl context
Summary
--------------------------------------------------------
|        Op Type | Total Kernel Time(ms) | Percent (%) |
--------------------------------------------------------
|       Conv_1x1 |               147.789 |      69.348 |
| Conv_Depthwise |                11.484 |       5.389 |
|         Concat |                 9.341 |       4.383 |
|            Pad |                 8.492 |       3.985 |
|      BatchNorm |                 8.187 |       3.842 |
|          PRelu |                 7.954 |       3.732 |
|            Add |                 7.828 |       3.673 |
|        Pooling |                 3.697 |       1.735 |
|       Conv_3x3 |                 3.306 |       1.551 |
|       Upsample |                 3.086 |       1.448 |
|         SplitV |                 1.397 |       0.655 |
|        SoftMax |                 0.552 |       0.259 |
--------------------------------------------------------
kernel runtime total: 213.112 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - OPENCL                    TNN Benchmark time cost: min = 288.641  ms  |  max = 298.735  ms  |  avg = 293.257  ms 
08-11 16:07:19.737 18930 18930 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - OPENCL                    TNN Benchmark time cost: min = 288.641  ms  |  max = 298.735  ms  |  avg = 293.257  ms 

5. 骁龙芯片(小米 Mix2s,芯片骁龙845)

  • ESPNetV2(自定义)
./test/libTNNBenchmarkTest.so: 1 file pushed, 0 skipped. 597.1 MB/s (230616 bytes in 0.000s)
./libTNN.so: 1 file pushed, 0 skipped. 61.0 MB/s (7525920 bytes in 0.118s)
test/TNNTest: 1 file pushed, 0 skipped. 342.2 MB/s (192504 bytes in 0.001s)
/home/liyang/GitHub/TNN/benchmark/benchmark_android/../benchmark-model/: 18 files pushed, 0 skipped. 11.6 MB/s (283496 bytes in 0.023s)
E/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 187] load program cache skipped, ret: 40966, msg: code: 0xA006 msg: open program cache file failed, input path: /data/local/tmp//d1_tnn_ocl_fd8c6f613ff9c0d503dbc462bf21353f_abc87b1bd5bec928c91c17fc45884487
/data/local/tmp/tnn-benchmark/benchmark_models_result.txt: 1 file pulled, 0 skipped. 24.3 MB/s (122071 bytes in 0.005s)
MIX 2S

benchmark device: ARM 
  Summary
------------------------------------------------------
|      Op Type | Total Kernel Time(ms) | Percent (%) |
------------------------------------------------------
|  Convolution |               149.520 |      82.845 |
|     Upsample |                 4.961 |       2.749 |
|      Pooling |                 4.927 |       2.730 |
|        PReLU |                 4.814 |       2.667 |
|          Add |                 4.553 |       2.523 |
|       Concat |                 3.513 |       1.947 |
| SoftmaxCaffe |                 2.735 |       1.515 |
|          Pad |                 2.203 |       1.221 |
| BatchNormCxx |                 2.183 |       1.210 |
|       SplitV |                 1.072 |       0.594 |
------------------------------------------------------
kernel runtime total: 180.481 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - ARM                       TNN Benchmark time cost: min = 180.755  ms  |  max = 185.242  ms  |  avg = 183.615  ms 
08-11 16:14:20.985 19859 19859 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - ARM                       TNN Benchmark time cost: min = 180.755  ms  |  max = 185.242  ms  |  avg = 183.615  ms 

benchmark device: OPENCL 

I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 120] OpenCL version: CL_TARGET_OPENCL_VERSION 200   CL_HPP_TARGET_OPENCL_VERSION 110   CL_HPP_MINIMUM_OPENCL_VERSION 110
I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 155] Create common opencl context
Summary
--------------------------------------------------------
|        Op Type | Total Kernel Time(ms) | Percent (%) |
--------------------------------------------------------
|       Conv_1x1 |                19.559 |      61.558 |
| Conv_Depthwise |                 3.476 |      10.940 |
|         Concat |                 2.204 |       6.936 |
|            Add |                 1.651 |       5.197 |
|          PRelu |                 1.597 |       5.026 |
|        Pooling |                 0.777 |       2.445 |
|            Pad |                 0.718 |       2.261 |
|      BatchNorm |                 0.562 |       1.769 |
|       Upsample |                 0.549 |       1.727 |
|       Conv_3x3 |                 0.372 |       1.172 |
|         SplitV |                 0.241 |       0.759 |
|        SoftMax |                 0.067 |       0.211 |
--------------------------------------------------------
kernel runtime total: 31.7729 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - OPENCL                    TNN Benchmark time cost: min = 34.534   ms  |  max = 38.537   ms  |  avg = 36.155   ms 
08-11 16:14:34.795 19885 19885 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - OPENCL                    TNN Benchmark time cost: min = 34.534   ms  |  max = 38.537   ms  |  avg = 36.155   ms 

5. 该如何优化麒麟芯片上小网络的推理性能

@MHGL
Copy link
Author

MHGL commented Aug 11, 2021

@lnmdlong
Copy link
Collaborator

@MHGL 反馈的速度问题跟测试的机型相关,不是麒麟处理器上的GPU通用问题,选取的Kirin 955机器,CPU配置是四核Cortex A72+四核Cortex A53,GPU配置是Mali-T880 MP4,Mali-T880 MP4相比A72性能优势不大,如果要充分发挥GPU的速度优势,可以拿Kirin 970/980(Mali-G架构的GPU)去在项目上做落地

@MHGL
Copy link
Author

MHGL commented Aug 25, 2021

@lnmdlong 非常感谢你的回复!
在这个性能测试文件中有关Kirin 970的测试数据中发现,小型网络如ShuffleNet,SqueezeNet都有体现出CPU性能优于GPU;所以我的问题是该如何针对性的优化小网络TNN模型在麒麟芯片上的表现呢?有具体的华为部署TNN流程吗?谢谢

@lnmdlong
Copy link
Collaborator

@MHGL 小型网络在麒麟芯片上的GPU性能TNN做了一些优化,部分模型性能不如CPU,跟模型结构和硬件特性相关,暂时还没有进一步优化的方案,后续有计划会及时同步;部署流程可以参考TNN的demo,https://github.com/Tencent/TNN/blob/master/doc/en/user/demo_en.md#ii-introduction-to-android-demo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants