Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The goVPP synchronous API exhibits slow performance. #254

Open
mohsinkazmi opened this issue Dec 13, 2024 · 1 comment
Open

The goVPP synchronous API exhibits slow performance. #254

mohsinkazmi opened this issue Dec 13, 2024 · 1 comment
Labels

Comments

@mohsinkazmi
Copy link
Contributor

mohsinkazmi commented Dec 13, 2024

Description
Upon restart, the agent sends numerous API calls to configure VPP (with a count of 2,100,140), but the process takes a significant amount of time to complete. I benchmarked the goVPP APIs using the perf-bench example, and the results indicate that the performance of the RPC, SYNC channel, and SYNC stream is inadequate and requires substantial improvement. These components should be capable of achieving a minimum rate of 30K operations per second.

How to reproduce
Steps to reproduce the behavior:
I used target rate of 2,100,140 to profile and benchmark the goVPP API. Below is the exact command line utilized for the benchmarking process.
sudo ./perf-bench --api-socket /run/vpp/api.sock --count=2100140 --rpc
sudo ./perf-bench --sync --api-socket /run/vpp/api.sock --count=2100140
sudo ./perf-bench --sync --api-socket /run/vpp/api.sock --count=2100140 --v2
sudo ./perf-bench --api-socket /run/vpp/api.sock --count=2100140
sudo ./perf-bench --api-socket /run/vpp/api.sock --count=2100140 --v2

Kindly use the following PR #253 to incorporate support for RPCs into the perf-bench example.

RPC

sudo ./perf-bench --api-socket /run/vpp/api.sock --count=2100140 --rpc
Running RPC perf test with 2100140 requests...
Test took: 2m38.333176979s
Requests per second: 13264

Channel

sudo ./perf-bench --sync --api-socket /run/vpp/api.sock --count=2100140
Running synchronous perf test with 2100140 requests...
Test took: 2m41.739194965s
Requests per second: 12985

Stream

sudo ./perf-bench --sync --api-socket /run/vpp/api.sock --count=2100140 --v2
Running synchronous perf test with 2100140 requests...
Test took: 1m46.650463371s
Requests per second: 19692

Channel Async (use another go routine)

sudo ./perf-bench --api-socket /run/vpp/api.sock --count=2100140
Running asynchronous perf test with 2100140 requests...
Sending asynchronous requests finished
Sending 2100140 Receiving 2100140
Test took: 23.745621161s
Requests per second: 88443

Stream Async (use another go routine)

sudo ./perf-bench --api-socket /run/vpp/api.sock --count=2100140 --v2
Running asynchronous perf test with 2100140 requests...
Sending asynchronous requests finished
Sending 2100140 Receiving 2100140
Test took: 32.591164196s
Requests per second: 64439

Environment

Info Value
OS/arch linux/x86_64
VPP version master
GoVPP version master
uname -r
5.15.0-107-generic
lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.4 LTS
Release:	22.04
Codename:	jammy
vpp# show cpu
Model name:               Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
Microarch model (family): [0x6] Skylake ([0x55] Skylake X/SP) stepping 0x4
Flags:                    sse3 pclmulqdq ssse3 sse41 sse42 avx rdrand avx2 bmi2 rtm pqm pqe avx512f rdseed aes aperfmperf invariant_tsc 
Base frequency:           3.20 GHz
vpp# sh version
vpp v25.02-rc0~210-ge210d413d built by sykazmi on vpp-220-6 at 2025-01-03T14:36:50

To compare the results with the C API bindings (shared memory for now):
Please fetch or pull the patch which adds support for benchmarking the C API bindings in VAT2: https://gerrit.fd.io/r/c/vpp/+/42105
Compile and run VPP.

C API Async performance:

sudo ./vpp/build-root/install-vpp-native/vpp/bin/vat2 --benchmark 2100000
Benchmarking 2100000 repeats
Elapsed time: 9.837724 seconds Reqs/s: 213464.01

C API Sync performance:

  1. Find the control_ping message:
sudo ./vpp/build-root/install-vpp-native/vpp/bin/vat2 --dump-apis | grep control_ping
control_ping
  1. Find the Json message format for control_ping:
sudo ./vpp/build-root/install-vpp-native/vpp/bin/vat2 -t control_ping
{
	"_msgname":	"control_ping",
	"_crc":	"51077d14"
}
  1. Create and execute the command:
sudo ./vpp/build-root/install-vpp-native/vpp/bin/vat2 --benchmark 210000 control_ping '{"_msgname":  "control_ping","_crc":  "51077d14"}'
Benchmarking 210000 repeats
Elapsed time: 5.342462 seconds Reqs/s: 39307.72
@VladoLavor
Copy link
Collaborator

cc @ondrej-fabry

Thank you @mohsinkazmi for providing the PR with the RPC perf test code. Here are the results reproduced on my local machine.

RPC

$perf-bench --api-socket /run/vpp/api.sock --count=2100140 --rpc
Running RPC perf test with 2100140 requests...
Test took: 2m29.437366922s
Requests per second: 14054

Channel

$perf-bench --sync --api-socket /run/vpp/api.sock --count=2100140
Running synchronous perf test with 2100140 requests...
Test took: 2m52.709149771s
Requests per second: 12160

Stream

$perf-bench --sync --api-socket /run/vpp/api.sock --count=2100140 --v2
Running synchronous perf test with 2100140 requests...
Test took: 2m37.342074475s
Requests per second: 13348

Channel async

$perf-bench --api-socket /run/vpp/api.sock --count=2100140
Running asynchronous perf test with 2100140 requests...
Sending asynchronous requests finished
Sending 2100140 Receiving 2100140
Test took: 9.175053706s
Requests per second: 228897

Stream async

$perf-bench --api-socket /run/vpp/api.sock --count=2100140 --v2
Running asynchronous perf test with 2100140 requests...
Sending asynchronous requests finished
Sending 2100140 Receiving 2100140
Test took: 16.187933466s
Requests per second: 129735

All measurements below were done using the Stream API and control_ping message.

[1] Stream request/reply per method duration (better result):

Request:
  SendMsg/newRequest                                               // 60ns
  SendMsg/processRequest                                           // 17.948µs
    SendMsg/processRequest/checkConnection                         // 60ns
    SendMsg/processRequest/retrieveMsg                             // 1.672µs
    SendMsg/processRequest/encode                                  // 373ns
    SendMsg/processRequest/packRequest                             // 36ns
    SendMsg/processRequest/sendRequest                             // 15.411µs
  SendMsg/lock                                                     // 336ns
  // total request 18.511µs
Reply:
    RecvMsg/recvReply/timer                                        // 1.102µs
    RecvMsg/recvReply/receiveReply                                 // 41ns
  RecvMsg/lock                                                     // 41ns
  RecvMsg/lookupId                                                 // 221ns
  RecvMsg/msgInterface                                             // 178ns
  RecvMsgdecode                                                    // 107ns
  // total response 51.303µs
// total request + response 69.814µs

[2] Stream request/reply per method duration (worse results):

Request:
  SendMsg/newRequest                                               // 176ns
  SendMsg/processRequest                                           // 11.025µs
    SendMsg/processRequest/checkConnection                         // 466ns
    SendMsg/processRequest/retrieveMsg                             // 661ns
    SendMsg/processRequest/encode                                  // 4.806µs
    SendMsg/processRequest/packRequest                             // 34ns
    SendMsg/processRequest/sendRequest                             // 4.521µs
 SendMsg/lock                                                      // 190ns
 // total request 11.539µs
Reply:
    RecvMsg/recvReply/timer                                        // 533.535µs
    RecvMsg/recvReply/receiveReply                                 // 166ns
  RecvMsg/lock                                                     // 45ns
  RecvMsg/lookupId                                                 // 942ns
  RecvMsg/msgInterface                                             // 1.858µs
  RecvMsgdecode                                                    // 758ns
  // total response 539.487µs
// total request + response 551.026µs

Despite the request being a bit slower in [1], the overall request duration was much faster due to the quick response. The slowest operation in the request is writing to the socket followed by the encoding and GetMessageID call. The request/reply operation is slowed down mostly because of waiting for reply that varies a lot between calls.

Test with 2M control_ping requests:
The best request/response duration - 3.73 µs
The worst request/response duration - 17.07 ms (that's extremely slow)
Average request/response duration - 62.68µs
Requests per second - 15954

One of my thoughts was that those latency spikes are caused by GC procedures (STW). Here are the results from the memory profiler for simple APIs and regular VPP configuration (creating objects, etc) using with the latest GoVPP:

Simple API

┌──────────────────────────────────────────────────┐
│ Simple API 1000000 calls                         │
├──────────────────────────────────────────────────┤
│ Time: 1m13.786936647s                            │
│ ~API calls: 1000000                              │
│ Rate per second: 13,552                          │
│ Result: PASSED                                   │
├────────────────┬────────────────┬────────────────┤
│                │    Measured    │    Threshold   │
├────────────────┼────────────────┼────────────────┤
│ Total alloc    │      848.79 MB │        1.20 GB │
│ Memory Freed   │      847.60 MB │                │
│ Heap alloc     │        1.19 MB │        2.40 MB │
│ Objects alloc  │     16,504,766 │                │
│ Obj freed      │     16,482,096 │                │
│ Objects remain │         22,670 │         55,000 │
│ Num goroutines │             10 │                │
└────────────────┴────────────────┴────────────────┘

VPP API

┌──────────────────────────────────────────────────┐
│ VPP API 40000 iterations:                        │
├──────────────────────────────────────────────────┤
│ Time: 4m25.484724434s     Iterations: 40000      │
│ ~API calls: 1000000                              │
│ Rate per second: 3,766     Result: PASSED        │
├────────────────┬────────────────┬────────────────┤
│                │    Measured    │    Threshold   │
├────────────────┼────────────────┼────────────────┤
│ Total alloc    │        1.04 GB │        1.10 GB │
│ Memory Freed   │        1.04 GB │                │
│ Heap alloc     │            0 B │        2.40 MB │
│ Objects alloc  │     23,032,895 │                │
│ Obj freed      │     23,032,786 │                │
│ Objects remain │            109 │         55,000 │
│ Num goroutines │             10 │                │
└────────────────┴────────────────┴────────────────┘

Short explanation - Total alloc and Objects alloc increases with the number of API calls and their type. Memory Freed and Obj freed are supposed to be close to the allocation numbers. Heap alloc and Objects remain should always be below a certain threshold, otherwise, it hints memory leaks.

I created this experimental change to decrease the amount of allocated objects #262, and re-run both tests.

Simple API patched

┌──────────────────────────────────────────────────┐
│ Simple API 1000000 calls                         │
├──────────────────────────────────────────────────┤
│ Time: 1m1.946427138s                             │
│ ~API calls: 1000000                              │
│ Rate per second: 16,142                          │
│ Result: PASSED                                   │
├────────────────┬────────────────┬────────────────┤
│                │    Measured    │    Threshold   │
├────────────────┼────────────────┼────────────────┤
│ Total alloc    │      786.02 MB │        1.20 GB │
│ Memory Freed   │      785.26 MB │                │
│ Heap alloc     │      756.82 kB │        2.40 MB │
│ Objects alloc  │     15,510,008 │                │
│ Obj freed      │     15,495,785 │                │
│ Objects remain │         14,223 │         55,000 │
│ Num goroutines │             10 │                │
└────────────────┴────────────────┴────────────────┘

The allocated memory/objects decreased by around 5% and RPS went up from 13.5k to 16k.

VPP API patched

┌──────────────────────────────────────────────────┐
│ Frouter2 API 40000 iterations:                   │
├──────────────────────────────────────────────────┤
│ Time: 4m18.235219032s     Iterations: 40000      │
│ ~API calls: 1000000                              │
│ Rate per second: 3,872     Result: PASSED        │
├────────────────┬────────────────┬────────────────┤
│                │    Measured    │    Threshold   │
├────────────────┼────────────────┼────────────────┤
│ Total alloc    │      915.39 MB │        1.10 GB │
│ Memory Freed   │      915.16 MB │                │
│ Heap alloc     │      238.78 kB │        2.40 MB │
│ Objects alloc  │     21,116,020 │                │
│ Obj freed      │     21,110,574 │                │
│ Objects remain │          5,446 │         55,000 │
│ Num goroutines │             10 │                │
└────────────────┴────────────────┴────────────────┘

Here the decreased memory allocation is even more apparent, but the RPS gain is only small.

I also run stream API test with 2M requests, and got RPS 16552 (compared to 15954 RPS with unpatched version).
So this could be one way how to improve the GoVPP performance.

@mohsinkazmi as per our previous discussion, here are additional points to investigate:

  • what's the C performance on the socket? (vat2 is shared memory afaik)
  • are results from the vat2 and govpp from the same environment (both local or both DC)?
  • govpp API trace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants