NETOBSERV-2148: Switch PCA feature from using perf events to ringbuf #594

msherif1234 · 2025-03-03T13:32:24Z

Description

using pref events while it has much lower performance compared to ringbuf but also enforce application to run in privileged mode because of kernel restrictions.

This PR migrate pca to use ringbuf

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

msherif1234 · 2025-03-03T13:57:20Z

/ok-to-test

github-actions · 2025-03-03T13:59:49Z

New images:
quay.io/netobserv/ebpf-bytecode:3dcf15b
quay.io/netobserv/netobserv-ebpf-agent:3dcf15b

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=3dcf15b make set-agent-image

msherif1234 · 2025-03-03T14:46:15Z

tested with cli

 USER=netobserv NETOBSERV_AGENT_IMAGE=quay.io/netobserv/netobserv-ebpf-agent:3dcf15b COMMAND_ARGS="--protocol=TCP --port=80" make packets

openshift-ci-robot · 2025-03-03T14:51:57Z

msherif1234 · 2025-03-03T17:19:29Z

Before:

15565: sched_cls  name tcx_egress_pca_parse  tag cf185091d59c5f15  gpl
	loaded_at 2025-03-03T11:54:45-0500  uid 0
	xlated 7384B  jited 4740B  memlock 12288B  map_ids 615,616,617,618,619
	btf_id 650
	pids netobserv-ebpf-(1147708)
15566: sched_cls  name tcx_ingress_pca_parse  tag 6e1c4d2436defe26  gpl
	loaded_at 2025-03-03T11:54:45-0500  uid 0
	xlated 7384B  jited 4737B  memlock 12288B  map_ids 615,616,617,618,619
	btf_id 651
	pids netobserv-ebpf-(1147708)

sudo perf stat -e cycles,instructions --bpf-prog 15565 --timeout 10000
 Performance counter stats for 'BPF program(s) 15565':

         2,798,598      cycles                                                                
           940,169      instructions                     #    0.34  insn per cycle            

      10.012628932 seconds time elapsed

sudo perf stat -e cycles,instructions --bpf-prog 15566 --timeout 10000
Performance counter stats for 'BPF program(s) 15566':

         2,661,480      cycles                                                                
           831,513      instructions                     #    0.31  insn per cycle            

      10.011295383 seconds time elapsed

After:

15634: sched_cls  name tcx_egress_pca_parse  tag 3b823b0d1e696fd4  gpl
	loaded_at 2025-03-03T12:02:29-0500  uid 0
	xlated 7984B  jited 5020B  memlock 12288B  map_ids 687,688,689,690,691
	btf_id 738
	pids netobserv-ebpf-(1152392)
15635: sched_cls  name tcx_ingress_pca_parse  tag 8784a6295ce1517f  gpl
	loaded_at 2025-03-03T12:02:29-0500  uid 0
	xlated 7984B  jited 5017B  memlock 12288B  map_ids 687,688,689,690,691
	btf_id 739
	pids netobserv-ebpf-(1152392)

sudo perf stat -e cycles,instructions --bpf-prog 15634 --timeout 10000

 Performance counter stats for 'BPF program(s) 15634':

         1,064,322      cycles                                                                
           388,642      instructions                     #    0.37  insn per cycle            

      10.012311676 seconds time elapsed

sudo perf stat -e cycles,instructions --bpf-prog 15635 --timeout 10000

 Performance counter stats for 'BPF program(s) 15635':

         2,018,524      cycles                                                                
           644,693      instructions                     #    0.32  insn per cycle            

      10.012645276 seconds time elapsed

jotak · 2025-03-04T08:28:22Z

bpf/maps_definition.h

-    __type(value, u32);
-    __uint(max_entries, 256);
+    __uint(type, BPF_MAP_TYPE_RINGBUF);
+    __uint(max_entries, 1 << 16);


that's much bigger than the perfevent array, is there a specific reason for that?
Also: ringbuf doesn't ask for values type?

I felt 256 is too low specially this map is only allocated when PCA is enabled and its the only active map for the packet agent + filters so why not give it a bit more ?
right this map type doesn't have key or value https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_RINGBUF/

fwiw, max_entries is a size in bytes you want to allocate to the ringbuf.
65535 bytes is tiny (64kB) but perhaps that what you want.
Personally I would size based on:

size of payload

expected events/sec

desired amount of buffer space

and document the formula in a comment.
e.g payload is 256bytes * 1000 events/sec * 5 sec buffer = 1310720 bytes.
Nearest power of 2 would be 2^21.

coming back to this, I see there's at least 4096 bytes allocated for packet payload in each record, so as it stands you'll hold 15 packets in the ringbuf before you start losing data.

on 2dn thought 4k is why too much I dropped that to 256 which should be enough for basic hdrs including v6

jotak · 2025-03-04T08:32:14Z

bpf/pca.h

    // Enable the flag to add packet header
    // Packet payload follows immediately after the meta struct
    u32 packetSize = (u32)(data_end - data);

    // Record the current time.
    u64 current_time = bpf_ktime_get_ns();

+    e = bpf_ringbuf_reserve(&packet_record, sizeof(payload_meta), 0);
+    if (!e) {
+        return TC_ACT_UNSPEC;


the return value of attach_packet_payload is ignored in the hooks; they always return TC_ACT_OK / TCX_NEXT ; could you take the opportunity to make it return void (or not ignore the return value) ?

sure will carry on the return code all the way back to the hook

It probably doesn't make a difference, but I think I'd prefer the other option, to return void :-)

Because netobserv should not signal anything particular to the kernel regarding how it has to process the packet - we must not take any decision.
I get that "UNSPEC" and "OK" is most of the time equivalent... but wondering if there are some devils in the details

here are all TC possible return code for reference

#define TC_ACT_UNSPEC (-1) #define TC_ACT_OK 0 #define TC_ACT_RECLASSIFY 1 #define TC_ACT_SHOT 2 #define TC_ACT_PIPE 3 #define TC_ACT_STOLEN 4 #define TC_ACT_QUEUED 5 #define TC_ACT_REPEAT 6 #define TC_ACT_REDIRECT 7 #define TC_ACT_TRAP 8 /* For hw path, this means "trap to cpu" * and don't further process the frame * in hardware. For sw path, this is * equivalent of TC_ACT_STOLEN - drop * the skb and act like everything * is alright. */ #define TC_ACT_VALUE_MAX TC_ACT_TRAP

for TC/TCx that return code impact how TC hook can coexists with other TC/TCx hooks, we aren't returning anything crazy that will impact kernel pkt processing much.

but our story with interaction with existing TC/TCX hooks not fully matured/tested yet

so you are suggesting to not trickle down the return code and let the hook always returns TC_ACT_OK ?

I don't know if it's better to always return TC_ACT_OK or always return TC_ACT_UNSPEC (it seems it doesn't make much difference), but I would always return the same thing just for the sake of telling: netobserv doesn't take any decision about what to do with the packets - it's purely an observer.

We can frame that differently: what's the rationale behind returning sometimes OK, and sometimes UNSPEC ? What are we trying to convey with this differentiation? If the answer is nothing, then why doing this differentiation?

both returns won't impact kernel pkt processing UNSPEC is the default and just is ignored while OK indicated the TC or TCX were able to run successfully and in case of TCX we can advance to the next available TCX hook if its available maybe its not worth it and we should always return 0 for TC and NEXT for TCX regardless its was suggestion to handle the return code :)

jotak · 2025-03-04T08:33:58Z

bpf/types.h

@@ -69,6 +69,8 @@ typedef __u64 u64;
 #define MAX_OBSERVED_INTERFACES 6
 #define OBSERVED_DIRECTION_BOTH 3

+#define MAX_PAYLOAD_SIZE 512


we may have to document this as a limitation, wdyt?

What happens if it's reached ? 🤔
I wonder if the pcapng file will still work

if the packet header exceed 512 bytes we will cap at 512 which should be enough for most of ipv4 and ipv6 basic pkts but we have to place a limit I can't use dynamic size to reserve ringbuf even I will get verifier errors I can make it even larger 4k maybe ?

I used 4k to be on the safe side specially with IPv6 and we need to doc this limit

Could it be configurable ?
That would be the best !

no it has to be static when u define the structure

actually 256 is more than enough even for v6 case v6 base hdrs is just 40 bytes so I will go back to conservative hdr size array unless that caused issues in the future and yes we need to doc this

jotak

looks good overall
just, it would be nice to fix the ignored returned value in pca.h

dave-tucker

adding some comments since I promised @msherif1234 I'd take a look

dave-tucker · 2025-03-04T15:09:28Z

bpf/maps_definition.h

-    __type(value, u32);
-    __uint(max_entries, 256);
+    __uint(type, BPF_MAP_TYPE_RINGBUF);
+    __uint(max_entries, 1 << 16);


fwiw, max_entries is a size in bytes you want to allocate to the ringbuf.
65535 bytes is tiny (64kB) but perhaps that what you want.
Personally I would size based on:

size of payload

expected events/sec

desired amount of buffer space

and document the formula in a comment.
e.g payload is 256bytes * 1000 events/sec * 5 sec buffer = 1310720 bytes.
Nearest power of 2 would be 2^21.

bpf/pca.h

dave-tucker · 2025-03-04T15:25:56Z

bpf/maps_definition.h

-    __type(value, u32);
-    __uint(max_entries, 256);
+    __uint(type, BPF_MAP_TYPE_RINGBUF);
+    __uint(max_entries, 1 << 16);


coming back to this, I see there's at least 4096 bytes allocated for packet payload in each record, so as it stands you'll hold 15 packets in the ringbuf before you start losing data.

dave-tucker · 2025-03-04T16:15:30Z

bpf/pca.h

    }
-    return TC_ACT_UNSPEC;
+    if (packetSize > 0 && bpf_skb_load_bytes(skb, 0, e->payload, packetSize)) {


since you've called bpf_skb_pull_data here the skb data has been linearized already you may as well just use memcpy, which might save you some instructions. otherwise you could just call bpf_skb_load_bytes and not bpf_skb_pull_data.

/cc @tohojo to check my understanding is correct.

I don't think memcpy will compile with variable length

Yeah, definitely don't do bpf_skb_pull_data() for this! Pulling the full data into a linear buffer can have performance side effects for the rest of that skb's lifetime.

Instead, just use bpf_skb_load_bytes() on the skb as-is, that will work just fine, and won't modify the skb itself. This is essentially also what perf_event_output() does when you pass it an skb pointer :)

great thanks @tohojo and @dave-tucker for review and the comments I dropped bpf_skb_pull_data() in favor of using bpf_skb_load_bytes() and updated the comments accordingly so I can remember when visit this code in the future

dave-tucker · 2025-03-04T16:24:11Z

pkg/model/packet_record.go

 	pr.Time = currentTime.Add(-tsDelta)

-	err := binary.Read(reader, binary.LittleEndian, &pr.Stream)
+	err := binary.Read(reader, binary.NativeEndian, &pr.Stream)


I get that the reads of other fields should be NativeEndian...

But if you are reading data from skb->data, shouldn't you be reading in binary.BigEndian format? NetworkEndian == BigEndian

its array of bytes so it shouldn't matter but when copied to userspace we need to be in host endian fmt ?

msherif1234 · 2025-03-05T11:47:25Z

/ok-to-test

github-actions · 2025-03-05T11:49:38Z

New images:
quay.io/netobserv/ebpf-bytecode:2d61633
quay.io/netobserv/netobserv-ebpf-agent:2d61633

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=2d61633 make set-agent-image

jotak · 2025-03-10T15:33:51Z

/lgtm

Amoghrd · 2025-03-10T16:02:48Z

@msherif1234 All QE backend e2e tests are failing with eBPF daemonset not getting ready. Could you PTAL?

msherif1234 · 2025-03-10T16:34:21Z

/test qe-e2e-tests

msherif1234 · 2025-03-10T16:35:49Z

@msherif1234 All QE backend e2e tests are failing with eBPF daemonset not getting ready. Could you PTAL?

@Amoghrd my changes should have no impact to regular agent functionality its limited to pca feature which isn't something e2e will be running I rerun it again to see if this consistent or flake

Amoghrd · 2025-03-10T21:09:47Z

/retest

openshift-ci · 2025-03-10T22:55:38Z

New changes are detected. LGTM label has been removed.

memodi · 2025-03-11T01:39:47Z

/ok-to-test

github-actions · 2025-03-11T01:42:06Z

New images:
quay.io/netobserv/ebpf-bytecode:1355cdd
quay.io/netobserv/netobserv-ebpf-agent:1355cdd

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=1355cdd make set-agent-image

msherif1234 · 2025-03-11T12:33:53Z

/ok-to-test

github-actions · 2025-03-11T12:36:20Z

New images:
quay.io/netobserv/ebpf-bytecode:a8fc15b
quay.io/netobserv/netobserv-ebpf-agent:a8fc15b

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=a8fc15b make set-agent-image

msherif1234 · 2025-03-11T13:25:58Z

/test images
/test qe-e2e-tests

msherif1234 · 2025-03-11T13:26:41Z

/hold
https://issues.redhat.com/browse/RHEL-83254

msherif1234 · 2025-03-11T17:58:32Z

/test images
/test qe-e2e-tests

msherif1234 · 2025-03-12T10:38:33Z

/test images
/test qe-e2e-tests

msherif1234 · 2025-03-18T10:42:35Z

/ok-to-test

msherif1234 · 2025-03-26T11:06:05Z

/hold

Signed-off-by: Mohamed Mahmoud <[email protected]>

openshift-ci · 2025-04-01T03:13:35Z

@msherif1234: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/qe-e2e-tests	`b00a62b`	link	false	`/test qe-e2e-tests`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

msherif1234 force-pushed the pcap_use_rb branch 2 times, most recently from add528a to 95098ea Compare March 3, 2025 13:38

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 3, 2025

msherif1234 changed the title ~~Switch PCA feature from using perf events to ringbuf~~ NETOBSERV-2148: Switch PCA feature from using perf events to ringbuf Mar 3, 2025

openshift-ci-robot added the jira/valid-reference label Mar 3, 2025

msherif1234 force-pushed the pcap_use_rb branch from 95098ea to 337214c Compare March 3, 2025 17:35

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 3, 2025

msherif1234 force-pushed the pcap_use_rb branch from 337214c to 96549c6 Compare March 4, 2025 02:47

jotak reviewed Mar 4, 2025

View reviewed changes

msherif1234 force-pushed the pcap_use_rb branch from 96549c6 to bc506da Compare March 4, 2025 11:26

msherif1234 requested review from jpinsonneau and jotak March 4, 2025 11:29

dave-tucker reviewed Mar 4, 2025

View reviewed changes

msherif1234 force-pushed the pcap_use_rb branch 4 times, most recently from dfad84e to 153f993 Compare March 5, 2025 11:44

msherif1234 requested review from dave-tucker and tohojo March 5, 2025 11:45

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 5, 2025

openshift-ci bot added the lgtm label Mar 10, 2025

msherif1234 force-pushed the pcap_use_rb branch from 770121d to 6aefad5 Compare March 10, 2025 22:55

openshift-ci bot removed the lgtm label Mar 10, 2025

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 10, 2025

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 11, 2025

msherif1234 force-pushed the pcap_use_rb branch from 6aefad5 to ceff2b0 Compare March 11, 2025 12:31

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 11, 2025

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 11, 2025

openshift-ci bot added the do-not-merge/hold label Mar 26, 2025

msherif1234 added 2 commits March 31, 2025 17:00

eBPF kernel changes to switch from pref events to ringbuf

54b946c

Signed-off-by: Mohamed Mahmoud <[email protected]>

userspace to switch to use ringbuf

b00a62b

Signed-off-by: Mohamed Mahmoud <[email protected]>

msherif1234 force-pushed the pcap_use_rb branch from ceff2b0 to b00a62b Compare March 31, 2025 21:00

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 31, 2025

NETOBSERV-2148: Switch PCA feature from using perf events to ringbuf #594

Are you sure you want to change the base?

NETOBSERV-2148: Switch PCA feature from using perf events to ringbuf #594

Conversation

msherif1234 commented Mar 3, 2025

Description

Dependencies

Checklist

msherif1234 commented Mar 3, 2025

github-actions bot commented Mar 3, 2025

msherif1234 commented Mar 3, 2025

openshift-ci-robot commented Mar 3, 2025 • edited by openshift-ci bot Loading

Description

Dependencies

Checklist

msherif1234 commented Mar 3, 2025

jotak Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msherif1234 Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

jotak Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jotak left a comment

Choose a reason for hiding this comment

dave-tucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msherif1234 commented Mar 5, 2025

github-actions bot commented Mar 5, 2025

jotak commented Mar 10, 2025

Amoghrd commented Mar 10, 2025

msherif1234 commented Mar 10, 2025

msherif1234 commented Mar 10, 2025

Amoghrd commented Mar 10, 2025

openshift-ci bot commented Mar 10, 2025

memodi commented Mar 11, 2025

github-actions bot commented Mar 11, 2025

msherif1234 commented Mar 11, 2025

github-actions bot commented Mar 11, 2025

msherif1234 commented Mar 11, 2025

msherif1234 commented Mar 11, 2025 • edited Loading

msherif1234 commented Mar 11, 2025

msherif1234 commented Mar 12, 2025

msherif1234 commented Mar 18, 2025

msherif1234 commented Mar 26, 2025

openshift-ci bot commented Apr 1, 2025

openshift-ci-robot commented Mar 3, 2025 •

edited by openshift-ci bot

Loading

jotak Mar 4, 2025 •

edited

Loading

msherif1234 Mar 6, 2025 •

edited

Loading

jotak Mar 7, 2025 •

edited

Loading

msherif1234 commented Mar 11, 2025 •

edited

Loading