-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-2148: Switch PCA feature from using perf events to ringbuf #594
base: main
Are you sure you want to change the base?
Conversation
add528a
to
95098ea
Compare
/ok-to-test |
New images: These will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=3dcf15b make set-agent-image |
@msherif1234: This pull request references NETOBSERV-2148 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
15565: sched_cls name tcx_egress_pca_parse tag cf185091d59c5f15 gpl
loaded_at 2025-03-03T11:54:45-0500 uid 0
xlated 7384B jited 4740B memlock 12288B map_ids 615,616,617,618,619
btf_id 650
pids netobserv-ebpf-(1147708)
15566: sched_cls name tcx_ingress_pca_parse tag 6e1c4d2436defe26 gpl
loaded_at 2025-03-03T11:54:45-0500 uid 0
xlated 7384B jited 4737B memlock 12288B map_ids 615,616,617,618,619
btf_id 651
pids netobserv-ebpf-(1147708)
sudo perf stat -e cycles,instructions --bpf-prog 15565 --timeout 10000
Performance counter stats for 'BPF program(s) 15565':
2,798,598 cycles
940,169 instructions # 0.34 insn per cycle
10.012628932 seconds time elapsed
sudo perf stat -e cycles,instructions --bpf-prog 15566 --timeout 10000
Performance counter stats for 'BPF program(s) 15566':
2,661,480 cycles
831,513 instructions # 0.31 insn per cycle
10.011295383 seconds time elapsed
|
bpf/maps_definition.h
Outdated
__type(value, u32); | ||
__uint(max_entries, 256); | ||
__uint(type, BPF_MAP_TYPE_RINGBUF); | ||
__uint(max_entries, 1 << 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's much bigger than the perfevent array, is there a specific reason for that?
Also: ringbuf doesn't ask for values type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt 256 is too low specially this map is only allocated when PCA is enabled and its the only active map for the packet agent + filters so why not give it a bit more ?
right this map type doesn't have key or value https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_RINGBUF/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, max_entries is a size in bytes you want to allocate to the ringbuf.
65535 bytes is tiny (64kB) but perhaps that what you want.
Personally I would size based on:
- size of payload
- expected events/sec
- desired amount of buffer space
and document the formula in a comment.
e.g payload is 256bytes * 1000 events/sec * 5 sec buffer = 1310720 bytes.
Nearest power of 2 would be 2^21.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coming back to this, I see there's at least 4096 bytes allocated for packet payload in each record, so as it stands you'll hold 15 packets in the ringbuf before you start losing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on 2dn thought 4k is why too much I dropped that to 256 which should be enough for basic hdrs including v6
bpf/pca.h
Outdated
// Enable the flag to add packet header | ||
// Packet payload follows immediately after the meta struct | ||
u32 packetSize = (u32)(data_end - data); | ||
|
||
// Record the current time. | ||
u64 current_time = bpf_ktime_get_ns(); | ||
|
||
e = bpf_ringbuf_reserve(&packet_record, sizeof(payload_meta), 0); | ||
if (!e) { | ||
return TC_ACT_UNSPEC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the return value of attach_packet_payload is ignored in the hooks; they always return TC_ACT_OK
/ TCX_NEXT
; could you take the opportunity to make it return void (or not ignore the return value) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure will carry on the return code all the way back to the hook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't make a difference, but I think I'd prefer the other option, to return void :-)
Because netobserv should not signal anything particular to the kernel regarding how it has to process the packet - we must not take any decision.
I get that "UNSPEC" and "OK" is most of the time equivalent... but wondering if there are some devils in the details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here are all TC possible return code for reference
#define TC_ACT_UNSPEC (-1)
#define TC_ACT_OK 0
#define TC_ACT_RECLASSIFY 1
#define TC_ACT_SHOT 2
#define TC_ACT_PIPE 3
#define TC_ACT_STOLEN 4
#define TC_ACT_QUEUED 5
#define TC_ACT_REPEAT 6
#define TC_ACT_REDIRECT 7
#define TC_ACT_TRAP 8 /* For hw path, this means "trap to cpu"
* and don't further process the frame
* in hardware. For sw path, this is
* equivalent of TC_ACT_STOLEN - drop
* the skb and act like everything
* is alright.
*/
#define TC_ACT_VALUE_MAX TC_ACT_TRAP
for TC/TCx that return code impact how TC hook can coexists with other TC/TCx hooks, we aren't returning anything crazy that will impact kernel pkt processing much.
but our story with interaction with existing TC/TCX hooks not fully matured/tested yet
so you are suggesting to not trickle down the return code and let the hook always returns TC_ACT_OK
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it's better to always return TC_ACT_OK or always return TC_ACT_UNSPEC (it seems it doesn't make much difference), but I would always return the same thing just for the sake of telling: netobserv doesn't take any decision about what to do with the packets - it's purely an observer.
We can frame that differently: what's the rationale behind returning sometimes OK, and sometimes UNSPEC ? What are we trying to convey with this differentiation? If the answer is nothing, then why doing this differentiation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both returns won't impact kernel pkt processing UNSPEC is the default and just is ignored while OK indicated the TC or TCX were able to run successfully and in case of TCX we can advance to the next available TCX hook if its available maybe its not worth it and we should always return 0 for TC and NEXT for TCX regardless its was suggestion to handle the return code :)
bpf/types.h
Outdated
@@ -69,6 +69,8 @@ typedef __u64 u64; | |||
#define MAX_OBSERVED_INTERFACES 6 | |||
#define OBSERVED_DIRECTION_BOTH 3 | |||
|
|||
#define MAX_PAYLOAD_SIZE 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may have to document this as a limitation, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if it's reached ? 🤔
I wonder if the pcapng file will still work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the packet header exceed 512 bytes we will cap at 512 which should be enough for most of ipv4 and ipv6 basic pkts but we have to place a limit I can't use dynamic size to reserve ringbuf even I will get verifier errors I can make it even larger 4k maybe ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used 4k to be on the safe side specially with IPv6 and we need to doc this limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be configurable ?
That would be the best !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it has to be static when u define the structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually 256 is more than enough even for v6 case v6 base hdrs is just 40 bytes so I will go back to conservative hdr size array unless that caused issues in the future and yes we need to doc this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall
just, it would be nice to fix the ignored returned value in pca.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding some comments since I promised @msherif1234 I'd take a look
bpf/maps_definition.h
Outdated
__type(value, u32); | ||
__uint(max_entries, 256); | ||
__uint(type, BPF_MAP_TYPE_RINGBUF); | ||
__uint(max_entries, 1 << 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, max_entries is a size in bytes you want to allocate to the ringbuf.
65535 bytes is tiny (64kB) but perhaps that what you want.
Personally I would size based on:
- size of payload
- expected events/sec
- desired amount of buffer space
and document the formula in a comment.
e.g payload is 256bytes * 1000 events/sec * 5 sec buffer = 1310720 bytes.
Nearest power of 2 would be 2^21.
bpf/maps_definition.h
Outdated
__type(value, u32); | ||
__uint(max_entries, 256); | ||
__uint(type, BPF_MAP_TYPE_RINGBUF); | ||
__uint(max_entries, 1 << 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coming back to this, I see there's at least 4096 bytes allocated for packet payload in each record, so as it stands you'll hold 15 packets in the ringbuf before you start losing data.
bpf/pca.h
Outdated
} | ||
return TC_ACT_UNSPEC; | ||
if (packetSize > 0 && bpf_skb_load_bytes(skb, 0, e->payload, packetSize)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you've called bpf_skb_pull_data
here the skb data has been linearized already you may as well just use memcpy
, which might save you some instructions. otherwise you could just call bpf_skb_load_bytes
and not bpf_skb_pull_data
.
/cc @tohojo to check my understanding is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think memcpy will compile with variable length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, definitely don't do bpf_skb_pull_data()
for this! Pulling the full data into a linear buffer can have performance side effects for the rest of that skb's lifetime.
Instead, just use bpf_skb_load_bytes()
on the skb as-is, that will work just fine, and won't modify the skb itself. This is essentially also what perf_event_output()
does when you pass it an skb pointer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great thanks @tohojo and @dave-tucker for review and the comments I dropped bpf_skb_pull_data()
in favor of using bpf_skb_load_bytes()
and updated the comments accordingly so I can remember when visit this code in the future
pr.Time = currentTime.Add(-tsDelta) | ||
|
||
err := binary.Read(reader, binary.LittleEndian, &pr.Stream) | ||
err := binary.Read(reader, binary.NativeEndian, &pr.Stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that the reads of other fields should be NativeEndian
...
But if you are reading data from skb->data
, shouldn't you be reading in binary.BigEndian
format? NetworkEndian
== BigEndian
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its array of bytes so it shouldn't matter but when copied to userspace we need to be in host endian fmt ?
dfad84e
to
153f993
Compare
/ok-to-test |
New images: These will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=2d61633 make set-agent-image |
/lgtm |
@msherif1234 All QE backend e2e tests are failing with eBPF daemonset not getting ready. Could you PTAL? |
/test qe-e2e-tests |
@Amoghrd my changes should have no impact to regular agent functionality its limited to pca feature which isn't something e2e will be running I rerun it again to see if this consistent or flake |
/retest |
770121d
to
6aefad5
Compare
New changes are detected. LGTM label has been removed. |
/ok-to-test |
New images: These will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=1355cdd make set-agent-image |
6aefad5
to
ceff2b0
Compare
/ok-to-test |
New images: These will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=a8fc15b make set-agent-image |
/test images |
/test images |
1 similar comment
/test images |
/ok-to-test |
/hold |
Signed-off-by: Mohamed Mahmoud <[email protected]>
Signed-off-by: Mohamed Mahmoud <[email protected]>
ceff2b0
to
b00a62b
Compare
@msherif1234: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
using pref events while it has much lower performance compared to ringbuf but also enforce application to run in
privileged
mode because of kernel restrictions.This PR migrate pca to use ringbuf
Dependencies
n/a
Checklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.