Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-2148: Switch PCA feature from using perf events to ringbuf #594

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

msherif1234
Copy link
Contributor

Description

using pref events while it has much lower performance compared to ringbuf but also enforce application to run in privileged mode because of kernel restrictions.

This PR migrate pca to use ringbuf

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@msherif1234 msherif1234 force-pushed the pcap_use_rb branch 2 times, most recently from add528a to 95098ea Compare March 3, 2025 13:38
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 3, 2025
Copy link

github-actions bot commented Mar 3, 2025

New images:
quay.io/netobserv/ebpf-bytecode:3dcf15b
quay.io/netobserv/netobserv-ebpf-agent:3dcf15b

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=3dcf15b make set-agent-image

@msherif1234
Copy link
Contributor Author

tested with cli

 USER=netobserv NETOBSERV_AGENT_IMAGE=quay.io/netobserv/netobserv-ebpf-agent:3dcf15b COMMAND_ARGS="--protocol=TCP --port=80" make packets

image

@msherif1234 msherif1234 changed the title Switch PCA feature from using perf events to ringbuf NETOBSERV-2148: Switch PCA feature from using perf events to ringbuf Mar 3, 2025
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 3, 2025

@msherif1234: This pull request references NETOBSERV-2148 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Description

using pref events while it has much lower performance compared to ringbuf but also enforce application to run in privileged mode because of kernel restrictions.

This PR migrate pca to use ringbuf

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@msherif1234
Copy link
Contributor Author

  • Before:
15565: sched_cls  name tcx_egress_pca_parse  tag cf185091d59c5f15  gpl
	loaded_at 2025-03-03T11:54:45-0500  uid 0
	xlated 7384B  jited 4740B  memlock 12288B  map_ids 615,616,617,618,619
	btf_id 650
	pids netobserv-ebpf-(1147708)
15566: sched_cls  name tcx_ingress_pca_parse  tag 6e1c4d2436defe26  gpl
	loaded_at 2025-03-03T11:54:45-0500  uid 0
	xlated 7384B  jited 4737B  memlock 12288B  map_ids 615,616,617,618,619
	btf_id 651
	pids netobserv-ebpf-(1147708)

sudo perf stat -e cycles,instructions --bpf-prog 15565 --timeout 10000
 Performance counter stats for 'BPF program(s) 15565':

         2,798,598      cycles                                                                
           940,169      instructions                     #    0.34  insn per cycle            

      10.012628932 seconds time elapsed

sudo perf stat -e cycles,instructions --bpf-prog 15566 --timeout 10000
Performance counter stats for 'BPF program(s) 15566':

         2,661,480      cycles                                                                
           831,513      instructions                     #    0.31  insn per cycle            

      10.011295383 seconds time elapsed
  • After:
15634: sched_cls  name tcx_egress_pca_parse  tag 3b823b0d1e696fd4  gpl
	loaded_at 2025-03-03T12:02:29-0500  uid 0
	xlated 7984B  jited 5020B  memlock 12288B  map_ids 687,688,689,690,691
	btf_id 738
	pids netobserv-ebpf-(1152392)
15635: sched_cls  name tcx_ingress_pca_parse  tag 8784a6295ce1517f  gpl
	loaded_at 2025-03-03T12:02:29-0500  uid 0
	xlated 7984B  jited 5017B  memlock 12288B  map_ids 687,688,689,690,691
	btf_id 739
	pids netobserv-ebpf-(1152392)

sudo perf stat -e cycles,instructions --bpf-prog 15634 --timeout 10000

 Performance counter stats for 'BPF program(s) 15634':

         1,064,322      cycles                                                                
           388,642      instructions                     #    0.37  insn per cycle            

      10.012311676 seconds time elapsed

sudo perf stat -e cycles,instructions --bpf-prog 15635 --timeout 10000

 Performance counter stats for 'BPF program(s) 15635':

         2,018,524      cycles                                                                
           644,693      instructions                     #    0.32  insn per cycle            

      10.012645276 seconds time elapsed

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 3, 2025
__type(value, u32);
__uint(max_entries, 256);
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 1 << 16);
Copy link
Member

@jotak jotak Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's much bigger than the perfevent array, is there a specific reason for that?
Also: ringbuf doesn't ask for values type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt 256 is too low specially this map is only allocated when PCA is enabled and its the only active map for the packet agent + filters so why not give it a bit more ?
right this map type doesn't have key or value https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_RINGBUF/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, max_entries is a size in bytes you want to allocate to the ringbuf.
65535 bytes is tiny (64kB) but perhaps that what you want.
Personally I would size based on:

  • size of payload
  • expected events/sec
  • desired amount of buffer space

and document the formula in a comment.
e.g payload is 256bytes * 1000 events/sec * 5 sec buffer = 1310720 bytes.
Nearest power of 2 would be 2^21.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coming back to this, I see there's at least 4096 bytes allocated for packet payload in each record, so as it stands you'll hold 15 packets in the ringbuf before you start losing data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on 2dn thought 4k is why too much I dropped that to 256 which should be enough for basic hdrs including v6

bpf/pca.h Outdated
// Enable the flag to add packet header
// Packet payload follows immediately after the meta struct
u32 packetSize = (u32)(data_end - data);

// Record the current time.
u64 current_time = bpf_ktime_get_ns();

e = bpf_ringbuf_reserve(&packet_record, sizeof(payload_meta), 0);
if (!e) {
return TC_ACT_UNSPEC;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return value of attach_packet_payload is ignored in the hooks; they always return TC_ACT_OK / TCX_NEXT ; could you take the opportunity to make it return void (or not ignore the return value) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure will carry on the return code all the way back to the hook

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably doesn't make a difference, but I think I'd prefer the other option, to return void :-)

Because netobserv should not signal anything particular to the kernel regarding how it has to process the packet - we must not take any decision.
I get that "UNSPEC" and "OK" is most of the time equivalent... but wondering if there are some devils in the details

Copy link
Contributor Author

@msherif1234 msherif1234 Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here are all TC possible return code for reference

#define TC_ACT_UNSPEC	(-1)
#define TC_ACT_OK		0
#define TC_ACT_RECLASSIFY	1
#define TC_ACT_SHOT		2
#define TC_ACT_PIPE		3
#define TC_ACT_STOLEN		4
#define TC_ACT_QUEUED		5
#define TC_ACT_REPEAT		6
#define TC_ACT_REDIRECT		7
#define TC_ACT_TRAP		8 /* For hw path, this means "trap to cpu"
				   * and don't further process the frame
				   * in hardware. For sw path, this is
				   * equivalent of TC_ACT_STOLEN - drop
				   * the skb and act like everything
				   * is alright.
				   */
#define TC_ACT_VALUE_MAX	TC_ACT_TRAP

for TC/TCx that return code impact how TC hook can coexists with other TC/TCx hooks, we aren't returning anything crazy that will impact kernel pkt processing much.

but our story with interaction with existing TC/TCX hooks not fully matured/tested yet

so you are suggesting to not trickle down the return code and let the hook always returns TC_ACT_OK ?

Copy link
Member

@jotak jotak Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's better to always return TC_ACT_OK or always return TC_ACT_UNSPEC (it seems it doesn't make much difference), but I would always return the same thing just for the sake of telling: netobserv doesn't take any decision about what to do with the packets - it's purely an observer.

We can frame that differently: what's the rationale behind returning sometimes OK, and sometimes UNSPEC ? What are we trying to convey with this differentiation? If the answer is nothing, then why doing this differentiation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both returns won't impact kernel pkt processing UNSPEC is the default and just is ignored while OK indicated the TC or TCX were able to run successfully and in case of TCX we can advance to the next available TCX hook if its available maybe its not worth it and we should always return 0 for TC and NEXT for TCX regardless its was suggestion to handle the return code :)

bpf/types.h Outdated
@@ -69,6 +69,8 @@ typedef __u64 u64;
#define MAX_OBSERVED_INTERFACES 6
#define OBSERVED_DIRECTION_BOTH 3

#define MAX_PAYLOAD_SIZE 512
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may have to document this as a limitation, wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it's reached ? 🤔
I wonder if the pcapng file will still work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the packet header exceed 512 bytes we will cap at 512 which should be enough for most of ipv4 and ipv6 basic pkts but we have to place a limit I can't use dynamic size to reserve ringbuf even I will get verifier errors I can make it even larger 4k maybe ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used 4k to be on the safe side specially with IPv6 and we need to doc this limit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be configurable ?
That would be the best !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it has to be static when u define the structure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually 256 is more than enough even for v6 case v6 base hdrs is just 40 bytes so I will go back to conservative hdr size array unless that caused issues in the future and yes we need to doc this

Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall
just, it would be nice to fix the ignored returned value in pca.h

Copy link

@dave-tucker dave-tucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding some comments since I promised @msherif1234 I'd take a look

__type(value, u32);
__uint(max_entries, 256);
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 1 << 16);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, max_entries is a size in bytes you want to allocate to the ringbuf.
65535 bytes is tiny (64kB) but perhaps that what you want.
Personally I would size based on:

  • size of payload
  • expected events/sec
  • desired amount of buffer space

and document the formula in a comment.
e.g payload is 256bytes * 1000 events/sec * 5 sec buffer = 1310720 bytes.
Nearest power of 2 would be 2^21.

__type(value, u32);
__uint(max_entries, 256);
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 1 << 16);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coming back to this, I see there's at least 4096 bytes allocated for packet payload in each record, so as it stands you'll hold 15 packets in the ringbuf before you start losing data.

bpf/pca.h Outdated
}
return TC_ACT_UNSPEC;
if (packetSize > 0 && bpf_skb_load_bytes(skb, 0, e->payload, packetSize)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you've called bpf_skb_pull_data here the skb data has been linearized already you may as well just use memcpy, which might save you some instructions. otherwise you could just call bpf_skb_load_bytes and not bpf_skb_pull_data.

/cc @tohojo to check my understanding is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think memcpy will compile with variable length

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely don't do bpf_skb_pull_data() for this! Pulling the full data into a linear buffer can have performance side effects for the rest of that skb's lifetime.

Instead, just use bpf_skb_load_bytes() on the skb as-is, that will work just fine, and won't modify the skb itself. This is essentially also what perf_event_output() does when you pass it an skb pointer :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great thanks @tohojo and @dave-tucker for review and the comments I dropped bpf_skb_pull_data() in favor of using bpf_skb_load_bytes() and updated the comments accordingly so I can remember when visit this code in the future

pr.Time = currentTime.Add(-tsDelta)

err := binary.Read(reader, binary.LittleEndian, &pr.Stream)
err := binary.Read(reader, binary.NativeEndian, &pr.Stream)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that the reads of other fields should be NativeEndian...

But if you are reading data from skb->data, shouldn't you be reading in binary.BigEndian format? NetworkEndian == BigEndian

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its array of bytes so it shouldn't matter but when copied to userspace we need to be in host endian fmt ?

@msherif1234 msherif1234 force-pushed the pcap_use_rb branch 4 times, most recently from dfad84e to 153f993 Compare March 5, 2025 11:44
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 5, 2025
Copy link

github-actions bot commented Mar 5, 2025

New images:
quay.io/netobserv/ebpf-bytecode:2d61633
quay.io/netobserv/netobserv-ebpf-agent:2d61633

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=2d61633 make set-agent-image

@jotak
Copy link
Member

jotak commented Mar 10, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Mar 10, 2025
@Amoghrd
Copy link

Amoghrd commented Mar 10, 2025

@msherif1234 All QE backend e2e tests are failing with eBPF daemonset not getting ready. Could you PTAL?

@msherif1234
Copy link
Contributor Author

/test qe-e2e-tests

@msherif1234
Copy link
Contributor Author

@msherif1234 All QE backend e2e tests are failing with eBPF daemonset not getting ready. Could you PTAL?

@Amoghrd my changes should have no impact to regular agent functionality its limited to pca feature which isn't something e2e will be running I rerun it again to see if this consistent or flake

@Amoghrd
Copy link

Amoghrd commented Mar 10, 2025

/retest

Copy link

openshift-ci bot commented Mar 10, 2025

New changes are detected. LGTM label has been removed.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 10, 2025
@memodi
Copy link
Contributor

memodi commented Mar 11, 2025

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 11, 2025
Copy link

New images:
quay.io/netobserv/ebpf-bytecode:1355cdd
quay.io/netobserv/netobserv-ebpf-agent:1355cdd

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=1355cdd make set-agent-image

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 11, 2025
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 11, 2025
Copy link

New images:
quay.io/netobserv/ebpf-bytecode:a8fc15b
quay.io/netobserv/netobserv-ebpf-agent:a8fc15b

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=a8fc15b make set-agent-image

@msherif1234
Copy link
Contributor Author

/test images
/test qe-e2e-tests

@msherif1234
Copy link
Contributor Author

msherif1234 commented Mar 11, 2025

@msherif1234
Copy link
Contributor Author

/test images
/test qe-e2e-tests

1 similar comment
@msherif1234
Copy link
Contributor Author

/test images
/test qe-e2e-tests

@msherif1234
Copy link
Contributor Author

/ok-to-test

@msherif1234
Copy link
Contributor Author

/hold

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 31, 2025
Copy link

openshift-ci bot commented Apr 1, 2025

@msherif1234: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/qe-e2e-tests b00a62b link false /test qe-e2e-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants