[PATCH v3 net-next 11/15] lwt: Don't disable migration prio invoking BPF.

BPF List
 help / color / mirror / Atom feed

* [PATCH v3 net-next 11/15] lwt: Don't disable migration prio invoking BPF.
       [not found] <20240529162927.403425-1-bigeasy@linutronix.de>
@ 2024-05-29 16:02 ` Sebastian Andrzej Siewior
  2024-05-29 16:02 ` [PATCH v3 net-next 12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-29 16:02 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, bpf

There is no need to explicitly disable migration if bottom halves are
also disabled. Disabling BH implies disabling migration.

Remove migrate_disable() and rely solely on disabling BH to remain on
the same CPU.

Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/lwt_bpf.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 4a0797f0a154b..a94943681e5aa 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -40,10 +40,9 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 {
 	int ret;
 
-	/* Migration disable and BH disable are needed to protect per-cpu
-	 * redirect_info between BPF prog and skb_do_redirect().
+	/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
+	 * BPF prog and skb_do_redirect().
 	 */
-	migrate_disable();
 	local_bh_disable();
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
@@ -78,7 +77,6 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	}
 
 	local_bh_enable();
-	migrate_enable();
 
 	return ret;
 }
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 net-next 12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states.
       [not found] <20240529162927.403425-1-bigeasy@linutronix.de>
  2024-05-29 16:02 ` [PATCH v3 net-next 11/15] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
@ 2024-05-29 16:02 ` Sebastian Andrzej Siewior
  2024-05-29 16:02 ` [PATCH v3 net-next 13/15] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-29 16:02 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	David Ahern, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

The access to seg6_bpf_srh_states is protected by disabling preemption.
Based on the code, the entry point is input_action_end_bpf() and
every other function (the bpf helper functions bpf_lwt_seg6_*()), that
is accessing seg6_bpf_srh_states, should be called from within
input_action_end_bpf().

input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
the function and then disables preemption. This looks wrong because if
preemption needs to be disabled as part of the locking mechanism then
the variable shouldn't be accessed beforehand.

Looking at how it is used via test_lwt_seg6local.sh then
input_action_end_bpf() is always invoked from softirq context. If this
is always the case then the preempt_disable() statement is superfluous.
If this is not always invoked from softirq then disabling only
preemption is not sufficient.

Replace the preempt_disable() statement with nested-BH locking. This is
not an equivalent replacement as it assumes that the invocation of
input_action_end_bpf() always occurs in softirq context and thus the
preempt_disable() is superfluous.
Add a local_lock_t the data structure and use local_lock_nested_bh() in
guard notation for locking. Add lockdep_assert_held() to ensure the lock
is held while the per-CPU variable is referenced in the helper functions.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/net/seg6_local.h |  1 +
 net/core/filter.c        |  3 +++
 net/ipv6/seg6_local.c    | 22 ++++++++++++++--------
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
index 3fab9dec2ec45..888c1ce6f5272 100644
--- a/include/net/seg6_local.h
+++ b/include/net/seg6_local.h
@@ -19,6 +19,7 @@ extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
 extern bool seg6_bpf_has_valid_srh(struct sk_buff *skb);
 
 struct seg6_bpf_srh_state {
+	local_lock_t bh_lock;
 	struct ipv6_sr_hdr *srh;
 	u16 hdrlen;
 	bool valid;
diff --git a/net/core/filter.c b/net/core/filter.c
index 7c46ecba3b01b..ba1a739a9bedc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6450,6 +6450,7 @@ BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
 	void *srh_tlvs, *srh_end, *ptr;
 	int srhoff = 0;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (srh == NULL)
 		return -EINVAL;
 
@@ -6506,6 +6507,7 @@ BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
 	int hdroff = 0;
 	int err;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	switch (action) {
 	case SEG6_LOCAL_ACTION_END_X:
 		if (!seg6_bpf_has_valid_srh(skb))
@@ -6582,6 +6584,7 @@ BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
 	int srhoff = 0;
 	int ret;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (unlikely(srh == NULL))
 		return -EINVAL;
 
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 24e2b4b494cb0..c4828c6620f07 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -1380,7 +1380,9 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 	return err;
 }
 
-DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states) = {
+	.bh_lock	= INIT_LOCAL_LOCK(bh_lock),
+};
 
 bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 {
@@ -1388,6 +1390,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 		this_cpu_ptr(&seg6_bpf_srh_states);
 	struct ipv6_sr_hdr *srh = srh_state->srh;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (unlikely(srh == NULL))
 		return false;
 
@@ -1408,8 +1411,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 static int input_action_end_bpf(struct sk_buff *skb,
 				struct seg6_local_lwt *slwt)
 {
-	struct seg6_bpf_srh_state *srh_state =
-		this_cpu_ptr(&seg6_bpf_srh_states);
+	struct seg6_bpf_srh_state *srh_state;
 	struct ipv6_sr_hdr *srh;
 	int ret;
 
@@ -1420,10 +1422,14 @@ static int input_action_end_bpf(struct sk_buff *skb,
 	}
 	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	/* preempt_disable is needed to protect the per-CPU buffer srh_state,
-	 * which is also accessed by the bpf_lwt_seg6_* helpers
+	/* The access to the per-CPU buffer srh_state is protected by running
+	 * always in softirq context (with disabled BH). On PREEMPT_RT the
+	 * required locking is provided by the following local_lock_nested_bh()
+	 * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
+	 * bpf_prog_run_save_cb().
 	 */
-	preempt_disable();
+	local_lock_nested_bh(&seg6_bpf_srh_states.bh_lock);
+	srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
 	srh_state->srh = srh;
 	srh_state->hdrlen = srh->hdrlen << 3;
 	srh_state->valid = true;
@@ -1446,15 +1452,15 @@ static int input_action_end_bpf(struct sk_buff *skb,
 
 	if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
 		goto drop;
+	local_unlock_nested_bh(&seg6_bpf_srh_states.bh_lock);
 
-	preempt_enable();
 	if (ret != BPF_REDIRECT)
 		seg6_lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
 drop:
-	preempt_enable();
+	local_unlock_nested_bh(&seg6_bpf_srh_states.bh_lock);
 	kfree_skb(skb);
 	return -EINVAL;
 }
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 net-next 13/15] net: Use nested-BH locking for bpf_scratchpad.
       [not found] <20240529162927.403425-1-bigeasy@linutronix.de>
  2024-05-29 16:02 ` [PATCH v3 net-next 11/15] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
  2024-05-29 16:02 ` [PATCH v3 net-next 12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
@ 2024-05-29 16:02 ` Sebastian Andrzej Siewior
  2024-05-29 16:02 ` [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT Sebastian Andrzej Siewior
  2024-05-29 16:02 ` [PATCH v3 net-next 15/15] net: Move per-CPU flush-lists to bpf_net_context " Sebastian Andrzej Siewior
  4 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-29 16:02 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Hao Luo, Jiri Olsa, John Fastabend, KP Singh, Martin KaFai Lau,
	Song Liu, Stanislav Fomichev, Yonghong Song, bpf

bpf_scratchpad is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/filter.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index ba1a739a9bedc..d6cf1a63c3f43 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1658,9 +1658,12 @@ struct bpf_scratchpad {
 		__be32 diff[MAX_BPF_STACK / sizeof(__be32)];
 		u8     buff[MAX_BPF_STACK];
 	};
+	local_lock_t	bh_lock;
 };
 
-static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp);
+static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp) = {
+	.bh_lock	= INIT_LOCAL_LOCK(bh_lock),
+};
 
 static inline int __bpf_try_make_writable(struct sk_buff *skb,
 					  unsigned int write_len)
@@ -2029,6 +2032,7 @@ BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
 		     diff_size > sizeof(sp->diff)))
 		return -EINVAL;
 
+	guard(local_lock_nested_bh)(&bpf_sp.bh_lock);
 	for (i = 0; i < from_size / sizeof(__be32); i++, j++)
 		sp->diff[j] = ~from[i];
 	for (i = 0; i <   to_size / sizeof(__be32); i++, j++)
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
       [not found] <20240529162927.403425-1-bigeasy@linutronix.de>
                   ` (2 preceding siblings ...)
  2024-05-29 16:02 ` [PATCH v3 net-next 13/15] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
@ 2024-05-29 16:02 ` Sebastian Andrzej Siewior
  2024-05-29 21:09   ` Alexei Starovoitov
  2024-05-29 22:09   ` Toke Høiland-Jørgensen
  2024-05-29 16:02 ` [PATCH v3 net-next 15/15] net: Move per-CPU flush-lists to bpf_net_context " Sebastian Andrzej Siewior
  4 siblings, 2 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-29 16:02 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jesper Dangaard Brouer, Jiri Olsa,
	John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Toke Høiland-Jørgensen,
	Yonghong Song, bpf

The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
  packet and makes decisions. While doing that, the per-CPU variable
  bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
  and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.

PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.

Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it. Use the __free() annotation to automatically reset the pointer once
function returns.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info.

On PREEMPT_RT the pointer to bpf_net_context is saved task's
task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
variable (which is always NODE-local memory). Using always the
bpf_net_context approach has the advantage that there is almost zero
differences between PREEMPT_RT and non-PREEMPT_RT builds.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/filter.h | 42 ++++++++++++++++++++++++++++++++-----
 include/linux/sched.h  |  3 +++
 kernel/bpf/cpumap.c    |  3 +++
 kernel/fork.c          |  1 +
 net/bpf/test_run.c     | 11 +++++++++-
 net/core/dev.c         | 19 ++++++++++++++++-
 net/core/filter.c      | 47 +++++++++++++++++++-----------------------
 net/core/lwt_bpf.c     |  3 +++
 8 files changed, 96 insertions(+), 33 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index b02aea291b7e8..a0d0ea356f925 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -744,7 +744,39 @@ struct bpf_redirect_info {
 	struct bpf_nh_params nh;
 };
 
-DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+struct bpf_net_context {
+	struct bpf_redirect_info ri;
+};
+
+static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->bpf_net_context != NULL)
+		return NULL;
+	tsk->bpf_net_context = bpf_net_ctx;
+	return bpf_net_ctx;
+}
+
+static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
+{
+	if (bpf_net_ctx)
+		current->bpf_net_context = NULL;
+}
+
+static inline struct bpf_net_context *bpf_net_ctx_get(void)
+{
+	return current->bpf_net_context;
+}
+
+static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	return &bpf_net_ctx->ri;
+}
+
+DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, bpf_net_ctx_clear(_T));
 
 /* flags for bpf_redirect_info kern_flags */
 #define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
@@ -1022,21 +1054,21 @@ void bpf_clear_redirect_map(struct bpf_map *map);
 
 static inline bool xdp_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_set_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_clear_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
 }
@@ -1592,7 +1624,7 @@ static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde
 						   u64 flags, const u64 flag_mask,
 						   void *lookup_elem(struct bpf_map *map, u32 key))
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
 
 	/* Lower bits of the flags are used as return code on lookup failure */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9b0ca72db55f..dfa1843ab2916 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -53,6 +53,7 @@ struct bio_list;
 struct blk_plug;
 struct bpf_local_storage;
 struct bpf_run_ctx;
+struct bpf_net_context;
 struct capture_control;
 struct cfs_rq;
 struct fs_struct;
@@ -1508,6 +1509,8 @@ struct task_struct {
 	/* Used for BPF run context */
 	struct bpf_run_ctx		*bpf_ctx;
 #endif
+	/* Used by BPF for per-TASK xdp storage */
+	struct bpf_net_context		*bpf_net_context;
 
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
 	unsigned long			lowest_stack;
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a8e34416e960f..66974bd027109 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -240,12 +240,14 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 				int xdp_n, struct xdp_cpumap_stats *stats,
 				struct list_head *list)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int nframes;
 
 	if (!rcpu->prog)
 		return xdp_n;
 
 	rcu_read_lock_bh();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
 	nframes = cpu_map_bpf_prog_run_xdp(rcpu, frames, xdp_n, stats);
 
@@ -255,6 +257,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	if (unlikely(!list_empty(list)))
 		cpu_map_bpf_prog_run_skb(rcpu, list, stats);
 
+	bpf_net_ctx_clear(bpf_net_ctx);
 	rcu_read_unlock_bh(); /* resched point, may call do_softirq() */
 
 	return nframes;
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d83..f314bdd7e6108 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2355,6 +2355,7 @@ __latent_entropy struct task_struct *copy_process(
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 	p->bpf_ctx = NULL;
 #endif
+	p->bpf_net_context =  NULL;
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index f6aad4ed2ab2f..600cc8e428c1a 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -283,9 +283,10 @@ static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
 static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 			      u32 repeat)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int err = 0, act, ret, i, nframes = 0, batch_sz;
 	struct xdp_frame **frames = xdp->frames;
+	struct bpf_redirect_info *ri;
 	struct xdp_page_head *head;
 	struct xdp_frame *frm;
 	bool redirect = false;
@@ -295,6 +296,8 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	batch_sz = min_t(u32, repeat, xdp->batch_size);
 
 	local_bh_disable();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+	ri = bpf_net_ctx_get_ri();
 	xdp_set_return_frame_no_direct();
 
 	for (i = 0; i < batch_sz; i++) {
@@ -359,6 +362,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	}
 
 	xdp_clear_return_frame_no_direct();
+	bpf_net_ctx_clear(bpf_net_ctx);
 	local_bh_enable();
 	return err;
 }
@@ -394,6 +398,7 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 			u32 *retval, u32 *time, bool xdp)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct bpf_prog_array_item item = {.prog = prog};
 	struct bpf_run_ctx *old_ctx;
 	struct bpf_cg_run_ctx run_ctx;
@@ -419,10 +424,14 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 	do {
 		run_ctx.prog_item = &item;
 		local_bh_disable();
+		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 		if (xdp)
 			*retval = bpf_prog_run_xdp(prog, ctx);
 		else
 			*retval = bpf_prog_run(prog, ctx);
+
+		bpf_net_ctx_clear(bpf_net_ctx);
 		local_bh_enable();
 	} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
 	bpf_reset_run_ctx(old_ctx);
diff --git a/net/core/dev.c b/net/core/dev.c
index 2c3f86c8cd176..3c0f5474eae78 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4030,11 +4030,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		   struct net_device *orig_dev, bool *another)
 {
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
+	struct bpf_net_context __bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
+
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	if (*pt_prev) {
 		*ret = deliver_skb(skb, *pt_prev, orig_dev);
 		*pt_prev = NULL;
@@ -4085,13 +4089,17 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 static __always_inline struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
+	struct bpf_net_context __bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
 	 * already set by the caller.
 	 */
@@ -6356,13 +6364,15 @@ static void __napi_busy_loop(unsigned int napi_id,
 		      bool (*loop_end)(void *, unsigned long),
 		      void *loop_end_arg, unsigned flags, u16 budget)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
 	int (*napi_poll)(struct napi_struct *napi, int budget);
+	struct bpf_net_context __bpf_net_ctx;
 	void *have_poll_lock = NULL;
 	struct napi_struct *napi;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
-
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 restart:
 	napi_poll = NULL;
 
@@ -6833,6 +6843,7 @@ static int napi_thread_wait(struct napi_struct *napi)
 
 static void napi_threaded_poll_loop(struct napi_struct *napi)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct softnet_data *sd;
 	unsigned long last_qs = jiffies;
 
@@ -6841,6 +6852,8 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
 		void *have;
 
 		local_bh_disable();
+		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 		sd = this_cpu_ptr(&softnet_data);
 		sd->in_napi_threaded_poll = true;
 
@@ -6856,6 +6869,7 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
 			net_rps_action_and_irq_enable(sd);
 		}
 		skb_defer_free_flush(sd);
+		bpf_net_ctx_clear(bpf_net_ctx);
 		local_bh_enable();
 
 		if (!repoll)
@@ -6878,13 +6892,16 @@ static int napi_threaded_poll(void *data)
 
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
 	unsigned long time_limit = jiffies +
 		usecs_to_jiffies(READ_ONCE(net_hotdata.netdev_budget_usecs));
 	int budget = READ_ONCE(net_hotdata.netdev_budget);
+	struct bpf_net_context __bpf_net_ctx;
 	LIST_HEAD(list);
 	LIST_HEAD(repoll);
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 start:
 	sd->in_net_rx_action = true;
 	local_irq_disable();
diff --git a/net/core/filter.c b/net/core/filter.c
index d6cf1a63c3f43..0a3cc71505b43 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2475,9 +2475,6 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
-DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
-EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
-
 static struct net_device *skb_get_peer_dev(struct net_device *dev)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
@@ -2490,7 +2487,7 @@ static struct net_device *skb_get_peer_dev(struct net_device *dev)
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct net *net = dev_net(skb->dev);
 	struct net_device *dev;
 	u32 flags = ri->flags;
@@ -2523,7 +2520,7 @@ int skb_do_redirect(struct sk_buff *skb)
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
 		return TC_ACT_SHOT;
@@ -2544,7 +2541,7 @@ static const struct bpf_func_proto bpf_redirect_proto = {
 
 BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags))
 		return TC_ACT_SHOT;
@@ -2566,7 +2563,7 @@ static const struct bpf_func_proto bpf_redirect_peer_proto = {
 BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
 	   int, plen, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely((plen && plen < sizeof(*params)) || flags))
 		return TC_ACT_SHOT;
@@ -4294,19 +4291,17 @@ void xdp_do_check_flushed(struct napi_struct *napi)
 
 void bpf_clear_redirect_map(struct bpf_map *map)
 {
-	struct bpf_redirect_info *ri;
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
-		/* Avoid polluting remote cacheline due to writes if
-		 * not needed. Once we pass this test, we need the
-		 * cmpxchg() to make sure it hasn't been changed in
-		 * the meantime by remote CPU.
-		 */
-		if (unlikely(READ_ONCE(ri->map) == map))
-			cmpxchg(&ri->map, map, NULL);
-	}
+	/* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF
+	 * program/ during NAPI callback. It is used during
+	 * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the
+	 * redirect callback afterwards. ri->map is cleared after usage.
+	 * The path has no explicit RCU read section but the local_bh_disable()
+	 * is also a RCU read section which makes the complete softirq callback
+	 * RCU protected. This in turn makes ri->map RCU protected and it is
+	 * sufficient to wait a grace period to ensure that no "ri->map == map"
+	 * exists. dev_map_free() removes the map from the list and then
+	 * invokes synchronize_rcu() after calling this function.
+	 */
 }
 
 DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
@@ -4314,8 +4309,8 @@ EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key);
 
 u32 xdp_master_redirect(struct xdp_buff *xdp)
 {
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct net_device *master, *slave;
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
 	master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
 	slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
@@ -4432,7 +4427,7 @@ static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
@@ -4446,7 +4441,7 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect);
 int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
 			  struct xdp_frame *xdpf, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
@@ -4463,7 +4458,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       enum bpf_map_type map_type, u32 map_id,
 				       u32 flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct bpf_map *map;
 	int err;
 
@@ -4517,7 +4512,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 	void *fwd = ri->tgt_value;
 	u32 map_id = ri->map_id;
@@ -4553,7 +4548,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags))
 		return XDP_ABORTED;
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..afb05f58b64c5 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -38,12 +38,14 @@ static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
 static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		       struct dst_entry *dst, bool can_redirect)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int ret;
 
 	/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
 	 * BPF prog and skb_do_redirect().
 	 */
 	local_bh_disable();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
 
@@ -76,6 +78,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		break;
 	}
 
+	bpf_net_ctx_clear(bpf_net_ctx);
 	local_bh_enable();
 
 	return ret;
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 net-next 15/15] net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT.
       [not found] <20240529162927.403425-1-bigeasy@linutronix.de>
                   ` (3 preceding siblings ...)
  2024-05-29 16:02 ` [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT Sebastian Andrzej Siewior
@ 2024-05-29 16:02 ` Sebastian Andrzej Siewior
  4 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-29 16:02 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Björn Töpel,
	Alexei Starovoitov, Andrii Nakryiko, Eduard Zingerman, Hao Luo,
	Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, Jonathan Lemon,
	KP Singh, Maciej Fijalkowski, Magnus Karlsson, Martin KaFai Lau,
	Song Liu, Stanislav Fomichev, Toke Høiland-Jørgensen,
	Yonghong Song, bpf

The per-CPU flush lists, which are accessed from within the NAPI callback
(xdp_do_flush() for instance), are per-CPU. There are subject to the
same problem as struct bpf_redirect_info.

Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and
xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the
access.

Cc: "Björn Töpel" <bjorn@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/filter.h | 32 ++++++++++++++++++++++++++++++++
 kernel/bpf/cpumap.c    | 19 +++----------------
 kernel/bpf/devmap.c    | 11 +++--------
 net/xdp/xsk.c          | 12 ++++--------
 4 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index a0d0ea356f925..f94115d154890 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -746,6 +746,9 @@ struct bpf_redirect_info {
 
 struct bpf_net_context {
 	struct bpf_redirect_info ri;
+	struct list_head cpu_map_flush_list;
+	struct list_head dev_map_flush_list;
+	struct list_head xskmap_map_flush_list;
 };
 
 static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
@@ -754,6 +757,14 @@ static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bp
 
 	if (tsk->bpf_net_context != NULL)
 		return NULL;
+
+	if (IS_ENABLED(CONFIG_BPF_SYSCALL)) {
+		INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
+		INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
+	}
+	if (IS_ENABLED(CONFIG_XDP_SOCKETS))
+		INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+
 	tsk->bpf_net_context = bpf_net_ctx;
 	return bpf_net_ctx;
 }
@@ -776,6 +787,27 @@ static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
 	return &bpf_net_ctx->ri;
 }
 
+static inline struct list_head *bpf_net_ctx_get_cpu_map_flush_list(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	return &bpf_net_ctx->cpu_map_flush_list;
+}
+
+static inline struct list_head *bpf_net_ctx_get_dev_flush_list(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	return &bpf_net_ctx->dev_map_flush_list;
+}
+
+static inline struct list_head *bpf_net_ctx_get_xskmap_flush_list(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	return &bpf_net_ctx->xskmap_map_flush_list;
+}
+
 DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, bpf_net_ctx_clear(_T));
 
 /* flags for bpf_redirect_info kern_flags */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 66974bd027109..068e994ed781a 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -79,8 +79,6 @@ struct bpf_cpu_map {
 	struct bpf_cpu_map_entry __rcu **cpu_map;
 };
 
-static DEFINE_PER_CPU(struct list_head, cpu_map_flush_list);
-
 static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 {
 	u32 value_size = attr->value_size;
@@ -709,7 +707,7 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
  */
 static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 {
-	struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list();
 	struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
 
 	if (unlikely(bq->count == CPU_MAP_BULK_SIZE))
@@ -761,7 +759,7 @@ int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu,
 
 void __cpu_map_flush(void)
 {
-	struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list();
 	struct xdp_bulk_queue *bq, *tmp;
 
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
@@ -775,20 +773,9 @@ void __cpu_map_flush(void)
 #ifdef CONFIG_DEBUG_NET
 bool cpu_map_check_flush(void)
 {
-	if (list_empty(this_cpu_ptr(&cpu_map_flush_list)))
+	if (list_empty(bpf_net_ctx_get_cpu_map_flush_list()))
 		return false;
 	__cpu_map_flush();
 	return true;
 }
 #endif
-
-static int __init cpu_map_init(void)
-{
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		INIT_LIST_HEAD(&per_cpu(cpu_map_flush_list, cpu));
-	return 0;
-}
-
-subsys_initcall(cpu_map_init);
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 4e2cdbb5629f2..9a66ac78a22ad 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -83,7 +83,6 @@ struct bpf_dtab {
 	u32 n_buckets;
 };
 
-static DEFINE_PER_CPU(struct list_head, dev_flush_list);
 static DEFINE_SPINLOCK(dev_map_lock);
 static LIST_HEAD(dev_map_list);
 
@@ -408,7 +407,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
  */
 void __dev_flush(void)
 {
-	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list();
 	struct xdp_dev_bulk_queue *bq, *tmp;
 
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
@@ -422,7 +421,7 @@ void __dev_flush(void)
 #ifdef CONFIG_DEBUG_NET
 bool dev_check_flush(void)
 {
-	if (list_empty(this_cpu_ptr(&dev_flush_list)))
+	if (list_empty(bpf_net_ctx_get_dev_flush_list()))
 		return false;
 	__dev_flush();
 	return true;
@@ -453,7 +452,7 @@ static void *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
 static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
 		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)
 {
-	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list();
 	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
 
 	if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
@@ -1156,15 +1155,11 @@ static struct notifier_block dev_map_notifier = {
 
 static int __init dev_map_init(void)
 {
-	int cpu;
-
 	/* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */
 	BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) !=
 		     offsetof(struct _bpf_dtab_netdev, dev));
 	register_netdevice_notifier(&dev_map_notifier);
 
-	for_each_possible_cpu(cpu)
-		INIT_LIST_HEAD(&per_cpu(dev_flush_list, cpu));
 	return 0;
 }
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 727aa20be4bde..8b0b557408fc2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -35,8 +35,6 @@
 #define TX_BATCH_SIZE 32
 #define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)
 
-static DEFINE_PER_CPU(struct list_head, xskmap_flush_list);
-
 void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
 {
 	if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
@@ -375,7 +373,7 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 
 int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list();
 	int err;
 
 	err = xsk_rcv(xs, xdp);
@@ -390,7 +388,7 @@ int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
 
 void __xsk_map_flush(void)
 {
-	struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
+	struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list();
 	struct xdp_sock *xs, *tmp;
 
 	list_for_each_entry_safe(xs, tmp, flush_list, flush_node) {
@@ -402,7 +400,7 @@ void __xsk_map_flush(void)
 #ifdef CONFIG_DEBUG_NET
 bool xsk_map_check_flush(void)
 {
-	if (list_empty(this_cpu_ptr(&xskmap_flush_list)))
+	if (list_empty(bpf_net_ctx_get_xskmap_flush_list()))
 		return false;
 	__xsk_map_flush();
 	return true;
@@ -1775,7 +1773,7 @@ static struct pernet_operations xsk_net_ops = {
 
 static int __init xsk_init(void)
 {
-	int err, cpu;
+	int err;
 
 	err = proto_register(&xsk_proto, 0 /* no slab */);
 	if (err)
@@ -1793,8 +1791,6 @@ static int __init xsk_init(void)
 	if (err)
 		goto out_pernet;
 
-	for_each_possible_cpu(cpu)
-		INIT_LIST_HEAD(&per_cpu(xskmap_flush_list, cpu));
 	return 0;
 
 out_pernet:
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-29 16:02 ` [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT Sebastian Andrzej Siewior
@ 2024-05-29 21:09   ` Alexei Starovoitov
  2024-05-29 22:09   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 12+ messages in thread
From: Alexei Starovoitov @ 2024-05-29 21:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: LKML, Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jesper Dangaard Brouer, Jiri Olsa,
	John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Toke Høiland-Jørgensen,
	Yonghong Song, bpf

On Wed, May 29, 2024 at 9:29 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> The XDP redirect process is two staged:
> - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
>   packet and makes decisions. While doing that, the per-CPU variable
>   bpf_redirect_info is used.
>
> - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
>   and it may also access other per-CPU variables like xskmap_flush_list.
>
> At the very end of the NAPI callback, xdp_do_flush() is invoked which
> does not access bpf_redirect_info but will touch the individual per-CPU
> lists.
>
> The per-CPU variables are only used in the NAPI callback hence disabling
> bottom halves is the only protection mechanism. Users from preemptible
> context (like cpu_map_kthread_run()) explicitly disable bottom halves
> for protections reasons.
> Without locking in local_bh_disable() on PREEMPT_RT this data structure
> requires explicit locking.
>
> PREEMPT_RT has forced-threaded interrupts enabled and every
> NAPI-callback runs in a thread. If each thread has its own data
> structure then locking can be avoided.
>
> Create a struct bpf_net_context which contains struct bpf_redirect_info.
> Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
> it. Use the __free() annotation to automatically reset the pointer once
> function returns.
> The bpf_net_ctx_set() may nest. For instance a function can be used from
> within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
> NET_TX_SOFTIRQ which does not. Therefore only the first invocations
> updates the pointer.
> Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
> bpf_redirect_info.
>
> On PREEMPT_RT the pointer to bpf_net_context is saved task's
> task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
> variable (which is always NODE-local memory). Using always the
> bpf_net_context approach has the advantage that there is almost zero
> differences between PREEMPT_RT and non-PREEMPT_RT builds.
>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrii Nakryiko <andrii@kernel.org>
> Cc: Eduard Zingerman <eddyz87@gmail.com>
> Cc: Hao Luo <haoluo@google.com>
> Cc: Jesper Dangaard Brouer <hawk@kernel.org>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: KP Singh <kpsingh@kernel.org>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Song Liu <song@kernel.org>
> Cc: Stanislav Fomichev <sdf@google.com>
> Cc: Toke Høiland-Jørgensen <toke@redhat.com>
> Cc: Yonghong Song <yonghong.song@linux.dev>
> Cc: bpf@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

lgtm
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-29 16:02 ` [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT Sebastian Andrzej Siewior
  2024-05-29 21:09   ` Alexei Starovoitov
@ 2024-05-29 22:09   ` Toke Høiland-Jørgensen
  2024-05-31 10:38     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 12+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-05-29 22:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jesper Dangaard Brouer, Jiri Olsa,
	John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Yonghong Song, bpf

Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> The XDP redirect process is two staged:
> - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
>   packet and makes decisions. While doing that, the per-CPU variable
>   bpf_redirect_info is used.
>
> - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
>   and it may also access other per-CPU variables like xskmap_flush_list.
>
> At the very end of the NAPI callback, xdp_do_flush() is invoked which
> does not access bpf_redirect_info but will touch the individual per-CPU
> lists.
>
> The per-CPU variables are only used in the NAPI callback hence disabling
> bottom halves is the only protection mechanism. Users from preemptible
> context (like cpu_map_kthread_run()) explicitly disable bottom halves
> for protections reasons.
> Without locking in local_bh_disable() on PREEMPT_RT this data structure
> requires explicit locking.
>
> PREEMPT_RT has forced-threaded interrupts enabled and every
> NAPI-callback runs in a thread. If each thread has its own data
> structure then locking can be avoided.
>
> Create a struct bpf_net_context which contains struct bpf_redirect_info.
> Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
> it. Use the __free() annotation to automatically reset the pointer once
> function returns.
> The bpf_net_ctx_set() may nest. For instance a function can be used from
> within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
> NET_TX_SOFTIRQ which does not. Therefore only the first invocations
> updates the pointer.
> Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
> bpf_redirect_info.
>
> On PREEMPT_RT the pointer to bpf_net_context is saved task's
> task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
> variable (which is always NODE-local memory). Using always the
> bpf_net_context approach has the advantage that there is almost zero
> differences between PREEMPT_RT and non-PREEMPT_RT builds.
>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrii Nakryiko <andrii@kernel.org>
> Cc: Eduard Zingerman <eddyz87@gmail.com>
> Cc: Hao Luo <haoluo@google.com>
> Cc: Jesper Dangaard Brouer <hawk@kernel.org>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: KP Singh <kpsingh@kernel.org>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Song Liu <song@kernel.org>
> Cc: Stanislav Fomichev <sdf@google.com>
> Cc: Toke Høiland-Jørgensen <toke@redhat.com>
> Cc: Yonghong Song <yonghong.song@linux.dev>
> Cc: bpf@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

[...]
> @@ -240,12 +240,14 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
>  				int xdp_n, struct xdp_cpumap_stats *stats,
>  				struct list_head *list)
>  {
> +	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>  	int nframes;

I think we need to zero-initialise all the context objects we allocate
on the stack.

The reason being that an XDP program can return XDP_REDIRECT without
calling any of the redirect helpers first; which will lead to
xdp_do_redirect() being called without any of the fields in struct
bpf_redirect_info having being set. This can lead to a crash if the
values happen to be the wrong value; and if we're not initialising the
stack space used by this struct, we have no guarantees about what value
they will end up with.

We fixed a similar bug relatively recently, see:

5bcf0dcbf906 ("xdp: use flags field to disambiguate broadcast redirect")

>  void bpf_clear_redirect_map(struct bpf_map *map)
>  {
> -	struct bpf_redirect_info *ri;
> -	int cpu;
> -
> -	for_each_possible_cpu(cpu) {
> -		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
> -		/* Avoid polluting remote cacheline due to writes if
> -		 * not needed. Once we pass this test, we need the
> -		 * cmpxchg() to make sure it hasn't been changed in
> -		 * the meantime by remote CPU.
> -		 */
> -		if (unlikely(READ_ONCE(ri->map) == map))
> -			cmpxchg(&ri->map, map, NULL);
> -	}
> +	/* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF
> +	 * program/ during NAPI callback. It is used during
> +	 * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the
> +	 * redirect callback afterwards. ri->map is cleared after usage.
> +	 * The path has no explicit RCU read section but the local_bh_disable()
> +	 * is also a RCU read section which makes the complete softirq callback
> +	 * RCU protected. This in turn makes ri->map RCU protected and it is
> +	 * sufficient to wait a grace period to ensure that no "ri->map == map"
> +	 * exists. dev_map_free() removes the map from the list and then
> +	 * invokes synchronize_rcu() after calling this function.
> +	 */
>  }

With the zeroing of the stack variable mentioned above, I agree that
this is not needed anymore, but I think we should just get rid of the
function entirely and put a comment in devmap.c instead of the call to
the (now empty) function.

-Toke


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-29 22:09   ` Toke Høiland-Jørgensen
@ 2024-05-31 10:38     ` Sebastian Andrzej Siewior
  2024-06-05 10:28       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-05-31 10:38 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jesper Dangaard Brouer, Jiri Olsa,
	John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Yonghong Song, bpf

On 2024-05-30 00:09:21 [+0200], Toke Høiland-Jørgensen wrote:
> [...]
> > @@ -240,12 +240,14 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
> >  				int xdp_n, struct xdp_cpumap_stats *stats,
> >  				struct list_head *list)
> >  {
> > +	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
> >  	int nframes;
> 
> I think we need to zero-initialise all the context objects we allocate
> on the stack.
> 
> The reason being that an XDP program can return XDP_REDIRECT without
> calling any of the redirect helpers first; which will lead to
> xdp_do_redirect() being called without any of the fields in struct
> bpf_redirect_info having being set. This can lead to a crash if the
> values happen to be the wrong value; and if we're not initialising the
> stack space used by this struct, we have no guarantees about what value
> they will end up with.

Okay, I can do that.

> >  void bpf_clear_redirect_map(struct bpf_map *map)
> >  {
> > -	struct bpf_redirect_info *ri;
> > -	int cpu;
> > -
> > -	for_each_possible_cpu(cpu) {
> > -		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
> > -		/* Avoid polluting remote cacheline due to writes if
> > -		 * not needed. Once we pass this test, we need the
> > -		 * cmpxchg() to make sure it hasn't been changed in
> > -		 * the meantime by remote CPU.
> > -		 */
> > -		if (unlikely(READ_ONCE(ri->map) == map))
> > -			cmpxchg(&ri->map, map, NULL);
> > -	}
> > +	/* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF
> > +	 * program/ during NAPI callback. It is used during
> > +	 * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the
> > +	 * redirect callback afterwards. ri->map is cleared after usage.
> > +	 * The path has no explicit RCU read section but the local_bh_disable()
> > +	 * is also a RCU read section which makes the complete softirq callback
> > +	 * RCU protected. This in turn makes ri->map RCU protected and it is
> > +	 * sufficient to wait a grace period to ensure that no "ri->map == map"
> > +	 * exists. dev_map_free() removes the map from the list and then
> > +	 * invokes synchronize_rcu() after calling this function.
> > +	 */
> >  }
> 
> With the zeroing of the stack variable mentioned above, I agree that
> this is not needed anymore, but I think we should just get rid of the
> function entirely and put a comment in devmap.c instead of the call to
> the (now empty) function.

I wasn't entirely sure if my reasoning is valid. In that case…

> -Toke

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-05-31 10:38     ` Sebastian Andrzej Siewior
@ 2024-06-05 10:28       ` Jesper Dangaard Brouer
  2024-06-05 10:37         ` Toke Høiland-Jørgensen
  2024-06-05 10:41         ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2024-06-05 10:28 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Toke Høiland-Jørgensen
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf



On 31/05/2024 12.38, Sebastian Andrzej Siewior wrote:
> On 2024-05-30 00:09:21 [+0200], Toke Høiland-Jørgensen wrote:
>> [...]
>>> @@ -240,12 +240,14 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
>>>   				int xdp_n, struct xdp_cpumap_stats *stats,
>>>   				struct list_head *list)
>>>   {
>>> +	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>>>   	int nframes;
 >>
>> I think we need to zero-initialise all the context objects we allocate
>> on the stack.
>>
> Okay, I can do that.

Hmm, but how will this affect performance?

--Jesper

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-06-05 10:28       ` Jesper Dangaard Brouer
@ 2024-06-05 10:37         ` Toke Høiland-Jørgensen
  2024-06-05 10:41         ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 12+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-06-05 10:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Sebastian Andrzej Siewior
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

Jesper Dangaard Brouer <hawk@kernel.org> writes:

> On 31/05/2024 12.38, Sebastian Andrzej Siewior wrote:
>> On 2024-05-30 00:09:21 [+0200], Toke Høiland-Jørgensen wrote:
>>> [...]
>>>> @@ -240,12 +240,14 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
>>>>   				int xdp_n, struct xdp_cpumap_stats *stats,
>>>>   				struct list_head *list)
>>>>   {
>>>> +	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>>>>   	int nframes;
>  >>
>>> I think we need to zero-initialise all the context objects we allocate
>>> on the stack.
>>>
>> Okay, I can do that.
>
> Hmm, but how will this affect performance?

My hunch would be that this would be in a cache line we're touching
anyway, so it won't make much difference? But better measure, I suppose :)

-Toke


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-06-05 10:28       ` Jesper Dangaard Brouer
  2024-06-05 10:37         ` Toke Høiland-Jørgensen
@ 2024-06-05 10:41         ` Sebastian Andrzej Siewior
  2024-06-05 11:55           ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-05 10:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toke Høiland-Jørgensen, linux-kernel, netdev,
	David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Eduard Zingerman, Hao Luo,
	Jiri Olsa, John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Yonghong Song, bpf

On 2024-06-05 12:28:08 [+0200], Jesper Dangaard Brouer wrote:
> 
> Hmm, but how will this affect performance?

As I wrote in the changelog for v4, I haven't notice a difference. I
tried to move bpf_net_ctx_set() from cpu_map_bpf_prog_run() to
cpu_map_kthread_run() to have this assignment only once and I didn't see
a difference/ I couldn't tell the two kernels apart.

This is what I have been using for testing

| xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e

in case I was changing the wrong part…

> --Jesper

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
  2024-06-05 10:41         ` Sebastian Andrzej Siewior
@ 2024-06-05 11:55           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2024-06-05 11:55 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Toke Høiland-Jørgensen, linux-kernel, netdev,
	David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Eduard Zingerman, Hao Luo,
	Jiri Olsa, John Fastabend, KP Singh, Martin KaFai Lau, Song Liu,
	Stanislav Fomichev, Yonghong Song, bpf



On 05/06/2024 12.41, Sebastian Andrzej Siewior wrote:
> On 2024-06-05 12:28:08 [+0200], Jesper Dangaard Brouer wrote:
>>
>> Hmm, but how will this affect performance?
> 
> As I wrote in the changelog for v4, I haven't notice a difference. I
> tried to move bpf_net_ctx_set() from cpu_map_bpf_prog_run() to
> cpu_map_kthread_run() to have this assignment only once and I didn't see
> a difference/ I couldn't tell the two kernels apart.
> 

This would be my preferred solution.
See below, your benchmark wasn't testing/measuring this changed code on
remote CPU running kthread.

> This is what I have been using for testing
> 
> | xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e
> 
> in case I was changing the wrong part…

As we saw earlier (with your hardware setup) this test is benchmarking
the RX-NAPI XDP-redirect code.  As the cpumap "remote" CPUs kthread had
idle cycles.

The extra clearing bpf_net_ctx_set() for each packet in the kthread on
the remote CPU will not change the benchmark numbers (as it have idle
cycles).

Looking closer at kernel code + your patch, I see that this clearing
isn't done for each packet, but per bulk (up-to CPUMAP_BATCH 8).  Given
that, I'm more okay with this change.

--Jesper



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-06-05 11:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240529162927.403425-1-bigeasy@linutronix.de>
2024-05-29 16:02 ` [PATCH v3 net-next 11/15] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 13/15] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
2024-05-29 16:02 ` [PATCH v3 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT Sebastian Andrzej Siewior
2024-05-29 21:09   ` Alexei Starovoitov
2024-05-29 22:09   ` Toke Høiland-Jørgensen
2024-05-31 10:38     ` Sebastian Andrzej Siewior
2024-06-05 10:28       ` Jesper Dangaard Brouer
2024-06-05 10:37         ` Toke Høiland-Jørgensen
2024-06-05 10:41         ` Sebastian Andrzej Siewior
2024-06-05 11:55           ` Jesper Dangaard Brouer
2024-05-29 16:02 ` [PATCH v3 net-next 15/15] net: Move per-CPU flush-lists to bpf_net_context " Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox