[PATCH net-next 11/24] lwt: Don't disable migration prio invoking BPF.

BPF List
 help / color / mirror / Atom feed

* [PATCH net-next 11/24] lwt: Don't disable migration prio invoking BPF.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, bpf

There is no need to explicitly disable migration if bottom halves are
also disabled. Disabling BH implies disabling migration.

Remove migrate_disable() and rely solely on disabling BH to remain on
the same CPU.

Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/lwt_bpf.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 4a0797f0a154b..a94943681e5aa 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -40,10 +40,9 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 {
 	int ret;
 
-	/* Migration disable and BH disable are needed to protect per-cpu
-	 * redirect_info between BPF prog and skb_do_redirect().
+	/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
+	 * BPF prog and skb_do_redirect().
 	 */
-	migrate_disable();
 	local_bh_disable();
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
@@ -78,7 +77,6 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	}
 
 	local_bh_enable();
-	migrate_enable();
 
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
  2023-12-15 17:07 ` [PATCH net-next 11/24] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-16  3:39   ` kernel test robot
  2023-12-18  8:33   ` Paolo Abeni
  2023-12-15 17:07 ` [PATCH net-next 13/24] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	David Ahern, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

The access to seg6_bpf_srh_states is protected by disabling preemption.
Based on the code, the entry point is input_action_end_bpf() and
every other function (the bpf helper functions bpf_lwt_seg6_*()), that
is accessing seg6_bpf_srh_states, should be called from within
input_action_end_bpf().

input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
the function and then disables preemption. This looks wrong because if
preemption needs to be disabled as part of the locking mechanism then
the variable shouldn't be accessed beforehand.

Looking at how it is used via test_lwt_seg6local.sh then
input_action_end_bpf() is always invoked from softirq context. If this
is always the case then the preempt_disable() statement is superfluous.
If this is not always invoked from softirq then disabling only
preemption is not sufficient.

Replace the preempt_disable() statement with nested-BH locking. This is
not an equivalent replacement as it assumes that the invocation of
input_action_end_bpf() always occurs in softirq context and thus the
preempt_disable() is superfluous.
Add a local_lock_t the data structure and use local_lock_nested_bh() in
guard notation for locking. Add lockdep_assert_held() to ensure the lock
is held while the per-CPU variable is referenced in the helper functions.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/net/seg6_local.h |  1 +
 net/core/filter.c        |  3 ++
 net/ipv6/seg6_local.c    | 59 ++++++++++++++++++++++------------------
 3 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
index 3fab9dec2ec45..0f22771359f4c 100644
--- a/include/net/seg6_local.h
+++ b/include/net/seg6_local.h
@@ -20,6 +20,7 @@ extern bool seg6_bpf_has_valid_srh(struct sk_buff *skb);
 
 struct seg6_bpf_srh_state {
 	struct ipv6_sr_hdr *srh;
+	local_lock_t bh_lock;
 	u16 hdrlen;
 	bool valid;
 };
diff --git a/net/core/filter.c b/net/core/filter.c
index 1737884be52f8..c8013f762524b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6384,6 +6384,7 @@ BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
 	void *srh_tlvs, *srh_end, *ptr;
 	int srhoff = 0;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (srh == NULL)
 		return -EINVAL;
 
@@ -6440,6 +6441,7 @@ BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
 	int hdroff = 0;
 	int err;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	switch (action) {
 	case SEG6_LOCAL_ACTION_END_X:
 		if (!seg6_bpf_has_valid_srh(skb))
@@ -6516,6 +6518,7 @@ BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
 	int srhoff = 0;
 	int ret;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (unlikely(srh == NULL))
 		return -EINVAL;
 
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 24e2b4b494cb0..ed7278af321a2 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -1380,7 +1380,9 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 	return err;
 }
 
-DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states) = {
+	.bh_lock	= INIT_LOCAL_LOCK(bh_lock),
+};
 
 bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 {
@@ -1388,6 +1390,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 		this_cpu_ptr(&seg6_bpf_srh_states);
 	struct ipv6_sr_hdr *srh = srh_state->srh;
 
+	lockdep_assert_held(&srh_state->bh_lock);
 	if (unlikely(srh == NULL))
 		return false;
 
@@ -1408,8 +1411,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
 static int input_action_end_bpf(struct sk_buff *skb,
 				struct seg6_local_lwt *slwt)
 {
-	struct seg6_bpf_srh_state *srh_state =
-		this_cpu_ptr(&seg6_bpf_srh_states);
+	struct seg6_bpf_srh_state *srh_state;
 	struct ipv6_sr_hdr *srh;
 	int ret;
 
@@ -1420,41 +1422,44 @@ static int input_action_end_bpf(struct sk_buff *skb,
 	}
 	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	/* preempt_disable is needed to protect the per-CPU buffer srh_state,
-	 * which is also accessed by the bpf_lwt_seg6_* helpers
+	/* The access to the per-CPU buffer srh_state is protected by running
+	 * always in softirq context (with disabled BH). On PREEMPT_RT the
+	 * required locking is provided by the following local_lock_nested_bh()
+	 * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
+	 * bpf_prog_run_save_cb().
 	 */
-	preempt_disable();
-	srh_state->srh = srh;
-	srh_state->hdrlen = srh->hdrlen << 3;
-	srh_state->valid = true;
+	scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
+		srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
+		srh_state->srh = srh;
+		srh_state->hdrlen = srh->hdrlen << 3;
+		srh_state->valid = true;
 
-	rcu_read_lock();
-	bpf_compute_data_pointers(skb);
-	ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
-	rcu_read_unlock();
+		rcu_read_lock();
+		bpf_compute_data_pointers(skb);
+		ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
+		rcu_read_unlock();
 
-	switch (ret) {
-	case BPF_OK:
-	case BPF_REDIRECT:
-		break;
-	case BPF_DROP:
-		goto drop;
-	default:
-		pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
-		goto drop;
+		switch (ret) {
+		case BPF_OK:
+		case BPF_REDIRECT:
+			break;
+		case BPF_DROP:
+			goto drop;
+		default:
+			pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
+			goto drop;
+		}
+
+		if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
+			goto drop;
 	}
 
-	if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
-		goto drop;
-
-	preempt_enable();
 	if (ret != BPF_REDIRECT)
 		seg6_lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
 drop:
-	preempt_enable();
 	kfree_skb(skb);
 	return -EINVAL;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 13/24] net: Use nested-BH locking for bpf_scratchpad.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
  2023-12-15 17:07 ` [PATCH net-next 11/24] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 14/24] net: Add a lock which held during the redirect process Sebastian Andrzej Siewior
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Hao Luo, Jiri Olsa, John Fastabend, KP Singh, Martin KaFai Lau,
	Song Liu, Stanislav Fomichev, Yonghong Song, bpf

bpf_scratchpad is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/filter.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index c8013f762524b..896aa3fa699f9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1652,9 +1652,12 @@ struct bpf_scratchpad {
 		__be32 diff[MAX_BPF_STACK / sizeof(__be32)];
 		u8     buff[MAX_BPF_STACK];
 	};
+	local_lock_t	lock;
 };
 
-static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp);
+static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp) = {
+	.lock	= INIT_LOCAL_LOCK(lock),
+};
 
 static inline int __bpf_try_make_writable(struct sk_buff *skb,
 					  unsigned int write_len)
@@ -2023,6 +2026,7 @@ BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
 		     diff_size > sizeof(sp->diff)))
 		return -EINVAL;
 
+	guard(local_lock_nested_bh)(&bpf_sp.lock);
 	for (i = 0; i < from_size / sizeof(__be32); i++, j++)
 		sp->diff[j] = ~from[i];
 	for (i = 0; i <   to_size / sizeof(__be32); i++, j++)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 14/24] net: Add a lock which held during the redirect process.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
                   ` (2 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 13/24] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect Sebastian Andrzej Siewior
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Hao Luo, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
	KP Singh, Martin KaFai Lau, Song Liu, Stanislav Fomichev,
	Yonghong Song, bpf

The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
  packet and makes decisions. While doing that, the per-CPU variable
  bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
  and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.

Introduce redirect_lock as a lock to be acquired when access to these
per-CPU variables is performed. Usually the lock is part of the per-CPU
variable which is about to be protected but since there are a few
different per-CPU variables which need to be protected at the same
time (and some of the variables depend on a CONFIG setting) a new
per-CPU data structure with variable bpf_run_lock is used for this.

The lock is a nested-BH lock meaning that on non-PREEMPT_RT kernels this
simply results in a lockdep check and ensuring that bottom halves are
disabled. On PREEMPT_RT kernels this will provide the needed
synchronisation once local_bh_disable() does not act as per-CPU lock.

This patch introduces the bpf_run_lock.redirect_lock lock. It will be
used by drivers in the following patches.

A follow-up step could be to keep bpf_prog_run_xdp() and the
XDP_REDIRECT switch case (with xdp_do_redirect()) close together. That
would allow a single scoped_guard() macro to cover the two required
instaces that require locking instead the whole switch case.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/bpf.h | 6 ++++++
 net/core/filter.c   | 5 +++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cff5bb08820ec..6912b85209b12 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -291,6 +291,12 @@ struct bpf_map {
 	s64 __percpu *elem_count;
 };

+struct bpf_run_lock {
+	local_lock_t redirect_lock;
+};
+
+DECLARE_PER_CPU(struct bpf_run_lock, bpf_run_lock);
+
 static inline const char *btf_field_type_name(enum btf_field_type type)
 {
 	switch (type) {
diff --git a/net/core/filter.c b/net/core/filter.c
index 896aa3fa699f9..7c9653734fb60 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -89,6 +89,11 @@
 static const struct bpf_func_proto *
 bpf_sk_base_func_proto(enum bpf_func_id func_id);

+DEFINE_PER_CPU(struct bpf_run_lock, bpf_run_lock) = {
+	.redirect_lock = INIT_LOCAL_LOCK(redirect_lock),
+};
+EXPORT_PER_CPU_SYMBOL_GPL(bpf_run_lock);
+
 int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len)
 {
 	if (in_compat_syscall()) {
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
                   ` (3 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 14/24] net: Add a lock which held during the redirect process Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-20  0:25   ` Alexei Starovoitov
  2023-12-15 17:07 ` [PATCH net-next 16/24] net: netkit, veth, tun, virt*: " Sebastian Andrzej Siewior
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Cong Wang, Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer,
	Jiri Olsa, Jiri Pirko, John Fastabend, KP Singh, Martin KaFai Lau,
	Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Ronak Doshi <doshir@vmware.com>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: VMware PV-Drivers Reviewers <pv-drivers@vmware.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/vmxnet3/vmxnet3_xdp.c |  1 +
 kernel/bpf/cpumap.c               |  2 ++
 net/bpf/test_run.c                | 11 ++++++++---
 net/core/dev.c                    |  3 +++
 net/core/filter.c                 |  1 +
 net/core/lwt_bpf.c                |  2 ++
 net/sched/cls_api.c               |  2 ++
 7 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.c b/drivers/net/vmxnet3/vmxnet3_xdp.c
index 80ddaff759d47..18bce98fd2e31 100644
--- a/drivers/net/vmxnet3/vmxnet3_xdp.c
+++ b/drivers/net/vmxnet3/vmxnet3_xdp.c
@@ -257,6 +257,7 @@ vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, struct xdp_buff *xdp,
 	u32 act;
 
 	rq->stats.xdp_packets++;
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	page = virt_to_page(xdp->data_hard_start);
 
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 8a0bb80fe48a3..c26d49bb78679 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -144,6 +144,7 @@ static void cpu_map_bpf_prog_run_skb(struct bpf_cpu_map_entry *rcpu,
 	int err;
 
 	list_for_each_entry_safe(skb, tmp, listp, list) {
+		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 		act = bpf_prog_run_generic_xdp(skb, &xdp, rcpu->prog);
 		switch (act) {
 		case XDP_PASS:
@@ -182,6 +183,7 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 	struct xdp_buff xdp;
 	int i, nframes = 0;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_set_return_frame_no_direct();
 	xdp.rxq = &rxq;
 
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index c9fdcc5cdce10..db8f7eb35c6ca 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -293,6 +293,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	batch_sz = min_t(u32, repeat, xdp->batch_size);
 
 	local_bh_disable();
+	local_lock_nested_bh(&bpf_run_lock.redirect_lock);
 	xdp_set_return_frame_no_direct();
 
 	for (i = 0; i < batch_sz; i++) {
@@ -348,6 +349,9 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	}
 
 out:
+	xdp_clear_return_frame_no_direct();
+	local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
+
 	if (redirect)
 		xdp_do_flush();
 	if (nframes) {
@@ -356,7 +360,6 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 			err = ret;
 	}
 
-	xdp_clear_return_frame_no_direct();
 	local_bh_enable();
 	return err;
 }
@@ -417,10 +420,12 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 	do {
 		run_ctx.prog_item = &item;
 		local_bh_disable();
-		if (xdp)
+		if (xdp) {
+			guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 			*retval = bpf_prog_run_xdp(prog, ctx);
-		else
+		} else {
 			*retval = bpf_prog_run(prog, ctx);
+		}
 		local_bh_enable();
 	} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
 	bpf_reset_run_ctx(old_ctx);
diff --git a/net/core/dev.c b/net/core/dev.c
index 5a0f6da7b3ae5..5ba7509e88752 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		*pt_prev = NULL;
 	}
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	qdisc_skb_cb(skb)->pkt_len = skb->len;
 	tcx_set_ingress(skb, true);
 
@@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 	if (!entry)
 		return skb;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
 	 * already set by the caller.
 	 */
@@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 		u32 act;
 		int err;
 
+		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 		act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
 		if (act != XDP_PASS) {
 			switch (act) {
diff --git a/net/core/filter.c b/net/core/filter.c
index 7c9653734fb60..72a7812f933a1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
  */
 void xdp_do_flush(void)
 {
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	__dev_flush();
 	__cpu_map_flush();
 	__xsk_map_flush();
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..74b88e897a7e3 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	 * BPF prog and skb_do_redirect().
 	 */
 	local_bh_disable();
+	local_lock_nested_bh(&bpf_run_lock.redirect_lock);
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
 
@@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		break;
 	}
 
+	local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
 	local_bh_enable();
 
 	return ret;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 1976bd1639863..da61b99bc558f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -23,6 +23,7 @@
 #include <linux/jhash.h>
 #include <linux/rculist.h>
 #include <linux/rhashtable.h>
+#include <linux/bpf.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/netlink.h>
@@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
 
 	fl = rcu_dereference_bh(qe->filter_chain);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
 	case TC_ACT_SHOT:
 		qdisc_qstats_drop(sch);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
                   ` (4 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-18  8:52   ` Daniel Borkmann
  2023-12-15 17:07 ` [PATCH net-next 18/24] net: Freescale: " Sebastian Andrzej Siewior
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, K. Y. Srinivasan, Michael S. Tsirkin,
	Alexei Starovoitov, Andrii Nakryiko, Dexuan Cui, Haiyang Zhang,
	Hao Luo, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
	Juergen Gross, KP Singh, Martin KaFai Lau, Nikolay Aleksandrov,
	Song Liu, Stanislav Fomichev, Stefano Stabellini, Wei Liu,
	Willem de Bruijn, Xuan Zhuo, Yonghong Song, bpf, virtualization,
	xen-devel

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Cc: virtualization@lists.linux.dev
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/hyperv/netvsc_bpf.c |  1 +
 drivers/net/netkit.c            | 13 +++++++----
 drivers/net/tun.c               | 28 +++++++++++++----------
 drivers/net/veth.c              | 40 ++++++++++++++++++++-------------
 drivers/net/virtio_net.c        |  1 +
 drivers/net/xen-netfront.c      |  1 +
 6 files changed, 52 insertions(+), 32 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_bpf.c b/drivers/net/hyperv/netvsc_bpf.c
index 4a9522689fa4f..55f8ca92ca199 100644
--- a/drivers/net/hyperv/netvsc_bpf.c
+++ b/drivers/net/hyperv/netvsc_bpf.c
@@ -58,6 +58,7 @@ u32 netvsc_run_xdp(struct net_device *ndev, struct netvsc_channel *nvchan,
 
 	memcpy(xdp->data, data, len);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	switch (act) {
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 39171380ccf29..fbcf78477bda8 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -80,8 +80,15 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
 	netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
 	skb->dev = peer;
 	entry = rcu_dereference(nk->active);
-	if (entry)
-		ret = netkit_run(entry, skb, ret);
+	if (entry) {
+		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+			ret = netkit_run(entry, skb, ret);
+			if (ret == NETKIT_REDIRECT) {
+				dev_sw_netstats_tx_add(dev, 1, len);
+				skb_do_redirect(skb);
+			}
+		}
+	}
 	switch (ret) {
 	case NETKIT_NEXT:
 	case NETKIT_PASS:
@@ -95,8 +102,6 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 		break;
 	case NETKIT_REDIRECT:
-		dev_sw_netstats_tx_add(dev, 1, len);
-		skb_do_redirect(skb);
 		break;
 	case NETKIT_DROP:
 	default:
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afa5497f7c35c..fe0d31f11e4b6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1708,16 +1708,18 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		xdp_init_buff(&xdp, buflen, &tfile->xdp_rxq);
 		xdp_prepare_buff(&xdp, buf, pad, len, false);
 
-		act = bpf_prog_run_xdp(xdp_prog, &xdp);
-		if (act == XDP_REDIRECT || act == XDP_TX) {
-			get_page(alloc_frag->page);
-			alloc_frag->offset += buflen;
-		}
-		err = tun_xdp_act(tun, xdp_prog, &xdp, act);
-		if (err < 0) {
-			if (act == XDP_REDIRECT || act == XDP_TX)
-				put_page(alloc_frag->page);
-			goto out;
+		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			if (act == XDP_REDIRECT || act == XDP_TX) {
+				get_page(alloc_frag->page);
+				alloc_frag->offset += buflen;
+			}
+			err = tun_xdp_act(tun, xdp_prog, &xdp, act);
+			if (err < 0) {
+				if (act == XDP_REDIRECT || act == XDP_TX)
+					put_page(alloc_frag->page);
+				goto out;
+			}
 		}
 
 		if (err == XDP_REDIRECT)
@@ -2460,8 +2462,10 @@ static int tun_xdp_one(struct tun_struct *tun,
 		xdp_init_buff(xdp, buflen, &tfile->xdp_rxq);
 		xdp_set_data_meta_invalid(xdp);
 
-		act = bpf_prog_run_xdp(xdp_prog, xdp);
-		ret = tun_xdp_act(tun, xdp_prog, xdp, act);
+		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+			act = bpf_prog_run_xdp(xdp_prog, xdp);
+			ret = tun_xdp_act(tun, xdp_prog, xdp, act);
+		}
 		if (ret < 0) {
 			put_page(virt_to_head_page(xdp->data));
 			return ret;
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 977861c46b1fe..c69e5ff9f8795 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -624,7 +624,18 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 		xdp->rxq = &rq->xdp_rxq;
 		vxbuf.skb = NULL;
 
-		act = bpf_prog_run_xdp(xdp_prog, xdp);
+		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+			act = bpf_prog_run_xdp(xdp_prog, xdp);
+			if (act == XDP_REDIRECT) {
+				orig_frame = *frame;
+				xdp->rxq->mem = frame->mem;
+				if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
+					frame = &orig_frame;
+					stats->xdp_drops++;
+					goto err_xdp;
+				}
+			}
+		}
 
 		switch (act) {
 		case XDP_PASS:
@@ -644,13 +655,6 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 			rcu_read_unlock();
 			goto xdp_xmit;
 		case XDP_REDIRECT:
-			orig_frame = *frame;
-			xdp->rxq->mem = frame->mem;
-			if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
-				frame = &orig_frame;
-				stats->rx_drops++;
-				goto err_xdp;
-			}
 			stats->xdp_redirect++;
 			rcu_read_unlock();
 			goto xdp_xmit;
@@ -857,7 +861,18 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	orig_data = xdp->data;
 	orig_data_end = xdp->data_end;
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
+		if (act == XDP_REDIRECT) {
+			veth_xdp_get(xdp);
+			consume_skb(skb);
+			xdp->rxq->mem = rq->xdp_mem;
+			if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
+				stats->rx_drops++;
+				goto err_xdp;
+			}
+		}
+	}
 
 	switch (act) {
 	case XDP_PASS:
@@ -875,13 +890,6 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 		rcu_read_unlock();
 		goto xdp_xmit;
 	case XDP_REDIRECT:
-		veth_xdp_get(xdp);
-		consume_skb(skb);
-		xdp->rxq->mem = rq->xdp_mem;
-		if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
-			stats->rx_drops++;
-			goto err_xdp;
-		}
 		stats->xdp_redirect++;
 		rcu_read_unlock();
 		goto xdp_xmit;
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d16f592c2061f..5e362c4604239 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1010,6 +1010,7 @@ static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp,
 	int err;
 	u32 act;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	u64_stats_inc(&stats->xdp_packets);
 
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index ad29f370034e4..e3daa8cdeb84e 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -978,6 +978,7 @@ static u32 xennet_run_xdp(struct netfront_queue *queue, struct page *pdata,
 	xdp_prepare_buff(xdp, page_address(pdata), XDP_PACKET_HEADROOM,
 			 len, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
 	case XDP_TX:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 18/24] net: Freescale: Use nested-BH locking for XDP redirect.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
                   ` (5 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 16/24] net: netkit, veth, tun, virt*: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 19/24] net: fungible, gve, mtk, microchip, mana: " Sebastian Andrzej Siewior
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Clark Wang,
	Claudiu Manoil, Ioana Ciornei, Jesper Dangaard Brouer,
	John Fastabend, Madalin Bucur, NXP Linux Team, Shenwei Wang,
	Vladimir Oltean, Wei Fang, bpf

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Clark Wang <xiaoning.wang@nxp.com>
Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Madalin Bucur <madalin.bucur@nxp.com>
Cc: NXP Linux Team <linux-imx@nxp.com>
Cc: Shenwei Wang <shenwei.wang@nxp.com>
Cc: Vladimir Oltean <vladimir.oltean@nxp.com>
Cc: Wei Fang <wei.fang@nxp.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 .../net/ethernet/freescale/dpaa/dpaa_eth.c    |  1 +
 .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  1 +
 .../net/ethernet/freescale/dpaa2/dpaa2-xsk.c  | 30 ++++++++++---------
 drivers/net/ethernet/freescale/enetc/enetc.c  |  1 +
 drivers/net/ethernet/freescale/fec_main.c     |  1 +
 5 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index dcbc598b11c6c..8adc766282fde 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2597,6 +2597,7 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
 	}
 #endif
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	/* Update the length and the offset of the FD */
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 888509cf1f210..08be35a3e3de7 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -442,6 +442,7 @@ static u32 dpaa2_eth_run_xdp(struct dpaa2_eth_priv *priv,
 	xdp_prepare_buff(&xdp, vaddr + offset, XDP_PACKET_HEADROOM,
 			 dpaa2_fd_get_len(fd), false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	/* xdp.data pointer may have changed */
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
index 051748b997f3f..e3ae9de6b0a34 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
@@ -56,23 +56,25 @@ static u32 dpaa2_xsk_run_xdp(struct dpaa2_eth_priv *priv,
 	xdp_buff->rxq = &ch->xdp_rxq;
 
 	xsk_buff_dma_sync_for_cpu(xdp_buff, ch->xsk_pool);
-	xdp_act = bpf_prog_run_xdp(xdp_prog, xdp_buff);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+		xdp_act = bpf_prog_run_xdp(xdp_prog, xdp_buff);
 
-	/* xdp.data pointer may have changed */
-	dpaa2_fd_set_offset(fd, xdp_buff->data - vaddr);
-	dpaa2_fd_set_len(fd, xdp_buff->data_end - xdp_buff->data);
+		/* xdp.data pointer may have changed */
+		dpaa2_fd_set_offset(fd, xdp_buff->data - vaddr);
+		dpaa2_fd_set_len(fd, xdp_buff->data_end - xdp_buff->data);
 
-	if (likely(xdp_act == XDP_REDIRECT)) {
-		err = xdp_do_redirect(priv->net_dev, xdp_buff, xdp_prog);
-		if (unlikely(err)) {
-			ch->stats.xdp_drop++;
-			dpaa2_eth_recycle_buf(priv, ch, addr);
-		} else {
-			ch->buf_count--;
-			ch->stats.xdp_redirect++;
+		if (likely(xdp_act == XDP_REDIRECT)) {
+			err = xdp_do_redirect(priv->net_dev, xdp_buff, xdp_prog);
+			if (unlikely(err)) {
+				ch->stats.xdp_drop++;
+				dpaa2_eth_recycle_buf(priv, ch, addr);
+			} else {
+				ch->buf_count--;
+				ch->stats.xdp_redirect++;
+			}
+
+			goto xdp_redir;
 		}
-
-		goto xdp_redir;
 	}
 
 	switch (xdp_act) {
diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
index cffbf27c4656b..d516b28815af4 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -1578,6 +1578,7 @@ static int enetc_clean_rx_ring_xdp(struct enetc_bdr *rx_ring,
 			rx_byte_cnt += VLAN_HLEN;
 		rx_byte_cnt += xdp_get_buff_len(&xdp_buff);
 
+		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 		xdp_act = bpf_prog_run_xdp(prog, &xdp_buff);
 
 		switch (xdp_act) {
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index c3b7694a74851..335b1e307d468 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1587,6 +1587,7 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
 	int err;
 	u32 act;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	/* Due xdp_adjust_tail and xdp_adjust_head: DMA sync for_device cover
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 19/24] net: fungible, gve, mtk, microchip, mana: Use nested-BH locking for XDP redirect.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
                   ` (6 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 18/24] net: Freescale: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 20/24] net: intel: " Sebastian Andrzej Siewior
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, K. Y. Srinivasan, Alexei Starovoitov,
	AngeloGioacchino Del Regno, Dexuan Cui, Dimitris Michailidis,
	Felix Fietkau, Haiyang Zhang, Horatiu Vultur, Jeroen de Borst,
	Jesper Dangaard Brouer, John Crispin, John Fastabend,
	Lorenzo Bianconi, Mark Lee, Matthias Brugger, Praveen Kaligineedi,
	Sean Wang, Shailend Chand, UNGLinuxDriver, Wei Liu, bpf,
	linux-hyperv, linux-mediatek

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Dimitris Michailidis <dmichail@fungible.com>
Cc: Felix Fietkau <nbd@nbd.name>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
Cc: Jeroen de Borst <jeroendb@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Crispin <john@phrozen.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Lorenzo Bianconi <lorenzo@kernel.org>
Cc: Mark Lee <Mark-MC.Lee@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Praveen Kaligineedi <pkaligineedi@google.com>
Cc: Sean Wang <sean.wang@mediatek.com>
Cc: Shailend Chand <shailend@google.com>
Cc: UNGLinuxDriver@microchip.com
Cc: Wei Liu <wei.liu@kernel.org>
Cc: bpf@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: linux-mediatek@lists.infradead.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/fungible/funeth/funeth_rx.c     |  1 +
 drivers/net/ethernet/google/gve/gve_rx.c             | 12 +++++++-----
 drivers/net/ethernet/mediatek/mtk_eth_soc.c          |  1 +
 drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c |  1 +
 drivers/net/ethernet/microsoft/mana/mana_bpf.c       |  1 +
 5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/fungible/funeth/funeth_rx.c b/drivers/net/ethernet/fungible/funeth/funeth_rx.c
index 7e2584895de39..e7b1382545908 100644
--- a/drivers/net/ethernet/fungible/funeth/funeth_rx.c
+++ b/drivers/net/ethernet/fungible/funeth/funeth_rx.c
@@ -152,6 +152,7 @@ static void *fun_run_xdp(struct funeth_rxq *q, skb_frag_t *frags, void *buf_va,
 	xdp_prepare_buff(&xdp, buf_va, FUN_XDP_HEADROOM, skb_frag_size(frags) -
 			 (FUN_RX_TAILROOM + FUN_XDP_HEADROOM), false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_prog = READ_ONCE(q->xdp_prog);
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index 73655347902d2..504c8ef761a33 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -779,11 +779,13 @@ static void gve_rx(struct gve_rx_ring *rx, netdev_features_t feat,
 				 page_info->page_offset, GVE_RX_PAD,
 				 len, false);
 		old_data = xdp.data;
-		xdp_act = bpf_prog_run_xdp(xprog, &xdp);
-		if (xdp_act != XDP_PASS) {
-			gve_xdp_done(priv, rx, &xdp, xprog, xdp_act);
-			ctx->total_size += frag_size;
-			goto finish_ok_pkt;
+		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+			xdp_act = bpf_prog_run_xdp(xprog, &xdp);
+			if (xdp_act != XDP_PASS) {
+				gve_xdp_done(priv, rx, &xdp, xprog, xdp_act);
+				ctx->total_size += frag_size;
+				goto finish_ok_pkt;
+			}
 		}
 
 		page_info->pad += xdp.data - old_data;
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 3cf6589cfdacf..477a74ee18c0a 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1946,6 +1946,7 @@ static u32 mtk_xdp_run(struct mtk_eth *eth, struct mtk_rx_ring *ring,
 	if (!prog)
 		goto out;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c b/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c
index 9ee61db8690b4..026311af07f9e 100644
--- a/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c
+++ b/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c
@@ -84,6 +84,7 @@ int lan966x_xdp_run(struct lan966x_port *port, struct page *page, u32 data_len)
 	xdp_prepare_buff(&xdp, page_address(page),
 			 IFH_LEN_BYTES + XDP_PACKET_HEADROOM,
 			 data_len - IFH_LEN_BYTES, false);
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index 23b1521c0df96..d465b1dd9fca0 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -93,6 +93,7 @@ u32 mana_run_xdp(struct net_device *ndev, struct mana_rxq *rxq,
 	xdp_init_buff(xdp, PAGE_SIZE, &rxq->xdp_rxq);
 	xdp_prepare_buff(xdp, buf_va, XDP_PACKET_HEADROOM, pkt_len, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	rx_stats = &rxq->stats;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
                   ` (7 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 19/24] net: fungible, gve, mtk, microchip, mana: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-16  4:53   ` kernel test robot
  2023-12-15 17:07 ` [PATCH net-next 21/24] net: marvell: " Sebastian Andrzej Siewior
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov,
	Jesper Dangaard Brouer, Jesse Brandeburg, John Fastabend,
	Tony Nguyen, bpf, intel-wired-lan

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
Cc: bpf@vger.kernel.org (open list:XDP
Cc: intel-wired-lan@lists.osuosl.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 22 +++++++++--------
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  1 +
 drivers/net/ethernet/intel/ice/ice_xsk.c      | 21 ++++++++--------
 drivers/net/ethernet/intel/igb/igb_main.c     |  1 +
 drivers/net/ethernet/intel/igc/igc_main.c     |  5 +++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 24 ++++++++++---------
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  3 ++-
 9 files changed, 46 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index dd410b15000f7..76e069ae2183a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2326,6 +2326,7 @@ static int i40e_run_xdp(struct i40e_ring *rx_ring, struct xdp_buff *xdp, struct
 
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index e99fa854d17f1..2b0c0c1f3ddc8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -201,17 +201,19 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp,
 	struct i40e_ring *xdp_ring;
 	u32 act;
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
-	if (likely(act == XDP_REDIRECT)) {
-		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-		if (!err)
-			return I40E_XDP_REDIR;
-		if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
-			result = I40E_XDP_EXIT;
-		else
-			result = I40E_XDP_CONSUMED;
-		goto out_failure;
+		if (likely(act == XDP_REDIRECT)) {
+			err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+			if (!err)
+				return I40E_XDP_REDIR;
+			if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
+				result = I40E_XDP_EXIT;
+			else
+				result = I40E_XDP_CONSUMED;
+			goto out_failure;
+		}
 	}
 
 	switch (act) {
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 9e97ea8630686..5d4cfa3455b37 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -571,6 +571,7 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 	if (!xdp_prog)
 		goto exit;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 99954508184f9..02f89c22d19e3 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -762,17 +762,18 @@ ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 	int err, result = ICE_XDP_PASS;
 	u32 act;
 
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
-
-	if (likely(act == XDP_REDIRECT)) {
-		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-		if (!err)
-			return ICE_XDP_REDIR;
-		if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
-			result = ICE_XDP_EXIT;
-		else
-			result = ICE_XDP_CONSUMED;
-		goto out_failure;
+		if (likely(act == XDP_REDIRECT)) {
+			err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+			if (!err)
+				return ICE_XDP_REDIR;
+			if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
+				result = ICE_XDP_EXIT;
+			else
+				result = ICE_XDP_CONSUMED;
+			goto out_failure;
+		}
 	}
 
 	switch (act) {
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index b2295caa2f0ab..e01be809d030e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -8621,6 +8621,7 @@ static struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
 
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index e9bb403bbacf9..8321419b3a307 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -2485,7 +2485,10 @@ static int __igc_xdp_run_prog(struct igc_adapter *adapter,
 			      struct bpf_prog *prog,
 			      struct xdp_buff *xdp)
 {
-	u32 act = bpf_prog_run_xdp(prog, xdp);
+	u32 act;
+
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
+	act = bpf_prog_run_xdp(prog, xdp);
 
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 94bde2cad0f47..de564e8b83be2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2203,6 +2203,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 59798bc33298f..b988f758aad49 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -104,18 +104,20 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
 	struct xdp_frame *xdpf;
 	u32 act;
 
-	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+		xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
-	if (likely(act == XDP_REDIRECT)) {
-		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-		if (!err)
-			return IXGBE_XDP_REDIR;
-		if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
-			result = IXGBE_XDP_EXIT;
-		else
-			result = IXGBE_XDP_CONSUMED;
-		goto out_failure;
+		if (likely(act == XDP_REDIRECT)) {
+			err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+			if (!err)
+				return IXGBE_XDP_REDIR;
+			if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
+				result = IXGBE_XDP_EXIT;
+			else
+				result = IXGBE_XDP_CONSUMED;
+			goto out_failure;
+		}
 	}
 
 	switch (act) {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a44e4bd561421..1c58c08aa15ff 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -1059,7 +1059,8 @@ static struct sk_buff *ixgbevf_run_xdp(struct ixgbevf_adapter *adapter,
 	if (!xdp_prog)
 		goto xdp_out;
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock)
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
 		break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 21/24] net: marvell: Use nested-BH locking for XDP redirect.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
                   ` (8 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 20/24] net: intel: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 22/24] net: mellanox, nfp, sfc: " Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Geetha sowjanya,
	Jesper Dangaard Brouer, John Fastabend, Marcin Wojtas,
	Russell King, Subbaraya Sundeep, Sunil Goutham, Thomas Petazzoni,
	hariprasad, bpf

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Geetha sowjanya <gakula@marvell.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Marcin Wojtas <mw@semihalf.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Subbaraya Sundeep <sbhatta@marvell.com>
Cc: Sunil Goutham <sgoutham@marvell.com>
Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Cc: hariprasad <hkelam@marvell.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/marvell/mvneta.c                  | 2 ++
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c        | 1 +
 drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c | 1 +
 3 files changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 29aac327574d6..9c7aacd73b590 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2263,6 +2263,8 @@ mvneta_run_xdp(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 
 	len = xdp->data_end - xdp->data_hard_start - pp->rx_offset_correction;
 	data_len = xdp->data_end - xdp->data;
+
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index 93137606869e2..3a5524ffaba68 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -3793,6 +3793,7 @@ mvpp2_run_xdp(struct mvpp2_port *port, struct bpf_prog *prog,
 	u32 ret, act;
 
 	len = xdp->data_end - xdp->data_hard_start - MVPP2_SKB_HEADROOM;
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
index 4d519ea833b2c..e48e84d6159bc 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
@@ -1422,6 +1422,7 @@ static bool otx2_xdp_rcv_pkt_handler(struct otx2_nic *pfvf,
 	xdp_prepare_buff(&xdp, hard_start, data - hard_start,
 			 cqe->sg.seg_size, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, &xdp);
 
 	switch (act) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 22/24] net: mellanox, nfp, sfc: Use nested-BH locking for XDP redirect.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
                   ` (9 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 21/24] net: marvell: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 23/24] net: qlogic, socionext, stmmac, cpsw: " Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 24/24] net: bpf: Add lockdep assert for the redirect process Sebastian Andrzej Siewior
  12 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Edward Cree,
	Jesper Dangaard Brouer, John Fastabend, Leon Romanovsky,
	Louis Peens, Martin Habets, Saeed Mahameed, Tariq Toukan, bpf,
	linux-net-drivers, linux-rdma, oss-drivers

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Edward Cree <ecree.xilinx@gmail.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Louis Peens <louis.peens@corigine.com>
Cc: Martin Habets <habetsm.xilinx@gmail.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Tariq Toukan <tariqt@nvidia.com>
Cc: bpf@vger.kernel.org
Cc: linux-net-drivers@amd.com
Cc: linux-rdma@vger.kernel.org
Cc: oss-drivers@corigine.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c       | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
 drivers/net/ethernet/netronome/nfp/nfd3/dp.c     | 3 ++-
 drivers/net/ethernet/netronome/nfp/nfd3/xsk.c    | 1 +
 drivers/net/ethernet/netronome/nfp/nfdk/dp.c     | 3 ++-
 drivers/net/ethernet/sfc/rx.c                    | 1 +
 drivers/net/ethernet/sfc/siena/rx.c              | 1 +
 7 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index a09b6e05337d9..c0a3ac3405bc5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -833,6 +833,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			mxbuf.ring = ring;
 			mxbuf.dev = dev;
 
+			guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
 			length = mxbuf.xdp.data_end - mxbuf.xdp.data;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 7decc81ed33a9..b4e3c6a5a6da6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -269,6 +269,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 	u32 act;
 	int err;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/netronome/nfp/nfd3/dp.c b/drivers/net/ethernet/netronome/nfp/nfd3/dp.c
index 17381bfc15d72..a041b55514aa3 100644
--- a/drivers/net/ethernet/netronome/nfp/nfd3/dp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfd3/dp.c
@@ -1011,7 +1011,8 @@ static int nfp_nfd3_rx(struct nfp_net_rx_ring *rx_ring, int budget)
 					 pkt_off - NFP_NET_RX_BUF_HEADROOM,
 					 pkt_len, true);
 
-			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock)
+				act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 			pkt_len = xdp.data_end - xdp.data;
 			pkt_off += xdp.data - orig_data;
diff --git a/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c b/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
index 45be6954d5aae..38f2d4c2b5b7c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
+++ b/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
@@ -216,6 +216,7 @@ nfp_nfd3_xsk_rx(struct nfp_net_rx_ring *rx_ring, int budget,
 			}
 		}
 
+		guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 		act = bpf_prog_run_xdp(xdp_prog, xrxbuf->xdp);
 
 		pkt_len = xrxbuf->xdp->data_end - xrxbuf->xdp->data;
diff --git a/drivers/net/ethernet/netronome/nfp/nfdk/dp.c b/drivers/net/ethernet/netronome/nfp/nfdk/dp.c
index 8d78c6faefa8a..af0a36c4fb018 100644
--- a/drivers/net/ethernet/netronome/nfp/nfdk/dp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfdk/dp.c
@@ -1130,7 +1130,8 @@ static int nfp_nfdk_rx(struct nfp_net_rx_ring *rx_ring, int budget)
 					 pkt_off - NFP_NET_RX_BUF_HEADROOM,
 					 pkt_len, true);
 
-			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock)
+				act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 			pkt_len = xdp.data_end - xdp.data;
 			pkt_off += xdp.data - orig_data;
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index f77a2d3ef37ec..3712d29150af5 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -291,6 +291,7 @@ static bool efx_do_xdp(struct efx_nic *efx, struct efx_channel *channel,
 	xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
 			 rx_buf->len, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	offset = (u8 *)xdp.data - *ehp;
diff --git a/drivers/net/ethernet/sfc/siena/rx.c b/drivers/net/ethernet/sfc/siena/rx.c
index 98d3c0743c0f5..6bfc4cd1c83e0 100644
--- a/drivers/net/ethernet/sfc/siena/rx.c
+++ b/drivers/net/ethernet/sfc/siena/rx.c
@@ -291,6 +291,7 @@ static bool efx_do_xdp(struct efx_nic *efx, struct efx_channel *channel,
 	xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
 			 rx_buf->len, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	offset = (u8 *)xdp.data - *ehp;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 23/24] net: qlogic, socionext, stmmac, cpsw: Use nested-BH locking for XDP redirect.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
                   ` (10 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 22/24] net: mellanox, nfp, sfc: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  2023-12-15 17:07 ` [PATCH net-next 24/24] net: bpf: Add lockdep assert for the redirect process Sebastian Andrzej Siewior
  12 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexandre Torgue, Alexei Starovoitov,
	Ariel Elior, Ilias Apalodimas, Jassi Brar, Jesper Dangaard Brouer,
	John Fastabend, Jose Abreu, Manish Chopra, Maxime Coquelin,
	Ravi Gunasekaran, Roger Quadros, Siddharth Vadapalli, bpf,
	linux-omap, linux-stm32

The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Ariel Elior <aelior@marvell.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jassi Brar <jaswinder.singh@linaro.org>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jose Abreu <joabreu@synopsys.com>
Cc: Manish Chopra <manishc@marvell.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Ravi Gunasekaran <r-gunasekaran@ti.com>
Cc: Roger Quadros <rogerq@kernel.org>
Cc: Siddharth Vadapalli <s-vadapalli@ti.com>
Cc: bpf@vger.kernel.org
Cc: linux-omap@vger.kernel.org
Cc: linux-stm32@st-md-mailman.stormreply.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ethernet/qlogic/qede/qede_fp.c        |  1 +
 drivers/net/ethernet/socionext/netsec.c           |  1 +
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |  1 +
 drivers/net/ethernet/ti/cpsw_priv.c               | 15 +++++++++------
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index cb1746bc0e0c5..ce5af094fb817 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -1091,6 +1091,7 @@ static bool qede_rx_xdp(struct qede_dev *edev,
 	xdp_prepare_buff(&xdp, page_address(bd->data), *data_offset,
 			 *len, false);
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, &xdp);
 
 	/* Recalculate, as XDP might have changed the headers */
diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
index 0891e9e49ecb5..47e314338f3f3 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -905,6 +905,7 @@ static u32 netsec_run_xdp(struct netsec_priv *priv, struct bpf_prog *prog,
 	int err;
 	u32 act;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 37e64283f9107..9e92affc8c22c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -4893,6 +4893,7 @@ static int __stmmac_xdp_run_prog(struct stmmac_priv *priv,
 	u32 act;
 	int res;
 
+	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
 	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
 	case XDP_PASS:
diff --git a/drivers/net/ethernet/ti/cpsw_priv.c b/drivers/net/ethernet/ti/cpsw_priv.c
index 764ed298b5708..f38c49f9fab35 100644
--- a/drivers/net/ethernet/ti/cpsw_priv.c
+++ b/drivers/net/ethernet/ti/cpsw_priv.c
@@ -1335,9 +1335,15 @@ int cpsw_run_xdp(struct cpsw_priv *priv, int ch, struct xdp_buff *xdp,
 	if (!prog)
 		return CPSW_XDP_PASS;
 
-	act = bpf_prog_run_xdp(prog, xdp);
-	/* XDP prog might have changed packet data and boundaries */
-	*len = xdp->data_end - xdp->data;
+	scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+		act = bpf_prog_run_xdp(prog, xdp);
+		/* XDP prog might have changed packet data and boundaries */
+		*len = xdp->data_end - xdp->data;
+		if (act == XDP_REDIRECT) {
+			if (xdp_do_redirect(ndev, xdp, prog))
+				goto drop;
+		}
+	}
 
 	switch (act) {
 	case XDP_PASS:
@@ -1352,9 +1358,6 @@ int cpsw_run_xdp(struct cpsw_priv *priv, int ch, struct xdp_buff *xdp,
 			xdp_return_frame_rx_napi(xdpf);
 		break;
 	case XDP_REDIRECT:
-		if (xdp_do_redirect(ndev, xdp, prog))
-			goto drop;
-
 		/*  Have to flush here, per packet, instead of doing it in bulk
 		 *  at the end of the napi handler. The RX devices on this
 		 *  particular hardware is sharing a common queue, so the
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net-next 24/24] net: bpf: Add lockdep assert for the redirect process.
       [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
                   ` (11 preceding siblings ...)
  2023-12-15 17:07 ` [PATCH net-next 23/24] net: qlogic, socionext, stmmac, cpsw: " Sebastian Andrzej Siewior
@ 2023-12-15 17:07 ` Sebastian Andrzej Siewior
  12 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-12-15 17:07 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	Hao Luo, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend,
	KP Singh, Martin KaFai Lau, Song Liu, Stanislav Fomichev,
	Yonghong Song, bpf

The users of bpf_redirect_info should lock the access by acquiring the
nested BH-lock bpf_run_lock.redirect_lock. This lock should be acquired
before the first usage (bpf_prog_run_xdp()) and dropped after the last
user in the context (xdp_do_redirect()).

Current user in tree have been audited and updated.

Add lockdep annonation to ensure new user acquire the lock.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/net/xdp.h |  1 +
 net/core/filter.c | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 349c36fb5fd8f..cdeab175abf18 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -493,6 +493,7 @@ static inline void xdp_clear_features_flag(struct net_device *dev)
 static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
 					    struct xdp_buff *xdp)
 {
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
 	/* Driver XDP hooks are invoked within a single NAPI poll cycle and thus
 	 * under local_bh_disable(), which provides the needed RCU protection
 	 * for accessing map entries.
diff --git a/net/core/filter.c b/net/core/filter.c
index 72a7812f933a1..a2f97503ed578 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2495,6 +2495,7 @@ int skb_do_redirect(struct sk_buff *skb)
 	struct net_device *dev;
 	u32 flags = ri->flags;
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
 	dev = dev_get_by_index_rcu(net, ri->tgt_index);
 	ri->tgt_index = 0;
 	ri->flags = 0;
@@ -2525,6 +2526,8 @@ BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
 	if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
 		return TC_ACT_SHOT;
 
@@ -2546,6 +2549,8 @@ BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
 	if (unlikely(flags))
 		return TC_ACT_SHOT;
 
@@ -2568,6 +2573,8 @@ BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
 	if (unlikely((plen && plen < sizeof(*params)) || flags))
 		return TC_ACT_SHOT;
 
@@ -4287,6 +4294,8 @@ u32 xdp_master_redirect(struct xdp_buff *xdp)
 	struct net_device *master, *slave;
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
 	master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
 	slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
 	if (slave && slave != xdp->rxq->dev) {
@@ -4394,6 +4403,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	enum bpf_map_type map_type = ri->map_type;
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
 		return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
 
@@ -4408,6 +4418,7 @@ int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	enum bpf_map_type map_type = ri->map_type;
 
+	lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
 		return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states.
  2023-12-15 17:07 ` [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
@ 2023-12-16  3:39   ` kernel test robot
  2023-12-18  8:33   ` Paolo Abeni
  1 sibling, 0 replies; 34+ messages in thread
From: kernel test robot @ 2023-12-16  3:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: oe-kbuild-all, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Andrii Nakryiko,
	David Ahern, Hao Luo, Jiri Olsa, John Fastabend, KP Singh,
	Martin KaFai Lau, Song Liu, Stanislav Fomichev, Yonghong Song,
	bpf

Hi Sebastian,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base:   net-next/main
patch link:    https://lore.kernel.org/r/20231215171020.687342-13-bigeasy%40linutronix.de
patch subject: [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states.
config: x86_64-randconfig-r131-20231216 (https://download.01.org/0day-ci/archive/20231216/202312161151.k1MBvXUD-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161151.k1MBvXUD-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312161151.k1MBvXUD-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> net/ipv6/seg6_local.c:1431:9: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct local_lock_t [usertype] *l @@     got struct local_lock_t [noderef] __percpu * @@
   net/ipv6/seg6_local.c:1431:9: sparse:     expected struct local_lock_t [usertype] *l
   net/ipv6/seg6_local.c:1431:9: sparse:     got struct local_lock_t [noderef] __percpu *

vim +1431 net/ipv6/seg6_local.c

  1410	
  1411	static int input_action_end_bpf(struct sk_buff *skb,
  1412					struct seg6_local_lwt *slwt)
  1413	{
  1414		struct seg6_bpf_srh_state *srh_state;
  1415		struct ipv6_sr_hdr *srh;
  1416		int ret;
  1417	
  1418		srh = get_and_validate_srh(skb);
  1419		if (!srh) {
  1420			kfree_skb(skb);
  1421			return -EINVAL;
  1422		}
  1423		advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
  1424	
  1425		/* The access to the per-CPU buffer srh_state is protected by running
  1426		 * always in softirq context (with disabled BH). On PREEMPT_RT the
  1427		 * required locking is provided by the following local_lock_nested_bh()
  1428		 * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
  1429		 * bpf_prog_run_save_cb().
  1430		 */
> 1431		scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
  1432			srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
  1433			srh_state->srh = srh;
  1434			srh_state->hdrlen = srh->hdrlen << 3;
  1435			srh_state->valid = true;
  1436	
  1437			rcu_read_lock();
  1438			bpf_compute_data_pointers(skb);
  1439			ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
  1440			rcu_read_unlock();
  1441	
  1442			switch (ret) {
  1443			case BPF_OK:
  1444			case BPF_REDIRECT:
  1445				break;
  1446			case BPF_DROP:
  1447				goto drop;
  1448			default:
  1449				pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
  1450				goto drop;
  1451			}
  1452	
  1453			if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
  1454				goto drop;
  1455		}
  1456	
  1457		if (ret != BPF_REDIRECT)
  1458			seg6_lookup_nexthop(skb, NULL, 0);
  1459	
  1460		return dst_input(skb);
  1461	
  1462	drop:
  1463		kfree_skb(skb);
  1464		return -EINVAL;
  1465	}
  1466	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 ` [PATCH net-next 20/24] net: intel: " Sebastian Andrzej Siewior
@ 2023-12-16  4:53   ` kernel test robot
  2023-12-19  0:01     ` Nathan Chancellor
  0 siblings, 1 reply; 34+ messages in thread
From: kernel test robot @ 2023-12-16  4:53 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: llvm, oe-kbuild-all, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Sebastian Andrzej Siewior, Alexei Starovoitov,
	Jesper Dangaard Brouer, Jesse Brandeburg, John Fastabend,
	Tony Nguyen, bpf, intel-wired-lan

Hi Sebastian,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base:   net-next/main
patch link:    https://lore.kernel.org/r/20231215171020.687342-21-bigeasy%40linutronix.de
patch subject: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
config: arm-defconfig (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312161212.D5tju5i6-lkp@intel.com/

All errors (new ones prefixed by >>):

>> drivers/net/ethernet/intel/igb/igb_main.c:8620:3: error: cannot jump from this goto statement to its label
                   goto xdp_out;
                   ^
   drivers/net/ethernet/intel/igb/igb_main.c:8624:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
           guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
           ^
   include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
           CLASS(_name, __UNIQUE_ID(guard))
                        ^
   include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
   #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
                               ^
   include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
   #define __PASTE(a,b) ___PASTE(a,b)
                        ^
   include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
   #define ___PASTE(a,b) a##b
                         ^
   <scratch space>:52:1: note: expanded from here
   __UNIQUE_ID_guard753
   ^
   1 error generated.


vim +8620 drivers/net/ethernet/intel/igb/igb_main.c

b1bb2eb0a0deb0 Alexander Duyck           2017-02-06  8608  
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8609  static struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8610  				   struct igb_ring *rx_ring,
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8611  				   struct xdp_buff *xdp)
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8612  {
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8613  	int err, result = IGB_XDP_PASS;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8614  	struct bpf_prog *xdp_prog;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8615  	u32 act;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8616  
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8617  	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8618  
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8619  	if (!xdp_prog)
9cbc948b5a20c9 Sven Auhagen              2020-09-02 @8620  		goto xdp_out;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8621  
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8622  	prefetchw(xdp->data_hard_start); /* xdp_frame write */
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8623  
d568b111738dbb Sebastian Andrzej Siewior 2023-12-15  8624  	guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8625  	act = bpf_prog_run_xdp(xdp_prog, xdp);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8626  	switch (act) {
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8627  	case XDP_PASS:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8628  		break;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8629  	case XDP_TX:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8630  		result = igb_xdp_xmit_back(adapter, xdp);
74431c40b9c5fa Magnus Karlsson           2021-05-10  8631  		if (result == IGB_XDP_CONSUMED)
74431c40b9c5fa Magnus Karlsson           2021-05-10  8632  			goto out_failure;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8633  		break;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8634  	case XDP_REDIRECT:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8635  		err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
74431c40b9c5fa Magnus Karlsson           2021-05-10  8636  		if (err)
74431c40b9c5fa Magnus Karlsson           2021-05-10  8637  			goto out_failure;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8638  		result = IGB_XDP_REDIR;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8639  		break;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8640  	default:
c8064e5b4adac5 Paolo Abeni               2021-11-30  8641  		bpf_warn_invalid_xdp_action(adapter->netdev, xdp_prog, act);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8642  		fallthrough;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8643  	case XDP_ABORTED:
74431c40b9c5fa Magnus Karlsson           2021-05-10  8644  out_failure:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8645  		trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8646  		fallthrough;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8647  	case XDP_DROP:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8648  		result = IGB_XDP_CONSUMED;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8649  		break;
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8650  	}
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8651  xdp_out:
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8652  	return ERR_PTR(-result);
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8653  }
9cbc948b5a20c9 Sven Auhagen              2020-09-02  8654  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states.
  2023-12-15 17:07 ` [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
  2023-12-16  3:39   ` kernel test robot
@ 2023-12-18  8:33   ` Paolo Abeni
  2024-01-12 11:23     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 34+ messages in thread
From: Paolo Abeni @ 2023-12-18  8:33 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Andrii Nakryiko, David Ahern, Hao Luo, Jiri Olsa, John Fastabend,
	KP Singh, Martin KaFai Lau, Song Liu, Stanislav Fomichev,
	Yonghong Song, bpf

On Fri, 2023-12-15 at 18:07 +0100, Sebastian Andrzej Siewior wrote:
> The access to seg6_bpf_srh_states is protected by disabling preemption.
> Based on the code, the entry point is input_action_end_bpf() and
> every other function (the bpf helper functions bpf_lwt_seg6_*()), that
> is accessing seg6_bpf_srh_states, should be called from within
> input_action_end_bpf().
> 
> input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
> the function and then disables preemption. This looks wrong because if
> preemption needs to be disabled as part of the locking mechanism then
> the variable shouldn't be accessed beforehand.
> 
> Looking at how it is used via test_lwt_seg6local.sh then
> input_action_end_bpf() is always invoked from softirq context. If this
> is always the case then the preempt_disable() statement is superfluous.
> If this is not always invoked from softirq then disabling only
> preemption is not sufficient.
> 
> Replace the preempt_disable() statement with nested-BH locking. This is
> not an equivalent replacement as it assumes that the invocation of
> input_action_end_bpf() always occurs in softirq context and thus the
> preempt_disable() is superfluous.
> Add a local_lock_t the data structure and use local_lock_nested_bh() in
> guard notation for locking. Add lockdep_assert_held() to ensure the lock
> is held while the per-CPU variable is referenced in the helper functions.
> 
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrii Nakryiko <andrii@kernel.org>
> Cc: David Ahern <dsahern@kernel.org>
> Cc: Hao Luo <haoluo@google.com>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: KP Singh <kpsingh@kernel.org>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Song Liu <song@kernel.org>
> Cc: Stanislav Fomichev <sdf@google.com>
> Cc: Yonghong Song <yonghong.song@linux.dev>
> Cc: bpf@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/net/seg6_local.h |  1 +
>  net/core/filter.c        |  3 ++
>  net/ipv6/seg6_local.c    | 59 ++++++++++++++++++++++------------------
>  3 files changed, 36 insertions(+), 27 deletions(-)
> 
> diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
> index 3fab9dec2ec45..0f22771359f4c 100644
> --- a/include/net/seg6_local.h
> +++ b/include/net/seg6_local.h
> @@ -20,6 +20,7 @@ extern bool seg6_bpf_has_valid_srh(struct sk_buff *skb);
>  
>  struct seg6_bpf_srh_state {
>  	struct ipv6_sr_hdr *srh;
> +	local_lock_t bh_lock;
>  	u16 hdrlen;
>  	bool valid;
>  };
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1737884be52f8..c8013f762524b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6384,6 +6384,7 @@ BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
>  	void *srh_tlvs, *srh_end, *ptr;
>  	int srhoff = 0;
>  
> +	lockdep_assert_held(&srh_state->bh_lock);
>  	if (srh == NULL)
>  		return -EINVAL;
>  
> @@ -6440,6 +6441,7 @@ BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
>  	int hdroff = 0;
>  	int err;
>  
> +	lockdep_assert_held(&srh_state->bh_lock);
>  	switch (action) {
>  	case SEG6_LOCAL_ACTION_END_X:
>  		if (!seg6_bpf_has_valid_srh(skb))
> @@ -6516,6 +6518,7 @@ BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
>  	int srhoff = 0;
>  	int ret;
>  
> +	lockdep_assert_held(&srh_state->bh_lock);
>  	if (unlikely(srh == NULL))
>  		return -EINVAL;
>  
> diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
> index 24e2b4b494cb0..ed7278af321a2 100644
> --- a/net/ipv6/seg6_local.c
> +++ b/net/ipv6/seg6_local.c
> @@ -1380,7 +1380,9 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
>  	return err;
>  }
>  
> -DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
> +DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states) = {
> +	.bh_lock	= INIT_LOCAL_LOCK(bh_lock),
> +};
>  
>  bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
>  {
> @@ -1388,6 +1390,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
>  		this_cpu_ptr(&seg6_bpf_srh_states);
>  	struct ipv6_sr_hdr *srh = srh_state->srh;
>  
> +	lockdep_assert_held(&srh_state->bh_lock);
>  	if (unlikely(srh == NULL))
>  		return false;
>  
> @@ -1408,8 +1411,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
>  static int input_action_end_bpf(struct sk_buff *skb,
>  				struct seg6_local_lwt *slwt)
>  {
> -	struct seg6_bpf_srh_state *srh_state =
> -		this_cpu_ptr(&seg6_bpf_srh_states);
> +	struct seg6_bpf_srh_state *srh_state;
>  	struct ipv6_sr_hdr *srh;
>  	int ret;
>  
> @@ -1420,41 +1422,44 @@ static int input_action_end_bpf(struct sk_buff *skb,
>  	}
>  	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
>  
> -	/* preempt_disable is needed to protect the per-CPU buffer srh_state,
> -	 * which is also accessed by the bpf_lwt_seg6_* helpers
> +	/* The access to the per-CPU buffer srh_state is protected by running
> +	 * always in softirq context (with disabled BH). On PREEMPT_RT the
> +	 * required locking is provided by the following local_lock_nested_bh()
> +	 * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
> +	 * bpf_prog_run_save_cb().
>  	 */
> -	preempt_disable();
> -	srh_state->srh = srh;
> -	srh_state->hdrlen = srh->hdrlen << 3;
> -	srh_state->valid = true;
> +	scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
> +		srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
> +		srh_state->srh = srh;
> +		srh_state->hdrlen = srh->hdrlen << 3;
> +		srh_state->valid = true;

Here the 'scoped_guard' usage adds a lot of noise to the patch, due to
the added indentation. What about using directly
local_lock_nested_bh()/local_unlock_nested_bh() ?

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 ` [PATCH net-next 16/24] net: netkit, veth, tun, virt*: " Sebastian Andrzej Siewior
@ 2023-12-18  8:52   ` Daniel Borkmann
  2024-01-12 15:37     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Borkmann @ 2023-12-18  8:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, netdev
  Cc: David S. Miller, Boqun Feng, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, K. Y. Srinivasan,
	Michael S. Tsirkin, Alexei Starovoitov, Andrii Nakryiko,
	Dexuan Cui, Haiyang Zhang, Hao Luo, Jesper Dangaard Brouer,
	Jiri Olsa, John Fastabend, Juergen Gross, KP Singh,
	Martin KaFai Lau, Nikolay Aleksandrov, Song Liu,
	Stanislav Fomichev, Stefano Stabellini, Wei Liu, Willem de Bruijn,
	Xuan Zhuo, Yonghong Song, bpf, virtualization, xen-devel

Hi Sebastian,

On 12/15/23 6:07 PM, Sebastian Andrzej Siewior wrote:
> The per-CPU variables used during bpf_prog_run_xdp() invocation and
> later during xdp_do_redirect() rely on disabled BH for their protection.
> Without locking in local_bh_disable() on PREEMPT_RT these data structure
> require explicit locking.
> 
> This is a follow-up on the previous change which introduced
> bpf_run_lock.redirect_lock and uses it now within drivers.
> 
> The simple way is to acquire the lock before bpf_prog_run_xdp() is
> invoked and hold it until the end of function.
> This does not always work because some drivers (cpsw, atlantic) invoke
> xdp_do_flush() in the same context.
> Acquiring the lock in bpf_prog_run_xdp() and dropping in
> xdp_do_redirect() (without touching drivers) does not work because not
> all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
> invoke xdp_do_redirect()).
> 
> Ideally the minimal locking scope would be bpf_prog_run_xdp() +
> xdp_do_redirect() and everything else (error recovery, DMA unmapping,
> free/ alloc of memory, …) would happen outside of the locked section.
[...]

>   drivers/net/hyperv/netvsc_bpf.c |  1 +
>   drivers/net/netkit.c            | 13 +++++++----
>   drivers/net/tun.c               | 28 +++++++++++++----------
>   drivers/net/veth.c              | 40 ++++++++++++++++++++-------------
>   drivers/net/virtio_net.c        |  1 +
>   drivers/net/xen-netfront.c      |  1 +
>   6 files changed, 52 insertions(+), 32 deletions(-)
[...]

Please exclude netkit from this set given it does not support XDP, but
instead only accepts tc BPF typed programs.

Thanks,
Daniel

> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index 39171380ccf29..fbcf78477bda8 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -80,8 +80,15 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
>   	netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
>   	skb->dev = peer;
>   	entry = rcu_dereference(nk->active);
> -	if (entry)
> -		ret = netkit_run(entry, skb, ret);
> +	if (entry) {
> +		scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
> +			ret = netkit_run(entry, skb, ret);
> +			if (ret == NETKIT_REDIRECT) {
> +				dev_sw_netstats_tx_add(dev, 1, len);
> +				skb_do_redirect(skb);
> +			}
> +		}
> +	}
>   	switch (ret) {
>   	case NETKIT_NEXT:
>   	case NETKIT_PASS:
> @@ -95,8 +102,6 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
>   		}
>   		break;
>   	case NETKIT_REDIRECT:
> -		dev_sw_netstats_tx_add(dev, 1, len);
> -		skb_do_redirect(skb);
>   		break;
>   	case NETKIT_DROP:
>   	default:

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
  2023-12-16  4:53   ` kernel test robot
@ 2023-12-19  0:01     ` Nathan Chancellor
  2023-12-19 16:55       ` Nick Desaulniers
  0 siblings, 1 reply; 34+ messages in thread
From: Nathan Chancellor @ 2023-12-19  0:01 UTC (permalink / raw)
  To: kernel test robot
  Cc: Sebastian Andrzej Siewior, linux-kernel, netdev, llvm,
	oe-kbuild-all, Boqun Feng, Daniel Borkmann, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Jesper Dangaard Brouer, Jesse Brandeburg,
	John Fastabend, Tony Nguyen, bpf, intel-wired-lan

On Sat, Dec 16, 2023 at 12:53:43PM +0800, kernel test robot wrote:
> Hi Sebastian,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on net-next/main]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
> base:   net-next/main
> patch link:    https://lore.kernel.org/r/20231215171020.687342-21-bigeasy%40linutronix.de
> patch subject: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
> config: arm-defconfig (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/config)
> compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202312161212.D5tju5i6-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
> >> drivers/net/ethernet/intel/igb/igb_main.c:8620:3: error: cannot jump from this goto statement to its label
>                    goto xdp_out;
>                    ^
>    drivers/net/ethernet/intel/igb/igb_main.c:8624:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
>            guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>            ^
>    include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
>            CLASS(_name, __UNIQUE_ID(guard))
>                         ^
>    include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
>    #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
>                                ^
>    include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
>    #define __PASTE(a,b) ___PASTE(a,b)
>                         ^
>    include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
>    #define ___PASTE(a,b) a##b
>                          ^
>    <scratch space>:52:1: note: expanded from here
>    __UNIQUE_ID_guard753
>    ^
>    1 error generated.

I initially thought that this may have been
https://github.com/ClangBuiltLinux/linux/issues/1886 but asm goto is not
involved here.

This error occurs because jumping over the initialization of a variable
declared with __attribute__((__cleanup__(...))) does not prevent the
clean up function from running as one may expect it to, but could
instead result in the clean up function getting run on uninitialized
memory. A contrived example (see the bottom of the "Output" tabs for the
execution output):

https://godbolt.org/z/9bvGboxvc

While there is a warning from GCC in that example, I don't see one in
the kernel's case. I see there is an open GCC issue around this problem:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91951

While it is possible that there may not actually be a problem with how
the kernel uses __attribute__((__cleanup__(...))) and gotos, I think
clang's behavior is reasonable given the potential footguns that this
construct has.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
  2023-12-19  0:01     ` Nathan Chancellor
@ 2023-12-19 16:55       ` Nick Desaulniers
  0 siblings, 0 replies; 34+ messages in thread
From: Nick Desaulniers @ 2023-12-19 16:55 UTC (permalink / raw)
  To: Nathan Chancellor, Sebastian Andrzej Siewior
  Cc: kernel test robot, linux-kernel, netdev, llvm, oe-kbuild-all,
	Boqun Feng, Daniel Borkmann, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Jesper Dangaard Brouer, Jesse Brandeburg, John Fastabend,
	Tony Nguyen, bpf, intel-wired-lan

On Mon, Dec 18, 2023 at 4:01 PM Nathan Chancellor <nathan@kernel.org> wrote:
>
> On Sat, Dec 16, 2023 at 12:53:43PM +0800, kernel test robot wrote:
> > Hi Sebastian,
> >
> > kernel test robot noticed the following build errors:
> >
> > [auto build test ERROR on net-next/main]
> >
> > url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
> > base:   net-next/main
> > patch link:    https://lore.kernel.org/r/20231215171020.687342-21-bigeasy%40linutronix.de
> > patch subject: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
> > config: arm-defconfig (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/config)
> > compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/202312161212.D5tju5i6-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202312161212.D5tju5i6-lkp@intel.com/
> >
> > All errors (new ones prefixed by >>):
> >
> > >> drivers/net/ethernet/intel/igb/igb_main.c:8620:3: error: cannot jump from this goto statement to its label
> >                    goto xdp_out;
> >                    ^

^ The problematic goto should be replaced with an early return. (and
perhaps a comment that you can't jump over __cleanup variable
initialization).

Otherwise the compiler cannot put the cleanup in the destination basic
block; it would have to split the edges and have all the happy paths
go to a synthesized basic block that runs the cleanup, then jumps to
the original destination.

> >    drivers/net/ethernet/intel/igb/igb_main.c:8624:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
> >            guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> >            ^
> >    include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
> >            CLASS(_name, __UNIQUE_ID(guard))
> >                         ^
> >    include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
> >    #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
> >                                ^
> >    include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
> >    #define __PASTE(a,b) ___PASTE(a,b)
> >                         ^
> >    include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
> >    #define ___PASTE(a,b) a##b
> >                          ^
> >    <scratch space>:52:1: note: expanded from here
> >    __UNIQUE_ID_guard753
> >    ^
> >    1 error generated.
>
> I initially thought that this may have been
> https://github.com/ClangBuiltLinux/linux/issues/1886 but asm goto is not
> involved here.
>
> This error occurs because jumping over the initialization of a variable
> declared with __attribute__((__cleanup__(...))) does not prevent the
> clean up function from running as one may expect it to, but could
> instead result in the clean up function getting run on uninitialized
> memory. A contrived example (see the bottom of the "Output" tabs for the
> execution output):
>
> https://godbolt.org/z/9bvGboxvc
>
> While there is a warning from GCC in that example, I don't see one in
> the kernel's case. I see there is an open GCC issue around this problem:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91951
>
> While it is possible that there may not actually be a problem with how
> the kernel uses __attribute__((__cleanup__(...))) and gotos, I think
> clang's behavior is reasonable given the potential footguns that this
> construct has.
>
> Cheers,
> Nathan
>


-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2023-12-15 17:07 ` [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect Sebastian Andrzej Siewior
@ 2023-12-20  0:25   ` Alexei Starovoitov
  2024-01-04 19:29     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-12-20  0:25 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: LKML, Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Cong Wang, Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer,
	Jiri Olsa, Jiri Pirko, John Fastabend, KP Singh, Martin KaFai Lau,
	Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

On Fri, Dec 15, 2023 at 9:10 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 5a0f6da7b3ae5..5ba7509e88752 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>                 *pt_prev = NULL;
>         }
>
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         qdisc_skb_cb(skb)->pkt_len = skb->len;
>         tcx_set_ingress(skb, true);
>
> @@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>         if (!entry)
>                 return skb;
>
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
>          * already set by the caller.
>          */
> @@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
>                 u32 act;
>                 int err;
>
> +               guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>                 act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
>                 if (act != XDP_PASS) {
>                         switch (act) {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7c9653734fb60..72a7812f933a1 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>   */
>  void xdp_do_flush(void)
>  {
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         __dev_flush();
>         __cpu_map_flush();
>         __xsk_map_flush();
> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
> index a94943681e5aa..74b88e897a7e3 100644
> --- a/net/core/lwt_bpf.c
> +++ b/net/core/lwt_bpf.c
> @@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>          * BPF prog and skb_do_redirect().
>          */
>         local_bh_disable();
> +       local_lock_nested_bh(&bpf_run_lock.redirect_lock);
>         bpf_compute_data_pointers(skb);
>         ret = bpf_prog_run_save_cb(lwt->prog, skb);
>
> @@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>                 break;
>         }
>
> +       local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
>         local_bh_enable();
>
>         return ret;
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 1976bd1639863..da61b99bc558f 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -23,6 +23,7 @@
>  #include <linux/jhash.h>
>  #include <linux/rculist.h>
>  #include <linux/rhashtable.h>
> +#include <linux/bpf.h>
>  #include <net/net_namespace.h>
>  #include <net/sock.h>
>  #include <net/netlink.h>
> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>
>         fl = rcu_dereference_bh(qe->filter_chain);
>
> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
>         case TC_ACT_SHOT:
>                 qdisc_qstats_drop(sch);

Here and in all other places this patch adds locks that
will kill performance of XDP, tcx and everything else in networking.

I'm surprised Jesper and other folks are not jumping in with nacks.
We measure performance in nanoseconds here.
Extra lock is no go.
Please find a different way without ruining performance.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2023-12-20  0:25   ` Alexei Starovoitov
@ 2024-01-04 19:29     ` Toke Høiland-Jørgensen
  2024-01-12 17:41       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-01-04 19:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Sebastian Andrzej Siewior
  Cc: LKML, Network Development, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Paolo Abeni, Peter Zijlstra, Thomas Gleixner,
	Waiman Long, Will Deacon, Alexei Starovoitov, Andrii Nakryiko,
	Cong Wang, Hao Luo, Jamal Hadi Salim, Jesper Dangaard Brouer,
	Jiri Olsa, Jiri Pirko, John Fastabend, KP Singh, Martin KaFai Lau,
	Ronak Doshi, Song Liu, Stanislav Fomichev,
	VMware PV-Drivers Reviewers, Yonghong Song, bpf

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Dec 15, 2023 at 9:10 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 5a0f6da7b3ae5..5ba7509e88752 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>>                 *pt_prev = NULL;
>>         }
>>
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         qdisc_skb_cb(skb)->pkt_len = skb->len;
>>         tcx_set_ingress(skb, true);
>>
>> @@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>>         if (!entry)
>>                 return skb;
>>
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
>>          * already set by the caller.
>>          */
>> @@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
>>                 u32 act;
>>                 int err;
>>
>> +               guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>                 act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
>>                 if (act != XDP_PASS) {
>>                         switch (act) {
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 7c9653734fb60..72a7812f933a1 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>>   */
>>  void xdp_do_flush(void)
>>  {
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         __dev_flush();
>>         __cpu_map_flush();
>>         __xsk_map_flush();
>> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
>> index a94943681e5aa..74b88e897a7e3 100644
>> --- a/net/core/lwt_bpf.c
>> +++ b/net/core/lwt_bpf.c
>> @@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>>          * BPF prog and skb_do_redirect().
>>          */
>>         local_bh_disable();
>> +       local_lock_nested_bh(&bpf_run_lock.redirect_lock);
>>         bpf_compute_data_pointers(skb);
>>         ret = bpf_prog_run_save_cb(lwt->prog, skb);
>>
>> @@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>>                 break;
>>         }
>>
>> +       local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
>>         local_bh_enable();
>>
>>         return ret;
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index 1976bd1639863..da61b99bc558f 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -23,6 +23,7 @@
>>  #include <linux/jhash.h>
>>  #include <linux/rculist.h>
>>  #include <linux/rhashtable.h>
>> +#include <linux/bpf.h>
>>  #include <net/net_namespace.h>
>>  #include <net/sock.h>
>>  #include <net/netlink.h>
>> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>>
>>         fl = rcu_dereference_bh(qe->filter_chain);
>>
>> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
>>         case TC_ACT_SHOT:
>>                 qdisc_qstats_drop(sch);
>
> Here and in all other places this patch adds locks that
> will kill performance of XDP, tcx and everything else in networking.
>
> I'm surprised Jesper and other folks are not jumping in with nacks.
> We measure performance in nanoseconds here.
> Extra lock is no go.
> Please find a different way without ruining performance.

I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
believe there are people who are using XDP on PREEMPT_RT kernels and
still expect decent performance. And to achieve that it is absolutely
imperative that we can amortise expensive operations (such as locking)
over multiple packets.

I realise there's a fundamental trade-off between the amount of
amortisation and the latency hit that we take from holding locks for
longer, but tuning the batch size (while still keeping some amount of
batching) may be a way forward? I suppose Jakub's suggestion in the
other part of the thread, of putting the locks around napi->poll(), is a
step towards something like this.

-Toke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states.
  2023-12-18  8:33   ` Paolo Abeni
@ 2024-01-12 11:23     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-12 11:23 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng,
	Daniel Borkmann, Eric Dumazet, Frederic Weisbecker, Ingo Molnar,
	Jakub Kicinski, Peter Zijlstra, Thomas Gleixner, Waiman Long,
	Will Deacon, Alexei Starovoitov, Andrii Nakryiko, David Ahern,
	Hao Luo, Jiri Olsa, John Fastabend, KP Singh, Martin KaFai Lau,
	Song Liu, Stanislav Fomichev, Yonghong Song, bpf

On 2023-12-18 09:33:39 [+0100], Paolo Abeni wrote:
> > --- a/net/ipv6/seg6_local.c
> > +++ b/net/ipv6/seg6_local.c
> > @@ -1420,41 +1422,44 @@ static int input_action_end_bpf(struct sk_buff *skb,
> >  	}
> >  	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
> >  
> > -	/* preempt_disable is needed to protect the per-CPU buffer srh_state,
> > -	 * which is also accessed by the bpf_lwt_seg6_* helpers
> > +	/* The access to the per-CPU buffer srh_state is protected by running
> > +	 * always in softirq context (with disabled BH). On PREEMPT_RT the
> > +	 * required locking is provided by the following local_lock_nested_bh()
> > +	 * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
> > +	 * bpf_prog_run_save_cb().
> >  	 */
> > -	preempt_disable();
> > -	srh_state->srh = srh;
> > -	srh_state->hdrlen = srh->hdrlen << 3;
> > -	srh_state->valid = true;
> > +	scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
> > +		srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
> > +		srh_state->srh = srh;
> > +		srh_state->hdrlen = srh->hdrlen << 3;
> > +		srh_state->valid = true;
> 
> Here the 'scoped_guard' usage adds a lot of noise to the patch, due to
> the added indentation. What about using directly
> local_lock_nested_bh()/local_unlock_nested_bh() ?

If this is preferred, sure.

> Cheers,
> 
> Paolo

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.
  2023-12-18  8:52   ` Daniel Borkmann
@ 2024-01-12 15:37     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-12 15:37 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: linux-kernel, netdev, David S. Miller, Boqun Feng, Eric Dumazet,
	Frederic Weisbecker, Ingo Molnar, Jakub Kicinski, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	K. Y. Srinivasan, Michael S. Tsirkin, Alexei Starovoitov,
	Andrii Nakryiko, Dexuan Cui, Haiyang Zhang, Hao Luo,
	Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, Juergen Gross,
	KP Singh, Martin KaFai Lau, Nikolay Aleksandrov, Song Liu,
	Stanislav Fomichev, Stefano Stabellini, Wei Liu, Willem de Bruijn,
	Xuan Zhuo, Yonghong Song, bpf, virtualization, xen-devel

On 2023-12-18 09:52:05 [+0100], Daniel Borkmann wrote:
> Hi Sebastian,
Hi Daniel,

> Please exclude netkit from this set given it does not support XDP, but
> instead only accepts tc BPF typed programs.

okay, thank you.

> Thanks,
> Daniel

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-04 19:29     ` Toke Høiland-Jørgensen
@ 2024-01-12 17:41       ` Sebastian Andrzej Siewior
  2024-01-17 16:37         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-12 17:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, LKML, Network Development, David S. Miller,
	Boqun Feng, Daniel Borkmann, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Andrii Nakryiko, Cong Wang, Hao Luo, Jamal Hadi Salim,
	Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko, John Fastabend,
	KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

On 2024-01-04 20:29:02 [+0100], Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
> >>
> >>         fl = rcu_dereference_bh(qe->filter_chain);
> >>
> >> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> >>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
> >>         case TC_ACT_SHOT:
> >>                 qdisc_qstats_drop(sch);
> >
> > Here and in all other places this patch adds locks that
> > will kill performance of XDP, tcx and everything else in networking.
> >
> > I'm surprised Jesper and other folks are not jumping in with nacks.
> > We measure performance in nanoseconds here.
> > Extra lock is no go.
> > Please find a different way without ruining performance.
> 
> I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
> believe there are people who are using XDP on PREEMPT_RT kernels and
> still expect decent performance. And to achieve that it is absolutely
> imperative that we can amortise expensive operations (such as locking)
> over multiple packets.
> 
> I realise there's a fundamental trade-off between the amount of
> amortisation and the latency hit that we take from holding locks for
> longer, but tuning the batch size (while still keeping some amount of
> batching) may be a way forward? I suppose Jakub's suggestion in the
> other part of the thread, of putting the locks around napi->poll(), is a
> step towards something like this.

The RT requirements are usually different. Networking as in CAN might be
important but Ethernet could only used for remote communication and so
"not" important. People complained that they need to wait for Ethernet
to be done until the CAN packet can be injected into the stack.
With that expectation you would like to pause Ethernet immediately and
switch over the CAN interrupt thread.

But if someone managed to setup XDP then it is likely to be important.
With RT traffic it is usually not the throughput that matters but the
latency. You are likely in the position to receive a packet, say every
1ms, and need to respond immediately. XDP would be used to inspect the
packet and either hand it over to the stack or process it.

I expected the lock operation (under RT) to always succeeds and not
cause any delay because it should not be contended. It should only
block if something with higher priority preempted the current interrupt
thread _and_ also happen to use XDP on the same CPU. In that case (XDP
is needed) it would flush the current user out of the locked section
before the higher-prio thread could take over. Doing bulk and allowing
the low-priority thread to complete would delay the high-priority
thread. Maybe I am too pessimistic here and having two XDP programs on
one CPU is unlikely to happen.

Adding the lock on per-NAPI basis would allow to batch packets.
Acquiring the lock only if XDP is supported would not block the CAN
drivers since they dont't support XDP. But sounds like a hack.

Daniel said netkit doesn't need this locking because it is not
supporting this redirect and it made me think. Would it work to make
the redirect structures part of the bpf_prog-structure instead of
per-CPU? My understanding is that eBPF's programs data structures are
part of it and contain locking allowing one eBPF program preempt
another one.
Having the redirect structures part of the program would obsolete
locking. Do I miss anything?

> -Toke

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-12 17:41       ` Sebastian Andrzej Siewior
@ 2024-01-17 16:37         ` Toke Høiland-Jørgensen
  2024-01-18  2:04           ` Jakub Kicinski
  2024-01-18  7:35           ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 34+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-01-17 16:37 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, LKML, Network Development, David S. Miller,
	Boqun Feng, Daniel Borkmann, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Andrii Nakryiko, Cong Wang, Hao Luo, Jamal Hadi Salim,
	Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko, John Fastabend,
	KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-01-04 20:29:02 [+0100], Toke Høiland-Jørgensen wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>> 
>> >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>> >>
>> >>         fl = rcu_dereference_bh(qe->filter_chain);
>> >>
>> >> +       guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>> >>         switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
>> >>         case TC_ACT_SHOT:
>> >>                 qdisc_qstats_drop(sch);
>> >
>> > Here and in all other places this patch adds locks that
>> > will kill performance of XDP, tcx and everything else in networking.
>> >
>> > I'm surprised Jesper and other folks are not jumping in with nacks.
>> > We measure performance in nanoseconds here.
>> > Extra lock is no go.
>> > Please find a different way without ruining performance.
>> 
>> I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
>> believe there are people who are using XDP on PREEMPT_RT kernels and
>> still expect decent performance. And to achieve that it is absolutely
>> imperative that we can amortise expensive operations (such as locking)
>> over multiple packets.
>> 
>> I realise there's a fundamental trade-off between the amount of
>> amortisation and the latency hit that we take from holding locks for
>> longer, but tuning the batch size (while still keeping some amount of
>> batching) may be a way forward? I suppose Jakub's suggestion in the
>> other part of the thread, of putting the locks around napi->poll(), is a
>> step towards something like this.
>
> The RT requirements are usually different. Networking as in CAN might be
> important but Ethernet could only used for remote communication and so
> "not" important. People complained that they need to wait for Ethernet
> to be done until the CAN packet can be injected into the stack.
> With that expectation you would like to pause Ethernet immediately and
> switch over the CAN interrupt thread.
>
> But if someone managed to setup XDP then it is likely to be important.
> With RT traffic it is usually not the throughput that matters but the
> latency. You are likely in the position to receive a packet, say every
> 1ms, and need to respond immediately. XDP would be used to inspect the
> packet and either hand it over to the stack or process it.

I am not contesting that latency is important, but it's a pretty
fundamental trade-off and we don't want to kill throughput entirely
either. Especially since this is global to the whole kernel; and there
are definitely people who want to use XDP on an RT kernel and still
achieve high PPS rates.

(Whether those people really strictly speaking need to be running an RT
kernel is maybe debatable, but it does happen).

> I expected the lock operation (under RT) to always succeeds and not
> cause any delay because it should not be contended.

A lock does cause delay even when it's not contended. Bear in mind that
at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
packet (for 64-byte packets). So just the atomic op to figure out
whether there's any contention (around 10ns on the Intel processors I
usually test on) will blow a huge chunk of the total processing budget.
We can't actually do the full processing needed in those 64 nanoseconds
(not to mention the 6.4 nanoseconds we have available at 100Gbps), which
is why it's essential to amortise as much as we can over multiple
packets.

This is all back-of-the-envelope calculations, of course. Having some
actual numbers to look at would be great; I don't suppose you have a
setup where you can run xdp-bench and see how your patches affect the
throughput?

> It should only block if something with higher priority preempted the
> current interrupt thread _and_ also happen to use XDP on the same CPU.
> In that case (XDP is needed) it would flush the current user out of
> the locked section before the higher-prio thread could take over.
> Doing bulk and allowing the low-priority thread to complete would
> delay the high-priority thread. Maybe I am too pessimistic here and
> having two XDP programs on one CPU is unlikely to happen.
>
> Adding the lock on per-NAPI basis would allow to batch packets.
> Acquiring the lock only if XDP is supported would not block the CAN
> drivers since they dont't support XDP. But sounds like a hack.

I chatted with Jesper about this, and he had an idea not too far from
this: split up the XDP and regular stack processing in two stages, each
with their individual batching. So whereas right now we're doing
something like:

run_napi()
  bh_disable()
  for pkt in budget:
    act = run_xdp(pkt)
    if (act == XDP_PASS)
      run_netstack(pkt)  // this is the expensive bit
  bh_enable()

We could instead do:

run_napi()
  bh_disable()
  for pkt in budget:
    act = run_xdp(pkt)
    if (act == XDP_PASS)
      add_to_list(pkt, to_stack_list)
  bh_enable()
  // sched point
  bh_disable()
  for pkt in to_stack_list:
    run_netstack(pkt)
  bh_enable()

This would limit the batching that blocks everything to only the XDP
processing itself, which should limit the maximum time spent in the
blocking state significantly compared to what we have today. The caveat
being that rearranging things like this is potentially a pretty major
refactoring task that needs to touch all the drivers (even if some of
the logic can be moved into the core code in the process). So not really
sure if this approach is feasible, TBH.

> Daniel said netkit doesn't need this locking because it is not
> supporting this redirect and it made me think. Would it work to make
> the redirect structures part of the bpf_prog-structure instead of
> per-CPU? My understanding is that eBPF's programs data structures are
> part of it and contain locking allowing one eBPF program preempt
> another one.
> Having the redirect structures part of the program would obsolete
> locking. Do I miss anything?

This won't work, unfortunately: the same XDP program can be attached to
multiple interfaces simultaneously, and for hardware with multiple
receive queues (which is most of the hardware that supports XDP), it can
even run simultaneously on multiple CPUs on the same interface. This is
the reason why this is all being kept in per-CPU variables today.

-Toke

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-17 16:37         ` Toke Høiland-Jørgensen
@ 2024-01-18  2:04           ` Jakub Kicinski
  2024-01-18  8:27             ` Sebastian Andrzej Siewior
  2024-01-18 11:51             ` Toke Høiland-Jørgensen
  2024-01-18  7:35           ` Sebastian Andrzej Siewior
  1 sibling, 2 replies; 34+ messages in thread
From: Jakub Kicinski @ 2024-01-18  2:04 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Sebastian Andrzej Siewior, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng, Daniel Borkmann,
	Eric Dumazet, Frederic Weisbecker, Ingo Molnar, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Cong Wang, Hao Luo,
	Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko,
	John Fastabend, KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

On Wed, 17 Jan 2024 17:37:29 +0100 Toke Høiland-Jørgensen wrote:
> I am not contesting that latency is important, but it's a pretty
> fundamental trade-off and we don't want to kill throughput entirely
> either. Especially since this is global to the whole kernel; and there
> are definitely people who want to use XDP on an RT kernel and still
> achieve high PPS rates.
> 
> (Whether those people really strictly speaking need to be running an RT
> kernel is maybe debatable, but it does happen).
> 
> > I expected the lock operation (under RT) to always succeeds and not
> > cause any delay because it should not be contended.  
> 
> A lock does cause delay even when it's not contended. Bear in mind that
> at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
> packet (for 64-byte packets). So just the atomic op to figure out
> whether there's any contention (around 10ns on the Intel processors I
> usually test on) will blow a huge chunk of the total processing budget.
> We can't actually do the full processing needed in those 64 nanoseconds
> (not to mention the 6.4 nanoseconds we have available at 100Gbps), which
> is why it's essential to amortise as much as we can over multiple
> packets.
> 
> This is all back-of-the-envelope calculations, of course. Having some
> actual numbers to look at would be great; I don't suppose you have a
> setup where you can run xdp-bench and see how your patches affect the
> throughput?

A potentially stupid idea which I have been turning in my head is 
how we could get away from having the driver handle details of NAPI
budgeting. It's an source of bugs and endless review comments.

All drivers end up maintaining a counter of "how many packets have
I processed" and comparing that against the budget. Would it be crazy
if we put that inside napi_struct? Add a "budget" member inside
napi_struct as well, and:

struct napi_struct {
...
	// poll state
	unsigned int budget;
	unsigned int rx_used;
...
}

static inline bool napi_rx_has_budget(napi)
{
	return napi->budget > napi->rx_used;
}

poll(napi) // no budget
{
	while (napi_rx_has_budget(napi)) {
		napi_gro_receive(napi, skb); /* does napi->rx_used++ */
		// maybe add explicit napi_rx_count() if
		// driver did something funny with the frame.
	}
}

We can also create napi_tx_has_budget() so that people stop being
confused whether budget is for Tx or not. And napi_xdp_comp_has_budget()
so that people stop completing XDP in hard irq context (budget==0)...

And we can pass napi into napi_consume_skb(), instead of, presumably
inexplicably to a newcomer, passing in budget.
And napi_complete_done() can lose the work_done argument, too.

Oh, and I'm bringing it up here, because CONFIG_RT can throw
in "need_resched()" into the napi_rx_has_budget(), obviously.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-17 16:37         ` Toke Høiland-Jørgensen
  2024-01-18  2:04           ` Jakub Kicinski
@ 2024-01-18  7:35           ` Sebastian Andrzej Siewior
  2024-01-18 11:58             ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-18  7:35 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, LKML, Network Development, David S. Miller,
	Boqun Feng, Daniel Borkmann, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Andrii Nakryiko, Cong Wang, Hao Luo, Jamal Hadi Salim,
	Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko, John Fastabend,
	KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

On 2024-01-17 17:37:29 [+0100], Toke Høiland-Jørgensen wrote:
> This is all back-of-the-envelope calculations, of course. Having some
> actual numbers to look at would be great; I don't suppose you have a
> setup where you can run xdp-bench and see how your patches affect the
> throughput?

No but I probably could set it up.

> I chatted with Jesper about this, and he had an idea not too far from
> this: split up the XDP and regular stack processing in two stages, each
> with their individual batching. So whereas right now we're doing
> something like:
> 
> run_napi()
>   bh_disable()
>   for pkt in budget:
>     act = run_xdp(pkt)
>     if (act == XDP_PASS)
>       run_netstack(pkt)  // this is the expensive bit
>   bh_enable()
> 
> We could instead do:
> 
> run_napi()
>   bh_disable()
>   for pkt in budget:
>     act = run_xdp(pkt)
>     if (act == XDP_PASS)
>       add_to_list(pkt, to_stack_list)
>   bh_enable()
>   // sched point
>   bh_disable()
>   for pkt in to_stack_list:
>     run_netstack(pkt)
>   bh_enable()
> 
> 
> This would limit the batching that blocks everything to only the XDP
> processing itself, which should limit the maximum time spent in the
> blocking state significantly compared to what we have today. The caveat
> being that rearranging things like this is potentially a pretty major
> refactoring task that needs to touch all the drivers (even if some of
> the logic can be moved into the core code in the process). So not really
> sure if this approach is feasible, TBH.

This does not work because bh_disable() does not disable scheduling.
Scheduling may happen. bh_disable() acquires a lock which is currently
the only synchronisation point between two say network driver doing
NAPI. And this what I want to get rid of.
Regarding expensive bit as in XDP_PASS: This doesn't need locking as per
proposal, just the REDIRECT piece.

> > Daniel said netkit doesn't need this locking because it is not
> > supporting this redirect and it made me think. Would it work to make
> > the redirect structures part of the bpf_prog-structure instead of
> > per-CPU? My understanding is that eBPF's programs data structures are
> > part of it and contain locking allowing one eBPF program preempt
> > another one.
> > Having the redirect structures part of the program would obsolete
> > locking. Do I miss anything?
> 
> This won't work, unfortunately: the same XDP program can be attached to
> multiple interfaces simultaneously, and for hardware with multiple
> receive queues (which is most of the hardware that supports XDP), it can
> even run simultaneously on multiple CPUs on the same interface. This is
> the reason why this is all being kept in per-CPU variables today.

So I started hacking this and noticed yesterday and noticed that you can
run multiple bpf programs. This is how I learned that it won't work.
My plan B is now to move it into task_struct. 

> -Toke
Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18  2:04           ` Jakub Kicinski
@ 2024-01-18  8:27             ` Sebastian Andrzej Siewior
  2024-01-18 16:38               ` Jakub Kicinski
  2024-01-18 11:51             ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-18  8:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng, Daniel Borkmann,
	Eric Dumazet, Frederic Weisbecker, Ingo Molnar, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Cong Wang, Hao Luo,
	Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko,
	John Fastabend, KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

On 2024-01-17 18:04:47 [-0800], Jakub Kicinski wrote:
> Oh, and I'm bringing it up here, because CONFIG_RT can throw
> in "need_resched()" into the napi_rx_has_budget(), obviously.

need_resched() does not work on PREEMPT_RT the way you think. This
context (the NAPI poll callback) is preemptible and (by default) runs at
SCHED_FIFO 50 (within a threaded IRQ) so a context switch can happen at
any time by a task with higher priority.
If threadA gets preempted and owns a lock that threadB, with higher
priority, wants then threadA will get back on CPU, inherit the priority
of the threadB and continue to run until it releases the lock.

If this is the per-CPU BH lock (which I want to remove) then it will
continue until all softirqs complete.

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18  2:04           ` Jakub Kicinski
  2024-01-18  8:27             ` Sebastian Andrzej Siewior
@ 2024-01-18 11:51             ` Toke Høiland-Jørgensen
  2024-01-18 16:37               ` Jakub Kicinski
  1 sibling, 1 reply; 34+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-01-18 11:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Sebastian Andrzej Siewior, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng, Daniel Borkmann,
	Eric Dumazet, Frederic Weisbecker, Ingo Molnar, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Cong Wang, Hao Luo,
	Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko,
	John Fastabend, KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

Jakub Kicinski <kuba@kernel.org> writes:

> On Wed, 17 Jan 2024 17:37:29 +0100 Toke Høiland-Jørgensen wrote:
>> I am not contesting that latency is important, but it's a pretty
>> fundamental trade-off and we don't want to kill throughput entirely
>> either. Especially since this is global to the whole kernel; and there
>> are definitely people who want to use XDP on an RT kernel and still
>> achieve high PPS rates.
>> 
>> (Whether those people really strictly speaking need to be running an RT
>> kernel is maybe debatable, but it does happen).
>> 
>> > I expected the lock operation (under RT) to always succeeds and not
>> > cause any delay because it should not be contended.  
>> 
>> A lock does cause delay even when it's not contended. Bear in mind that
>> at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
>> packet (for 64-byte packets). So just the atomic op to figure out
>> whether there's any contention (around 10ns on the Intel processors I
>> usually test on) will blow a huge chunk of the total processing budget.
>> We can't actually do the full processing needed in those 64 nanoseconds
>> (not to mention the 6.4 nanoseconds we have available at 100Gbps), which
>> is why it's essential to amortise as much as we can over multiple
>> packets.
>> 
>> This is all back-of-the-envelope calculations, of course. Having some
>> actual numbers to look at would be great; I don't suppose you have a
>> setup where you can run xdp-bench and see how your patches affect the
>> throughput?
>
> A potentially stupid idea which I have been turning in my head is 
> how we could get away from having the driver handle details of NAPI
> budgeting. It's an source of bugs and endless review comments.
>
> All drivers end up maintaining a counter of "how many packets have
> I processed" and comparing that against the budget. Would it be crazy
> if we put that inside napi_struct? Add a "budget" member inside
> napi_struct as well, and:
>
> struct napi_struct {
> ...
> 	// poll state
> 	unsigned int budget;
> 	unsigned int rx_used;
> ...
> }
>
> static inline bool napi_rx_has_budget(napi)
> {
> 	return napi->budget > napi->rx_used;
> }
>
> poll(napi) // no budget
> {
> 	while (napi_rx_has_budget(napi)) {
> 		napi_gro_receive(napi, skb); /* does napi->rx_used++ */
> 		// maybe add explicit napi_rx_count() if
> 		// driver did something funny with the frame.
> 	}
> }
>
> We can also create napi_tx_has_budget() so that people stop being
> confused whether budget is for Tx or not. And napi_xdp_comp_has_budget()
> so that people stop completing XDP in hard irq context (budget==0)...
>
> And we can pass napi into napi_consume_skb(), instead of, presumably
> inexplicably to a newcomer, passing in budget.
> And napi_complete_done() can lose the work_done argument, too.

I do agree that conceptually it makes a lot of sense to encapsulate the
budget like this so drivers don't have to do all this state tracking
themselves. It does appear that drivers are doing different things with
the budget as it is today, though. For instance, the intel drivers seem
to divide the budget over all the enabled RX rings(?); so I'm wondering
if it'll be possible to unify drivers around a more opaque NAPI poll
API?

-Toke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18  7:35           ` Sebastian Andrzej Siewior
@ 2024-01-18 11:58             ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 34+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-01-18 11:58 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, LKML, Network Development, David S. Miller,
	Boqun Feng, Daniel Borkmann, Eric Dumazet, Frederic Weisbecker,
	Ingo Molnar, Jakub Kicinski, Paolo Abeni, Peter Zijlstra,
	Thomas Gleixner, Waiman Long, Will Deacon, Alexei Starovoitov,
	Andrii Nakryiko, Cong Wang, Hao Luo, Jamal Hadi Salim,
	Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko, John Fastabend,
	KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-01-17 17:37:29 [+0100], Toke Høiland-Jørgensen wrote:
>> This is all back-of-the-envelope calculations, of course. Having some
>> actual numbers to look at would be great; I don't suppose you have a
>> setup where you can run xdp-bench and see how your patches affect the
>> throughput?
>
> No but I probably could set it up.

That would be great! Feel free to ping me if you need any pointers to
how we usually do the perf measurements :)

>> I chatted with Jesper about this, and he had an idea not too far from
>> this: split up the XDP and regular stack processing in two stages, each
>> with their individual batching. So whereas right now we're doing
>> something like:
>> 
>> run_napi()
>>   bh_disable()
>>   for pkt in budget:
>>     act = run_xdp(pkt)
>>     if (act == XDP_PASS)
>>       run_netstack(pkt)  // this is the expensive bit
>>   bh_enable()
>> 
>> We could instead do:
>> 
>> run_napi()
>>   bh_disable()
>>   for pkt in budget:
>>     act = run_xdp(pkt)
>>     if (act == XDP_PASS)
>>       add_to_list(pkt, to_stack_list)
>>   bh_enable()
>>   // sched point
>>   bh_disable()
>>   for pkt in to_stack_list:
>>     run_netstack(pkt)
>>   bh_enable()
>> 
>> 
>> This would limit the batching that blocks everything to only the XDP
>> processing itself, which should limit the maximum time spent in the
>> blocking state significantly compared to what we have today. The caveat
>> being that rearranging things like this is potentially a pretty major
>> refactoring task that needs to touch all the drivers (even if some of
>> the logic can be moved into the core code in the process). So not really
>> sure if this approach is feasible, TBH.
>
> This does not work because bh_disable() does not disable scheduling.
> Scheduling may happen. bh_disable() acquires a lock which is currently
> the only synchronisation point between two say network driver doing
> NAPI. And this what I want to get rid of.
> Regarding expensive bit as in XDP_PASS: This doesn't need locking as per
> proposal, just the REDIRECT piece.

Right, well s/bh_disable()/lock()/; my main point was splitting up the
processing so that the XDP processing itself and the stack activation on
XDP_PASS is not interleaved. This will make it possible to hold the lock
around the whole XDP batch, not just individual packets, and so retain
the performance we gain from amortising expensive operations over
multiple packets.

-Toke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18 11:51             ` Toke Høiland-Jørgensen
@ 2024-01-18 16:37               ` Jakub Kicinski
  2024-01-20 14:41                 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2024-01-18 16:37 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Sebastian Andrzej Siewior, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng, Daniel Borkmann,
	Eric Dumazet, Frederic Weisbecker, Ingo Molnar, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Cong Wang, Hao Luo,
	Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko,
	John Fastabend, KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

On Thu, 18 Jan 2024 12:51:18 +0100 Toke Høiland-Jørgensen wrote:
> I do agree that conceptually it makes a lot of sense to encapsulate the
> budget like this so drivers don't have to do all this state tracking
> themselves. It does appear that drivers are doing different things with
> the budget as it is today, though. For instance, the intel drivers seem
> to divide the budget over all the enabled RX rings(?); so I'm wondering
> if it'll be possible to unify drivers around a more opaque NAPI poll API?

We can come up with APIs which would cater to multi-queue cases.
Bigger question is what is the sensible polling strategy for those,
just dividing the budget seems, hm, crude.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18  8:27             ` Sebastian Andrzej Siewior
@ 2024-01-18 16:38               ` Jakub Kicinski
  2024-01-18 16:50                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2024-01-18 16:38 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng, Daniel Borkmann,
	Eric Dumazet, Frederic Weisbecker, Ingo Molnar, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Cong Wang, Hao Luo,
	Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko,
	John Fastabend, KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

On Thu, 18 Jan 2024 09:27:54 +0100 Sebastian Andrzej Siewior wrote:
> On 2024-01-17 18:04:47 [-0800], Jakub Kicinski wrote:
> > Oh, and I'm bringing it up here, because CONFIG_RT can throw
> > in "need_resched()" into the napi_rx_has_budget(), obviously.  
> 
> need_resched() does not work on PREEMPT_RT the way you think. This
> context (the NAPI poll callback) is preemptible and (by default) runs at
> SCHED_FIFO 50 (within a threaded IRQ) so a context switch can happen at
> any time by a task with higher priority.
> If threadA gets preempted and owns a lock that threadB, with higher
> priority, wants then threadA will get back on CPU, inherit the priority
> of the threadB and continue to run until it releases the lock.
> 
> If this is the per-CPU BH lock (which I want to remove) then it will
> continue until all softirqs complete.

So there's no way for a process to know on RT that someone with higher
prio is waiting for it to release its locks? :(

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18 16:38               ` Jakub Kicinski
@ 2024-01-18 16:50                 ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-18 16:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng, Daniel Borkmann,
	Eric Dumazet, Frederic Weisbecker, Ingo Molnar, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Cong Wang, Hao Luo,
	Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko,
	John Fastabend, KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

On 2024-01-18 08:38:12 [-0800], Jakub Kicinski wrote:
> > If this is the per-CPU BH lock (which I want to remove) then it will
> > continue until all softirqs complete.
> 
> So there's no way for a process to know on RT that someone with higher
> prio is waiting for it to release its locks? :(

You could add a function to check if your current priority is inherited
from someone else and if so start dropping the locks you think are
responsible for it.
I made a PoC that appears to work for timer_list timer which is one of
the softirqs. This made me realise that I need in more spots and I am
doing it for the wrong reasons…

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
  2024-01-18 16:37               ` Jakub Kicinski
@ 2024-01-20 14:41                 ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 34+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-01-20 14:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Sebastian Andrzej Siewior, Alexei Starovoitov, LKML,
	Network Development, David S. Miller, Boqun Feng, Daniel Borkmann,
	Eric Dumazet, Frederic Weisbecker, Ingo Molnar, Paolo Abeni,
	Peter Zijlstra, Thomas Gleixner, Waiman Long, Will Deacon,
	Alexei Starovoitov, Andrii Nakryiko, Cong Wang, Hao Luo,
	Jamal Hadi Salim, Jesper Dangaard Brouer, Jiri Olsa, Jiri Pirko,
	John Fastabend, KP Singh, Martin KaFai Lau, Ronak Doshi, Song Liu,
	Stanislav Fomichev, VMware PV-Drivers Reviewers, Yonghong Song,
	bpf

Jakub Kicinski <kuba@kernel.org> writes:

> On Thu, 18 Jan 2024 12:51:18 +0100 Toke Høiland-Jørgensen wrote:
>> I do agree that conceptually it makes a lot of sense to encapsulate the
>> budget like this so drivers don't have to do all this state tracking
>> themselves. It does appear that drivers are doing different things with
>> the budget as it is today, though. For instance, the intel drivers seem
>> to divide the budget over all the enabled RX rings(?); so I'm wondering
>> if it'll be possible to unify drivers around a more opaque NAPI poll API?
>
> We can come up with APIs which would cater to multi-queue cases.
> Bigger question is what is the sensible polling strategy for those,
> just dividing the budget seems, hm, crude.

Right, agreed, though I don't have a good answer for what else to do off
the top of my head...

-Toke


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-01-20 14:41 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20231215171020.687342-1-bigeasy@linutronix.de>
2023-12-15 17:07 ` [PATCH net-next 11/24] lwt: Don't disable migration prio invoking BPF Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states Sebastian Andrzej Siewior
2023-12-16  3:39   ` kernel test robot
2023-12-18  8:33   ` Paolo Abeni
2024-01-12 11:23     ` Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 13/24] net: Use nested-BH locking for bpf_scratchpad Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 14/24] net: Add a lock which held during the redirect process Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect Sebastian Andrzej Siewior
2023-12-20  0:25   ` Alexei Starovoitov
2024-01-04 19:29     ` Toke Høiland-Jørgensen
2024-01-12 17:41       ` Sebastian Andrzej Siewior
2024-01-17 16:37         ` Toke Høiland-Jørgensen
2024-01-18  2:04           ` Jakub Kicinski
2024-01-18  8:27             ` Sebastian Andrzej Siewior
2024-01-18 16:38               ` Jakub Kicinski
2024-01-18 16:50                 ` Sebastian Andrzej Siewior
2024-01-18 11:51             ` Toke Høiland-Jørgensen
2024-01-18 16:37               ` Jakub Kicinski
2024-01-20 14:41                 ` Toke Høiland-Jørgensen
2024-01-18  7:35           ` Sebastian Andrzej Siewior
2024-01-18 11:58             ` Toke Høiland-Jørgensen
2023-12-15 17:07 ` [PATCH net-next 16/24] net: netkit, veth, tun, virt*: " Sebastian Andrzej Siewior
2023-12-18  8:52   ` Daniel Borkmann
2024-01-12 15:37     ` Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 18/24] net: Freescale: " Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 19/24] net: fungible, gve, mtk, microchip, mana: " Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 20/24] net: intel: " Sebastian Andrzej Siewior
2023-12-16  4:53   ` kernel test robot
2023-12-19  0:01     ` Nathan Chancellor
2023-12-19 16:55       ` Nick Desaulniers
2023-12-15 17:07 ` [PATCH net-next 21/24] net: marvell: " Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 22/24] net: mellanox, nfp, sfc: " Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 23/24] net: qlogic, socionext, stmmac, cpsw: " Sebastian Andrzej Siewior
2023-12-15 17:07 ` [PATCH net-next 24/24] net: bpf: Add lockdep assert for the redirect process Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox