netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net] net: lwtunnel: disable preemption when required
@ 2025-04-03  8:39 Justin Iurman
  2025-04-03 16:24 ` Stanislav Fomichev
  0 siblings, 1 reply; 18+ messages in thread
From: Justin Iurman @ 2025-04-03  8:39 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, horms, kuniyu, justin.iurman,
	Alexei Starovoitov, bpf, Stanislav Fomichev

In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
preemptible scope for PREEMPT kernels. This patch disables preemption
before calling dev_xmit_recursion(). Preemption is re-enabled only at
the end, since we must ensure the same CPU is used for both
dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
recursion levels in some cases) in order to maintain valid per-cpu
counters.

Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Closes: https://lore.kernel.org/netdev/CAADnVQJFWn3dBFJtY+ci6oN1pDFL=TzCmNbRgey7MdYxt_AP2g@mail.gmail.com/
Fixes: 986ffb3a57c5 ("net: lwtunnel: fix recursion loops")
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
---
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Stanislav Fomichev <stfomichev@gmail.com>
---
 net/core/lwtunnel.c | 47 ++++++++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 18 deletions(-)

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index e39a459540ec..a9ad068e5707 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -333,6 +333,8 @@ int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	struct dst_entry *dst;
 	int ret;
 
+	preempt_disable();
+
 	if (dev_xmit_recursion()) {
 		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
 				     __func__);
@@ -345,11 +347,13 @@ int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 		ret = -EINVAL;
 		goto drop;
 	}
-	lwtstate = dst->lwtstate;
 
+	lwtstate = dst->lwtstate;
 	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
-	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
-		return 0;
+	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+		ret = 0;
+		goto out;
+	}
 
 	ret = -EOPNOTSUPP;
 	rcu_read_lock();
@@ -364,11 +368,11 @@ int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	if (ret == -EOPNOTSUPP)
 		goto drop;
 
-	return ret;
-
+	goto out;
 drop:
 	kfree_skb(skb);
-
+out:
+	preempt_enable();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(lwtunnel_output);
@@ -380,6 +384,8 @@ int lwtunnel_xmit(struct sk_buff *skb)
 	struct dst_entry *dst;
 	int ret;
 
+	preempt_disable();
+
 	if (dev_xmit_recursion()) {
 		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
 				     __func__);
@@ -394,10 +400,11 @@ int lwtunnel_xmit(struct sk_buff *skb)
 	}
 
 	lwtstate = dst->lwtstate;
-
 	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
-	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
-		return 0;
+	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+		ret = 0;
+		goto out;
+	}
 
 	ret = -EOPNOTSUPP;
 	rcu_read_lock();
@@ -412,11 +419,11 @@ int lwtunnel_xmit(struct sk_buff *skb)
 	if (ret == -EOPNOTSUPP)
 		goto drop;
 
-	return ret;
-
+	goto out;
 drop:
 	kfree_skb(skb);
-
+out:
+	preempt_enable();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(lwtunnel_xmit);
@@ -428,6 +435,8 @@ int lwtunnel_input(struct sk_buff *skb)
 	struct dst_entry *dst;
 	int ret;
 
+	preempt_disable();
+
 	if (dev_xmit_recursion()) {
 		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
 				     __func__);
@@ -440,11 +449,13 @@ int lwtunnel_input(struct sk_buff *skb)
 		ret = -EINVAL;
 		goto drop;
 	}
-	lwtstate = dst->lwtstate;
 
+	lwtstate = dst->lwtstate;
 	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
-	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
-		return 0;
+	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+		ret = 0;
+		goto out;
+	}
 
 	ret = -EOPNOTSUPP;
 	rcu_read_lock();
@@ -459,11 +470,11 @@ int lwtunnel_input(struct sk_buff *skb)
 	if (ret == -EOPNOTSUPP)
 		goto drop;
 
-	return ret;
-
+	goto out;
 drop:
 	kfree_skb(skb);
-
+out:
+	preempt_enable();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(lwtunnel_input);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-03  8:39 [PATCH net] net: lwtunnel: disable preemption when required Justin Iurman
@ 2025-04-03 16:24 ` Stanislav Fomichev
  2025-04-03 19:08   ` Justin Iurman
  0 siblings, 1 reply; 18+ messages in thread
From: Stanislav Fomichev @ 2025-04-03 16:24 UTC (permalink / raw)
  To: Justin Iurman
  Cc: netdev, davem, edumazet, kuba, pabeni, horms, kuniyu,
	Alexei Starovoitov, bpf

On 04/03, Justin Iurman wrote:
> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
> preemptible scope for PREEMPT kernels. This patch disables preemption
> before calling dev_xmit_recursion(). Preemption is re-enabled only at
> the end, since we must ensure the same CPU is used for both
> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
> recursion levels in some cases) in order to maintain valid per-cpu
> counters.

Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
track the recursion. Any reason not to do it in the generic PREEMPT case?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-03 16:24 ` Stanislav Fomichev
@ 2025-04-03 19:08   ` Justin Iurman
  2025-04-03 20:35     ` Alexei Starovoitov
  2025-04-03 23:38     ` Jakub Kicinski
  0 siblings, 2 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-03 19:08 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, davem, edumazet, kuba, pabeni, horms, kuniyu,
	Alexei Starovoitov, bpf

On 4/3/25 18:24, Stanislav Fomichev wrote:
> On 04/03, Justin Iurman wrote:
>> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
>> preemptible scope for PREEMPT kernels. This patch disables preemption
>> before calling dev_xmit_recursion(). Preemption is re-enabled only at
>> the end, since we must ensure the same CPU is used for both
>> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
>> recursion levels in some cases) in order to maintain valid per-cpu
>> counters.
> 
> Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
> track the recursion. Any reason not to do it in the generic PREEMPT case?

I'd say PREEMPT_RT is a different beast. IMO, softirqs can be 
preempted/migrated in RT kernels, which is not true for non-RT kernels. 
Maybe RT kernels could use __this_cpu_* instead of "current" though, but 
it would be less trivial. For example, see commit ecefbc09e8ee ("net: 
softnet_data: Make xmit per task.") on why it makes sense to use 
"current" in RT kernels. I guess the opposite as you suggest (i.e., 
non-RT kernels using "current") would be technically possible, but there 
must be a reason it is defined the way it is... so probably incorrect or 
inefficient?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-03 19:08   ` Justin Iurman
@ 2025-04-03 20:35     ` Alexei Starovoitov
  2025-04-04 14:19       ` Sebastian Sewior
  2025-04-03 23:38     ` Jakub Kicinski
  1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-04-03 20:35 UTC (permalink / raw)
  To: Justin Iurman, Sebastian Sewior
  Cc: Stanislav Fomichev, Network Development, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, bpf

On Thu, Apr 3, 2025 at 12:08 PM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> On 4/3/25 18:24, Stanislav Fomichev wrote:
> > On 04/03, Justin Iurman wrote:
> >> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
> >> preemptible scope for PREEMPT kernels. This patch disables preemption
> >> before calling dev_xmit_recursion(). Preemption is re-enabled only at
> >> the end, since we must ensure the same CPU is used for both
> >> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
> >> recursion levels in some cases) in order to maintain valid per-cpu
> >> counters.
> >
> > Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
> > track the recursion. Any reason not to do it in the generic PREEMPT case?
>
> I'd say PREEMPT_RT is a different beast. IMO, softirqs can be
> preempted/migrated in RT kernels, which is not true for non-RT kernels.
> Maybe RT kernels could use __this_cpu_* instead of "current" though, but
> it would be less trivial. For example, see commit ecefbc09e8ee ("net:
> softnet_data: Make xmit per task.") on why it makes sense to use
> "current" in RT kernels. I guess the opposite as you suggest (i.e.,
> non-RT kernels using "current") would be technically possible, but there
> must be a reason it is defined the way it is... so probably incorrect or
> inefficient?

Stating the obvious...
Sebastian did a lot of work removing preempt_disable from the networking
stack.
We're certainly not adding them back.
This patch is no go.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-03 19:08   ` Justin Iurman
  2025-04-03 20:35     ` Alexei Starovoitov
@ 2025-04-03 23:38     ` Jakub Kicinski
  1 sibling, 0 replies; 18+ messages in thread
From: Jakub Kicinski @ 2025-04-03 23:38 UTC (permalink / raw)
  To: Justin Iurman
  Cc: Stanislav Fomichev, netdev, davem, edumazet, pabeni, horms,
	kuniyu, Alexei Starovoitov, bpf

On Thu, 3 Apr 2025 21:08:12 +0200 Justin Iurman wrote:
> On 4/3/25 18:24, Stanislav Fomichev wrote:
> > On 04/03, Justin Iurman wrote:  
> >> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
> >> preemptible scope for PREEMPT kernels. This patch disables preemption
> >> before calling dev_xmit_recursion(). Preemption is re-enabled only at
> >> the end, since we must ensure the same CPU is used for both
> >> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
> >> recursion levels in some cases) in order to maintain valid per-cpu
> >> counters.  
> > 
> > Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
> > track the recursion. Any reason not to do it in the generic PREEMPT case?  
> 
> I'd say PREEMPT_RT is a different beast. IMO, softirqs can be 
> preempted/migrated in RT kernels, which is not true for non-RT kernels. 
> Maybe RT kernels could use __this_cpu_* instead of "current" though, but 
> it would be less trivial. For example, see commit ecefbc09e8ee ("net: 
> softnet_data: Make xmit per task.") on why it makes sense to use 
> "current" in RT kernels. I guess the opposite as you suggest (i.e., 
> non-RT kernels using "current") would be technically possible, but there 
> must be a reason it is defined the way it is... so probably incorrect or 
> inefficient?

I suspect it's to avoid the performance overhead.
IIUC you would be better off using local_bh_disable() here.
It doesn't disable preemption on RT.
I don't believe "disable preemption if !RT" primitive exists.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-03 20:35     ` Alexei Starovoitov
@ 2025-04-04 14:19       ` Sebastian Sewior
  2025-04-06  8:59         ` Justin Iurman
  0 siblings, 1 reply; 18+ messages in thread
From: Sebastian Sewior @ 2025-04-04 14:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Justin Iurman, Stanislav Fomichev, Network Development,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, bpf

Alexei, thank you for the Cc.

On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> Stating the obvious...
> Sebastian did a lot of work removing preempt_disable from the networking
> stack.
> We're certainly not adding them back.
> This patch is no go.

While looking through the code, it looks as if lwtunnel_xmit() lacks a
local_bh_disable().

Sebastian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-04 14:19       ` Sebastian Sewior
@ 2025-04-06  8:59         ` Justin Iurman
  2025-04-07 17:54           ` Alexei Starovoitov
  0 siblings, 1 reply; 18+ messages in thread
From: Justin Iurman @ 2025-04-06  8:59 UTC (permalink / raw)
  To: Sebastian Sewior, Alexei Starovoitov
  Cc: Stanislav Fomichev, Network Development, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, bpf

On 4/4/25 16:19, Sebastian Sewior wrote:
> Alexei, thank you for the Cc.
> 
> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>> Stating the obvious...
>> Sebastian did a lot of work removing preempt_disable from the networking
>> stack.
>> We're certainly not adding them back.
>> This patch is no go.
> 
> While looking through the code, it looks as if lwtunnel_xmit() lacks a
> local_bh_disable().

Thanks Sebastian for the confirmation, as the initial idea was to use 
local_bh_disable() as well. Then I thought preempt_disable() would be 
enough in this context, but I didn't realize you made efforts to remove 
it from the networking stack.

@Alexei, just to clarify: would you ACK this patch if we do 
s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?

> Sebastian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-06  8:59         ` Justin Iurman
@ 2025-04-07 17:54           ` Alexei Starovoitov
  2025-04-11 18:34             ` Justin Iurman
  0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-04-07 17:54 UTC (permalink / raw)
  To: Justin Iurman
  Cc: Sebastian Sewior, Stanislav Fomichev, Network Development,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, bpf

On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> On 4/4/25 16:19, Sebastian Sewior wrote:
> > Alexei, thank you for the Cc.
> >
> > On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> >> Stating the obvious...
> >> Sebastian did a lot of work removing preempt_disable from the networking
> >> stack.
> >> We're certainly not adding them back.
> >> This patch is no go.
> >
> > While looking through the code, it looks as if lwtunnel_xmit() lacks a
> > local_bh_disable().
>
> Thanks Sebastian for the confirmation, as the initial idea was to use
> local_bh_disable() as well. Then I thought preempt_disable() would be
> enough in this context, but I didn't realize you made efforts to remove
> it from the networking stack.
>
> @Alexei, just to clarify: would you ACK this patch if we do
> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?

You need to think it through and not sprinkle local_bh_disable in
every lwt related function.
Like lwtunnel_input should be running with bh disabled already.

I don't remember the exact conditions where bh is disabled in xmit path.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-07 17:54           ` Alexei Starovoitov
@ 2025-04-11 18:34             ` Justin Iurman
  2025-04-14 23:13               ` Alexei Starovoitov
  2025-04-15  0:54               ` Andrea Mayer
  0 siblings, 2 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-11 18:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Sebastian Sewior, Stanislav Fomichev, Network Development,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, bpf, Andrea Mayer

On 4/7/25 19:54, Alexei Starovoitov wrote:
> On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>
>> On 4/4/25 16:19, Sebastian Sewior wrote:
>>> Alexei, thank you for the Cc.
>>>
>>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>>>> Stating the obvious...
>>>> Sebastian did a lot of work removing preempt_disable from the networking
>>>> stack.
>>>> We're certainly not adding them back.
>>>> This patch is no go.
>>>
>>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
>>> local_bh_disable().
>>
>> Thanks Sebastian for the confirmation, as the initial idea was to use
>> local_bh_disable() as well. Then I thought preempt_disable() would be
>> enough in this context, but I didn't realize you made efforts to remove
>> it from the networking stack.
>>
>> @Alexei, just to clarify: would you ACK this patch if we do
>> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
> 
> You need to think it through and not sprinkle local_bh_disable in
> every lwt related function.
> Like lwtunnel_input should be running with bh disabled already.

Having nested calls to local_bh_{disable|enable}() is fine (i.e., 
disabling BHs when they're already disabled), but I guess it's cleaner 
to avoid it here as you suggest. And since lwtunnel_input() is indeed 
(always) running with BHs disabled, no changes needed. Thanks for the 
reminder.

> I don't remember the exact conditions where bh is disabled in xmit path.

Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can 
definitely run with or without BHs disabled. So, what I propose is the 
following logic (applied to lwtunnel_xmit() too): if BHs disabled then 
NOP else local_bh_disable(). Thoughts on this new version? (sorry, my 
mailer messes it up, but you got the idea):

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index e39a459540ec..d44d341683c5 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
  	const struct lwtunnel_encap_ops *ops;
  	struct lwtunnel_state *lwtstate;
  	struct dst_entry *dst;
+	bool in_softirq;
  	int ret;

+	in_softirq = in_softirq();
+	if (!in_softirq)
+		local_bh_disable();
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
@@ -345,11 +350,13 @@ int lwtunnel_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
  		ret = -EINVAL;
  		goto drop;
  	}
-	lwtstate = dst->lwtstate;

+	lwtstate = dst->lwtstate;
  	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
-	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
-		return 0;
+	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+		ret = 0;
+		goto out;
+	}

  	ret = -EOPNOTSUPP;
  	rcu_read_lock();
@@ -364,10 +371,12 @@ int lwtunnel_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
  	if (ret == -EOPNOTSUPP)
  		goto drop;

-	return ret;
-
+	goto out;
  drop:
  	kfree_skb(skb);
+out:
+	if (!in_softirq)
+		local_bh_enable();

  	return ret;
  }
@@ -378,8 +387,13 @@ int lwtunnel_xmit(struct sk_buff *skb)
  	const struct lwtunnel_encap_ops *ops;
  	struct lwtunnel_state *lwtstate;
  	struct dst_entry *dst;
+	bool in_softirq;
  	int ret;

+	in_softirq = in_softirq();
+	if (!in_softirq)
+		local_bh_disable();
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
@@ -394,10 +408,11 @@ int lwtunnel_xmit(struct sk_buff *skb)
  	}

  	lwtstate = dst->lwtstate;
-
  	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
-	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
-		return 0;
+	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+		ret = 0;
+		goto out;
+	}

  	ret = -EOPNOTSUPP;
  	rcu_read_lock();
@@ -412,10 +427,12 @@ int lwtunnel_xmit(struct sk_buff *skb)
  	if (ret == -EOPNOTSUPP)
  		goto drop;

-	return ret;
-
+	goto out;
  drop:
  	kfree_skb(skb);
+out:
+	if (!in_softirq)
+		local_bh_enable();

  	return ret;
  }
@@ -428,6 +445,8 @@ int lwtunnel_input(struct sk_buff *skb)
  	struct dst_entry *dst;
  	int ret;

+	WARN_ON_ONCE(!in_softirq());
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
@@ -440,8 +459,8 @@ int lwtunnel_input(struct sk_buff *skb)
  		ret = -EINVAL;
  		goto drop;
  	}
-	lwtstate = dst->lwtstate;

+	lwtstate = dst->lwtstate;
  	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
  	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
  		return 0;
@@ -460,10 +479,8 @@ int lwtunnel_input(struct sk_buff *skb)
  		goto drop;

  	return ret;
-
  drop:
  	kfree_skb(skb);
-
  	return ret;
  }
  EXPORT_SYMBOL_GPL(lwtunnel_input);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-11 18:34             ` Justin Iurman
@ 2025-04-14 23:13               ` Alexei Starovoitov
  2025-04-15  9:25                 ` Justin Iurman
  2025-04-15 11:56                 ` Justin Iurman
  2025-04-15  0:54               ` Andrea Mayer
  1 sibling, 2 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-04-14 23:13 UTC (permalink / raw)
  To: Justin Iurman
  Cc: Sebastian Sewior, Stanislav Fomichev, Network Development,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, bpf, Andrea Mayer

On Fri, Apr 11, 2025 at 11:34 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> On 4/7/25 19:54, Alexei Starovoitov wrote:
> > On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
> >>
> >> On 4/4/25 16:19, Sebastian Sewior wrote:
> >>> Alexei, thank you for the Cc.
> >>>
> >>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> >>>> Stating the obvious...
> >>>> Sebastian did a lot of work removing preempt_disable from the networking
> >>>> stack.
> >>>> We're certainly not adding them back.
> >>>> This patch is no go.
> >>>
> >>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
> >>> local_bh_disable().
> >>
> >> Thanks Sebastian for the confirmation, as the initial idea was to use
> >> local_bh_disable() as well. Then I thought preempt_disable() would be
> >> enough in this context, but I didn't realize you made efforts to remove
> >> it from the networking stack.
> >>
> >> @Alexei, just to clarify: would you ACK this patch if we do
> >> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
> >
> > You need to think it through and not sprinkle local_bh_disable in
> > every lwt related function.
> > Like lwtunnel_input should be running with bh disabled already.
>
> Having nested calls to local_bh_{disable|enable}() is fine (i.e.,
> disabling BHs when they're already disabled), but I guess it's cleaner
> to avoid it here as you suggest. And since lwtunnel_input() is indeed
> (always) running with BHs disabled, no changes needed. Thanks for the
> reminder.
>
> > I don't remember the exact conditions where bh is disabled in xmit path.
>
> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can
> definitely run with or without BHs disabled. So, what I propose is the
> following logic (applied to lwtunnel_xmit() too): if BHs disabled then
> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my
> mailer messes it up, but you got the idea):
>
> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
> index e39a459540ec..d44d341683c5 100644
> --- a/net/core/lwtunnel.c
> +++ b/net/core/lwtunnel.c
> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock
> *sk, struct sk_buff *skb)
>         const struct lwtunnel_encap_ops *ops;
>         struct lwtunnel_state *lwtstate;
>         struct dst_entry *dst;
> +       bool in_softirq;
>         int ret;
>
> +       in_softirq = in_softirq();
> +       if (!in_softirq)
> +               local_bh_disable();
> +

This looks like a hack to me.

Instead analyze the typical xmit path. If bh is not disabled
then add local_bh_disable(). It's fine if it happens to be nested
in some cases.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-11 18:34             ` Justin Iurman
  2025-04-14 23:13               ` Alexei Starovoitov
@ 2025-04-15  0:54               ` Andrea Mayer
  2025-04-15  9:10                 ` Justin Iurman
  1 sibling, 1 reply; 18+ messages in thread
From: Andrea Mayer @ 2025-04-15  0:54 UTC (permalink / raw)
  To: Justin Iurman
  Cc: Alexei Starovoitov, Sebastian Sewior, Stanislav Fomichev,
	Network Development, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Kuniyuki Iwashima, bpf,
	Stefano Salsano, Paolo Lungaroni, Andrea Mayer

On Fri, 11 Apr 2025 20:34:54 +0200
Justin Iurman <justin.iurman@uliege.be> wrote:

> On 4/7/25 19:54, Alexei Starovoitov wrote:
> > On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
> >>
> >> On 4/4/25 16:19, Sebastian Sewior wrote:
> >>> Alexei, thank you for the Cc.
> >>>
> >>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> >>>> Stating the obvious...
> >>>> Sebastian did a lot of work removing preempt_disable from the networking
> >>>> stack.
> >>>> We're certainly not adding them back.
> >>>> This patch is no go.
> >>>
> >>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
> >>> local_bh_disable().
> >>
> >> Thanks Sebastian for the confirmation, as the initial idea was to use
> >> local_bh_disable() as well. Then I thought preempt_disable() would be
> >> enough in this context, but I didn't realize you made efforts to remove
> >> it from the networking stack.
> >>
> >> @Alexei, just to clarify: would you ACK this patch if we do
> >> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
> > 
> > You need to think it through and not sprinkle local_bh_disable in
> > every lwt related function.
> > Like lwtunnel_input should be running with bh disabled already.
> 
> Having nested calls to local_bh_{disable|enable}() is fine (i.e., 
> disabling BHs when they're already disabled), but I guess it's cleaner 
> to avoid it here as you suggest. And since lwtunnel_input() is indeed 
> (always) running with BHs disabled, no changes needed. Thanks for the 
> reminder.
> 
> > I don't remember the exact conditions where bh is disabled in xmit path.
> 
> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can

Justin, thanks for the Cc.

I have been looking into the behavior of the lwtunnel_xmit() function in both
task and softirq contexts. To facilitate this investigation, I have written a
simple eBPF program that only prints messages to the trace pipe. This program
is attached to the LWT BPF XMIT hook by configuring a route (on my test node)
with a destination address (DA) pointing to an external node, referred to as
x.x.x.x, within my testbed network.

To trigger that LWT BPF XMIT instance from a softirq context, it is sufficient
to receive (on the test node) a packet with a DA matching x.x.x.x. This packet
is then processed through the forwarding path, eventually leading to the
ip_output() function. Processing ends with a call to ip_finish_output2(), which
then calls lwtunnel_xmit().

Below is the stack trace from my testing machine, highlighting the key
functions involved in this processing path:

 ============================================
  <IRQ>                                         
  ...
  lwtunnel_xmit+0x18/0x3f0                      
  ip_finish_output2+0x45a/0xcc0                 
  ip_output+0xe2/0x380                          
  NF_HOOK.constprop.0+0x7e/0x2f0                
  ip_rcv+0x4bf/0x4d0                            
  __netif_receive_skb_one_core+0x11c/0x130      
  process_backlog+0x277/0x980
  __napi_poll.constprop.0+0x58/0x260
  net_rx_action+0x396/0x6e0
  handle_softirqs+0x116/0x640
  do_softirq+0xa9/0xe0
  </IRQ>
 ============================================

Conversely, to trigger lwtunnel_xmit() from the task context, simply ping
x.x.x.x on the same testing node. Below is the corresponding stack trace:

 ============================================
  <TASK>                                                               
  ...
  lwtunnel_xmit+0x18/0x3f0                                             
  ip_finish_output2+0x45a/0xcc0                                        
  ip_output+0xe2/0x380                                                 
  ip_push_pending_frames+0x17a/0x200                                   
  raw_sendmsg+0x9fa/0x1060                                             
  __sys_sendto+0x294/0x2e0                                             
  __x64_sys_sendto+0x6d/0x80                                           
  do_syscall_64+0x64/0x140                                             
  entry_SYSCALL_64_after_hwframe+0x76/0x7e                             
  </TASK>    
 ============================================

So also for the lwtunnel_xmit(), we need to make sure that the functions
dev_xmit_recursion{_inc/dec}() and the necessary logic to avoid lwt recursion
are protected, i.e. inside a local_bh_{disable/enable} block.

> definitely run with or without BHs disabled. So, what I propose is the 
> following logic (applied to lwtunnel_xmit() too): if BHs disabled then 
> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my 
> mailer messes it up, but you got the idea):
> 
> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
> index e39a459540ec..d44d341683c5 100644
> --- a/net/core/lwtunnel.c
> +++ b/net/core/lwtunnel.c
> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock 
> *sk, struct sk_buff *skb)
>   	const struct lwtunnel_encap_ops *ops;
>   	struct lwtunnel_state *lwtstate;
>   	struct dst_entry *dst;
> +	bool in_softirq;
>   	int ret;
> 
> +	in_softirq = in_softirq();
> +	if (!in_softirq)
> +		local_bh_disable();
> +

In a non-preemptible real-time environment (i.e., when !PREEMPT_RT), the
in_softirq() expands to softirq_count(), which in turn uses the preempt_count()
function. On my x86 architecture, preempt_count() accesses the per-CPU
__preempt_count variable.

If in_softirq() returns 0, it indicates that no softirqs are currently being
processed on the local CPU and BH are not disabled. Therefore, following the
logic above, we disable bottom halves (BH) on that particular CPU.

However, there is my opinion an issue that can occur: between the check on
in_softirq() and the call to local_bh_disable(), the task may be scheduled on
another CPU. As a result, the check on in_softirq() becomes ineffective because
we may end up disabling BH on a CPU that is not the one we just checked (with
if (in_softirq()) { ... }).


>   	if (dev_xmit_recursion()) {
>   		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
>   				     __func__);
> @@ -345,11 +350,13 @@ int lwtunnel_output(struct net *net, struct sock 
> *sk, struct sk_buff *skb)
>   		ret = -EINVAL;
>   		goto drop;
>   	}
> -	lwtstate = dst->lwtstate;
> 
> +	lwtstate = dst->lwtstate;
>   	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
> -	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
> -		return 0;
> +	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
> +		ret = 0;
> +		goto out;
> +	}
> 
>   	ret = -EOPNOTSUPP;
>   	rcu_read_lock();
> @@ -364,10 +371,12 @@ int lwtunnel_output(struct net *net, struct sock 
> *sk, struct sk_buff *skb)
>   	if (ret == -EOPNOTSUPP)
>   		goto drop;
> 
> -	return ret;
> -
> +	goto out;
>   drop:
>   	kfree_skb(skb);
> +out:
> +	if (!in_softirq)
> +		local_bh_enable();
> 
>   	return ret;
>   }
> @@ -378,8 +387,13 @@ int lwtunnel_xmit(struct sk_buff *skb)
>   	const struct lwtunnel_encap_ops *ops;
>   	struct lwtunnel_state *lwtstate;
>   	struct dst_entry *dst;
> +	bool in_softirq;
>   	int ret;
> 
> +	in_softirq = in_softirq();
> +	if (!in_softirq)
> +		local_bh_disable();
> +
>   	if (dev_xmit_recursion()) {
>   		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
>   				     __func__);
> @@ -394,10 +408,11 @@ int lwtunnel_xmit(struct sk_buff *skb)
>   	}
> 
>   	lwtstate = dst->lwtstate;
> -
>   	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
> -	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
> -		return 0;
> +	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
> +		ret = 0;
> +		goto out;
> +	}
> 
>   	ret = -EOPNOTSUPP;
>   	rcu_read_lock();
> @@ -412,10 +427,12 @@ int lwtunnel_xmit(struct sk_buff *skb)
>   	if (ret == -EOPNOTSUPP)
>   		goto drop;
> 
> -	return ret;
> -
> +	goto out;
>   drop:
>   	kfree_skb(skb);
> +out:
> +	if (!in_softirq)
> +		local_bh_enable();
> 
>   	return ret;
>   }
> @@ -428,6 +445,8 @@ int lwtunnel_input(struct sk_buff *skb)
>   	struct dst_entry *dst;
>   	int ret;
> 
> +	WARN_ON_ONCE(!in_softirq());
> +

What about DEBUG_NET_WARN_ON_ONCE instead?

Ciao,
Andrea

>   	if (dev_xmit_recursion()) {
>   		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
>   				     __func__);
> @@ -440,8 +459,8 @@ int lwtunnel_input(struct sk_buff *skb)
>   		ret = -EINVAL;
>   		goto drop;
>   	}
> -	lwtstate = dst->lwtstate;
> 
> +	lwtstate = dst->lwtstate;
>   	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
>   	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
>   		return 0;
> @@ -460,10 +479,8 @@ int lwtunnel_input(struct sk_buff *skb)
>   		goto drop;
> 
>   	return ret;
> -
>   drop:
>   	kfree_skb(skb);
> -
>   	return ret;
>   }
>   EXPORT_SYMBOL_GPL(lwtunnel_input);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-15  0:54               ` Andrea Mayer
@ 2025-04-15  9:10                 ` Justin Iurman
  2025-04-15 14:38                   ` Jakub Kicinski
  0 siblings, 1 reply; 18+ messages in thread
From: Justin Iurman @ 2025-04-15  9:10 UTC (permalink / raw)
  To: Andrea Mayer
  Cc: Alexei Starovoitov, Sebastian Sewior, Stanislav Fomichev,
	Network Development, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Kuniyuki Iwashima, bpf,
	Stefano Salsano, Paolo Lungaroni

On 4/15/25 02:54, Andrea Mayer wrote:> I have been looking into the 
behavior of the lwtunnel_xmit() function in both
> task and softirq contexts. To facilitate this investigation, I have written a
> simple eBPF program that only prints messages to the trace pipe. This program
> is attached to the LWT BPF XMIT hook by configuring a route (on my test node)
> with a destination address (DA) pointing to an external node, referred to as
> x.x.x.x, within my testbed network.
> 
> To trigger that LWT BPF XMIT instance from a softirq context, it is sufficient
> to receive (on the test node) a packet with a DA matching x.x.x.x. This packet
> is then processed through the forwarding path, eventually leading to the
> ip_output() function. Processing ends with a call to ip_finish_output2(), which
> then calls lwtunnel_xmit().
> 
> Below is the stack trace from my testing machine, highlighting the key
> functions involved in this processing path:
> 
>   ============================================
>    <IRQ>
>    ...
>    lwtunnel_xmit+0x18/0x3f0
>    ip_finish_output2+0x45a/0xcc0
>    ip_output+0xe2/0x380
>    NF_HOOK.constprop.0+0x7e/0x2f0
>    ip_rcv+0x4bf/0x4d0
>    __netif_receive_skb_one_core+0x11c/0x130
>    process_backlog+0x277/0x980
>    __napi_poll.constprop.0+0x58/0x260
>    net_rx_action+0x396/0x6e0
>    handle_softirqs+0x116/0x640
>    do_softirq+0xa9/0xe0
>    </IRQ>
>   ============================================
> 
> Conversely, to trigger lwtunnel_xmit() from the task context, simply ping
> x.x.x.x on the same testing node. Below is the corresponding stack trace:
> 
>   ============================================
>    <TASK>
>    ...
>    lwtunnel_xmit+0x18/0x3f0
>    ip_finish_output2+0x45a/0xcc0
>    ip_output+0xe2/0x380
>    ip_push_pending_frames+0x17a/0x200
>    raw_sendmsg+0x9fa/0x1060
>    __sys_sendto+0x294/0x2e0
>    __x64_sys_sendto+0x6d/0x80
>    do_syscall_64+0x64/0x140
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    </TASK>
>   ============================================
> 
> So also for the lwtunnel_xmit(), we need to make sure that the functions
> dev_xmit_recursion{_inc/dec}() and the necessary logic to avoid lwt recursion
> are protected, i.e. inside a local_bh_{disable/enable} block.

That's correct, and I ended up with the same conclusion as yours on the 
possible paths for lwtunnel_xmit() depending on the context (task vs 
irq). Based on your description, we're using a similar approach with 
eBPF :-) Note that paths are similar for lwtunnel_output() (see below).

> In a non-preemptible real-time environment (i.e., when !PREEMPT_RT), the
> in_softirq() expands to softirq_count(), which in turn uses the preempt_count()
> function. On my x86 architecture, preempt_count() accesses the per-CPU
> __preempt_count variable.
> 
> If in_softirq() returns 0, it indicates that no softirqs are currently being
> processed on the local CPU and BH are not disabled. Therefore, following the
> logic above, we disable bottom halves (BH) on that particular CPU.
> 
> However, there is my opinion an issue that can occur: between the check on
> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> another CPU. As a result, the check on in_softirq() becomes ineffective because
> we may end up disabling BH on a CPU that is not the one we just checked (with
> if (in_softirq()) { ... }).

Hmm, I think it's correct... good catch. I went for this solution to (i) 
avoid useless nested BHs disable calls; and (ii) avoid ending up with a 
spaghetti graph of possible paths with or without BHs disabled (i.e., 
with single entry points, namely lwtunnel_xmit() and lwtunnel_output()), 
which otherwise makes it hard to maintain the code IMO.

So, if we want to follow what Alexei suggests (see his last response), 
we'd need to disable BHs in both ip_local_out() and ip6_local_out(). 
These are the common functions which are closest in depth, and so for 
both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of 
disabling BHs even when it may not be required. Indeed, ip_local_out() 
and ip6_local_out() both call dst_output(), which one is usually not 
lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call 
either).

The other solution is to always call local_bh_disable() in both 
lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when 
they were already. Which was basically -v1 and received a NACK from Alexei.

At the moment, I'm not sure what's best.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-14 23:13               ` Alexei Starovoitov
@ 2025-04-15  9:25                 ` Justin Iurman
  2025-04-15 11:56                 ` Justin Iurman
  1 sibling, 0 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-15  9:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Sebastian Sewior, Stanislav Fomichev, Network Development,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, bpf, Andrea Mayer

On 4/15/25 01:13, Alexei Starovoitov wrote:
> On Fri, Apr 11, 2025 at 11:34 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>
>> On 4/7/25 19:54, Alexei Starovoitov wrote:
>>> On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>>>
>>>> On 4/4/25 16:19, Sebastian Sewior wrote:
>>>>> Alexei, thank you for the Cc.
>>>>>
>>>>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>>>>>> Stating the obvious...
>>>>>> Sebastian did a lot of work removing preempt_disable from the networking
>>>>>> stack.
>>>>>> We're certainly not adding them back.
>>>>>> This patch is no go.
>>>>>
>>>>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
>>>>> local_bh_disable().
>>>>
>>>> Thanks Sebastian for the confirmation, as the initial idea was to use
>>>> local_bh_disable() as well. Then I thought preempt_disable() would be
>>>> enough in this context, but I didn't realize you made efforts to remove
>>>> it from the networking stack.
>>>>
>>>> @Alexei, just to clarify: would you ACK this patch if we do
>>>> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
>>>
>>> You need to think it through and not sprinkle local_bh_disable in
>>> every lwt related function.
>>> Like lwtunnel_input should be running with bh disabled already.
>>
>> Having nested calls to local_bh_{disable|enable}() is fine (i.e.,
>> disabling BHs when they're already disabled), but I guess it's cleaner
>> to avoid it here as you suggest. And since lwtunnel_input() is indeed
>> (always) running with BHs disabled, no changes needed. Thanks for the
>> reminder.
>>
>>> I don't remember the exact conditions where bh is disabled in xmit path.
>>
>> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can
>> definitely run with or without BHs disabled. So, what I propose is the
>> following logic (applied to lwtunnel_xmit() too): if BHs disabled then
>> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my
>> mailer messes it up, but you got the idea):
>>
>> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
>> index e39a459540ec..d44d341683c5 100644
>> --- a/net/core/lwtunnel.c
>> +++ b/net/core/lwtunnel.c
>> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock
>> *sk, struct sk_buff *skb)
>>          const struct lwtunnel_encap_ops *ops;
>>          struct lwtunnel_state *lwtstate;
>>          struct dst_entry *dst;
>> +       bool in_softirq;
>>          int ret;
>>
>> +       in_softirq = in_softirq();
>> +       if (!in_softirq)
>> +               local_bh_disable();
>> +
> 
> This looks like a hack to me.
> 
> Instead analyze the typical xmit path. If bh is not disabled

This is already what I did, and it's exactly the reason why I ended up 
with the above proposal. It's not only about the xmit path but also the 
output path. Of course, having BHs disabled only where they need to 
without useless nested calls would be nice, but in reality the solution 
is not perfect and makes it even more difficult to visualize the path(s) 
with or without BHs disabled IMO.

For both lwtunnel_xmit() and lwtunnel_output(), the common functions 
which are closest in depth and where BHs should be disabled are 
ip_local_out() and ip6_local_out(). And even when it's not required 
(which is the tradeoff). The other solution was -v1, which you NACK'ed. 
Please see my reply to Andrea for the whole story. To summarize, I'd say 
that it's either (a) what you suggest, i.e., non-required BH disable 
calls vs. (b) nested BH disable calls. With tradeoffs for each.

> then add local_bh_disable(). It's fine if it happens to be nested
> in some cases.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-14 23:13               ` Alexei Starovoitov
  2025-04-15  9:25                 ` Justin Iurman
@ 2025-04-15 11:56                 ` Justin Iurman
  1 sibling, 0 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-15 11:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Sebastian Sewior, Stanislav Fomichev, Network Development,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, bpf, Andrea Mayer

On 4/15/25 01:13, Alexei Starovoitov wrote:
> On Fri, Apr 11, 2025 at 11:34 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>
>> On 4/7/25 19:54, Alexei Starovoitov wrote:
>>> On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>>>
>>>> On 4/4/25 16:19, Sebastian Sewior wrote:
>>>>> Alexei, thank you for the Cc.
>>>>>
>>>>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>>>>>> Stating the obvious...
>>>>>> Sebastian did a lot of work removing preempt_disable from the networking
>>>>>> stack.
>>>>>> We're certainly not adding them back.
>>>>>> This patch is no go.
>>>>>
>>>>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
>>>>> local_bh_disable().
>>>>
>>>> Thanks Sebastian for the confirmation, as the initial idea was to use
>>>> local_bh_disable() as well. Then I thought preempt_disable() would be
>>>> enough in this context, but I didn't realize you made efforts to remove
>>>> it from the networking stack.
>>>>
>>>> @Alexei, just to clarify: would you ACK this patch if we do
>>>> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
>>>
>>> You need to think it through and not sprinkle local_bh_disable in
>>> every lwt related function.
>>> Like lwtunnel_input should be running with bh disabled already.
>>
>> Having nested calls to local_bh_{disable|enable}() is fine (i.e.,
>> disabling BHs when they're already disabled), but I guess it's cleaner
>> to avoid it here as you suggest. And since lwtunnel_input() is indeed
>> (always) running with BHs disabled, no changes needed. Thanks for the
>> reminder.
>>
>>> I don't remember the exact conditions where bh is disabled in xmit path.
>>
>> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can
>> definitely run with or without BHs disabled. So, what I propose is the
>> following logic (applied to lwtunnel_xmit() too): if BHs disabled then
>> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my
>> mailer messes it up, but you got the idea):
>>
>> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
>> index e39a459540ec..d44d341683c5 100644
>> --- a/net/core/lwtunnel.c
>> +++ b/net/core/lwtunnel.c
>> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock
>> *sk, struct sk_buff *skb)
>>          const struct lwtunnel_encap_ops *ops;
>>          struct lwtunnel_state *lwtstate;
>>          struct dst_entry *dst;
>> +       bool in_softirq;
>>          int ret;
>>
>> +       in_softirq = in_softirq();
>> +       if (!in_softirq)
>> +               local_bh_disable();
>> +
> 
> This looks like a hack to me.
> 
> Instead analyze the typical xmit path. If bh is not disabled
> then add local_bh_disable(). It's fine if it happens to be nested
> in some cases.

FYI, and based on my previous response, the patch would look like this 
in that case (again, my mailer messes long lines up, sorry). I'll let 
others comment on which solution/tradeoff seems better.

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index e39a459540ec..d0cb0f2f9efe 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -333,6 +333,8 @@ int lwtunnel_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
  	struct dst_entry *dst;
  	int ret;

+	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
@@ -380,6 +382,8 @@ int lwtunnel_xmit(struct sk_buff *skb)
  	struct dst_entry *dst;
  	int ret;

+	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
@@ -428,6 +432,8 @@ int lwtunnel_input(struct sk_buff *skb)
  	struct dst_entry *dst;
  	int ret;

+	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6e18d7ec5062..89bda2f424bb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -124,10 +124,13 @@ int ip_local_out(struct net *net, struct sock *sk, 
struct sk_buff *skb)
  {
  	int err;

+	local_bh_disable();
+
  	err = __ip_local_out(net, sk, skb);
  	if (likely(err == 1))
  		err = dst_output(net, sk, skb);

+	local_bh_enable();
  	return err;
  }
  EXPORT_SYMBOL_GPL(ip_local_out);
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 806d4b5dd1e6..bb40196edeb6 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -150,10 +150,13 @@ int ip6_local_out(struct net *net, struct sock 
*sk, struct sk_buff *skb)
  {
  	int err;

+	local_bh_disable();
+
  	err = __ip6_local_out(net, sk, skb);
  	if (likely(err == 1))
  		err = dst_output(net, sk, skb);

+	local_bh_enable();
  	return err;
  }
  EXPORT_SYMBOL_GPL(ip6_local_out);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-15  9:10                 ` Justin Iurman
@ 2025-04-15 14:38                   ` Jakub Kicinski
  2025-04-15 15:12                     ` Justin Iurman
  2025-04-15 15:12                     ` Alexei Starovoitov
  0 siblings, 2 replies; 18+ messages in thread
From: Jakub Kicinski @ 2025-04-15 14:38 UTC (permalink / raw)
  To: Justin Iurman
  Cc: Andrea Mayer, Alexei Starovoitov, Sebastian Sewior,
	Stanislav Fomichev, Network Development, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Kuniyuki Iwashima, bpf,
	Stefano Salsano, Paolo Lungaroni

On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
> > However, there is my opinion an issue that can occur: between the check on
> > in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> > another CPU. As a result, the check on in_softirq() becomes ineffective because
> > we may end up disabling BH on a CPU that is not the one we just checked (with
> > if (in_softirq()) { ... }).  

The context is not affected by migration. The context is fully defined
by the execution stack.

> Hmm, I think it's correct... good catch. I went for this solution to (i) 
> avoid useless nested BHs disable calls; and (ii) avoid ending up with a 
> spaghetti graph of possible paths with or without BHs disabled (i.e., 
> with single entry points, namely lwtunnel_xmit() and lwtunnel_output()), 
> which otherwise makes it hard to maintain the code IMO.
> 
> So, if we want to follow what Alexei suggests (see his last response), 
> we'd need to disable BHs in both ip_local_out() and ip6_local_out(). 
> These are the common functions which are closest in depth, and so for 
> both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of 
> disabling BHs even when it may not be required. Indeed, ip_local_out() 
> and ip6_local_out() both call dst_output(), which one is usually not 
> lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call 
> either).
> 
> The other solution is to always call local_bh_disable() in both 
> lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when 
> they were already. Which was basically -v1 and received a NACK from Alexei.

I thought he nacked preempt_disable()

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-15 14:38                   ` Jakub Kicinski
@ 2025-04-15 15:12                     ` Justin Iurman
  2025-04-15 15:12                     ` Alexei Starovoitov
  1 sibling, 0 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-15 15:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrea Mayer, Alexei Starovoitov, Sebastian Sewior,
	Stanislav Fomichev, Network Development, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Kuniyuki Iwashima, bpf,
	Stefano Salsano, Paolo Lungaroni

On 4/15/25 16:38, Jakub Kicinski wrote:
> On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
>>> However, there is my opinion an issue that can occur: between the check on
>>> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
>>> another CPU. As a result, the check on in_softirq() becomes ineffective because
>>> we may end up disabling BH on a CPU that is not the one we just checked (with
>>> if (in_softirq()) { ... }).
> 
> The context is not affected by migration. The context is fully defined
> by the execution stack.
> 
>> Hmm, I think it's correct... good catch. I went for this solution to (i)
>> avoid useless nested BHs disable calls; and (ii) avoid ending up with a
>> spaghetti graph of possible paths with or without BHs disabled (i.e.,
>> with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
>> which otherwise makes it hard to maintain the code IMO.
>>
>> So, if we want to follow what Alexei suggests (see his last response),
>> we'd need to disable BHs in both ip_local_out() and ip6_local_out().
>> These are the common functions which are closest in depth, and so for
>> both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
>> disabling BHs even when it may not be required. Indeed, ip_local_out()
>> and ip6_local_out() both call dst_output(), which one is usually not
>> lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
>> either).
>>
>> The other solution is to always call local_bh_disable() in both
>> lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
>> they were already. Which was basically -v1 and received a NACK from Alexei.
> 
> I thought he nacked preempt_disable()

I think I wasn't clear enough, sorry. Alexei explicitly NACK'ed the 
initial patch (the one with preempt_disable()) -- you're right. I think 
he also (implicitly) NACK'ed the other solution I proposed (see [1]) by 
reading his reply. Alexei can clarify if I'm mistaken. He seems to 
prefer the solution that disables BHs on specific paths (see [2] for the 
other proposal) instead of inside lwtunnel_xmit() and lwtunnel_output().

So, basically, we have a choice to make between [1] and [2], where [1] 
IMO is better for the following reasons: (i) no nested calls to disable 
BHs, (ii) disable BHs only when they're not already, and (iii) code 
clarity, i.e., not overly complexifying the graph of paths w/ or w/o BHs 
disabled. While with [2], BHs will be disabled even when not required on 
the xmit/output path, and also resulting in an even more complex graph 
of paths regarding BHs.

   [1] 
https://lore.kernel.org/netdev/20250415073818.06ea327c@kernel.org/T/#m5a4e6a56206d9d110a5e4d664ab4ea09e7e9b33e
   [2] 
https://lore.kernel.org/netdev/20250415073818.06ea327c@kernel.org/T/#m3a88de38eceb0a53e2d173dc3675ecaa37e9d0b4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-15 14:38                   ` Jakub Kicinski
  2025-04-15 15:12                     ` Justin Iurman
@ 2025-04-15 15:12                     ` Alexei Starovoitov
  2025-04-15 17:12                       ` Justin Iurman
  1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-04-15 15:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Justin Iurman, Andrea Mayer, Sebastian Sewior, Stanislav Fomichev,
	Network Development, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, bpf, Stefano Salsano,
	Paolo Lungaroni

On Tue, Apr 15, 2025 at 7:38 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
> > > However, there is my opinion an issue that can occur: between the check on
> > > in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> > > another CPU. As a result, the check on in_softirq() becomes ineffective because
> > > we may end up disabling BH on a CPU that is not the one we just checked (with
> > > if (in_softirq()) { ... }).
>
> The context is not affected by migration. The context is fully defined
> by the execution stack.
>
> > Hmm, I think it's correct... good catch. I went for this solution to (i)
> > avoid useless nested BHs disable calls; and (ii) avoid ending up with a
> > spaghetti graph of possible paths with or without BHs disabled (i.e.,
> > with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
> > which otherwise makes it hard to maintain the code IMO.
> >
> > So, if we want to follow what Alexei suggests (see his last response),
> > we'd need to disable BHs in both ip_local_out() and ip6_local_out().
> > These are the common functions which are closest in depth, and so for
> > both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
> > disabling BHs even when it may not be required. Indeed, ip_local_out()
> > and ip6_local_out() both call dst_output(), which one is usually not
> > lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
> > either).
> >
> > The other solution is to always call local_bh_disable() in both
> > lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
> > they were already. Which was basically -v1 and received a NACK from Alexei.
>
> I thought he nacked preempt_disable()

+1.

imo unconditional local_bh_disable() in tx path is fine.
I didn't like the addition of local_bh_disable() in every lwt related
function without doing home work whether it's needed there or not.
Like input path shouldn't need local_bh_disable

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net] net: lwtunnel: disable preemption when required
  2025-04-15 15:12                     ` Alexei Starovoitov
@ 2025-04-15 17:12                       ` Justin Iurman
  0 siblings, 0 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-15 17:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Jakub Kicinski
  Cc: Andrea Mayer, Sebastian Sewior, Stanislav Fomichev,
	Network Development, David S. Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, bpf, Stefano Salsano,
	Paolo Lungaroni

On 4/15/25 17:12, Alexei Starovoitov wrote:
> On Tue, Apr 15, 2025 at 7:38 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
>>>> However, there is my opinion an issue that can occur: between the check on
>>>> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
>>>> another CPU. As a result, the check on in_softirq() becomes ineffective because
>>>> we may end up disabling BH on a CPU that is not the one we just checked (with
>>>> if (in_softirq()) { ... }).
>>
>> The context is not affected by migration. The context is fully defined
>> by the execution stack.
>>
>>> Hmm, I think it's correct... good catch. I went for this solution to (i)
>>> avoid useless nested BHs disable calls; and (ii) avoid ending up with a
>>> spaghetti graph of possible paths with or without BHs disabled (i.e.,
>>> with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
>>> which otherwise makes it hard to maintain the code IMO.
>>>
>>> So, if we want to follow what Alexei suggests (see his last response),
>>> we'd need to disable BHs in both ip_local_out() and ip6_local_out().
>>> These are the common functions which are closest in depth, and so for
>>> both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
>>> disabling BHs even when it may not be required. Indeed, ip_local_out()
>>> and ip6_local_out() both call dst_output(), which one is usually not
>>> lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
>>> either).
>>>
>>> The other solution is to always call local_bh_disable() in both
>>> lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
>>> they were already. Which was basically -v1 and received a NACK from Alexei.
>>
>> I thought he nacked preempt_disable()
> 
> +1.
> 
> imo unconditional local_bh_disable() in tx path is fine.
> I didn't like the addition of local_bh_disable() in every lwt related
> function without doing home work whether it's needed there or not.
> Like input path shouldn't need local_bh_disable

Ack, sorry for the confusion. I'll post -v2 with that solution.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-04-15 17:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-03  8:39 [PATCH net] net: lwtunnel: disable preemption when required Justin Iurman
2025-04-03 16:24 ` Stanislav Fomichev
2025-04-03 19:08   ` Justin Iurman
2025-04-03 20:35     ` Alexei Starovoitov
2025-04-04 14:19       ` Sebastian Sewior
2025-04-06  8:59         ` Justin Iurman
2025-04-07 17:54           ` Alexei Starovoitov
2025-04-11 18:34             ` Justin Iurman
2025-04-14 23:13               ` Alexei Starovoitov
2025-04-15  9:25                 ` Justin Iurman
2025-04-15 11:56                 ` Justin Iurman
2025-04-15  0:54               ` Andrea Mayer
2025-04-15  9:10                 ` Justin Iurman
2025-04-15 14:38                   ` Jakub Kicinski
2025-04-15 15:12                     ` Justin Iurman
2025-04-15 15:12                     ` Alexei Starovoitov
2025-04-15 17:12                       ` Justin Iurman
2025-04-03 23:38     ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).