* [PATCH net] net: lwtunnel: disable preemption when required
@ 2025-04-03 8:39 Justin Iurman
2025-04-03 16:24 ` Stanislav Fomichev
0 siblings, 1 reply; 18+ messages in thread
From: Justin Iurman @ 2025-04-03 8:39 UTC (permalink / raw)
To: netdev
Cc: davem, edumazet, kuba, pabeni, horms, kuniyu, justin.iurman,
Alexei Starovoitov, bpf, Stanislav Fomichev
In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
preemptible scope for PREEMPT kernels. This patch disables preemption
before calling dev_xmit_recursion(). Preemption is re-enabled only at
the end, since we must ensure the same CPU is used for both
dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
recursion levels in some cases) in order to maintain valid per-cpu
counters.
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Closes: https://lore.kernel.org/netdev/CAADnVQJFWn3dBFJtY+ci6oN1pDFL=TzCmNbRgey7MdYxt_AP2g@mail.gmail.com/
Fixes: 986ffb3a57c5 ("net: lwtunnel: fix recursion loops")
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
---
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Stanislav Fomichev <stfomichev@gmail.com>
---
net/core/lwtunnel.c | 47 ++++++++++++++++++++++++++++-----------------
1 file changed, 29 insertions(+), 18 deletions(-)
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index e39a459540ec..a9ad068e5707 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -333,6 +333,8 @@ int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
struct dst_entry *dst;
int ret;
+ preempt_disable();
+
if (dev_xmit_recursion()) {
net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
__func__);
@@ -345,11 +347,13 @@ int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
ret = -EINVAL;
goto drop;
}
- lwtstate = dst->lwtstate;
+ lwtstate = dst->lwtstate;
if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
- lwtstate->type > LWTUNNEL_ENCAP_MAX)
- return 0;
+ lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+ ret = 0;
+ goto out;
+ }
ret = -EOPNOTSUPP;
rcu_read_lock();
@@ -364,11 +368,11 @@ int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
if (ret == -EOPNOTSUPP)
goto drop;
- return ret;
-
+ goto out;
drop:
kfree_skb(skb);
-
+out:
+ preempt_enable();
return ret;
}
EXPORT_SYMBOL_GPL(lwtunnel_output);
@@ -380,6 +384,8 @@ int lwtunnel_xmit(struct sk_buff *skb)
struct dst_entry *dst;
int ret;
+ preempt_disable();
+
if (dev_xmit_recursion()) {
net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
__func__);
@@ -394,10 +400,11 @@ int lwtunnel_xmit(struct sk_buff *skb)
}
lwtstate = dst->lwtstate;
-
if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
- lwtstate->type > LWTUNNEL_ENCAP_MAX)
- return 0;
+ lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+ ret = 0;
+ goto out;
+ }
ret = -EOPNOTSUPP;
rcu_read_lock();
@@ -412,11 +419,11 @@ int lwtunnel_xmit(struct sk_buff *skb)
if (ret == -EOPNOTSUPP)
goto drop;
- return ret;
-
+ goto out;
drop:
kfree_skb(skb);
-
+out:
+ preempt_enable();
return ret;
}
EXPORT_SYMBOL_GPL(lwtunnel_xmit);
@@ -428,6 +435,8 @@ int lwtunnel_input(struct sk_buff *skb)
struct dst_entry *dst;
int ret;
+ preempt_disable();
+
if (dev_xmit_recursion()) {
net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
__func__);
@@ -440,11 +449,13 @@ int lwtunnel_input(struct sk_buff *skb)
ret = -EINVAL;
goto drop;
}
- lwtstate = dst->lwtstate;
+ lwtstate = dst->lwtstate;
if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
- lwtstate->type > LWTUNNEL_ENCAP_MAX)
- return 0;
+ lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+ ret = 0;
+ goto out;
+ }
ret = -EOPNOTSUPP;
rcu_read_lock();
@@ -459,11 +470,11 @@ int lwtunnel_input(struct sk_buff *skb)
if (ret == -EOPNOTSUPP)
goto drop;
- return ret;
-
+ goto out;
drop:
kfree_skb(skb);
-
+out:
+ preempt_enable();
return ret;
}
EXPORT_SYMBOL_GPL(lwtunnel_input);
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-03 8:39 [PATCH net] net: lwtunnel: disable preemption when required Justin Iurman
@ 2025-04-03 16:24 ` Stanislav Fomichev
2025-04-03 19:08 ` Justin Iurman
0 siblings, 1 reply; 18+ messages in thread
From: Stanislav Fomichev @ 2025-04-03 16:24 UTC (permalink / raw)
To: Justin Iurman
Cc: netdev, davem, edumazet, kuba, pabeni, horms, kuniyu,
Alexei Starovoitov, bpf
On 04/03, Justin Iurman wrote:
> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
> preemptible scope for PREEMPT kernels. This patch disables preemption
> before calling dev_xmit_recursion(). Preemption is re-enabled only at
> the end, since we must ensure the same CPU is used for both
> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
> recursion levels in some cases) in order to maintain valid per-cpu
> counters.
Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
track the recursion. Any reason not to do it in the generic PREEMPT case?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-03 16:24 ` Stanislav Fomichev
@ 2025-04-03 19:08 ` Justin Iurman
2025-04-03 20:35 ` Alexei Starovoitov
2025-04-03 23:38 ` Jakub Kicinski
0 siblings, 2 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-03 19:08 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: netdev, davem, edumazet, kuba, pabeni, horms, kuniyu,
Alexei Starovoitov, bpf
On 4/3/25 18:24, Stanislav Fomichev wrote:
> On 04/03, Justin Iurman wrote:
>> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
>> preemptible scope for PREEMPT kernels. This patch disables preemption
>> before calling dev_xmit_recursion(). Preemption is re-enabled only at
>> the end, since we must ensure the same CPU is used for both
>> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
>> recursion levels in some cases) in order to maintain valid per-cpu
>> counters.
>
> Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
> track the recursion. Any reason not to do it in the generic PREEMPT case?
I'd say PREEMPT_RT is a different beast. IMO, softirqs can be
preempted/migrated in RT kernels, which is not true for non-RT kernels.
Maybe RT kernels could use __this_cpu_* instead of "current" though, but
it would be less trivial. For example, see commit ecefbc09e8ee ("net:
softnet_data: Make xmit per task.") on why it makes sense to use
"current" in RT kernels. I guess the opposite as you suggest (i.e.,
non-RT kernels using "current") would be technically possible, but there
must be a reason it is defined the way it is... so probably incorrect or
inefficient?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-03 19:08 ` Justin Iurman
@ 2025-04-03 20:35 ` Alexei Starovoitov
2025-04-04 14:19 ` Sebastian Sewior
2025-04-03 23:38 ` Jakub Kicinski
1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-04-03 20:35 UTC (permalink / raw)
To: Justin Iurman, Sebastian Sewior
Cc: Stanislav Fomichev, Network Development, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, bpf
On Thu, Apr 3, 2025 at 12:08 PM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> On 4/3/25 18:24, Stanislav Fomichev wrote:
> > On 04/03, Justin Iurman wrote:
> >> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
> >> preemptible scope for PREEMPT kernels. This patch disables preemption
> >> before calling dev_xmit_recursion(). Preemption is re-enabled only at
> >> the end, since we must ensure the same CPU is used for both
> >> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
> >> recursion levels in some cases) in order to maintain valid per-cpu
> >> counters.
> >
> > Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
> > track the recursion. Any reason not to do it in the generic PREEMPT case?
>
> I'd say PREEMPT_RT is a different beast. IMO, softirqs can be
> preempted/migrated in RT kernels, which is not true for non-RT kernels.
> Maybe RT kernels could use __this_cpu_* instead of "current" though, but
> it would be less trivial. For example, see commit ecefbc09e8ee ("net:
> softnet_data: Make xmit per task.") on why it makes sense to use
> "current" in RT kernels. I guess the opposite as you suggest (i.e.,
> non-RT kernels using "current") would be technically possible, but there
> must be a reason it is defined the way it is... so probably incorrect or
> inefficient?
Stating the obvious...
Sebastian did a lot of work removing preempt_disable from the networking
stack.
We're certainly not adding them back.
This patch is no go.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-03 19:08 ` Justin Iurman
2025-04-03 20:35 ` Alexei Starovoitov
@ 2025-04-03 23:38 ` Jakub Kicinski
1 sibling, 0 replies; 18+ messages in thread
From: Jakub Kicinski @ 2025-04-03 23:38 UTC (permalink / raw)
To: Justin Iurman
Cc: Stanislav Fomichev, netdev, davem, edumazet, pabeni, horms,
kuniyu, Alexei Starovoitov, bpf
On Thu, 3 Apr 2025 21:08:12 +0200 Justin Iurman wrote:
> On 4/3/25 18:24, Stanislav Fomichev wrote:
> > On 04/03, Justin Iurman wrote:
> >> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
> >> preemptible scope for PREEMPT kernels. This patch disables preemption
> >> before calling dev_xmit_recursion(). Preemption is re-enabled only at
> >> the end, since we must ensure the same CPU is used for both
> >> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
> >> recursion levels in some cases) in order to maintain valid per-cpu
> >> counters.
> >
> > Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
> > track the recursion. Any reason not to do it in the generic PREEMPT case?
>
> I'd say PREEMPT_RT is a different beast. IMO, softirqs can be
> preempted/migrated in RT kernels, which is not true for non-RT kernels.
> Maybe RT kernels could use __this_cpu_* instead of "current" though, but
> it would be less trivial. For example, see commit ecefbc09e8ee ("net:
> softnet_data: Make xmit per task.") on why it makes sense to use
> "current" in RT kernels. I guess the opposite as you suggest (i.e.,
> non-RT kernels using "current") would be technically possible, but there
> must be a reason it is defined the way it is... so probably incorrect or
> inefficient?
I suspect it's to avoid the performance overhead.
IIUC you would be better off using local_bh_disable() here.
It doesn't disable preemption on RT.
I don't believe "disable preemption if !RT" primitive exists.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-03 20:35 ` Alexei Starovoitov
@ 2025-04-04 14:19 ` Sebastian Sewior
2025-04-06 8:59 ` Justin Iurman
0 siblings, 1 reply; 18+ messages in thread
From: Sebastian Sewior @ 2025-04-04 14:19 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Justin Iurman, Stanislav Fomichev, Network Development,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, bpf
Alexei, thank you for the Cc.
On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> Stating the obvious...
> Sebastian did a lot of work removing preempt_disable from the networking
> stack.
> We're certainly not adding them back.
> This patch is no go.
While looking through the code, it looks as if lwtunnel_xmit() lacks a
local_bh_disable().
Sebastian
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-04 14:19 ` Sebastian Sewior
@ 2025-04-06 8:59 ` Justin Iurman
2025-04-07 17:54 ` Alexei Starovoitov
0 siblings, 1 reply; 18+ messages in thread
From: Justin Iurman @ 2025-04-06 8:59 UTC (permalink / raw)
To: Sebastian Sewior, Alexei Starovoitov
Cc: Stanislav Fomichev, Network Development, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, bpf
On 4/4/25 16:19, Sebastian Sewior wrote:
> Alexei, thank you for the Cc.
>
> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>> Stating the obvious...
>> Sebastian did a lot of work removing preempt_disable from the networking
>> stack.
>> We're certainly not adding them back.
>> This patch is no go.
>
> While looking through the code, it looks as if lwtunnel_xmit() lacks a
> local_bh_disable().
Thanks Sebastian for the confirmation, as the initial idea was to use
local_bh_disable() as well. Then I thought preempt_disable() would be
enough in this context, but I didn't realize you made efforts to remove
it from the networking stack.
@Alexei, just to clarify: would you ACK this patch if we do
s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
> Sebastian
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-06 8:59 ` Justin Iurman
@ 2025-04-07 17:54 ` Alexei Starovoitov
2025-04-11 18:34 ` Justin Iurman
0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-04-07 17:54 UTC (permalink / raw)
To: Justin Iurman
Cc: Sebastian Sewior, Stanislav Fomichev, Network Development,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, bpf
On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> On 4/4/25 16:19, Sebastian Sewior wrote:
> > Alexei, thank you for the Cc.
> >
> > On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> >> Stating the obvious...
> >> Sebastian did a lot of work removing preempt_disable from the networking
> >> stack.
> >> We're certainly not adding them back.
> >> This patch is no go.
> >
> > While looking through the code, it looks as if lwtunnel_xmit() lacks a
> > local_bh_disable().
>
> Thanks Sebastian for the confirmation, as the initial idea was to use
> local_bh_disable() as well. Then I thought preempt_disable() would be
> enough in this context, but I didn't realize you made efforts to remove
> it from the networking stack.
>
> @Alexei, just to clarify: would you ACK this patch if we do
> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
You need to think it through and not sprinkle local_bh_disable in
every lwt related function.
Like lwtunnel_input should be running with bh disabled already.
I don't remember the exact conditions where bh is disabled in xmit path.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-07 17:54 ` Alexei Starovoitov
@ 2025-04-11 18:34 ` Justin Iurman
2025-04-14 23:13 ` Alexei Starovoitov
2025-04-15 0:54 ` Andrea Mayer
0 siblings, 2 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-11 18:34 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Sebastian Sewior, Stanislav Fomichev, Network Development,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, bpf, Andrea Mayer
On 4/7/25 19:54, Alexei Starovoitov wrote:
> On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>
>> On 4/4/25 16:19, Sebastian Sewior wrote:
>>> Alexei, thank you for the Cc.
>>>
>>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>>>> Stating the obvious...
>>>> Sebastian did a lot of work removing preempt_disable from the networking
>>>> stack.
>>>> We're certainly not adding them back.
>>>> This patch is no go.
>>>
>>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
>>> local_bh_disable().
>>
>> Thanks Sebastian for the confirmation, as the initial idea was to use
>> local_bh_disable() as well. Then I thought preempt_disable() would be
>> enough in this context, but I didn't realize you made efforts to remove
>> it from the networking stack.
>>
>> @Alexei, just to clarify: would you ACK this patch if we do
>> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
>
> You need to think it through and not sprinkle local_bh_disable in
> every lwt related function.
> Like lwtunnel_input should be running with bh disabled already.
Having nested calls to local_bh_{disable|enable}() is fine (i.e.,
disabling BHs when they're already disabled), but I guess it's cleaner
to avoid it here as you suggest. And since lwtunnel_input() is indeed
(always) running with BHs disabled, no changes needed. Thanks for the
reminder.
> I don't remember the exact conditions where bh is disabled in xmit path.
Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can
definitely run with or without BHs disabled. So, what I propose is the
following logic (applied to lwtunnel_xmit() too): if BHs disabled then
NOP else local_bh_disable(). Thoughts on this new version? (sorry, my
mailer messes it up, but you got the idea):
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index e39a459540ec..d44d341683c5 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock
*sk, struct sk_buff *skb)
const struct lwtunnel_encap_ops *ops;
struct lwtunnel_state *lwtstate;
struct dst_entry *dst;
+ bool in_softirq;
int ret;
+ in_softirq = in_softirq();
+ if (!in_softirq)
+ local_bh_disable();
+
if (dev_xmit_recursion()) {
net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
__func__);
@@ -345,11 +350,13 @@ int lwtunnel_output(struct net *net, struct sock
*sk, struct sk_buff *skb)
ret = -EINVAL;
goto drop;
}
- lwtstate = dst->lwtstate;
+ lwtstate = dst->lwtstate;
if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
- lwtstate->type > LWTUNNEL_ENCAP_MAX)
- return 0;
+ lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+ ret = 0;
+ goto out;
+ }
ret = -EOPNOTSUPP;
rcu_read_lock();
@@ -364,10 +371,12 @@ int lwtunnel_output(struct net *net, struct sock
*sk, struct sk_buff *skb)
if (ret == -EOPNOTSUPP)
goto drop;
- return ret;
-
+ goto out;
drop:
kfree_skb(skb);
+out:
+ if (!in_softirq)
+ local_bh_enable();
return ret;
}
@@ -378,8 +387,13 @@ int lwtunnel_xmit(struct sk_buff *skb)
const struct lwtunnel_encap_ops *ops;
struct lwtunnel_state *lwtstate;
struct dst_entry *dst;
+ bool in_softirq;
int ret;
+ in_softirq = in_softirq();
+ if (!in_softirq)
+ local_bh_disable();
+
if (dev_xmit_recursion()) {
net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
__func__);
@@ -394,10 +408,11 @@ int lwtunnel_xmit(struct sk_buff *skb)
}
lwtstate = dst->lwtstate;
-
if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
- lwtstate->type > LWTUNNEL_ENCAP_MAX)
- return 0;
+ lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+ ret = 0;
+ goto out;
+ }
ret = -EOPNOTSUPP;
rcu_read_lock();
@@ -412,10 +427,12 @@ int lwtunnel_xmit(struct sk_buff *skb)
if (ret == -EOPNOTSUPP)
goto drop;
- return ret;
-
+ goto out;
drop:
kfree_skb(skb);
+out:
+ if (!in_softirq)
+ local_bh_enable();
return ret;
}
@@ -428,6 +445,8 @@ int lwtunnel_input(struct sk_buff *skb)
struct dst_entry *dst;
int ret;
+ WARN_ON_ONCE(!in_softirq());
+
if (dev_xmit_recursion()) {
net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
__func__);
@@ -440,8 +459,8 @@ int lwtunnel_input(struct sk_buff *skb)
ret = -EINVAL;
goto drop;
}
- lwtstate = dst->lwtstate;
+ lwtstate = dst->lwtstate;
if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
lwtstate->type > LWTUNNEL_ENCAP_MAX)
return 0;
@@ -460,10 +479,8 @@ int lwtunnel_input(struct sk_buff *skb)
goto drop;
return ret;
-
drop:
kfree_skb(skb);
-
return ret;
}
EXPORT_SYMBOL_GPL(lwtunnel_input);
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-11 18:34 ` Justin Iurman
@ 2025-04-14 23:13 ` Alexei Starovoitov
2025-04-15 9:25 ` Justin Iurman
2025-04-15 11:56 ` Justin Iurman
2025-04-15 0:54 ` Andrea Mayer
1 sibling, 2 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-04-14 23:13 UTC (permalink / raw)
To: Justin Iurman
Cc: Sebastian Sewior, Stanislav Fomichev, Network Development,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, bpf, Andrea Mayer
On Fri, Apr 11, 2025 at 11:34 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> On 4/7/25 19:54, Alexei Starovoitov wrote:
> > On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
> >>
> >> On 4/4/25 16:19, Sebastian Sewior wrote:
> >>> Alexei, thank you for the Cc.
> >>>
> >>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> >>>> Stating the obvious...
> >>>> Sebastian did a lot of work removing preempt_disable from the networking
> >>>> stack.
> >>>> We're certainly not adding them back.
> >>>> This patch is no go.
> >>>
> >>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
> >>> local_bh_disable().
> >>
> >> Thanks Sebastian for the confirmation, as the initial idea was to use
> >> local_bh_disable() as well. Then I thought preempt_disable() would be
> >> enough in this context, but I didn't realize you made efforts to remove
> >> it from the networking stack.
> >>
> >> @Alexei, just to clarify: would you ACK this patch if we do
> >> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
> >
> > You need to think it through and not sprinkle local_bh_disable in
> > every lwt related function.
> > Like lwtunnel_input should be running with bh disabled already.
>
> Having nested calls to local_bh_{disable|enable}() is fine (i.e.,
> disabling BHs when they're already disabled), but I guess it's cleaner
> to avoid it here as you suggest. And since lwtunnel_input() is indeed
> (always) running with BHs disabled, no changes needed. Thanks for the
> reminder.
>
> > I don't remember the exact conditions where bh is disabled in xmit path.
>
> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can
> definitely run with or without BHs disabled. So, what I propose is the
> following logic (applied to lwtunnel_xmit() too): if BHs disabled then
> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my
> mailer messes it up, but you got the idea):
>
> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
> index e39a459540ec..d44d341683c5 100644
> --- a/net/core/lwtunnel.c
> +++ b/net/core/lwtunnel.c
> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock
> *sk, struct sk_buff *skb)
> const struct lwtunnel_encap_ops *ops;
> struct lwtunnel_state *lwtstate;
> struct dst_entry *dst;
> + bool in_softirq;
> int ret;
>
> + in_softirq = in_softirq();
> + if (!in_softirq)
> + local_bh_disable();
> +
This looks like a hack to me.
Instead analyze the typical xmit path. If bh is not disabled
then add local_bh_disable(). It's fine if it happens to be nested
in some cases.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-11 18:34 ` Justin Iurman
2025-04-14 23:13 ` Alexei Starovoitov
@ 2025-04-15 0:54 ` Andrea Mayer
2025-04-15 9:10 ` Justin Iurman
1 sibling, 1 reply; 18+ messages in thread
From: Andrea Mayer @ 2025-04-15 0:54 UTC (permalink / raw)
To: Justin Iurman
Cc: Alexei Starovoitov, Sebastian Sewior, Stanislav Fomichev,
Network Development, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Kuniyuki Iwashima, bpf,
Stefano Salsano, Paolo Lungaroni, Andrea Mayer
On Fri, 11 Apr 2025 20:34:54 +0200
Justin Iurman <justin.iurman@uliege.be> wrote:
> On 4/7/25 19:54, Alexei Starovoitov wrote:
> > On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
> >>
> >> On 4/4/25 16:19, Sebastian Sewior wrote:
> >>> Alexei, thank you for the Cc.
> >>>
> >>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> >>>> Stating the obvious...
> >>>> Sebastian did a lot of work removing preempt_disable from the networking
> >>>> stack.
> >>>> We're certainly not adding them back.
> >>>> This patch is no go.
> >>>
> >>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
> >>> local_bh_disable().
> >>
> >> Thanks Sebastian for the confirmation, as the initial idea was to use
> >> local_bh_disable() as well. Then I thought preempt_disable() would be
> >> enough in this context, but I didn't realize you made efforts to remove
> >> it from the networking stack.
> >>
> >> @Alexei, just to clarify: would you ACK this patch if we do
> >> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
> >
> > You need to think it through and not sprinkle local_bh_disable in
> > every lwt related function.
> > Like lwtunnel_input should be running with bh disabled already.
>
> Having nested calls to local_bh_{disable|enable}() is fine (i.e.,
> disabling BHs when they're already disabled), but I guess it's cleaner
> to avoid it here as you suggest. And since lwtunnel_input() is indeed
> (always) running with BHs disabled, no changes needed. Thanks for the
> reminder.
>
> > I don't remember the exact conditions where bh is disabled in xmit path.
>
> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can
Justin, thanks for the Cc.
I have been looking into the behavior of the lwtunnel_xmit() function in both
task and softirq contexts. To facilitate this investigation, I have written a
simple eBPF program that only prints messages to the trace pipe. This program
is attached to the LWT BPF XMIT hook by configuring a route (on my test node)
with a destination address (DA) pointing to an external node, referred to as
x.x.x.x, within my testbed network.
To trigger that LWT BPF XMIT instance from a softirq context, it is sufficient
to receive (on the test node) a packet with a DA matching x.x.x.x. This packet
is then processed through the forwarding path, eventually leading to the
ip_output() function. Processing ends with a call to ip_finish_output2(), which
then calls lwtunnel_xmit().
Below is the stack trace from my testing machine, highlighting the key
functions involved in this processing path:
============================================
<IRQ>
...
lwtunnel_xmit+0x18/0x3f0
ip_finish_output2+0x45a/0xcc0
ip_output+0xe2/0x380
NF_HOOK.constprop.0+0x7e/0x2f0
ip_rcv+0x4bf/0x4d0
__netif_receive_skb_one_core+0x11c/0x130
process_backlog+0x277/0x980
__napi_poll.constprop.0+0x58/0x260
net_rx_action+0x396/0x6e0
handle_softirqs+0x116/0x640
do_softirq+0xa9/0xe0
</IRQ>
============================================
Conversely, to trigger lwtunnel_xmit() from the task context, simply ping
x.x.x.x on the same testing node. Below is the corresponding stack trace:
============================================
<TASK>
...
lwtunnel_xmit+0x18/0x3f0
ip_finish_output2+0x45a/0xcc0
ip_output+0xe2/0x380
ip_push_pending_frames+0x17a/0x200
raw_sendmsg+0x9fa/0x1060
__sys_sendto+0x294/0x2e0
__x64_sys_sendto+0x6d/0x80
do_syscall_64+0x64/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
============================================
So also for the lwtunnel_xmit(), we need to make sure that the functions
dev_xmit_recursion{_inc/dec}() and the necessary logic to avoid lwt recursion
are protected, i.e. inside a local_bh_{disable/enable} block.
> definitely run with or without BHs disabled. So, what I propose is the
> following logic (applied to lwtunnel_xmit() too): if BHs disabled then
> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my
> mailer messes it up, but you got the idea):
>
> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
> index e39a459540ec..d44d341683c5 100644
> --- a/net/core/lwtunnel.c
> +++ b/net/core/lwtunnel.c
> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock
> *sk, struct sk_buff *skb)
> const struct lwtunnel_encap_ops *ops;
> struct lwtunnel_state *lwtstate;
> struct dst_entry *dst;
> + bool in_softirq;
> int ret;
>
> + in_softirq = in_softirq();
> + if (!in_softirq)
> + local_bh_disable();
> +
In a non-preemptible real-time environment (i.e., when !PREEMPT_RT), the
in_softirq() expands to softirq_count(), which in turn uses the preempt_count()
function. On my x86 architecture, preempt_count() accesses the per-CPU
__preempt_count variable.
If in_softirq() returns 0, it indicates that no softirqs are currently being
processed on the local CPU and BH are not disabled. Therefore, following the
logic above, we disable bottom halves (BH) on that particular CPU.
However, there is my opinion an issue that can occur: between the check on
in_softirq() and the call to local_bh_disable(), the task may be scheduled on
another CPU. As a result, the check on in_softirq() becomes ineffective because
we may end up disabling BH on a CPU that is not the one we just checked (with
if (in_softirq()) { ... }).
> if (dev_xmit_recursion()) {
> net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
> __func__);
> @@ -345,11 +350,13 @@ int lwtunnel_output(struct net *net, struct sock
> *sk, struct sk_buff *skb)
> ret = -EINVAL;
> goto drop;
> }
> - lwtstate = dst->lwtstate;
>
> + lwtstate = dst->lwtstate;
> if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
> - lwtstate->type > LWTUNNEL_ENCAP_MAX)
> - return 0;
> + lwtstate->type > LWTUNNEL_ENCAP_MAX) {
> + ret = 0;
> + goto out;
> + }
>
> ret = -EOPNOTSUPP;
> rcu_read_lock();
> @@ -364,10 +371,12 @@ int lwtunnel_output(struct net *net, struct sock
> *sk, struct sk_buff *skb)
> if (ret == -EOPNOTSUPP)
> goto drop;
>
> - return ret;
> -
> + goto out;
> drop:
> kfree_skb(skb);
> +out:
> + if (!in_softirq)
> + local_bh_enable();
>
> return ret;
> }
> @@ -378,8 +387,13 @@ int lwtunnel_xmit(struct sk_buff *skb)
> const struct lwtunnel_encap_ops *ops;
> struct lwtunnel_state *lwtstate;
> struct dst_entry *dst;
> + bool in_softirq;
> int ret;
>
> + in_softirq = in_softirq();
> + if (!in_softirq)
> + local_bh_disable();
> +
> if (dev_xmit_recursion()) {
> net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
> __func__);
> @@ -394,10 +408,11 @@ int lwtunnel_xmit(struct sk_buff *skb)
> }
>
> lwtstate = dst->lwtstate;
> -
> if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
> - lwtstate->type > LWTUNNEL_ENCAP_MAX)
> - return 0;
> + lwtstate->type > LWTUNNEL_ENCAP_MAX) {
> + ret = 0;
> + goto out;
> + }
>
> ret = -EOPNOTSUPP;
> rcu_read_lock();
> @@ -412,10 +427,12 @@ int lwtunnel_xmit(struct sk_buff *skb)
> if (ret == -EOPNOTSUPP)
> goto drop;
>
> - return ret;
> -
> + goto out;
> drop:
> kfree_skb(skb);
> +out:
> + if (!in_softirq)
> + local_bh_enable();
>
> return ret;
> }
> @@ -428,6 +445,8 @@ int lwtunnel_input(struct sk_buff *skb)
> struct dst_entry *dst;
> int ret;
>
> + WARN_ON_ONCE(!in_softirq());
> +
What about DEBUG_NET_WARN_ON_ONCE instead?
Ciao,
Andrea
> if (dev_xmit_recursion()) {
> net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
> __func__);
> @@ -440,8 +459,8 @@ int lwtunnel_input(struct sk_buff *skb)
> ret = -EINVAL;
> goto drop;
> }
> - lwtstate = dst->lwtstate;
>
> + lwtstate = dst->lwtstate;
> if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
> lwtstate->type > LWTUNNEL_ENCAP_MAX)
> return 0;
> @@ -460,10 +479,8 @@ int lwtunnel_input(struct sk_buff *skb)
> goto drop;
>
> return ret;
> -
> drop:
> kfree_skb(skb);
> -
> return ret;
> }
> EXPORT_SYMBOL_GPL(lwtunnel_input);
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-15 0:54 ` Andrea Mayer
@ 2025-04-15 9:10 ` Justin Iurman
2025-04-15 14:38 ` Jakub Kicinski
0 siblings, 1 reply; 18+ messages in thread
From: Justin Iurman @ 2025-04-15 9:10 UTC (permalink / raw)
To: Andrea Mayer
Cc: Alexei Starovoitov, Sebastian Sewior, Stanislav Fomichev,
Network Development, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Kuniyuki Iwashima, bpf,
Stefano Salsano, Paolo Lungaroni
On 4/15/25 02:54, Andrea Mayer wrote:> I have been looking into the
behavior of the lwtunnel_xmit() function in both
> task and softirq contexts. To facilitate this investigation, I have written a
> simple eBPF program that only prints messages to the trace pipe. This program
> is attached to the LWT BPF XMIT hook by configuring a route (on my test node)
> with a destination address (DA) pointing to an external node, referred to as
> x.x.x.x, within my testbed network.
>
> To trigger that LWT BPF XMIT instance from a softirq context, it is sufficient
> to receive (on the test node) a packet with a DA matching x.x.x.x. This packet
> is then processed through the forwarding path, eventually leading to the
> ip_output() function. Processing ends with a call to ip_finish_output2(), which
> then calls lwtunnel_xmit().
>
> Below is the stack trace from my testing machine, highlighting the key
> functions involved in this processing path:
>
> ============================================
> <IRQ>
> ...
> lwtunnel_xmit+0x18/0x3f0
> ip_finish_output2+0x45a/0xcc0
> ip_output+0xe2/0x380
> NF_HOOK.constprop.0+0x7e/0x2f0
> ip_rcv+0x4bf/0x4d0
> __netif_receive_skb_one_core+0x11c/0x130
> process_backlog+0x277/0x980
> __napi_poll.constprop.0+0x58/0x260
> net_rx_action+0x396/0x6e0
> handle_softirqs+0x116/0x640
> do_softirq+0xa9/0xe0
> </IRQ>
> ============================================
>
> Conversely, to trigger lwtunnel_xmit() from the task context, simply ping
> x.x.x.x on the same testing node. Below is the corresponding stack trace:
>
> ============================================
> <TASK>
> ...
> lwtunnel_xmit+0x18/0x3f0
> ip_finish_output2+0x45a/0xcc0
> ip_output+0xe2/0x380
> ip_push_pending_frames+0x17a/0x200
> raw_sendmsg+0x9fa/0x1060
> __sys_sendto+0x294/0x2e0
> __x64_sys_sendto+0x6d/0x80
> do_syscall_64+0x64/0x140
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> </TASK>
> ============================================
>
> So also for the lwtunnel_xmit(), we need to make sure that the functions
> dev_xmit_recursion{_inc/dec}() and the necessary logic to avoid lwt recursion
> are protected, i.e. inside a local_bh_{disable/enable} block.
That's correct, and I ended up with the same conclusion as yours on the
possible paths for lwtunnel_xmit() depending on the context (task vs
irq). Based on your description, we're using a similar approach with
eBPF :-) Note that paths are similar for lwtunnel_output() (see below).
> In a non-preemptible real-time environment (i.e., when !PREEMPT_RT), the
> in_softirq() expands to softirq_count(), which in turn uses the preempt_count()
> function. On my x86 architecture, preempt_count() accesses the per-CPU
> __preempt_count variable.
>
> If in_softirq() returns 0, it indicates that no softirqs are currently being
> processed on the local CPU and BH are not disabled. Therefore, following the
> logic above, we disable bottom halves (BH) on that particular CPU.
>
> However, there is my opinion an issue that can occur: between the check on
> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> another CPU. As a result, the check on in_softirq() becomes ineffective because
> we may end up disabling BH on a CPU that is not the one we just checked (with
> if (in_softirq()) { ... }).
Hmm, I think it's correct... good catch. I went for this solution to (i)
avoid useless nested BHs disable calls; and (ii) avoid ending up with a
spaghetti graph of possible paths with or without BHs disabled (i.e.,
with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
which otherwise makes it hard to maintain the code IMO.
So, if we want to follow what Alexei suggests (see his last response),
we'd need to disable BHs in both ip_local_out() and ip6_local_out().
These are the common functions which are closest in depth, and so for
both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
disabling BHs even when it may not be required. Indeed, ip_local_out()
and ip6_local_out() both call dst_output(), which one is usually not
lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
either).
The other solution is to always call local_bh_disable() in both
lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
they were already. Which was basically -v1 and received a NACK from Alexei.
At the moment, I'm not sure what's best.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-14 23:13 ` Alexei Starovoitov
@ 2025-04-15 9:25 ` Justin Iurman
2025-04-15 11:56 ` Justin Iurman
1 sibling, 0 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-15 9:25 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Sebastian Sewior, Stanislav Fomichev, Network Development,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, bpf, Andrea Mayer
On 4/15/25 01:13, Alexei Starovoitov wrote:
> On Fri, Apr 11, 2025 at 11:34 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>
>> On 4/7/25 19:54, Alexei Starovoitov wrote:
>>> On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>>>
>>>> On 4/4/25 16:19, Sebastian Sewior wrote:
>>>>> Alexei, thank you for the Cc.
>>>>>
>>>>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>>>>>> Stating the obvious...
>>>>>> Sebastian did a lot of work removing preempt_disable from the networking
>>>>>> stack.
>>>>>> We're certainly not adding them back.
>>>>>> This patch is no go.
>>>>>
>>>>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
>>>>> local_bh_disable().
>>>>
>>>> Thanks Sebastian for the confirmation, as the initial idea was to use
>>>> local_bh_disable() as well. Then I thought preempt_disable() would be
>>>> enough in this context, but I didn't realize you made efforts to remove
>>>> it from the networking stack.
>>>>
>>>> @Alexei, just to clarify: would you ACK this patch if we do
>>>> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
>>>
>>> You need to think it through and not sprinkle local_bh_disable in
>>> every lwt related function.
>>> Like lwtunnel_input should be running with bh disabled already.
>>
>> Having nested calls to local_bh_{disable|enable}() is fine (i.e.,
>> disabling BHs when they're already disabled), but I guess it's cleaner
>> to avoid it here as you suggest. And since lwtunnel_input() is indeed
>> (always) running with BHs disabled, no changes needed. Thanks for the
>> reminder.
>>
>>> I don't remember the exact conditions where bh is disabled in xmit path.
>>
>> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can
>> definitely run with or without BHs disabled. So, what I propose is the
>> following logic (applied to lwtunnel_xmit() too): if BHs disabled then
>> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my
>> mailer messes it up, but you got the idea):
>>
>> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
>> index e39a459540ec..d44d341683c5 100644
>> --- a/net/core/lwtunnel.c
>> +++ b/net/core/lwtunnel.c
>> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock
>> *sk, struct sk_buff *skb)
>> const struct lwtunnel_encap_ops *ops;
>> struct lwtunnel_state *lwtstate;
>> struct dst_entry *dst;
>> + bool in_softirq;
>> int ret;
>>
>> + in_softirq = in_softirq();
>> + if (!in_softirq)
>> + local_bh_disable();
>> +
>
> This looks like a hack to me.
>
> Instead analyze the typical xmit path. If bh is not disabled
This is already what I did, and it's exactly the reason why I ended up
with the above proposal. It's not only about the xmit path but also the
output path. Of course, having BHs disabled only where they need to
without useless nested calls would be nice, but in reality the solution
is not perfect and makes it even more difficult to visualize the path(s)
with or without BHs disabled IMO.
For both lwtunnel_xmit() and lwtunnel_output(), the common functions
which are closest in depth and where BHs should be disabled are
ip_local_out() and ip6_local_out(). And even when it's not required
(which is the tradeoff). The other solution was -v1, which you NACK'ed.
Please see my reply to Andrea for the whole story. To summarize, I'd say
that it's either (a) what you suggest, i.e., non-required BH disable
calls vs. (b) nested BH disable calls. With tradeoffs for each.
> then add local_bh_disable(). It's fine if it happens to be nested
> in some cases.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-14 23:13 ` Alexei Starovoitov
2025-04-15 9:25 ` Justin Iurman
@ 2025-04-15 11:56 ` Justin Iurman
1 sibling, 0 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-15 11:56 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Sebastian Sewior, Stanislav Fomichev, Network Development,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, bpf, Andrea Mayer
On 4/15/25 01:13, Alexei Starovoitov wrote:
> On Fri, Apr 11, 2025 at 11:34 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>
>> On 4/7/25 19:54, Alexei Starovoitov wrote:
>>> On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>>>
>>>> On 4/4/25 16:19, Sebastian Sewior wrote:
>>>>> Alexei, thank you for the Cc.
>>>>>
>>>>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>>>>>> Stating the obvious...
>>>>>> Sebastian did a lot of work removing preempt_disable from the networking
>>>>>> stack.
>>>>>> We're certainly not adding them back.
>>>>>> This patch is no go.
>>>>>
>>>>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
>>>>> local_bh_disable().
>>>>
>>>> Thanks Sebastian for the confirmation, as the initial idea was to use
>>>> local_bh_disable() as well. Then I thought preempt_disable() would be
>>>> enough in this context, but I didn't realize you made efforts to remove
>>>> it from the networking stack.
>>>>
>>>> @Alexei, just to clarify: would you ACK this patch if we do
>>>> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
>>>
>>> You need to think it through and not sprinkle local_bh_disable in
>>> every lwt related function.
>>> Like lwtunnel_input should be running with bh disabled already.
>>
>> Having nested calls to local_bh_{disable|enable}() is fine (i.e.,
>> disabling BHs when they're already disabled), but I guess it's cleaner
>> to avoid it here as you suggest. And since lwtunnel_input() is indeed
>> (always) running with BHs disabled, no changes needed. Thanks for the
>> reminder.
>>
>>> I don't remember the exact conditions where bh is disabled in xmit path.
>>
>> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can
>> definitely run with or without BHs disabled. So, what I propose is the
>> following logic (applied to lwtunnel_xmit() too): if BHs disabled then
>> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my
>> mailer messes it up, but you got the idea):
>>
>> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
>> index e39a459540ec..d44d341683c5 100644
>> --- a/net/core/lwtunnel.c
>> +++ b/net/core/lwtunnel.c
>> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock
>> *sk, struct sk_buff *skb)
>> const struct lwtunnel_encap_ops *ops;
>> struct lwtunnel_state *lwtstate;
>> struct dst_entry *dst;
>> + bool in_softirq;
>> int ret;
>>
>> + in_softirq = in_softirq();
>> + if (!in_softirq)
>> + local_bh_disable();
>> +
>
> This looks like a hack to me.
>
> Instead analyze the typical xmit path. If bh is not disabled
> then add local_bh_disable(). It's fine if it happens to be nested
> in some cases.
FYI, and based on my previous response, the patch would look like this
in that case (again, my mailer messes long lines up, sorry). I'll let
others comment on which solution/tradeoff seems better.
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index e39a459540ec..d0cb0f2f9efe 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -333,6 +333,8 @@ int lwtunnel_output(struct net *net, struct sock
*sk, struct sk_buff *skb)
struct dst_entry *dst;
int ret;
+ DEBUG_NET_WARN_ON_ONCE(!in_softirq());
+
if (dev_xmit_recursion()) {
net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
__func__);
@@ -380,6 +382,8 @@ int lwtunnel_xmit(struct sk_buff *skb)
struct dst_entry *dst;
int ret;
+ DEBUG_NET_WARN_ON_ONCE(!in_softirq());
+
if (dev_xmit_recursion()) {
net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
__func__);
@@ -428,6 +432,8 @@ int lwtunnel_input(struct sk_buff *skb)
struct dst_entry *dst;
int ret;
+ DEBUG_NET_WARN_ON_ONCE(!in_softirq());
+
if (dev_xmit_recursion()) {
net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
__func__);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6e18d7ec5062..89bda2f424bb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -124,10 +124,13 @@ int ip_local_out(struct net *net, struct sock *sk,
struct sk_buff *skb)
{
int err;
+ local_bh_disable();
+
err = __ip_local_out(net, sk, skb);
if (likely(err == 1))
err = dst_output(net, sk, skb);
+ local_bh_enable();
return err;
}
EXPORT_SYMBOL_GPL(ip_local_out);
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 806d4b5dd1e6..bb40196edeb6 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -150,10 +150,13 @@ int ip6_local_out(struct net *net, struct sock
*sk, struct sk_buff *skb)
{
int err;
+ local_bh_disable();
+
err = __ip6_local_out(net, sk, skb);
if (likely(err == 1))
err = dst_output(net, sk, skb);
+ local_bh_enable();
return err;
}
EXPORT_SYMBOL_GPL(ip6_local_out);
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-15 9:10 ` Justin Iurman
@ 2025-04-15 14:38 ` Jakub Kicinski
2025-04-15 15:12 ` Justin Iurman
2025-04-15 15:12 ` Alexei Starovoitov
0 siblings, 2 replies; 18+ messages in thread
From: Jakub Kicinski @ 2025-04-15 14:38 UTC (permalink / raw)
To: Justin Iurman
Cc: Andrea Mayer, Alexei Starovoitov, Sebastian Sewior,
Stanislav Fomichev, Network Development, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, Kuniyuki Iwashima, bpf,
Stefano Salsano, Paolo Lungaroni
On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
> > However, there is my opinion an issue that can occur: between the check on
> > in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> > another CPU. As a result, the check on in_softirq() becomes ineffective because
> > we may end up disabling BH on a CPU that is not the one we just checked (with
> > if (in_softirq()) { ... }).
The context is not affected by migration. The context is fully defined
by the execution stack.
> Hmm, I think it's correct... good catch. I went for this solution to (i)
> avoid useless nested BHs disable calls; and (ii) avoid ending up with a
> spaghetti graph of possible paths with or without BHs disabled (i.e.,
> with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
> which otherwise makes it hard to maintain the code IMO.
>
> So, if we want to follow what Alexei suggests (see his last response),
> we'd need to disable BHs in both ip_local_out() and ip6_local_out().
> These are the common functions which are closest in depth, and so for
> both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
> disabling BHs even when it may not be required. Indeed, ip_local_out()
> and ip6_local_out() both call dst_output(), which one is usually not
> lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
> either).
>
> The other solution is to always call local_bh_disable() in both
> lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
> they were already. Which was basically -v1 and received a NACK from Alexei.
I thought he nacked preempt_disable()
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-15 14:38 ` Jakub Kicinski
@ 2025-04-15 15:12 ` Justin Iurman
2025-04-15 15:12 ` Alexei Starovoitov
1 sibling, 0 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-15 15:12 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Andrea Mayer, Alexei Starovoitov, Sebastian Sewior,
Stanislav Fomichev, Network Development, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, Kuniyuki Iwashima, bpf,
Stefano Salsano, Paolo Lungaroni
On 4/15/25 16:38, Jakub Kicinski wrote:
> On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
>>> However, there is my opinion an issue that can occur: between the check on
>>> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
>>> another CPU. As a result, the check on in_softirq() becomes ineffective because
>>> we may end up disabling BH on a CPU that is not the one we just checked (with
>>> if (in_softirq()) { ... }).
>
> The context is not affected by migration. The context is fully defined
> by the execution stack.
>
>> Hmm, I think it's correct... good catch. I went for this solution to (i)
>> avoid useless nested BHs disable calls; and (ii) avoid ending up with a
>> spaghetti graph of possible paths with or without BHs disabled (i.e.,
>> with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
>> which otherwise makes it hard to maintain the code IMO.
>>
>> So, if we want to follow what Alexei suggests (see his last response),
>> we'd need to disable BHs in both ip_local_out() and ip6_local_out().
>> These are the common functions which are closest in depth, and so for
>> both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
>> disabling BHs even when it may not be required. Indeed, ip_local_out()
>> and ip6_local_out() both call dst_output(), which one is usually not
>> lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
>> either).
>>
>> The other solution is to always call local_bh_disable() in both
>> lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
>> they were already. Which was basically -v1 and received a NACK from Alexei.
>
> I thought he nacked preempt_disable()
I think I wasn't clear enough, sorry. Alexei explicitly NACK'ed the
initial patch (the one with preempt_disable()) -- you're right. I think
he also (implicitly) NACK'ed the other solution I proposed (see [1]) by
reading his reply. Alexei can clarify if I'm mistaken. He seems to
prefer the solution that disables BHs on specific paths (see [2] for the
other proposal) instead of inside lwtunnel_xmit() and lwtunnel_output().
So, basically, we have a choice to make between [1] and [2], where [1]
IMO is better for the following reasons: (i) no nested calls to disable
BHs, (ii) disable BHs only when they're not already, and (iii) code
clarity, i.e., not overly complexifying the graph of paths w/ or w/o BHs
disabled. While with [2], BHs will be disabled even when not required on
the xmit/output path, and also resulting in an even more complex graph
of paths regarding BHs.
[1]
https://lore.kernel.org/netdev/20250415073818.06ea327c@kernel.org/T/#m5a4e6a56206d9d110a5e4d664ab4ea09e7e9b33e
[2]
https://lore.kernel.org/netdev/20250415073818.06ea327c@kernel.org/T/#m3a88de38eceb0a53e2d173dc3675ecaa37e9d0b4
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-15 14:38 ` Jakub Kicinski
2025-04-15 15:12 ` Justin Iurman
@ 2025-04-15 15:12 ` Alexei Starovoitov
2025-04-15 17:12 ` Justin Iurman
1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-04-15 15:12 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Justin Iurman, Andrea Mayer, Sebastian Sewior, Stanislav Fomichev,
Network Development, David S. Miller, Eric Dumazet, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, bpf, Stefano Salsano,
Paolo Lungaroni
On Tue, Apr 15, 2025 at 7:38 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
> > > However, there is my opinion an issue that can occur: between the check on
> > > in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> > > another CPU. As a result, the check on in_softirq() becomes ineffective because
> > > we may end up disabling BH on a CPU that is not the one we just checked (with
> > > if (in_softirq()) { ... }).
>
> The context is not affected by migration. The context is fully defined
> by the execution stack.
>
> > Hmm, I think it's correct... good catch. I went for this solution to (i)
> > avoid useless nested BHs disable calls; and (ii) avoid ending up with a
> > spaghetti graph of possible paths with or without BHs disabled (i.e.,
> > with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
> > which otherwise makes it hard to maintain the code IMO.
> >
> > So, if we want to follow what Alexei suggests (see his last response),
> > we'd need to disable BHs in both ip_local_out() and ip6_local_out().
> > These are the common functions which are closest in depth, and so for
> > both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
> > disabling BHs even when it may not be required. Indeed, ip_local_out()
> > and ip6_local_out() both call dst_output(), which one is usually not
> > lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
> > either).
> >
> > The other solution is to always call local_bh_disable() in both
> > lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
> > they were already. Which was basically -v1 and received a NACK from Alexei.
>
> I thought he nacked preempt_disable()
+1.
imo unconditional local_bh_disable() in tx path is fine.
I didn't like the addition of local_bh_disable() in every lwt related
function without doing home work whether it's needed there or not.
Like input path shouldn't need local_bh_disable
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net] net: lwtunnel: disable preemption when required
2025-04-15 15:12 ` Alexei Starovoitov
@ 2025-04-15 17:12 ` Justin Iurman
0 siblings, 0 replies; 18+ messages in thread
From: Justin Iurman @ 2025-04-15 17:12 UTC (permalink / raw)
To: Alexei Starovoitov, Jakub Kicinski
Cc: Andrea Mayer, Sebastian Sewior, Stanislav Fomichev,
Network Development, David S. Miller, Eric Dumazet, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, bpf, Stefano Salsano,
Paolo Lungaroni
On 4/15/25 17:12, Alexei Starovoitov wrote:
> On Tue, Apr 15, 2025 at 7:38 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
>>>> However, there is my opinion an issue that can occur: between the check on
>>>> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
>>>> another CPU. As a result, the check on in_softirq() becomes ineffective because
>>>> we may end up disabling BH on a CPU that is not the one we just checked (with
>>>> if (in_softirq()) { ... }).
>>
>> The context is not affected by migration. The context is fully defined
>> by the execution stack.
>>
>>> Hmm, I think it's correct... good catch. I went for this solution to (i)
>>> avoid useless nested BHs disable calls; and (ii) avoid ending up with a
>>> spaghetti graph of possible paths with or without BHs disabled (i.e.,
>>> with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
>>> which otherwise makes it hard to maintain the code IMO.
>>>
>>> So, if we want to follow what Alexei suggests (see his last response),
>>> we'd need to disable BHs in both ip_local_out() and ip6_local_out().
>>> These are the common functions which are closest in depth, and so for
>>> both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
>>> disabling BHs even when it may not be required. Indeed, ip_local_out()
>>> and ip6_local_out() both call dst_output(), which one is usually not
>>> lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
>>> either).
>>>
>>> The other solution is to always call local_bh_disable() in both
>>> lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
>>> they were already. Which was basically -v1 and received a NACK from Alexei.
>>
>> I thought he nacked preempt_disable()
>
> +1.
>
> imo unconditional local_bh_disable() in tx path is fine.
> I didn't like the addition of local_bh_disable() in every lwt related
> function without doing home work whether it's needed there or not.
> Like input path shouldn't need local_bh_disable
Ack, sorry for the confusion. I'll post -v2 with that solution.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-04-15 17:12 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-03 8:39 [PATCH net] net: lwtunnel: disable preemption when required Justin Iurman
2025-04-03 16:24 ` Stanislav Fomichev
2025-04-03 19:08 ` Justin Iurman
2025-04-03 20:35 ` Alexei Starovoitov
2025-04-04 14:19 ` Sebastian Sewior
2025-04-06 8:59 ` Justin Iurman
2025-04-07 17:54 ` Alexei Starovoitov
2025-04-11 18:34 ` Justin Iurman
2025-04-14 23:13 ` Alexei Starovoitov
2025-04-15 9:25 ` Justin Iurman
2025-04-15 11:56 ` Justin Iurman
2025-04-15 0:54 ` Andrea Mayer
2025-04-15 9:10 ` Justin Iurman
2025-04-15 14:38 ` Jakub Kicinski
2025-04-15 15:12 ` Justin Iurman
2025-04-15 15:12 ` Alexei Starovoitov
2025-04-15 17:12 ` Justin Iurman
2025-04-03 23:38 ` Jakub Kicinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).