* [PATCH net v2] net/sched: act_mirred: Add carrier check
@ 2023-04-24 17:08 Victor Nogueira
2023-04-24 17:36 ` Leon Romanovsky
0 siblings, 1 reply; 8+ messages in thread
From: Victor Nogueira @ 2023-04-24 17:08 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni
Cc: netdev, jhs, xiyou.wangcong, jiri, kernel, Victor Nogueira
There are cases where the device is adminstratively UP, but operationally
down. For example, we have a physical device (Nvidia ConnectX-6 Dx, 25Gbps)
who's cable was pulled out, here is its ip link output:
5: ens2f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether b8:ce:f6:4b:68:35 brd ff:ff:ff:ff:ff:ff
altname enp179s0f1np1
As you can see, it's administratively UP but operationally down.
In this case, sending a packet to this port caused a nasty kernel hang (so
nasty that we were unable to capture it). Aborting a transmit based on
operational status (in addition to administrative status) fixes the issue.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
---
net/sched/act_mirred.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index ec43764e92e7..0a711c184c29 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -264,7 +264,7 @@ TC_INDIRECT_SCOPE int tcf_mirred_act(struct sk_buff *skb,
goto out;
}
- if (unlikely(!(dev->flags & IFF_UP))) {
+ if (unlikely(!(dev->flags & IFF_UP)) || !netif_carrier_ok(dev)) {
net_notice_ratelimited("tc mirred to Houston: device %s is down\n",
dev->name);
goto out;
--
2.25.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH net v2] net/sched: act_mirred: Add carrier check
2023-04-24 17:08 [PATCH net v2] net/sched: act_mirred: Add carrier check Victor Nogueira
@ 2023-04-24 17:36 ` Leon Romanovsky
2023-04-24 17:44 ` Stephen Hemminger
0 siblings, 1 reply; 8+ messages in thread
From: Leon Romanovsky @ 2023-04-24 17:36 UTC (permalink / raw)
To: Victor Nogueira
Cc: davem, edumazet, kuba, pabeni, netdev, jhs, xiyou.wangcong, jiri,
kernel
On Mon, Apr 24, 2023 at 05:08:32PM +0000, Victor Nogueira wrote:
> There are cases where the device is adminstratively UP, but operationally
> down. For example, we have a physical device (Nvidia ConnectX-6 Dx, 25Gbps)
> who's cable was pulled out, here is its ip link output:
>
> 5: ens2f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
> link/ether b8:ce:f6:4b:68:35 brd ff:ff:ff:ff:ff:ff
> altname enp179s0f1np1
>
> As you can see, it's administratively UP but operationally down.
> In this case, sending a packet to this port caused a nasty kernel hang (so
> nasty that we were unable to capture it). Aborting a transmit based on
> operational status (in addition to administrative status) fixes the issue.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>
Please no blank line between tags.
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> ---
> net/sched/act_mirred.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
It will be great to have changelog to see the changes between versions.
Thanks
>
> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> index ec43764e92e7..0a711c184c29 100644
> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -264,7 +264,7 @@ TC_INDIRECT_SCOPE int tcf_mirred_act(struct sk_buff *skb,
> goto out;
> }
>
> - if (unlikely(!(dev->flags & IFF_UP))) {
> + if (unlikely(!(dev->flags & IFF_UP)) || !netif_carrier_ok(dev)) {
> net_notice_ratelimited("tc mirred to Houston: device %s is down\n",
> dev->name);
> goto out;
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net v2] net/sched: act_mirred: Add carrier check
2023-04-24 17:36 ` Leon Romanovsky
@ 2023-04-24 17:44 ` Stephen Hemminger
2023-04-24 17:59 ` Jamal Hadi Salim
0 siblings, 1 reply; 8+ messages in thread
From: Stephen Hemminger @ 2023-04-24 17:44 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Victor Nogueira, davem, edumazet, kuba, pabeni, netdev, jhs,
xiyou.wangcong, jiri, kernel
On Mon, 24 Apr 2023 20:36:02 +0300
Leon Romanovsky <leon@kernel.org> wrote:
> > There are cases where the device is adminstratively UP, but operationally
> > down. For example, we have a physical device (Nvidia ConnectX-6 Dx, 25Gbps)
> > who's cable was pulled out, here is its ip link output:
> >
> > 5: ens2f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
> > link/ether b8:ce:f6:4b:68:35 brd ff:ff:ff:ff:ff:ff
> > altname enp179s0f1np1
> >
> > As you can see, it's administratively UP but operationally down.
> > In this case, sending a packet to this port caused a nasty kernel hang (so
> > nasty that we were unable to capture it). Aborting a transmit based on
> > operational status (in addition to administrative status) fixes the issue.
> >
Then fix the driver. It shouldn't hang.
Other drivers just drop packets if link is down.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net v2] net/sched: act_mirred: Add carrier check
2023-04-24 17:44 ` Stephen Hemminger
@ 2023-04-24 17:59 ` Jamal Hadi Salim
2023-04-24 21:36 ` Jakub Kicinski
0 siblings, 1 reply; 8+ messages in thread
From: Jamal Hadi Salim @ 2023-04-24 17:59 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Leon Romanovsky, Victor Nogueira, davem, edumazet, kuba, pabeni,
netdev, xiyou.wangcong, jiri, kernel
On Mon, Apr 24, 2023 at 1:44 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Mon, 24 Apr 2023 20:36:02 +0300
> Leon Romanovsky <leon@kernel.org> wrote:
>
> > > There are cases where the device is adminstratively UP, but operationally
> > > down. For example, we have a physical device (Nvidia ConnectX-6 Dx, 25Gbps)
> > > who's cable was pulled out, here is its ip link output:
> > >
> > > 5: ens2f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
> > > link/ether b8:ce:f6:4b:68:35 brd ff:ff:ff:ff:ff:ff
> > > altname enp179s0f1np1
> > >
> > > As you can see, it's administratively UP but operationally down.
> > > In this case, sending a packet to this port caused a nasty kernel hang (so
> > > nasty that we were unable to capture it). Aborting a transmit based on
> > > operational status (in addition to administrative status) fixes the issue.
> > >
>
> Then fix the driver. It shouldn't hang.
> Other drivers just drop packets if link is down.
We didnt do extensive testing of drivers but consider this a safeguard
against buggy driver (its a huge process upgrading drivers in some
environments). It may even make sense to move this to dev_queue_xmit()
i.e the arguement is: why is the core sending a packet to hardware
that has link down to begin with? BTW, I believe the bridge behaves
this way ...
cheers,
jamal
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net v2] net/sched: act_mirred: Add carrier check
2023-04-24 17:59 ` Jamal Hadi Salim
@ 2023-04-24 21:36 ` Jakub Kicinski
2023-04-24 21:53 ` Jamal Hadi Salim
0 siblings, 1 reply; 8+ messages in thread
From: Jakub Kicinski @ 2023-04-24 21:36 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: Stephen Hemminger, Leon Romanovsky, Victor Nogueira, davem,
edumazet, pabeni, netdev, xiyou.wangcong, jiri, kernel
On Mon, 24 Apr 2023 13:59:15 -0400 Jamal Hadi Salim wrote:
> > Then fix the driver. It shouldn't hang.
> > Other drivers just drop packets if link is down.
>
> We didnt do extensive testing of drivers but consider this a safeguard
> against buggy driver (its a huge process upgrading drivers in some
> environments). It may even make sense to move this to dev_queue_xmit()
> i.e the arguement is: why is the core sending a packet to hardware
> that has link down to begin with? BTW, I believe the bridge behaves
> this way ...
I'm with Stephen, even if the check makes sense in general we should
first drill down into the real bug, and squash it.
--
pw-bot: cr
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net v2] net/sched: act_mirred: Add carrier check
2023-04-24 21:36 ` Jakub Kicinski
@ 2023-04-24 21:53 ` Jamal Hadi Salim
2023-04-24 22:10 ` Stephen Hemminger
0 siblings, 1 reply; 8+ messages in thread
From: Jamal Hadi Salim @ 2023-04-24 21:53 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Stephen Hemminger, Leon Romanovsky, Victor Nogueira, davem,
edumazet, pabeni, netdev, xiyou.wangcong, jiri, kernel
On Mon, Apr 24, 2023 at 5:36 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 24 Apr 2023 13:59:15 -0400 Jamal Hadi Salim wrote:
> > > Then fix the driver. It shouldn't hang.
> > > Other drivers just drop packets if link is down.
> >
> > We didnt do extensive testing of drivers but consider this a safeguard
> > against buggy driver (its a huge process upgrading drivers in some
> > environments). It may even make sense to move this to dev_queue_xmit()
> > i.e the arguement is: why is the core sending a packet to hardware
> > that has link down to begin with? BTW, I believe the bridge behaves
> > this way ...
>
> I'm with Stephen, even if the check makes sense in general we should
> first drill down into the real bug, and squash it.
Ok then, I guess in keeping up with the spirit of trivial patches
generating the most discussion, these are two separate issues in my
opinion: IOW, the driver bug should be fixed (we have reached out to
the vendor) - but the patch stands on its own.
cheers,
jamal
> --
> pw-bot: cr
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net v2] net/sched: act_mirred: Add carrier check
2023-04-24 21:53 ` Jamal Hadi Salim
@ 2023-04-24 22:10 ` Stephen Hemminger
2023-04-24 22:22 ` Jamal Hadi Salim
0 siblings, 1 reply; 8+ messages in thread
From: Stephen Hemminger @ 2023-04-24 22:10 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: Jakub Kicinski, Leon Romanovsky, Victor Nogueira, davem, edumazet,
pabeni, netdev, xiyou.wangcong, jiri, kernel
On Mon, 24 Apr 2023 17:53:03 -0400
Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On Mon, Apr 24, 2023 at 5:36 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 24 Apr 2023 13:59:15 -0400 Jamal Hadi Salim wrote:
> > > > Then fix the driver. It shouldn't hang.
> > > > Other drivers just drop packets if link is down.
> > >
> > > We didnt do extensive testing of drivers but consider this a safeguard
> > > against buggy driver (its a huge process upgrading drivers in some
> > > environments). It may even make sense to move this to dev_queue_xmit()
> > > i.e the arguement is: why is the core sending a packet to hardware
> > > that has link down to begin with? BTW, I believe the bridge behaves
> > > this way ...
> >
> > I'm with Stephen, even if the check makes sense in general we should
> > first drill down into the real bug, and squash it.
>
> Ok then, I guess in keeping up with the spirit of trivial patches
> generating the most discussion, these are two separate issues in my
> opinion: IOW, the driver bug should be fixed (we have reached out to
> the vendor) - but the patch stands on its own.
There are many other ways packet could arrive at driver when link
is down. You are addressing only one small corner case by patching
mirred.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net v2] net/sched: act_mirred: Add carrier check
2023-04-24 22:10 ` Stephen Hemminger
@ 2023-04-24 22:22 ` Jamal Hadi Salim
0 siblings, 0 replies; 8+ messages in thread
From: Jamal Hadi Salim @ 2023-04-24 22:22 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Jakub Kicinski, Leon Romanovsky, Victor Nogueira, davem, edumazet,
pabeni, netdev, xiyou.wangcong, jiri, kernel
On Mon, Apr 24, 2023 at 6:10 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Mon, 24 Apr 2023 17:53:03 -0400
> Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> > On Mon, Apr 24, 2023 at 5:36 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Mon, 24 Apr 2023 13:59:15 -0400 Jamal Hadi Salim wrote:
> > > > > Then fix the driver. It shouldn't hang.
> > > > > Other drivers just drop packets if link is down.
> > > >
> > > > We didnt do extensive testing of drivers but consider this a safeguard
> > > > against buggy driver (its a huge process upgrading drivers in some
> > > > environments). It may even make sense to move this to dev_queue_xmit()
> > > > i.e the arguement is: why is the core sending a packet to hardware
> > > > that has link down to begin with? BTW, I believe the bridge behaves
> > > > this way ...
> > >
> > > I'm with Stephen, even if the check makes sense in general we should
> > > first drill down into the real bug, and squash it.
> >
> > Ok then, I guess in keeping up with the spirit of trivial patches
> > generating the most discussion, these are two separate issues in my
> > opinion: IOW, the driver bug should be fixed (we have reached out to
> > the vendor) - but the patch stands on its own.
>
> There are many other ways packet could arrive at driver when link
> is down. You are addressing only one small corner case by patching
> mirred.
Of course, we are dealing with the thing that is causing us grief is a
fair thing to do, no?
What other (non-corner use) cases do you want to see fixed?
cheers
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-04-24 22:22 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-24 17:08 [PATCH net v2] net/sched: act_mirred: Add carrier check Victor Nogueira
2023-04-24 17:36 ` Leon Romanovsky
2023-04-24 17:44 ` Stephen Hemminger
2023-04-24 17:59 ` Jamal Hadi Salim
2023-04-24 21:36 ` Jakub Kicinski
2023-04-24 21:53 ` Jamal Hadi Salim
2023-04-24 22:10 ` Stephen Hemminger
2023-04-24 22:22 ` Jamal Hadi Salim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).