* [PATCH v2] net: net_failover: Fix the deadlock in slave register
@ 2026-04-27 9:29 faicker.mo
2026-04-29 1:19 ` Jakub Kicinski
0 siblings, 1 reply; 3+ messages in thread
From: faicker.mo @ 2026-04-27 9:29 UTC (permalink / raw)
To: faicker.mo
Cc: horms, Sridhar Samudrala, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Stanislav Fomichev,
netdev, linux-kernel
From: Faicker Mo <faicker.mo@gmail.com>
There is netdev_lock_ops() before the NETDEV_REGISTER notifier
in register_netdevice(), so use the non-locking functions
in net_failover_slave_register().
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_preempt_disabled+0x15/0x30
__mutex_lock.constprop.0+0x538/0x9e0
__mutex_lock_slowpath+0x13/0x20
mutex_lock+0x3b/0x50
dev_set_mtu+0x40/0xe0
net_failover_slave_register+0x24/0x280
failover_slave_register+0x103/0x1b0
failover_event+0x15e/0x210
? dropmon_net_event+0xac/0xe0
notifier_call_chain+0x5e/0xe0
raw_notifier_call_chain+0x16/0x30
call_netdevice_notifiers_info+0x52/0xa0
register_netdevice+0x5f4/0x7c0
register_netdev+0x1e/0x40
_mlx5e_probe+0xe2/0x370 [mlx5_core]
mlx5e_probe+0x59/0x70 [mlx5_core]
? __pfx_mlx5e_probe+0x10/0x10 [mlx5_core]
Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP")
Signed-off-by: Faicker Mo <faicker.mo@gmail.com>
---
Changes since v1:
- Fix the space chars (Simon)
- Change the dev_close to netif_close (Simon)
- Change the label err_dev_open to err_netif_open
---
drivers/net/net_failover.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index d0361aaf25ef..3f7d31033bae 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -502,7 +502,7 @@ static int net_failover_slave_register(struct net_device *slave_dev,
/* Align MTU of slave with failover dev */
orig_mtu = slave_dev->mtu;
- err = dev_set_mtu(slave_dev, failover_dev->mtu);
+ err = netif_set_mtu(slave_dev, failover_dev->mtu);
if (err) {
netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
slave_dev->name, failover_dev->mtu);
@@ -512,11 +512,11 @@ static int net_failover_slave_register(struct net_device *slave_dev,
dev_hold(slave_dev);
if (netif_running(failover_dev)) {
- err = dev_open(slave_dev, NULL);
+ err = netif_open(slave_dev, NULL);
if (err && (err != -EBUSY)) {
netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
slave_dev->name, err);
- goto err_dev_open;
+ goto err_netif_open;
}
}
@@ -562,10 +562,10 @@ static int net_failover_slave_register(struct net_device *slave_dev,
err_vlan_add:
dev_uc_unsync(slave_dev, failover_dev);
dev_mc_unsync(slave_dev, failover_dev);
- dev_close(slave_dev);
-err_dev_open:
+ netif_close(slave_dev);
+err_netif_open:
dev_put(slave_dev);
- dev_set_mtu(slave_dev, orig_mtu);
+ netif_set_mtu(slave_dev, orig_mtu);
done:
return err;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v2] net: net_failover: Fix the deadlock in slave register
2026-04-27 9:29 [PATCH v2] net: net_failover: Fix the deadlock in slave register faicker.mo
@ 2026-04-29 1:19 ` Jakub Kicinski
2026-05-02 2:46 ` faicker mo
0 siblings, 1 reply; 3+ messages in thread
From: Jakub Kicinski @ 2026-04-29 1:19 UTC (permalink / raw)
To: faicker.mo
Cc: Jakub Kicinski, horms, sridhar.samudrala, andrew+netdev, davem,
edumazet, pabeni, sdf, netdev, linux-kernel
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: net_failover: Fix the deadlock in slave register
This patch replaces dev_set_mtu(), dev_open() and dev_close() in
net_failover_slave_register() with their non-locking netif_* counterparts
to avoid a deadlock when the slave is registered from the NETDEV_REGISTER
notifier, where register_netdevice() already holds the per-device ops
lock.
> diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
> index d0361aaf25ef..3f7d31033bae 100644
> --- a/drivers/net/net_failover.c
> +++ b/drivers/net/net_failover.c
> @@ -502,7 +502,7 @@ static int net_failover_slave_register(struct net_device *slave_dev,
>
> /* Align MTU of slave with failover dev */
> orig_mtu = slave_dev->mtu;
> - err = dev_set_mtu(slave_dev, failover_dev->mtu);
> + err = netif_set_mtu(slave_dev, failover_dev->mtu);
Is net_failover_slave_register() only reachable from the NETDEV_REGISTER
notifier path?
Looking at net/core/failover.c, the fops->slave_register callback is
invoked from two places. One is failover_event() on NETDEV_REGISTER,
which matches the call trace in the commit message and where
register_netdevice() has already taken the ops lock via
netdev_lock_ops().
The other is failover_register() -> failover_existing_slave_register(),
which walks existing netdevs under rtnl_lock only:
net/core/failover.c:failover_existing_slave_register() {
rtnl_lock();
for_each_netdev(net, dev) {
if (netif_is_failover(dev))
continue;
if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
failover_slave_register(dev);
}
rtnl_unlock();
}
On that path, the slave's dev->lock is not held by the caller.
netif_set_mtu() -> netif_set_mtu_ext() calls netdev_ops_assert_locked(dev)
for devices with request_ops_lock (for example mlx5e, the driver in the
commit's own call trace).
> @@ -512,11 +512,11 @@ static int net_failover_slave_register(struct net_device *slave_dev,
> dev_hold(slave_dev);
>
> if (netif_running(failover_dev)) {
> - err = dev_open(slave_dev, NULL);
> + err = netif_open(slave_dev, NULL);
Same question for this call. netif_open() -> __dev_open() also does
netdev_ops_assert_locked(dev), and on the failover_existing_slave_register()
path the slave's ops lock is not held.
Before this change, dev_set_mtu()/dev_open()/dev_close() acquired the ops
lock internally via netdev_lock_ops(), so the pre-existing-slave path was
covered. Does switching unconditionally to the non-locking variants
regress that path for slaves whose drivers use the per-device ops lock,
both as a lockdep splat when CONFIG_PROVE_LOCKING is set and as a loss of
ops-lock serialization against concurrent ethtool, queue and netdev-genl
operations on the slave?
> @@ -562,10 +562,10 @@ static int net_failover_slave_register(struct net_device *slave_dev,
> err_vlan_add:
> dev_uc_unsync(slave_dev, failover_dev);
> dev_mc_unsync(slave_dev, failover_dev);
> - dev_close(slave_dev);
> -err_dev_open:
> + netif_close(slave_dev);
> +err_netif_open:
> dev_put(slave_dev);
> - dev_set_mtu(slave_dev, orig_mtu);
> + netif_set_mtu(slave_dev, orig_mtu);
Would it be more appropriate to acquire netdev_lock_ops(slave_dev) in
net_failover_slave_register() when the caller is not the NETDEV_REGISTER
notifier (where the ops lock is already held), instead of dropping the
locking unconditionally?
The commit message explains the NETDEV_REGISTER caller but does not
mention the failover_existing_slave_register() caller; could that
rationale be extended, or the other caller handled explicitly?
--
pw-bot: cr
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v2] net: net_failover: Fix the deadlock in slave register
2026-04-29 1:19 ` Jakub Kicinski
@ 2026-05-02 2:46 ` faicker mo
0 siblings, 0 replies; 3+ messages in thread
From: faicker mo @ 2026-05-02 2:46 UTC (permalink / raw)
To: Jakub Kicinski
Cc: horms, sridhar.samudrala, andrew+netdev, davem, edumazet, pabeni,
sdf, netdev, linux-kernel
>
> > @@ -512,11 +512,11 @@ static int net_failover_slave_register(struct net_device *slave_dev,
> > dev_hold(slave_dev);
> >
> > if (netif_running(failover_dev)) {
> > - err = dev_open(slave_dev, NULL);
> > + err = netif_open(slave_dev, NULL);
>
> Same question for this call. netif_open() -> __dev_open() also does
> netdev_ops_assert_locked(dev), and on the failover_existing_slave_register()
> path the slave's ops lock is not held.
Yes.
>
> Before this change, dev_set_mtu()/dev_open()/dev_close() acquired the ops
> lock internally via netdev_lock_ops(), so the pre-existing-slave path was
> covered. Does switching unconditionally to the non-locking variants
> regress that path for slaves whose drivers use the per-device ops lock,
> both as a lockdep splat when CONFIG_PROVE_LOCKING is set and as a loss of
> ops-lock serialization against concurrent ethtool, queue and netdev-genl
> operations on the slave?
>
> > @@ -562,10 +562,10 @@ static int net_failover_slave_register(struct net_device *slave_dev,
> > err_vlan_add:
> > dev_uc_unsync(slave_dev, failover_dev);
> > dev_mc_unsync(slave_dev, failover_dev);
> > - dev_close(slave_dev);
> > -err_dev_open:
> > + netif_close(slave_dev);
> > +err_netif_open:
> > dev_put(slave_dev);
> > - dev_set_mtu(slave_dev, orig_mtu);
> > + netif_set_mtu(slave_dev, orig_mtu);
>
> Would it be more appropriate to acquire netdev_lock_ops(slave_dev) in
> net_failover_slave_register() when the caller is not the NETDEV_REGISTER
> notifier (where the ops lock is already held), instead of dropping the
> locking unconditionally?
Yes, we should add netdev_lock_ops in failover_existing_slave_register
before calling failover_slave_register.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-05-02 2:46 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 9:29 [PATCH v2] net: net_failover: Fix the deadlock in slave register faicker.mo
2026-04-29 1:19 ` Jakub Kicinski
2026-05-02 2:46 ` faicker mo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox