* [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
@ 2025-01-20 14:19 Kory Maincent
2025-01-20 19:12 ` Jakub Kicinski
0 siblings, 1 reply; 14+ messages in thread
From: Kory Maincent @ 2025-01-20 14:19 UTC (permalink / raw)
To: Kory Maincent, David S. Miller, netdev, linux-kernel
Cc: Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit,
Russell King, Eric Dumazet, Jakub Kicinski, Paolo Abeni
The phy_detach function can be called with or without the rtnl lock held.
When the rtnl lock is not held, using rtnl_dereference() triggers a
warning due to the lack of lock context.
Add an rcu_read_lock() to ensure the lock is acquired and to maintain
synchronization.
The path reported to not having RTNL lock acquired is the suspend path of
the ravb MAC driver. Without this fix we got this warning:
[ 39.032969] =============================
[ 39.032983] WARNING: suspicious RCU usage
[ 39.033019] -----------------------------
[ 39.033033] drivers/net/phy/phy_device.c:2004 suspicious
rcu_dereference_protected() usage!
...
[ 39.033597] stack backtrace:
[ 39.033613] CPU: 0 UID: 0 PID: 174 Comm: python3 Not tainted
6.13.0-rc7-next-20250116-arm64-renesas-00002-g35245dfdc62c #7
[ 39.033623] Hardware name: Renesas SMARC EVK version 2 based on
r9a08g045s33 (DT)
[ 39.033628] Call trace:
[ 39.033633] show_stack+0x14/0x1c (C)
[ 39.033652] dump_stack_lvl+0xb4/0xc4
[ 39.033664] dump_stack+0x14/0x1c
[ 39.033671] lockdep_rcu_suspicious+0x16c/0x22c
[ 39.033682] phy_detach+0x160/0x190
[ 39.033694] phy_disconnect+0x40/0x54
[ 39.033703] ravb_close+0x6c/0x1cc
[ 39.033714] ravb_suspend+0x48/0x120
[ 39.033721] dpm_run_callback+0x4c/0x14c
[ 39.033731] device_suspend+0x11c/0x4dc
[ 39.033740] dpm_suspend+0xdc/0x214
[ 39.033748] dpm_suspend_start+0x48/0x60
[ 39.033758] suspend_devices_and_enter+0x124/0x574
[ 39.033769] pm_suspend+0x1ac/0x274
[ 39.033778] state_store+0x88/0x124
[ 39.033788] kobj_attr_store+0x14/0x24
[ 39.033798] sysfs_kf_write+0x48/0x6c
[ 39.033808] kernfs_fop_write_iter+0x118/0x1a8
[ 39.033817] vfs_write+0x27c/0x378
[ 39.033825] ksys_write+0x64/0xf4
[ 39.033833] __arm64_sys_write+0x18/0x20
[ 39.033841] invoke_syscall+0x44/0x104
[ 39.033852] el0_svc_common.constprop.0+0xb4/0xd4
[ 39.033862] do_el0_svc+0x18/0x20
[ 39.033870] el0_svc+0x3c/0xf0
[ 39.033880] el0t_64_sync_handler+0xc0/0xc4
[ 39.033888] el0t_64_sync+0x154/0x158
[ 39.041274] ravb 11c30000.ethernet eth0: Link is Down
Tested-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Reported-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Closes: https://lore.kernel.org/netdev/4c6419d8-c06b-495c-b987-d66c2e1ff848@tuxon.dev/
Fixes: 35f7cad1743e ("net: Add the possibility to support a selected hwtstamp in netdevice")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
---
Changes in v3:
- Update the commit message with the stack trace.
Changes in v2:
- Add a missing ;
---
drivers/net/phy/phy_device.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 5b34d39d1d52..3eeee7cba923 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -2001,12 +2001,14 @@ void phy_detach(struct phy_device *phydev)
if (dev) {
struct hwtstamp_provider *hwprov;
- hwprov = rtnl_dereference(dev->hwprov);
+ rcu_read_lock();
+ hwprov = rcu_dereference(dev->hwprov);
/* Disable timestamp if it is the one selected */
if (hwprov && hwprov->phydev == phydev) {
rcu_assign_pointer(dev->hwprov, NULL);
kfree_rcu(hwprov, rcu_head);
}
+ rcu_read_unlock();
phydev->attached_dev->phydev = NULL;
phydev->attached_dev = NULL;
--
2.34.1
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-20 14:19 [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage Kory Maincent @ 2025-01-20 19:12 ` Jakub Kicinski 2025-01-21 9:38 ` Kory Maincent 0 siblings, 1 reply; 14+ messages in thread From: Jakub Kicinski @ 2025-01-20 19:12 UTC (permalink / raw) To: Kory Maincent Cc: David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Russell King, Eric Dumazet, Paolo Abeni On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote: > The path reported to not having RTNL lock acquired is the suspend path of > the ravb MAC driver. Without this fix we got this warning: I maintain that ravb is buggy, plenty of drivers take rtnl_lock from the .suspend callback. We need _some_ write protection here, the patch as is only silences a legitimate warning. -- pw-bot: cr ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-20 19:12 ` Jakub Kicinski @ 2025-01-21 9:38 ` Kory Maincent 2025-01-21 10:02 ` Russell King (Oracle) ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Kory Maincent @ 2025-01-21 9:38 UTC (permalink / raw) To: Jakub Kicinski Cc: David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Russell King, Eric Dumazet, Paolo Abeni On Mon, 20 Jan 2025 11:12:28 -0800 Jakub Kicinski <kuba@kernel.org> wrote: > On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote: > > The path reported to not having RTNL lock acquired is the suspend path of > > the ravb MAC driver. Without this fix we got this warning: > > I maintain that ravb is buggy, plenty of drivers take rtnl_lock > from the .suspend callback. We need _some_ write protection here, > the patch as is only silences a legitimate warning. Indeed if the suspend path is buggy we should fix it. Still there is lots of ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return an error or in the remove path. What should we do about it? About ravb suspend, I don't have the board, Claudiu could you try this instead of the current fix: diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index bc395294a32d..c9a0d2d6f371 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -3215,15 +3215,22 @@ static int ravb_suspend(struct device *dev) if (!netif_running(ndev)) goto reset_assert; + rtnl_lock(); netif_device_detach(ndev); - if (priv->wol_enabled) - return ravb_wol_setup(ndev); + if (priv->wol_enabled) { + ret = ravb_wol_setup(ndev); + rtnl_unlock(); + return ret; + } ret = ravb_close(ndev); - if (ret) + if (ret) { + rtnl_unlock(); return ret; + } + rtnl_unlock(); ret = pm_runtime_force_suspend(&priv->pdev->dev); if (ret) return ret; Regards, -- Köry Maincent, Bootlin Embedded Linux and kernel engineering https://bootlin.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-21 9:38 ` Kory Maincent @ 2025-01-21 10:02 ` Russell King (Oracle) 2025-01-21 10:29 ` Claudiu Beznea 2025-01-21 11:34 ` Paul Barker 2 siblings, 0 replies; 14+ messages in thread From: Russell King (Oracle) @ 2025-01-21 10:02 UTC (permalink / raw) To: Kory Maincent Cc: Jakub Kicinski, David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Eric Dumazet, Paolo Abeni On Tue, Jan 21, 2025 at 10:38:45AM +0100, Kory Maincent wrote: > On Mon, 20 Jan 2025 11:12:28 -0800 > Jakub Kicinski <kuba@kernel.org> wrote: > > > On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote: > > > The path reported to not having RTNL lock acquired is the suspend path of > > > the ravb MAC driver. Without this fix we got this warning: > > > > I maintain that ravb is buggy, plenty of drivers take rtnl_lock > > from the .suspend callback. We need _some_ write protection here, > > the patch as is only silences a legitimate warning. > > Indeed if the suspend path is buggy we should fix it. Still there is lots of > ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return an > error or in the remove path. What should we do about it? They could trigger the same warning, although I think they would be relatively safe because register_netdev() hasn't been called, and thus nothing that the netdev provides should be used. (If it can be used, as the driver has not completed initialisation, then it's probably racy anyway.) I don't think throwing ASSERT_RTNL() into phy_detach() will do anything to solve this. If the RCU warning doesn't trigger (because phy_detach() only gets called on error which practically never happens) then ASSERT_RTNL() isn't going to trigger either. Warnings in functions will only work when they're called in a context that will trigger the warning! So, I think it's something that can only be addressed by reviewing drivers and patching. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-21 9:38 ` Kory Maincent 2025-01-21 10:02 ` Russell King (Oracle) @ 2025-01-21 10:29 ` Claudiu Beznea 2025-01-21 11:34 ` Paul Barker 2 siblings, 0 replies; 14+ messages in thread From: Claudiu Beznea @ 2025-01-21 10:29 UTC (permalink / raw) To: Kory Maincent, Jakub Kicinski Cc: David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Russell King, Eric Dumazet, Paolo Abeni Hi, Kory, On 21.01.2025 11:38, Kory Maincent wrote: > On Mon, 20 Jan 2025 11:12:28 -0800 > Jakub Kicinski <kuba@kernel.org> wrote: > >> On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote: >>> The path reported to not having RTNL lock acquired is the suspend path of >>> the ravb MAC driver. Without this fix we got this warning: >> >> I maintain that ravb is buggy, plenty of drivers take rtnl_lock >> from the .suspend callback. We need _some_ write protection here, >> the patch as is only silences a legitimate warning. > > Indeed if the suspend path is buggy we should fix it. Still there is lots of > ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return an > error or in the remove path. What should we do about it? > > About ravb suspend, I don't have the board, Claudiu could you try this instead > of the current fix: > > diff --git a/drivers/net/ethernet/renesas/ravb_main.c > b/drivers/net/ethernet/renesas/ravb_main.c index bc395294a32d..c9a0d2d6f371 > 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c > +++ b/drivers/net/ethernet/renesas/ravb_main.c > @@ -3215,15 +3215,22 @@ static int ravb_suspend(struct device *dev) > if (!netif_running(ndev)) > goto reset_assert; > > + rtnl_lock(); > netif_device_detach(ndev); > > - if (priv->wol_enabled) > - return ravb_wol_setup(ndev); > + if (priv->wol_enabled) { > + ret = ravb_wol_setup(ndev); > + rtnl_unlock(); > + return ret; > + } > > ret = ravb_close(ndev); > - if (ret) > + if (ret) { > + rtnl_unlock(); > return ret; > + } > > + rtnl_unlock(); > ret = pm_runtime_force_suspend(&priv->pdev->dev); > if (ret) > return ret; The warning is gone with this, as well. Thank you, Claudiu > > Regards, ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-21 9:38 ` Kory Maincent 2025-01-21 10:02 ` Russell King (Oracle) 2025-01-21 10:29 ` Claudiu Beznea @ 2025-01-21 11:34 ` Paul Barker 2025-01-21 13:01 ` Kory Maincent 2 siblings, 1 reply; 14+ messages in thread From: Paul Barker @ 2025-01-21 11:34 UTC (permalink / raw) To: Kory Maincent, Jakub Kicinski Cc: David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Russell King, Eric Dumazet, Paolo Abeni, Niklas Söderlund, Sergey Shtylyov [-- Attachment #1.1.1: Type: text/plain, Size: 2819 bytes --] On 21/01/2025 09:38, Kory Maincent wrote: > On Mon, 20 Jan 2025 11:12:28 -0800 > Jakub Kicinski <kuba@kernel.org> wrote: > >> On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote: >>> The path reported to not having RTNL lock acquired is the suspend path of >>> the ravb MAC driver. Without this fix we got this warning: >> >> I maintain that ravb is buggy, plenty of drivers take rtnl_lock >> from the .suspend callback. We need _some_ write protection here, >> the patch as is only silences a legitimate warning. > > Indeed if the suspend path is buggy we should fix it. Still there is lots of > ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return an > error or in the remove path. What should we do about it? > > About ravb suspend, I don't have the board, Claudiu could you try this instead > of the current fix: > > diff --git a/drivers/net/ethernet/renesas/ravb_main.c > b/drivers/net/ethernet/renesas/ravb_main.c index bc395294a32d..c9a0d2d6f371 > 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c > +++ b/drivers/net/ethernet/renesas/ravb_main.c > @@ -3215,15 +3215,22 @@ static int ravb_suspend(struct device *dev) > if (!netif_running(ndev)) > goto reset_assert; > > + rtnl_lock(); > netif_device_detach(ndev); > > - if (priv->wol_enabled) > - return ravb_wol_setup(ndev); > + if (priv->wol_enabled) { > + ret = ravb_wol_setup(ndev); > + rtnl_unlock(); > + return ret; > + } > > ret = ravb_close(ndev); > - if (ret) > + if (ret) { > + rtnl_unlock(); > return ret; > + } > > + rtnl_unlock(); > ret = pm_runtime_force_suspend(&priv->pdev->dev); > if (ret) > return ret; > > Regards, (Cc'ing Niklas and Sergey as this relates to the ravb driver) Why do we need to hold the rtnl mutex across the calls to netif_device_detach() and ravb_wol_setup()? My reading of Documentation/networking/netdevices.rst is that the rtnl mutex is held when the net subsystem calls the driver's ndo_stop method, which in our case is ravb_close(). So, we should take the rtnl mutex when we call ravb_close() directly, in both ravb_suspend() and ravb_wol_restore(). That would ensure that we do not call phy_disconnect() without holding the rtnl mutex and should fix this issue. Commit 35f7cad1743e ("net: Add the possibility to support a selected hwtstamp in netdevice") may have unearthed the issue, but the fixes tag should point to the commits adding those unlocked calls to ravb_close(). I am not super familiar with the rtnl lock so let me know if I've missed something. Thanks, -- Paul Barker [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 3577 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 236 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-21 11:34 ` Paul Barker @ 2025-01-21 13:01 ` Kory Maincent 2025-01-21 15:44 ` Paul Barker 0 siblings, 1 reply; 14+ messages in thread From: Kory Maincent @ 2025-01-21 13:01 UTC (permalink / raw) To: Paul Barker Cc: Jakub Kicinski, David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Russell King, Eric Dumazet, Paolo Abeni, Niklas Söderlund, Sergey Shtylyov On Tue, 21 Jan 2025 11:34:48 +0000 Paul Barker <paul.barker.ct@bp.renesas.com> wrote: > On 21/01/2025 09:38, Kory Maincent wrote: > > On Mon, 20 Jan 2025 11:12:28 -0800 > > Jakub Kicinski <kuba@kernel.org> wrote: > > > >> On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote: > [...] > >> > >> I maintain that ravb is buggy, plenty of drivers take rtnl_lock > >> from the .suspend callback. We need _some_ write protection here, > >> the patch as is only silences a legitimate warning. > > > > Indeed if the suspend path is buggy we should fix it. Still there is lots of > > ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return > > an error or in the remove path. What should we do about it? > > > > About ravb suspend, I don't have the board, Claudiu could you try this > > instead of the current fix: > > > > diff --git a/drivers/net/ethernet/renesas/ravb_main.c > > b/drivers/net/ethernet/renesas/ravb_main.c index bc395294a32d..c9a0d2d6f371 > > 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c > > +++ b/drivers/net/ethernet/renesas/ravb_main.c > > @@ -3215,15 +3215,22 @@ static int ravb_suspend(struct device *dev) > > if (!netif_running(ndev)) > > goto reset_assert; > > > > + rtnl_lock(); > > netif_device_detach(ndev); > > > > - if (priv->wol_enabled) > > - return ravb_wol_setup(ndev); > > + if (priv->wol_enabled) { > > + ret = ravb_wol_setup(ndev); > > + rtnl_unlock(); > > + return ret; > > + } > > > > ret = ravb_close(ndev); > > - if (ret) > > + if (ret) { > > + rtnl_unlock(); > > return ret; > > + } > > > > + rtnl_unlock(); > > ret = pm_runtime_force_suspend(&priv->pdev->dev); > > if (ret) > > return ret; > > > > Regards, > > (Cc'ing Niklas and Sergey as this relates to the ravb driver) Yes, thanks. > Why do we need to hold the rtnl mutex across the calls to > netif_device_detach() and ravb_wol_setup()? > > My reading of Documentation/networking/netdevices.rst is that the rtnl > mutex is held when the net subsystem calls the driver's ndo_stop method, > which in our case is ravb_close(). So, we should take the rtnl mutex > when we call ravb_close() directly, in both ravb_suspend() and > ravb_wol_restore(). That would ensure that we do not call > phy_disconnect() without holding the rtnl mutex and should fix this > issue. Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup() won't be protected by the rtnl lock. I don't know about netif_device_detach(). It doesn't seems to be the case as there is lots of driver using it without holding rtnl lock. Indeed we should add the rtnl lock also in the resume path. > Commit 35f7cad1743e ("net: Add the possibility to support a selected > hwtstamp in netdevice") may have unearthed the issue, but the fixes tag > should point to the commits adding those unlocked calls to ravb_close(). The current patch was on phy_device.c that's why the fixes tag does not point to a ravb commit, it will change. Regards, -- Köry Maincent, Bootlin Embedded Linux and kernel engineering https://bootlin.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-21 13:01 ` Kory Maincent @ 2025-01-21 15:44 ` Paul Barker 2025-01-21 16:11 ` Kory Maincent 0 siblings, 1 reply; 14+ messages in thread From: Paul Barker @ 2025-01-21 15:44 UTC (permalink / raw) To: Kory Maincent Cc: Jakub Kicinski, David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Russell King, Eric Dumazet, Paolo Abeni, Niklas Söderlund, Sergey Shtylyov [-- Attachment #1.1.1: Type: text/plain, Size: 3753 bytes --] On 21/01/2025 13:01, Kory Maincent wrote: > On Tue, 21 Jan 2025 11:34:48 +0000 > Paul Barker <paul.barker.ct@bp.renesas.com> wrote: > >> On 21/01/2025 09:38, Kory Maincent wrote: >>> On Mon, 20 Jan 2025 11:12:28 -0800 >>> Jakub Kicinski <kuba@kernel.org> wrote: >>> >>>> On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote: >> [...] >>>> >>>> I maintain that ravb is buggy, plenty of drivers take rtnl_lock >>>> from the .suspend callback. We need _some_ write protection here, >>>> the patch as is only silences a legitimate warning. >>> >>> Indeed if the suspend path is buggy we should fix it. Still there is lots of >>> ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return >>> an error or in the remove path. What should we do about it? >>> >>> About ravb suspend, I don't have the board, Claudiu could you try this >>> instead of the current fix: >>> >>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c >>> b/drivers/net/ethernet/renesas/ravb_main.c index bc395294a32d..c9a0d2d6f371 >>> 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c >>> +++ b/drivers/net/ethernet/renesas/ravb_main.c >>> @@ -3215,15 +3215,22 @@ static int ravb_suspend(struct device *dev) >>> if (!netif_running(ndev)) >>> goto reset_assert; >>> >>> + rtnl_lock(); >>> netif_device_detach(ndev); >>> >>> - if (priv->wol_enabled) >>> - return ravb_wol_setup(ndev); >>> + if (priv->wol_enabled) { >>> + ret = ravb_wol_setup(ndev); >>> + rtnl_unlock(); >>> + return ret; >>> + } >>> >>> ret = ravb_close(ndev); >>> - if (ret) >>> + if (ret) { >>> + rtnl_unlock(); >>> return ret; >>> + } >>> >>> + rtnl_unlock(); >>> ret = pm_runtime_force_suspend(&priv->pdev->dev); >>> if (ret) >>> return ret; >>> >>> Regards, >> >> (Cc'ing Niklas and Sergey as this relates to the ravb driver) > > Yes, thanks. > >> Why do we need to hold the rtnl mutex across the calls to >> netif_device_detach() and ravb_wol_setup()? >> >> My reading of Documentation/networking/netdevices.rst is that the rtnl >> mutex is held when the net subsystem calls the driver's ndo_stop method, >> which in our case is ravb_close(). So, we should take the rtnl mutex >> when we call ravb_close() directly, in both ravb_suspend() and >> ravb_wol_restore(). That would ensure that we do not call >> phy_disconnect() without holding the rtnl mutex and should fix this >> issue. > > Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup() won't > be protected by the rtnl lock. ravb_ptp_stop() modifies a couple of device registers and calls ptp_clock_unregister(). I don't see anything to suggest that this requires the rtnl lock to be held, unless I am missing something. > I don't know about netif_device_detach(). It doesn't seems to be the case as > there is lots of driver using it without holding rtnl lock. netif_device_detach() clears the present flag from the link state and stops all TX queues, so I don't think the rtnl lock needs to be held to call this. > > Indeed we should add the rtnl lock also in the resume path. > >> Commit 35f7cad1743e ("net: Add the possibility to support a selected >> hwtstamp in netdevice") may have unearthed the issue, but the fixes tag >> should point to the commits adding those unlocked calls to ravb_close(). > > The current patch was on phy_device.c that's why the fixes tag does not point to > a ravb commit, it will change. Ack. Thanks for digging in to this! -- Paul Barker [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 3577 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 236 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-21 15:44 ` Paul Barker @ 2025-01-21 16:11 ` Kory Maincent 2025-01-22 14:03 ` Paul Barker ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Kory Maincent @ 2025-01-21 16:11 UTC (permalink / raw) To: Paul Barker Cc: Jakub Kicinski, David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Russell King, Eric Dumazet, Paolo Abeni, Niklas Söderlund, Sergey Shtylyov On Tue, 21 Jan 2025 15:44:34 +0000 Paul Barker <paul.barker.ct@bp.renesas.com> wrote: > On 21/01/2025 13:01, Kory Maincent wrote: > > On Tue, 21 Jan 2025 11:34:48 +0000 > > Paul Barker <paul.barker.ct@bp.renesas.com> wrote: > > > >> On 21/01/2025 09:38, Kory Maincent wrote: > [...] > [...] > >> [...] > [...] > [...] > >> > >> (Cc'ing Niklas and Sergey as this relates to the ravb driver) > > > > Yes, thanks. > > > >> Why do we need to hold the rtnl mutex across the calls to > >> netif_device_detach() and ravb_wol_setup()? > >> > >> My reading of Documentation/networking/netdevices.rst is that the rtnl > >> mutex is held when the net subsystem calls the driver's ndo_stop method, > >> which in our case is ravb_close(). So, we should take the rtnl mutex > >> when we call ravb_close() directly, in both ravb_suspend() and > >> ravb_wol_restore(). That would ensure that we do not call > >> phy_disconnect() without holding the rtnl mutex and should fix this > >> issue. > > > > Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup() > > won't be protected by the rtnl lock. > > ravb_ptp_stop() modifies a couple of device registers and calls > ptp_clock_unregister(). I don't see anything to suggest that this > requires the rtnl lock to be held, unless I am missing something. What happens if two ptp_clock_unregister() with the same ptp_clock pointer are called simultaneously? From ravb_suspend and ravb_set_ringparam for example. It may cause some errors. For example the ptp->kworker pointer could be used after a kfree. https://elixir.bootlin.com/linux/v6.12.6/source/drivers/ptp/ptp_clock.c#L416 Regards, -- Köry Maincent, Bootlin Embedded Linux and kernel engineering https://bootlin.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-21 16:11 ` Kory Maincent @ 2025-01-22 14:03 ` Paul Barker 2025-01-22 16:12 ` Kory Maincent 2025-01-23 11:25 ` Claudiu Beznea 2025-01-23 15:45 ` Russell King (Oracle) 2 siblings, 1 reply; 14+ messages in thread From: Paul Barker @ 2025-01-22 14:03 UTC (permalink / raw) To: Kory Maincent Cc: Jakub Kicinski, David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Russell King, Eric Dumazet, Paolo Abeni, Niklas Söderlund, Sergey Shtylyov [-- Attachment #1.1.1: Type: text/plain, Size: 2280 bytes --] On 21/01/2025 16:11, Kory Maincent wrote: > On Tue, 21 Jan 2025 15:44:34 +0000 > Paul Barker <paul.barker.ct@bp.renesas.com> wrote: > >> On 21/01/2025 13:01, Kory Maincent wrote: >> >>> On Tue, 21 Jan 2025 11:34:48 +0000 >>> Paul Barker <paul.barker.ct@bp.renesas.com> wrote: >>> >>>> Why do we need to hold the rtnl mutex across the calls to >>>> netif_device_detach() and ravb_wol_setup()? >>>> >>>> My reading of Documentation/networking/netdevices.rst is that the rtnl >>>> mutex is held when the net subsystem calls the driver's ndo_stop method, >>>> which in our case is ravb_close(). So, we should take the rtnl mutex >>>> when we call ravb_close() directly, in both ravb_suspend() and >>>> ravb_wol_restore(). That would ensure that we do not call >>>> phy_disconnect() without holding the rtnl mutex and should fix this >>>> issue. >>> >>> Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup() >>> won't be protected by the rtnl lock. >> >> ravb_ptp_stop() modifies a couple of device registers and calls >> ptp_clock_unregister(). I don't see anything to suggest that this >> requires the rtnl lock to be held, unless I am missing something. > > What happens if two ptp_clock_unregister() with the same ptp_clock pointer are > called simultaneously? From ravb_suspend and ravb_set_ringparam for example. It > may cause some errors. > For example the ptp->kworker pointer could be used after a kfree. > https://elixir.bootlin.com/linux/v6.12.6/source/drivers/ptp/ptp_clock.c#L416 I've dug into this some more today and looked at the suspend/resume paths of other Ethernet drivers for comparison. netif_device_detach() and netif_device_attach() seem to be safe to call without holding the rtnl lock, e.g. the stmmac driver does this. In the suspend path, we should hold the rtnl lock across the calls to ravb_wol_setup() and ravb_close(). In the resume path, we should hold the rtnl lock across the calls to ravb_wol_restore() and ravb_open(). I don't think there is any harm in holding the rtnl lock while we call pm_runtime_force_resume(), so we can take the lock before checking priv->wol_enabled and release after calling ravb_open(). How does that sound? Thanks, -- Paul Barker [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 3577 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 236 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-22 14:03 ` Paul Barker @ 2025-01-22 16:12 ` Kory Maincent 0 siblings, 0 replies; 14+ messages in thread From: Kory Maincent @ 2025-01-22 16:12 UTC (permalink / raw) To: Paul Barker Cc: Jakub Kicinski, David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Russell King, Eric Dumazet, Paolo Abeni, Niklas Söderlund, Sergey Shtylyov On Wed, 22 Jan 2025 14:03:37 +0000 Paul Barker <paul.barker.ct@bp.renesas.com> wrote: > On 21/01/2025 16:11, Kory Maincent wrote: > > On Tue, 21 Jan 2025 15:44:34 +0000 > > Paul Barker <paul.barker.ct@bp.renesas.com> wrote: > > > >> On 21/01/2025 13:01, Kory Maincent wrote: > >> > [...] > [...] > [...] > >> > >> ravb_ptp_stop() modifies a couple of device registers and calls > >> ptp_clock_unregister(). I don't see anything to suggest that this > >> requires the rtnl lock to be held, unless I am missing something. > > > > What happens if two ptp_clock_unregister() with the same ptp_clock pointer > > are called simultaneously? From ravb_suspend and ravb_set_ringparam for > > example. It may cause some errors. > > For example the ptp->kworker pointer could be used after a kfree. > > https://elixir.bootlin.com/linux/v6.12.6/source/drivers/ptp/ptp_clock.c#L416 > > > > I've dug into this some more today and looked at the suspend/resume > paths of other Ethernet drivers for comparison. > > netif_device_detach() and netif_device_attach() seem to be safe to call > without holding the rtnl lock, e.g. the stmmac driver does this. > > In the suspend path, we should hold the rtnl lock across the calls to > ravb_wol_setup() and ravb_close(). > > In the resume path, we should hold the rtnl lock across the calls to > ravb_wol_restore() and ravb_open(). I don't think there is any harm in > holding the rtnl lock while we call pm_runtime_force_resume(), so we can > take the lock before checking priv->wol_enabled and release after > calling ravb_open(). > > How does that sound? Yes that sound nice, and similar to what I thought. I will send a patch, I also found the same issue on sh_eth driver. Regards -- Köry Maincent, Bootlin Embedded Linux and kernel engineering https://bootlin.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-21 16:11 ` Kory Maincent 2025-01-22 14:03 ` Paul Barker @ 2025-01-23 11:25 ` Claudiu Beznea 2025-01-23 14:05 ` Kory Maincent 2025-01-23 15:45 ` Russell King (Oracle) 2 siblings, 1 reply; 14+ messages in thread From: Claudiu Beznea @ 2025-01-23 11:25 UTC (permalink / raw) To: Kory Maincent, Paul Barker Cc: Jakub Kicinski, David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Russell King, Eric Dumazet, Paolo Abeni, Niklas Söderlund, Sergey Shtylyov Hi, Kory, On 21.01.2025 18:11, Kory Maincent wrote: > On Tue, 21 Jan 2025 15:44:34 +0000 > Paul Barker <paul.barker.ct@bp.renesas.com> wrote: > >> On 21/01/2025 13:01, Kory Maincent wrote: >>> On Tue, 21 Jan 2025 11:34:48 +0000 >>> Paul Barker <paul.barker.ct@bp.renesas.com> wrote: >>> >>>> On 21/01/2025 09:38, Kory Maincent wrote: >> [...] >> [...] >>>> [...] >> [...] >> [...] >>>> >>>> (Cc'ing Niklas and Sergey as this relates to the ravb driver) >>> >>> Yes, thanks. >>> >>>> Why do we need to hold the rtnl mutex across the calls to >>>> netif_device_detach() and ravb_wol_setup()? >>>> >>>> My reading of Documentation/networking/netdevices.rst is that the rtnl >>>> mutex is held when the net subsystem calls the driver's ndo_stop method, >>>> which in our case is ravb_close(). So, we should take the rtnl mutex >>>> when we call ravb_close() directly, in both ravb_suspend() and >>>> ravb_wol_restore(). That would ensure that we do not call >>>> phy_disconnect() without holding the rtnl mutex and should fix this >>>> issue. >>> >>> Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup() >>> won't be protected by the rtnl lock. >> >> ravb_ptp_stop() modifies a couple of device registers and calls >> ptp_clock_unregister(). I don't see anything to suggest that this >> requires the rtnl lock to be held, unless I am missing something. > > What happens if two ptp_clock_unregister() with the same ptp_clock pointer are > called simultaneously? From ravb_suspend and ravb_set_ringparam for example. It > may cause some errors. Can this happen? I see ethtool_ops::set_ringparam() references only in ethtool or ioctl files: net/ethtool/ioctl.c:2066 net/ethtool/ioctl.c:2081 net/ethtool/rings.c:212 net/ethtool/rings.c:304 At the time the suspend/resume APIs are called the user space threads are frozen. Thank you, Claudiu > For example the ptp->kworker pointer could be used after a kfree. > https://elixir.bootlin.com/linux/v6.12.6/source/drivers/ptp/ptp_clock.c#L416 > > Regards, ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-23 11:25 ` Claudiu Beznea @ 2025-01-23 14:05 ` Kory Maincent 0 siblings, 0 replies; 14+ messages in thread From: Kory Maincent @ 2025-01-23 14:05 UTC (permalink / raw) To: Claudiu Beznea Cc: Paul Barker, Jakub Kicinski, David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Russell King, Eric Dumazet, Paolo Abeni, Niklas Söderlund, Sergey Shtylyov On Thu, 23 Jan 2025 13:25:57 +0200 Claudiu Beznea <claudiu.beznea@tuxon.dev> wrote: > >> ravb_ptp_stop() modifies a couple of device registers and calls > >> ptp_clock_unregister(). I don't see anything to suggest that this > >> requires the rtnl lock to be held, unless I am missing something. > > > > What happens if two ptp_clock_unregister() with the same ptp_clock pointer > > are called simultaneously? From ravb_suspend and ravb_set_ringparam for > > example. It may cause some errors. > > Can this happen? I see ethtool_ops::set_ringparam() references only in > ethtool or ioctl files: > > net/ethtool/ioctl.c:2066 > net/ethtool/ioctl.c:2081 > net/ethtool/rings.c:212 > net/ethtool/rings.c:304 > > At the time the suspend/resume APIs are called the user space threads are > frozen. Maybe, I don't know the suspend path, and what the state of user space threads at that time. This was an example but Wake on Lan setup could also have some issue. IMHO I think it is more precautions to have it under rtnl lock. Regards, -- Köry Maincent, Bootlin Embedded Linux and kernel engineering https://bootlin.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage 2025-01-21 16:11 ` Kory Maincent 2025-01-22 14:03 ` Paul Barker 2025-01-23 11:25 ` Claudiu Beznea @ 2025-01-23 15:45 ` Russell King (Oracle) 2 siblings, 0 replies; 14+ messages in thread From: Russell King (Oracle) @ 2025-01-23 15:45 UTC (permalink / raw) To: Kory Maincent Cc: Paul Barker, Jakub Kicinski, David S. Miller, netdev, linux-kernel, Claudiu Beznea, thomas.petazzoni, Andrew Lunn, Heiner Kallweit, Eric Dumazet, Paolo Abeni, Niklas Söderlund, Sergey Shtylyov On Tue, Jan 21, 2025 at 05:11:56PM +0100, Kory Maincent wrote: > On Tue, 21 Jan 2025 15:44:34 +0000 > Paul Barker <paul.barker.ct@bp.renesas.com> wrote: > > > On 21/01/2025 13:01, Kory Maincent wrote: > > > On Tue, 21 Jan 2025 11:34:48 +0000 > > > Paul Barker <paul.barker.ct@bp.renesas.com> wrote: > > > > > >> On 21/01/2025 09:38, Kory Maincent wrote: > > [...] > > [...] > > >> [...] > > [...] > > [...] > > >> > > >> (Cc'ing Niklas and Sergey as this relates to the ravb driver) > > > > > > Yes, thanks. > > > > > >> Why do we need to hold the rtnl mutex across the calls to > > >> netif_device_detach() and ravb_wol_setup()? > > >> > > >> My reading of Documentation/networking/netdevices.rst is that the rtnl > > >> mutex is held when the net subsystem calls the driver's ndo_stop method, > > >> which in our case is ravb_close(). So, we should take the rtnl mutex > > >> when we call ravb_close() directly, in both ravb_suspend() and > > >> ravb_wol_restore(). That would ensure that we do not call > > >> phy_disconnect() without holding the rtnl mutex and should fix this > > >> issue. > > > > > > Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup() > > > won't be protected by the rtnl lock. > > > > ravb_ptp_stop() modifies a couple of device registers and calls > > ptp_clock_unregister(). I don't see anything to suggest that this > > requires the rtnl lock to be held, unless I am missing something. > > What happens if two ptp_clock_unregister() with the same ptp_clock pointer are > called simultaneously? From ravb_suspend and ravb_set_ringparam for example. It > may cause some errors. > For example the ptp->kworker pointer could be used after a kfree. > https://elixir.bootlin.com/linux/v6.12.6/source/drivers/ptp/ptp_clock.c#L416 Taking a look at where ravb_ptp_stop() is called from: 1. ravb_set_ringparam(). ethtool operation. RTNL will be held for this. 2. ravb_open() error-cleanup. RTNL will be held for this. 3. ravb_tx_timeout_work(). rtnl_trylock() is called and we will only call through to the above function if we grabbed the RTNL. 4. ravb_close(), again RTNL will be held here. 5. ravb_wol_setup(). Another ethtool operation. (1) applies. Hence, it is not possible for two threads to execute ravb_ptp_stop() symultaneously. However, if ptp_clock_register() in ravb_ptp_init() fails, then priv->ptp.clock will be set to an error-pointer, and subsequently passed to ptp_clock_unregister() which would cause a kernel oops. No one seems to have thought about that... and that definitely needs fixing. However, one wonders why it's necessary to unregister a _user_ _interface_ when responding to a change in WoL, ring parameters, or merely handling a transmit timeout. It doesn't seem particularly nice to userspace for a device that its using to suddenly go away for these reasons. I wonder whether anyone has tested anything that uses the PTP clock interfaces while changing e.g. the WoL settings. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-01-23 15:46 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-20 14:19 [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage Kory Maincent 2025-01-20 19:12 ` Jakub Kicinski 2025-01-21 9:38 ` Kory Maincent 2025-01-21 10:02 ` Russell King (Oracle) 2025-01-21 10:29 ` Claudiu Beznea 2025-01-21 11:34 ` Paul Barker 2025-01-21 13:01 ` Kory Maincent 2025-01-21 15:44 ` Paul Barker 2025-01-21 16:11 ` Kory Maincent 2025-01-22 14:03 ` Paul Barker 2025-01-22 16:12 ` Kory Maincent 2025-01-23 11:25 ` Claudiu Beznea 2025-01-23 14:05 ` Kory Maincent 2025-01-23 15:45 ` Russell King (Oracle)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).