netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
@ 2022-02-09 12:04 Vladimir Oltean
  2022-02-09 13:30 ` patchwork-bot+netdevbpf
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Vladimir Oltean @ 2022-02-09 12:04 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, Andrew Lunn, Vivien Didelot, Vladimir Oltean,
	David S. Miller, Jakub Kicinski, Rafael Richter, Daniel Klauer,
	Lino Sanfilippo

Rafael reports that on a system with LX2160A and Marvell DSA switches,
if a reboot occurs while the DSA master (dpaa2-eth) is up, the following
panic can be seen:

systemd-shutdown[1]: Rebooting.
Unable to handle kernel paging request at virtual address 00a0000800000041
[00a0000800000041] address between user and kernel address ranges
Internal error: Oops: 96000004 [#1] PREEMPT SMP
CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 5.16.5-00042-g8f5585009b24 #32
pc : dsa_slave_netdevice_event+0x130/0x3e4
lr : raw_notifier_call_chain+0x50/0x6c
Call trace:
 dsa_slave_netdevice_event+0x130/0x3e4
 raw_notifier_call_chain+0x50/0x6c
 call_netdevice_notifiers_info+0x54/0xa0
 __dev_close_many+0x50/0x130
 dev_close_many+0x84/0x120
 unregister_netdevice_many+0x130/0x710
 unregister_netdevice_queue+0x8c/0xd0
 unregister_netdev+0x20/0x30
 dpaa2_eth_remove+0x68/0x190
 fsl_mc_driver_remove+0x20/0x5c
 __device_release_driver+0x21c/0x220
 device_release_driver_internal+0xac/0xb0
 device_links_unbind_consumers+0xd4/0x100
 __device_release_driver+0x94/0x220
 device_release_driver+0x28/0x40
 bus_remove_device+0x118/0x124
 device_del+0x174/0x420
 fsl_mc_device_remove+0x24/0x40
 __fsl_mc_device_remove+0xc/0x20
 device_for_each_child+0x58/0xa0
 dprc_remove+0x90/0xb0
 fsl_mc_driver_remove+0x20/0x5c
 __device_release_driver+0x21c/0x220
 device_release_driver+0x28/0x40
 bus_remove_device+0x118/0x124
 device_del+0x174/0x420
 fsl_mc_bus_remove+0x80/0x100
 fsl_mc_bus_shutdown+0xc/0x1c
 platform_shutdown+0x20/0x30
 device_shutdown+0x154/0x330
 __do_sys_reboot+0x1cc/0x250
 __arm64_sys_reboot+0x20/0x30
 invoke_syscall.constprop.0+0x4c/0xe0
 do_el0_svc+0x4c/0x150
 el0_svc+0x24/0xb0
 el0t_64_sync_handler+0xa8/0xb0
 el0t_64_sync+0x178/0x17c

It can be seen from the stack trace that the problem is that the
deregistration of the master causes a dev_close(), which gets notified
as NETDEV_GOING_DOWN to dsa_slave_netdevice_event().
But dsa_switch_shutdown() has already run, and this has unregistered the
DSA slave interfaces, and yet, the NETDEV_GOING_DOWN handler attempts to
call dev_close_many() on those slave interfaces, leading to the problem.

The previous attempt to avoid the NETDEV_GOING_DOWN on the master after
dsa_switch_shutdown() was called seems improper. Unregistering the slave
interfaces is unnecessary and unhelpful. Instead, after the slaves have
stopped being uppers of the DSA master, we can now reset to NULL the
master->dsa_ptr pointer, which will make DSA start ignoring all future
notifier events on the master.

Fixes: 0650bf52b31f ("net: dsa: be compatible with masters which unregister on shutdown")
Reported-by: Rafael Richter <rafael.richter@gin.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/dsa2.c | 25 ++++++-------------------
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 909b045c9b11..e498c927c3d0 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -1784,7 +1784,6 @@ EXPORT_SYMBOL_GPL(dsa_unregister_switch);
 void dsa_switch_shutdown(struct dsa_switch *ds)
 {
 	struct net_device *master, *slave_dev;
-	LIST_HEAD(unregister_list);
 	struct dsa_port *dp;
 
 	mutex_lock(&dsa2_mutex);
@@ -1795,25 +1794,13 @@ void dsa_switch_shutdown(struct dsa_switch *ds)
 		slave_dev = dp->slave;
 
 		netdev_upper_dev_unlink(master, slave_dev);
-		/* Just unlinking ourselves as uppers of the master is not
-		 * sufficient. When the master net device unregisters, that will
-		 * also call dev_close, which we will catch as NETDEV_GOING_DOWN
-		 * and trigger a dev_close on our own devices (dsa_slave_close).
-		 * In turn, that will call dev_mc_unsync on the master's net
-		 * device. If the master is also a DSA switch port, this will
-		 * trigger dsa_slave_set_rx_mode which will call dev_mc_sync on
-		 * its own master. Lockdep will complain about the fact that
-		 * all cascaded masters have the same dsa_master_addr_list_lock_key,
-		 * which it normally would not do if the cascaded masters would
-		 * be in a proper upper/lower relationship, which we've just
-		 * destroyed.
-		 * To suppress the lockdep warnings, let's actually unregister
-		 * the DSA slave interfaces too, to avoid the nonsensical
-		 * multicast address list synchronization on shutdown.
-		 */
-		unregister_netdevice_queue(slave_dev, &unregister_list);
 	}
-	unregister_netdevice_many(&unregister_list);
+
+	/* Disconnect from further netdevice notifiers on the master,
+	 * since netdev_uses_dsa() will now return false.
+	 */
+	dsa_switch_for_each_cpu_port(dp, ds)
+		dp->master->dsa_ptr = NULL;
 
 	rtnl_unlock();
 	mutex_unlock(&dsa2_mutex);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2022-02-09 12:04 [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown Vladimir Oltean
@ 2022-02-09 13:30 ` patchwork-bot+netdevbpf
  2024-04-10  9:03 ` Some questions " xu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-02-09 13:30 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, f.fainelli, andrew, vivien.didelot, olteanv, davem, kuba,
	rafael.richter, daniel.klauer, LinoSanfilippo

Hello:

This patch was applied to netdev/net.git (master)
by David S. Miller <davem@davemloft.net>:

On Wed,  9 Feb 2022 14:04:33 +0200 you wrote:
> Rafael reports that on a system with LX2160A and Marvell DSA switches,
> if a reboot occurs while the DSA master (dpaa2-eth) is up, the following
> panic can be seen:
> 
> systemd-shutdown[1]: Rebooting.
> Unable to handle kernel paging request at virtual address 00a0000800000041
> [00a0000800000041] address between user and kernel address ranges
> Internal error: Oops: 96000004 [#1] PREEMPT SMP
> CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 5.16.5-00042-g8f5585009b24 #32
> pc : dsa_slave_netdevice_event+0x130/0x3e4
> lr : raw_notifier_call_chain+0x50/0x6c
> Call trace:
>  dsa_slave_netdevice_event+0x130/0x3e4
>  raw_notifier_call_chain+0x50/0x6c
>  call_netdevice_notifiers_info+0x54/0xa0
>  __dev_close_many+0x50/0x130
>  dev_close_many+0x84/0x120
>  unregister_netdevice_many+0x130/0x710
>  unregister_netdevice_queue+0x8c/0xd0
>  unregister_netdev+0x20/0x30
>  dpaa2_eth_remove+0x68/0x190
>  fsl_mc_driver_remove+0x20/0x5c
>  __device_release_driver+0x21c/0x220
>  device_release_driver_internal+0xac/0xb0
>  device_links_unbind_consumers+0xd4/0x100
>  __device_release_driver+0x94/0x220
>  device_release_driver+0x28/0x40
>  bus_remove_device+0x118/0x124
>  device_del+0x174/0x420
>  fsl_mc_device_remove+0x24/0x40
>  __fsl_mc_device_remove+0xc/0x20
>  device_for_each_child+0x58/0xa0
>  dprc_remove+0x90/0xb0
>  fsl_mc_driver_remove+0x20/0x5c
>  __device_release_driver+0x21c/0x220
>  device_release_driver+0x28/0x40
>  bus_remove_device+0x118/0x124
>  device_del+0x174/0x420
>  fsl_mc_bus_remove+0x80/0x100
>  fsl_mc_bus_shutdown+0xc/0x1c
>  platform_shutdown+0x20/0x30
>  device_shutdown+0x154/0x330
>  __do_sys_reboot+0x1cc/0x250
>  __arm64_sys_reboot+0x20/0x30
>  invoke_syscall.constprop.0+0x4c/0xe0
>  do_el0_svc+0x4c/0x150
>  el0_svc+0x24/0xb0
>  el0t_64_sync_handler+0xa8/0xb0
>  el0t_64_sync+0x178/0x17c
> 
> [...]

Here is the summary with links:
  - [net] net: dsa: fix panic when DSA master device unbinds on shutdown
    https://git.kernel.org/netdev/net/c/ee534378f005

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Some questions Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2022-02-09 12:04 [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown Vladimir Oltean
  2022-02-09 13:30 ` patchwork-bot+netdevbpf
@ 2024-04-10  9:03 ` xu
  2024-04-10  9:06 ` xu
  2024-09-04  8:03 ` Sverdlin, Alexander
  3 siblings, 0 replies; 13+ messages in thread
From: xu @ 2024-04-10  9:03 UTC (permalink / raw)
  To: gregkh
  Cc: vladimir.oltean, LinoSanfilippo, andrew, daniel.klauer, davem,
	f.fainelli, kuba, netdev, olteanv, rafael.richter, vivien.didelot

Hi! Excuse me, I'm wondering why this patch was not merged into the 5.15 stable branch.

Maybe I missed something.


Thanks.
xu xin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Some questions Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2022-02-09 12:04 [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown Vladimir Oltean
  2022-02-09 13:30 ` patchwork-bot+netdevbpf
  2024-04-10  9:03 ` Some questions " xu
@ 2024-04-10  9:06 ` xu
  2024-04-10  9:14   ` Paolo Abeni
  2024-09-04  8:03 ` Sverdlin, Alexander
  3 siblings, 1 reply; 13+ messages in thread
From: xu @ 2024-04-10  9:06 UTC (permalink / raw)
  To: gregkh
  Cc: vladimir.oltean, LinoSanfilippo, andrew, daniel.klauer, davem,
	f.fainelli, kuba, netdev, olteanv, rafael.richter, vivien.didelot,
	xu.xin16

Hi! Excuse me, I'm wondering why this patch was not merged into the 5.15 stable branch.

Maybe I missed something.


Thanks.
xu xin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Some questions Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2024-04-10  9:06 ` xu
@ 2024-04-10  9:14   ` Paolo Abeni
  2024-04-10 14:34     ` Vladimir Oltean
  0 siblings, 1 reply; 13+ messages in thread
From: Paolo Abeni @ 2024-04-10  9:14 UTC (permalink / raw)
  To: xu, gregkh
  Cc: vladimir.oltean, LinoSanfilippo, andrew, daniel.klauer, davem,
	f.fainelli, kuba, netdev, olteanv, rafael.richter, vivien.didelot,
	xu.xin16

On Wed, 2024-04-10 at 09:06 +0000, xu wrote:
> Hi! Excuse me, I'm wondering why this patch was not merged into the 5.15 stable branch.

Because it lacked the CC: stable tag?

You can still ask (or do) an explicit backport, please have a look at:

Documentation/process/stable-kernel-rules.rst

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Some questions Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2024-04-10  9:14   ` Paolo Abeni
@ 2024-04-10 14:34     ` Vladimir Oltean
  2024-04-10 14:55       ` Greg KH
  2024-04-10 16:30       ` Sasha Levin
  0 siblings, 2 replies; 13+ messages in thread
From: Vladimir Oltean @ 2024-04-10 14:34 UTC (permalink / raw)
  To: Paolo Abeni, gregkh
  Cc: xu, stable, vladimir.oltean, LinoSanfilippo, andrew,
	daniel.klauer, davem, f.fainelli, kuba, netdev, rafael.richter,
	vivien.didelot, xu.xin16

On Wed, Apr 10, 2024 at 11:14:09AM +0200, Paolo Abeni wrote:
> On Wed, 2024-04-10 at 09:06 +0000, xu wrote:
> > Hi! Excuse me, I'm wondering why this patch was not merged into the 5.15 stable branch.
> 
> Because it lacked the CC: stable tag?
> 
> You can still ask (or do) an explicit backport, please have a look at:
> 
> Documentation/process/stable-kernel-rules.rst
> 
> Cheers,
> 
> Paolo
> 

My email records say that it was backported to 5.16:
https://lore.kernel.org/lkml/20220214092515.419944498@linuxfoundation.org/
On 5.15 I have no idea why not (no email).

Anyway, on linux-5.15.y, "git cherry-pick -xs ee534378f00561207656663d93907583958339ae"
does apply (it says "auto-merging"), so maybe Greg can just pick up the fix with one command?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Some questions Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2024-04-10 14:34     ` Vladimir Oltean
@ 2024-04-10 14:55       ` Greg KH
  2024-04-10 16:30       ` Sasha Levin
  1 sibling, 0 replies; 13+ messages in thread
From: Greg KH @ 2024-04-10 14:55 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Paolo Abeni, xu, stable, vladimir.oltean, LinoSanfilippo, andrew,
	daniel.klauer, davem, f.fainelli, kuba, netdev, rafael.richter,
	vivien.didelot, xu.xin16

On Wed, Apr 10, 2024 at 05:34:19PM +0300, Vladimir Oltean wrote:
> On Wed, Apr 10, 2024 at 11:14:09AM +0200, Paolo Abeni wrote:
> > On Wed, 2024-04-10 at 09:06 +0000, xu wrote:
> > > Hi! Excuse me, I'm wondering why this patch was not merged into the 5.15 stable branch.
> > 
> > Because it lacked the CC: stable tag?
> > 
> > You can still ask (or do) an explicit backport, please have a look at:
> > 
> > Documentation/process/stable-kernel-rules.rst
> > 
> > Cheers,
> > 
> > Paolo
> > 
> 
> My email records say that it was backported to 5.16:
> https://lore.kernel.org/lkml/20220214092515.419944498@linuxfoundation.org/
> On 5.15 I have no idea why not (no email).
> 
> Anyway, on linux-5.15.y, "git cherry-pick -xs ee534378f00561207656663d93907583958339ae"
> does apply (it says "auto-merging"), so maybe Greg can just pick up the fix with one command?
> 

Now queued up, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Some questions Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2024-04-10 14:34     ` Vladimir Oltean
  2024-04-10 14:55       ` Greg KH
@ 2024-04-10 16:30       ` Sasha Levin
  1 sibling, 0 replies; 13+ messages in thread
From: Sasha Levin @ 2024-04-10 16:30 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Paolo Abeni, gregkh, xu, stable, vladimir.oltean, LinoSanfilippo,
	andrew, daniel.klauer, davem, f.fainelli, kuba, netdev,
	rafael.richter, vivien.didelot, xu.xin16

On Wed, Apr 10, 2024 at 05:34:19PM +0300, Vladimir Oltean wrote:
>On Wed, Apr 10, 2024 at 11:14:09AM +0200, Paolo Abeni wrote:
>> On Wed, 2024-04-10 at 09:06 +0000, xu wrote:
>> > Hi! Excuse me, I'm wondering why this patch was not merged into the 5.15 stable branch.
>>
>> Because it lacked the CC: stable tag?
>>
>> You can still ask (or do) an explicit backport, please have a look at:
>>
>> Documentation/process/stable-kernel-rules.rst
>>
>> Cheers,
>>
>> Paolo
>>
>
>My email records say that it was backported to 5.16:
>https://lore.kernel.org/lkml/20220214092515.419944498@linuxfoundation.org/
>On 5.15 I have no idea why not (no email).

Happy to answer why!

It was proposed for 5.15, but dropped due to build failures:

	https://lore.kernel.org/all/202202131427.SK7CctaU-lkp@intel.com/

The missing functions were later brought to the 5.15 tree via:

	https://lore.kernel.org/all/20230601131937.674646135@linuxfoundation.org/

At this point, Greg's current backport of the commit in question was
successful.

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2022-02-09 12:04 [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown Vladimir Oltean
                   ` (2 preceding siblings ...)
  2024-04-10  9:06 ` xu
@ 2024-09-04  8:03 ` Sverdlin, Alexander
  2024-09-05  7:11   ` Sverdlin, Alexander
  3 siblings, 1 reply; 13+ messages in thread
From: Sverdlin, Alexander @ 2024-09-04  8:03 UTC (permalink / raw)
  To: netdev@vger.kernel.org, vladimir.oltean@nxp.com
  Cc: andrew@lunn.ch, olteanv@gmail.com, daniel.klauer@gin.de,
	davem@davemloft.net, vivien.didelot@gmail.com,
	LinoSanfilippo@gmx.de, f.fainelli@gmail.com, kuba@kernel.org,
	rafael.richter@gin.de

Hello Vladimir!

On Wed, 2022-02-09 at 14:04 +0200, Vladimir Oltean wrote:
> Rafael reports that on a system with LX2160A and Marvell DSA switches,
> if a reboot occurs while the DSA master (dpaa2-eth) is up, the following
> panic can be seen:
> 
> systemd-shutdown[1]: Rebooting.
> Unable to handle kernel paging request at virtual address 00a0000800000041
> [00a0000800000041] address between user and kernel address ranges
> Internal error: Oops: 96000004 [#1] PREEMPT SMP
> CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 5.16.5-00042-g8f5585009b24 #32
> pc : dsa_slave_netdevice_event+0x130/0x3e4
> lr : raw_notifier_call_chain+0x50/0x6c
> Call trace:
>  dsa_slave_netdevice_event+0x130/0x3e4
>  raw_notifier_call_chain+0x50/0x6c
>  call_netdevice_notifiers_info+0x54/0xa0
>  __dev_close_many+0x50/0x130
>  dev_close_many+0x84/0x120
>  unregister_netdevice_many+0x130/0x710
>  unregister_netdevice_queue+0x8c/0xd0
>  unregister_netdev+0x20/0x30
>  dpaa2_eth_remove+0x68/0x190
>  fsl_mc_driver_remove+0x20/0x5c
>  __device_release_driver+0x21c/0x220
>  device_release_driver_internal+0xac/0xb0
>  device_links_unbind_consumers+0xd4/0x100
>  __device_release_driver+0x94/0x220
>  device_release_driver+0x28/0x40
>  bus_remove_device+0x118/0x124
>  device_del+0x174/0x420
>  fsl_mc_device_remove+0x24/0x40
>  __fsl_mc_device_remove+0xc/0x20
>  device_for_each_child+0x58/0xa0
>  dprc_remove+0x90/0xb0
>  fsl_mc_driver_remove+0x20/0x5c
>  __device_release_driver+0x21c/0x220
>  device_release_driver+0x28/0x40
>  bus_remove_device+0x118/0x124
>  device_del+0x174/0x420
>  fsl_mc_bus_remove+0x80/0x100
>  fsl_mc_bus_shutdown+0xc/0x1c
>  platform_shutdown+0x20/0x30
>  device_shutdown+0x154/0x330
>  __do_sys_reboot+0x1cc/0x250
>  __arm64_sys_reboot+0x20/0x30
>  invoke_syscall.constprop.0+0x4c/0xe0
>  do_el0_svc+0x4c/0x150
>  el0_svc+0x24/0xb0
>  el0t_64_sync_handler+0xa8/0xb0
>  el0t_64_sync+0x178/0x17c
> 
> It can be seen from the stack trace that the problem is that the
> deregistration of the master causes a dev_close(), which gets notified
> as NETDEV_GOING_DOWN to dsa_slave_netdevice_event().
> But dsa_switch_shutdown() has already run, and this has unregistered the
> DSA slave interfaces, and yet, the NETDEV_GOING_DOWN handler attempts to
> call dev_close_many() on those slave interfaces, leading to the problem.
> 
> The previous attempt to avoid the NETDEV_GOING_DOWN on the master after
> dsa_switch_shutdown() was called seems improper. Unregistering the slave
> interfaces is unnecessary and unhelpful. Instead, after the slaves have
> stopped being uppers of the DSA master, we can now reset to NULL the
> master->dsa_ptr pointer, which will make DSA start ignoring all future
> notifier events on the master.
> 
> Fixes: 0650bf52b31f ("net: dsa: be compatible with masters which unregister on shutdown")
> Reported-by: Rafael Richter <rafael.richter@gin.de>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  net/dsa/dsa2.c | 25 ++++++-------------------
>  1 file changed, 6 insertions(+), 19 deletions(-)
> 
> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
> index 909b045c9b11..e498c927c3d0 100644
> --- a/net/dsa/dsa2.c
> +++ b/net/dsa/dsa2.c
> @@ -1784,7 +1784,6 @@ EXPORT_SYMBOL_GPL(dsa_unregister_switch);
>  void dsa_switch_shutdown(struct dsa_switch *ds)
>  {
>  	struct net_device *master, *slave_dev;
> -	LIST_HEAD(unregister_list);
>  	struct dsa_port *dp;
>  
>  	mutex_lock(&dsa2_mutex);
> @@ -1795,25 +1794,13 @@ void dsa_switch_shutdown(struct dsa_switch *ds)
>  		slave_dev = dp->slave;
>  
>  		netdev_upper_dev_unlink(master, slave_dev);
> -		/* Just unlinking ourselves as uppers of the master is not
> -		 * sufficient. When the master net device unregisters, that will
> -		 * also call dev_close, which we will catch as NETDEV_GOING_DOWN
> -		 * and trigger a dev_close on our own devices (dsa_slave_close).
> -		 * In turn, that will call dev_mc_unsync on the master's net
> -		 * device. If the master is also a DSA switch port, this will
> -		 * trigger dsa_slave_set_rx_mode which will call dev_mc_sync on
> -		 * its own master. Lockdep will complain about the fact that
> -		 * all cascaded masters have the same dsa_master_addr_list_lock_key,
> -		 * which it normally would not do if the cascaded masters would
> -		 * be in a proper upper/lower relationship, which we've just
> -		 * destroyed.
> -		 * To suppress the lockdep warnings, let's actually unregister
> -		 * the DSA slave interfaces too, to avoid the nonsensical
> -		 * multicast address list synchronization on shutdown.
> -		 */
> -		unregister_netdevice_queue(slave_dev, &unregister_list);
>  	}
> -	unregister_netdevice_many(&unregister_list);
> +
> +	/* Disconnect from further netdevice notifiers on the master,
> +	 * since netdev_uses_dsa() will now return false.
> +	 */
> +	dsa_switch_for_each_cpu_port(dp, ds)
> +		dp->master->dsa_ptr = NULL;

This is unfortunately racy and leads to other panics:

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
CPU: 0 PID: 12 Comm: ksoftirqd/0 Tainted: G           O       6.1.99+gitb7793b7d9b35 #1
pc : lan9303_rcv+0x64/0x210
lr : lan9303_rcv+0x148/0x210
Call trace:
 lan9303_rcv+0x64/0x210
 dsa_switch_rcv+0x1d8/0x350
 __netif_receive_skb_list_core+0x1f8/0x220
 netif_receive_skb_list_internal+0x18c/0x2a4
 napi_gro_receive+0x238/0x254
 fec_enet_rx_napi+0x830/0xe60
 __napi_poll+0x40/0x210
 net_rx_action+0x138/0x2d0

Even though dsa_switch_rcv() checks 

        if (unlikely(!cpu_dp)) {
                kfree_skb(skb);
                return 0;
        }

if dsa_switch_shutdown() happens to zero dsa_ptr before
dsa_conduit_find_user(dev, 0, port) call, the latter will dereference dsa_ptr==NULL:

static inline struct net_device *dsa_conduit_find_user(struct net_device *dev,
                                                       int device, int port)
{
        struct dsa_port *cpu_dp = dev->dsa_ptr;
        struct dsa_switch_tree *dst = cpu_dp->dst;

I believe there are other race patterns as well if we consider all possible

static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev,
                          struct packet_type *pt, struct net_device *unused)
{
        struct metadata_dst *md_dst = skb_metadata_dst(skb);
        struct dsa_port *cpu_dp = dev->dsa_ptr;

...

                nskb = cpu_dp->rcv(skb, dev);

>  
>  	rtnl_unlock();
>  	mutex_unlock(&dsa2_mutex);

I'm not sure there is a safe way to zero dsa_ptr without ensuring the port
is down and there is no ongoing receive in parallel.

-- 
Alexander Sverdlin
Siemens AG
www.siemens.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2024-09-04  8:03 ` Sverdlin, Alexander
@ 2024-09-05  7:11   ` Sverdlin, Alexander
  2024-09-09 14:16     ` Vladimir Oltean
  0 siblings, 1 reply; 13+ messages in thread
From: Sverdlin, Alexander @ 2024-09-05  7:11 UTC (permalink / raw)
  To: netdev@vger.kernel.org, vladimir.oltean@nxp.com
  Cc: andrew@lunn.ch, olteanv@gmail.com, daniel.klauer@gin.de,
	davem@davemloft.net, vivien.didelot@gmail.com,
	LinoSanfilippo@gmx.de, f.fainelli@gmail.com, kuba@kernel.org,
	rafael.richter@gin.de

Hello Vladimir,

On Wed, 2024-09-04 at 10:03 +0200, Alexander Sverdlin wrote:
> > +	/* Disconnect from further netdevice notifiers on the master,
> > +	 * since netdev_uses_dsa() will now return false.
> > +	 */
> > +	dsa_switch_for_each_cpu_port(dp, ds)
> > +		dp->master->dsa_ptr = NULL;
> 
> This is unfortunately racy and leads to other panics:
> 
> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
> CPU: 0 PID: 12 Comm: ksoftirqd/0 Tainted: G           O       6.1.99+gitb7793b7d9b35 #1
> pc : lan9303_rcv+0x64/0x210
> lr : lan9303_rcv+0x148/0x210
> Call trace:
>  lan9303_rcv+0x64/0x210
>  dsa_switch_rcv+0x1d8/0x350
>  __netif_receive_skb_list_core+0x1f8/0x220
>  netif_receive_skb_list_internal+0x18c/0x2a4
>  napi_gro_receive+0x238/0x254
>  fec_enet_rx_napi+0x830/0xe60
>  __napi_poll+0x40/0x210
>  net_rx_action+0x138/0x2d0
> 
> Even though dsa_switch_rcv() checks 
> 
>         if (unlikely(!cpu_dp)) {
>                 kfree_skb(skb);
>                 return 0;
>         }
> 
> if dsa_switch_shutdown() happens to zero dsa_ptr before
> dsa_conduit_find_user(dev, 0, port) call, the latter will dereference dsa_ptr==NULL:
> 
> static inline struct net_device *dsa_conduit_find_user(struct net_device *dev,
>                                                        int device, int port)
> {
>         struct dsa_port *cpu_dp = dev->dsa_ptr;
>         struct dsa_switch_tree *dst = cpu_dp->dst;
> 
> I believe there are other race patterns as well if we consider all possible
> 
> static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev,
>                           struct packet_type *pt, struct net_device *unused)
> {
>         struct metadata_dst *md_dst = skb_metadata_dst(skb);
>         struct dsa_port *cpu_dp = dev->dsa_ptr;
> 
> ...
> 
>                 nskb = cpu_dp->rcv(skb, dev);
> 
> >  
> >  	rtnl_unlock();
> >  	mutex_unlock(&dsa2_mutex);
> 
> I'm not sure there is a safe way to zero dsa_ptr without ensuring the port
> is down and there is no ongoing receive in parallel.

after my first attempts to put a band aid on this failed, I concluded
that both assignments "dsa_ptr = NULL;" in kernel are broken. Or, being more
precise, they break widely spread pattern

CPU0					CPU1
if (netdev_uses_dsa())
					dev->dsa_ptr = NULL;
        dev->dsa_ptr->...

because there is no synchronization whatsoever, so tearing down DSA is actually
broken in many places...

Seems that something lock-free is required for dsa_ptr, maybe RCU or refcounting,
I'll try to come up with some rework, but any hints are welcome!

-- 
Alexander Sverdlin
Siemens AG
www.siemens.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2024-09-05  7:11   ` Sverdlin, Alexander
@ 2024-09-09 14:16     ` Vladimir Oltean
  2024-09-09 14:23       ` Sverdlin, Alexander
  0 siblings, 1 reply; 13+ messages in thread
From: Vladimir Oltean @ 2024-09-09 14:16 UTC (permalink / raw)
  To: Sverdlin, Alexander
  Cc: netdev@vger.kernel.org, andrew@lunn.ch, olteanv@gmail.com,
	daniel.klauer@gin.de, davem@davemloft.net,
	vivien.didelot@gmail.com, LinoSanfilippo@gmx.de,
	f.fainelli@gmail.com, kuba@kernel.org, rafael.richter@gin.de

On Thu, Sep 05, 2024 at 07:11:44AM +0000, Sverdlin, Alexander wrote:
> Hello Vladimir,
> 
> On Wed, 2024-09-04 at 10:03 +0200, Alexander Sverdlin wrote:
> > > +	/* Disconnect from further netdevice notifiers on the master,
> > > +	 * since netdev_uses_dsa() will now return false.
> > > +	 */
> > > +	dsa_switch_for_each_cpu_port(dp, ds)
> > > +		dp->master->dsa_ptr = NULL;
> > 
> > This is unfortunately racy and leads to other panics:
> > 
> > Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
> > CPU: 0 PID: 12 Comm: ksoftirqd/0 Tainted: G           O       6.1.99+gitb7793b7d9b35 #1
> > pc : lan9303_rcv+0x64/0x210
> > lr : lan9303_rcv+0x148/0x210
> > Call trace:
> >  lan9303_rcv+0x64/0x210
> >  dsa_switch_rcv+0x1d8/0x350
> >  __netif_receive_skb_list_core+0x1f8/0x220
> >  netif_receive_skb_list_internal+0x18c/0x2a4
> >  napi_gro_receive+0x238/0x254
> >  fec_enet_rx_napi+0x830/0xe60
> >  __napi_poll+0x40/0x210
> >  net_rx_action+0x138/0x2d0
> > 
> > Even though dsa_switch_rcv() checks 
> > 
> >         if (unlikely(!cpu_dp)) {
> >                 kfree_skb(skb);
> >                 return 0;
> >         }
> > 
> > if dsa_switch_shutdown() happens to zero dsa_ptr before
> > dsa_conduit_find_user(dev, 0, port) call, the latter will dereference dsa_ptr==NULL:
> > 
> > static inline struct net_device *dsa_conduit_find_user(struct net_device *dev,
> >                                                        int device, int port)
> > {
> >         struct dsa_port *cpu_dp = dev->dsa_ptr;
> >         struct dsa_switch_tree *dst = cpu_dp->dst;
> > 
> > I believe there are other race patterns as well if we consider all possible
> > 
> > static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev,
> >                           struct packet_type *pt, struct net_device *unused)
> > {
> >         struct metadata_dst *md_dst = skb_metadata_dst(skb);
> >         struct dsa_port *cpu_dp = dev->dsa_ptr;
> > 
> > ...
> > 
> >                 nskb = cpu_dp->rcv(skb, dev);
> > 
> > >  
> > >  	rtnl_unlock();
> > >  	mutex_unlock(&dsa2_mutex);
> > 
> > I'm not sure there is a safe way to zero dsa_ptr without ensuring the port
> > is down and there is no ongoing receive in parallel.
> 
> after my first attempts to put a band aid on this failed, I concluded
> that both assignments "dsa_ptr = NULL;" in kernel are broken. Or, being more
> precise, they break widely spread pattern
> 
> CPU0					CPU1
> if (netdev_uses_dsa())
> 					dev->dsa_ptr = NULL;
>         dev->dsa_ptr->...
> 
> because there is no synchronization whatsoever, so tearing down DSA is actually
> broken in many places...
> 
> Seems that something lock-free is required for dsa_ptr, maybe RCU or refcounting,
> I'll try to come up with some rework, but any hints are welcome!

I'm trying to understand if this rework still leads to NULL dereferences
of conduit->dsa_ptr in the receive path? Could you please test?

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 668c729946ea..f1ce6d8dc499 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -1576,32 +1576,7 @@ EXPORT_SYMBOL_GPL(dsa_unregister_switch);
  */
 void dsa_switch_shutdown(struct dsa_switch *ds)
 {
-	struct net_device *conduit, *user_dev;
-	struct dsa_port *dp;
-
-	mutex_lock(&dsa2_mutex);
-
-	if (!ds->setup)
-		goto out;
-
-	rtnl_lock();
-
-	dsa_switch_for_each_user_port(dp, ds) {
-		conduit = dsa_port_to_conduit(dp);
-		user_dev = dp->user;
-
-		netdev_upper_dev_unlink(conduit, user_dev);
-	}
-
-	/* Disconnect from further netdevice notifiers on the conduit,
-	 * since netdev_uses_dsa() will now return false.
-	 */
-	dsa_switch_for_each_cpu_port(dp, ds)
-		dp->conduit->dsa_ptr = NULL;
-
-	rtnl_unlock();
-out:
-	mutex_unlock(&dsa2_mutex);
+	dsa_unregister_switch(ds);
 }
 EXPORT_SYMBOL_GPL(dsa_switch_shutdown);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2024-09-09 14:16     ` Vladimir Oltean
@ 2024-09-09 14:23       ` Sverdlin, Alexander
  2024-09-10  4:49         ` Sverdlin, Alexander
  0 siblings, 1 reply; 13+ messages in thread
From: Sverdlin, Alexander @ 2024-09-09 14:23 UTC (permalink / raw)
  To: vladimir.oltean@nxp.com
  Cc: andrew@lunn.ch, olteanv@gmail.com, LinoSanfilippo@gmx.de,
	daniel.klauer@gin.de, davem@davemloft.net,
	vivien.didelot@gmail.com, netdev@vger.kernel.org,
	f.fainelli@gmail.com, kuba@kernel.org, rafael.richter@gin.de

Hi Vladimir!

Thanks for looking into this!

On Mon, 2024-09-09 at 17:16 +0300, Vladimir Oltean wrote:
> > after my first attempts to put a band aid on this failed, I concluded
> > that both assignments "dsa_ptr = NULL;" in kernel are broken. Or, being more
> > precise, they break widely spread pattern
> > 
> > CPU0					CPU1
> > if (netdev_uses_dsa())
> >  					dev->dsa_ptr = NULL;
> >          dev->dsa_ptr->...
> > 
> > because there is no synchronization whatsoever, so tearing down DSA is actually
> > broken in many places...
> > 
> > Seems that something lock-free is required for dsa_ptr, maybe RCU or refcounting,
> > I'll try to come up with some rework, but any hints are welcome!
> 
> I'm trying to understand if this rework still leads to NULL dereferences
> of conduit->dsa_ptr in the receive path? Could you please test?

I didn't test yet (I can do it though), but I belive dsa_conduit_teardown()
will trigger the same crash eventually.

We can probably trigger a NULL pointer dereference in tagging_show() vs shutdown,
etc...

I'm actually close to publishing my RCU rework of dsa_ptr, but I would need to
test it as well...

I'll keep you updated!

> diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
> index 668c729946ea..f1ce6d8dc499 100644
> --- a/net/dsa/dsa.c
> +++ b/net/dsa/dsa.c
> @@ -1576,32 +1576,7 @@ EXPORT_SYMBOL_GPL(dsa_unregister_switch);
>   */
>  void dsa_switch_shutdown(struct dsa_switch *ds)
>  {
> -	struct net_device *conduit, *user_dev;
> -	struct dsa_port *dp;
> -
> -	mutex_lock(&dsa2_mutex);
> -
> -	if (!ds->setup)
> -		goto out;
> -
> -	rtnl_lock();
> -
> -	dsa_switch_for_each_user_port(dp, ds) {
> -		conduit = dsa_port_to_conduit(dp);
> -		user_dev = dp->user;
> -
> -		netdev_upper_dev_unlink(conduit, user_dev);
> -	}
> -
> -	/* Disconnect from further netdevice notifiers on the conduit,
> -	 * since netdev_uses_dsa() will now return false.
> -	 */
> -	dsa_switch_for_each_cpu_port(dp, ds)
> -		dp->conduit->dsa_ptr = NULL;
> -
> -	rtnl_unlock();
> -out:
> -	mutex_unlock(&dsa2_mutex);
> +	dsa_unregister_switch(ds);
>  }
>  EXPORT_SYMBOL_GPL(dsa_switch_shutdown);

-- 
Alexander Sverdlin
Siemens AG
www.siemens.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown
  2024-09-09 14:23       ` Sverdlin, Alexander
@ 2024-09-10  4:49         ` Sverdlin, Alexander
  0 siblings, 0 replies; 13+ messages in thread
From: Sverdlin, Alexander @ 2024-09-10  4:49 UTC (permalink / raw)
  To: vladimir.oltean@nxp.com
  Cc: andrew@lunn.ch, olteanv@gmail.com, LinoSanfilippo@gmx.de,
	daniel.klauer@gin.de, davem@davemloft.net,
	vivien.didelot@gmail.com, netdev@vger.kernel.org,
	f.fainelli@gmail.com, kuba@kernel.org, rafael.richter@gin.de

Hello Vladimir!

On Mon, 2024-09-09 at 16:23 +0200, Alexander Sverdlin wrote:
> On Mon, 2024-09-09 at 17:16 +0300, Vladimir Oltean wrote:
> > > after my first attempts to put a band aid on this failed, I concluded
> > > that both assignments "dsa_ptr = NULL;" in kernel are broken. Or, being more
> > > precise, they break widely spread pattern
> > > 
> > > CPU0					CPU1
> > > if (netdev_uses_dsa())
> > >  					dev->dsa_ptr = NULL;
> > >          dev->dsa_ptr->...
> > > 
> > > because there is no synchronization whatsoever, so tearing down DSA is actually
> > > broken in many places...
> > > 
> > > Seems that something lock-free is required for dsa_ptr, maybe RCU or refcounting,
> > > I'll try to come up with some rework, but any hints are welcome!
> > 
> > I'm trying to understand if this rework still leads to NULL dereferences
> > of conduit->dsa_ptr in the receive path? Could you please test?
> 
> I didn't test yet (I can do it though), but I belive dsa_conduit_teardown()
> will trigger the same crash eventually.

I've realized, I've tested similar patch already (where I replaced lan9303_shutdown()
with lan9303_remove()), but this only fixed the race with MDIO accesses in lan9303.

However I've applied your below patch (well, onto v6.1.99, but I don't think it matters
for now) and got the same crashes once per hour (on back-to-back reboots under 100MBit
traffic):

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
...
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O       6.1.99+git6b4dd8ce06dc #1
...
pc : lan9303_rcv+0x68/0x220
lr : lan9303_rcv+0x14c/0x220
Call trace:
 lan9303_rcv+0x68/0x220
 dsa_switch_rcv+0x1d0/0x334
 __netif_receive_skb_list_core+0x1f4/0x220
 netif_receive_skb_list_internal+0x1e0/0x2d0
 napi_complete_done+0x70/0x1cc
 fec_enet_rx_napi+0x4fc/0xe50
 __napi_poll+0x40/0x200
 net_rx_action+0x140/0x2e0
 handle_softirqs+0x120/0x360
...

> We can probably trigger a NULL pointer dereference in tagging_show() vs shutdown,
> etc...

I wasn't able to trigger these even though I thought it will be even easier...

> I'm actually close to publishing my RCU rework of dsa_ptr, but I would need to
> test it as well...

Hopefully I will be able to test an RCU conversion today with some LOCKDEP and
RCU debugging, otherwise I can just send a compile-tested RFC...

> > diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
> > index 668c729946ea..f1ce6d8dc499 100644
> > --- a/net/dsa/dsa.c
> > +++ b/net/dsa/dsa.c
> > @@ -1576,32 +1576,7 @@ EXPORT_SYMBOL_GPL(dsa_unregister_switch);
> >   */
> >  void dsa_switch_shutdown(struct dsa_switch *ds)
> >  {
> > -	struct net_device *conduit, *user_dev;
> > -	struct dsa_port *dp;
> > -
> > -	mutex_lock(&dsa2_mutex);
> > -
> > -	if (!ds->setup)
> > -		goto out;
> > -
> > -	rtnl_lock();
> > -
> > -	dsa_switch_for_each_user_port(dp, ds) {
> > -		conduit = dsa_port_to_conduit(dp);
> > -		user_dev = dp->user;
> > -
> > -		netdev_upper_dev_unlink(conduit, user_dev);
> > -	}
> > -
> > -	/* Disconnect from further netdevice notifiers on the conduit,
> > -	 * since netdev_uses_dsa() will now return false.
> > -	 */
> > -	dsa_switch_for_each_cpu_port(dp, ds)
> > -		dp->conduit->dsa_ptr = NULL;
> > -
> > -	rtnl_unlock();
> > -out:
> > -	mutex_unlock(&dsa2_mutex);
> > +	dsa_unregister_switch(ds);
> >  }
> >  EXPORT_SYMBOL_GPL(dsa_switch_shutdown);

-- 
Alexander Sverdlin
Siemens AG
www.siemens.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-09-10  4:49 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-09 12:04 [PATCH net] net: dsa: fix panic when DSA master device unbinds on shutdown Vladimir Oltean
2022-02-09 13:30 ` patchwork-bot+netdevbpf
2024-04-10  9:03 ` Some questions " xu
2024-04-10  9:06 ` xu
2024-04-10  9:14   ` Paolo Abeni
2024-04-10 14:34     ` Vladimir Oltean
2024-04-10 14:55       ` Greg KH
2024-04-10 16:30       ` Sasha Levin
2024-09-04  8:03 ` Sverdlin, Alexander
2024-09-05  7:11   ` Sverdlin, Alexander
2024-09-09 14:16     ` Vladimir Oltean
2024-09-09 14:23       ` Sverdlin, Alexander
2024-09-10  4:49         ` Sverdlin, Alexander

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).