* Re: can/j1939: hung inside rtnl_dellink()
[not found] ` <aKg9mTaSxzBVpTVI@pengutronix.de>
@ 2025-08-22 10:23 ` Tetsuo Handa
2025-08-24 13:36 ` Tetsuo Handa
0 siblings, 1 reply; 5+ messages in thread
From: Tetsuo Handa @ 2025-08-22 10:23 UTC (permalink / raw)
To: Oleksij Rempel
Cc: Robin van der Gracht, kernel, Oliver Hartkopp, Marc Kleine-Budde,
linux-can, LKML, Network Development
(Adding netdev ML to ask for hints from different network protocols...)
On 2025/08/22 18:51, Oleksij Rempel wrote:
> Hello Tetsuo,
>
> On Sat, Aug 16, 2025 at 03:51:54PM +0900, Tetsuo Handa wrote:
>> Hello.
>>
>> I made a minimized C reproducer for
>>
>> unregister_netdevice: waiting for vcan0 to become free. Usage count = 2
>>
>> problem at https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84 , and
>> obtained some data using debug printk() patch. It seems that the cause is
>> net/can/j1939/ does not handle NETDEV_UNREGISTER notification
>> while net/can/j1939/ can directly call rtnl_dellink() via sendmsg().
>
> Sorry for long delay and than you for your investigation!
>
>> The minimized C reproducer is shown below.
> ....
>
>> Therefore, I guess that either
>>
>> j1939_netdev_notify() is handling NETDEV_UNREGISTER notification
Oops. I wanted to write
j1939_netdev_notify() is *not* handling NETDEV_UNREGISTER notification
>>
>> or
>>
>> rtnl_dellink() can be called via sendmsg() despite the j1939 socket
>> are in use
>>
>> is wrong. How to fix this problem?
>
> I assume the first variant is correct. Can you please test following change:
> --- a/net/can/j1939/main.c
> +++ b/net/can/j1939/main.c
> @@ -370,6 +370,7 @@
> goto notify_done;
>
> switch (msg) {
> + case NETDEV_UNREGISTER:
> case NETDEV_DOWN:
> j1939_cancel_active_session(priv, NULL);
> j1939_sk_netdev_event_netdown(priv);
>
Such change is not sufficient.
As far as I tested, the only way that can drop the refcount to 1 is to
call j1939_sk_release() (which involves sock_put()) on all j1939 sockets
(i.e. something like shown below).
diff --git a/net/can/j1939/j1939-priv.h b/net/can/j1939/j1939-priv.h
index 31a93cae5111..81f58924b4ac 100644
--- a/net/can/j1939/j1939-priv.h
+++ b/net/can/j1939/j1939-priv.h
@@ -212,6 +212,7 @@ void j1939_priv_get(struct j1939_priv *priv);
/* notify/alert all j1939 sockets bound to ifindex */
void j1939_sk_netdev_event_netdown(struct j1939_priv *priv);
+void j1939_sk_netdev_event_unregister(struct j1939_priv *priv);
int j1939_cancel_active_session(struct j1939_priv *priv, struct sock *sk);
void j1939_tp_init(struct j1939_priv *priv);
diff --git a/net/can/j1939/main.c b/net/can/j1939/main.c
index 7e8a20f2fc42..e568b5928a39 100644
--- a/net/can/j1939/main.c
+++ b/net/can/j1939/main.c
@@ -377,6 +377,11 @@ static int j1939_netdev_notify(struct notifier_block *nb,
j1939_sk_netdev_event_netdown(priv);
j1939_ecu_unmap_all(priv);
break;
+ case NETDEV_UNREGISTER:
+ pr_info("NETDEV_UNREGISTER notification on %px start\n", ndev);
+ j1939_sk_netdev_event_unregister(priv);
+ pr_info("NETDEV_UNREGISTER notification on %px end\n", ndev);
+ break;
}
j1939_priv_put(priv);
diff --git a/net/can/j1939/socket.c b/net/can/j1939/socket.c
index 3d8b588822f9..4e53a1b10907 100644
--- a/net/can/j1939/socket.c
+++ b/net/can/j1939/socket.c
@@ -1300,6 +1313,24 @@ void j1939_sk_netdev_event_netdown(struct j1939_priv *priv)
read_unlock_bh(&priv->j1939_socks_lock);
}
+void j1939_sk_netdev_event_unregister(struct j1939_priv *priv)
+{
+ struct j1939_sock *jsk;
+ struct socket sock = { };
+
+ rescan:
+ read_lock_bh(&priv->j1939_socks_lock);
+ list_for_each_entry(jsk, &priv->j1939_socks, list) {
+ read_unlock_bh(&priv->j1939_socks_lock);
+ pr_info("Releasing %px\n", &jsk->sk);
+ sock.sk = &jsk->sk;
+ //sock_hold(&jsk->sk);
+ j1939_sk_release(&sock);
+ goto rescan;
+ }
+ read_unlock_bh(&priv->j1939_socks_lock);
+}
+
static int j1939_sk_no_ioctlcmd(struct socket *sock, unsigned int cmd,
unsigned long arg)
{
Of course, calling sk_j1939_sk_release() upon NETDEV_UNREGISTER event
causes refcount underflow bug. But calling sock_hold() before calling
j1939_sk_release() prevents the refcount from dropping to 1. :-(
I think we need to somehow make it possible to logically close j1939
sockets without actually closing. Maybe something like
"struct in_device"->dead flag which is set by inetdev_destroy() upon
NETDEV_UNREGISTER event is needed by j1939 sockets...
My build environment is very slow (testing on VMWare on a Windows PC).
Running my simplified reproducer on your build environment would be
much faster.
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: can/j1939: hung inside rtnl_dellink()
2025-08-22 10:23 ` can/j1939: hung inside rtnl_dellink() Tetsuo Handa
@ 2025-08-24 13:36 ` Tetsuo Handa
2025-08-25 14:07 ` [PATCH] can: j1939: implement NETDEV_UNREGISTER notification handler Tetsuo Handa
0 siblings, 1 reply; 5+ messages in thread
From: Tetsuo Handa @ 2025-08-24 13:36 UTC (permalink / raw)
To: Oleksij Rempel
Cc: Robin van der Gracht, kernel, Oliver Hartkopp, Marc Kleine-Budde,
linux-can, LKML, Network Development
On 2025/08/22 19:23, Tetsuo Handa wrote:
> I think we need to somehow make it possible to logically close j1939
> sockets without actually closing. Maybe something like
> "struct in_device"->dead flag which is set by inetdev_destroy() upon
> NETDEV_UNREGISTER event is needed by j1939 sockets...
This change seems to fix the hung problem syzbot is reporting.
Does this change look correct?
---
net/can/j1939/j1939-priv.h | 1 +
net/can/j1939/main.c | 3 +++
net/can/j1939/socket.c | 40 ++++++++++++++++++++++++++++++++++++++
3 files changed, 44 insertions(+)
diff --git a/net/can/j1939/j1939-priv.h b/net/can/j1939/j1939-priv.h
index 31a93cae5111..81f58924b4ac 100644
--- a/net/can/j1939/j1939-priv.h
+++ b/net/can/j1939/j1939-priv.h
@@ -212,6 +212,7 @@ void j1939_priv_get(struct j1939_priv *priv);
/* notify/alert all j1939 sockets bound to ifindex */
void j1939_sk_netdev_event_netdown(struct j1939_priv *priv);
+void j1939_sk_netdev_event_unregister(struct j1939_priv *priv);
int j1939_cancel_active_session(struct j1939_priv *priv, struct sock *sk);
void j1939_tp_init(struct j1939_priv *priv);
diff --git a/net/can/j1939/main.c b/net/can/j1939/main.c
index 7e8a20f2fc42..3706a872ecaf 100644
--- a/net/can/j1939/main.c
+++ b/net/can/j1939/main.c
@@ -377,6 +377,9 @@ static int j1939_netdev_notify(struct notifier_block *nb,
j1939_sk_netdev_event_netdown(priv);
j1939_ecu_unmap_all(priv);
break;
+ case NETDEV_UNREGISTER:
+ j1939_sk_netdev_event_unregister(priv);
+ break;
}
j1939_priv_put(priv);
diff --git a/net/can/j1939/socket.c b/net/can/j1939/socket.c
index 493f49bfaf5d..0fbfdffdfc24 100644
--- a/net/can/j1939/socket.c
+++ b/net/can/j1939/socket.c
@@ -1303,6 +1303,46 @@ void j1939_sk_netdev_event_netdown(struct j1939_priv *priv)
read_unlock_bh(&priv->j1939_socks_lock);
}
+void j1939_sk_netdev_event_unregister(struct j1939_priv *priv)
+{
+ struct sock *sk;
+ struct j1939_sock *jsk;
+
+ rescan: /* The caller is holding a ref on this "priv" via j1939_priv_get_by_ndev(). */
+ read_lock_bh(&priv->j1939_socks_lock);
+ list_for_each_entry(jsk, &priv->j1939_socks, list) {
+ /* Skip if j1939_jsk_add() is not called on this socket. */
+ if (!(jsk->state & J1939_SOCK_BOUND))
+ continue;
+ sk = &jsk->sk;
+ sock_hold(sk);
+ read_unlock_bh(&priv->j1939_socks_lock);
+ /* Check if j1939_jsk_del() is not yet called on this socket after holding
+ * socket's lock, for both j1939_sk_bind() and j1939_sk_release() call
+ * j1939_jsk_del() with socket's lock held.
+ */
+ lock_sock(sk);
+ if (jsk->state & J1939_SOCK_BOUND) {
+ /* Neither j1939_sk_bind() nor j1939_sk_release() called j1939_jsk_del().
+ * Make this socket no longer bound, by pretending as if j1939_sk_bind()
+ * dropped old references but did not get new references.
+ */
+ j1939_jsk_del(priv, jsk);
+ j1939_local_ecu_put(priv, jsk->addr.src_name, jsk->addr.sa);
+ j1939_netdev_stop(priv);
+ /* Call j1939_priv_put() now and prevent j1939_sk_sock_destruct() from
+ * calling the corresponding j1939_priv_put().
+ */
+ j1939_priv_put(priv);
+ jsk->priv = NULL;
+ }
+ release_sock(sk);
+ sock_put(sk);
+ goto rescan;
+ }
+ read_unlock_bh(&priv->j1939_socks_lock);
+}
+
static int j1939_sk_no_ioctlcmd(struct socket *sock, unsigned int cmd,
unsigned long arg)
{
--
2.51.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH] can: j1939: implement NETDEV_UNREGISTER notification handler
2025-08-24 13:36 ` Tetsuo Handa
@ 2025-08-25 14:07 ` Tetsuo Handa
2025-09-05 8:29 ` Oleksij Rempel
2025-09-09 11:48 ` Marc Kleine-Budde
0 siblings, 2 replies; 5+ messages in thread
From: Tetsuo Handa @ 2025-08-25 14:07 UTC (permalink / raw)
To: Oleksij Rempel
Cc: Robin van der Gracht, kernel, Oliver Hartkopp, Marc Kleine-Budde,
linux-can, LKML, Network Development, Kurt Van Dijck
syzbot is reporting
unregister_netdevice: waiting for vcan0 to become free. Usage count = 2
problem, for j1939 protocol did not have NETDEV_UNREGISTER notification
handler for undoing changes made by j1939_sk_bind().
Commit 25fe97cb7620 ("can: j1939: move j1939_priv_put() into sk_destruct
callback") expects that a call to j1939_priv_put() can be unconditionally
delayed until j1939_sk_sock_destruct() is called. But we need to call
j1939_priv_put() against an extra ref held by j1939_sk_bind() call
(as a part of undoing changes made by j1939_sk_bind()) as soon as
NETDEV_UNREGISTER notification fires (i.e. before j1939_sk_sock_destruct()
is called via j1939_sk_release()). Otherwise, the extra ref on "struct
j1939_priv" held by j1939_sk_bind() call prevents "struct net_device" from
dropping the usage count to 1; making it impossible for
unregister_netdevice() to continue.
Reported-by: syzbot <syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84
Tested-by: syzbot <syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com>
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Fixes: 25fe97cb7620 ("can: j1939: move j1939_priv_put() into sk_destruct callback")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
net/can/j1939/j1939-priv.h | 1 +
net/can/j1939/main.c | 3 +++
net/can/j1939/socket.c | 49 ++++++++++++++++++++++++++++++++++++++
3 files changed, 53 insertions(+)
diff --git a/net/can/j1939/j1939-priv.h b/net/can/j1939/j1939-priv.h
index 31a93cae5111..81f58924b4ac 100644
--- a/net/can/j1939/j1939-priv.h
+++ b/net/can/j1939/j1939-priv.h
@@ -212,6 +212,7 @@ void j1939_priv_get(struct j1939_priv *priv);
/* notify/alert all j1939 sockets bound to ifindex */
void j1939_sk_netdev_event_netdown(struct j1939_priv *priv);
+void j1939_sk_netdev_event_unregister(struct j1939_priv *priv);
int j1939_cancel_active_session(struct j1939_priv *priv, struct sock *sk);
void j1939_tp_init(struct j1939_priv *priv);
diff --git a/net/can/j1939/main.c b/net/can/j1939/main.c
index 7e8a20f2fc42..3706a872ecaf 100644
--- a/net/can/j1939/main.c
+++ b/net/can/j1939/main.c
@@ -377,6 +377,9 @@ static int j1939_netdev_notify(struct notifier_block *nb,
j1939_sk_netdev_event_netdown(priv);
j1939_ecu_unmap_all(priv);
break;
+ case NETDEV_UNREGISTER:
+ j1939_sk_netdev_event_unregister(priv);
+ break;
}
j1939_priv_put(priv);
diff --git a/net/can/j1939/socket.c b/net/can/j1939/socket.c
index 493f49bfaf5d..72c649cec9e1 100644
--- a/net/can/j1939/socket.c
+++ b/net/can/j1939/socket.c
@@ -1303,6 +1303,55 @@ void j1939_sk_netdev_event_netdown(struct j1939_priv *priv)
read_unlock_bh(&priv->j1939_socks_lock);
}
+void j1939_sk_netdev_event_unregister(struct j1939_priv *priv)
+{
+ struct sock *sk;
+ struct j1939_sock *jsk;
+ bool wait_rcu = false;
+
+ rescan: /* The caller is holding a ref on this "priv" via j1939_priv_get_by_ndev(). */
+ read_lock_bh(&priv->j1939_socks_lock);
+ list_for_each_entry(jsk, &priv->j1939_socks, list) {
+ /* Skip if j1939_jsk_add() is not called on this socket. */
+ if (!(jsk->state & J1939_SOCK_BOUND))
+ continue;
+ sk = &jsk->sk;
+ sock_hold(sk);
+ read_unlock_bh(&priv->j1939_socks_lock);
+ /* Check if j1939_jsk_del() is not yet called on this socket after holding
+ * socket's lock, for both j1939_sk_bind() and j1939_sk_release() call
+ * j1939_jsk_del() with socket's lock held.
+ */
+ lock_sock(sk);
+ if (jsk->state & J1939_SOCK_BOUND) {
+ /* Neither j1939_sk_bind() nor j1939_sk_release() called j1939_jsk_del().
+ * Make this socket no longer bound, by pretending as if j1939_sk_bind()
+ * dropped old references but did not get new references.
+ */
+ j1939_jsk_del(priv, jsk);
+ j1939_local_ecu_put(priv, jsk->addr.src_name, jsk->addr.sa);
+ j1939_netdev_stop(priv);
+ /* Call j1939_priv_put() now and prevent j1939_sk_sock_destruct() from
+ * calling the corresponding j1939_priv_put().
+ *
+ * j1939_sk_sock_destruct() is supposed to call j1939_priv_put() after
+ * an RCU grace period. But since the caller is holding a ref on this
+ * "priv", we can defer synchronize_rcu() until immediately before
+ * the caller calls j1939_priv_put().
+ */
+ j1939_priv_put(priv);
+ jsk->priv = NULL;
+ wait_rcu = true;
+ }
+ release_sock(sk);
+ sock_put(sk);
+ goto rescan;
+ }
+ read_unlock_bh(&priv->j1939_socks_lock);
+ if (wait_rcu)
+ synchronize_rcu();
+}
+
static int j1939_sk_no_ioctlcmd(struct socket *sock, unsigned int cmd,
unsigned long arg)
{
--
2.51.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] can: j1939: implement NETDEV_UNREGISTER notification handler
2025-08-25 14:07 ` [PATCH] can: j1939: implement NETDEV_UNREGISTER notification handler Tetsuo Handa
@ 2025-09-05 8:29 ` Oleksij Rempel
2025-09-09 11:48 ` Marc Kleine-Budde
1 sibling, 0 replies; 5+ messages in thread
From: Oleksij Rempel @ 2025-09-05 8:29 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Robin van der Gracht, kernel, Oliver Hartkopp, Marc Kleine-Budde,
linux-can, LKML, Network Development, Kurt Van Dijck
On Mon, Aug 25, 2025 at 11:07:24PM +0900, Tetsuo Handa wrote:
> syzbot is reporting
>
> unregister_netdevice: waiting for vcan0 to become free. Usage count = 2
>
> problem, for j1939 protocol did not have NETDEV_UNREGISTER notification
> handler for undoing changes made by j1939_sk_bind().
>
> Commit 25fe97cb7620 ("can: j1939: move j1939_priv_put() into sk_destruct
> callback") expects that a call to j1939_priv_put() can be unconditionally
> delayed until j1939_sk_sock_destruct() is called. But we need to call
> j1939_priv_put() against an extra ref held by j1939_sk_bind() call
> (as a part of undoing changes made by j1939_sk_bind()) as soon as
> NETDEV_UNREGISTER notification fires (i.e. before j1939_sk_sock_destruct()
> is called via j1939_sk_release()). Otherwise, the extra ref on "struct
> j1939_priv" held by j1939_sk_bind() call prevents "struct net_device" from
> dropping the usage count to 1; making it impossible for
> unregister_netdevice() to continue.
>
> Reported-by: syzbot <syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com>
> Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84
> Tested-by: syzbot <syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com>
> Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
> Fixes: 25fe97cb7620 ("can: j1939: move j1939_priv_put() into sk_destruct callback")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Thank you!
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] can: j1939: implement NETDEV_UNREGISTER notification handler
2025-08-25 14:07 ` [PATCH] can: j1939: implement NETDEV_UNREGISTER notification handler Tetsuo Handa
2025-09-05 8:29 ` Oleksij Rempel
@ 2025-09-09 11:48 ` Marc Kleine-Budde
1 sibling, 0 replies; 5+ messages in thread
From: Marc Kleine-Budde @ 2025-09-09 11:48 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Oleksij Rempel, Robin van der Gracht, kernel, Oliver Hartkopp,
linux-can, LKML, Network Development, Kurt Van Dijck
[-- Attachment #1: Type: text/plain, Size: 2262 bytes --]
On 25.08.2025 23:07:24, Tetsuo Handa wrote:
> syzbot is reporting
>
> unregister_netdevice: waiting for vcan0 to become free. Usage count = 2
>
> problem, for j1939 protocol did not have NETDEV_UNREGISTER notification
> handler for undoing changes made by j1939_sk_bind().
>
> Commit 25fe97cb7620 ("can: j1939: move j1939_priv_put() into sk_destruct
> callback") expects that a call to j1939_priv_put() can be unconditionally
> delayed until j1939_sk_sock_destruct() is called. But we need to call
> j1939_priv_put() against an extra ref held by j1939_sk_bind() call
> (as a part of undoing changes made by j1939_sk_bind()) as soon as
> NETDEV_UNREGISTER notification fires (i.e. before j1939_sk_sock_destruct()
> is called via j1939_sk_release()). Otherwise, the extra ref on "struct
> j1939_priv" held by j1939_sk_bind() call prevents "struct net_device" from
> dropping the usage count to 1; making it impossible for
> unregister_netdevice() to continue.
>
> Reported-by: syzbot <syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com>
> Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84
> Tested-by: syzbot <syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com>
> Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
> Fixes: 25fe97cb7620 ("can: j1939: move j1939_priv_put() into sk_destruct callback")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Applied to linux-can.
> --- a/net/can/j1939/socket.c
> +++ b/net/can/j1939/socket.c
> @@ -1303,6 +1303,55 @@ void j1939_sk_netdev_event_netdown(struct j1939_priv *priv)
> read_unlock_bh(&priv->j1939_socks_lock);
> }
>
> +void j1939_sk_netdev_event_unregister(struct j1939_priv *priv)
> +{
> + struct sock *sk;
> + struct j1939_sock *jsk;
> + bool wait_rcu = false;
> +
> + rescan: /* The caller is holding a ref on this "priv" via j1939_priv_get_by_ndev(). */
^^
I've removed the space while applying the patch.
regards,
Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Embedded Linux | https://www.pengutronix.de |
Vertretung Nürnberg | Phone: +49-5121-206917-129 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-9 |
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-09-09 11:48 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <50055a40-6fd9-468f-8e59-26d1b5b3c23d@I-love.SAKURA.ne.jp>
[not found] ` <aKg9mTaSxzBVpTVI@pengutronix.de>
2025-08-22 10:23 ` can/j1939: hung inside rtnl_dellink() Tetsuo Handa
2025-08-24 13:36 ` Tetsuo Handa
2025-08-25 14:07 ` [PATCH] can: j1939: implement NETDEV_UNREGISTER notification handler Tetsuo Handa
2025-09-05 8:29 ` Oleksij Rempel
2025-09-09 11:48 ` Marc Kleine-Budde
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).