public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5] driver core: enforce device_lock for driver_match_device()
@ 2026-01-13 16:28 Gui-Dong Han
  2026-01-13 16:35 ` Rafael J. Wysocki
                   ` (4 more replies)
  0 siblings, 5 replies; 63+ messages in thread
From: Gui-Dong Han @ 2026-01-13 16:28 UTC (permalink / raw)
  To: gregkh, rafael, dakr
  Cc: linux-kernel, baijiaju1990, Gui-Dong Han, Qiu-ji Chen

Currently, driver_match_device() is called from three sites. One site
(__device_attach_driver) holds device_lock(dev), but the other two
(bind_store and __driver_attach) do not. This inconsistency means that
bus match() callbacks are not guaranteed to be called with the lock
held.

Fix this by introducing driver_match_device_locked(), which guarantees
holding the device lock using a scoped guard. Replace the unlocked calls
in bind_store() and __driver_attach() with this new helper. Also add a
lock assertion to driver_match_device() to enforce this guarantee.

This consistency also fixes a known race condition. The driver_override
implementation relies on the device_lock, so the missing lock led to the
use-after-free (UAF) reported in Bugzilla for buses using this field.

Stress testing the two newly locked paths for 24 hours with
CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP enabled showed no UAF recurrence
and no lockdep warnings.

Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Suggested-by: Qiu-ji Chen <chenqiuji666@gmail.com>
Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>
---
v5:
* Introduce driver_match_device_locked() helper using guard(device) to
handle locking, as suggested by Rafael J. Wysocki.
v4:
* Remove the misleading comment above device_lock_assert(), and update
subject and commit message to focus on enforcing consistent locking,
as discussed with Danilo Krummrich.
v3:
* Remove redundant locking comments at call sites and add a blank line
after the lock assertion in driver_match_device(), as suggested by Greg KH.
v2:
* Add device_lock_assert() in driver_match_device() to enforce locking
requirement, as suggested by Greg KH.
v1:
* The Bugzilla entry contains full KASAN reports and two PoCs that reliably
reproduce the UAF on both unlocked paths using a standard QEMU setup
(default e1000 device at 0000:00:03.0).
---
 drivers/base/base.h | 9 +++++++++
 drivers/base/bus.c  | 2 +-
 drivers/base/dd.c   | 2 +-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 430cbefbc97f..677320881af1 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -182,9 +182,18 @@ void device_set_deferred_probe_reason(const struct device *dev, struct va_format
 static inline int driver_match_device(const struct device_driver *drv,
 				      struct device *dev)
 {
+	device_lock_assert(dev);
+
 	return drv->bus->match ? drv->bus->match(dev, drv) : 1;
 }
 
+static inline int driver_match_device_locked(const struct device_driver *drv,
+					     struct device *dev)
+{
+	guard(device)(dev);
+	return driver_match_device(drv, dev);
+}
+
 static inline void dev_sync_state(struct device *dev)
 {
 	if (dev->bus->sync_state)
diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 9eb7771706f0..331d750465e2 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -263,7 +263,7 @@ static ssize_t bind_store(struct device_driver *drv, const char *buf,
 	int err = -ENODEV;
 
 	dev = bus_find_device_by_name(bus, NULL, buf);
-	if (dev && driver_match_device(drv, dev)) {
+	if (dev && driver_match_device_locked(drv, dev)) {
 		err = device_driver_attach(drv, dev);
 		if (!err) {
 			/* success */
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 349f31bedfa1..98feb4c77160 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -1178,7 +1178,7 @@ static int __driver_attach(struct device *dev, void *data)
 	 * is an error.
 	 */
 
-	ret = driver_match_device(drv, dev);
+	ret = driver_match_device_locked(drv, dev);
 	if (ret == 0) {
 		/* no match */
 		return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-13 16:28 [PATCH v5] driver core: enforce device_lock for driver_match_device() Gui-Dong Han
@ 2026-01-13 16:35 ` Rafael J. Wysocki
  2026-01-13 19:23 ` Danilo Krummrich
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 63+ messages in thread
From: Rafael J. Wysocki @ 2026-01-13 16:35 UTC (permalink / raw)
  To: Gui-Dong Han
  Cc: gregkh, rafael, dakr, linux-kernel, baijiaju1990, Qiu-ji Chen

On Tue, Jan 13, 2026 at 5:29 PM Gui-Dong Han <hanguidong02@gmail.com> wrote:
>
> Currently, driver_match_device() is called from three sites. One site
> (__device_attach_driver) holds device_lock(dev), but the other two
> (bind_store and __driver_attach) do not. This inconsistency means that
> bus match() callbacks are not guaranteed to be called with the lock
> held.
>
> Fix this by introducing driver_match_device_locked(), which guarantees
> holding the device lock using a scoped guard. Replace the unlocked calls
> in bind_store() and __driver_attach() with this new helper. Also add a
> lock assertion to driver_match_device() to enforce this guarantee.
>
> This consistency also fixes a known race condition. The driver_override
> implementation relies on the device_lock, so the missing lock led to the
> use-after-free (UAF) reported in Bugzilla for buses using this field.
>
> Stress testing the two newly locked paths for 24 hours with
> CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP enabled showed no UAF recurrence
> and no lockdep warnings.
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
> Suggested-by: Qiu-ji Chen <chenqiuji666@gmail.com>
> Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>

Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>

> ---
> v5:
> * Introduce driver_match_device_locked() helper using guard(device) to
> handle locking, as suggested by Rafael J. Wysocki.
> v4:
> * Remove the misleading comment above device_lock_assert(), and update
> subject and commit message to focus on enforcing consistent locking,
> as discussed with Danilo Krummrich.
> v3:
> * Remove redundant locking comments at call sites and add a blank line
> after the lock assertion in driver_match_device(), as suggested by Greg KH.
> v2:
> * Add device_lock_assert() in driver_match_device() to enforce locking
> requirement, as suggested by Greg KH.
> v1:
> * The Bugzilla entry contains full KASAN reports and two PoCs that reliably
> reproduce the UAF on both unlocked paths using a standard QEMU setup
> (default e1000 device at 0000:00:03.0).
> ---
>  drivers/base/base.h | 9 +++++++++
>  drivers/base/bus.c  | 2 +-
>  drivers/base/dd.c   | 2 +-
>  3 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/base/base.h b/drivers/base/base.h
> index 430cbefbc97f..677320881af1 100644
> --- a/drivers/base/base.h
> +++ b/drivers/base/base.h
> @@ -182,9 +182,18 @@ void device_set_deferred_probe_reason(const struct device *dev, struct va_format
>  static inline int driver_match_device(const struct device_driver *drv,
>                                       struct device *dev)
>  {
> +       device_lock_assert(dev);
> +
>         return drv->bus->match ? drv->bus->match(dev, drv) : 1;
>  }
>
> +static inline int driver_match_device_locked(const struct device_driver *drv,
> +                                            struct device *dev)
> +{
> +       guard(device)(dev);
> +       return driver_match_device(drv, dev);
> +}
> +
>  static inline void dev_sync_state(struct device *dev)
>  {
>         if (dev->bus->sync_state)
> diff --git a/drivers/base/bus.c b/drivers/base/bus.c
> index 9eb7771706f0..331d750465e2 100644
> --- a/drivers/base/bus.c
> +++ b/drivers/base/bus.c
> @@ -263,7 +263,7 @@ static ssize_t bind_store(struct device_driver *drv, const char *buf,
>         int err = -ENODEV;
>
>         dev = bus_find_device_by_name(bus, NULL, buf);
> -       if (dev && driver_match_device(drv, dev)) {
> +       if (dev && driver_match_device_locked(drv, dev)) {
>                 err = device_driver_attach(drv, dev);
>                 if (!err) {
>                         /* success */
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 349f31bedfa1..98feb4c77160 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -1178,7 +1178,7 @@ static int __driver_attach(struct device *dev, void *data)
>          * is an error.
>          */
>
> -       ret = driver_match_device(drv, dev);
> +       ret = driver_match_device_locked(drv, dev);
>         if (ret == 0) {
>                 /* no match */
>                 return 0;
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-13 16:28 [PATCH v5] driver core: enforce device_lock for driver_match_device() Gui-Dong Han
  2026-01-13 16:35 ` Rafael J. Wysocki
@ 2026-01-13 19:23 ` Danilo Krummrich
  2026-01-16  7:34   ` Gui-Dong Han
  2026-01-16 11:54 ` Danilo Krummrich
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-13 19:23 UTC (permalink / raw)
  To: Gui-Dong Han; +Cc: gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen

On Tue Jan 13, 2026 at 5:28 PM CET, Gui-Dong Han wrote:
> Currently, driver_match_device() is called from three sites. One site
> (__device_attach_driver) holds device_lock(dev), but the other two
> (bind_store and __driver_attach) do not. This inconsistency means that
> bus match() callbacks are not guaranteed to be called with the lock
> held.
>
> Fix this by introducing driver_match_device_locked(), which guarantees
> holding the device lock using a scoped guard. Replace the unlocked calls
> in bind_store() and __driver_attach() with this new helper. Also add a
> lock assertion to driver_match_device() to enforce this guarantee.
>
> This consistency also fixes a known race condition. The driver_override
> implementation relies on the device_lock, so the missing lock led to the
> use-after-free (UAF) reported in Bugzilla for buses using this field.
>
> Stress testing the two newly locked paths for 24 hours with
> CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP enabled showed no UAF recurrence
> and no lockdep warnings.
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789

Fixes: 49b420a13ff9 ("driver core: check bus->match without holding device lock")

> Suggested-by: Qiu-ji Chen <chenqiuji666@gmail.com>
> Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>

This looks good now!

Reviewed-by: Danilo Krummrich <dakr@kernel.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-13 19:23 ` Danilo Krummrich
@ 2026-01-16  7:34   ` Gui-Dong Han
  2026-01-16 11:19     ` Greg KH
  0 siblings, 1 reply; 63+ messages in thread
From: Gui-Dong Han @ 2026-01-16  7:34 UTC (permalink / raw)
  To: Danilo Krummrich; +Cc: gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen

On Wed, Jan 14, 2026 at 3:23 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Tue Jan 13, 2026 at 5:28 PM CET, Gui-Dong Han wrote:
> > Currently, driver_match_device() is called from three sites. One site
> > (__device_attach_driver) holds device_lock(dev), but the other two
> > (bind_store and __driver_attach) do not. This inconsistency means that
> > bus match() callbacks are not guaranteed to be called with the lock
> > held.
> >
> > Fix this by introducing driver_match_device_locked(), which guarantees
> > holding the device lock using a scoped guard. Replace the unlocked calls
> > in bind_store() and __driver_attach() with this new helper. Also add a
> > lock assertion to driver_match_device() to enforce this guarantee.
> >
> > This consistency also fixes a known race condition. The driver_override
> > implementation relies on the device_lock, so the missing lock led to the
> > use-after-free (UAF) reported in Bugzilla for buses using this field.
> >
> > Stress testing the two newly locked paths for 24 hours with
> > CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP enabled showed no UAF recurrence
> > and no lockdep warnings.
> >
> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
>
> Fixes: 49b420a13ff9 ("driver core: check bus->match without holding device lock")

Thanks for the review! The Fixes tag looks correct as the
inconsistency dates back to that commit.

Since a Fixes tag is present, I recall the patch bot often warns when
a Fixes tag is provided without a corresponding Cc:
stable@vger.kernel.org tag. Perhaps we should include it? This would
also allow us to fix the UAF in older kernels.

If you agree, could you please add it when picking up the patch?

Thank you very much!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-16  7:34   ` Gui-Dong Han
@ 2026-01-16 11:19     ` Greg KH
  2026-01-16 11:38       ` Gui-Dong Han
  0 siblings, 1 reply; 63+ messages in thread
From: Greg KH @ 2026-01-16 11:19 UTC (permalink / raw)
  To: Gui-Dong Han
  Cc: Danilo Krummrich, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen

On Fri, Jan 16, 2026 at 03:34:53PM +0800, Gui-Dong Han wrote:
> On Wed, Jan 14, 2026 at 3:23 AM Danilo Krummrich <dakr@kernel.org> wrote:
> >
> > On Tue Jan 13, 2026 at 5:28 PM CET, Gui-Dong Han wrote:
> > > Currently, driver_match_device() is called from three sites. One site
> > > (__device_attach_driver) holds device_lock(dev), but the other two
> > > (bind_store and __driver_attach) do not. This inconsistency means that
> > > bus match() callbacks are not guaranteed to be called with the lock
> > > held.
> > >
> > > Fix this by introducing driver_match_device_locked(), which guarantees
> > > holding the device lock using a scoped guard. Replace the unlocked calls
> > > in bind_store() and __driver_attach() with this new helper. Also add a
> > > lock assertion to driver_match_device() to enforce this guarantee.
> > >
> > > This consistency also fixes a known race condition. The driver_override
> > > implementation relies on the device_lock, so the missing lock led to the
> > > use-after-free (UAF) reported in Bugzilla for buses using this field.
> > >
> > > Stress testing the two newly locked paths for 24 hours with
> > > CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP enabled showed no UAF recurrence
> > > and no lockdep warnings.
> > >
> > > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
> >
> > Fixes: 49b420a13ff9 ("driver core: check bus->match without holding device lock")
> 
> Thanks for the review! The Fixes tag looks correct as the
> inconsistency dates back to that commit.
> 
> Since a Fixes tag is present, I recall the patch bot often warns when
> a Fixes tag is provided without a corresponding Cc:
> stable@vger.kernel.org tag. Perhaps we should include it? This would
> also allow us to fix the UAF in older kernels.

Can you reproduce this UAF?  If so, yes, otherwise fixing for
theoretical things isn't good to backport.

Anyway, looks good to me:

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-16 11:19     ` Greg KH
@ 2026-01-16 11:38       ` Gui-Dong Han
  0 siblings, 0 replies; 63+ messages in thread
From: Gui-Dong Han @ 2026-01-16 11:38 UTC (permalink / raw)
  To: Greg KH; +Cc: Danilo Krummrich, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen

On Fri, Jan 16, 2026 at 7:19 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Fri, Jan 16, 2026 at 03:34:53PM +0800, Gui-Dong Han wrote:
> > On Wed, Jan 14, 2026 at 3:23 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > >
> > > On Tue Jan 13, 2026 at 5:28 PM CET, Gui-Dong Han wrote:
> > > > Currently, driver_match_device() is called from three sites. One site
> > > > (__device_attach_driver) holds device_lock(dev), but the other two
> > > > (bind_store and __driver_attach) do not. This inconsistency means that
> > > > bus match() callbacks are not guaranteed to be called with the lock
> > > > held.
> > > >
> > > > Fix this by introducing driver_match_device_locked(), which guarantees
> > > > holding the device lock using a scoped guard. Replace the unlocked calls
> > > > in bind_store() and __driver_attach() with this new helper. Also add a
> > > > lock assertion to driver_match_device() to enforce this guarantee.
> > > >
> > > > This consistency also fixes a known race condition. The driver_override
> > > > implementation relies on the device_lock, so the missing lock led to the
> > > > use-after-free (UAF) reported in Bugzilla for buses using this field.
> > > >
> > > > Stress testing the two newly locked paths for 24 hours with
> > > > CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP enabled showed no UAF recurrence
> > > > and no lockdep warnings.
> > > >
> > > > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
> > >
> > > Fixes: 49b420a13ff9 ("driver core: check bus->match without holding device lock")
> >
> > Thanks for the review! The Fixes tag looks correct as the
> > inconsistency dates back to that commit.
> >
> > Since a Fixes tag is present, I recall the patch bot often warns when
> > a Fixes tag is provided without a corresponding Cc:
> > stable@vger.kernel.org tag. Perhaps we should include it? This would
> > also allow us to fix the UAF in older kernels.
>
> Can you reproduce this UAF?  If so, yes, otherwise fixing for
> theoretical things isn't good to backport.

Yes, it is reproducible. The Bugzilla entry linked in the Closes tag
contains full KASAN reports and PoCs that reliably reproduce the UAF.
(This was also noted in the v1 changelog).

Thanks for the review!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-13 16:28 [PATCH v5] driver core: enforce device_lock for driver_match_device() Gui-Dong Han
  2026-01-13 16:35 ` Rafael J. Wysocki
  2026-01-13 19:23 ` Danilo Krummrich
@ 2026-01-16 11:54 ` Danilo Krummrich
  2026-01-20 13:22 ` Mark Brown
  2026-02-25 20:19 ` Cristian Marussi
  4 siblings, 0 replies; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-16 11:54 UTC (permalink / raw)
  To: Gui-Dong Han; +Cc: gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen

On Tue Jan 13, 2026 at 5:28 PM CET, Gui-Dong Han wrote:
> Currently, driver_match_device() is called from three sites. One site
> (__device_attach_driver) holds device_lock(dev), but the other two
> (bind_store and __driver_attach) do not. This inconsistency means that
> bus match() callbacks are not guaranteed to be called with the lock
> held.
>
> Fix this by introducing driver_match_device_locked(), which guarantees
> holding the device lock using a scoped guard. Replace the unlocked calls
> in bind_store() and __driver_attach() with this new helper. Also add a
> lock assertion to driver_match_device() to enforce this guarantee.
>
> This consistency also fixes a known race condition. The driver_override
> implementation relies on the device_lock, so the missing lock led to the
> use-after-free (UAF) reported in Bugzilla for buses using this field.
>
> Stress testing the two newly locked paths for 24 hours with
> CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP enabled showed no UAF recurrence
> and no lockdep warnings.
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
> Suggested-by: Qiu-ji Chen <chenqiuji666@gmail.com>
> Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>

Applied to driver-core-linus, thanks!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-13 16:28 [PATCH v5] driver core: enforce device_lock for driver_match_device() Gui-Dong Han
                   ` (2 preceding siblings ...)
  2026-01-16 11:54 ` Danilo Krummrich
@ 2026-01-20 13:22 ` Mark Brown
  2026-01-20 13:30   ` Gui-Dong Han
                     ` (4 more replies)
  2026-02-25 20:19 ` Cristian Marussi
  4 siblings, 5 replies; 63+ messages in thread
From: Mark Brown @ 2026-01-20 13:22 UTC (permalink / raw)
  To: Gui-Dong Han
  Cc: gregkh, rafael, dakr, linux-kernel, baijiaju1990, Qiu-ji Chen,
	Aishwarya.TCV

[-- Attachment #1: Type: text/plain, Size: 5114 bytes --]

On Wed, Jan 14, 2026 at 12:28:43AM +0800, Gui-Dong Han wrote:
> Currently, driver_match_device() is called from three sites. One site
> (__device_attach_driver) holds device_lock(dev), but the other two
> (bind_store and __driver_attach) do not. This inconsistency means that
> bus match() callbacks are not guaranteed to be called with the lock
> held.

I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
this commit.  The boot grinds to a halt near the end of boot:

[    2.570549] ledtrig-cpu: registered to indicate activity on CPUs
[    2.618301] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    2.623547] msm_serial: driver initialized
[    2.624058] SuperH (H)SCI(F) driver initialized
[    2.624312] STM32 USART driver initialized

with no further output, full log:

   https://lava.sirena.org.uk/scheduler/job/2387335#L862

We are also seeing similar looking boot hangs on some Qualcomm platforms
in Arm's test lab which aren't verified to be the same thing but are
hanging at a similar point in boot.

Bisect log:

# bad: [53d96388f8a868d3f79559a6cf39db81b34f9fb4] Merge branch 'slab/for-next-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git
# good: [24d479d26b25bce5faea3ddd9fa8f3a6c3129ea7] Linux 6.19-rc6
# good: [caa329649259d0f90c0056c9860ca659d4ba3211] spi: intel-pci: Add support for Nova Lake SPI serial flash
# good: [4b58aac989c1e3fafb1c68a733811859df388250] regmap: Fix race condition in hwspinlock irqsave routine
# good: [b062a899c997df7b9ce29c62164888baa7a85833] spi: hisi-kunpeng: Fixed the wrong debugfs node name in hisi_spi debugfs initialization
# good: [f3f380ce6b3d5c9805c7e0b3d5bc28d9ec41e2e8] regmap: maple: free entry on mas_store_gfp() failure
git bisect start '53d96388f8a868d3f79559a6cf39db81b34f9fb4' '24d479d26b25bce5faea3ddd9fa8f3a6c3129ea7' 'caa329649259d0f90c0056c9860ca659d4ba3211' '4b58aac989c1e3fafb1c68a733811859df388250' 'b062a899c997df7b9ce29c62164888baa7a85833' 'f3f380ce6b3d5c9805c7e0b3d5bc28d9ec41e2e8'
# test job: [caa329649259d0f90c0056c9860ca659d4ba3211] https://lava.sirena.org.uk/scheduler/job/2376182
# test job: [4b58aac989c1e3fafb1c68a733811859df388250] https://lava.sirena.org.uk/scheduler/job/2364586
# test job: [b062a899c997df7b9ce29c62164888baa7a85833] https://lava.sirena.org.uk/scheduler/job/2363939
# test job: [f3f380ce6b3d5c9805c7e0b3d5bc28d9ec41e2e8] https://lava.sirena.org.uk/scheduler/job/2331556
# test job: [53d96388f8a868d3f79559a6cf39db81b34f9fb4] https://lava.sirena.org.uk/scheduler/job/2387335
# bad: [53d96388f8a868d3f79559a6cf39db81b34f9fb4] Merge branch 'slab/for-next-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git
git bisect bad 53d96388f8a868d3f79559a6cf39db81b34f9fb4
# test job: [64f6a0a031257e68060f22d217c53a50a671dfc8] https://lava.sirena.org.uk/scheduler/job/2387395
# good: [64f6a0a031257e68060f22d217c53a50a671dfc8] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless.git
git bisect good 64f6a0a031257e68060f22d217c53a50a671dfc8
# test job: [686c6adb9882dd8d9fe1cc816bfd289dfa0abbd1] https://lava.sirena.org.uk/scheduler/job/2387470
# bad: [686c6adb9882dd8d9fe1cc816bfd289dfa0abbd1] Merge branch 'fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86.git
git bisect bad 686c6adb9882dd8d9fe1cc816bfd289dfa0abbd1
# test job: [e03b29b55f2b7c345a919a6ee36633b06bf3fb56] https://lava.sirena.org.uk/scheduler/job/2387535
# good: [e03b29b55f2b7c345a919a6ee36633b06bf3fb56] comedi: dmm32at: serialize use of paged registers
git bisect good e03b29b55f2b7c345a919a6ee36633b06bf3fb56
# test job: [bb25769465146dfc54e90e4066e8f41ec6b25730] https://lava.sirena.org.uk/scheduler/job/2387693
# bad: [bb25769465146dfc54e90e4066e8f41ec6b25730] Merge branch 'mtd/fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git
git bisect bad bb25769465146dfc54e90e4066e8f41ec6b25730
# test job: [dc23806a7c47ec5f1293aba407fb69519f976ee0] https://lava.sirena.org.uk/scheduler/job/2388190
# bad: [dc23806a7c47ec5f1293aba407fb69519f976ee0] driver core: enforce device_lock for driver_match_device()
git bisect bad dc23806a7c47ec5f1293aba407fb69519f976ee0
# test job: [5f4476e98387618ce22bb93fb5c11142827458ec] https://lava.sirena.org.uk/scheduler/job/2388401
# good: [5f4476e98387618ce22bb93fb5c11142827458ec] rust: auxiliary: add Driver::unbind() callback
git bisect good 5f4476e98387618ce22bb93fb5c11142827458ec
# test job: [c1d4519e1c36ffa01973e23af4502e69dcd84f39] https://lava.sirena.org.uk/scheduler/job/2388603
# good: [c1d4519e1c36ffa01973e23af4502e69dcd84f39] rust: driver: add DEVICE_DRIVER_OFFSET to the DriverLayout trait
git bisect good c1d4519e1c36ffa01973e23af4502e69dcd84f39
# test job: [a995fe1a3aa78b7d06cc1cc7b6b8436c5e93b07f] https://lava.sirena.org.uk/scheduler/job/2388676
# good: [a995fe1a3aa78b7d06cc1cc7b6b8436c5e93b07f] rust: driver: drop device private data post unbind
git bisect good a995fe1a3aa78b7d06cc1cc7b6b8436c5e93b07f
# first bad commit: [dc23806a7c47ec5f1293aba407fb69519f976ee0] driver core: enforce device_lock for driver_match_device()

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 13:22 ` Mark Brown
@ 2026-01-20 13:30   ` Gui-Dong Han
  2026-01-20 13:48     ` Mark Brown
  2026-01-21  8:55     ` Wang Jiayue
  2026-01-20 15:03   ` Danilo Krummrich
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 63+ messages in thread
From: Gui-Dong Han @ 2026-01-20 13:30 UTC (permalink / raw)
  To: Mark Brown
  Cc: gregkh, rafael, dakr, linux-kernel, baijiaju1990, Qiu-ji Chen,
	Aishwarya.TCV

On Tue, Jan 20, 2026 at 9:22 PM Mark Brown <broonie@kernel.org> wrote:
>
> On Wed, Jan 14, 2026 at 12:28:43AM +0800, Gui-Dong Han wrote:
> > Currently, driver_match_device() is called from three sites. One site
> > (__device_attach_driver) holds device_lock(dev), but the other two
> > (bind_store and __driver_attach) do not. This inconsistency means that
> > bus match() callbacks are not guaranteed to be called with the lock
> > held.
>
> I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
> this commit.  The boot grinds to a halt near the end of boot:
>
> [    2.570549] ledtrig-cpu: registered to indicate activity on CPUs
> [    2.618301] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> [    2.623547] msm_serial: driver initialized
> [    2.624058] SuperH (H)SCI(F) driver initialized
> [    2.624312] STM32 USART driver initialized
>
> with no further output, full log:
>
>    https://lava.sirena.org.uk/scheduler/job/2387335#L862
>
> We are also seeing similar looking boot hangs on some Qualcomm platforms
> in Arm's test lab which aren't verified to be the same thing but are
> hanging at a similar point in boot.

Hi Mark,

Thanks for the report and the detailed bisect log.

I verified this on x86 without issues, but it seems I missed this
regression on Arm platforms. I will investigate the cause of this hang
immediately.

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 13:30   ` Gui-Dong Han
@ 2026-01-20 13:48     ` Mark Brown
  2026-01-20 14:05       ` Gui-Dong Han
  2026-01-21  8:55     ` Wang Jiayue
  1 sibling, 1 reply; 63+ messages in thread
From: Mark Brown @ 2026-01-20 13:48 UTC (permalink / raw)
  To: Gui-Dong Han
  Cc: gregkh, rafael, dakr, linux-kernel, baijiaju1990, Qiu-ji Chen,
	Aishwarya.TCV

[-- Attachment #1: Type: text/plain, Size: 652 bytes --]

On Tue, Jan 20, 2026 at 09:30:53PM +0800, Gui-Dong Han wrote:
> On Tue, Jan 20, 2026 at 9:22 PM Mark Brown <broonie@kernel.org> wrote:

> > I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
> > this commit.  The boot grinds to a halt near the end of boot:

> I verified this on x86 without issues, but it seems I missed this
> regression on Arm platforms. I will investigate the cause of this hang
> immediately.

Thanks.  It's not all platforms - in particular none of the main
emulation platforms triggered the issue unfortunately.  It seems to be a
fairly limited subset of physical platforms that are impacted.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 13:48     ` Mark Brown
@ 2026-01-20 14:05       ` Gui-Dong Han
  0 siblings, 0 replies; 63+ messages in thread
From: Gui-Dong Han @ 2026-01-20 14:05 UTC (permalink / raw)
  To: Mark Brown
  Cc: gregkh, rafael, dakr, linux-kernel, baijiaju1990, Qiu-ji Chen,
	Aishwarya.TCV

On Tue, Jan 20, 2026 at 9:48 PM Mark Brown <broonie@kernel.org> wrote:
>
> On Tue, Jan 20, 2026 at 09:30:53PM +0800, Gui-Dong Han wrote:
> > On Tue, Jan 20, 2026 at 9:22 PM Mark Brown <broonie@kernel.org> wrote:
>
> > > I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
> > > this commit.  The boot grinds to a halt near the end of boot:
>
> > I verified this on x86 without issues, but it seems I missed this
> > regression on Arm platforms. I will investigate the cause of this hang
> > immediately.
>
> Thanks.  It's not all platforms - in particular none of the main
> emulation platforms triggered the issue unfortunately.  It seems to be a
> fairly limited subset of physical platforms that are impacted.

I am quite puzzled by this. My suspicion is a deadlock, possibly
involving PM/power domain operations within the match() callback
triggering a lock dependency.

However, since __device_attach_driver (the device-initiated binding
path) has always held device_lock while calling match(), it is strange
that this locking pattern causes issues now in the __driver_attach
(driver-initiated) path. The specific timing or device state during
late binding must be the key factor.

I do not have access to the affected hardware, but I will attempt to
reproduce it on QEMU Arm and perform a code audit of the relevant bus
match functions.

If anyone has insights into potential lock conflicts specific to this
path, please share.

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 13:22 ` Mark Brown
  2026-01-20 13:30   ` Gui-Dong Han
@ 2026-01-20 15:03   ` Danilo Krummrich
  2026-01-20 15:35     ` Mark Brown
  2026-01-20 17:38     ` Mark Brown
  2026-01-20 15:23   ` Marek Szyprowski
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-20 15:03 UTC (permalink / raw)
  To: Mark Brown
  Cc: Gui-Dong Han, gregkh, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV

On Tue Jan 20, 2026 at 2:22 PM CET, Mark Brown wrote:
> On Wed, Jan 14, 2026 at 12:28:43AM +0800, Gui-Dong Han wrote:
>> Currently, driver_match_device() is called from three sites. One site
>> (__device_attach_driver) holds device_lock(dev), but the other two
>> (bind_store and __driver_attach) do not. This inconsistency means that
>> bus match() callbacks are not guaranteed to be called with the lock
>> held.
>
> I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
> this commit.  The boot grinds to a halt near the end of boot:
>
> [    2.570549] ledtrig-cpu: registered to indicate activity on CPUs
> [    2.618301] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> [    2.623547] msm_serial: driver initialized
> [    2.624058] SuperH (H)SCI(F) driver initialized
> [    2.624312] STM32 USART driver initialized

Hm..sounds a bit like some match() callback manually takes the device_lock() and
the reason we're not seeing anything from lockdep is because it happens with the
serial driver.

I don't have a machine to reproduce it, but for debugging it would probably help
to not actually take the lock in __driver_attach(), but only acquire / release
the corresponding lockdep map. If my suspicion is correct, we should see a
lockdep splat pointing out the issue.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 13:22 ` Mark Brown
  2026-01-20 13:30   ` Gui-Dong Han
  2026-01-20 15:03   ` Danilo Krummrich
@ 2026-01-20 15:23   ` Marek Szyprowski
  2026-01-20 15:27     ` Mark Brown
  2026-01-21 20:00     ` Jon Hunter
  2026-01-21  7:40   ` David Heidelberg
  2026-02-11 10:42   ` Alexander Stein
  4 siblings, 2 replies; 63+ messages in thread
From: Marek Szyprowski @ 2026-01-20 15:23 UTC (permalink / raw)
  To: Mark Brown, Gui-Dong Han
  Cc: gregkh, rafael, dakr, linux-kernel, baijiaju1990, Qiu-ji Chen,
	Aishwarya.TCV

On 20.01.2026 14:22, Mark Brown wrote:
> On Wed, Jan 14, 2026 at 12:28:43AM +0800, Gui-Dong Han wrote:
>> Currently, driver_match_device() is called from three sites. One site
>> (__device_attach_driver) holds device_lock(dev), but the other two
>> (bind_store and __driver_attach) do not. This inconsistency means that
>> bus match() callbacks are not guaranteed to be called with the lock
>> held.
> I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
> this commit.  The boot grinds to a halt near the end of boot:
>
> [    2.570549] ledtrig-cpu: registered to indicate activity on CPUs
> [    2.618301] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> [    2.623547] msm_serial: driver initialized
> [    2.624058] SuperH (H)SCI(F) driver initialized
> [    2.624312] STM32 USART driver initialized
>
> with no further output, full log:
>
>     https://lava.sirena.org.uk/scheduler/job/2387335#L862
>
> We are also seeing similar looking boot hangs on some Qualcomm platforms
> in Arm's test lab which aren't verified to be the same thing but are
> hanging at a similar point in boot.

I've observed the same issue on Qualcomm RB5 board and bisecting lead me 
also to this patch. My kernel log also doesn't reveal much information:

...

[    3.671227] vreg_bob: Setting 3008000-4000000uV
[    3.676929] vreg_l1c_1p8: Setting 1800000-1800000uV
[    3.682826] vreg_l2c_1p2: Setting 1200000-1200000uV
[    3.688547] vreg_l3c_0p8: Setting 800000-800000uV
[    3.694080] vreg_l4c_1p7: Setting 1704000-2928000uV
[    3.699908] vreg_l5c_1p8: Setting 1800000-2928000uV
[    3.705763] vreg_l6c_2p96: Setting 1800000-2960000uV
[    3.711684] vreg_l7c_cam_vcm0_2p85: Setting 2856000-3104000uV
[    3.718408] vreg_l8c_1p8: Setting 1800000-1800000uV
[    3.724287] vreg_l9c_2p96: Setting 2704000-2960000uV
[    3.730218] vreg_l10c_3p0: Setting 3000000-3000000uV
[    3.736226] vreg_l11c_3p3: Setting 3296000-3296000uV
[    3.743413] vreg_s8c_1p3: Setting 1352000-1352000uV
[    3.771370] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    3.792020] msm_serial: driver initialized
[    3.797633] SuperH (H)SCI(F) driver initialized
[    3.802881] STM32 USART driver initialized

[hang/freeze]

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 15:23   ` Marek Szyprowski
@ 2026-01-20 15:27     ` Mark Brown
  2026-01-21 20:00     ` Jon Hunter
  1 sibling, 0 replies; 63+ messages in thread
From: Mark Brown @ 2026-01-20 15:27 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Gui-Dong Han, gregkh, rafael, dakr, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV

[-- Attachment #1: Type: text/plain, Size: 534 bytes --]

On Tue, Jan 20, 2026 at 04:23:27PM +0100, Marek Szyprowski wrote:
> On 20.01.2026 14:22, Mark Brown wrote:

> > We are also seeing similar looking boot hangs on some Qualcomm platforms
> > in Arm's test lab which aren't verified to be the same thing but are
> > hanging at a similar point in boot.

> I've observed the same issue on Qualcomm RB5 board and bisecting lead me 
> also to this patch. My kernel log also doesn't reveal much information:

Yeah, that's one of the boards we're seeing issues on in Arm's test lab.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 15:03   ` Danilo Krummrich
@ 2026-01-20 15:35     ` Mark Brown
  2026-01-20 17:38     ` Mark Brown
  1 sibling, 0 replies; 63+ messages in thread
From: Mark Brown @ 2026-01-20 15:35 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Gui-Dong Han, gregkh, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV

[-- Attachment #1: Type: text/plain, Size: 915 bytes --]

On Tue, Jan 20, 2026 at 04:03:03PM +0100, Danilo Krummrich wrote:
> On Tue Jan 20, 2026 at 2:22 PM CET, Mark Brown wrote:

> > I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
> > this commit.  The boot grinds to a halt near the end of boot:

> Hm..sounds a bit like some match() callback manually takes the device_lock() and
> the reason we're not seeing anything from lockdep is because it happens with the
> serial driver.

> I don't have a machine to reproduce it, but for debugging it would probably help
> to not actually take the lock in __driver_attach(), but only acquire / release
> the corresponding lockdep map. If my suspicion is correct, we should see a
> lockdep splat pointing out the issue.

lockdep isn't enabled by any of the defconfigs so has limited test
coverage, the only one I'm seeing it enabled by is ps3_defconfig.  I'll
try to run some tests with lockdep enabled.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 15:03   ` Danilo Krummrich
  2026-01-20 15:35     ` Mark Brown
@ 2026-01-20 17:38     ` Mark Brown
  2026-01-20 18:36       ` Danilo Krummrich
  1 sibling, 1 reply; 63+ messages in thread
From: Mark Brown @ 2026-01-20 17:38 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Gui-Dong Han, gregkh, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV, Marek Szyprowski

[-- Attachment #1: Type: text/plain, Size: 597 bytes --]

On Tue, Jan 20, 2026 at 04:03:03PM +0100, Danilo Krummrich wrote:

> I don't have a machine to reproduce it, but for debugging it would probably help
> to not actually take the lock in __driver_attach(), but only acquire / release
> the corresponding lockdep map. If my suspicion is correct, we should see a
> lockdep splat pointing out the issue.

I tried lockdep but didn't see anything different.  Instrumenting with
printk() tells me it's deadlocking trying to attach arm-smmu on Juno
(that's a v1 SMMU on this platform), I'll try to poke further but it'll
likely be tomorrow at the earliest.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 17:38     ` Mark Brown
@ 2026-01-20 18:36       ` Danilo Krummrich
  2026-01-20 20:05         ` Mark Brown
  0 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-20 18:36 UTC (permalink / raw)
  To: Mark Brown
  Cc: Gui-Dong Han, gregkh, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV, Marek Szyprowski

On Tue Jan 20, 2026 at 6:38 PM CET, Mark Brown wrote:
> On Tue, Jan 20, 2026 at 04:03:03PM +0100, Danilo Krummrich wrote:
>
>> I don't have a machine to reproduce it, but for debugging it would probably help
>> to not actually take the lock in __driver_attach(), but only acquire / release
>> the corresponding lockdep map. If my suspicion is correct, we should see a
>> lockdep splat pointing out the issue.
>
> I tried lockdep but didn't see anything different.  Instrumenting with
> printk() tells me it's deadlocking trying to attach arm-smmu on Juno
> (that's a v1 SMMU on this platform), I'll try to poke further but it'll
> likely be tomorrow at the earliest.

Maybe the following diff faking the lock for lockdep helps, as it should keep
things running, i.e. with this we have the exact same semantics as if we'd
revert the patch (except for the lockdep check of course).

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 677320881af1..4741412d7e46 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -190,8 +190,13 @@ static inline int driver_match_device(const struct device_driver *drv,
 static inline int driver_match_device_locked(const struct device_driver *drv,
                                             struct device *dev)
 {
-       guard(device)(dev);
-       return driver_match_device(drv, dev);
+       int ret;
+
+       mutex_acquire(&dev->mutex.dep_map, 0, 0, _THIS_IP_);
+       ret = driver_match_device(drv, dev);
+       mutex_release(&dev->mutex.dep_map, _THIS_IP_);
+
+       return ret;
 }

 static inline void dev_sync_state(struct device *dev)


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 18:36       ` Danilo Krummrich
@ 2026-01-20 20:05         ` Mark Brown
  2026-01-20 21:18           ` Danilo Krummrich
  0 siblings, 1 reply; 63+ messages in thread
From: Mark Brown @ 2026-01-20 20:05 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Gui-Dong Han, gregkh, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV, Marek Szyprowski

[-- Attachment #1: Type: text/plain, Size: 720 bytes --]

On Tue, Jan 20, 2026 at 07:36:05PM +0100, Danilo Krummrich wrote:
> On Tue Jan 20, 2026 at 6:38 PM CET, Mark Brown wrote:

> > I tried lockdep but didn't see anything different.  Instrumenting with
> > printk() tells me it's deadlocking trying to attach arm-smmu on Juno
> > (that's a v1 SMMU on this platform), I'll try to poke further but it'll
> > likely be tomorrow at the earliest.

> Maybe the following diff faking the lock for lockdep helps, as it should keep
> things running, i.e. with this we have the exact same semantics as if we'd
> revert the patch (except for the lockdep check of course).

That does allow us to continue to make progress, the SMMU never manages
to probe AFAICT but we do boot normally.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 20:05         ` Mark Brown
@ 2026-01-20 21:18           ` Danilo Krummrich
  2026-01-21  1:11             ` Danilo Krummrich
  0 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-20 21:18 UTC (permalink / raw)
  To: Mark Brown
  Cc: Gui-Dong Han, gregkh, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV, Marek Szyprowski

On Tue Jan 20, 2026 at 9:05 PM CET, Mark Brown wrote:
> On Tue, Jan 20, 2026 at 07:36:05PM +0100, Danilo Krummrich wrote:
>> On Tue Jan 20, 2026 at 6:38 PM CET, Mark Brown wrote:
>
>> > I tried lockdep but didn't see anything different.  Instrumenting with
>> > printk() tells me it's deadlocking trying to attach arm-smmu on Juno
>> > (that's a v1 SMMU on this platform), I'll try to poke further but it'll
>> > likely be tomorrow at the earliest.
>
>> Maybe the following diff faking the lock for lockdep helps, as it should keep
>> things running, i.e. with this we have the exact same semantics as if we'd
>> revert the patch (except for the lockdep check of course).
>
> That does allow us to continue to make progress, the SMMU never manages
> to probe AFAICT but we do boot normally.

I really would expect a lockdep splat in this case, so I was even about to ask
whether CONFIG_PROVE_LOCKING etc. is enabled. But it's me who messed it up. I
missed that we have lockdep_set_novalidate_class(&dev->mutex).

(The fact that the SMMU never manages to probe must be unrelated, i.e. a
different issue. Since my diff should be equivalent to a revert of the patch,
except that it fakes that the mutex has been taken for lockdep.)

Anyways, this should work:

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 677320881af1..4741412d7e46 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -190,8 +190,13 @@ static inline int driver_match_device(const struct device_driver *drv,
 static inline int driver_match_device_locked(const struct device_driver *drv,
                                             struct device *dev)
 {
-       guard(device)(dev);
-       return driver_match_device(drv, dev);
+       int ret;
+
+       mutex_acquire(&dev->mutex.dep_map, 0, 0, _THIS_IP_);
+       ret = driver_match_device(drv, dev);
+       mutex_release(&dev->mutex.dep_map, _THIS_IP_);
+
+       return ret;
 }

 static inline void dev_sync_state(struct device *dev)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 40de2f51a1b1..af270362aeb7 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -3159,7 +3159,7 @@ void device_initialize(struct device *dev)
        kobject_init(&dev->kobj, &device_ktype);
        INIT_LIST_HEAD(&dev->dma_pools);
        mutex_init(&dev->mutex);
-       lockdep_set_novalidate_class(&dev->mutex);
+       //lockdep_set_novalidate_class(&dev->mutex);
        spin_lock_init(&dev->devres_lock);
        INIT_LIST_HEAD(&dev->devres_head);
        device_pm_init(dev);


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 21:18           ` Danilo Krummrich
@ 2026-01-21  1:11             ` Danilo Krummrich
  2026-01-21  7:18               ` Gui-Dong Han
  0 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-21  1:11 UTC (permalink / raw)
  To: Mark Brown
  Cc: Gui-Dong Han, gregkh, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV, Marek Szyprowski

On Tue Jan 20, 2026 at 10:18 PM CET, Danilo Krummrich wrote:
> Anyways, this should work:

I Just notied that I pasted the wrong diff, which was nonsense of course, since
it just unlocks all the suppressed false positives. (Should not have sent it
during a meeting. :)

What I actually intended (not neat, but hopefully helps):

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 677320881af1..4741412d7e46 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -190,8 +190,13 @@ static inline int driver_match_device(const struct device_driver *drv,
 static inline int driver_match_device_locked(const struct device_driver *drv,
                                             struct device *dev)
 {
-       guard(device)(dev);
-       return driver_match_device(drv, dev);
+       int ret;
+
+       mutex_acquire(&dev->mutex.dep_map, 0, 0, _THIS_IP_);
+       ret = driver_match_device(drv, dev);
+       mutex_release(&dev->mutex.dep_map, _THIS_IP_);
+
+       return ret;
 }
 
 static inline void dev_sync_state(struct device *dev)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 40de2f51a1b1..56c62b3016aa 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2557,6 +2557,8 @@ static void device_release(struct kobject *kobj)
 
        kfree(dev->dma_range_map);
 
+       lockdep_unregister_key(&dev->lock_key);
+
        if (dev->release)
                dev->release(dev);
        else if (dev->type && dev->type->release)
@@ -3159,7 +3161,9 @@ void device_initialize(struct device *dev)
        kobject_init(&dev->kobj, &device_ktype);
        INIT_LIST_HEAD(&dev->dma_pools);
        mutex_init(&dev->mutex);
-       lockdep_set_novalidate_class(&dev->mutex);
+       //lockdep_set_novalidate_class(&dev->mutex);
+       lockdep_register_key(&dev->lock_key);
+       lockdep_set_class(&dev->mutex, &dev->lock_key);
        spin_lock_init(&dev->devres_lock);
        INIT_LIST_HEAD(&dev->devres_head);
        device_pm_init(dev);
diff --git a/include/linux/device.h b/include/linux/device.h
index 0be95294b6e6..dc898a420bc2 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -579,6 +579,7 @@ struct device {
        struct mutex            mutex;  /* mutex to synchronize calls to
                                         * its driver.
                                         */
+       struct lock_class_key lock_key;
 
        struct dev_links_info   links;
        struct dev_pm_info      power;


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21  1:11             ` Danilo Krummrich
@ 2026-01-21  7:18               ` Gui-Dong Han
  2026-01-21  7:41                 ` Gui-Dong Han
  0 siblings, 1 reply; 63+ messages in thread
From: Gui-Dong Han @ 2026-01-21  7:18 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV, Marek Szyprowski

On Wed, Jan 21, 2026 at 9:11 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Tue Jan 20, 2026 at 10:18 PM CET, Danilo Krummrich wrote:
> > Anyways, this should work:
>
> I Just notied that I pasted the wrong diff, which was nonsense of course, since
> it just unlocks all the suppressed false positives. (Should not have sent it
> during a meeting. :)
>
> What I actually intended (not neat, but hopefully helps):

Thanks for the updated diff.

I tested it on my QEMU setup. Since I couldn't reproduce the hang
there, I didn't see any lockdep splats regarding the deadlock.
However, since the physical lock is removed, my PoCs successfully
triggered the UAF on both paths as expected.

I did notice a lockdep warning during boot, which happens every time.
I suspect this is because faux_bus_init is an __init function, so we
are registering a key from memory that gets freed. This seems specific
to the debug code, but I'm pasting it below for reference.

[    2.093905] WARNING: CPU: 1 PID: 1 at kernel/locking/lockdep.c:1227
lockdep_register_key+0x104/0x150
[    2.094924] Modules linked in:
[    2.095682] CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted
6.17.0-dirty #11 PREEMPT(voluntary)
[    2.096241] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.15.0-1 04/01/2014
[    2.096896] RIP: 0010:lockdep_register_key+0x104/0x150
[    2.097301] Code: 8b 04 ed 20 91 dd 90 4c 89 6b 08 48 89 03 48 89
1c ed 20 91 dd 90 48 85 c0 74 04 48 89 58 08 49
[    2.098909] RSP: 0018:ffff888001217e60 EFLAGS: 00000202
[    2.099260] RAX: 0000000000000001 RBX: ffffffff8f025260 RCX: 0000000000000000
[    2.099566] RDX: 1ffffffff1e04a45 RSI: 0000000000000001 RDI: ffffffff8f025260
[    2.099898] RBP: ffffffff8f025140 R08: 0000000000000004 R09: 0000000000000000
[    2.100228] R10: ffffffff8f025213 R11: 0000000000000000 R12: ffffffff8f025260
[    2.100556] R13: ffff888001209580 R14: 0000000000000000 R15: 0000000000000000
[    2.101923] FS:  0000000000000000(0000) GS:ffff8880dc98d000(0000)
knlGS:0000000000000000
[    2.102298] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.102560] CR2: 0000000000000000 CR3: 000000004de8c000 CR4: 00000000000006f0
[    2.102628] Call Trace:
[    2.102628]  <TASK>
[    2.102991]  device_initialize+0xc3/0x500
[    2.103307]  device_register+0xd/0x20
[    2.103525]  faux_bus_init+0x12/0x80
[    2.103730]  driver_init+0x2e/0x50
[    2.104066]  kernel_init_freeable+0x33e/0x6e0
[    2.104335]  ? __pfx_kernel_init+0x10/0x10
[    2.104566]  kernel_init+0x1a/0x1e0
[    2.104765]  ? _raw_spin_unlock_irq+0x23/0x40
[    2.105118]  ret_from_fork+0x255/0x330
[    2.105350]  ? __pfx_kernel_init+0x10/0x10
[    2.105567]  ret_from_fork_asm+0x1a/0x30
[    2.106001]  </TASK>
[    2.106218] irq event stamp: 3545
[    2.106405] hardirqs last  enabled at (3553): [<ffffffff8ace8436>]
__up_console_sem+0x66/0x70
[    2.106892] hardirqs last disabled at (3562): [<ffffffff8ace841b>]
__up_console_sem+0x4b/0x70
[    2.107276] softirqs last  enabled at (3244): [<ffffffff8ab7de83>]
handle_softirqs+0x4f3/0x750
[    2.107644] softirqs last disabled at (3239): [<ffffffff8ab7e214>]
__irq_exit_rcu+0xc4/0x100
[    2.108924] ---[ end trace 0000000000000000 ]---

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 13:22 ` Mark Brown
                     ` (2 preceding siblings ...)
  2026-01-20 15:23   ` Marek Szyprowski
@ 2026-01-21  7:40   ` David Heidelberg
  2026-02-11 10:42   ` Alexander Stein
  4 siblings, 0 replies; 63+ messages in thread
From: David Heidelberg @ 2026-01-21  7:40 UTC (permalink / raw)
  To: Mark Brown, Gui-Dong Han
  Cc: gregkh, rafael, dakr, linux-kernel, baijiaju1990, Qiu-ji Chen,
	Aishwarya.TCV

On 20/01/2026 14:22, Mark Brown wrote:
> On Wed, Jan 14, 2026 at 12:28:43AM +0800, Gui-Dong Han wrote:
>> Currently, driver_match_device() is called from three sites. One site
>> (__device_attach_driver) holds device_lock(dev), but the other two
>> (bind_store and __driver_attach) do not. This inconsistency means that
>> bus match() callbacks are not guaranteed to be called with the lock
>> held.
> 
> I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
> this commit.  The boot grinds to a halt near the end of boot:
> 
> [    2.570549] ledtrig-cpu: registered to indicate activity on CPUs
> [    2.618301] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> [    2.623547] msm_serial: driver initialized
> [    2.624058] SuperH (H)SCI(F) driver initialized
> [    2.624312] STM32 USART driver initialized
> 

Similar outcome on sdm845-based:
  - Pixel 3
  - OnePlus 6T

Reverting unblock the boot process.

David

-- 
David Heidelberg


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21  7:18               ` Gui-Dong Han
@ 2026-01-21  7:41                 ` Gui-Dong Han
  2026-01-21  7:56                   ` Greg KH
  0 siblings, 1 reply; 63+ messages in thread
From: Gui-Dong Han @ 2026-01-21  7:41 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV, Marek Szyprowski

On Wed, Jan 21, 2026 at 3:18 PM Gui-Dong Han <hanguidong02@gmail.com> wrote:
>
> On Wed, Jan 21, 2026 at 9:11 AM Danilo Krummrich <dakr@kernel.org> wrote:
> >
> > On Tue Jan 20, 2026 at 10:18 PM CET, Danilo Krummrich wrote:
> > > Anyways, this should work:
> >
> > I Just notied that I pasted the wrong diff, which was nonsense of course, since
> > it just unlocks all the suppressed false positives. (Should not have sent it
> > during a meeting. :)
> >
> > What I actually intended (not neat, but hopefully helps):
>
> Thanks for the updated diff.
>
> I tested it on my QEMU setup. Since I couldn't reproduce the hang
> there, I didn't see any lockdep splats regarding the deadlock.
> However, since the physical lock is removed, my PoCs successfully
> triggered the UAF on both paths as expected.
>
> I did notice a lockdep warning during boot, which happens every time.
> I suspect this is because faux_bus_init is an __init function, so we
> are registering a key from memory that gets freed. This seems specific
> to the debug code, but I'm pasting it below for reference.

I figured out the root cause.

The warning is triggered because faux_bus_root is a static object.
lockdep_register_key() has a WARN_ON_ONCE(static_obj(key)) check that
forbids registering keys residing in static memory. It is not about
__init memory being freed.

Anyway, this is not a big deal and doesn't impact the testing results.

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21  7:41                 ` Gui-Dong Han
@ 2026-01-21  7:56                   ` Greg KH
  2026-01-21  8:12                     ` Greg KH
  2026-01-21  9:54                     ` Danilo Krummrich
  0 siblings, 2 replies; 63+ messages in thread
From: Greg KH @ 2026-01-21  7:56 UTC (permalink / raw)
  To: Gui-Dong Han
  Cc: Danilo Krummrich, Mark Brown, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV, Marek Szyprowski

On Wed, Jan 21, 2026 at 03:41:56PM +0800, Gui-Dong Han wrote:
> On Wed, Jan 21, 2026 at 3:18 PM Gui-Dong Han <hanguidong02@gmail.com> wrote:
> >
> > On Wed, Jan 21, 2026 at 9:11 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > >
> > > On Tue Jan 20, 2026 at 10:18 PM CET, Danilo Krummrich wrote:
> > > > Anyways, this should work:
> > >
> > > I Just notied that I pasted the wrong diff, which was nonsense of course, since
> > > it just unlocks all the suppressed false positives. (Should not have sent it
> > > during a meeting. :)
> > >
> > > What I actually intended (not neat, but hopefully helps):
> >
> > Thanks for the updated diff.
> >
> > I tested it on my QEMU setup. Since I couldn't reproduce the hang
> > there, I didn't see any lockdep splats regarding the deadlock.
> > However, since the physical lock is removed, my PoCs successfully
> > triggered the UAF on both paths as expected.
> >
> > I did notice a lockdep warning during boot, which happens every time.
> > I suspect this is because faux_bus_init is an __init function, so we
> > are registering a key from memory that gets freed. This seems specific
> > to the debug code, but I'm pasting it below for reference.
> 
> I figured out the root cause.
> 
> The warning is triggered because faux_bus_root is a static object.
> lockdep_register_key() has a WARN_ON_ONCE(static_obj(key)) check that
> forbids registering keys residing in static memory. It is not about
> __init memory being freed.
> 
> Anyway, this is not a big deal and doesn't impact the testing results.

Ooh, nice catch.  Let me go make that a dynamic object.  It really
shouldn't be a static one, I hate static struct device usage, and
complain about it from everyone else.  So there's no reason I should
have used that myself :(

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21  7:56                   ` Greg KH
@ 2026-01-21  8:12                     ` Greg KH
  2026-01-21  9:54                     ` Danilo Krummrich
  1 sibling, 0 replies; 63+ messages in thread
From: Greg KH @ 2026-01-21  8:12 UTC (permalink / raw)
  To: Gui-Dong Han
  Cc: Danilo Krummrich, Mark Brown, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV, Marek Szyprowski

On Wed, Jan 21, 2026 at 08:56:26AM +0100, Greg KH wrote:
> On Wed, Jan 21, 2026 at 03:41:56PM +0800, Gui-Dong Han wrote:
> > On Wed, Jan 21, 2026 at 3:18 PM Gui-Dong Han <hanguidong02@gmail.com> wrote:
> > >
> > > On Wed, Jan 21, 2026 at 9:11 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > >
> > > > On Tue Jan 20, 2026 at 10:18 PM CET, Danilo Krummrich wrote:
> > > > > Anyways, this should work:
> > > >
> > > > I Just notied that I pasted the wrong diff, which was nonsense of course, since
> > > > it just unlocks all the suppressed false positives. (Should not have sent it
> > > > during a meeting. :)
> > > >
> > > > What I actually intended (not neat, but hopefully helps):
> > >
> > > Thanks for the updated diff.
> > >
> > > I tested it on my QEMU setup. Since I couldn't reproduce the hang
> > > there, I didn't see any lockdep splats regarding the deadlock.
> > > However, since the physical lock is removed, my PoCs successfully
> > > triggered the UAF on both paths as expected.
> > >
> > > I did notice a lockdep warning during boot, which happens every time.
> > > I suspect this is because faux_bus_init is an __init function, so we
> > > are registering a key from memory that gets freed. This seems specific
> > > to the debug code, but I'm pasting it below for reference.
> > 
> > I figured out the root cause.
> > 
> > The warning is triggered because faux_bus_root is a static object.
> > lockdep_register_key() has a WARN_ON_ONCE(static_obj(key)) check that
> > forbids registering keys residing in static memory. It is not about
> > __init memory being freed.
> > 
> > Anyway, this is not a big deal and doesn't impact the testing results.
> 
> Ooh, nice catch.  Let me go make that a dynamic object.  It really
> shouldn't be a static one, I hate static struct device usage, and
> complain about it from everyone else.  So there's no reason I should
> have used that myself :(

Totally untested patch below.  Give me a few hours before I can reboot
and try this, but if you wish to use it, please do!


From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Wed, 21 Jan 2026 09:10:21 +0100
Subject: [PATCH] driver core: faux: stop using static struct device

faux_bus_root should not have been a static struct device, but rather a
dynamically created structure so that lockdep and other testing tools do
not trip over it (as well as being the right thing overall to do.)  Fix
this up by making it properly dynamic.

Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/base/faux.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/base/faux.c b/drivers/base/faux.c
index 21dd02124231..23d725817232 100644
--- a/drivers/base/faux.c
+++ b/drivers/base/faux.c
@@ -29,9 +29,7 @@ struct faux_object {
 };
 #define to_faux_object(dev) container_of_const(dev, struct faux_object, faux_dev.dev)
 
-static struct device faux_bus_root = {
-	.init_name	= "faux",
-};
+static struct device *faux_bus_root;
 
 static int faux_match(struct device *dev, const struct device_driver *drv)
 {
@@ -152,7 +150,7 @@ struct faux_device *faux_device_create_with_groups(const char *name,
 	if (parent)
 		dev->parent = parent;
 	else
-		dev->parent = &faux_bus_root;
+		dev->parent = faux_bus_root;
 	dev->bus = &faux_bus_type;
 	dev_set_name(dev, "%s", name);
 	device_set_pm_not_required(dev);
@@ -236,9 +234,15 @@ int __init faux_bus_init(void)
 {
 	int ret;
 
-	ret = device_register(&faux_bus_root);
+	faux_bus_root = kzalloc(sizeof(*faux_bus_root), GFP_KERNEL);
+	if (!faux_bus_root)
+		return -ENOMEM;
+
+	dev_set_name(faux_bus_root, "faux");
+
+	ret = device_register(faux_bus_root);
 	if (ret) {
-		put_device(&faux_bus_root);
+		put_device(faux_bus_root);
 		return ret;
 	}
 
@@ -256,6 +260,6 @@ int __init faux_bus_init(void)
 	bus_unregister(&faux_bus_type);
 
 error_bus:
-	device_unregister(&faux_bus_root);
+	device_unregister(faux_bus_root);
 	return ret;
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 13:30   ` Gui-Dong Han
  2026-01-20 13:48     ` Mark Brown
@ 2026-01-21  8:55     ` Wang Jiayue
  2026-01-21  8:57       ` Gui-Dong Han
  2026-01-21 10:40       ` Danilo Krummrich
  1 sibling, 2 replies; 63+ messages in thread
From: Wang Jiayue @ 2026-01-21  8:55 UTC (permalink / raw)
  To: hanguidong02
  Cc: Aishwarya.TCV, broonie, chenqiuji666, dakr, gregkh, linux-kernel,
	rafael

> I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
> this commit.  The boot grinds to a halt near the end of boot:
> 
> [    2.570549] ledtrig-cpu: registered to indicate activity on CPUs
> [    2.618301] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> [    2.623547] msm_serial: driver initialized
> [    2.624058] SuperH (H)SCI(F) driver initialized
> [    2.624312] STM32 USART driver initialized

After partially modifying juno.dts, I managed to roughly emulate kernel
boot on juno board with qemu and successfully reproduced the boot hang.
Below is the gdb backtrace:

#0  0xffff800080114ae0 in mutex_spin_on_owner (lock=0xffff0000036bfc90, owner=0xffff000003510000, ww_ctx=0x0, waiter=0x0) at kernel/locking/mutex.c:377
#1  0xffff80008118cecc in mutex_optimistic_spin (waiter=<optimized out>, ww_ctx=<optimized out>, lock=<optimized out>) at kernel/locking/mutex.c:480
#2  __mutex_lock_common (use_ww_ctx=<optimized out>, ww_ctx=<optimized out>, ip=<optimized out>, nest_lock=<optimized out>, subclass=<optimized out>, state=<optimized out>, lock=<optimized out>) at kernel/locking/mutex.c:618
#3  __mutex_lock (lock=0xffff0000036bfc90, state=0x2, ip=<optimized out>, nest_lock=<optimized out>, subclass=<optimized out>) at kernel/locking/mutex.c:776
#4  0xffff80008118d1dc in __mutex_lock_slowpath (lock=0xffff0000036bfc90) at kernel/locking/mutex.c:1065
#5  0xffff80008118d230 in mutex_lock (lock=0xffff0000036bfc90) at kernel/locking/mutex.c:290
#6  0xffff8000809cdd1c in device_lock (dev=<optimized out>) at ./include/linux/device.h:895
#7  class_device_constructor (_T=<optimized out>) at ./include/linux/device.h:913
#8  driver_match_device_locked (dev=<optimized out>, drv=<optimized out>) at drivers/base/base.h:193
#9  __driver_attach (dev=0xffff0000036bfc10, data=0xffff800082e64440 <qcom_smmu_tbu_driver+40>) at drivers/base/dd.c:1183
#10 0xffff8000809cb17c in bus_for_each_dev (bus=0xffff0000036bfc90, start=0x0, data=0xffff800082e64440 <qcom_smmu_tbu_driver+40>, fn=0xffff8000809cdcec <__driver_attach>) at drivers/base/bus.c:383
#11 0xffff8000809cd03c in driver_attach (drv=0x0) at drivers/base/dd.c:1245
#12 0xffff8000809cc748 in bus_add_driver (drv=0xffff800082e64440 <qcom_smmu_tbu_driver+40>) at drivers/base/bus.c:715
#13 0xffff8000809ced28 in driver_register (drv=0xffff800082e64440 <qcom_smmu_tbu_driver+40>) at drivers/base/driver.c:249
#14 0xffff8000809d0254 in __platform_driver_register (drv=0x0, owner=0xffff000003510000) at drivers/base/platform.c:908
#15 0xffff8000809a6208 in qcom_smmu_impl_init (smmu=0xffff0000037c0080) at drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c:780
#16 0xffff8000809a48a0 in arm_smmu_impl_init (smmu=0xffff0000037c0080) at drivers/iommu/arm/arm-smmu/arm-smmu-impl.c:224
#17 0xffff8000809a2ae0 in arm_smmu_device_probe (pdev=0xffff0000036bfc00) at drivers/iommu/arm/arm-smmu/arm-smmu.c:2155
#18 0xffff8000809d060c in platform_probe (_dev=0xffff0000036bfc10) at drivers/base/platform.c:1446
#19 0xffff8000809cd6a4 in call_driver_probe (drv=<optimized out>, dev=<optimized out>) at drivers/base/dd.c:583
#20 really_probe (dev=0xffff0000036bfc10, drv=0xffff800082e641c0 <arm_smmu_driver+40>) at drivers/base/dd.c:661
#21 0xffff8000809cd8f8 in __driver_probe_device (drv=0xffff800082e641c0 <arm_smmu_driver+40>, dev=0xffff0000036bfc10) at drivers/base/dd.c:803
#22 0xffff8000809cdb34 in driver_probe_device (drv=0xffff0000036bfc90, dev=0xffff0000036bfc10) at drivers/base/dd.c:833
#23 0xffff8000809cddb8 in __driver_attach (data=<optimized out>, dev=<optimized out>) at drivers/base/dd.c:1227
#24 __driver_attach (dev=0xffff0000036bfc10, data=0xffff800082e641c0 <arm_smmu_driver+40>) at drivers/base/dd.c:1167
#25 0xffff8000809cb17c in bus_for_each_dev (bus=0xffff0000036bfc90, start=0x0, data=0xffff800082e641c0 <arm_smmu_driver+40>, fn=0xffff8000809cdcec <__driver_attach>) at drivers/base/bus.c:383
#26 0xffff8000809cd03c in driver_attach (drv=0x0) at drivers/base/dd.c:1245
#27 0xffff8000809cc748 in bus_add_driver (drv=0xffff800082e641c0 <arm_smmu_driver+40>) at drivers/base/bus.c:715
#28 0xffff8000809ced28 in driver_register (drv=0xffff800082e641c0 <arm_smmu_driver+40>) at drivers/base/driver.c:249
#29 0xffff8000809d0254 in __platform_driver_register (drv=0x0, owner=0xffff000003510000) at drivers/base/platform.c:908
#30 0xffff800081f3d12c in arm_smmu_driver_init () at drivers/iommu/arm/arm-smmu/arm-smmu.c:2368
#31 0xffff800080015218 in do_one_initcall (fn=0xffff800081f3d10c <arm_smmu_driver_init>) at init/main.c:1378
#32 0xffff800081ed13e4 in do_initcall_level (command_line=<optimized out>, level=<optimized out>) at init/main.c:1440
#33 do_initcalls () at init/main.c:1456
#34 do_basic_setup () at init/main.c:1475
#35 kernel_init_freeable () at init/main.c:1688
#36 0xffff800081187b50 in kernel_init (unused=0xffff0000036bfc90) at init/main.c:1578
#37 0xffff800080015f58 in ret_from_fork () at arch/arm64/kernel/entry.S:860
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21  8:55     ` Wang Jiayue
@ 2026-01-21  8:57       ` Gui-Dong Han
  2026-01-21 10:40       ` Danilo Krummrich
  1 sibling, 0 replies; 63+ messages in thread
From: Gui-Dong Han @ 2026-01-21  8:57 UTC (permalink / raw)
  To: Wang Jiayue
  Cc: Aishwarya.TCV, broonie, chenqiuji666, dakr, gregkh, linux-kernel,
	rafael

On Wed, Jan 21, 2026 at 4:56 PM Wang Jiayue <akaieurus@gmail.com> wrote:
>
> > I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
> > this commit.  The boot grinds to a halt near the end of boot:
> >
> > [    2.570549] ledtrig-cpu: registered to indicate activity on CPUs
> > [    2.618301] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > [    2.623547] msm_serial: driver initialized
> > [    2.624058] SuperH (H)SCI(F) driver initialized
> > [    2.624312] STM32 USART driver initialized
>
> After partially modifying juno.dts, I managed to roughly emulate kernel
> boot on juno board with qemu and successfully reproduced the boot hang.
> Below is the gdb backtrace:

Great work, thank you very much!

This is a bit puzzling to me. Since __device_driver_lock(dev,
dev->parent) is called a few lines later in __driver_attach(), I
wonder if the original kernel would also deadlock if match() returns
true in such a nested scenario?

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21  7:56                   ` Greg KH
  2026-01-21  8:12                     ` Greg KH
@ 2026-01-21  9:54                     ` Danilo Krummrich
  2026-01-21 10:30                       ` Greg KH
  1 sibling, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-21  9:54 UTC (permalink / raw)
  To: Greg KH, Gui-Dong Han
  Cc: Mark Brown, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen,
	Aishwarya.TCV, Marek Szyprowski

On Wed Jan 21, 2026 at 8:56 AM CET, Greg KH wrote:
> On Wed, Jan 21, 2026 at 03:41:56PM +0800, Gui-Dong Han wrote:
>> The warning is triggered because faux_bus_root is a static object.
>> lockdep_register_key() has a WARN_ON_ONCE(static_obj(key)) check that
>> forbids registering keys residing in static memory. It is not about
>> __init memory being freed.
>> 
>> Anyway, this is not a big deal and doesn't impact the testing results.

Sorry, I was aware of this (and that it is harmess for testing purposes) and
should have mentioned it so people do not get distracted by it.

> Ooh, nice catch.  Let me go make that a dynamic object.

Also had it on my list to fix up today. :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21  9:54                     ` Danilo Krummrich
@ 2026-01-21 10:30                       ` Greg KH
  0 siblings, 0 replies; 63+ messages in thread
From: Greg KH @ 2026-01-21 10:30 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Gui-Dong Han, Mark Brown, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, Aishwarya.TCV, Marek Szyprowski

On Wed, Jan 21, 2026 at 10:54:52AM +0100, Danilo Krummrich wrote:
> On Wed Jan 21, 2026 at 8:56 AM CET, Greg KH wrote:
> > On Wed, Jan 21, 2026 at 03:41:56PM +0800, Gui-Dong Han wrote:
> >> The warning is triggered because faux_bus_root is a static object.
> >> lockdep_register_key() has a WARN_ON_ONCE(static_obj(key)) check that
> >> forbids registering keys residing in static memory. It is not about
> >> __init memory being freed.
> >> 
> >> Anyway, this is not a big deal and doesn't impact the testing results.
> 
> Sorry, I was aware of this (and that it is harmess for testing purposes) and
> should have mentioned it so people do not get distracted by it.
> 
> > Ooh, nice catch.  Let me go make that a dynamic object.
> 
> Also had it on my list to fix up today. :)

I've tested this and have sent it out for "real" now.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21  8:55     ` Wang Jiayue
  2026-01-21  8:57       ` Gui-Dong Han
@ 2026-01-21 10:40       ` Danilo Krummrich
  2026-01-21 11:02         ` Danilo Krummrich
  1 sibling, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-21 10:40 UTC (permalink / raw)
  To: Wang Jiayue, hanguidong02, gregkh, rafael
  Cc: Aishwarya.TCV, broonie, chenqiuji666, linux-kernel, m.szyprowski,
	robin.clark, will, robin.murphy, joro, iommu

(Cc: Rob, Will, Robin, Joerg)

On Wed Jan 21, 2026 at 9:55 AM CET, Wang Jiayue wrote:
> After partially modifying juno.dts, I managed to roughly emulate kernel
> boot on juno board with qemu and successfully reproduced the boot hang.
> Below is the gdb backtrace:
>
> #0  0xffff800080114ae0 in mutex_spin_on_owner (lock=0xffff0000036bfc90, owner=0xffff000003510000, ww_ctx=0x0, waiter=0x0) at kernel/locking/mutex.c:377
> #1  0xffff80008118cecc in mutex_optimistic_spin (waiter=<optimized out>, ww_ctx=<optimized out>, lock=<optimized out>) at kernel/locking/mutex.c:480
> #2  __mutex_lock_common (use_ww_ctx=<optimized out>, ww_ctx=<optimized out>, ip=<optimized out>, nest_lock=<optimized out>, subclass=<optimized out>, state=<optimized out>, lock=<optimized out>) at kernel/locking/mutex.c:618
> #3  __mutex_lock (lock=0xffff0000036bfc90, state=0x2, ip=<optimized out>, nest_lock=<optimized out>, subclass=<optimized out>) at kernel/locking/mutex.c:776
> #4  0xffff80008118d1dc in __mutex_lock_slowpath (lock=0xffff0000036bfc90) at kernel/locking/mutex.c:1065
> #5  0xffff80008118d230 in mutex_lock (lock=0xffff0000036bfc90) at kernel/locking/mutex.c:290
> #6  0xffff8000809cdd1c in device_lock (dev=<optimized out>) at ./include/linux/device.h:895
> #7  class_device_constructor (_T=<optimized out>) at ./include/linux/device.h:913
> #8  driver_match_device_locked (dev=<optimized out>, drv=<optimized out>) at drivers/base/base.h:193
> #9  __driver_attach (dev=0xffff0000036bfc10, data=0xffff800082e64440 <qcom_smmu_tbu_driver+40>) at drivers/base/dd.c:1183
> #10 0xffff8000809cb17c in bus_for_each_dev (bus=0xffff0000036bfc90, start=0x0, data=0xffff800082e64440 <qcom_smmu_tbu_driver+40>, fn=0xffff8000809cdcec <__driver_attach>) at drivers/base/bus.c:383
> #11 0xffff8000809cd03c in driver_attach (drv=0x0) at drivers/base/dd.c:1245
> #12 0xffff8000809cc748 in bus_add_driver (drv=0xffff800082e64440 <qcom_smmu_tbu_driver+40>) at drivers/base/bus.c:715
> #13 0xffff8000809ced28 in driver_register (drv=0xffff800082e64440 <qcom_smmu_tbu_driver+40>) at drivers/base/driver.c:249
> #14 0xffff8000809d0254 in __platform_driver_register (drv=0x0, owner=0xffff000003510000) at drivers/base/platform.c:908
> #15 0xffff8000809a6208 in qcom_smmu_impl_init (smmu=0xffff0000037c0080) at drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c:780
> #16 0xffff8000809a48a0 in arm_smmu_impl_init (smmu=0xffff0000037c0080) at drivers/iommu/arm/arm-smmu/arm-smmu-impl.c:224
> #17 0xffff8000809a2ae0 in arm_smmu_device_probe (pdev=0xffff0000036bfc00) at drivers/iommu/arm/arm-smmu/arm-smmu.c:2155
> #18 0xffff8000809d060c in platform_probe (_dev=0xffff0000036bfc10) at drivers/base/platform.c:1446
> #19 0xffff8000809cd6a4 in call_driver_probe (drv=<optimized out>, dev=<optimized out>) at drivers/base/dd.c:583
> #20 really_probe (dev=0xffff0000036bfc10, drv=0xffff800082e641c0 <arm_smmu_driver+40>) at drivers/base/dd.c:661
> #21 0xffff8000809cd8f8 in __driver_probe_device (drv=0xffff800082e641c0 <arm_smmu_driver+40>, dev=0xffff0000036bfc10) at drivers/base/dd.c:803
> #22 0xffff8000809cdb34 in driver_probe_device (drv=0xffff0000036bfc90, dev=0xffff0000036bfc10) at drivers/base/dd.c:833
> #23 0xffff8000809cddb8 in __driver_attach (data=<optimized out>, dev=<optimized out>) at drivers/base/dd.c:1227
> #24 __driver_attach (dev=0xffff0000036bfc10, data=0xffff800082e641c0 <arm_smmu_driver+40>) at drivers/base/dd.c:1167
> #25 0xffff8000809cb17c in bus_for_each_dev (bus=0xffff0000036bfc90, start=0x0, data=0xffff800082e641c0 <arm_smmu_driver+40>, fn=0xffff8000809cdcec <__driver_attach>) at drivers/base/bus.c:383
> #26 0xffff8000809cd03c in driver_attach (drv=0x0) at drivers/base/dd.c:1245
> #27 0xffff8000809cc748 in bus_add_driver (drv=0xffff800082e641c0 <arm_smmu_driver+40>) at drivers/base/bus.c:715
> #28 0xffff8000809ced28 in driver_register (drv=0xffff800082e641c0 <arm_smmu_driver+40>) at drivers/base/driver.c:249
> #29 0xffff8000809d0254 in __platform_driver_register (drv=0x0, owner=0xffff000003510000) at drivers/base/platform.c:908
> #30 0xffff800081f3d12c in arm_smmu_driver_init () at drivers/iommu/arm/arm-smmu/arm-smmu.c:2368
> #31 0xffff800080015218 in do_one_initcall (fn=0xffff800081f3d10c <arm_smmu_driver_init>) at init/main.c:1378
> #32 0xffff800081ed13e4 in do_initcall_level (command_line=<optimized out>, level=<optimized out>) at init/main.c:1440
> #33 do_initcalls () at init/main.c:1456
> #34 do_basic_setup () at init/main.c:1475
> #35 kernel_init_freeable () at init/main.c:1688
> #36 0xffff800081187b50 in kernel_init (unused=0xffff0000036bfc90) at init/main.c:1578
> #37 0xffff800080015f58 in ret_from_fork () at arch/arm64/kernel/entry.S:860
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thanks, this backtrace is very helpful. My lockdep patch should reveal the same
issue once run on real hardware, but with this it's probably not even necessary
anymore.

So, the problem is that in the callstack of the arm-smmu driver's (a platform
driver) probe() function, the QCOM specific code (through arm_smmu_impl_init())
registers another platform driver. Since we are still in probe() of arm-smmu the
call to platform_driver_register() happens with the device lock of the arm-smmu
platform device held.

platform_driver_register() eventually results in driver_attach() which iterates
over all the devices of a bus. Since the device we are probing and the driver we
are registering are for the same bus (i.e. the platform bus) it can now happen
that by chance that we also match the exact same device that is currently probed
again. And since we take the device lock for matching now, we actually take the
same lock twice.

Now, we could avoid this by not matching bound devices, but we check this
through dev->driver while holding the device lock, so that doesn't help.

But on the other hand, I don't see any reason why a driver would call
platform_driver_register() from probe() in the first place. I think drivers
should not do that and instead just register the driver through a normal
initcall.

(If, however, it turns out that registering drivers from probe() is something we
really need for some reason, it is probably best to drop the patch and don't
make any guarantees about whether match() is called with the device lock held or
not.

Consequently, driver_override must be protected with a separate lock (which
would be the cleaner solution in any case).)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21 10:40       ` Danilo Krummrich
@ 2026-01-21 11:02         ` Danilo Krummrich
  2026-01-21 11:19           ` Greg KH
                             ` (3 more replies)
  0 siblings, 4 replies; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-21 11:02 UTC (permalink / raw)
  To: Wang Jiayue, hanguidong02, gregkh, rafael
  Cc: Aishwarya.TCV, broonie, chenqiuji666, linux-kernel, m.szyprowski,
	robin.clark, will, robin.murphy, joro, iommu

On Wed Jan 21, 2026 at 11:40 AM CET, Danilo Krummrich wrote:
> So, the problem is that in the callstack of the arm-smmu driver's (a platform
> driver) probe() function, the QCOM specific code (through arm_smmu_impl_init())
> registers another platform driver. Since we are still in probe() of arm-smmu the
> call to platform_driver_register() happens with the device lock of the arm-smmu
> platform device held.
>
> platform_driver_register() eventually results in driver_attach() which iterates
> over all the devices of a bus. Since the device we are probing and the driver we
> are registering are for the same bus (i.e. the platform bus) it can now happen
> that by chance that we also match the exact same device that is currently probed
> again. And since we take the device lock for matching now, we actually take the
> same lock twice.
>
> Now, we could avoid this by not matching bound devices, but we check this
> through dev->driver while holding the device lock, so that doesn't help.
>
> But on the other hand, I don't see any reason why a driver would call
> platform_driver_register() from probe() in the first place. I think drivers
> should not do that and instead just register the driver through a normal
> initcall.
>
> (If, however, it turns out that registering drivers from probe() is something we
> really need for some reason, it is probably best to drop the patch and don't
> make any guarantees about whether match() is called with the device lock held or
> not.
>
> Consequently, driver_override must be protected with a separate lock (which
> would be the cleaner solution in any case).)

I assume that this should resolve the problem (unless there are more drivers
that register drivers in probe()):

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 573085349df3..9bb793efc35f 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -774,10 +774,6 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu)
 {
        const struct device_node *np = smmu->dev->of_node;
        const struct of_device_id *match;
-       static u8 tbu_registered;
-
-       if (!tbu_registered++)
-               platform_driver_register(&qcom_smmu_tbu_driver);

 #ifdef CONFIG_ACPI
        if (np == NULL) {
@@ -802,3 +798,5 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu)

        return smmu;
 }
+
+builtin_platform_driver(qcom_smmu_tbu_driver);

@qcom maintainers: I'm aware of commit 0b4eeee2876f ("iommu/arm-smmu-qcom:
Register the TBU driver in qcom_smmu_impl_init"), but I think the above patch
should work fine as it is still *not only* registered when
CONFIG_ARM_SMMU_QCOM_DEBUG?

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21 11:02         ` Danilo Krummrich
@ 2026-01-21 11:19           ` Greg KH
  2026-01-21 12:49           ` Mark Brown
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Greg KH @ 2026-01-21 11:19 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Wang Jiayue, hanguidong02, rafael, Aishwarya.TCV, broonie,
	chenqiuji666, linux-kernel, m.szyprowski, robin.clark, will,
	robin.murphy, joro, iommu

On Wed, Jan 21, 2026 at 12:02:15PM +0100, Danilo Krummrich wrote:
> On Wed Jan 21, 2026 at 11:40 AM CET, Danilo Krummrich wrote:
> > So, the problem is that in the callstack of the arm-smmu driver's (a platform
> > driver) probe() function, the QCOM specific code (through arm_smmu_impl_init())
> > registers another platform driver. Since we are still in probe() of arm-smmu the
> > call to platform_driver_register() happens with the device lock of the arm-smmu
> > platform device held.
> >
> > platform_driver_register() eventually results in driver_attach() which iterates
> > over all the devices of a bus. Since the device we are probing and the driver we
> > are registering are for the same bus (i.e. the platform bus) it can now happen
> > that by chance that we also match the exact same device that is currently probed
> > again. And since we take the device lock for matching now, we actually take the
> > same lock twice.
> >
> > Now, we could avoid this by not matching bound devices, but we check this
> > through dev->driver while holding the device lock, so that doesn't help.
> >
> > But on the other hand, I don't see any reason why a driver would call
> > platform_driver_register() from probe() in the first place. I think drivers
> > should not do that and instead just register the driver through a normal
> > initcall.
> >
> > (If, however, it turns out that registering drivers from probe() is something we
> > really need for some reason, it is probably best to drop the patch and don't
> > make any guarantees about whether match() is called with the device lock held or
> > not.
> >
> > Consequently, driver_override must be protected with a separate lock (which
> > would be the cleaner solution in any case).)
> 
> I assume that this should resolve the problem (unless there are more drivers
> that register drivers in probe()):
> 
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index 573085349df3..9bb793efc35f 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -774,10 +774,6 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu)
>  {
>         const struct device_node *np = smmu->dev->of_node;
>         const struct of_device_id *match;
> -       static u8 tbu_registered;
> -
> -       if (!tbu_registered++)
> -               platform_driver_register(&qcom_smmu_tbu_driver);

Ick, yeah, that should not be happening.  We should deadlock on that no
matter what.

> 
>  #ifdef CONFIG_ACPI
>         if (np == NULL) {
> @@ -802,3 +798,5 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu)
> 
>         return smmu;
>  }
> +
> +builtin_platform_driver(qcom_smmu_tbu_driver);

change makes sense to me.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21 11:02         ` Danilo Krummrich
  2026-01-21 11:19           ` Greg KH
@ 2026-01-21 12:49           ` Mark Brown
  2026-01-21 12:50             ` Danilo Krummrich
  2026-01-21 13:03           ` Robin Murphy
  2026-01-21 13:22           ` Jiayue Wang
  3 siblings, 1 reply; 63+ messages in thread
From: Mark Brown @ 2026-01-21 12:49 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Wang Jiayue, hanguidong02, gregkh, rafael, Aishwarya.TCV,
	chenqiuji666, linux-kernel, m.szyprowski, robin.clark, will,
	robin.murphy, joro, iommu

[-- Attachment #1: Type: text/plain, Size: 293 bytes --]

On Wed, Jan 21, 2026 at 12:02:15PM +0100, Danilo Krummrich wrote:

> I assume that this should resolve the problem (unless there are more drivers
> that register drivers in probe()):

This makes sense to me, I guess we can just fix any other instances as
they arise.  Will you send the patch?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21 12:49           ` Mark Brown
@ 2026-01-21 12:50             ` Danilo Krummrich
  2026-01-21 13:02               ` Will Deacon
  0 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-21 12:50 UTC (permalink / raw)
  To: Mark Brown
  Cc: Wang Jiayue, hanguidong02, gregkh, rafael, Aishwarya.TCV,
	chenqiuji666, linux-kernel, m.szyprowski, robin.clark, will,
	robin.murphy, joro, iommu

On Wed Jan 21, 2026 at 1:49 PM CET, Mark Brown wrote:
> On Wed, Jan 21, 2026 at 12:02:15PM +0100, Danilo Krummrich wrote:
>
>> I assume that this should resolve the problem (unless there are more drivers
>> that register drivers in probe()):
>
> This makes sense to me, I guess we can just fix any other instances as
> they arise.  Will you send the patch?

Yes, I will send it soon.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21 12:50             ` Danilo Krummrich
@ 2026-01-21 13:02               ` Will Deacon
  2026-01-21 14:07                 ` Danilo Krummrich
  0 siblings, 1 reply; 63+ messages in thread
From: Will Deacon @ 2026-01-21 13:02 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Mark Brown, Wang Jiayue, hanguidong02, gregkh, rafael,
	Aishwarya.TCV, chenqiuji666, linux-kernel, m.szyprowski,
	robin.clark, robin.murphy, joro, iommu, quic_c_gdjako,
	dmitry.baryshkov, dmitry.baryshkov

On Wed, Jan 21, 2026 at 01:50:52PM +0100, Danilo Krummrich wrote:
> On Wed Jan 21, 2026 at 1:49 PM CET, Mark Brown wrote:
> > On Wed, Jan 21, 2026 at 12:02:15PM +0100, Danilo Krummrich wrote:
> >
> >> I assume that this should resolve the problem (unless there are more drivers
> >> that register drivers in probe()):
> >
> > This makes sense to me, I guess we can just fix any other instances as
> > they arise.  Will you send the patch?
> 
> Yes, I will send it soon.

Please make sure you cc Dmitry and Georgi when you do that. I'm worried
that you proposal means we run the probe code once per TBU, which looks
like it will break.

Will

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21 11:02         ` Danilo Krummrich
  2026-01-21 11:19           ` Greg KH
  2026-01-21 12:49           ` Mark Brown
@ 2026-01-21 13:03           ` Robin Murphy
  2026-01-21 14:13             ` Danilo Krummrich
  2026-01-21 13:22           ` Jiayue Wang
  3 siblings, 1 reply; 63+ messages in thread
From: Robin Murphy @ 2026-01-21 13:03 UTC (permalink / raw)
  To: Danilo Krummrich, Wang Jiayue, hanguidong02, gregkh, rafael
  Cc: Aishwarya.TCV, broonie, chenqiuji666, linux-kernel, m.szyprowski,
	robin.clark, will, joro, iommu

On 2026-01-21 11:02 am, Danilo Krummrich wrote:
> On Wed Jan 21, 2026 at 11:40 AM CET, Danilo Krummrich wrote:
>> So, the problem is that in the callstack of the arm-smmu driver's (a platform
>> driver) probe() function, the QCOM specific code (through arm_smmu_impl_init())
>> registers another platform driver. Since we are still in probe() of arm-smmu the
>> call to platform_driver_register() happens with the device lock of the arm-smmu
>> platform device held.
>>
>> platform_driver_register() eventually results in driver_attach() which iterates
>> over all the devices of a bus. Since the device we are probing and the driver we
>> are registering are for the same bus (i.e. the platform bus) it can now happen
>> that by chance that we also match the exact same device that is currently probed
>> again. And since we take the device lock for matching now, we actually take the
>> same lock twice.
>>
>> Now, we could avoid this by not matching bound devices, but we check this
>> through dev->driver while holding the device lock, so that doesn't help.
>>
>> But on the other hand, I don't see any reason why a driver would call
>> platform_driver_register() from probe() in the first place. I think drivers
>> should not do that and instead just register the driver through a normal
>> initcall.
>>
>> (If, however, it turns out that registering drivers from probe() is something we
>> really need for some reason, it is probably best to drop the patch and don't
>> make any guarantees about whether match() is called with the device lock held or
>> not.
>>
>> Consequently, driver_override must be protected with a separate lock (which
>> would be the cleaner solution in any case).)
> 
> I assume that this should resolve the problem (unless there are more drivers
> that register drivers in probe()):
> 
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index 573085349df3..9bb793efc35f 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -774,10 +774,6 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu)
>   {
>          const struct device_node *np = smmu->dev->of_node;
>          const struct of_device_id *match;
> -       static u8 tbu_registered;
> -
> -       if (!tbu_registered++)
> -               platform_driver_register(&qcom_smmu_tbu_driver);
> 
>   #ifdef CONFIG_ACPI
>          if (np == NULL) {
> @@ -802,3 +798,5 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu)
> 
>          return smmu;
>   }
> +
> +builtin_platform_driver(qcom_smmu_tbu_driver);
> 
> @qcom maintainers: I'm aware of commit 0b4eeee2876f ("iommu/arm-smmu-qcom:
> Register the TBU driver in qcom_smmu_impl_init"), but I think the above patch
> should work fine as it is still *not only* registered when
> CONFIG_ARM_SMMU_QCOM_DEBUG?

In principle there should be nothing wrong with registering the driver 
unconditionally - that existing tbu_registered logic looks racy in the 
face of async_probe anyway - however I don't think the *_platform_driver 
macros will work here, as this all gets combined into arm_smmu.ko 
wherein ending up with multiple module_init declarations breaks the build.

(Please do double-check all the build permutations of ARM_SMMU, 
ARM_SMMU_QCOM and ARM_SMMU_QCOM_DEBUG)

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21 11:02         ` Danilo Krummrich
                             ` (2 preceding siblings ...)
  2026-01-21 13:03           ` Robin Murphy
@ 2026-01-21 13:22           ` Jiayue Wang
  3 siblings, 0 replies; 63+ messages in thread
From: Jiayue Wang @ 2026-01-21 13:22 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: hanguidong02, gregkh, rafael, Aishwarya.TCV, broonie,
	chenqiuji666, linux-kernel, m.szyprowski, robin.clark, will,
	robin.murphy, joro, iommu

>
> On Wed Jan 21, 2026 at 11:40 AM CET, Danilo Krummrich wrote:
> > So, the problem is that in the callstack of the arm-smmu driver's (a platform
> > driver) probe() function, the QCOM specific code (through arm_smmu_impl_init())
> > registers another platform driver. Since we are still in probe() of arm-smmu the
> > call to platform_driver_register() happens with the device lock of the arm-smmu
> > platform device held.
> >
> > platform_driver_register() eventually results in driver_attach() which iterates
> > over all the devices of a bus. Since the device we are probing and the driver we
> > are registering are for the same bus (i.e. the platform bus) it can now happen
> > that by chance that we also match the exact same device that is currently probed
> > again. And since we take the device lock for matching now, we actually take the
> > same lock twice.
> >
> > Now, we could avoid this by not matching bound devices, but we check this
> > through dev->driver while holding the device lock, so that doesn't help.
> >
> > But on the other hand, I don't see any reason why a driver would call
> > platform_driver_register() from probe() in the first place. I think drivers
> > should not do that and instead just register the driver through a normal
> > initcall.
> >
> > (If, however, it turns out that registering drivers from probe() is something we
> > really need for some reason, it is probably best to drop the patch and don't
> > make any guarantees about whether match() is called with the device lock held or
> > not.
> >
> > Consequently, driver_override must be protected with a separate lock (which
> > would be the cleaner solution in any case).)
>
> I assume that this should resolve the problem (unless there are more drivers
> that register drivers in probe()):

I tested this patch on qemu, and the boot appears to be normal now
without deadlocks. I encountered some other minor issues, but those
are likely due to my forced emulation in qemu and should be unrelated.

Tested-by: Wang Jiayue <akaieurus@gmail.com>

>
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index 573085349df3..9bb793efc35f 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -774,10 +774,6 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu)
>  {
>         const struct device_node *np = smmu->dev->of_node;
>         const struct of_device_id *match;
> -       static u8 tbu_registered;
> -
> -       if (!tbu_registered++)
> -               platform_driver_register(&qcom_smmu_tbu_driver);
>
>  #ifdef CONFIG_ACPI
>         if (np == NULL) {
> @@ -802,3 +798,5 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu)
>
>         return smmu;
>  }
> +
> +builtin_platform_driver(qcom_smmu_tbu_driver);
>
> @qcom maintainers: I'm aware of commit 0b4eeee2876f ("iommu/arm-smmu-qcom:
> Register the TBU driver in qcom_smmu_impl_init"), but I think the above patch
> should work fine as it is still *not only* registered when
> CONFIG_ARM_SMMU_QCOM_DEBUG?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21 13:02               ` Will Deacon
@ 2026-01-21 14:07                 ` Danilo Krummrich
  0 siblings, 0 replies; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-21 14:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Brown, Wang Jiayue, hanguidong02, gregkh, rafael,
	Aishwarya.TCV, chenqiuji666, linux-kernel, m.szyprowski,
	robin.clark, robin.murphy, joro, iommu, quic_c_gdjako,
	dmitry.baryshkov, dmitry.baryshkov

On Wed Jan 21, 2026 at 2:02 PM CET, Will Deacon wrote:
> On Wed, Jan 21, 2026 at 01:50:52PM +0100, Danilo Krummrich wrote:
>> On Wed Jan 21, 2026 at 1:49 PM CET, Mark Brown wrote:
>> > On Wed, Jan 21, 2026 at 12:02:15PM +0100, Danilo Krummrich wrote:
>> >
>> >> I assume that this should resolve the problem (unless there are more drivers
>> >> that register drivers in probe()):
>> >
>> > This makes sense to me, I guess we can just fix any other instances as
>> > they arise.  Will you send the patch?
>> 
>> Yes, I will send it soon.
>
> Please make sure you cc Dmitry and Georgi when you do that.

Sure!

> I'm worried that you proposal means we run the probe code once per TBU, which
> looks like it will break.

It shouldn't change anything in this regard. It only changes when the
qcom_smmu_tbu_driver is registered, initcall vs probe() of another driver.

- Danilo

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21 13:03           ` Robin Murphy
@ 2026-01-21 14:13             ` Danilo Krummrich
  0 siblings, 0 replies; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-21 14:13 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Wang Jiayue, hanguidong02, gregkh, rafael, Aishwarya.TCV, broonie,
	chenqiuji666, linux-kernel, m.szyprowski, robin.clark, will, joro,
	iommu

On Wed Jan 21, 2026 at 2:03 PM CET, Robin Murphy wrote:
> On 2026-01-21 11:02 am, Danilo Krummrich wrote:
>> On Wed Jan 21, 2026 at 11:40 AM CET, Danilo Krummrich wrote:
>>> So, the problem is that in the callstack of the arm-smmu driver's (a platform
>>> driver) probe() function, the QCOM specific code (through arm_smmu_impl_init())
>>> registers another platform driver. Since we are still in probe() of arm-smmu the
>>> call to platform_driver_register() happens with the device lock of the arm-smmu
>>> platform device held.
>>>
>>> platform_driver_register() eventually results in driver_attach() which iterates
>>> over all the devices of a bus. Since the device we are probing and the driver we
>>> are registering are for the same bus (i.e. the platform bus) it can now happen
>>> that by chance that we also match the exact same device that is currently probed
>>> again. And since we take the device lock for matching now, we actually take the
>>> same lock twice.
>>>
>>> Now, we could avoid this by not matching bound devices, but we check this
>>> through dev->driver while holding the device lock, so that doesn't help.
>>>
>>> But on the other hand, I don't see any reason why a driver would call
>>> platform_driver_register() from probe() in the first place. I think drivers
>>> should not do that and instead just register the driver through a normal
>>> initcall.
>>>
>>> (If, however, it turns out that registering drivers from probe() is something we
>>> really need for some reason, it is probably best to drop the patch and don't
>>> make any guarantees about whether match() is called with the device lock held or
>>> not.
>>>
>>> Consequently, driver_override must be protected with a separate lock (which
>>> would be the cleaner solution in any case).)
>> 
>> I assume that this should resolve the problem (unless there are more drivers
>> that register drivers in probe()):
>> 
>> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
>> index 573085349df3..9bb793efc35f 100644
>> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
>> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
>> @@ -774,10 +774,6 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu)
>>   {
>>          const struct device_node *np = smmu->dev->of_node;
>>          const struct of_device_id *match;
>> -       static u8 tbu_registered;
>> -
>> -       if (!tbu_registered++)
>> -               platform_driver_register(&qcom_smmu_tbu_driver);
>> 
>>   #ifdef CONFIG_ACPI
>>          if (np == NULL) {
>> @@ -802,3 +798,5 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu)
>> 
>>          return smmu;
>>   }
>> +
>> +builtin_platform_driver(qcom_smmu_tbu_driver);
>> 
>> @qcom maintainers: I'm aware of commit 0b4eeee2876f ("iommu/arm-smmu-qcom:
>> Register the TBU driver in qcom_smmu_impl_init"), but I think the above patch
>> should work fine as it is still *not only* registered when
>> CONFIG_ARM_SMMU_QCOM_DEBUG?
>
> In principle there should be nothing wrong with registering the driver 
> unconditionally - that existing tbu_registered logic looks racy in the 
> face of async_probe anyway - however I don't think the *_platform_driver 
> macros will work here, as this all gets combined into arm_smmu.ko 
> wherein ending up with multiple module_init declarations breaks the build.
>
> (Please do double-check all the build permutations of ARM_SMMU, 
> ARM_SMMU_QCOM and ARM_SMMU_QCOM_DEBUG)

Indeed, I accounted for this in the final patch I sent out, thanks!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 15:23   ` Marek Szyprowski
  2026-01-20 15:27     ` Mark Brown
@ 2026-01-21 20:00     ` Jon Hunter
  2026-01-21 21:42       ` Danilo Krummrich
  1 sibling, 1 reply; 63+ messages in thread
From: Jon Hunter @ 2026-01-21 20:00 UTC (permalink / raw)
  To: Marek Szyprowski, Mark Brown, Gui-Dong Han
  Cc: gregkh, rafael, dakr, linux-kernel, baijiaju1990, Qiu-ji Chen,
	Aishwarya.TCV, linux-tegra@vger.kernel.org


On 20/01/2026 15:23, Marek Szyprowski wrote:
> On 20.01.2026 14:22, Mark Brown wrote:
>> On Wed, Jan 14, 2026 at 12:28:43AM +0800, Gui-Dong Han wrote:
>>> Currently, driver_match_device() is called from three sites. One site
>>> (__device_attach_driver) holds device_lock(dev), but the other two
>>> (bind_store and __driver_attach) do not. This inconsistency means that
>>> bus match() callbacks are not guaranteed to be called with the lock
>>> held.
>> I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
>> this commit.  The boot grinds to a halt near the end of boot:
>>
>> [    2.570549] ledtrig-cpu: registered to indicate activity on CPUs
>> [    2.618301] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
>> [    2.623547] msm_serial: driver initialized
>> [    2.624058] SuperH (H)SCI(F) driver initialized
>> [    2.624312] STM32 USART driver initialized
>>
>> with no further output, full log:
>>
>>      https://lava.sirena.org.uk/scheduler/job/2387335#L862
>>
>> We are also seeing similar looking boot hangs on some Qualcomm platforms
>> in Arm's test lab which aren't verified to be the same thing but are
>> hanging at a similar point in boot.
> 
> I've observed the same issue on Qualcomm RB5 board and bisecting lead me
> also to this patch. My kernel log also doesn't reveal much information:
> 
> ...
> 
> [    3.671227] vreg_bob: Setting 3008000-4000000uV
> [    3.676929] vreg_l1c_1p8: Setting 1800000-1800000uV
> [    3.682826] vreg_l2c_1p2: Setting 1200000-1200000uV
> [    3.688547] vreg_l3c_0p8: Setting 800000-800000uV
> [    3.694080] vreg_l4c_1p7: Setting 1704000-2928000uV
> [    3.699908] vreg_l5c_1p8: Setting 1800000-2928000uV
> [    3.705763] vreg_l6c_2p96: Setting 1800000-2960000uV
> [    3.711684] vreg_l7c_cam_vcm0_2p85: Setting 2856000-3104000uV
> [    3.718408] vreg_l8c_1p8: Setting 1800000-1800000uV
> [    3.724287] vreg_l9c_2p96: Setting 2704000-2960000uV
> [    3.730218] vreg_l10c_3p0: Setting 3000000-3000000uV
> [    3.736226] vreg_l11c_3p3: Setting 3296000-3296000uV
> [    3.743413] vreg_s8c_1p3: Setting 1352000-1352000uV
> [    3.771370] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> [    3.792020] msm_serial: driver initialized
> [    3.797633] SuperH (H)SCI(F) driver initialized
> [    3.802881] STM32 USART driver initialized
> 
> [hang/freeze]

I am seeing a similar issue on one of our Tegra boards and bisect also 
points to this commit.

It is odd because it only appears to impact the Tegra194 Jetson Xavier 
NX board (tegra194-p3509-0000+p3668-0000.dts).

It appears to boot enough so the test can SSH into the device, but the 
kernel log does not show the us getting to the console prompt. It also 
appears that a lot of drivers are not bound as expected. I would need to 
check if those are all modules or not.

Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21 20:00     ` Jon Hunter
@ 2026-01-21 21:42       ` Danilo Krummrich
  2026-01-22 17:28         ` Jon Hunter
  0 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-21 21:42 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Marek Szyprowski, Mark Brown, Gui-Dong Han, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

On Wed Jan 21, 2026 at 9:00 PM CET, Jon Hunter wrote:
> It is odd because it only appears to impact the Tegra194 Jetson Xavier 
> NX board (tegra194-p3509-0000+p3668-0000.dts).
>
> It appears to boot enough so the test can SSH into the device, but the 
> kernel log does not show the us getting to the console prompt. It also 
> appears that a lot of drivers are not bound as expected. I would need to 
> check if those are all modules or not.

The other reports were fixed by [1], but the issue in arm-smmu-qcom shouldn't be
related in this case.

I quickyl checked all drivers with "tegra194" in their compatible string, but
didn't see anything odd.

Can you please try to enable CONFIG_LOCKDEP, CONFIG_PROVE_LOCKING,
CONFIG_DEBUG_MUTEXES and see if you get a lockdep splat using the following
diff?

(You will see a lockdep warning in faux_bus_init(), it's harmless and can be
ignored.)

[1] https://lore.kernel.org/driver-core/20260121141215.29658-1-dakr@kernel.org/

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 677320881af1..4741412d7e46 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -190,8 +190,13 @@ static inline int driver_match_device(const struct device_driver *drv,
 static inline int driver_match_device_locked(const struct device_driver *drv,
                                             struct device *dev)
 {
-       guard(device)(dev);
-       return driver_match_device(drv, dev);
+       int ret;
+
+       mutex_acquire(&dev->mutex.dep_map, 0, 0, _THIS_IP_);
+       ret = driver_match_device(drv, dev);
+       mutex_release(&dev->mutex.dep_map, _THIS_IP_);
+
+       return ret;
 }

 static inline void dev_sync_state(struct device *dev)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 40de2f51a1b1..56c62b3016aa 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2557,6 +2557,8 @@ static void device_release(struct kobject *kobj)

        kfree(dev->dma_range_map);

+       lockdep_unregister_key(&dev->lock_key);
+
        if (dev->release)
                dev->release(dev);
        else if (dev->type && dev->type->release)
@@ -3159,7 +3161,9 @@ void device_initialize(struct device *dev)
        kobject_init(&dev->kobj, &device_ktype);
        INIT_LIST_HEAD(&dev->dma_pools);
        mutex_init(&dev->mutex);
-       lockdep_set_novalidate_class(&dev->mutex);
+       //lockdep_set_novalidate_class(&dev->mutex);
+       lockdep_register_key(&dev->lock_key);
+       lockdep_set_class(&dev->mutex, &dev->lock_key);
        spin_lock_init(&dev->devres_lock);
        INIT_LIST_HEAD(&dev->devres_head);
        device_pm_init(dev);
diff --git a/include/linux/device.h b/include/linux/device.h
index 0be95294b6e6..dc898a420bc2 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -579,6 +579,7 @@ struct device {
        struct mutex            mutex;  /* mutex to synchronize calls to
                                         * its driver.
                                         */
+       struct lock_class_key lock_key;

        struct dev_links_info   links;
        struct dev_pm_info      power;


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-21 21:42       ` Danilo Krummrich
@ 2026-01-22 17:28         ` Jon Hunter
  2026-01-22 17:55           ` Gui-Dong Han
  0 siblings, 1 reply; 63+ messages in thread
From: Jon Hunter @ 2026-01-22 17:28 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Marek Szyprowski, Mark Brown, Gui-Dong Han, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

Hi Danilo,

On 21/01/2026 21:42, Danilo Krummrich wrote:
> On Wed Jan 21, 2026 at 9:00 PM CET, Jon Hunter wrote:
>> It is odd because it only appears to impact the Tegra194 Jetson Xavier
>> NX board (tegra194-p3509-0000+p3668-0000.dts).
>>
>> It appears to boot enough so the test can SSH into the device, but the
>> kernel log does not show the us getting to the console prompt. It also
>> appears that a lot of drivers are not bound as expected. I would need to
>> check if those are all modules or not.
> 
> The other reports were fixed by [1], but the issue in arm-smmu-qcom shouldn't be
> related in this case.
> 
> I quickyl checked all drivers with "tegra194" in their compatible string, but
> didn't see anything odd.
> 
> Can you please try to enable CONFIG_LOCKDEP, CONFIG_PROVE_LOCKING,
> CONFIG_DEBUG_MUTEXES and see if you get a lockdep splat using the following
> diff?
> 
> (You will see a lockdep warning in faux_bus_init(), it's harmless and can be
> ignored.)

Thanks. I do the lockdep warning in faux_bus_init() but that's the only 
one. I have verified that all these CONFIGs are correctly enabled in the 
build. The device boots fine with the below diff, but I am guessing that 
that is expected?

Any other thoughts?

Thanks
Jon
  > [1] 
https://lore.kernel.org/driver-core/20260121141215.29658-1-dakr@kernel.org/
> 
> diff --git a/drivers/base/base.h b/drivers/base/base.h
> index 677320881af1..4741412d7e46 100644
> --- a/drivers/base/base.h
> +++ b/drivers/base/base.h
> @@ -190,8 +190,13 @@ static inline int driver_match_device(const struct device_driver *drv,
>   static inline int driver_match_device_locked(const struct device_driver *drv,
>                                               struct device *dev)
>   {
> -       guard(device)(dev);
> -       return driver_match_device(drv, dev);
> +       int ret;
> +
> +       mutex_acquire(&dev->mutex.dep_map, 0, 0, _THIS_IP_);
> +       ret = driver_match_device(drv, dev);
> +       mutex_release(&dev->mutex.dep_map, _THIS_IP_);
> +
> +       return ret;
>   }
> 
>   static inline void dev_sync_state(struct device *dev)
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 40de2f51a1b1..56c62b3016aa 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -2557,6 +2557,8 @@ static void device_release(struct kobject *kobj)
> 
>          kfree(dev->dma_range_map);
> 
> +       lockdep_unregister_key(&dev->lock_key);
> +
>          if (dev->release)
>                  dev->release(dev);
>          else if (dev->type && dev->type->release)
> @@ -3159,7 +3161,9 @@ void device_initialize(struct device *dev)
>          kobject_init(&dev->kobj, &device_ktype);
>          INIT_LIST_HEAD(&dev->dma_pools);
>          mutex_init(&dev->mutex);
> -       lockdep_set_novalidate_class(&dev->mutex);
> +       //lockdep_set_novalidate_class(&dev->mutex);
> +       lockdep_register_key(&dev->lock_key);
> +       lockdep_set_class(&dev->mutex, &dev->lock_key);
>          spin_lock_init(&dev->devres_lock);
>          INIT_LIST_HEAD(&dev->devres_head);
>          device_pm_init(dev);
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 0be95294b6e6..dc898a420bc2 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -579,6 +579,7 @@ struct device {
>          struct mutex            mutex;  /* mutex to synchronize calls to
>                                           * its driver.
>                                           */
> +       struct lock_class_key lock_key;
> 
>          struct dev_links_info   links;
>          struct dev_pm_info      power;
> 

-- 
nvpublic


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-22 17:28         ` Jon Hunter
@ 2026-01-22 17:55           ` Gui-Dong Han
  2026-01-22 18:12             ` Danilo Krummrich
  0 siblings, 1 reply; 63+ messages in thread
From: Gui-Dong Han @ 2026-01-22 17:55 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Danilo Krummrich, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

On Fri, Jan 23, 2026 at 1:28 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>
> Hi Danilo,
>
> On 21/01/2026 21:42, Danilo Krummrich wrote:
> > On Wed Jan 21, 2026 at 9:00 PM CET, Jon Hunter wrote:
> >> It is odd because it only appears to impact the Tegra194 Jetson Xavier
> >> NX board (tegra194-p3509-0000+p3668-0000.dts).
> >>
> >> It appears to boot enough so the test can SSH into the device, but the
> >> kernel log does not show the us getting to the console prompt. It also
> >> appears that a lot of drivers are not bound as expected. I would need to
> >> check if those are all modules or not.
> >
> > The other reports were fixed by [1], but the issue in arm-smmu-qcom shouldn't be
> > related in this case.
> >
> > I quickyl checked all drivers with "tegra194" in their compatible string, but
> > didn't see anything odd.
> >
> > Can you please try to enable CONFIG_LOCKDEP, CONFIG_PROVE_LOCKING,
> > CONFIG_DEBUG_MUTEXES and see if you get a lockdep splat using the following
> > diff?
> >
> > (You will see a lockdep warning in faux_bus_init(), it's harmless and can be
> > ignored.)
>
> Thanks. I do the lockdep warning in faux_bus_init() but that's the only
> one. I have verified that all these CONFIGs are correctly enabled in the
> build. The device boots fine with the below diff, but I am guessing that
> that is expected?
>
> Any other thoughts?

Can you please try applying the following commit?

https://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core.git/commit/?h=driver-core-linus&id=ed1ac3c977dd6b119405fa36dd41f7151bd5b4de

Robin Murphy confirmed that the qcom specific issue might actually
impact other hardware platforms (provided ARM_SMMU_QCOM/ARCH_QCOM is
enabled), as the implementation init code is still executed:

https://lore.kernel.org/driver-core/d2ddbb72-30a8-44da-b761-876b2d37567e@arm.com/

So, this patch might fix the issue on Tegra as well.

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-22 17:55           ` Gui-Dong Han
@ 2026-01-22 18:12             ` Danilo Krummrich
  2026-01-22 18:58               ` Jon Hunter
  0 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-22 18:12 UTC (permalink / raw)
  To: Gui-Dong Han
  Cc: Jon Hunter, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

On Thu Jan 22, 2026 at 6:55 PM CET, Gui-Dong Han wrote:
> On Fri, Jan 23, 2026 at 1:28 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>>
>> Hi Danilo,
>>
>> On 21/01/2026 21:42, Danilo Krummrich wrote:
>> > On Wed Jan 21, 2026 at 9:00 PM CET, Jon Hunter wrote:
>> >> It is odd because it only appears to impact the Tegra194 Jetson Xavier
>> >> NX board (tegra194-p3509-0000+p3668-0000.dts).
>> >>
>> >> It appears to boot enough so the test can SSH into the device, but the
>> >> kernel log does not show the us getting to the console prompt. It also
>> >> appears that a lot of drivers are not bound as expected. I would need to
>> >> check if those are all modules or not.
>> >
>> > The other reports were fixed by [1], but the issue in arm-smmu-qcom shouldn't be
>> > related in this case.
>> >
>> > I quickyl checked all drivers with "tegra194" in their compatible string, but
>> > didn't see anything odd.
>> >
>> > Can you please try to enable CONFIG_LOCKDEP, CONFIG_PROVE_LOCKING,
>> > CONFIG_DEBUG_MUTEXES and see if you get a lockdep splat using the following
>> > diff?
>> >
>> > (You will see a lockdep warning in faux_bus_init(), it's harmless and can be
>> > ignored.)
>>
>> Thanks. I do the lockdep warning in faux_bus_init() but that's the only
>> one. I have verified that all these CONFIGs are correctly enabled in the
>> build. The device boots fine with the below diff, but I am guessing that
>> that is expected?

Yes, that's expected, we not actually taking the lock, but assert to lockdep
that we did. The fact that we use a dynamic lock class key for each device mutex
to avoid false positives should also be fine.

>> Any other thoughts?

With this diff, if I intentionally create a deadlock condition on my machine, I
do see a lockdep splat as expected.

Anyways, another option would be to attach a hardware debugger (I assume you
have TRACE32 or something available?) and then get a backtrace from the CPU
affected of the deadlock.

> Can you please try applying the following commit?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core.git/commit/?h=driver-core-linus&id=ed1ac3c977dd6b119405fa36dd41f7151bd5b4de
>
> Robin Murphy confirmed that the qcom specific issue might actually
> impact other hardware platforms (provided ARM_SMMU_QCOM/ARCH_QCOM is
> enabled), as the implementation init code is still executed:
>
> https://lore.kernel.org/driver-core/d2ddbb72-30a8-44da-b761-876b2d37567e@arm.com/
>
> So, this patch might fix the issue on Tegra as well.

I thought of that as well, but looking at the code in arm_smmu_impl_init(), it
seems that can't happen?

	if (of_device_is_compatible(np, "nvidia,tegra234-smmu") ||
	    of_device_is_compatible(np, "nvidia,tegra194-smmu") ||
	    of_device_is_compatible(np, "nvidia,tegra186-smmu"))
		return nvidia_smmu_impl_init(smmu);
	
	if (IS_ENABLED(CONFIG_ARM_SMMU_QCOM))
		smmu = qcom_smmu_impl_init(smmu);

But maybe there is some odd case where the first if condition does not evaluate
to true on tegra194, so maybe worth a try.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-22 18:12             ` Danilo Krummrich
@ 2026-01-22 18:58               ` Jon Hunter
  2026-01-22 19:35                 ` Danilo Krummrich
  0 siblings, 1 reply; 63+ messages in thread
From: Jon Hunter @ 2026-01-22 18:58 UTC (permalink / raw)
  To: Danilo Krummrich, Gui-Dong Han
  Cc: Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel,
	baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org



On 22/01/2026 18:12, Danilo Krummrich wrote:

...

>>> Any other thoughts?
> 
> With this diff, if I intentionally create a deadlock condition on my machine, I
> do see a lockdep splat as expected.
> 
> Anyways, another option would be to attach a hardware debugger (I assume you
> have TRACE32 or something available?) and then get a backtrace from the CPU
> affected of the deadlock.

Unfortunately, these days I don't have such tools available so that's 
not an option.

>> Can you please try applying the following commit?
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core.git/commit/?h=driver-core-linus&id=ed1ac3c977dd6b119405fa36dd41f7151bd5b4de
>>
>> Robin Murphy confirmed that the qcom specific issue might actually
>> impact other hardware platforms (provided ARM_SMMU_QCOM/ARCH_QCOM is
>> enabled), as the implementation init code is still executed:
>>
>> https://lore.kernel.org/driver-core/d2ddbb72-30a8-44da-b761-876b2d37567e@arm.com/
>>
>> So, this patch might fix the issue on Tegra as well.
> 
> I thought of that as well, but looking at the code in arm_smmu_impl_init(), it
> seems that can't happen?
> 
> 	if (of_device_is_compatible(np, "nvidia,tegra234-smmu") ||
> 	    of_device_is_compatible(np, "nvidia,tegra194-smmu") ||
> 	    of_device_is_compatible(np, "nvidia,tegra186-smmu"))
> 		return nvidia_smmu_impl_init(smmu);
> 	
> 	if (IS_ENABLED(CONFIG_ARM_SMMU_QCOM))
> 		smmu = qcom_smmu_impl_init(smmu);
> 
> But maybe there is some odd case where the first if condition does not evaluate
> to true on tegra194, so maybe worth a try.

I gave this a shot but that did not help either.

Thanks
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-22 18:58               ` Jon Hunter
@ 2026-01-22 19:35                 ` Danilo Krummrich
  2026-01-23 13:57                   ` Jon Hunter
  2026-01-27 14:53                   ` Jon Hunter
  0 siblings, 2 replies; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-22 19:35 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

On Thu Jan 22, 2026 at 7:58 PM CET, Jon Hunter wrote:
> On 22/01/2026 18:12, Danilo Krummrich wrote:
>> With this diff, if I intentionally create a deadlock condition on my machine, I
>> do see a lockdep splat as expected.
>> 
>> Anyways, another option would be to attach a hardware debugger (I assume you
>> have TRACE32 or something available?) and then get a backtrace from the CPU
>> affected of the deadlock.
>
> Unfortunately, these days I don't have such tools available so that's 
> not an option.

Hm..slowly running out of options. :)

I remember you previously said that you can still SSH into the machine? If so,
can you please share the the first output of

	echo l > /proc/sysrq-trigger

directly after booting?

Subsequently, can you please also run

	echo w > /proc/sysrq-trigger

and

	echo t > /proc/sysrq-trigger

If the output of the last shows a task in D state, you can also run

	cat /proc/$PID/stack

Also, are there any OOT modules loaded?

Thanks,
Danilo

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-22 19:35                 ` Danilo Krummrich
@ 2026-01-23 13:57                   ` Jon Hunter
  2026-01-23 14:09                     ` Danilo Krummrich
  2026-01-27 14:53                   ` Jon Hunter
  1 sibling, 1 reply; 63+ messages in thread
From: Jon Hunter @ 2026-01-23 13:57 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org


On 22/01/2026 19:35, Danilo Krummrich wrote:
> On Thu Jan 22, 2026 at 7:58 PM CET, Jon Hunter wrote:
>> On 22/01/2026 18:12, Danilo Krummrich wrote:
>>> With this diff, if I intentionally create a deadlock condition on my machine, I
>>> do see a lockdep splat as expected.
>>>
>>> Anyways, another option would be to attach a hardware debugger (I assume you
>>> have TRACE32 or something available?) and then get a backtrace from the CPU
>>> affected of the deadlock.
>>
>> Unfortunately, these days I don't have such tools available so that's
>> not an option.
> 
> Hm..slowly running out of options. :)

No worries. There appears to be a couple issues going on with this 
board. With the patch reverted the board boots fine and tests pass. Even 
in the passing case with this patch reverted, during boot I see a NULL 
pointer deference crash log from the QSPI driver. So I disabled the QSPI 
device in device-tree and with this patch the board boots fine and tests 
pass.

There is a on-going thread for the QSPI driver to fix these NULL pointer 
deference crashes [0]. So the QSPI driver seems to be the root of the 
problem.

Cheers
Jon

[0] https://lore.kernel.org/linux-tegra/aXJWRUhAe8F67-zG@gmail.com/T/#t

-- 
nvpublic


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-23 13:57                   ` Jon Hunter
@ 2026-01-23 14:09                     ` Danilo Krummrich
  2026-01-23 14:29                       ` Jon Hunter
  0 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-23 14:09 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

On Fri Jan 23, 2026 at 2:57 PM CET, Jon Hunter wrote:
> No worries. There appears to be a couple issues going on with this 
> board. With the patch reverted the board boots fine and tests pass. Even 
> in the passing case with this patch reverted, during boot I see a NULL 
> pointer deference crash log from the QSPI driver. So I disabled the QSPI 
> device in device-tree and with this patch the board boots fine and tests 
> pass.
>
> There is a on-going thread for the QSPI driver to fix these NULL pointer 
> deference crashes [0]. So the QSPI driver seems to be the root of the 
> problem.
>
> [0] https://lore.kernel.org/linux-tegra/aXJWRUhAe8F67-zG@gmail.com/T/#t

So, are you saying the problems you are seeing are unrelated to this patch and
there is no deadlock? (At least this would explain why we couldn't get a lockdep
splat with the diff I shared. :)

Otherwise, can you please share the output of the commands I shared in my
pevious mail?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-23 14:09                     ` Danilo Krummrich
@ 2026-01-23 14:29                       ` Jon Hunter
  2026-01-23 16:54                         ` Danilo Krummrich
  0 siblings, 1 reply; 63+ messages in thread
From: Jon Hunter @ 2026-01-23 14:29 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org


On 23/01/2026 14:09, Danilo Krummrich wrote:
> On Fri Jan 23, 2026 at 2:57 PM CET, Jon Hunter wrote:
>> No worries. There appears to be a couple issues going on with this
>> board. With the patch reverted the board boots fine and tests pass. Even
>> in the passing case with this patch reverted, during boot I see a NULL
>> pointer deference crash log from the QSPI driver. So I disabled the QSPI
>> device in device-tree and with this patch the board boots fine and tests
>> pass.
>>
>> There is a on-going thread for the QSPI driver to fix these NULL pointer
>> deference crashes [0]. So the QSPI driver seems to be the root of the
>> problem.
>>
>> [0] https://lore.kernel.org/linux-tegra/aXJWRUhAe8F67-zG@gmail.com/T/#t
> 
> So, are you saying the problems you are seeing are unrelated to this patch and
> there is no deadlock? (At least this would explain why we couldn't get a lockdep
> splat with the diff I shared. :)

Not exactly. With vanilla -next I see various tests fail on this board 
and I can see various devices are not probed as expected. Bisect pointed 
to this patch.

I can fix this by either:

1. Reverting this patch.
2. Disabling the QSPI driver.

Now the QSPI driver has issues which need to be fixed which I am 
wondering once fix will avoid this problem.

However, I guess regardless of the QSPI issue, should this patch be 
having such an impact?

Looking at the bootlog [0], you can see the crash is occurring during 
the tegra_qspi_probe() and so I am guessing this what leads to the 
deadlock? And may be there is no way to avoid that?

Please note that a lot of the boards I test are in a farm and I don't 
have direct access. So although I can see the test harness SSH'ing into 
the board, I am not accessing directly. However, we can run whatever 
tests we want.

There are no OOT drivers being used this is just vanilla -next.

Jon

[0] https://pastebin.com/wJheruPP

-- 
nvpublic


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-23 14:29                       ` Jon Hunter
@ 2026-01-23 16:54                         ` Danilo Krummrich
  2026-01-23 18:53                           ` Gui-Dong Han
  0 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-23 16:54 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

On Fri Jan 23, 2026 at 3:29 PM CET, Jon Hunter wrote:
> I can fix this by either:
>
> 1. Reverting this patch.
> 2. Disabling the QSPI driver.
>
> Now the QSPI driver has issues which need to be fixed which I am 
> wondering once fix will avoid this problem.
>
> However, I guess regardless of the QSPI issue, should this patch be 
> having such an impact?

So, this patch by itself is correct, but it reveals when drivers do the wrong
thing, that is register drivers from contexts where it neither makes sense nor
it is supported by the driver core.

The deadlock happens when a driver (A) registers another driver (B) from a
context where the device lock of the device bound to (A) is held, e.g. from bus
callbacks, such as probe(). See also [1].

While never valid, the deadlock does only occur when (A) and (B) are on the same
bus, e.g. when a platform driver registers another platform driver in its
probe() callback.

However, it is a bit more tricky than that: Let's say a platform driver
registers an SPI controller, then spi_register_controller() might scan the SPI
bus and register SPI devices (not drivers), which are then probed as well. So
far this is all fine, but if now in one of the SPI drivers probe() callbacks a
platform driver is registered, you have a deadlock condition as well.

So it seems that something of this kind is going on with
drivers/spi/spi-tegra210-quad.c.

I did already run quite thorough analysis throughout the whole kernel tree with
various static analyzers and also played around with LLMs for finding this
pattern.

The tools gave me two results:

  (1) The IOMMU one I already fixed [2].
  (2) The GPIO driver I posted a patch for in [3].

I specifically also looked for all drivers that are required to run all the
peripherals in the tegra194-p3509-0000+p3668-0000.dts hierarchy, but couldn't
catch anything.

(This is also why I asked about OOT, because there are quite some compatible
strings that are not supported by any upstream driver.)

I think to really see what's going in with spi-tegra210-quad.c, we need the
dumps of the sysrq-triggers I provided in a previous mail.

I'd also recommend to pick a stable state of the spi-tegra210-quad.c driver and
apply this patch on top (or just apply the spi-tegra210-quad.c fixes as well).

Subsequently, we could try and retest with the diff I provided and the
corresponding lockdep options enabled and with the sysrq-triggers (without the
diff).

[1] https://lore.kernel.org/lkml/DFU7CEPUSG9A.1KKGVW4HIPMSH@kernel.org/
[2] https://lore.kernel.org/all/20260121141215.29658-1-dakr@kernel.org/
[3] https://lore.kernel.org/all/20260123133614.72586-1-dakr@kernel.org/

> Please note that a lot of the boards I test are in a farm and I don't 
> have direct access. So although I can see the test harness SSH'ing into 
> the board, I am not accessing directly. However, we can run whatever 
> tests we want.

Maybe you can trigger the sysrq-trigger from a custom test?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-23 16:54                         ` Danilo Krummrich
@ 2026-01-23 18:53                           ` Gui-Dong Han
  2026-01-23 19:07                             ` Danilo Krummrich
  0 siblings, 1 reply; 63+ messages in thread
From: Gui-Dong Han @ 2026-01-23 18:53 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Jon Hunter, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

On Sat, Jan 24, 2026 at 12:54 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Fri Jan 23, 2026 at 3:29 PM CET, Jon Hunter wrote:
> > I can fix this by either:
> >
> > 1. Reverting this patch.
> > 2. Disabling the QSPI driver.
> >
> > Now the QSPI driver has issues which need to be fixed which I am
> > wondering once fix will avoid this problem.
> >
> > However, I guess regardless of the QSPI issue, should this patch be
> > having such an impact?
>
> So, this patch by itself is correct, but it reveals when drivers do the wrong
> thing, that is register drivers from contexts where it neither makes sense nor
> it is supported by the driver core.
>
> The deadlock happens when a driver (A) registers another driver (B) from a
> context where the device lock of the device bound to (A) is held, e.g. from bus
> callbacks, such as probe(). See also [1].
>
> While never valid, the deadlock does only occur when (A) and (B) are on the same
> bus, e.g. when a platform driver registers another platform driver in its
> probe() callback.
>
> However, it is a bit more tricky than that: Let's say a platform driver
> registers an SPI controller, then spi_register_controller() might scan the SPI
> bus and register SPI devices (not drivers), which are then probed as well. So
> far this is all fine, but if now in one of the SPI drivers probe() callbacks a
> platform driver is registered, you have a deadlock condition as well.
>
> So it seems that something of this kind is going on with
> drivers/spi/spi-tegra210-quad.c.
>
> I did already run quite thorough analysis throughout the whole kernel tree with
> various static analyzers and also played around with LLMs for finding this
> pattern.
>
> The tools gave me two results:
>
>   (1) The IOMMU one I already fixed [2].
>   (2) The GPIO driver I posted a patch for in [3].
>
> I specifically also looked for all drivers that are required to run all the
> peripherals in the tegra194-p3509-0000+p3668-0000.dts hierarchy, but couldn't
> catch anything.
>
> (This is also why I asked about OOT, because there are quite some compatible
> strings that are not supported by any upstream driver.)
>
> I think to really see what's going in with spi-tegra210-quad.c, we need the
> dumps of the sysrq-triggers I provided in a previous mail.

It seems the issue is simpler than a recursive registration deadlock.
Looking at the logs, tegra_qspi_probe triggers a NULL pointer
dereference (Oops) while holding the device_lock. The mutex likely
remains marked as held/orphaned, blocking subsequent driver bindings
on the same bus.

This likely explains why lockdep was silent. Since this is not a lock
dependency cycle or a recursive locking violation, but rather a lock
remaining held by a terminated task, lockdep would not flag it as a
deadlock pattern.

This is indeed a side effect of enforcing the lock here—it amplifies
the impact of a crash. However, an Oops while holding the device_lock
is generally catastrophic regardless.

Following up on our previous discussion [1], refactoring
driver_override would resolve this. We could move driver_override to
struct device and protect it with a dedicated lock (e.g.,
driver_override_lock). We would then replace driver_set_override with
dev_set_driver_override and add dev_access_driver_override with
internal lock assertions. This allows us to remove device_lock from
the 2 match paths, reducing contention and preventing a single crash
from stalling the whole bus.

However, this deviates from the current paradigm where device_lock
protects sysfs attributes (like waiting_for_supplier and
power/control). If other sysfs attributes are found to share similar
constraints or would benefit from finer-grained locking (which
requires further investigation), we might have a stronger argument for
introducing a more generic sysfs_lock to handle this class of
attributes. We would also need to carefully verify safety during
device removal.

Danilo, what are your thoughts on this refactoring plan? I am willing
to attempt it, but since it touches the driver core, documentation,
and 10+ bus drivers, and I haven't submitted such a large series
before, it may take me a few weeks to get an initial version out, and
additional time to iterate based on review feedback until it is ready
for merging. If you prefer to handle it yourself to expedite things,
please let me know so we don't duplicate efforts.

[1] https://lore.kernel.org/all/DFNI14L1K1I0.3FZ84OAWXY0LP@kernel.org/

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-23 18:53                           ` Gui-Dong Han
@ 2026-01-23 19:07                             ` Danilo Krummrich
  2026-01-27 14:58                               ` Jon Hunter
  0 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-23 19:07 UTC (permalink / raw)
  To: Gui-Dong Han
  Cc: Jon Hunter, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

On Fri Jan 23, 2026 at 7:53 PM CET, Gui-Dong Han wrote:
> It seems the issue is simpler than a recursive registration deadlock.
> Looking at the logs, tegra_qspi_probe triggers a NULL pointer
> dereference (Oops) while holding the device_lock. The mutex likely
> remains marked as held/orphaned, blocking subsequent driver bindings
> on the same bus.
>
> This likely explains why lockdep was silent. Since this is not a lock
> dependency cycle or a recursive locking violation, but rather a lock
> remaining held by a terminated task, lockdep would not flag it as a
> deadlock pattern.
>
> This is indeed a side effect of enforcing the lock here—it amplifies
> the impact of a crash. However, an Oops while holding the device_lock
> is generally catastrophic regardless.

This makes sense to me; it might indeed be as simple as that.

> Following up on our previous discussion [1], refactoring
> driver_override would resolve this. We could move driver_override to
> struct device and protect it with a dedicated lock (e.g.,
> driver_override_lock). We would then replace driver_set_override with
> dev_set_driver_override and add dev_access_driver_override with
> internal lock assertions. This allows us to remove device_lock from
> the 2 match paths, reducing contention and preventing a single crash
> from stalling the whole bus.
>
> However, this deviates from the current paradigm where device_lock
> protects sysfs attributes (like waiting_for_supplier and
> power/control). If other sysfs attributes are found to share similar
> constraints or would benefit from finer-grained locking (which
> requires further investigation), we might have a stronger argument for
> introducing a more generic sysfs_lock to handle this class of
> attributes. We would also need to carefully verify safety during
> device removal.
>
> Danilo, what are your thoughts on this refactoring plan? I am willing
> to attempt it, but since it touches the driver core, documentation,
> and 10+ bus drivers, and I haven't submitted such a large series
> before, it may take me a few weeks to get an initial version out, and
> additional time to iterate based on review feedback until it is ready
> for merging. If you prefer to handle it yourself to expedite things,
> please let me know so we don't duplicate efforts.

I think moving driver_override to struct device and providing accessors with
proper lockdep assertions is the correct thing to do. With that, I do not think
a separate lock is necessary.

Please feel free to follow up on this.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-22 19:35                 ` Danilo Krummrich
  2026-01-23 13:57                   ` Jon Hunter
@ 2026-01-27 14:53                   ` Jon Hunter
  2026-01-27 15:05                     ` Gui-Dong Han
  1 sibling, 1 reply; 63+ messages in thread
From: Jon Hunter @ 2026-01-27 14:53 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

Hi Danilo,

On 22/01/2026 19:35, Danilo Krummrich wrote:
> On Thu Jan 22, 2026 at 7:58 PM CET, Jon Hunter wrote:
>> On 22/01/2026 18:12, Danilo Krummrich wrote:
>>> With this diff, if I intentionally create a deadlock condition on my machine, I
>>> do see a lockdep splat as expected.
>>>
>>> Anyways, another option would be to attach a hardware debugger (I assume you
>>> have TRACE32 or something available?) and then get a backtrace from the CPU
>>> affected of the deadlock.
>>
>> Unfortunately, these days I don't have such tools available so that's
>> not an option.
> 
> Hm..slowly running out of options. :)
> 
> I remember you previously said that you can still SSH into the machine? If so,
> can you please share the the first output of
> 
> 	echo l > /proc/sysrq-trigger
> 
> directly after booting?
> 
> Subsequently, can you please also run
> 
> 	echo w > /proc/sysrq-trigger
> 
> and
> 
> 	echo t > /proc/sysrq-trigger

You can find the output of the above commands here:

https://pastebin.com/PuhFURwh

If you search for 'state:D', you can see that various drivers are stuck 
waiting for the mutex. I believe that this all happens because the SPI 
driver crashed during the probe and prevents any further drivers from 
probing.

Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-23 19:07                             ` Danilo Krummrich
@ 2026-01-27 14:58                               ` Jon Hunter
  2026-01-27 15:18                                 ` Danilo Krummrich
  0 siblings, 1 reply; 63+ messages in thread
From: Jon Hunter @ 2026-01-27 14:58 UTC (permalink / raw)
  To: Danilo Krummrich, Gui-Dong Han
  Cc: Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel,
	baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org


On 23/01/2026 19:07, Danilo Krummrich wrote:
> On Fri Jan 23, 2026 at 7:53 PM CET, Gui-Dong Han wrote:
>> It seems the issue is simpler than a recursive registration deadlock.
>> Looking at the logs, tegra_qspi_probe triggers a NULL pointer
>> dereference (Oops) while holding the device_lock. The mutex likely
>> remains marked as held/orphaned, blocking subsequent driver bindings
>> on the same bus.
>>
>> This likely explains why lockdep was silent. Since this is not a lock
>> dependency cycle or a recursive locking violation, but rather a lock
>> remaining held by a terminated task, lockdep would not flag it as a
>> deadlock pattern.
>>
>> This is indeed a side effect of enforcing the lock here—it amplifies
>> the impact of a crash. However, an Oops while holding the device_lock
>> is generally catastrophic regardless.
> 
> This makes sense to me; it might indeed be as simple as that.

Yes I believe that this is the case too.

BTW, if I apply the SPI series from Breno [0], which fixes crash in the 
SPI driver, then everything works fine.

Jon

[0] 
https://lore.kernel.org/linux-tegra/20260126-tegra_xfer-v2-0-6d2115e4f387@debian.org/T/#t
-- 
nvpublic


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-27 14:53                   ` Jon Hunter
@ 2026-01-27 15:05                     ` Gui-Dong Han
  0 siblings, 0 replies; 63+ messages in thread
From: Gui-Dong Han @ 2026-01-27 15:05 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Danilo Krummrich, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

On Tue, Jan 27, 2026 at 10:53 PM Jon Hunter <jonathanh@nvidia.com> wrote:
>
> Hi Danilo,
>
> On 22/01/2026 19:35, Danilo Krummrich wrote:
> > On Thu Jan 22, 2026 at 7:58 PM CET, Jon Hunter wrote:
> >> On 22/01/2026 18:12, Danilo Krummrich wrote:
> >>> With this diff, if I intentionally create a deadlock condition on my machine, I
> >>> do see a lockdep splat as expected.
> >>>
> >>> Anyways, another option would be to attach a hardware debugger (I assume you
> >>> have TRACE32 or something available?) and then get a backtrace from the CPU
> >>> affected of the deadlock.
> >>
> >> Unfortunately, these days I don't have such tools available so that's
> >> not an option.
> >
> > Hm..slowly running out of options. :)
> >
> > I remember you previously said that you can still SSH into the machine? If so,
> > can you please share the the first output of
> >
> >       echo l > /proc/sysrq-trigger
> >
> > directly after booting?
> >
> > Subsequently, can you please also run
> >
> >       echo w > /proc/sysrq-trigger
> >
> > and
> >
> >       echo t > /proc/sysrq-trigger
>
> You can find the output of the above commands here:
>
> https://pastebin.com/PuhFURwh

Thanks for the logs.

Looking at the trace, it confirms the previous suspicion. Since there
is no circular dependency shown in the logs, it is not a classic
recursive deadlock but rather the device_lock remaining held due to
the earlier crash in the QSPI driver. This prevented other devices on
the same bus from completing their probe.

I'm glad to hear that Breno's SPI fixes resolve the issue. It's a
happy ending for this case.

Thanks for the hard work on debugging this!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-27 14:58                               ` Jon Hunter
@ 2026-01-27 15:18                                 ` Danilo Krummrich
  0 siblings, 0 replies; 63+ messages in thread
From: Danilo Krummrich @ 2026-01-27 15:18 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV,
	linux-tegra@vger.kernel.org

On Tue Jan 27, 2026 at 3:58 PM CET, Jon Hunter wrote:
>
> On 23/01/2026 19:07, Danilo Krummrich wrote:
>> On Fri Jan 23, 2026 at 7:53 PM CET, Gui-Dong Han wrote:
>>> It seems the issue is simpler than a recursive registration deadlock.
>>> Looking at the logs, tegra_qspi_probe triggers a NULL pointer
>>> dereference (Oops) while holding the device_lock. The mutex likely
>>> remains marked as held/orphaned, blocking subsequent driver bindings
>>> on the same bus.
>>>
>>> This likely explains why lockdep was silent. Since this is not a lock
>>> dependency cycle or a recursive locking violation, but rather a lock
>>> remaining held by a terminated task, lockdep would not flag it as a
>>> deadlock pattern.
>>>
>>> This is indeed a side effect of enforcing the lock here—it amplifies
>>> the impact of a crash. However, an Oops while holding the device_lock
>>> is generally catastrophic regardless.
>> 
>> This makes sense to me; it might indeed be as simple as that.
>
> Yes I believe that this is the case too.
>
> BTW, if I apply the SPI series from Breno [0], which fixes crash in the 
> SPI driver, then everything works fine.

Thanks for confirming!

- Danilo

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-20 13:22 ` Mark Brown
                     ` (3 preceding siblings ...)
  2026-01-21  7:40   ` David Heidelberg
@ 2026-02-11 10:42   ` Alexander Stein
  2026-02-11 13:56     ` Danilo Krummrich
  4 siblings, 1 reply; 63+ messages in thread
From: Alexander Stein @ 2026-02-11 10:42 UTC (permalink / raw)
  To: Gui-Dong Han, Mark Brown
  Cc: gregkh, rafael, dakr, linux-kernel, baijiaju1990, Qiu-ji Chen,
	Aishwarya.TCV

Hi,

Am Dienstag, 20. Januar 2026, 14:22:19 CET schrieb Mark Brown:
> On Wed, Jan 14, 2026 at 12:28:43AM +0800, Gui-Dong Han wrote:
> > Currently, driver_match_device() is called from three sites. One site
> > (__device_attach_driver) holds device_lock(dev), but the other two
> > (bind_store and __driver_attach) do not. This inconsistency means that
> > bus match() callbacks are not guaranteed to be called with the lock
> > held.
> 
> I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to
> this commit.  The boot grinds to a halt near the end of boot:
> 
> [    2.570549] ledtrig-cpu: registered to indicate activity on CPUs
> [    2.618301] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> [    2.623547] msm_serial: driver initialized
> [    2.624058] SuperH (H)SCI(F) driver initialized
> [    2.624312] STM32 USART driver initialized
> 
> with no further output, full log:
> 
>    https://lava.sirena.org.uk/scheduler/job/2387335#L862
> 
> We are also seeing similar looking boot hangs on some Qualcomm platforms
> in Arm's test lab which aren't verified to be the same thing but are
> hanging at a similar point in boot.

I've also experienced a boot problem on an NXP i.MX8QM based board
(DT currently in process of upstreaming). Bisect also resulted in this
commit.

Boot suddenly blocks, after about 4 minutes I get a blocked task backtrace:
[    1.053132] NET: Registered PF_PACKET protocol family
[    1.057958] Key type dns_resolver registered
[    1.070822] registered taskstats version 1
[    1.074738] Loading compiled-in X.509 certificates
[    1.091208] Demotion targets for Node 0: null
[    1.099632] imx-scu system-controller: NXP i.MX SCU Initialized
[  242.855834] INFO: task swapper/0:1 blocked for more than 120 seconds.
[  242.861994]       Not tainted 6.19.0-rc5+ #41
[  242.866304] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.874122] task:swapper/0       state:D stack:0     pid:1     tgid:1     ppid:0      task_flags:0x0140 flags:0x00000010
[  242.884981] Call trace:
[  242.887406]  __switch_to+0xdc/0x120 (T)
[  242.891233]  __schedule+0x310/0x7a0
[  242.894708]  schedule+0x30/0xa4
[  242.897835]  schedule_timeout+0x10c/0x118
[  242.901831]  wait_for_completion+0xd4/0x170
[  242.906002]  __flush_work+0x6c/0xdc
[  242.909476]  flush_work+0x2c/0x38
[  242.912777]  deferred_probe_initcall+0x58/0xb0
[  242.917208]  do_one_initcall+0x40/0x24c
[  242.921030]  do_initcalls+0xf0/0x1c4
[  242.924592]  kernel_init_freeable+0x114/0x224
[  242.928936]  kernel_init+0x20/0x1d4
[  242.932413]  ret_from_fork+0x10/0x20
[  242.935990] INFO: task kworker/u16:1:42 blocked for more than 120 seconds.
[  242.942836]       Not tainted 6.19.0-rc5+ #41
[  242.947179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.955000] task:kworker/u16:1   state:D stack:0     pid:42    tgid:42    ppid:2      task_flags:0x4208060 flags:0x00000010
[  242.966127] Workqueue: events_unbound deferred_probe_work_func
[  242.971940] Call trace:
[  242.974364]  __switch_to+0xdc/0x120 (T)
[  242.978195]  __schedule+0x310/0x7a0
[  242.981669]  schedule+0x30/0xa4
[  242.984803]  schedule_preempt_disabled+0x20/0x38
[  242.989401]  __mutex_lock.constprop.0+0x4c4/0xc90
[  242.994094]  __mutex_lock_slowpath+0x10/0x18
[  242.998349]  mutex_lock+0x68/0x74
[  243.001650]  __driver_attach+0x28/0x254
[  243.005473]  bus_for_each_dev+0x74/0xd4
[  243.009295]  driver_attach+0x20/0x28
[  243.012857]  bus_add_driver+0x118/0x238
[  243.016681]  driver_register+0x58/0x114
[  243.020502]  __platform_driver_register+0x20/0x28
[  243.025193]  imx_clk_scu_init+0xe8/0xf8
[  243.029016]  imx8qxp_clk_probe+0x3c/0x11c0
[  243.033099]  platform_probe+0x58/0x90
[  243.036748]  really_probe+0xbc/0x3d4
[  243.040310]  __driver_probe_device+0x7c/0x174
[  243.044653]  driver_probe_device+0x38/0x108
[  243.048823]  __device_attach_driver+0xb0/0x154
[  243.053254]  bus_for_each_drv+0x80/0xdc
[  243.057076]  __device_attach+0x9c/0x1bc
[  243.060899]  device_initial_probe+0x48/0x4c
[  243.065069]  bus_probe_device+0x34/0x98
[  243.068898]  device_add+0x4bc/0x6b4
[  243.072367]  of_device_add+0x40/0x58
[  243.075929]  of_platform_device_create_pdata+0x94/0x158
[  243.081141]  of_platform_bus_create+0xcc/0x280
[  243.085571]  of_platform_populate+0x78/0x174
[  243.089828]  devm_of_platform_populate+0x54/0xb0
[  243.094435]  imx_scu_probe+0x1f8/0x30c
[  243.098169]  platform_probe+0x58/0x90
[  243.101819]  really_probe+0xbc/0x3d4
[  243.105379]  __driver_probe_device+0x7c/0x174
[  243.109728]  driver_probe_device+0x38/0x108
[  243.113896]  __device_attach_driver+0xb0/0x154
[  243.118326]  bus_for_each_drv+0x80/0xdc
[  243.122148]  __device_attach+0x9c/0x1bc
[  243.125971]  device_initial_probe+0x48/0x4c
[  243.130141]  bus_probe_device+0x34/0x98
[  243.133963]  deferred_probe_work_func+0x9c/0xf4
[  243.138480]  process_one_work+0x14c/0x3c0
[  243.142477]  worker_thread+0x2dc/0x3f0
[  243.146212]  kthread+0x124/0x1e4
[  243.149427]  ret_from_fork+0x10/0x20
[  243.152992] INFO: task kworker/u16:1:42 is blocked on a mutex likely owned by task kworker/u16:1:42.

The patches regarding QSPI or SMMU in this thread don't seem to apply for / affect this platform.

Best regards,
Alexander
-- 
TQ-Systems GmbH | Mühlstraße 2, Gut Delling | 82229 Seefeld, Germany
Amtsgericht München, HRB 105018
Geschäftsführer: Detlef Schneider, Rüdiger Stahl, Stefan Schneider
http://www.tq-group.com/



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-02-11 10:42   ` Alexander Stein
@ 2026-02-11 13:56     ` Danilo Krummrich
  0 siblings, 0 replies; 63+ messages in thread
From: Danilo Krummrich @ 2026-02-11 13:56 UTC (permalink / raw)
  To: Alexander Stein
  Cc: Gui-Dong Han, Mark Brown, gregkh, rafael, linux-kernel,
	baijiaju1990, Qiu-ji Chen, Aishwarya.TCV

On Wed Feb 11, 2026 at 11:42 AM CET, Alexander Stein wrote:
> Am Dienstag, 20. Januar 2026, 14:22:19 CET schrieb Mark Brown:
>> On Wed, Jan 14, 2026 at 12:28:43AM +0800, Gui-Dong Han wrote:
> I've also experienced a boot problem on an NXP i.MX8QM based board
> (DT currently in process of upstreaming). Bisect also resulted in this
> commit.

Thanks for reporting (and the backtrace, that helps a lot).

This is another case where a platform driver registers another platform driver
in probe(). The following diff should fix the issue, I will send out a patch
soon.

diff --git a/drivers/clk/imx/clk-imx8qxp.c b/drivers/clk/imx/clk-imx8qxp.c
index 3ae162625bb1..d89a2f40771e 100644
--- a/drivers/clk/imx/clk-imx8qxp.c
+++ b/drivers/clk/imx/clk-imx8qxp.c
@@ -346,7 +346,29 @@ static struct platform_driver imx8qxp_clk_driver = {
        },
        .probe = imx8qxp_clk_probe,
 };
-module_platform_driver(imx8qxp_clk_driver);
+
+static int __init imx8qxp_init(void)
+{
+       int ret;
+
+       ret = platform_driver_register(&imx8qxp_clk_driver);
+       if (ret)
+               return ret;
+
+       ret = imx_clk_scu_module_init();
+       if (ret)
+               platform_driver_unregister(&imx8qxp_clk_driver);
+
+       return ret;
+}
+module_init(imx8qxp_init);
+
+static void __exit imx8qxp_exit(void)
+{
+       imx_clk_scu_module_exit();
+       platform_driver_unregister(&imx8qxp_clk_driver);
+}
+module_exit(imx8qxp_exit);

 MODULE_AUTHOR("Aisheng Dong <aisheng.dong@nxp.com>");
 MODULE_DESCRIPTION("NXP i.MX8QXP clock driver");
diff --git a/drivers/clk/imx/clk-scu.c b/drivers/clk/imx/clk-scu.c
index 34c9dc1fb20e..c90d21e05f91 100644
--- a/drivers/clk/imx/clk-scu.c
+++ b/drivers/clk/imx/clk-scu.c
@@ -191,6 +191,16 @@ static bool imx_scu_clk_is_valid(u32 rsrc_id)
        return p != NULL;
 }

+int __init imx_clk_scu_module_init(void)
+{
+       return platform_driver_register(&imx_clk_scu_driver);
+}
+
+void __exit imx_clk_scu_module_exit(void)
+{
+       return platform_driver_unregister(&imx_clk_scu_driver);
+}
+
 int imx_clk_scu_init(struct device_node *np,
                     const struct imx_clk_scu_rsrc_table *data)
 {
@@ -215,7 +225,7 @@ int imx_clk_scu_init(struct device_node *np,
                rsrc_table = data;
        }

-       return platform_driver_register(&imx_clk_scu_driver);
+       return 0;
 }

 /*
diff --git a/drivers/clk/imx/clk-scu.h b/drivers/clk/imx/clk-scu.h
index af7b697f51ca..ca82f2cce897 100644
--- a/drivers/clk/imx/clk-scu.h
+++ b/drivers/clk/imx/clk-scu.h
@@ -25,6 +25,8 @@ extern const struct imx_clk_scu_rsrc_table imx_clk_scu_rsrc_imx8dxl;
 extern const struct imx_clk_scu_rsrc_table imx_clk_scu_rsrc_imx8qxp;
 extern const struct imx_clk_scu_rsrc_table imx_clk_scu_rsrc_imx8qm;

+int __init imx_clk_scu_module_init(void);
+void __exit imx_clk_scu_module_exit(void);
 int imx_clk_scu_init(struct device_node *np,
                     const struct imx_clk_scu_rsrc_table *data);
 struct clk_hw *imx_scu_of_clk_src_get(struct of_phandle_args *clkspec,


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-01-13 16:28 [PATCH v5] driver core: enforce device_lock for driver_match_device() Gui-Dong Han
                   ` (3 preceding siblings ...)
  2026-01-20 13:22 ` Mark Brown
@ 2026-02-25 20:19 ` Cristian Marussi
  2026-02-25 20:38   ` Danilo Krummrich
  4 siblings, 1 reply; 63+ messages in thread
From: Cristian Marussi @ 2026-02-25 20:19 UTC (permalink / raw)
  To: Gui-Dong Han
  Cc: gregkh, rafael, dakr, linux-kernel, baijiaju1990, Qiu-ji Chen,
	cristian.marussi, gatien.chevallier, sudeep.holla, arm-scmi

On Wed, Jan 14, 2026 at 12:28:43AM +0800, Gui-Dong Han wrote:
> Currently, driver_match_device() is called from three sites. One site
> (__device_attach_driver) holds device_lock(dev), but the other two
> (bind_store and __driver_attach) do not. This inconsistency means that
> bus match() callbacks are not guaranteed to be called with the lock
> held.
> 
> Fix this by introducing driver_match_device_locked(), which guarantees
> holding the device lock using a scoped guard. Replace the unlocked calls
> in bind_store() and __driver_attach() with this new helper. Also add a
> lock assertion to driver_match_device() to enforce this guarantee.

Hi,

it has been reported by Gatien (in CC) that this break the SCMI OPTEE
transport.

Moreover I still have to verify, BUT I think this also breaks SCMI
Virtio transport since both call platform_driver_register() during their
probe, since a few years ago the SCMI transports have been reworked to be
standalone full-fledged drivers.

I'll have a look in the next days if we can cope with this.

Thanks,
Cristian


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-02-25 20:19 ` Cristian Marussi
@ 2026-02-25 20:38   ` Danilo Krummrich
  2026-02-26  8:54     ` Gatien CHEVALLIER
  0 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-02-25 20:38 UTC (permalink / raw)
  To: Cristian Marussi
  Cc: Gui-Dong Han, gregkh, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, gatien.chevallier, sudeep.holla, arm-scmi

On Wed Feb 25, 2026 at 9:19 PM CET, Cristian Marussi wrote:
> it has been reported by Gatien (in CC) that this break the SCMI OPTEE
> transport.
>
> Moreover I still have to verify, BUT I think this also breaks SCMI
> Virtio transport since both call platform_driver_register() during their
> probe, since a few years ago the SCMI transports have been reworked to be
> standalone full-fledged drivers.

I've had a quick look and I'm pretty sure that both transports/virtio.c and
transports/optee.c are broken.

Both cases look identical and I think the fix should be as trivial as moving
platform_driver_register() into module_init().

> I'll have a look in the next days if we can cope with this.

I already fixed those three [1,2,3]. If you can provide a patch for both of
them, that'd be great. Otherwise, please let me know if you want me to send
something.

Thanks,
Danilo

[1] https://lore.kernel.org/driver-core/20260121141215.29658-1-dakr@kernel.org/
[2] https://lore.kernel.org/driver-core/20260123133614.72586-1-dakr@kernel.org/
[3] https://lore.kernel.org/driver-core/20260212235842.85934-1-dakr@kernel.org/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-02-25 20:38   ` Danilo Krummrich
@ 2026-02-26  8:54     ` Gatien CHEVALLIER
  2026-02-26 11:15       ` Danilo Krummrich
  0 siblings, 1 reply; 63+ messages in thread
From: Gatien CHEVALLIER @ 2026-02-26  8:54 UTC (permalink / raw)
  To: Danilo Krummrich, Cristian Marussi
  Cc: Gui-Dong Han, gregkh, rafael, linux-kernel, baijiaju1990,
	Qiu-ji Chen, sudeep.holla, arm-scmi



On 2/25/26 21:38, Danilo Krummrich wrote:
> On Wed Feb 25, 2026 at 9:19 PM CET, Cristian Marussi wrote:
>> it has been reported by Gatien (in CC) that this break the SCMI OPTEE
>> transport.
>>
>> Moreover I still have to verify, BUT I think this also breaks SCMI
>> Virtio transport since both call platform_driver_register() during their
>> probe, since a few years ago the SCMI transports have been reworked to be
>> standalone full-fledged drivers.
> 
> I've had a quick look and I'm pretty sure that both transports/virtio.c and
> transports/optee.c are broken.
> 
> Both cases look identical and I think the fix should be as trivial as moving
> platform_driver_register() into module_init().
> 

Hello,

(Regarding OP-TEE SCMI transport)

It may not be that trivial because unconditionally moving it to
module_init() would not guarantee that the TEE bus driver probes
before the platform device driver. I think there's a bit more to it.

The platform device driver, as is, is not functional whilst the
TEE services are not available. Therefore, it could falsely fail to
probe if probed before. We may consider some sort of API to know if
the services are available?

Gatien

>> I'll have a look in the next days if we can cope with this.
> 
> I already fixed those three [1,2,3]. If you can provide a patch for both of
> them, that'd be great. Otherwise, please let me know if you want me to send
> something.
> 
> Thanks,
> Danilo
> 
> [1] https://lore.kernel.org/driver-core/20260121141215.29658-1-dakr@kernel.org/
> [2] https://lore.kernel.org/driver-core/20260123133614.72586-1-dakr@kernel.org/
> [3] https://lore.kernel.org/driver-core/20260212235842.85934-1-dakr@kernel.org/


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-02-26  8:54     ` Gatien CHEVALLIER
@ 2026-02-26 11:15       ` Danilo Krummrich
  2026-02-26 12:21         ` Cristian Marussi
  0 siblings, 1 reply; 63+ messages in thread
From: Danilo Krummrich @ 2026-02-26 11:15 UTC (permalink / raw)
  To: Gatien CHEVALLIER
  Cc: Cristian Marussi, Gui-Dong Han, gregkh, rafael, linux-kernel,
	baijiaju1990, Qiu-ji Chen, sudeep.holla, arm-scmi

On Thu Feb 26, 2026 at 9:54 AM CET, Gatien CHEVALLIER wrote:
> It may not be that trivial because unconditionally moving it to
> module_init() would not guarantee that the TEE bus driver probes
> before the platform device driver. I think there's a bit more to it.
>
> The platform device driver, as is, is not functional whilst the
> TEE services are not available. Therefore, it could falsely fail to
> probe if probed before. We may consider some sort of API to know if
> the services are available?

Ok, so there are three drivers somehow involved:

  (1) module_tee_client_driver(scmi_optee_service_driver)

  (2) scmi_optee_driver with compatible = "linaro,scmi-optee"

  (3) scmi_driver with name = "arm-scmi"

(1) registeres (2) in probe(), and the only thing (2) does is allocating a
platform_device with name = "arm-scmi", such that (3) is probed.

So, to me it seems that the indirection through a fake platform device is
unnecessary and instead (3) should just have compatible = "linaro,scmi-optee"
and return -EPROBE_DEFER if (1) is not ready. Besides that, this seems like a
use-case for device links.

Thanks,
Danilo

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
  2026-02-26 11:15       ` Danilo Krummrich
@ 2026-02-26 12:21         ` Cristian Marussi
  0 siblings, 0 replies; 63+ messages in thread
From: Cristian Marussi @ 2026-02-26 12:21 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Gatien CHEVALLIER, Cristian Marussi, Gui-Dong Han, gregkh, rafael,
	linux-kernel, baijiaju1990, Qiu-ji Chen, sudeep.holla, arm-scmi

On Thu, Feb 26, 2026 at 12:15:33PM +0100, Danilo Krummrich wrote:
> On Thu Feb 26, 2026 at 9:54 AM CET, Gatien CHEVALLIER wrote:

Hi,

I think my mail thread is a bit messed up...my previous reply is nowhere
to be found, so I will repeat here...

> > It may not be that trivial because unconditionally moving it to
> > module_init() would not guarantee that the TEE bus driver probes
> > before the platform device driver. I think there's a bit more to it.

@Gatien

Yes, it is just what I was going to reply...these SCMI transport drivers
basically have to probe first against their related subsystem like
OPTEE/Virtio and then, only when fully probed at this level, are fully
functional and can become a transport supplier for the SCMI core...
...registering the platform driver at the end of the OPTEE/VIRTIO probe
allowed to avoid any kind of EPROBE_DEFER handling that can come up
espcially when all of this is split into loadable modules..

...I have to review this in deep but I think I had a first version of these
transport rework that worked a bit differently...

> >
> > The platform device driver, as is, is not functional whilst the
> > TEE services are not available. Therefore, it could falsely fail to
> > probe if probed before. We may consider some sort of API to know if
> > the services are available?
> 
> Ok, so there are three drivers somehow involved:
> 
>   (1) module_tee_client_driver(scmi_optee_service_driver)
> 
>   (2) scmi_optee_driver with compatible = "linaro,scmi-optee"
> 
>   (3) scmi_driver with name = "arm-scmi"
> 
> (1) registeres (2) in probe(), and the only thing (2) does is allocating a
> platform_device with name = "arm-scmi", such that (3) is probed.
> 
> So, to me it seems that the indirection through a fake platform device is
> unnecessary and instead (3) should just have compatible = "linaro,scmi-optee"

@Danilo

That was the initial state, all the transports and related compatibles embedded
into the SCMI core stack...then we split out the transport as loadable standalone
drivers, since it seemed to be cleaner (and some vendor also asked for by that
time...).

	https://lore.kernel.org/arm-scmi/20240812173340.3912830-1-cristian.marussi@arm.com/

> and return -EPROBE_DEFER if (1) is not ready. Besides that, this seems like a
> use-case for device links.

We tried to unify the handling because other transports (mailbox/smc) do
NOT rely on 3rd party subsystem to be initialized and this approach
indeed simplified all because you dont need any DEFER simply because
the SCMI core probing will be kicked off ONLY after at least a transport
is successfully probed and initialized...there is a devlink indeed later
that classifies the used transport as a supplier of the core SCMI
arm-scmi driver...

I have to recall the full details especially of the reqs that led to
this design bit I suppose we should at this point reintroduce a bit of
EDEFER...

Thanks,
Cristian


^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2026-02-26 12:21 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-13 16:28 [PATCH v5] driver core: enforce device_lock for driver_match_device() Gui-Dong Han
2026-01-13 16:35 ` Rafael J. Wysocki
2026-01-13 19:23 ` Danilo Krummrich
2026-01-16  7:34   ` Gui-Dong Han
2026-01-16 11:19     ` Greg KH
2026-01-16 11:38       ` Gui-Dong Han
2026-01-16 11:54 ` Danilo Krummrich
2026-01-20 13:22 ` Mark Brown
2026-01-20 13:30   ` Gui-Dong Han
2026-01-20 13:48     ` Mark Brown
2026-01-20 14:05       ` Gui-Dong Han
2026-01-21  8:55     ` Wang Jiayue
2026-01-21  8:57       ` Gui-Dong Han
2026-01-21 10:40       ` Danilo Krummrich
2026-01-21 11:02         ` Danilo Krummrich
2026-01-21 11:19           ` Greg KH
2026-01-21 12:49           ` Mark Brown
2026-01-21 12:50             ` Danilo Krummrich
2026-01-21 13:02               ` Will Deacon
2026-01-21 14:07                 ` Danilo Krummrich
2026-01-21 13:03           ` Robin Murphy
2026-01-21 14:13             ` Danilo Krummrich
2026-01-21 13:22           ` Jiayue Wang
2026-01-20 15:03   ` Danilo Krummrich
2026-01-20 15:35     ` Mark Brown
2026-01-20 17:38     ` Mark Brown
2026-01-20 18:36       ` Danilo Krummrich
2026-01-20 20:05         ` Mark Brown
2026-01-20 21:18           ` Danilo Krummrich
2026-01-21  1:11             ` Danilo Krummrich
2026-01-21  7:18               ` Gui-Dong Han
2026-01-21  7:41                 ` Gui-Dong Han
2026-01-21  7:56                   ` Greg KH
2026-01-21  8:12                     ` Greg KH
2026-01-21  9:54                     ` Danilo Krummrich
2026-01-21 10:30                       ` Greg KH
2026-01-20 15:23   ` Marek Szyprowski
2026-01-20 15:27     ` Mark Brown
2026-01-21 20:00     ` Jon Hunter
2026-01-21 21:42       ` Danilo Krummrich
2026-01-22 17:28         ` Jon Hunter
2026-01-22 17:55           ` Gui-Dong Han
2026-01-22 18:12             ` Danilo Krummrich
2026-01-22 18:58               ` Jon Hunter
2026-01-22 19:35                 ` Danilo Krummrich
2026-01-23 13:57                   ` Jon Hunter
2026-01-23 14:09                     ` Danilo Krummrich
2026-01-23 14:29                       ` Jon Hunter
2026-01-23 16:54                         ` Danilo Krummrich
2026-01-23 18:53                           ` Gui-Dong Han
2026-01-23 19:07                             ` Danilo Krummrich
2026-01-27 14:58                               ` Jon Hunter
2026-01-27 15:18                                 ` Danilo Krummrich
2026-01-27 14:53                   ` Jon Hunter
2026-01-27 15:05                     ` Gui-Dong Han
2026-01-21  7:40   ` David Heidelberg
2026-02-11 10:42   ` Alexander Stein
2026-02-11 13:56     ` Danilo Krummrich
2026-02-25 20:19 ` Cristian Marussi
2026-02-25 20:38   ` Danilo Krummrich
2026-02-26  8:54     ` Gatien CHEVALLIER
2026-02-26 11:15       ` Danilo Krummrich
2026-02-26 12:21         ` Cristian Marussi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox