* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() [not found] ` <ef37ed64-24ad-4b82-bc6c-d3bc72aaf232@samsung.com> @ 2026-01-21 20:00 ` Jon Hunter 2026-01-21 21:42 ` Danilo Krummrich 0 siblings, 1 reply; 17+ messages in thread From: Jon Hunter @ 2026-01-21 20:00 UTC (permalink / raw) To: Marek Szyprowski, Mark Brown, Gui-Dong Han Cc: gregkh, rafael, dakr, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On 20/01/2026 15:23, Marek Szyprowski wrote: > On 20.01.2026 14:22, Mark Brown wrote: >> On Wed, Jan 14, 2026 at 12:28:43AM +0800, Gui-Dong Han wrote: >>> Currently, driver_match_device() is called from three sites. One site >>> (__device_attach_driver) holds device_lock(dev), but the other two >>> (bind_store and __driver_attach) do not. This inconsistency means that >>> bus match() callbacks are not guaranteed to be called with the lock >>> held. >> I'm seeing boot hangs on Arm Juno in next/pending-fixes which bisect to >> this commit. The boot grinds to a halt near the end of boot: >> >> [ 2.570549] ledtrig-cpu: registered to indicate activity on CPUs >> [ 2.618301] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled >> [ 2.623547] msm_serial: driver initialized >> [ 2.624058] SuperH (H)SCI(F) driver initialized >> [ 2.624312] STM32 USART driver initialized >> >> with no further output, full log: >> >> https://lava.sirena.org.uk/scheduler/job/2387335#L862 >> >> We are also seeing similar looking boot hangs on some Qualcomm platforms >> in Arm's test lab which aren't verified to be the same thing but are >> hanging at a similar point in boot. > > I've observed the same issue on Qualcomm RB5 board and bisecting lead me > also to this patch. My kernel log also doesn't reveal much information: > > ... > > [ 3.671227] vreg_bob: Setting 3008000-4000000uV > [ 3.676929] vreg_l1c_1p8: Setting 1800000-1800000uV > [ 3.682826] vreg_l2c_1p2: Setting 1200000-1200000uV > [ 3.688547] vreg_l3c_0p8: Setting 800000-800000uV > [ 3.694080] vreg_l4c_1p7: Setting 1704000-2928000uV > [ 3.699908] vreg_l5c_1p8: Setting 1800000-2928000uV > [ 3.705763] vreg_l6c_2p96: Setting 1800000-2960000uV > [ 3.711684] vreg_l7c_cam_vcm0_2p85: Setting 2856000-3104000uV > [ 3.718408] vreg_l8c_1p8: Setting 1800000-1800000uV > [ 3.724287] vreg_l9c_2p96: Setting 2704000-2960000uV > [ 3.730218] vreg_l10c_3p0: Setting 3000000-3000000uV > [ 3.736226] vreg_l11c_3p3: Setting 3296000-3296000uV > [ 3.743413] vreg_s8c_1p3: Setting 1352000-1352000uV > [ 3.771370] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled > [ 3.792020] msm_serial: driver initialized > [ 3.797633] SuperH (H)SCI(F) driver initialized > [ 3.802881] STM32 USART driver initialized > > [hang/freeze] I am seeing a similar issue on one of our Tegra boards and bisect also points to this commit. It is odd because it only appears to impact the Tegra194 Jetson Xavier NX board (tegra194-p3509-0000+p3668-0000.dts). It appears to boot enough so the test can SSH into the device, but the kernel log does not show the us getting to the console prompt. It also appears that a lot of drivers are not bound as expected. I would need to check if those are all modules or not. Jon -- nvpublic ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-21 20:00 ` [PATCH v5] driver core: enforce device_lock for driver_match_device() Jon Hunter @ 2026-01-21 21:42 ` Danilo Krummrich 2026-01-22 17:28 ` Jon Hunter 0 siblings, 1 reply; 17+ messages in thread From: Danilo Krummrich @ 2026-01-21 21:42 UTC (permalink / raw) To: Jon Hunter Cc: Marek Szyprowski, Mark Brown, Gui-Dong Han, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On Wed Jan 21, 2026 at 9:00 PM CET, Jon Hunter wrote: > It is odd because it only appears to impact the Tegra194 Jetson Xavier > NX board (tegra194-p3509-0000+p3668-0000.dts). > > It appears to boot enough so the test can SSH into the device, but the > kernel log does not show the us getting to the console prompt. It also > appears that a lot of drivers are not bound as expected. I would need to > check if those are all modules or not. The other reports were fixed by [1], but the issue in arm-smmu-qcom shouldn't be related in this case. I quickyl checked all drivers with "tegra194" in their compatible string, but didn't see anything odd. Can you please try to enable CONFIG_LOCKDEP, CONFIG_PROVE_LOCKING, CONFIG_DEBUG_MUTEXES and see if you get a lockdep splat using the following diff? (You will see a lockdep warning in faux_bus_init(), it's harmless and can be ignored.) [1] https://lore.kernel.org/driver-core/20260121141215.29658-1-dakr@kernel.org/ diff --git a/drivers/base/base.h b/drivers/base/base.h index 677320881af1..4741412d7e46 100644 --- a/drivers/base/base.h +++ b/drivers/base/base.h @@ -190,8 +190,13 @@ static inline int driver_match_device(const struct device_driver *drv, static inline int driver_match_device_locked(const struct device_driver *drv, struct device *dev) { - guard(device)(dev); - return driver_match_device(drv, dev); + int ret; + + mutex_acquire(&dev->mutex.dep_map, 0, 0, _THIS_IP_); + ret = driver_match_device(drv, dev); + mutex_release(&dev->mutex.dep_map, _THIS_IP_); + + return ret; } static inline void dev_sync_state(struct device *dev) diff --git a/drivers/base/core.c b/drivers/base/core.c index 40de2f51a1b1..56c62b3016aa 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -2557,6 +2557,8 @@ static void device_release(struct kobject *kobj) kfree(dev->dma_range_map); + lockdep_unregister_key(&dev->lock_key); + if (dev->release) dev->release(dev); else if (dev->type && dev->type->release) @@ -3159,7 +3161,9 @@ void device_initialize(struct device *dev) kobject_init(&dev->kobj, &device_ktype); INIT_LIST_HEAD(&dev->dma_pools); mutex_init(&dev->mutex); - lockdep_set_novalidate_class(&dev->mutex); + //lockdep_set_novalidate_class(&dev->mutex); + lockdep_register_key(&dev->lock_key); + lockdep_set_class(&dev->mutex, &dev->lock_key); spin_lock_init(&dev->devres_lock); INIT_LIST_HEAD(&dev->devres_head); device_pm_init(dev); diff --git a/include/linux/device.h b/include/linux/device.h index 0be95294b6e6..dc898a420bc2 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -579,6 +579,7 @@ struct device { struct mutex mutex; /* mutex to synchronize calls to * its driver. */ + struct lock_class_key lock_key; struct dev_links_info links; struct dev_pm_info power; ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-21 21:42 ` Danilo Krummrich @ 2026-01-22 17:28 ` Jon Hunter 2026-01-22 17:55 ` Gui-Dong Han 0 siblings, 1 reply; 17+ messages in thread From: Jon Hunter @ 2026-01-22 17:28 UTC (permalink / raw) To: Danilo Krummrich Cc: Marek Szyprowski, Mark Brown, Gui-Dong Han, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org Hi Danilo, On 21/01/2026 21:42, Danilo Krummrich wrote: > On Wed Jan 21, 2026 at 9:00 PM CET, Jon Hunter wrote: >> It is odd because it only appears to impact the Tegra194 Jetson Xavier >> NX board (tegra194-p3509-0000+p3668-0000.dts). >> >> It appears to boot enough so the test can SSH into the device, but the >> kernel log does not show the us getting to the console prompt. It also >> appears that a lot of drivers are not bound as expected. I would need to >> check if those are all modules or not. > > The other reports were fixed by [1], but the issue in arm-smmu-qcom shouldn't be > related in this case. > > I quickyl checked all drivers with "tegra194" in their compatible string, but > didn't see anything odd. > > Can you please try to enable CONFIG_LOCKDEP, CONFIG_PROVE_LOCKING, > CONFIG_DEBUG_MUTEXES and see if you get a lockdep splat using the following > diff? > > (You will see a lockdep warning in faux_bus_init(), it's harmless and can be > ignored.) Thanks. I do the lockdep warning in faux_bus_init() but that's the only one. I have verified that all these CONFIGs are correctly enabled in the build. The device boots fine with the below diff, but I am guessing that that is expected? Any other thoughts? Thanks Jon > [1] https://lore.kernel.org/driver-core/20260121141215.29658-1-dakr@kernel.org/ > > diff --git a/drivers/base/base.h b/drivers/base/base.h > index 677320881af1..4741412d7e46 100644 > --- a/drivers/base/base.h > +++ b/drivers/base/base.h > @@ -190,8 +190,13 @@ static inline int driver_match_device(const struct device_driver *drv, > static inline int driver_match_device_locked(const struct device_driver *drv, > struct device *dev) > { > - guard(device)(dev); > - return driver_match_device(drv, dev); > + int ret; > + > + mutex_acquire(&dev->mutex.dep_map, 0, 0, _THIS_IP_); > + ret = driver_match_device(drv, dev); > + mutex_release(&dev->mutex.dep_map, _THIS_IP_); > + > + return ret; > } > > static inline void dev_sync_state(struct device *dev) > diff --git a/drivers/base/core.c b/drivers/base/core.c > index 40de2f51a1b1..56c62b3016aa 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -2557,6 +2557,8 @@ static void device_release(struct kobject *kobj) > > kfree(dev->dma_range_map); > > + lockdep_unregister_key(&dev->lock_key); > + > if (dev->release) > dev->release(dev); > else if (dev->type && dev->type->release) > @@ -3159,7 +3161,9 @@ void device_initialize(struct device *dev) > kobject_init(&dev->kobj, &device_ktype); > INIT_LIST_HEAD(&dev->dma_pools); > mutex_init(&dev->mutex); > - lockdep_set_novalidate_class(&dev->mutex); > + //lockdep_set_novalidate_class(&dev->mutex); > + lockdep_register_key(&dev->lock_key); > + lockdep_set_class(&dev->mutex, &dev->lock_key); > spin_lock_init(&dev->devres_lock); > INIT_LIST_HEAD(&dev->devres_head); > device_pm_init(dev); > diff --git a/include/linux/device.h b/include/linux/device.h > index 0be95294b6e6..dc898a420bc2 100644 > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -579,6 +579,7 @@ struct device { > struct mutex mutex; /* mutex to synchronize calls to > * its driver. > */ > + struct lock_class_key lock_key; > > struct dev_links_info links; > struct dev_pm_info power; > -- nvpublic ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-22 17:28 ` Jon Hunter @ 2026-01-22 17:55 ` Gui-Dong Han 2026-01-22 18:12 ` Danilo Krummrich 0 siblings, 1 reply; 17+ messages in thread From: Gui-Dong Han @ 2026-01-22 17:55 UTC (permalink / raw) To: Jon Hunter Cc: Danilo Krummrich, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On Fri, Jan 23, 2026 at 1:28 AM Jon Hunter <jonathanh@nvidia.com> wrote: > > Hi Danilo, > > On 21/01/2026 21:42, Danilo Krummrich wrote: > > On Wed Jan 21, 2026 at 9:00 PM CET, Jon Hunter wrote: > >> It is odd because it only appears to impact the Tegra194 Jetson Xavier > >> NX board (tegra194-p3509-0000+p3668-0000.dts). > >> > >> It appears to boot enough so the test can SSH into the device, but the > >> kernel log does not show the us getting to the console prompt. It also > >> appears that a lot of drivers are not bound as expected. I would need to > >> check if those are all modules or not. > > > > The other reports were fixed by [1], but the issue in arm-smmu-qcom shouldn't be > > related in this case. > > > > I quickyl checked all drivers with "tegra194" in their compatible string, but > > didn't see anything odd. > > > > Can you please try to enable CONFIG_LOCKDEP, CONFIG_PROVE_LOCKING, > > CONFIG_DEBUG_MUTEXES and see if you get a lockdep splat using the following > > diff? > > > > (You will see a lockdep warning in faux_bus_init(), it's harmless and can be > > ignored.) > > Thanks. I do the lockdep warning in faux_bus_init() but that's the only > one. I have verified that all these CONFIGs are correctly enabled in the > build. The device boots fine with the below diff, but I am guessing that > that is expected? > > Any other thoughts? Can you please try applying the following commit? https://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core.git/commit/?h=driver-core-linus&id=ed1ac3c977dd6b119405fa36dd41f7151bd5b4de Robin Murphy confirmed that the qcom specific issue might actually impact other hardware platforms (provided ARM_SMMU_QCOM/ARCH_QCOM is enabled), as the implementation init code is still executed: https://lore.kernel.org/driver-core/d2ddbb72-30a8-44da-b761-876b2d37567e@arm.com/ So, this patch might fix the issue on Tegra as well. Thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-22 17:55 ` Gui-Dong Han @ 2026-01-22 18:12 ` Danilo Krummrich 2026-01-22 18:58 ` Jon Hunter 0 siblings, 1 reply; 17+ messages in thread From: Danilo Krummrich @ 2026-01-22 18:12 UTC (permalink / raw) To: Gui-Dong Han Cc: Jon Hunter, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On Thu Jan 22, 2026 at 6:55 PM CET, Gui-Dong Han wrote: > On Fri, Jan 23, 2026 at 1:28 AM Jon Hunter <jonathanh@nvidia.com> wrote: >> >> Hi Danilo, >> >> On 21/01/2026 21:42, Danilo Krummrich wrote: >> > On Wed Jan 21, 2026 at 9:00 PM CET, Jon Hunter wrote: >> >> It is odd because it only appears to impact the Tegra194 Jetson Xavier >> >> NX board (tegra194-p3509-0000+p3668-0000.dts). >> >> >> >> It appears to boot enough so the test can SSH into the device, but the >> >> kernel log does not show the us getting to the console prompt. It also >> >> appears that a lot of drivers are not bound as expected. I would need to >> >> check if those are all modules or not. >> > >> > The other reports were fixed by [1], but the issue in arm-smmu-qcom shouldn't be >> > related in this case. >> > >> > I quickyl checked all drivers with "tegra194" in their compatible string, but >> > didn't see anything odd. >> > >> > Can you please try to enable CONFIG_LOCKDEP, CONFIG_PROVE_LOCKING, >> > CONFIG_DEBUG_MUTEXES and see if you get a lockdep splat using the following >> > diff? >> > >> > (You will see a lockdep warning in faux_bus_init(), it's harmless and can be >> > ignored.) >> >> Thanks. I do the lockdep warning in faux_bus_init() but that's the only >> one. I have verified that all these CONFIGs are correctly enabled in the >> build. The device boots fine with the below diff, but I am guessing that >> that is expected? Yes, that's expected, we not actually taking the lock, but assert to lockdep that we did. The fact that we use a dynamic lock class key for each device mutex to avoid false positives should also be fine. >> Any other thoughts? With this diff, if I intentionally create a deadlock condition on my machine, I do see a lockdep splat as expected. Anyways, another option would be to attach a hardware debugger (I assume you have TRACE32 or something available?) and then get a backtrace from the CPU affected of the deadlock. > Can you please try applying the following commit? > > https://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core.git/commit/?h=driver-core-linus&id=ed1ac3c977dd6b119405fa36dd41f7151bd5b4de > > Robin Murphy confirmed that the qcom specific issue might actually > impact other hardware platforms (provided ARM_SMMU_QCOM/ARCH_QCOM is > enabled), as the implementation init code is still executed: > > https://lore.kernel.org/driver-core/d2ddbb72-30a8-44da-b761-876b2d37567e@arm.com/ > > So, this patch might fix the issue on Tegra as well. I thought of that as well, but looking at the code in arm_smmu_impl_init(), it seems that can't happen? if (of_device_is_compatible(np, "nvidia,tegra234-smmu") || of_device_is_compatible(np, "nvidia,tegra194-smmu") || of_device_is_compatible(np, "nvidia,tegra186-smmu")) return nvidia_smmu_impl_init(smmu); if (IS_ENABLED(CONFIG_ARM_SMMU_QCOM)) smmu = qcom_smmu_impl_init(smmu); But maybe there is some odd case where the first if condition does not evaluate to true on tegra194, so maybe worth a try. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-22 18:12 ` Danilo Krummrich @ 2026-01-22 18:58 ` Jon Hunter 2026-01-22 19:35 ` Danilo Krummrich 0 siblings, 1 reply; 17+ messages in thread From: Jon Hunter @ 2026-01-22 18:58 UTC (permalink / raw) To: Danilo Krummrich, Gui-Dong Han Cc: Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On 22/01/2026 18:12, Danilo Krummrich wrote: ... >>> Any other thoughts? > > With this diff, if I intentionally create a deadlock condition on my machine, I > do see a lockdep splat as expected. > > Anyways, another option would be to attach a hardware debugger (I assume you > have TRACE32 or something available?) and then get a backtrace from the CPU > affected of the deadlock. Unfortunately, these days I don't have such tools available so that's not an option. >> Can you please try applying the following commit? >> >> https://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core.git/commit/?h=driver-core-linus&id=ed1ac3c977dd6b119405fa36dd41f7151bd5b4de >> >> Robin Murphy confirmed that the qcom specific issue might actually >> impact other hardware platforms (provided ARM_SMMU_QCOM/ARCH_QCOM is >> enabled), as the implementation init code is still executed: >> >> https://lore.kernel.org/driver-core/d2ddbb72-30a8-44da-b761-876b2d37567e@arm.com/ >> >> So, this patch might fix the issue on Tegra as well. > > I thought of that as well, but looking at the code in arm_smmu_impl_init(), it > seems that can't happen? > > if (of_device_is_compatible(np, "nvidia,tegra234-smmu") || > of_device_is_compatible(np, "nvidia,tegra194-smmu") || > of_device_is_compatible(np, "nvidia,tegra186-smmu")) > return nvidia_smmu_impl_init(smmu); > > if (IS_ENABLED(CONFIG_ARM_SMMU_QCOM)) > smmu = qcom_smmu_impl_init(smmu); > > But maybe there is some odd case where the first if condition does not evaluate > to true on tegra194, so maybe worth a try. I gave this a shot but that did not help either. Thanks Jon -- nvpublic ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-22 18:58 ` Jon Hunter @ 2026-01-22 19:35 ` Danilo Krummrich 2026-01-23 13:57 ` Jon Hunter 2026-01-27 14:53 ` Jon Hunter 0 siblings, 2 replies; 17+ messages in thread From: Danilo Krummrich @ 2026-01-22 19:35 UTC (permalink / raw) To: Jon Hunter Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On Thu Jan 22, 2026 at 7:58 PM CET, Jon Hunter wrote: > On 22/01/2026 18:12, Danilo Krummrich wrote: >> With this diff, if I intentionally create a deadlock condition on my machine, I >> do see a lockdep splat as expected. >> >> Anyways, another option would be to attach a hardware debugger (I assume you >> have TRACE32 or something available?) and then get a backtrace from the CPU >> affected of the deadlock. > > Unfortunately, these days I don't have such tools available so that's > not an option. Hm..slowly running out of options. :) I remember you previously said that you can still SSH into the machine? If so, can you please share the the first output of echo l > /proc/sysrq-trigger directly after booting? Subsequently, can you please also run echo w > /proc/sysrq-trigger and echo t > /proc/sysrq-trigger If the output of the last shows a task in D state, you can also run cat /proc/$PID/stack Also, are there any OOT modules loaded? Thanks, Danilo ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-22 19:35 ` Danilo Krummrich @ 2026-01-23 13:57 ` Jon Hunter 2026-01-23 14:09 ` Danilo Krummrich 2026-01-27 14:53 ` Jon Hunter 1 sibling, 1 reply; 17+ messages in thread From: Jon Hunter @ 2026-01-23 13:57 UTC (permalink / raw) To: Danilo Krummrich Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On 22/01/2026 19:35, Danilo Krummrich wrote: > On Thu Jan 22, 2026 at 7:58 PM CET, Jon Hunter wrote: >> On 22/01/2026 18:12, Danilo Krummrich wrote: >>> With this diff, if I intentionally create a deadlock condition on my machine, I >>> do see a lockdep splat as expected. >>> >>> Anyways, another option would be to attach a hardware debugger (I assume you >>> have TRACE32 or something available?) and then get a backtrace from the CPU >>> affected of the deadlock. >> >> Unfortunately, these days I don't have such tools available so that's >> not an option. > > Hm..slowly running out of options. :) No worries. There appears to be a couple issues going on with this board. With the patch reverted the board boots fine and tests pass. Even in the passing case with this patch reverted, during boot I see a NULL pointer deference crash log from the QSPI driver. So I disabled the QSPI device in device-tree and with this patch the board boots fine and tests pass. There is a on-going thread for the QSPI driver to fix these NULL pointer deference crashes [0]. So the QSPI driver seems to be the root of the problem. Cheers Jon [0] https://lore.kernel.org/linux-tegra/aXJWRUhAe8F67-zG@gmail.com/T/#t -- nvpublic ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-23 13:57 ` Jon Hunter @ 2026-01-23 14:09 ` Danilo Krummrich 2026-01-23 14:29 ` Jon Hunter 0 siblings, 1 reply; 17+ messages in thread From: Danilo Krummrich @ 2026-01-23 14:09 UTC (permalink / raw) To: Jon Hunter Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On Fri Jan 23, 2026 at 2:57 PM CET, Jon Hunter wrote: > No worries. There appears to be a couple issues going on with this > board. With the patch reverted the board boots fine and tests pass. Even > in the passing case with this patch reverted, during boot I see a NULL > pointer deference crash log from the QSPI driver. So I disabled the QSPI > device in device-tree and with this patch the board boots fine and tests > pass. > > There is a on-going thread for the QSPI driver to fix these NULL pointer > deference crashes [0]. So the QSPI driver seems to be the root of the > problem. > > [0] https://lore.kernel.org/linux-tegra/aXJWRUhAe8F67-zG@gmail.com/T/#t So, are you saying the problems you are seeing are unrelated to this patch and there is no deadlock? (At least this would explain why we couldn't get a lockdep splat with the diff I shared. :) Otherwise, can you please share the output of the commands I shared in my pevious mail? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-23 14:09 ` Danilo Krummrich @ 2026-01-23 14:29 ` Jon Hunter 2026-01-23 16:54 ` Danilo Krummrich 0 siblings, 1 reply; 17+ messages in thread From: Jon Hunter @ 2026-01-23 14:29 UTC (permalink / raw) To: Danilo Krummrich Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On 23/01/2026 14:09, Danilo Krummrich wrote: > On Fri Jan 23, 2026 at 2:57 PM CET, Jon Hunter wrote: >> No worries. There appears to be a couple issues going on with this >> board. With the patch reverted the board boots fine and tests pass. Even >> in the passing case with this patch reverted, during boot I see a NULL >> pointer deference crash log from the QSPI driver. So I disabled the QSPI >> device in device-tree and with this patch the board boots fine and tests >> pass. >> >> There is a on-going thread for the QSPI driver to fix these NULL pointer >> deference crashes [0]. So the QSPI driver seems to be the root of the >> problem. >> >> [0] https://lore.kernel.org/linux-tegra/aXJWRUhAe8F67-zG@gmail.com/T/#t > > So, are you saying the problems you are seeing are unrelated to this patch and > there is no deadlock? (At least this would explain why we couldn't get a lockdep > splat with the diff I shared. :) Not exactly. With vanilla -next I see various tests fail on this board and I can see various devices are not probed as expected. Bisect pointed to this patch. I can fix this by either: 1. Reverting this patch. 2. Disabling the QSPI driver. Now the QSPI driver has issues which need to be fixed which I am wondering once fix will avoid this problem. However, I guess regardless of the QSPI issue, should this patch be having such an impact? Looking at the bootlog [0], you can see the crash is occurring during the tegra_qspi_probe() and so I am guessing this what leads to the deadlock? And may be there is no way to avoid that? Please note that a lot of the boards I test are in a farm and I don't have direct access. So although I can see the test harness SSH'ing into the board, I am not accessing directly. However, we can run whatever tests we want. There are no OOT drivers being used this is just vanilla -next. Jon [0] https://pastebin.com/wJheruPP -- nvpublic ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-23 14:29 ` Jon Hunter @ 2026-01-23 16:54 ` Danilo Krummrich 2026-01-23 18:53 ` Gui-Dong Han 0 siblings, 1 reply; 17+ messages in thread From: Danilo Krummrich @ 2026-01-23 16:54 UTC (permalink / raw) To: Jon Hunter Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On Fri Jan 23, 2026 at 3:29 PM CET, Jon Hunter wrote: > I can fix this by either: > > 1. Reverting this patch. > 2. Disabling the QSPI driver. > > Now the QSPI driver has issues which need to be fixed which I am > wondering once fix will avoid this problem. > > However, I guess regardless of the QSPI issue, should this patch be > having such an impact? So, this patch by itself is correct, but it reveals when drivers do the wrong thing, that is register drivers from contexts where it neither makes sense nor it is supported by the driver core. The deadlock happens when a driver (A) registers another driver (B) from a context where the device lock of the device bound to (A) is held, e.g. from bus callbacks, such as probe(). See also [1]. While never valid, the deadlock does only occur when (A) and (B) are on the same bus, e.g. when a platform driver registers another platform driver in its probe() callback. However, it is a bit more tricky than that: Let's say a platform driver registers an SPI controller, then spi_register_controller() might scan the SPI bus and register SPI devices (not drivers), which are then probed as well. So far this is all fine, but if now in one of the SPI drivers probe() callbacks a platform driver is registered, you have a deadlock condition as well. So it seems that something of this kind is going on with drivers/spi/spi-tegra210-quad.c. I did already run quite thorough analysis throughout the whole kernel tree with various static analyzers and also played around with LLMs for finding this pattern. The tools gave me two results: (1) The IOMMU one I already fixed [2]. (2) The GPIO driver I posted a patch for in [3]. I specifically also looked for all drivers that are required to run all the peripherals in the tegra194-p3509-0000+p3668-0000.dts hierarchy, but couldn't catch anything. (This is also why I asked about OOT, because there are quite some compatible strings that are not supported by any upstream driver.) I think to really see what's going in with spi-tegra210-quad.c, we need the dumps of the sysrq-triggers I provided in a previous mail. I'd also recommend to pick a stable state of the spi-tegra210-quad.c driver and apply this patch on top (or just apply the spi-tegra210-quad.c fixes as well). Subsequently, we could try and retest with the diff I provided and the corresponding lockdep options enabled and with the sysrq-triggers (without the diff). [1] https://lore.kernel.org/lkml/DFU7CEPUSG9A.1KKGVW4HIPMSH@kernel.org/ [2] https://lore.kernel.org/all/20260121141215.29658-1-dakr@kernel.org/ [3] https://lore.kernel.org/all/20260123133614.72586-1-dakr@kernel.org/ > Please note that a lot of the boards I test are in a farm and I don't > have direct access. So although I can see the test harness SSH'ing into > the board, I am not accessing directly. However, we can run whatever > tests we want. Maybe you can trigger the sysrq-trigger from a custom test? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-23 16:54 ` Danilo Krummrich @ 2026-01-23 18:53 ` Gui-Dong Han 2026-01-23 19:07 ` Danilo Krummrich 0 siblings, 1 reply; 17+ messages in thread From: Gui-Dong Han @ 2026-01-23 18:53 UTC (permalink / raw) To: Danilo Krummrich Cc: Jon Hunter, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On Sat, Jan 24, 2026 at 12:54 AM Danilo Krummrich <dakr@kernel.org> wrote: > > On Fri Jan 23, 2026 at 3:29 PM CET, Jon Hunter wrote: > > I can fix this by either: > > > > 1. Reverting this patch. > > 2. Disabling the QSPI driver. > > > > Now the QSPI driver has issues which need to be fixed which I am > > wondering once fix will avoid this problem. > > > > However, I guess regardless of the QSPI issue, should this patch be > > having such an impact? > > So, this patch by itself is correct, but it reveals when drivers do the wrong > thing, that is register drivers from contexts where it neither makes sense nor > it is supported by the driver core. > > The deadlock happens when a driver (A) registers another driver (B) from a > context where the device lock of the device bound to (A) is held, e.g. from bus > callbacks, such as probe(). See also [1]. > > While never valid, the deadlock does only occur when (A) and (B) are on the same > bus, e.g. when a platform driver registers another platform driver in its > probe() callback. > > However, it is a bit more tricky than that: Let's say a platform driver > registers an SPI controller, then spi_register_controller() might scan the SPI > bus and register SPI devices (not drivers), which are then probed as well. So > far this is all fine, but if now in one of the SPI drivers probe() callbacks a > platform driver is registered, you have a deadlock condition as well. > > So it seems that something of this kind is going on with > drivers/spi/spi-tegra210-quad.c. > > I did already run quite thorough analysis throughout the whole kernel tree with > various static analyzers and also played around with LLMs for finding this > pattern. > > The tools gave me two results: > > (1) The IOMMU one I already fixed [2]. > (2) The GPIO driver I posted a patch for in [3]. > > I specifically also looked for all drivers that are required to run all the > peripherals in the tegra194-p3509-0000+p3668-0000.dts hierarchy, but couldn't > catch anything. > > (This is also why I asked about OOT, because there are quite some compatible > strings that are not supported by any upstream driver.) > > I think to really see what's going in with spi-tegra210-quad.c, we need the > dumps of the sysrq-triggers I provided in a previous mail. It seems the issue is simpler than a recursive registration deadlock. Looking at the logs, tegra_qspi_probe triggers a NULL pointer dereference (Oops) while holding the device_lock. The mutex likely remains marked as held/orphaned, blocking subsequent driver bindings on the same bus. This likely explains why lockdep was silent. Since this is not a lock dependency cycle or a recursive locking violation, but rather a lock remaining held by a terminated task, lockdep would not flag it as a deadlock pattern. This is indeed a side effect of enforcing the lock here—it amplifies the impact of a crash. However, an Oops while holding the device_lock is generally catastrophic regardless. Following up on our previous discussion [1], refactoring driver_override would resolve this. We could move driver_override to struct device and protect it with a dedicated lock (e.g., driver_override_lock). We would then replace driver_set_override with dev_set_driver_override and add dev_access_driver_override with internal lock assertions. This allows us to remove device_lock from the 2 match paths, reducing contention and preventing a single crash from stalling the whole bus. However, this deviates from the current paradigm where device_lock protects sysfs attributes (like waiting_for_supplier and power/control). If other sysfs attributes are found to share similar constraints or would benefit from finer-grained locking (which requires further investigation), we might have a stronger argument for introducing a more generic sysfs_lock to handle this class of attributes. We would also need to carefully verify safety during device removal. Danilo, what are your thoughts on this refactoring plan? I am willing to attempt it, but since it touches the driver core, documentation, and 10+ bus drivers, and I haven't submitted such a large series before, it may take me a few weeks to get an initial version out, and additional time to iterate based on review feedback until it is ready for merging. If you prefer to handle it yourself to expedite things, please let me know so we don't duplicate efforts. [1] https://lore.kernel.org/all/DFNI14L1K1I0.3FZ84OAWXY0LP@kernel.org/ Thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-23 18:53 ` Gui-Dong Han @ 2026-01-23 19:07 ` Danilo Krummrich 2026-01-27 14:58 ` Jon Hunter 0 siblings, 1 reply; 17+ messages in thread From: Danilo Krummrich @ 2026-01-23 19:07 UTC (permalink / raw) To: Gui-Dong Han Cc: Jon Hunter, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On Fri Jan 23, 2026 at 7:53 PM CET, Gui-Dong Han wrote: > It seems the issue is simpler than a recursive registration deadlock. > Looking at the logs, tegra_qspi_probe triggers a NULL pointer > dereference (Oops) while holding the device_lock. The mutex likely > remains marked as held/orphaned, blocking subsequent driver bindings > on the same bus. > > This likely explains why lockdep was silent. Since this is not a lock > dependency cycle or a recursive locking violation, but rather a lock > remaining held by a terminated task, lockdep would not flag it as a > deadlock pattern. > > This is indeed a side effect of enforcing the lock here—it amplifies > the impact of a crash. However, an Oops while holding the device_lock > is generally catastrophic regardless. This makes sense to me; it might indeed be as simple as that. > Following up on our previous discussion [1], refactoring > driver_override would resolve this. We could move driver_override to > struct device and protect it with a dedicated lock (e.g., > driver_override_lock). We would then replace driver_set_override with > dev_set_driver_override and add dev_access_driver_override with > internal lock assertions. This allows us to remove device_lock from > the 2 match paths, reducing contention and preventing a single crash > from stalling the whole bus. > > However, this deviates from the current paradigm where device_lock > protects sysfs attributes (like waiting_for_supplier and > power/control). If other sysfs attributes are found to share similar > constraints or would benefit from finer-grained locking (which > requires further investigation), we might have a stronger argument for > introducing a more generic sysfs_lock to handle this class of > attributes. We would also need to carefully verify safety during > device removal. > > Danilo, what are your thoughts on this refactoring plan? I am willing > to attempt it, but since it touches the driver core, documentation, > and 10+ bus drivers, and I haven't submitted such a large series > before, it may take me a few weeks to get an initial version out, and > additional time to iterate based on review feedback until it is ready > for merging. If you prefer to handle it yourself to expedite things, > please let me know so we don't duplicate efforts. I think moving driver_override to struct device and providing accessors with proper lockdep assertions is the correct thing to do. With that, I do not think a separate lock is necessary. Please feel free to follow up on this. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-23 19:07 ` Danilo Krummrich @ 2026-01-27 14:58 ` Jon Hunter 2026-01-27 15:18 ` Danilo Krummrich 0 siblings, 1 reply; 17+ messages in thread From: Jon Hunter @ 2026-01-27 14:58 UTC (permalink / raw) To: Danilo Krummrich, Gui-Dong Han Cc: Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On 23/01/2026 19:07, Danilo Krummrich wrote: > On Fri Jan 23, 2026 at 7:53 PM CET, Gui-Dong Han wrote: >> It seems the issue is simpler than a recursive registration deadlock. >> Looking at the logs, tegra_qspi_probe triggers a NULL pointer >> dereference (Oops) while holding the device_lock. The mutex likely >> remains marked as held/orphaned, blocking subsequent driver bindings >> on the same bus. >> >> This likely explains why lockdep was silent. Since this is not a lock >> dependency cycle or a recursive locking violation, but rather a lock >> remaining held by a terminated task, lockdep would not flag it as a >> deadlock pattern. >> >> This is indeed a side effect of enforcing the lock here—it amplifies >> the impact of a crash. However, an Oops while holding the device_lock >> is generally catastrophic regardless. > > This makes sense to me; it might indeed be as simple as that. Yes I believe that this is the case too. BTW, if I apply the SPI series from Breno [0], which fixes crash in the SPI driver, then everything works fine. Jon [0] https://lore.kernel.org/linux-tegra/20260126-tegra_xfer-v2-0-6d2115e4f387@debian.org/T/#t -- nvpublic ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-27 14:58 ` Jon Hunter @ 2026-01-27 15:18 ` Danilo Krummrich 0 siblings, 0 replies; 17+ messages in thread From: Danilo Krummrich @ 2026-01-27 15:18 UTC (permalink / raw) To: Jon Hunter Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On Tue Jan 27, 2026 at 3:58 PM CET, Jon Hunter wrote: > > On 23/01/2026 19:07, Danilo Krummrich wrote: >> On Fri Jan 23, 2026 at 7:53 PM CET, Gui-Dong Han wrote: >>> It seems the issue is simpler than a recursive registration deadlock. >>> Looking at the logs, tegra_qspi_probe triggers a NULL pointer >>> dereference (Oops) while holding the device_lock. The mutex likely >>> remains marked as held/orphaned, blocking subsequent driver bindings >>> on the same bus. >>> >>> This likely explains why lockdep was silent. Since this is not a lock >>> dependency cycle or a recursive locking violation, but rather a lock >>> remaining held by a terminated task, lockdep would not flag it as a >>> deadlock pattern. >>> >>> This is indeed a side effect of enforcing the lock here—it amplifies >>> the impact of a crash. However, an Oops while holding the device_lock >>> is generally catastrophic regardless. >> >> This makes sense to me; it might indeed be as simple as that. > > Yes I believe that this is the case too. > > BTW, if I apply the SPI series from Breno [0], which fixes crash in the > SPI driver, then everything works fine. Thanks for confirming! - Danilo ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-22 19:35 ` Danilo Krummrich 2026-01-23 13:57 ` Jon Hunter @ 2026-01-27 14:53 ` Jon Hunter 2026-01-27 15:05 ` Gui-Dong Han 1 sibling, 1 reply; 17+ messages in thread From: Jon Hunter @ 2026-01-27 14:53 UTC (permalink / raw) To: Danilo Krummrich Cc: Gui-Dong Han, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org Hi Danilo, On 22/01/2026 19:35, Danilo Krummrich wrote: > On Thu Jan 22, 2026 at 7:58 PM CET, Jon Hunter wrote: >> On 22/01/2026 18:12, Danilo Krummrich wrote: >>> With this diff, if I intentionally create a deadlock condition on my machine, I >>> do see a lockdep splat as expected. >>> >>> Anyways, another option would be to attach a hardware debugger (I assume you >>> have TRACE32 or something available?) and then get a backtrace from the CPU >>> affected of the deadlock. >> >> Unfortunately, these days I don't have such tools available so that's >> not an option. > > Hm..slowly running out of options. :) > > I remember you previously said that you can still SSH into the machine? If so, > can you please share the the first output of > > echo l > /proc/sysrq-trigger > > directly after booting? > > Subsequently, can you please also run > > echo w > /proc/sysrq-trigger > > and > > echo t > /proc/sysrq-trigger You can find the output of the above commands here: https://pastebin.com/PuhFURwh If you search for 'state:D', you can see that various drivers are stuck waiting for the mutex. I believe that this all happens because the SPI driver crashed during the probe and prevents any further drivers from probing. Jon -- nvpublic ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v5] driver core: enforce device_lock for driver_match_device() 2026-01-27 14:53 ` Jon Hunter @ 2026-01-27 15:05 ` Gui-Dong Han 0 siblings, 0 replies; 17+ messages in thread From: Gui-Dong Han @ 2026-01-27 15:05 UTC (permalink / raw) To: Jon Hunter Cc: Danilo Krummrich, Marek Szyprowski, Mark Brown, gregkh, rafael, linux-kernel, baijiaju1990, Qiu-ji Chen, Aishwarya.TCV, linux-tegra@vger.kernel.org On Tue, Jan 27, 2026 at 10:53 PM Jon Hunter <jonathanh@nvidia.com> wrote: > > Hi Danilo, > > On 22/01/2026 19:35, Danilo Krummrich wrote: > > On Thu Jan 22, 2026 at 7:58 PM CET, Jon Hunter wrote: > >> On 22/01/2026 18:12, Danilo Krummrich wrote: > >>> With this diff, if I intentionally create a deadlock condition on my machine, I > >>> do see a lockdep splat as expected. > >>> > >>> Anyways, another option would be to attach a hardware debugger (I assume you > >>> have TRACE32 or something available?) and then get a backtrace from the CPU > >>> affected of the deadlock. > >> > >> Unfortunately, these days I don't have such tools available so that's > >> not an option. > > > > Hm..slowly running out of options. :) > > > > I remember you previously said that you can still SSH into the machine? If so, > > can you please share the the first output of > > > > echo l > /proc/sysrq-trigger > > > > directly after booting? > > > > Subsequently, can you please also run > > > > echo w > /proc/sysrq-trigger > > > > and > > > > echo t > /proc/sysrq-trigger > > You can find the output of the above commands here: > > https://pastebin.com/PuhFURwh Thanks for the logs. Looking at the trace, it confirms the previous suspicion. Since there is no circular dependency shown in the logs, it is not a classic recursive deadlock but rather the device_lock remaining held due to the earlier crash in the QSPI driver. This prevented other devices on the same bus from completing their probe. I'm glad to hear that Breno's SPI fixes resolve the issue. It's a happy ending for this case. Thanks for the hard work on debugging this! ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-01-27 15:18 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260113162843.12712-1-hanguidong02@gmail.com>
[not found] ` <7ae38e31-ef31-43ad-9106-7c76ea0e8596@sirena.org.uk>
[not found] ` <CGME20260120152328eucas1p1024a7488ae10b8b7f2fcb74baee24c75@eucas1p1.samsung.com>
[not found] ` <ef37ed64-24ad-4b82-bc6c-d3bc72aaf232@samsung.com>
2026-01-21 20:00 ` [PATCH v5] driver core: enforce device_lock for driver_match_device() Jon Hunter
2026-01-21 21:42 ` Danilo Krummrich
2026-01-22 17:28 ` Jon Hunter
2026-01-22 17:55 ` Gui-Dong Han
2026-01-22 18:12 ` Danilo Krummrich
2026-01-22 18:58 ` Jon Hunter
2026-01-22 19:35 ` Danilo Krummrich
2026-01-23 13:57 ` Jon Hunter
2026-01-23 14:09 ` Danilo Krummrich
2026-01-23 14:29 ` Jon Hunter
2026-01-23 16:54 ` Danilo Krummrich
2026-01-23 18:53 ` Gui-Dong Han
2026-01-23 19:07 ` Danilo Krummrich
2026-01-27 14:58 ` Jon Hunter
2026-01-27 15:18 ` Danilo Krummrich
2026-01-27 14:53 ` Jon Hunter
2026-01-27 15:05 ` Gui-Dong Han
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox