* Re: [GIT PULL] Driver core changes for 7.0-rc1
2026-03-01 7:44 ` Linus Torvalds
@ 2026-03-01 7:56 ` Linus Torvalds
2026-03-01 13:01 ` Gary Guo
` (2 subsequent siblings)
3 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2026-03-01 7:56 UTC (permalink / raw)
To: Danilo Krummrich
Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Saravana Kannan,
Andrew Morton, driver-core, rust-for-linux, linux-kernel
On Sat, 28 Feb 2026 at 23:44, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> In other words, it makes fragile drivers go from "you get an oops" to
> something much worse. The oops becomes unrecoverable - with typically
> a black screen at boot - because the probe is holding a lock that then
> makes everything else come to a grinding halt when the driver fails.
See this message:
https://lore.kernel.org/all/67f655bb-4d81-4609-b008-68d200255dd2@davidgow.net/
on the interaction with the driver core merge with the firewire oops bug.
In this case we were actually somewhat lucky, in that the hardware is
fairly common, but seldom actually used - a combination that meant tht
wer had several people who caught it fairly quickly and were willing
and able to bisect it. I had the first report the day after -rc1 was
released.
IOW, it could have been *so* much worse. Imagine some random driver
bug on hardware that isn't common and just causes inexplicable boot
failures for the people who see it and are likely using distro
kernels.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [GIT PULL] Driver core changes for 7.0-rc1
2026-03-01 7:44 ` Linus Torvalds
2026-03-01 7:56 ` Linus Torvalds
@ 2026-03-01 13:01 ` Gary Guo
2026-03-01 13:04 ` Danilo Krummrich
2026-04-14 18:39 ` Uwe Kleine-König
3 siblings, 0 replies; 13+ messages in thread
From: Gary Guo @ 2026-03-01 13:01 UTC (permalink / raw)
To: Linus Torvalds, Danilo Krummrich
Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Saravana Kannan,
Andrew Morton, driver-core, rust-for-linux, linux-kernel
On Sun Mar 1, 2026 at 7:44 AM GMT, Linus Torvalds wrote:
> On Wed, 11 Feb 2026 at 15:04, Danilo Krummrich <dakr@kernel.org> wrote:
>>
>> Driver core changes for 7.0-rc1
>>
>> - Bus:
>> - Ensure bus->match() is consistently called with the device lock held
>
> So I'm coming back to this, because it turns out this sounds like a
> horrible mistake in the end.
>
> You document it as being about consistent locking, but it appears this
> change is what made the "firewire oops at driver attach" turn an oops
> into just a silently dead machine.
>
> In other words, it makes fragile drivers go from "you get an oops" to
> something much worse. The oops becomes unrecoverable - with typically
> a black screen at boot - because the probe is holding a lock that then
> makes everything else come to a grinding halt when the driver fails.
>
> And yes, this obviously only happens for buggy driver and doesn't
> matter for _correct_ code, but about half of the kernel code is
> drivers, and that half of the kernel code is also the typically the
> most badly tested and often questionably implemented half.
>
> No, not all drivers a bad, but there are a lot of drivers, and some of
> them have problems.
>
> So if a driver problem causes problems for the whole machine, the
> driver core design is bad.
>
> I really think this should be re-thought. Perhaps just reverted
> outright. Instead of saying
>
> "This inconsistency means that
> bus match() callbacks are not guaranteed to be called with the lock
> held"
>
> as if it's automatically a bad thing, just don't depend on the device
> match having to be called with a lock held if that lock has this
> problem.
Note that taking lock on match() fixes a real bug where data race can lead to
use-after-free https://bugzilla.kernel.org/show_bug.cgi?id=220789. It is
mentioned in the patch
https://lore.kernel.org/lkml/20260113162843.12712-1-hanguidong02@gmail.com/.
>
> It's not clear why anybody should *care* about the lock at driver
> attach time, when nothing else can access the device that hasn't been
> brought up yet.
We have always been taking the device lock when probing. This is needed as
obviously you don't want to have two drivers attaching to the same device at the
same time. When probing oops, the device lock is never going to be unlocked
again.
However, before matching starts to take the lock, we're "fine" in a sense that,
everything else keeps working as unless a device is matched and would actually
require probing, the device lock is not touched.
Perhaps what we should do is to defend against drivers oopsing inside probe and
have a mechanism so that device locks are unlocked even when probe oops. Another
option is to have `driver_override` protected by a different lock so match()
takes that lock instead of the device lock.
Best,
Gary
>
> Put another way: the downsides seem worse than the upsides.
> "Consistency" is not an upside if it causes problems.
>
> Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] Driver core changes for 7.0-rc1
2026-03-01 7:44 ` Linus Torvalds
2026-03-01 7:56 ` Linus Torvalds
2026-03-01 13:01 ` Gary Guo
@ 2026-03-01 13:04 ` Danilo Krummrich
2026-03-01 18:17 ` Linus Torvalds
2026-03-01 18:20 ` Danilo Krummrich
2026-04-14 18:39 ` Uwe Kleine-König
3 siblings, 2 replies; 13+ messages in thread
From: Danilo Krummrich @ 2026-03-01 13:04 UTC (permalink / raw)
To: Linus Torvalds
Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Saravana Kannan,
Andrew Morton, driver-core, rust-for-linux, linux-kernel
On Sun Mar 1, 2026 at 8:44 AM CET, Linus Torvalds wrote:
> So I'm coming back to this, because it turns out this sounds like a
> horrible mistake in the end.
I came to the same conclusion following the discussion around the firewire oops.
> You document it as being about consistent locking
It happens that quite a few busses rely on this, and there is a possible race
condition that can lead to UAF bugs in the context of driver_override.
I think it is rather unlikely to happen though, as it would require a user to
change a device's driver_override field through sysfs while the device is
matched with a driver.
In any case, this can easily be solved with a separate lock.
> In other words, it makes fragile drivers go from "you get an oops" to
> something much worse. The oops becomes unrecoverable - with typically
> a black screen at boot - because the probe is holding a lock that then
> makes everything else come to a grinding halt when the driver fails.
Yes, the problem is that when a device is already present in the system and a
driver is registered on the same bus, we iterate over all devices registered on
this bus to see if one of them matches. If we come across an already bound one
where the corresponding driver crashed while holding the device lock (e.g. in
probe()) we can't make any progress anymore.
Obviously, this is not an issue the other way around, i.e. when the driver is
present in the system first and the device is added subsequently.
> And yes, this obviously only happens for buggy driver and doesn't
> matter for _correct_ code, but about half of the kernel code is
> drivers, and that half of the kernel code is also the typically the
> most badly tested and often questionably implemented half.
I agree, it is a case that will happen regularly, and besides hurting developer
ergonomics, it potentially decreases chances of shutting things down cleanly and
obtaining logs in a production environment as well.
> I really think this should be re-thought. Perhaps just reverted
> outright.
Yes, I agree and in fact I already have a few local changes to move
driver_override to struct device, provide corresponding accessors for busses and
handle locking with a separate lock.
(Technically, the "move driver_override to struct device" part is orthogonal,
but doing it right away results in less (and much cleaner) changes.)
I do not consider those changes to be complicated and risky, but I'm not sure
you want to see those for one of the upcoming -rc releases (probably -rc4/5).
Independently, I can send a revert for -rc3.
Thanks,
Danilo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] Driver core changes for 7.0-rc1
2026-03-01 13:04 ` Danilo Krummrich
@ 2026-03-01 18:17 ` Linus Torvalds
2026-03-01 20:21 ` Danilo Krummrich
2026-03-01 18:20 ` Danilo Krummrich
1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2026-03-01 18:17 UTC (permalink / raw)
To: Danilo Krummrich
Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Saravana Kannan,
Andrew Morton, driver-core, rust-for-linux, linux-kernel
On Sun, 1 Mar 2026 at 05:04, Danilo Krummrich <dakr@kernel.org> wrote:
>
> It happens that quite a few busses rely on this, and there is a possible race
> condition that can lead to UAF bugs in the context of driver_override.
>
> I think it is rather unlikely to happen though, as it would require a user to
> change a device's driver_override field through sysfs while the device is
> matched with a driver.
>
> In any case, this can easily be solved with a separate lock.
Yes, if it's literally just about driver_override, please just fix the locking.
Use some really simple local spinlock lock to just copy the string
into a local copy when accessing it - it's not like it's even some
arbitrarily long string afaik (how long can driver names be?)
Don't use a huge sleeping lock that has other semantics for something
trivial like this.
(Or is there some other driver_override thing I'm not aware of?)
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [GIT PULL] Driver core changes for 7.0-rc1
2026-03-01 18:17 ` Linus Torvalds
@ 2026-03-01 20:21 ` Danilo Krummrich
2026-03-01 21:01 ` Linus Torvalds
0 siblings, 1 reply; 13+ messages in thread
From: Danilo Krummrich @ 2026-03-01 20:21 UTC (permalink / raw)
To: Linus Torvalds
Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Saravana Kannan,
Andrew Morton, driver-core, rust-for-linux, linux-kernel
On Sun Mar 1, 2026 at 7:17 PM CET, Linus Torvalds wrote:
> Use some really simple local spinlock lock to just copy the string
> into a local copy when accessing it - it's not like it's even some
> arbitrarily long string afaik (how long can driver names be?)
Yes, that's what my code in [1] already does. Actually, I think we don't even
need a local copy for accessing the string. We should be good with something
like
int device_match_driver_override(struct device *dev,
const struct device_driver *drv)
which internally compares the strings holding the spinlock.
Otherwise, since you asked, the string length is currently limited to
PAGE_SIZE - 1 as it will be copied with sysfs_emit(). But as mentioned, we
shouldn't need a copy anyways.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/dakr/linux.git/log/?h=driver_override
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] Driver core changes for 7.0-rc1
2026-03-01 20:21 ` Danilo Krummrich
@ 2026-03-01 21:01 ` Linus Torvalds
2026-03-02 19:19 ` Danilo Krummrich
0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2026-03-01 21:01 UTC (permalink / raw)
To: Danilo Krummrich
Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Saravana Kannan,
Andrew Morton, driver-core, rust-for-linux, linux-kernel
On Sun, 1 Mar 2026 at 12:21, Danilo Krummrich <dakr@kernel.org> wrote:
>
> Otherwise, since you asked, the string length is currently limited to
> PAGE_SIZE - 1 as it will be copied with sysfs_emit().
Well, the string length from /proc itself might be long, but do we *care*?
It's only meaningful when it matches a driver name, so anything longer
than the longest driver name is irrelevant - it's not going to match.
So the thing that would matter is the longest actual real driver name.
Aren't those typically just a few bytes (eg "regulator-bus-drv" or
"__typec_altmode_driver" being long ones I find with a bad grep
pattern that might miss millions of other cases)
IOW, it looks like it would be fine to just say "use just a 32-byte
buffer" if a buffer is needed.
Of course, if no buffer is needed that's even better.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [GIT PULL] Driver core changes for 7.0-rc1
2026-03-01 21:01 ` Linus Torvalds
@ 2026-03-02 19:19 ` Danilo Krummrich
0 siblings, 0 replies; 13+ messages in thread
From: Danilo Krummrich @ 2026-03-02 19:19 UTC (permalink / raw)
To: Linus Torvalds
Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Saravana Kannan,
Andrew Morton, driver-core, rust-for-linux, linux-kernel
On Sun Mar 1, 2026 at 10:01 PM CET, Linus Torvalds wrote:
> So the thing that would matter is the longest actual real driver name.
> Aren't those typically just a few bytes (eg "regulator-bus-drv" or
> "__typec_altmode_driver" being long ones I find with a bad grep
> pattern that might miss millions of other cases)
Yes, generally they should be pretty short.
OOC, I asked an LLM to figure it out (so take this with the necessary grain of
salt):
Out of ~7,560 unique driver names:
┌────────────┬───────────┐
│ Threshold │ Count │
├────────────┼───────────┤
│ > 32 chars │ 6 (0.08%) │
├────────────┼───────────┤
│ > 48 chars │ 1 (0.01%) │
├────────────┼───────────┤
│ > 64 chars │ 0 │
└────────────┴───────────┘
Length distribution
1-10 chars: 3,513 (46.5%)
11-16 chars: 3,111 (41.1%)
17-24 chars: 880 (11.6%)
25-32 chars: 50 ( 0.7%)
33-49 chars: 6 ( 0.1%)
> IOW, it looks like it would be fine to just say "use just a 32-byte
> buffer" if a buffer is needed.
Assuming the above is roughly correct, you are pretty close. :)
> Of course, if no buffer is needed that's even better.
Yes, I don't think we ever need to obtain a copy other than through
sysfs_emit().
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] Driver core changes for 7.0-rc1
2026-03-01 13:04 ` Danilo Krummrich
2026-03-01 18:17 ` Linus Torvalds
@ 2026-03-01 18:20 ` Danilo Krummrich
1 sibling, 0 replies; 13+ messages in thread
From: Danilo Krummrich @ 2026-03-01 18:20 UTC (permalink / raw)
To: Linus Torvalds
Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Saravana Kannan,
Andrew Morton, driver-core, rust-for-linux, linux-kernel
On Sun Mar 1, 2026 at 2:04 PM CET, Danilo Krummrich wrote:
> Yes, I agree and in fact I already have a few local changes to move
> driver_override to struct device, provide corresponding accessors for busses and
> handle locking with a separate lock.
>
> (Technically, the "move driver_override to struct device" part is orthogonal,
> but doing it right away results in less (and much cleaner) changes.)
>
> I do not consider those changes to be complicated and risky, but I'm not sure
> you want to see those for one of the upcoming -rc releases (probably -rc4/5).
This is roughly what I have in mind [1] (partially compile and runtime tested).
Given that we agree that the driver_match_device() change should be reverted, it
may make sense to land the first patch in an upcoming -rc, split up the treewide
one, and let subsystems follow-up with those fixes individually.
Please let me know what you prefer.
- Danilo
[1] https://git.kernel.org/pub/scm/linux/kernel/git/dakr/linux.git/log/?h=driver_override
> Independently, I can send a revert for -rc3.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] Driver core changes for 7.0-rc1
2026-03-01 7:44 ` Linus Torvalds
` (2 preceding siblings ...)
2026-03-01 13:04 ` Danilo Krummrich
@ 2026-04-14 18:39 ` Uwe Kleine-König
2026-04-14 23:04 ` Danilo Krummrich
3 siblings, 1 reply; 13+ messages in thread
From: Uwe Kleine-König @ 2026-04-14 18:39 UTC (permalink / raw)
To: Danilo Krummrich, Linus Torvalds
Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Saravana Kannan,
Andrew Morton, driver-core, rust-for-linux, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1904 bytes --]
Hello,
On Sat, Feb 28, 2026 at 11:44:25PM -0800, Linus Torvalds wrote:
> On Wed, 11 Feb 2026 at 15:04, Danilo Krummrich <dakr@kernel.org> wrote:
> >
> > Driver core changes for 7.0-rc1
> >
> > - Bus:
> > - Ensure bus->match() is consistently called with the device lock held
>
> So I'm coming back to this, because it turns out this sounds like a
> horrible mistake in the end.
>
> You document it as being about consistent locking, but it appears this
> change is what made the "firewire oops at driver attach" turn an oops
> into just a silently dead machine.
>
> In other words, it makes fragile drivers go from "you get an oops" to
> something much worse. The oops becomes unrecoverable - with typically
> a black screen at boot - because the probe is holding a lock that then
> makes everything else come to a grinding halt when the driver fails.
>
> And yes, this obviously only happens for buggy driver and doesn't
> matter for _correct_ code, but about half of the kernel code is
> drivers, and that half of the kernel code is also the typically the
> most badly tested and often questionably implemented half.
I have a machine here (stm32mp135f-dk, ARCH=arm) that fails to boot with
dc23806a7c47 ("driver core: enforce device_lock for
driver_match_device()"), but doesn't oops on dc23806a7c47^.
(Fails to boot = no kernel messages on console.)
I didn't try to debug that yet, but I wonder if that is an understood
problem of said commit. I know that the commit was reverted in the
meantime (and the machine boots fine on 9de68394a615 ("Revert "driver
core: enforce device_lock for driver_match_device()""), but does that
mean that there is a driver involved that somehow violates driver core
assumptions and should be fixed even without the consistent locking?
Hints about how to approach the issue (if there is any) welcome.
Best regards
Uwe
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [GIT PULL] Driver core changes for 7.0-rc1
2026-04-14 18:39 ` Uwe Kleine-König
@ 2026-04-14 23:04 ` Danilo Krummrich
0 siblings, 0 replies; 13+ messages in thread
From: Danilo Krummrich @ 2026-04-14 23:04 UTC (permalink / raw)
To: Uwe Kleine-König
Cc: Linus Torvalds, Greg Kroah-Hartman, Rafael J. Wysocki,
Saravana Kannan, Andrew Morton, driver-core, rust-for-linux,
linux-kernel
On Tue Apr 14, 2026 at 8:39 PM CEST, Uwe Kleine-König wrote:
> does that mean that there is a driver involved that somehow violates driver
> core assumptions and should be fixed even without the consistent locking?
Most likely. There are two known cases where interactions with this commit are
expected.
(1) One of the drivers probed on your machine gets stuck within probe() (or
any other place where the device lock is held, e.g. bus callbacks) for
some reason, e.g. due to a deadlock. In this case this commit would
potentially cause other tasks to get stuck in driver_attach() when they
attempt to register a driver for the same bus the bad one sits on.
This is also the main reason why we eventually reverted this commit, i.e.
despite not being the root cause of an issue, it makes an already bad
situation worse.
(2) If there is a driver probed on your machine that registers another driver
from within its probe() function for the same bus it results in a
deadlock. Note that this is transitive -- if a driver is probed on bus A,
which e.g. deploys devices on bus B that are subsequently probed, and then
in one of the probe() calls on bus B a driver is registered for bus A,
that is a deadlock as well.
For instance, this could happen when a platform driver that runs a PCIe
root complex deploys the corresponding PCI devices and one of the
corresponding PCI drivers registers a platform driver from probe().
Anyways, for the underlying problem this reveals, the exact constellation
doesn't matter. The anti-pattern it reveals is that drivers shouldn't be
registered from another driver's probe() function in the first place.
I fixed a few drivers having this anti-pattern and all of them had other
(lifetime) issues due to this and I think there are other potential
deadlock scenarios as well.
> Hints about how to approach the issue (if there is any) welcome.
For (1) I think it's obvious, and I think it wouldn't have gone unnoticed if any
of the drivers were bad to the point that they're getting stuck in probe() or
any other place where the device lock is held.
As for (2) I think the best way to catch it is lockdep. Unfortunately, lockdep
won't be very helpful without some additional tricks, since the driver core
calls lockdep_set_novalidate_class() for the device lock to avoid false
positives.
However, we can work around this by registering a dynamic lock class key for
every struct device individually [1] and fake taking the device lock with
mutex_acquire() and mutex_release() in __driver_attach().
This way your box should still boot properly, and in case it got stuck due to
(2), print a proper lockdep splat.
I hope this helps!
- Danilo
[1]
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 763e17e9f148..6770eba83fbd 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2555,6 +2555,7 @@ static void device_release(struct kobject *kobj)
*/
devres_release_all(dev);
+ lockdep_unregister_key(&dev->mutex_key);
kfree(dev->dma_range_map);
kfree(dev->driver_override.name);
@@ -3160,9 +3161,9 @@ void device_initialize(struct device *dev)
dev->kobj.kset = devices_kset;
kobject_init(&dev->kobj, &device_ktype);
INIT_LIST_HEAD(&dev->dma_pools);
- mutex_init(&dev->mutex);
+ lockdep_register_key(&dev->mutex_key);
+ __mutex_init(&dev->mutex, "dev->mutex", &dev->mutex_key);
spin_lock_init(&dev->driver_override.lock);
- lockdep_set_novalidate_class(&dev->mutex);
spin_lock_init(&dev->devres_lock);
INIT_LIST_HEAD(&dev->devres_head);
device_pm_init(dev);
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index cb5046f0634d..a81a4ec2284c 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -1228,7 +1228,9 @@ static int __driver_attach(struct device *dev, void *data)
* is an error.
*/
+ mutex_acquire(&dev->mutex.dep_map, 0, 0, _THIS_IP_);
ret = driver_match_device(drv, dev);
+ mutex_release(&dev->mutex.dep_map, _THIS_IP_);
if (ret == 0) {
/* no match */
return 0;
diff --git a/include/linux/device.h b/include/linux/device.h
index f0d52e1a6e07..2185d50f1c1d 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -585,6 +585,7 @@ struct device {
struct mutex mutex; /* mutex to synchronize calls to
* its driver.
*/
+ struct lock_class_key mutex_key;
struct dev_links_info links;
struct dev_pm_info power;
^ permalink raw reply related [flat|nested] 13+ messages in thread