* [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()
@ 2026-02-19 23:57 Dmitry Torokhov
2026-02-20 0:11 ` Dmitry Torokhov
2026-02-28 3:44 ` Zijun Hu
0 siblings, 2 replies; 9+ messages in thread
From: Dmitry Torokhov @ 2026-02-19 23:57 UTC (permalink / raw)
To: Vinod Koul
Cc: Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven,
Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert,
Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu,
linux-phy, linux-kernel
The implementation put_device()s located device and then uses
container_of() on the pointer. The device may disappear by that time,
resulting in UAF.
Fix the problem by keeping the reference to the framer device,
avoiding getting an extra reference to it in framer_get(), and making
sure to drop the reference in error path when we fail to get the module.
Fixes: e6625db66212 ("phy: core: Simplify API of_phy_simple_xlate() implementation")
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
---
drivers/phy/phy-core.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/phy/phy-core.c b/drivers/phy/phy-core.c
index 4ad396214d0c..cf62eb9ddca9 100644
--- a/drivers/phy/phy-core.c
+++ b/drivers/phy/phy-core.c
@@ -682,10 +682,10 @@ struct phy *of_phy_get(struct device_node *np, const char *con_id)
if (IS_ERR(phy))
return phy;
- if (!try_module_get(phy->ops->owner))
+ if (!try_module_get(phy->ops->owner)) {
+ put_device(&phy->dev);
return ERR_PTR(-EPROBE_DEFER);
-
- get_device(&phy->dev);
+ }
return phy;
}
@@ -765,7 +765,6 @@ struct phy *of_phy_simple_xlate(struct device *dev,
if (!target_dev)
return ERR_PTR(-ENODEV);
- put_device(target_dev);
return to_phy(target_dev);
}
EXPORT_SYMBOL_GPL(of_phy_simple_xlate);
--
2.53.0.345.g96ddfc5eaa-goog
--
Dmitry
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() 2026-02-19 23:57 [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() Dmitry Torokhov @ 2026-02-20 0:11 ` Dmitry Torokhov 2026-02-21 1:01 ` Dmitry Torokhov 2026-02-28 3:44 ` Zijun Hu 1 sibling, 1 reply; 9+ messages in thread From: Dmitry Torokhov @ 2026-02-20 0:11 UTC (permalink / raw) To: Vinod Koul Cc: Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy, linux-kernel On Thu, Feb 19, 2026 at 03:57:11PM -0800, Dmitry Torokhov wrote: > The implementation put_device()s located device and then uses > container_of() on the pointer. The device may disappear by that time, > resulting in UAF. > > Fix the problem by keeping the reference to the framer device, > avoiding getting an extra reference to it in framer_get(), and making > sure to drop the reference in error path when we fail to get the module. Hmm, I was too rash. There are bunch of other xlate functions that need to be updated to take the reference. > > Fixes: e6625db66212 ("phy: core: Simplify API of_phy_simple_xlate() implementation") > Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> > --- > drivers/phy/phy-core.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/phy/phy-core.c b/drivers/phy/phy-core.c > index 4ad396214d0c..cf62eb9ddca9 100644 > --- a/drivers/phy/phy-core.c > +++ b/drivers/phy/phy-core.c > @@ -682,10 +682,10 @@ struct phy *of_phy_get(struct device_node *np, const char *con_id) > if (IS_ERR(phy)) > return phy; > > - if (!try_module_get(phy->ops->owner)) > + if (!try_module_get(phy->ops->owner)) { > + put_device(&phy->dev); > return ERR_PTR(-EPROBE_DEFER); > - > - get_device(&phy->dev); > + } > > return phy; > } > @@ -765,7 +765,6 @@ struct phy *of_phy_simple_xlate(struct device *dev, > if (!target_dev) > return ERR_PTR(-ENODEV); > > - put_device(target_dev); > return to_phy(target_dev); > } > EXPORT_SYMBOL_GPL(of_phy_simple_xlate); > -- > 2.53.0.345.g96ddfc5eaa-goog > > -- Dmitry ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() 2026-02-20 0:11 ` Dmitry Torokhov @ 2026-02-21 1:01 ` Dmitry Torokhov 2026-02-23 23:15 ` Vladimir Oltean 0 siblings, 1 reply; 9+ messages in thread From: Dmitry Torokhov @ 2026-02-21 1:01 UTC (permalink / raw) To: Vinod Koul, Kishon Vijay Abraham I Cc: Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy, linux-kernel On Thu, Feb 19, 2026 at 04:11:37PM -0800, Dmitry Torokhov wrote: > On Thu, Feb 19, 2026 at 03:57:11PM -0800, Dmitry Torokhov wrote: > > The implementation put_device()s located device and then uses > > container_of() on the pointer. The device may disappear by that time, > > resulting in UAF. > > > > Fix the problem by keeping the reference to the framer device, > > avoiding getting an extra reference to it in framer_get(), and making > > sure to drop the reference in error path when we fail to get the module. > > Hmm, I was too rash. There are bunch of other xlate functions that need > to be updated to take the reference. So I am convinced that xlate functions need to bump up the reference to phy devices they return. The question is how to deal with the ones that do not. I can either convert them in the same patch (the changes are quite mechanical) or we can do the whole song and dance, introduce a flag, set it up in converted xlate functions, have the core respect it, and then remove it from xlates and from the core when it is all done. Please let me know. Thanks. -- Dmitry ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() 2026-02-21 1:01 ` Dmitry Torokhov @ 2026-02-23 23:15 ` Vladimir Oltean 2026-02-24 0:37 ` Dmitry Torokhov 0 siblings, 1 reply; 9+ messages in thread From: Vladimir Oltean @ 2026-02-23 23:15 UTC (permalink / raw) To: Dmitry Torokhov Cc: Vinod Koul, Kishon Vijay Abraham I, Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy, linux-kernel Hi Dmitry, On Fri, Feb 20, 2026 at 05:01:50PM -0800, Dmitry Torokhov wrote: > On Thu, Feb 19, 2026 at 04:11:37PM -0800, Dmitry Torokhov wrote: > > On Thu, Feb 19, 2026 at 03:57:11PM -0800, Dmitry Torokhov wrote: > > > The implementation put_device()s located device and then uses > > > container_of() on the pointer. The device may disappear by that time, > > > resulting in UAF. > > > > > > Fix the problem by keeping the reference to the framer device, > > > avoiding getting an extra reference to it in framer_get(), and making > > > sure to drop the reference in error path when we fail to get the module. > > > > Hmm, I was too rash. There are bunch of other xlate functions that need > > to be updated to take the reference. > > So I am convinced that xlate functions need to bump up the reference to > phy devices they return. The question is how to deal with the ones that > do not. I can either convert them in the same patch (the changes are > quite mechanical) or we can do the whole song and dance, introduce a > flag, set it up in converted xlate functions, have the core respect it, > and then remove it from xlates and from the core when it is all done. > > Please let me know. You have hit on a weak spot of the Generic PHY framework and I would like to encourage you to follow through with patches for this finding (although it won't be exactly trivial, I think it's doable with some determination). On your direct question: It will impact downstream in subtle and unpleasant ways if you change the of_xlate() semantics w.r.t. reference counting, but the code will still compile just as before, i.e. when looking just at your downstream driver (what 99% of developers do) there will be no obvious way of knowing that something in the API has changed. I would avoid doing that. But first the problem. Too little has been said about the problem, and too much about the solution. We can't find a good solution if we don't call out the problem properly first. The phy_get() function follows a "lookup-then-get" approach for PHY providers, rather than "atomic-lookup-and-get". Namely, the phy_provider->of_xlate() caller, which is _of_phy_get(), returns a struct phy * with its underlying device reference count nominally at 1. All callers of _of_phy_get() do later call get_device() to bump this refcount for each PHY consumer, but it is "too late" in the sense that a window has been opened where the PHY provider driver can unbind, and the reference count of &phy->dev can drop to 0 and it can be freed. Immediately after being created by phy_create(), the &phy->dev has a refcount of 1, and only the action of its provider driver (calling phy_destroy() directly or through devres, on driver unbind) can drop that refcount to 0. As long as the driver doesn't do that, the &phy->dev refcount is not in danger. Then the PHY is exposed to the outside world using of_phy_provider_register(), where the fact that the provider driver can concurrently unregister starts being a problem. Therefore, I would say that the problem is that consumers, aka phy_get() callers, can get a phy with a &phy->dev refcount that is not already bumped by the time they are handed over that PHY. Mechanically, PHY lookup happens under &phy_provider_mutex, and I believe it would be sufficient to call get_device() under this lock, in order for consumers to get PHYs with their reference bumped. Why? Let's consider what a PHY provider does when unbinding. It runs the PHY registration and creation processes in reverse, i.e. - it calls of_phy_provider_unregister() directly or through devres - it calls phy_destroy() directly or through devres Since of_phy_provider_unregister() also acquires &phy_provider_mutex, this acts like a synchronization barrier. There are 2 cases: Consumer Provider phy_get() -> mutex_lock(&phy_provider_mutex) -> get_device(&phy->dev) -> 2 -> mutex_unlock(&phy_provider_mutex) of_phy_provider_unregister() -> mutex_lock(&phy_provider_mutex) -> removes phy from list -> mutex_unlock(&phy_provider_mutex) phy_destroy() -> device_unregister(&phy->dev) -> device_del(dev); -> put_device(dev); // -> 1 and the consumer remains with a "zombie" PHY device (more later) Or Consumer Provider phy_get() of_phy_provider_unregister() -> mutex_lock(&phy_provider_mutex) -> removes phy from list -> mutex_unlock(&phy_provider_mutex) -> mutex_lock(&phy_provider_mutex) -> doesn't find the PHY device! It was unpublished even if not freed -> mutex_unlock(&phy_provider_mutex) phy_destroy() -> device_unregister(&phy->dev) -> device_del(dev); -> put_device(dev); // -> 1 and the consumer won't be allowed to even find the PHY device in this case. So this is why get_device() under phy_provider_mutex actually helps. Now, it is much less important whether you push the get_device() to the actual of_xlate() PHY vendor implementation or not. For simplicity sake, I suggest you don't do that, but instead keep it in the core, to preserve driver API semantics (patch not even compile-tested): -- >8 -- diff --git a/drivers/phy/phy-core.c b/drivers/phy/phy-core.c index 21aaf2f76e53..c50e38f057a8 100644 --- a/drivers/phy/phy-core.c +++ b/drivers/phy/phy-core.c @@ -122,17 +122,19 @@ EXPORT_SYMBOL_GPL(phy_remove_lookup); static struct phy *phy_find(struct device *dev, const char *con_id) { const char *dev_id = dev_name(dev); - struct phy_lookup *p, *pl = NULL; + struct phy *phy = NULL; + struct phy_lookup *p; mutex_lock(&phy_provider_mutex); list_for_each_entry(p, &phys, node) if (!strcmp(p->dev_id, dev_id) && !strcmp(p->con_id, con_id)) { - pl = p; + phy = p->phy; + get_device(&phy->dev); break; } mutex_unlock(&phy_provider_mutex); - return pl ? pl->phy : ERR_PTR(-ENODEV); + return phy ? phy : ERR_PTR(-ENODEV); } static struct phy_provider *of_phy_provider_lookup(struct device_node *node) @@ -649,6 +651,7 @@ static struct phy *_of_phy_get(struct device_node *np, int index) } phy = phy_provider->of_xlate(phy_provider->dev, &args); + get_device(&phy->dev); out_put_module: module_put(phy_provider->owner); @@ -682,10 +685,10 @@ struct phy *of_phy_get(struct device_node *np, const char *con_id) if (IS_ERR(phy)) return phy; - if (!try_module_get(phy->ops->owner)) + if (!try_module_get(phy->ops->owner)) { + put_device(&phy->dev); return ERR_PTR(-EPROBE_DEFER); - - get_device(&phy->dev); + } return phy; } @@ -803,10 +806,10 @@ struct phy *phy_get(struct device *dev, const char *string) if (IS_ERR(phy)) return phy; - if (!try_module_get(phy->ops->owner)) + if (!try_module_get(phy->ops->owner)) { + put_device(&phy->dev); return ERR_PTR(-EPROBE_DEFER); - - get_device(&phy->dev); + } link = device_link_add(dev, &phy->dev, DL_FLAG_STATELESS); if (!link) @@ -969,11 +972,10 @@ struct phy *devm_of_phy_get_by_index(struct device *dev, struct device_node *np, if (!try_module_get(phy->ops->owner)) { devres_free(ptr); + put_device(&phy->dev); return ERR_PTR(-EPROBE_DEFER); } - get_device(&phy->dev); - *ptr = phy; devres_add(dev, ptr); -- >8 -- The patch above leaves a few loose ends. The most obvious is of_phy_simple_xlate(), which has that put_device() to balance out class_find_device_by_of_node() - which bumps the device refcount to 2. It "looks" wrong but it is consistent with vendor implementations of of_xlate(), which also provide a phy->dev refcount of 1. And as explained, the refcount never _actually_ drops to 0 with the above patch. I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake. Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I would offer a new helper, devm_of_phy_simple_provider_register(), which would internally use a callback that doesn't drop the refcount (say __of_phy_simple_xlate() for lack of imagination). I would convert vendor PHY drivers one by one to use this (it would be valid at any time to use either the old or the new method). You'd notice that most of the of_phy_simple_xlate() occurrences go away, except for freescale/phy-fsl-lynx-28g.c which calls this function directly from its own lynx_28g_xlate(). It will still have to keep calling it, and for refcount equalization purposes that weird put_device() will have to continue to exist. Just leave a comment as to why that is. But there are more important loose ends still. I mentioned that "zombie" device. We've solved the memory safety issue, but it's possible for consumers to hold onto a phy whose provider has disappeared. The refcount of &phy->dev hasn't dropped to 0, so technically it's a valid object, but from PHY API perspective, it's still possible to call phy_init(), phy_power_on() and friends on this PHY, and the Generic PHY core will be happy to further call into the phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound, so it should really be left alone. If we fix the UAF but leave the zombie PHY problem, we've effectively done nothing but silence static analysis checkers, while the code path where the PHY provider unbinds effectively remains treated as poorly as before, just moving the crashes to a different place. I suspect what needs to be done here is to introduce a "bool dead" or similar, which is to be set from phy_destroy() and checked from every API call. The idea is that API functions on zombie PHYs should fail without calling into their driver. I **suppose** that try_module_get(phy->ops->owner) "tried" to avoid this situation, but it just protects against module removal, not against "echo device > /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable. Finally, I have identified one more loose end still. /** * of_phy_put() - release the PHY * @phy: the phy returned by of_phy_get() * * Releases a refcount the caller received from of_phy_get(). */ void of_phy_put(struct phy *phy) { if (!phy || IS_ERR(phy)) return; mutex_lock(&phy->mutex); if (phy->ops->release) phy->ops->release(phy); mutex_unlock(&phy->mutex); module_put(phy->ops->owner); put_device(&phy->dev); } EXPORT_SYMBOL_GPL(of_phy_put); This function is called by PHY consumers. A PHY provider can have multiple consumers (example in the case of Ethernet: a QSGMII SerDes lane has 4 MAC ports multiplexed onto it). If a single consumer calls of_phy_put() to release its reference to the SerDes lane, the phy->ops->release() method makes absolutely no sense. There are 3 remaining consumers with handles to the lane! But we aren't even telling the PHY which consumer has disappeared! It has nothing useful to do with this information. Looking at actual phy_ops implementations, what they want to know is when _all_ consumers went away, not when individual consumers did. So the phy->ops->release() call needs to be put somewhere which is executed when all consumers disappear. If I were to guess, that would be the phy_release() class callback, but this is completely untested. Sorry that this email is a bit long, it got a bit late writing it and I don't have a lot of energy left to trim it down. In summary I found 4 highly related problems: - use after free - misleading of_phy_simple_xlate() API pattern (not a functional issue) - zombie PHYs - premature release call ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() 2026-02-23 23:15 ` Vladimir Oltean @ 2026-02-24 0:37 ` Dmitry Torokhov 2026-02-24 15:47 ` Vladimir Oltean 0 siblings, 1 reply; 9+ messages in thread From: Dmitry Torokhov @ 2026-02-24 0:37 UTC (permalink / raw) To: Vladimir Oltean Cc: Vinod Koul, Kishon Vijay Abraham I, Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy, linux-kernel Hi Vladimir, On Tue, Feb 24, 2026 at 01:15:00AM +0200, Vladimir Oltean wrote: > Hi Dmitry, > > On Fri, Feb 20, 2026 at 05:01:50PM -0800, Dmitry Torokhov wrote: > > On Thu, Feb 19, 2026 at 04:11:37PM -0800, Dmitry Torokhov wrote: > > > On Thu, Feb 19, 2026 at 03:57:11PM -0800, Dmitry Torokhov wrote: > > > > The implementation put_device()s located device and then uses > > > > container_of() on the pointer. The device may disappear by that time, > > > > resulting in UAF. > > > > > > > > Fix the problem by keeping the reference to the framer device, > > > > avoiding getting an extra reference to it in framer_get(), and making > > > > sure to drop the reference in error path when we fail to get the module. > > > > > > Hmm, I was too rash. There are bunch of other xlate functions that need > > > to be updated to take the reference. > > > > So I am convinced that xlate functions need to bump up the reference to > > phy devices they return. The question is how to deal with the ones that > > do not. I can either convert them in the same patch (the changes are > > quite mechanical) or we can do the whole song and dance, introduce a > > flag, set it up in converted xlate functions, have the core respect it, > > and then remove it from xlates and from the core when it is all done. > > > > Please let me know. > > You have hit on a weak spot of the Generic PHY framework and I would > like to encourage you to follow through with patches for this finding > (although it won't be exactly trivial, I think it's doable with some > determination). Thank you for the detailed response. I am not sure how much time I can spend on phy rework as this change was an essentially a drive-by. I only happen to look there because I want to remove class_find_device_by_of_node() in favor of class_find_device_by_fwnode(). > > On your direct question: > It will impact downstream in subtle and unpleasant ways if you change > the of_xlate() semantics w.r.t. reference counting, but the code will > still compile just as before, i.e. when looking just at your downstream > driver (what 99% of developers do) there will be no obvious way of > knowing that something in the API has changed. I would avoid doing that. Hmm, I was not aware that we needed to take particular measures for benefit of downstream trees... > > But first the problem. Too little has been said about the problem, and > too much about the solution. We can't find a good solution if we don't > call out the problem properly first. > > The phy_get() function follows a "lookup-then-get" approach for PHY > providers, rather than "atomic-lookup-and-get". > > Namely, the phy_provider->of_xlate() caller, which is _of_phy_get(), > returns a struct phy * with its underlying device reference count > nominally at 1. All callers of _of_phy_get() do later call get_device() > to bump this refcount for each PHY consumer, but it is "too late" in the > sense that a window has been opened where the PHY provider driver can > unbind, and the reference count of &phy->dev can drop to 0 and it can be > freed. Yes. > > Immediately after being created by phy_create(), the &phy->dev has a > refcount of 1, and only the action of its provider driver (calling > phy_destroy() directly or through devres, on driver unbind) can drop > that refcount to 0. As long as the driver doesn't do that, the &phy->dev > refcount is not in danger. Right. > > Then the PHY is exposed to the outside world using of_phy_provider_register(), > where the fact that the provider driver can concurrently unregister > starts being a problem. > > Therefore, I would say that the problem is that consumers, aka phy_get() > callers, can get a phy with a &phy->dev refcount that is not already bumped > by the time they are handed over that PHY. > > Mechanically, PHY lookup happens under &phy_provider_mutex, and I > believe it would be sufficient to call get_device() under this lock, in > order for consumers to get PHYs with their reference bumped. <skip the explanation and a patch> Yes, given that there is 1:1 mapping between device and provider (not phy instance but its parent) it looks like we can drop the reference and then bump it up again as long as we hold that mutex. > > The patch above leaves a few loose ends. > > The most obvious is of_phy_simple_xlate(), which has that put_device() > to balance out class_find_device_by_of_node() - which bumps the device > refcount to 2. It "looks" wrong but it is consistent with vendor > implementations of of_xlate(), which also provide a phy->dev refcount of 1. > And as explained, the refcount never _actually_ drops to 0 with the > above patch. > > I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake. > Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I > would offer a new helper, devm_of_phy_simple_provider_register(), which > would internally use a callback that doesn't drop the refcount (say > __of_phy_simple_xlate() for lack of imagination). I would convert vendor > PHY drivers one by one to use this (it would be valid at any time to use > either the old or the new method). > > You'd notice that most of the of_phy_simple_xlate() occurrences go away, > except for freescale/phy-fsl-lynx-28g.c which calls this function > directly from its own lynx_28g_xlate(). It will still have to keep > calling it, and for refcount equalization purposes that weird > put_device() will have to continue to exist. Just leave a comment as to > why that is. I this case I would simply add a comment telling why the reference should (and can) be dropped where it is being dropped in of_phy_simple_xlate() and call it a day. There is not much benefit in adding another helper. > > > But there are more important loose ends still. > > I mentioned that "zombie" device. We've solved the memory safety issue, > but it's possible for consumers to hold onto a phy whose provider has > disappeared. The refcount of &phy->dev hasn't dropped to 0, so > technically it's a valid object, but from PHY API perspective, it's > still possible to call phy_init(), phy_power_on() and friends on this > PHY, and the Generic PHY core will be happy to further call into the > phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound, > so it should really be left alone. > > If we fix the UAF but leave the zombie PHY problem, we've effectively > done nothing but silence static analysis checkers, while the code path > where the PHY provider unbinds effectively remains treated as poorly as > before, just moving the crashes to a different place. > > I suspect what needs to be done here is to introduce a "bool dead" or > similar, which is to be set from phy_destroy() and checked from every > API call. The idea is that API functions on zombie PHYs should fail > without calling into their driver. I **suppose** that > try_module_get(phy->ops->owner) "tried" to avoid this situation, but > it just protects against module removal, not against "echo device > > /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable. I think this is a problem common to many kernel subsystems, where there are devices that are vital for other devices to function, but are not parents of said devices. Think about regulators or clocks and such. In many such cases even validating at API level is not sufficient, because if you enable a clock you typically keep it running at least for a while, if not for entire duration of device being bound to a driver. If that clock goes away the device will break even though you are not calling any clock APIs at that particular time. We need some kind of revocation mechanism to signal consumers that providers they rely upon are going away. > > > Finally, I have identified one more loose end still. > > /** > * of_phy_put() - release the PHY > * @phy: the phy returned by of_phy_get() > * > * Releases a refcount the caller received from of_phy_get(). > */ > void of_phy_put(struct phy *phy) > { > if (!phy || IS_ERR(phy)) > return; > > mutex_lock(&phy->mutex); > if (phy->ops->release) > phy->ops->release(phy); > mutex_unlock(&phy->mutex); > > module_put(phy->ops->owner); > put_device(&phy->dev); > } > EXPORT_SYMBOL_GPL(of_phy_put); > > This function is called by PHY consumers. A PHY provider can have > multiple consumers (example in the case of Ethernet: a QSGMII SerDes > lane has 4 MAC ports multiplexed onto it). > > If a single consumer calls of_phy_put() to release its reference to the > SerDes lane, the phy->ops->release() method makes absolutely no sense. > There are 3 remaining consumers with handles to the lane! But we aren't > even telling the PHY which consumer has disappeared! It has nothing > useful to do with this information. > > Looking at actual phy_ops implementations, what they want to know is > when _all_ consumers went away, not when individual consumers did. > So the phy->ops->release() call needs to be put somewhere which is > executed when all consumers disappear. If I were to guess, that would be > the phy_release() class callback, but this is completely untested. Shouldn't phy->ops->release() keep track of users and decide when to destroy the resources? The rest seem fine as they simply drop references (I assume each user of shared phy will bump up references as needed?) > > Sorry that this email is a bit long, it got a bit late writing it and I > don't have a lot of energy left to trim it down. > In summary I found 4 highly related problems: > - use after free > - misleading of_phy_simple_xlate() API pattern (not a functional issue) > - zombie PHYs > - premature release call Thanks. -- Dmitry ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() 2026-02-24 0:37 ` Dmitry Torokhov @ 2026-02-24 15:47 ` Vladimir Oltean 2026-02-24 20:38 ` Dmitry Torokhov 0 siblings, 1 reply; 9+ messages in thread From: Vladimir Oltean @ 2026-02-24 15:47 UTC (permalink / raw) To: Dmitry Torokhov Cc: Vinod Koul, Kishon Vijay Abraham I, Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy, linux-kernel On Mon, Feb 23, 2026 at 04:37:39PM -0800, Dmitry Torokhov wrote: > > You have hit on a weak spot of the Generic PHY framework and I would > > like to encourage you to follow through with patches for this finding > > (although it won't be exactly trivial, I think it's doable with some > > determination). > > Thank you for the detailed response. I am not sure how much time I can > spend on phy rework as this change was an essentially a drive-by. I only > happen to look there because I want to remove > class_find_device_by_of_node() in favor of class_find_device_by_fwnode(). OK, I believe you can still do that without any dependency. > > On your direct question: > > It will impact downstream in subtle and unpleasant ways if you change > > the of_xlate() semantics w.r.t. reference counting, but the code will > > still compile just as before, i.e. when looking just at your downstream > > driver (what 99% of developers do) there will be no obvious way of > > knowing that something in the API has changed. I would avoid doing that. > > Hmm, I was not aware that we needed to take particular measures for > benefit of downstream trees... I think you're entirely missing the point. Obviously it benefits the maintainer/reviewer first and foremost. If you don't make the API change unmissable, drivers expecting the old convention will eventually creep their way into the mainline kernel. Why put avoidable pressure on reviewers watching out for things like this for years to come, when you can force downstream to notice and adapt (OR not make the change in the first place). Note that downstream means "API consumers invisible to you, the API changer", not only "hopelessly un-upstreamable drivers". > > The patch above leaves a few loose ends. > > > > The most obvious is of_phy_simple_xlate(), which has that put_device() > > to balance out class_find_device_by_of_node() - which bumps the device > > refcount to 2. It "looks" wrong but it is consistent with vendor > > implementations of of_xlate(), which also provide a phy->dev refcount of 1. > > And as explained, the refcount never _actually_ drops to 0 with the > > above patch. > > > > I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake. > > Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I > > would offer a new helper, devm_of_phy_simple_provider_register(), which > > would internally use a callback that doesn't drop the refcount (say > > __of_phy_simple_xlate() for lack of imagination). I would convert vendor > > PHY drivers one by one to use this (it would be valid at any time to use > > either the old or the new method). > > > > You'd notice that most of the of_phy_simple_xlate() occurrences go away, > > except for freescale/phy-fsl-lynx-28g.c which calls this function > > directly from its own lynx_28g_xlate(). It will still have to keep > > calling it, and for refcount equalization purposes that weird > > put_device() will have to continue to exist. Just leave a comment as to > > why that is. > > I this case I would simply add a comment telling why the reference > should (and can) be dropped where it is being dropped in > of_phy_simple_xlate() and call it a day. There is not much benefit in > adding another helper. OK. > > But there are more important loose ends still. > > > > I mentioned that "zombie" device. We've solved the memory safety issue, > > but it's possible for consumers to hold onto a phy whose provider has > > disappeared. The refcount of &phy->dev hasn't dropped to 0, so > > technically it's a valid object, but from PHY API perspective, it's > > still possible to call phy_init(), phy_power_on() and friends on this > > PHY, and the Generic PHY core will be happy to further call into the > > phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound, > > so it should really be left alone. > > > > If we fix the UAF but leave the zombie PHY problem, we've effectively > > done nothing but silence static analysis checkers, while the code path > > where the PHY provider unbinds effectively remains treated as poorly as > > before, just moving the crashes to a different place. > > > > I suspect what needs to be done here is to introduce a "bool dead" or > > similar, which is to be set from phy_destroy() and checked from every > > API call. The idea is that API functions on zombie PHYs should fail > > without calling into their driver. I **suppose** that > > try_module_get(phy->ops->owner) "tried" to avoid this situation, but > > it just protects against module removal, not against "echo device > > > /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable. > > I think this is a problem common to many kernel subsystems, where there > are devices that are vital for other devices to function, but are not > parents of said devices. Think about regulators or clocks and such. > In many such cases even validating at API level is not sufficient, > because if you enable a clock you typically keep it running at least for > a while, if not for entire duration of device being bound to a driver. > If that clock goes away the device will break even though you are not > calling any clock APIs at that particular time. > > We need some kind of revocation mechanism to signal consumers that > providers they rely upon are going away. Yeah, this gave me pause. When you unbind a Generic PHY you can kind of expect that the data path it affects to not work. But you'd sort of expect it to start working again when you rebind the PHY driver. That will not be the case, though. The problem, simply put, is that the struct phy that the consumer has will be different from the second struct phy that the provider registers when it rebinds. Using the stale struct phy, we cannot reach the phy_ops of the new provider. If we implement something similar to what Bartosz Golaszewski suggested here: https://lpc.events/event/17/contributions/1627/ aka decouple the struct phy given to the consumer from the struct phy provided by the provider, and perform a PHY lookup during each phy_init() / phy_power_up() / etc etc operation; we are kinda able to: - match the consumer phy to the provider phy and forward the call to phy_ops, if the provider is present (or rebound) - return -ENODEV instead of calling into phy_ops, if the provider is not present But there's still one big gap. PHY consumers are driving the PHY through a state machine when they do phy_init -> phy_power_up() -> phy_set_mode() -> [ phy_configure()...]. When the provider unbinds and then rebinds, that state is lost. But the consumer has no idea. It knows it is in a state where it called phy_power_up(), and next thing it knows, the power_count is 1 and it must call phy_power_down(). The Generic PHY core cannot actually replay the entire state in a meaningful way so as to make the provider reprobing completely transparent (think of phy_calibrate() calls). So I guess we're looking at what Bartosz refers to as a notification side-channel that the PHY provider is going away. At least, for drivers that care in a meaningful way. For drivers that don't care about such complexity, there seems to be a simpler answer: device_link_add(DL_FLAG_AUTOREMOVE_CONSUMER). I can also answer why device links with "autoremove consumer" are not a universal answer. This is because the consumer itself may be a multi-port network switch (single device). If you unbind the Generic PHY driver (retimer, SerDes PHY) from port 3, you don't want ports 0, 1 and 2 to also disappear from the kernel. You need the more complex PHY provider disappearance notification which does something more localized (calls phylink_stop(), I don't know). I can say right off the bat that this is too complicated for me to even think about in more detail than that, at the moment. I would be quite happy if it's possible to unbind the PHY driver without the possibility to rebind it, as a first step. > > Finally, I have identified one more loose end still. > > > > /** > > * of_phy_put() - release the PHY > > * @phy: the phy returned by of_phy_get() > > * > > * Releases a refcount the caller received from of_phy_get(). > > */ > > void of_phy_put(struct phy *phy) > > { > > if (!phy || IS_ERR(phy)) > > return; > > > > mutex_lock(&phy->mutex); > > if (phy->ops->release) > > phy->ops->release(phy); > > mutex_unlock(&phy->mutex); > > > > module_put(phy->ops->owner); > > put_device(&phy->dev); > > } > > EXPORT_SYMBOL_GPL(of_phy_put); > > > > This function is called by PHY consumers. A PHY provider can have > > multiple consumers (example in the case of Ethernet: a QSGMII SerDes > > lane has 4 MAC ports multiplexed onto it). > > > > If a single consumer calls of_phy_put() to release its reference to the > > SerDes lane, the phy->ops->release() method makes absolutely no sense. > > There are 3 remaining consumers with handles to the lane! But we aren't > > even telling the PHY which consumer has disappeared! It has nothing > > useful to do with this information. > > > > Looking at actual phy_ops implementations, what they want to know is > > when _all_ consumers went away, not when individual consumers did. > > So the phy->ops->release() call needs to be put somewhere which is > > executed when all consumers disappear. If I were to guess, that would be > > the phy_release() class callback, but this is completely untested. > > Shouldn't phy->ops->release() keep track of users and decide when to > destroy the resources? The rest seem fine as they simply drop references > (I assume each user of shared phy will bump up references as needed?) AFAICS, the only instance is phy/ti/phy-am654-serdes.c: serdes_am654_release(). This undoes the effect of serdes_am654_xlate(), i.e. calls mux_control_deselect(phy->control). The fact that phy_provider->of_xlate() has side effects on this platform, rather than just return a struct phy * for a given struct device_node * (as it's supposed to do), is "interesting". I don't have extra info why the PHY maintainer didn't add extra phy_ops for consumer_bind() and consumer_unbind() or something like that, to make it clear that these calls are supposed to be _per consumer_. Notice how the first of_xlate() call sets am654_phy->busy = true, and second of_xlate() call is designed to fail due to am654_phy->busy. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() 2026-02-24 15:47 ` Vladimir Oltean @ 2026-02-24 20:38 ` Dmitry Torokhov 2026-02-25 23:50 ` Vladimir Oltean 0 siblings, 1 reply; 9+ messages in thread From: Dmitry Torokhov @ 2026-02-24 20:38 UTC (permalink / raw) To: Vladimir Oltean Cc: Vinod Koul, Kishon Vijay Abraham I, Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy, linux-kernel On Tue, Feb 24, 2026 at 05:47:30PM +0200, Vladimir Oltean wrote: > On Mon, Feb 23, 2026 at 04:37:39PM -0800, Dmitry Torokhov wrote: > > > You have hit on a weak spot of the Generic PHY framework and I would > > > like to encourage you to follow through with patches for this finding > > > (although it won't be exactly trivial, I think it's doable with some > > > determination). > > > > Thank you for the detailed response. I am not sure how much time I can > > spend on phy rework as this change was an essentially a drive-by. I only > > happen to look there because I want to remove > > class_find_device_by_of_node() in favor of class_find_device_by_fwnode(). > > OK, I believe you can still do that without any dependency. Right, this is independent, I just happened to look at that function. > > > > On your direct question: > > > It will impact downstream in subtle and unpleasant ways if you change > > > the of_xlate() semantics w.r.t. reference counting, but the code will > > > still compile just as before, i.e. when looking just at your downstream > > > driver (what 99% of developers do) there will be no obvious way of > > > knowing that something in the API has changed. I would avoid doing that. > > > > Hmm, I was not aware that we needed to take particular measures for > > benefit of downstream trees... > > I think you're entirely missing the point. > > Obviously it benefits the maintainer/reviewer first and foremost. If you > don't make the API change unmissable, drivers expecting the old convention > will eventually creep their way into the mainline kernel. Why put avoidable > pressure on reviewers watching out for things like this for years to come, > when you can force downstream to notice and adapt (OR not make the > change in the first place). Note that downstream means "API consumers > invisible to you, the API changer", not only "hopelessly un-upstreamable > drivers". Well, this is whole "stable kernel API" argument. Rules of lifetime around objects may change and the kernel needs to adapt. Anyway, this is your subsystem so you get to decide. > > > > The patch above leaves a few loose ends. > > > > > > The most obvious is of_phy_simple_xlate(), which has that put_device() > > > to balance out class_find_device_by_of_node() - which bumps the device > > > refcount to 2. It "looks" wrong but it is consistent with vendor > > > implementations of of_xlate(), which also provide a phy->dev refcount of 1. > > > And as explained, the refcount never _actually_ drops to 0 with the > > > above patch. > > > > > > I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake. > > > Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I > > > would offer a new helper, devm_of_phy_simple_provider_register(), which > > > would internally use a callback that doesn't drop the refcount (say > > > __of_phy_simple_xlate() for lack of imagination). I would convert vendor > > > PHY drivers one by one to use this (it would be valid at any time to use > > > either the old or the new method). > > > > > > You'd notice that most of the of_phy_simple_xlate() occurrences go away, > > > except for freescale/phy-fsl-lynx-28g.c which calls this function > > > directly from its own lynx_28g_xlate(). It will still have to keep > > > calling it, and for refcount equalization purposes that weird > > > put_device() will have to continue to exist. Just leave a comment as to > > > why that is. > > > > I this case I would simply add a comment telling why the reference > > should (and can) be dropped where it is being dropped in > > of_phy_simple_xlate() and call it a day. There is not much benefit in > > adding another helper. > > OK. > > > > But there are more important loose ends still. > > > > > > I mentioned that "zombie" device. We've solved the memory safety issue, > > > but it's possible for consumers to hold onto a phy whose provider has > > > disappeared. The refcount of &phy->dev hasn't dropped to 0, so > > > technically it's a valid object, but from PHY API perspective, it's > > > still possible to call phy_init(), phy_power_on() and friends on this > > > PHY, and the Generic PHY core will be happy to further call into the > > > phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound, > > > so it should really be left alone. > > > > > > If we fix the UAF but leave the zombie PHY problem, we've effectively > > > done nothing but silence static analysis checkers, while the code path > > > where the PHY provider unbinds effectively remains treated as poorly as > > > before, just moving the crashes to a different place. > > > > > > I suspect what needs to be done here is to introduce a "bool dead" or > > > similar, which is to be set from phy_destroy() and checked from every > > > API call. The idea is that API functions on zombie PHYs should fail > > > without calling into their driver. I **suppose** that > > > try_module_get(phy->ops->owner) "tried" to avoid this situation, but > > > it just protects against module removal, not against "echo device > > > > /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable. > > > > I think this is a problem common to many kernel subsystems, where there > > are devices that are vital for other devices to function, but are not > > parents of said devices. Think about regulators or clocks and such. > > In many such cases even validating at API level is not sufficient, > > because if you enable a clock you typically keep it running at least for > > a while, if not for entire duration of device being bound to a driver. > > If that clock goes away the device will break even though you are not > > calling any clock APIs at that particular time. > > > > We need some kind of revocation mechanism to signal consumers that > > providers they rely upon are going away. > > Yeah, this gave me pause. > > When you unbind a Generic PHY you can kind of expect that the data path > it affects to not work. But you'd sort of expect it to start working > again when you rebind the PHY driver. > > That will not be the case, though. > > The problem, simply put, is that the struct phy that the consumer has > will be different from the second struct phy that the provider registers > when it rebinds. Using the stale struct phy, we cannot reach the phy_ops > of the new provider. > > If we implement something similar to what Bartosz Golaszewski suggested > here: > https://lpc.events/event/17/contributions/1627/ There is also "revocable" from Tzung-Bi Shih. > > aka decouple the struct phy given to the consumer from the struct phy > provided by the provider, and perform a PHY lookup during each > phy_init() / phy_power_up() / etc etc operation; > > we are kinda able to: > - match the consumer phy to the provider phy and forward the call to > phy_ops, if the provider is present (or rebound) > - return -ENODEV instead of calling into phy_ops, if the provider is not > present > > But there's still one big gap. > > PHY consumers are driving the PHY through a state machine when they do > phy_init -> phy_power_up() -> phy_set_mode() -> [ phy_configure()...]. > When the provider unbinds and then rebinds, that state is lost. But the > consumer has no idea. It knows it is in a state where it called > phy_power_up(), and next thing it knows, the power_count is 1 and it > must call phy_power_down(). > > The Generic PHY core cannot actually replay the entire state in a > meaningful way so as to make the provider reprobing completely > transparent (think of phy_calibrate() calls). > > So I guess we're looking at what Bartosz refers to as a notification > side-channel that the PHY provider is going away. > > At least, for drivers that care in a meaningful way. For drivers that > don't care about such complexity, there seems to be a simpler answer: > device_link_add(DL_FLAG_AUTOREMOVE_CONSUMER). Does DL_FLAG_AUTOREMOVE_CONSUMER result in forceful unbinding of the consumer when provider goes away? And if it happens does it happen in the right order? > > I can also answer why device links with "autoremove consumer" are not a > universal answer. This is because the consumer itself may be a > multi-port network switch (single device). If you unbind the Generic PHY > driver (retimer, SerDes PHY) from port 3, you don't want ports 0, 1 and > 2 to also disappear from the kernel. You need the more complex PHY > provider disappearance notification which does something more localized > (calls phylink_stop(), I don't know). Aren't the ports represented as individual devices with their own state and lifetime? > > > I can say right off the bat that this is too complicated for me to even > think about in more detail than that, at the moment. I would be quite > happy if it's possible to unbind the PHY driver without the possibility > to rebind it, as a first step. Yes, this is something that is not phy specific and I think should be solved at driver core (or at least driver core should provide subsystems tools to solve this). Thanks. -- Dmitry ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() 2026-02-24 20:38 ` Dmitry Torokhov @ 2026-02-25 23:50 ` Vladimir Oltean 0 siblings, 0 replies; 9+ messages in thread From: Vladimir Oltean @ 2026-02-25 23:50 UTC (permalink / raw) To: Dmitry Torokhov Cc: Vinod Koul, Kishon Vijay Abraham I, Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy, linux-kernel On Tue, Feb 24, 2026 at 12:38:48PM -0800, Dmitry Torokhov wrote: > > Obviously it benefits the maintainer/reviewer first and foremost. If you > > don't make the API change unmissable, drivers expecting the old convention > > will eventually creep their way into the mainline kernel. Why put avoidable > > pressure on reviewers watching out for things like this for years to come, > > when you can force downstream to notice and adapt (OR not make the > > change in the first place). Note that downstream means "API consumers > > invisible to you, the API changer", not only "hopelessly un-upstreamable > > drivers". > > Well, this is whole "stable kernel API" argument. Rules of lifetime > around objects may change and the kernel needs to adapt. Actually it's not the "stable kernel API" argument. I just said that if the of_xlate() callbacks currently need to do: return phy; and you propose a new scheme where they have to do: get_device(&phy->dev); return phy; , then you are creating the possibility for: - the API change to be missed by anyone who hasn't been _actively_ monitoring you, because the old code will still compile just as well, and won't show any superficial runtime problem either - the reviewers and maintainers to not notice that patches were developed according to the old convention and such patches may even get accepted, but be buggy. I actually don't care whether the kernel API is stable or not. But if you change it, make it obvious it has changed. > Anyway, this is your subsystem so you get to decide. Not my subsystem. Sorry if I gave you the wrong impression. > > > > But there are more important loose ends still. > > > > > > > > I mentioned that "zombie" device. We've solved the memory safety issue, > > > > but it's possible for consumers to hold onto a phy whose provider has > > > > disappeared. The refcount of &phy->dev hasn't dropped to 0, so > > > > technically it's a valid object, but from PHY API perspective, it's > > > > still possible to call phy_init(), phy_power_on() and friends on this > > > > PHY, and the Generic PHY core will be happy to further call into the > > > > phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound, > > > > so it should really be left alone. > > > > > > > > If we fix the UAF but leave the zombie PHY problem, we've effectively > > > > done nothing but silence static analysis checkers, while the code path > > > > where the PHY provider unbinds effectively remains treated as poorly as > > > > before, just moving the crashes to a different place. > > > > > > > > I suspect what needs to be done here is to introduce a "bool dead" or > > > > similar, which is to be set from phy_destroy() and checked from every > > > > API call. The idea is that API functions on zombie PHYs should fail > > > > without calling into their driver. I **suppose** that > > > > try_module_get(phy->ops->owner) "tried" to avoid this situation, but > > > > it just protects against module removal, not against "echo device > > > > > /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable. > > > > > > I think this is a problem common to many kernel subsystems, where there > > > are devices that are vital for other devices to function, but are not > > > parents of said devices. Think about regulators or clocks and such. > > > In many such cases even validating at API level is not sufficient, > > > because if you enable a clock you typically keep it running at least for > > > a while, if not for entire duration of device being bound to a driver. > > > If that clock goes away the device will break even though you are not > > > calling any clock APIs at that particular time. > > > > > > We need some kind of revocation mechanism to signal consumers that > > > providers they rely upon are going away. > > > > Yeah, this gave me pause. > > > > When you unbind a Generic PHY you can kind of expect that the data path > > it affects to not work. But you'd sort of expect it to start working > > again when you rebind the PHY driver. > > > > That will not be the case, though. > > > > The problem, simply put, is that the struct phy that the consumer has > > will be different from the second struct phy that the provider registers > > when it rebinds. Using the stale struct phy, we cannot reach the phy_ops > > of the new provider. > > > > If we implement something similar to what Bartosz Golaszewski suggested > > here: > > https://lpc.events/event/17/contributions/1627/ > > There is also "revocable" from Tzung-Bi Shih. OK. Thanks for the heads up. > > aka decouple the struct phy given to the consumer from the struct phy > > provided by the provider, and perform a PHY lookup during each > > phy_init() / phy_power_up() / etc etc operation; > > > > we are kinda able to: > > - match the consumer phy to the provider phy and forward the call to > > phy_ops, if the provider is present (or rebound) > > - return -ENODEV instead of calling into phy_ops, if the provider is not > > present > > > > But there's still one big gap. > > > > PHY consumers are driving the PHY through a state machine when they do > > phy_init -> phy_power_up() -> phy_set_mode() -> [ phy_configure()...]. > > When the provider unbinds and then rebinds, that state is lost. But the > > consumer has no idea. It knows it is in a state where it called > > phy_power_up(), and next thing it knows, the power_count is 1 and it > > must call phy_power_down(). > > > > The Generic PHY core cannot actually replay the entire state in a > > meaningful way so as to make the provider reprobing completely > > transparent (think of phy_calibrate() calls). > > > > So I guess we're looking at what Bartosz refers to as a notification > > side-channel that the PHY provider is going away. > > > > At least, for drivers that care in a meaningful way. For drivers that > > don't care about such complexity, there seems to be a simpler answer: > > device_link_add(DL_FLAG_AUTOREMOVE_CONSUMER). > > Does DL_FLAG_AUTOREMOVE_CONSUMER result in forceful unbinding of the > consumer when provider goes away? And if it happens does it happen in > the right order? If "the right order" means "consumer before provider", yes. > > I can also answer why device links with "autoremove consumer" are not a > > universal answer. This is because the consumer itself may be a > > multi-port network switch (single device). If you unbind the Generic PHY > > driver (retimer, SerDes PHY) from port 3, you don't want ports 0, 1 and > > 2 to also disappear from the kernel. You need the more complex PHY > > provider disappearance notification which does something more localized > > (calls phylink_stop(), I don't know). > > Aren't the ports represented as individual devices with their own state > and lifetime? Nope. A single platform_device/pci_device/spi_device/i2c_client/mdio_device/etc represents an entire switch, and its driver will call register_netdevice() multiple times, once per port. This is the switchdev model as per Documentation/networking/switchdev.rst. Technically each netdev has its own "class" device (for /sys/class/net/) but it's not a "bus" device, doesn't have a bound driver, and their parent is the aforementioned platform_device/etc. > > I can say right off the bat that this is too complicated for me to even > > think about in more detail than that, at the moment. I would be quite > > happy if it's possible to unbind the PHY driver without the possibility > > to rebind it, as a first step. > > Yes, this is something that is not phy specific and I think should be > solved at driver core (or at least driver core should provide subsystems > tools to solve this). Some subsystems I'm slightly more familiar with do have mechanisms to handle this. Networking has (a pretty complicated) netdev_wait_allrefs_any(), which waits for all references to a net_device to be dropped before fully unregistering it (these are net_device :: dev_refcnt, different from struct device references). Drivers are supposed to monitor the netdev notifier call chain for the NETDEV_UNREGISTER event, and let go of the reference. Sadly I don't know enough about the Generic PHY subsystem to be making generalizations with any claim of correctness, but I suspect the majority of consumers are in a 1:1 relationship with the PHY provider, and would be happy with the cost/benefit ratio given by the device_link(DL_FLAG_AUTOREMOVE_CONSUMER) behaviour. Only a select few would care to continue their lifetime in a PHY-less environment (normal operation on other ports / degraded on the PHY-less port, reestablish connection to provider when it comes back, etc). I'm not sure that there exists a space for "revocable" in Generic PHY between the above two options. Only if the device link idea doesn't work out, for some reason I'm not seeing. I'll have to put this thought in the background and let it simmer for a while. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() 2026-02-19 23:57 [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() Dmitry Torokhov 2026-02-20 0:11 ` Dmitry Torokhov @ 2026-02-28 3:44 ` Zijun Hu 1 sibling, 0 replies; 9+ messages in thread From: Zijun Hu @ 2026-02-28 3:44 UTC (permalink / raw) To: Dmitry Torokhov, Vinod Koul Cc: Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy, linux-kernel On 2/20/26 07:57, Dmitry Torokhov wrote: > The implementation put_device()s located device and then uses > container_of() on the pointer. The device may disappear by that time, > resulting in UAF. > > Fix the problem by keeping the reference to the framer device, > avoiding getting an extra reference to it in framer_get(), and making > sure to drop the reference in error path when we fail to get the module. > > Fixes: e6625db66212 ("phy: core: Simplify API of_phy_simple_xlate() implementation") this fix tag is wrong as explained by below comments. > Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> > --- > drivers/phy/phy-core.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/phy/phy-core.c b/drivers/phy/phy-core.c > index 4ad396214d0c..cf62eb9ddca9 100644 > --- a/drivers/phy/phy-core.c > +++ b/drivers/phy/phy-core.c > @@ -682,10 +682,10 @@ struct phy *of_phy_get(struct device_node *np, const char *con_id) > if (IS_ERR(phy)) > return phy; > > - if (!try_module_get(phy->ops->owner)) > + if (!try_module_get(phy->ops->owner)) { > + put_device(&phy->dev); > return ERR_PTR(-EPROBE_DEFER); > - > - get_device(&phy->dev); > + } > > return phy; > } > @@ -765,7 +765,6 @@ struct phy *of_phy_simple_xlate(struct device *dev, > if (!target_dev) > return ERR_PTR(-ENODEV); > > - put_device(target_dev); put reference count of @target_dev got by class_find_device_by_of_node() so the following commit mentioned by the fix tag does not change the reference count. https://lore.kernel.org/all/20241213-phy_core_fix-v6-6-40ae28f5015a@quicinc.com/ > return to_phy(target_dev); > } > EXPORT_SYMBOL_GPL(of_phy_simple_xlate); ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-02-28 3:44 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-19 23:57 [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() Dmitry Torokhov 2026-02-20 0:11 ` Dmitry Torokhov 2026-02-21 1:01 ` Dmitry Torokhov 2026-02-23 23:15 ` Vladimir Oltean 2026-02-24 0:37 ` Dmitry Torokhov 2026-02-24 15:47 ` Vladimir Oltean 2026-02-24 20:38 ` Dmitry Torokhov 2026-02-25 23:50 ` Vladimir Oltean 2026-02-28 3:44 ` Zijun Hu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox