[PATCH] phy: core: fix potential UAF in of_phy_simple

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()
@ 2026-02-19 23:57 Dmitry Torokhov
  2026-02-20  0:11 ` Dmitry Torokhov
  2026-02-28  3:44 ` Zijun Hu
  0 siblings, 2 replies; 9+ messages in thread
From: Dmitry Torokhov @ 2026-02-19 23:57 UTC (permalink / raw)
  To: Vinod Koul
  Cc: Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven,
	Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert,
	Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu,
	linux-phy, linux-kernel

The implementation put_device()s located device and then uses
container_of() on the pointer. The device may disappear by that time,
resulting in UAF.

Fix the problem by keeping the reference to the framer device,
avoiding getting an extra reference to it in framer_get(), and making
sure to drop the reference in error path when we fail to get the module.

Fixes: e6625db66212 ("phy: core: Simplify API of_phy_simple_xlate() implementation")
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
---
 drivers/phy/phy-core.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/phy/phy-core.c b/drivers/phy/phy-core.c
index 4ad396214d0c..cf62eb9ddca9 100644
--- a/drivers/phy/phy-core.c
+++ b/drivers/phy/phy-core.c
@@ -682,10 +682,10 @@ struct phy *of_phy_get(struct device_node *np, const char *con_id)
 	if (IS_ERR(phy))
 		return phy;
 
-	if (!try_module_get(phy->ops->owner))
+	if (!try_module_get(phy->ops->owner)) {
+		put_device(&phy->dev);
 		return ERR_PTR(-EPROBE_DEFER);
-
-	get_device(&phy->dev);
+	}
 
 	return phy;
 }
@@ -765,7 +765,6 @@ struct phy *of_phy_simple_xlate(struct device *dev,
 	if (!target_dev)
 		return ERR_PTR(-ENODEV);
 
-	put_device(target_dev);
 	return to_phy(target_dev);
 }
 EXPORT_SYMBOL_GPL(of_phy_simple_xlate);
-- 
2.53.0.345.g96ddfc5eaa-goog


-- 
Dmitry

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()
  2026-02-19 23:57 [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() Dmitry Torokhov
@ 2026-02-20  0:11 ` Dmitry Torokhov
  2026-02-21  1:01   ` Dmitry Torokhov
  2026-02-28  3:44 ` Zijun Hu
  1 sibling, 1 reply; 9+ messages in thread
From: Dmitry Torokhov @ 2026-02-20  0:11 UTC (permalink / raw)
  To: Vinod Koul
  Cc: Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven,
	Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert,
	Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu,
	linux-phy, linux-kernel

On Thu, Feb 19, 2026 at 03:57:11PM -0800, Dmitry Torokhov wrote:
> The implementation put_device()s located device and then uses
> container_of() on the pointer. The device may disappear by that time,
> resulting in UAF.
> 
> Fix the problem by keeping the reference to the framer device,
> avoiding getting an extra reference to it in framer_get(), and making
> sure to drop the reference in error path when we fail to get the module.

Hmm, I was too rash. There are bunch of other xlate functions that need
to be updated to take the reference.

> 
> Fixes: e6625db66212 ("phy: core: Simplify API of_phy_simple_xlate() implementation")
> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
> ---
>  drivers/phy/phy-core.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/phy/phy-core.c b/drivers/phy/phy-core.c
> index 4ad396214d0c..cf62eb9ddca9 100644
> --- a/drivers/phy/phy-core.c
> +++ b/drivers/phy/phy-core.c
> @@ -682,10 +682,10 @@ struct phy *of_phy_get(struct device_node *np, const char *con_id)
>  	if (IS_ERR(phy))
>  		return phy;
>  
> -	if (!try_module_get(phy->ops->owner))
> +	if (!try_module_get(phy->ops->owner)) {
> +		put_device(&phy->dev);
>  		return ERR_PTR(-EPROBE_DEFER);
> -
> -	get_device(&phy->dev);
> +	}
>  
>  	return phy;
>  }
> @@ -765,7 +765,6 @@ struct phy *of_phy_simple_xlate(struct device *dev,
>  	if (!target_dev)
>  		return ERR_PTR(-ENODEV);
>  
> -	put_device(target_dev);
>  	return to_phy(target_dev);
>  }
>  EXPORT_SYMBOL_GPL(of_phy_simple_xlate);
> -- 
> 2.53.0.345.g96ddfc5eaa-goog
> 
> 

-- 
Dmitry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()
  2026-02-20  0:11 ` Dmitry Torokhov
@ 2026-02-21  1:01   ` Dmitry Torokhov
  2026-02-23 23:15     ` Vladimir Oltean
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Torokhov @ 2026-02-21  1:01 UTC (permalink / raw)
  To: Vinod Koul, Kishon Vijay Abraham I
  Cc: Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven,
	Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert,
	Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu,
	linux-phy, linux-kernel

On Thu, Feb 19, 2026 at 04:11:37PM -0800, Dmitry Torokhov wrote:
> On Thu, Feb 19, 2026 at 03:57:11PM -0800, Dmitry Torokhov wrote:
> > The implementation put_device()s located device and then uses
> > container_of() on the pointer. The device may disappear by that time,
> > resulting in UAF.
> > 
> > Fix the problem by keeping the reference to the framer device,
> > avoiding getting an extra reference to it in framer_get(), and making
> > sure to drop the reference in error path when we fail to get the module.
> 
> Hmm, I was too rash. There are bunch of other xlate functions that need
> to be updated to take the reference.

So I am convinced that xlate functions need to bump up the reference to
phy devices they return. The question is how to deal with the ones that
do not. I can either convert them in the same patch (the changes are
quite mechanical) or we can do the whole song and dance, introduce a
flag, set it up in converted xlate functions, have the core respect it,
and then remove it from xlates and from the core when it is all done.

Please let me know.

Thanks.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()
  2026-02-21  1:01   ` Dmitry Torokhov
@ 2026-02-23 23:15     ` Vladimir Oltean
  2026-02-24  0:37       ` Dmitry Torokhov
  0 siblings, 1 reply; 9+ messages in thread
From: Vladimir Oltean @ 2026-02-23 23:15 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Vinod Koul, Kishon Vijay Abraham I, Neil Armstrong,
	Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold,
	Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin,
	Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy,
	linux-kernel

Hi Dmitry,

On Fri, Feb 20, 2026 at 05:01:50PM -0800, Dmitry Torokhov wrote:
> On Thu, Feb 19, 2026 at 04:11:37PM -0800, Dmitry Torokhov wrote:
> > On Thu, Feb 19, 2026 at 03:57:11PM -0800, Dmitry Torokhov wrote:
> > > The implementation put_device()s located device and then uses
> > > container_of() on the pointer. The device may disappear by that time,
> > > resulting in UAF.
> > > 
> > > Fix the problem by keeping the reference to the framer device,
> > > avoiding getting an extra reference to it in framer_get(), and making
> > > sure to drop the reference in error path when we fail to get the module.
> > 
> > Hmm, I was too rash. There are bunch of other xlate functions that need
> > to be updated to take the reference.
> 
> So I am convinced that xlate functions need to bump up the reference to
> phy devices they return. The question is how to deal with the ones that
> do not. I can either convert them in the same patch (the changes are
> quite mechanical) or we can do the whole song and dance, introduce a
> flag, set it up in converted xlate functions, have the core respect it,
> and then remove it from xlates and from the core when it is all done.
> 
> Please let me know.

You have hit on a weak spot of the Generic PHY framework and I would
like to encourage you to follow through with patches for this finding
(although it won't be exactly trivial, I think it's doable with some
determination).

On your direct question:
It will impact downstream in subtle and unpleasant ways if you change
the of_xlate() semantics w.r.t. reference counting, but the code will
still compile just as before, i.e. when looking just at your downstream
driver (what 99% of developers do) there will be no obvious way of
knowing that something in the API has changed. I would avoid doing that.

But first the problem. Too little has been said about the problem, and
too much about the solution. We can't find a good solution if we don't
call out the problem properly first.

The phy_get() function follows a "lookup-then-get" approach for PHY
providers, rather than "atomic-lookup-and-get".

Namely, the phy_provider->of_xlate() caller, which is _of_phy_get(),
returns a struct phy * with its underlying device reference count
nominally at 1. All callers of _of_phy_get() do later call get_device()
to bump this refcount for each PHY consumer, but it is "too late" in the
sense that a window has been opened where the PHY provider driver can
unbind, and the reference count of &phy->dev can drop to 0 and it can be
freed.

Immediately after being created by phy_create(), the &phy->dev has a
refcount of 1, and only the action of its provider driver (calling
phy_destroy() directly or through devres, on driver unbind) can drop
that refcount to 0. As long as the driver doesn't do that, the &phy->dev
refcount is not in danger.

Then the PHY is exposed to the outside world using of_phy_provider_register(),
where the fact that the provider driver can concurrently unregister
starts being a problem.

Therefore, I would say that the problem is that consumers, aka phy_get()
callers, can get a phy with a &phy->dev refcount that is not already bumped
by the time they are handed over that PHY.

Mechanically, PHY lookup happens under &phy_provider_mutex, and I
believe it would be sufficient to call get_device() under this lock, in
order for consumers to get PHYs with their reference bumped.

Why?

Let's consider what a PHY provider does when unbinding. It runs the PHY
registration and creation processes in reverse, i.e.
- it calls of_phy_provider_unregister() directly or through devres
- it calls phy_destroy() directly or through devres

Since of_phy_provider_unregister() also acquires &phy_provider_mutex,
this acts like a synchronization barrier. There are 2 cases:

Consumer                                 Provider
phy_get()
-> mutex_lock(&phy_provider_mutex)
-> get_device(&phy->dev) -> 2
-> mutex_unlock(&phy_provider_mutex)
                                         of_phy_provider_unregister()
                                         -> mutex_lock(&phy_provider_mutex)
                                         -> removes phy from list
                                         -> mutex_unlock(&phy_provider_mutex)
                                         phy_destroy()
                                         -> device_unregister(&phy->dev)
                                            -> device_del(dev);
                                            -> put_device(dev); // -> 1

and the consumer remains with a "zombie" PHY device (more later)

Or

Consumer                                 Provider
phy_get()
                                         of_phy_provider_unregister()
                                         -> mutex_lock(&phy_provider_mutex)
                                         -> removes phy from list
                                         -> mutex_unlock(&phy_provider_mutex)
-> mutex_lock(&phy_provider_mutex)
-> doesn't find the PHY device!
   It was unpublished even if not freed
-> mutex_unlock(&phy_provider_mutex)
                                         phy_destroy()
                                         -> device_unregister(&phy->dev)
                                            -> device_del(dev);
                                            -> put_device(dev); // -> 1

and the consumer won't be allowed to even find the PHY device in this case.
So this is why get_device() under phy_provider_mutex actually helps.

Now, it is much less important whether you push the get_device() to the
actual of_xlate() PHY vendor implementation or not. For simplicity sake,
I suggest you don't do that, but instead keep it in the core, to
preserve driver API semantics (patch not even compile-tested):

-- >8 --
diff --git a/drivers/phy/phy-core.c b/drivers/phy/phy-core.c
index 21aaf2f76e53..c50e38f057a8 100644
--- a/drivers/phy/phy-core.c
+++ b/drivers/phy/phy-core.c
@@ -122,17 +122,19 @@ EXPORT_SYMBOL_GPL(phy_remove_lookup);
 static struct phy *phy_find(struct device *dev, const char *con_id)
 {
 	const char *dev_id = dev_name(dev);
-	struct phy_lookup *p, *pl = NULL;
+	struct phy *phy = NULL;
+	struct phy_lookup *p;

 	mutex_lock(&phy_provider_mutex);
 	list_for_each_entry(p, &phys, node)
 		if (!strcmp(p->dev_id, dev_id) && !strcmp(p->con_id, con_id)) {
-			pl = p;
+			phy = p->phy;
+			get_device(&phy->dev);
 			break;
 		}
 	mutex_unlock(&phy_provider_mutex);

-	return pl ? pl->phy : ERR_PTR(-ENODEV);
+	return phy ? phy : ERR_PTR(-ENODEV);
 }

 static struct phy_provider *of_phy_provider_lookup(struct device_node *node)
@@ -649,6 +651,7 @@ static struct phy *_of_phy_get(struct device_node *np, int index)
 	}

 	phy = phy_provider->of_xlate(phy_provider->dev, &args);
+	get_device(&phy->dev);

 out_put_module:
 	module_put(phy_provider->owner);
@@ -682,10 +685,10 @@ struct phy *of_phy_get(struct device_node *np, const char *con_id)
 	if (IS_ERR(phy))
 		return phy;

-	if (!try_module_get(phy->ops->owner))
+	if (!try_module_get(phy->ops->owner)) {
+		put_device(&phy->dev);
 		return ERR_PTR(-EPROBE_DEFER);
-
-	get_device(&phy->dev);
+	}

 	return phy;
 }
@@ -803,10 +806,10 @@ struct phy *phy_get(struct device *dev, const char *string)
 	if (IS_ERR(phy))
 		return phy;

-	if (!try_module_get(phy->ops->owner))
+	if (!try_module_get(phy->ops->owner)) {
+		put_device(&phy->dev);
 		return ERR_PTR(-EPROBE_DEFER);
-
-	get_device(&phy->dev);
+	}

 	link = device_link_add(dev, &phy->dev, DL_FLAG_STATELESS);
 	if (!link)
@@ -969,11 +972,10 @@ struct phy *devm_of_phy_get_by_index(struct device *dev, struct device_node *np,

 	if (!try_module_get(phy->ops->owner)) {
 		devres_free(ptr);
+		put_device(&phy->dev);
 		return ERR_PTR(-EPROBE_DEFER);
 	}

-	get_device(&phy->dev);
-
 	*ptr = phy;
 	devres_add(dev, ptr);

-- >8 --

The patch above leaves a few loose ends.

The most obvious is of_phy_simple_xlate(), which has that put_device()
to balance out class_find_device_by_of_node() - which bumps the device
refcount to 2. It "looks" wrong but it is consistent with vendor
implementations of of_xlate(), which also provide a phy->dev refcount of 1.
And as explained, the refcount never _actually_ drops to 0 with the
above patch.

I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake.
Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I
would offer a new helper, devm_of_phy_simple_provider_register(), which
would internally use a callback that doesn't drop the refcount (say
__of_phy_simple_xlate() for lack of imagination). I would convert vendor
PHY drivers one by one to use this (it would be valid at any time to use
either the old or the new method).

You'd notice that most of the of_phy_simple_xlate() occurrences go away,
except for freescale/phy-fsl-lynx-28g.c which calls this function
directly from its own lynx_28g_xlate(). It will still have to keep
calling it, and for refcount equalization purposes that weird
put_device() will have to continue to exist. Just leave a comment as to
why that is.

But there are more important loose ends still.

I mentioned that "zombie" device. We've solved the memory safety issue,
but it's possible for consumers to hold onto a phy whose provider has
disappeared. The refcount of &phy->dev hasn't dropped to 0, so
technically it's a valid object, but from PHY API perspective, it's
still possible to call phy_init(), phy_power_on() and friends on this
PHY, and the Generic PHY core will be happy to further call into the
phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound,
so it should really be left alone.

If we fix the UAF but leave the zombie PHY problem, we've effectively
done nothing but silence static analysis checkers, while the code path
where the PHY provider unbinds effectively remains treated as poorly as
before, just moving the crashes to a different place.

I suspect what needs to be done here is to introduce a "bool dead" or
similar, which is to be set from phy_destroy() and checked from every
API call. The idea is that API functions on zombie PHYs should fail
without calling into their driver. I **suppose** that
try_module_get(phy->ops->owner) "tried" to avoid this situation, but
it just protects against module removal, not against "echo device >
/sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable.

Finally, I have identified one more loose end still.

/**
 * of_phy_put() - release the PHY
 * @phy: the phy returned by of_phy_get()
 *
 * Releases a refcount the caller received from of_phy_get().
 */
void of_phy_put(struct phy *phy)
{
	if (!phy || IS_ERR(phy))
		return;

	mutex_lock(&phy->mutex);
	if (phy->ops->release)
		phy->ops->release(phy);
	mutex_unlock(&phy->mutex);

	module_put(phy->ops->owner);
	put_device(&phy->dev);
}
EXPORT_SYMBOL_GPL(of_phy_put);

This function is called by PHY consumers. A PHY provider can have
multiple consumers (example in the case of Ethernet: a QSGMII SerDes
lane has 4 MAC ports multiplexed onto it).

If a single consumer calls of_phy_put() to release its reference to the
SerDes lane, the phy->ops->release() method makes absolutely no sense.
There are 3 remaining consumers with handles to the lane! But we aren't
even telling the PHY which consumer has disappeared! It has nothing
useful to do with this information.

Looking at actual phy_ops implementations, what they want to know is
when _all_ consumers went away, not when individual consumers did.
So the phy->ops->release() call needs to be put somewhere which is
executed when all consumers disappear. If I were to guess, that would be
the phy_release() class callback, but this is completely untested.

Sorry that this email is a bit long, it got a bit late writing it and I
don't have a lot of energy left to trim it down.
In summary I found 4 highly related problems:
- use after free
- misleading of_phy_simple_xlate() API pattern (not a functional issue)
- zombie PHYs
- premature release call

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()
  2026-02-23 23:15     ` Vladimir Oltean
@ 2026-02-24  0:37       ` Dmitry Torokhov
  2026-02-24 15:47         ` Vladimir Oltean
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Torokhov @ 2026-02-24  0:37 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Vinod Koul, Kishon Vijay Abraham I, Neil Armstrong,
	Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold,
	Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin,
	Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy,
	linux-kernel

Hi Vladimir,

On Tue, Feb 24, 2026 at 01:15:00AM +0200, Vladimir Oltean wrote:
> Hi Dmitry,
> 
> On Fri, Feb 20, 2026 at 05:01:50PM -0800, Dmitry Torokhov wrote:
> > On Thu, Feb 19, 2026 at 04:11:37PM -0800, Dmitry Torokhov wrote:
> > > On Thu, Feb 19, 2026 at 03:57:11PM -0800, Dmitry Torokhov wrote:
> > > > The implementation put_device()s located device and then uses
> > > > container_of() on the pointer. The device may disappear by that time,
> > > > resulting in UAF.
> > > > 
> > > > Fix the problem by keeping the reference to the framer device,
> > > > avoiding getting an extra reference to it in framer_get(), and making
> > > > sure to drop the reference in error path when we fail to get the module.
> > > 
> > > Hmm, I was too rash. There are bunch of other xlate functions that need
> > > to be updated to take the reference.
> > 
> > So I am convinced that xlate functions need to bump up the reference to
> > phy devices they return. The question is how to deal with the ones that
> > do not. I can either convert them in the same patch (the changes are
> > quite mechanical) or we can do the whole song and dance, introduce a
> > flag, set it up in converted xlate functions, have the core respect it,
> > and then remove it from xlates and from the core when it is all done.
> > 
> > Please let me know.
> 
> You have hit on a weak spot of the Generic PHY framework and I would
> like to encourage you to follow through with patches for this finding
> (although it won't be exactly trivial, I think it's doable with some
> determination).

Thank you for the detailed response. I am not sure how much time I can
spend on phy rework as this change was an essentially a drive-by. I only
happen to look there because I want to remove
class_find_device_by_of_node() in favor of class_find_device_by_fwnode().

> 
> On your direct question:
> It will impact downstream in subtle and unpleasant ways if you change
> the of_xlate() semantics w.r.t. reference counting, but the code will
> still compile just as before, i.e. when looking just at your downstream
> driver (what 99% of developers do) there will be no obvious way of
> knowing that something in the API has changed. I would avoid doing that.

Hmm, I was not aware that we needed to take particular measures for
benefit of downstream trees... 

> 
> But first the problem. Too little has been said about the problem, and
> too much about the solution. We can't find a good solution if we don't
> call out the problem properly first.
> 
> The phy_get() function follows a "lookup-then-get" approach for PHY
> providers, rather than "atomic-lookup-and-get".
> 
> Namely, the phy_provider->of_xlate() caller, which is _of_phy_get(),
> returns a struct phy * with its underlying device reference count
> nominally at 1. All callers of _of_phy_get() do later call get_device()
> to bump this refcount for each PHY consumer, but it is "too late" in the
> sense that a window has been opened where the PHY provider driver can
> unbind, and the reference count of &phy->dev can drop to 0 and it can be
> freed.

Yes.

> 
> Immediately after being created by phy_create(), the &phy->dev has a
> refcount of 1, and only the action of its provider driver (calling
> phy_destroy() directly or through devres, on driver unbind) can drop
> that refcount to 0. As long as the driver doesn't do that, the &phy->dev
> refcount is not in danger.

Right.

> 
> Then the PHY is exposed to the outside world using of_phy_provider_register(),
> where the fact that the provider driver can concurrently unregister
> starts being a problem.
> 
> Therefore, I would say that the problem is that consumers, aka phy_get()
> callers, can get a phy with a &phy->dev refcount that is not already bumped
> by the time they are handed over that PHY.
> 
> Mechanically, PHY lookup happens under &phy_provider_mutex, and I
> believe it would be sufficient to call get_device() under this lock, in
> order for consumers to get PHYs with their reference bumped.

<skip the explanation and a patch>

Yes, given that there is 1:1 mapping between device and provider (not
phy instance but its parent) it looks like we can drop the reference and
then bump it up again as long as we hold that mutex.

> 
> The patch above leaves a few loose ends.
> 
> The most obvious is of_phy_simple_xlate(), which has that put_device()
> to balance out class_find_device_by_of_node() - which bumps the device
> refcount to 2. It "looks" wrong but it is consistent with vendor
> implementations of of_xlate(), which also provide a phy->dev refcount of 1.
> And as explained, the refcount never _actually_ drops to 0 with the
> above patch.
> 
> I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake.
> Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I
> would offer a new helper, devm_of_phy_simple_provider_register(), which
> would internally use a callback that doesn't drop the refcount (say
> __of_phy_simple_xlate() for lack of imagination). I would convert vendor
> PHY drivers one by one to use this (it would be valid at any time to use
> either the old or the new method).
> 
> You'd notice that most of the of_phy_simple_xlate() occurrences go away,
> except for freescale/phy-fsl-lynx-28g.c which calls this function
> directly from its own lynx_28g_xlate(). It will still have to keep
> calling it, and for refcount equalization purposes that weird
> put_device() will have to continue to exist. Just leave a comment as to
> why that is.

I this case I would simply add a comment telling why the reference
should (and can) be dropped where it is being dropped in 
of_phy_simple_xlate() and call it a day. There is not much benefit in
adding another helper.

> 
> 
> But there are more important loose ends still.
> 
> I mentioned that "zombie" device. We've solved the memory safety issue,
> but it's possible for consumers to hold onto a phy whose provider has
> disappeared. The refcount of &phy->dev hasn't dropped to 0, so
> technically it's a valid object, but from PHY API perspective, it's
> still possible to call phy_init(), phy_power_on() and friends on this
> PHY, and the Generic PHY core will be happy to further call into the
> phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound,
> so it should really be left alone.
> 
> If we fix the UAF but leave the zombie PHY problem, we've effectively
> done nothing but silence static analysis checkers, while the code path
> where the PHY provider unbinds effectively remains treated as poorly as
> before, just moving the crashes to a different place.
> 
> I suspect what needs to be done here is to introduce a "bool dead" or
> similar, which is to be set from phy_destroy() and checked from every
> API call. The idea is that API functions on zombie PHYs should fail
> without calling into their driver. I **suppose** that
> try_module_get(phy->ops->owner) "tried" to avoid this situation, but
> it just protects against module removal, not against "echo device >
> /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable.

I think this is a problem common to many kernel subsystems, where there
are devices that are vital for other devices to function, but are not
parents of said devices. Think about regulators or clocks and such.
In many such cases even validating at API level is not sufficient,
because if you enable a clock you typically keep it running at least for
a while, if not for entire duration of device being bound to a driver.
If that clock goes away the device will break even though you are not
calling any clock APIs at that particular time.

We need some kind of revocation mechanism to signal consumers that
providers they rely upon are going away.

> 
> 
> Finally, I have identified one more loose end still.
> 
> /**
>  * of_phy_put() - release the PHY
>  * @phy: the phy returned by of_phy_get()
>  *
>  * Releases a refcount the caller received from of_phy_get().
>  */
> void of_phy_put(struct phy *phy)
> {
> 	if (!phy || IS_ERR(phy))
> 		return;
> 
> 	mutex_lock(&phy->mutex);
> 	if (phy->ops->release)
> 		phy->ops->release(phy);
> 	mutex_unlock(&phy->mutex);
> 
> 	module_put(phy->ops->owner);
> 	put_device(&phy->dev);
> }
> EXPORT_SYMBOL_GPL(of_phy_put);
> 
> This function is called by PHY consumers. A PHY provider can have
> multiple consumers (example in the case of Ethernet: a QSGMII SerDes
> lane has 4 MAC ports multiplexed onto it).
> 
> If a single consumer calls of_phy_put() to release its reference to the
> SerDes lane, the phy->ops->release() method makes absolutely no sense.
> There are 3 remaining consumers with handles to the lane! But we aren't
> even telling the PHY which consumer has disappeared! It has nothing
> useful to do with this information.
> 
> Looking at actual phy_ops implementations, what they want to know is
> when _all_ consumers went away, not when individual consumers did.
> So the phy->ops->release() call needs to be put somewhere which is
> executed when all consumers disappear. If I were to guess, that would be
> the phy_release() class callback, but this is completely untested.

Shouldn't phy->ops->release() keep track of users and decide when to
destroy the resources? The rest seem fine as they simply drop references
(I assume each user of shared phy will bump up references as needed?)

> 
> Sorry that this email is a bit long, it got a bit late writing it and I
> don't have a lot of energy left to trim it down.
> In summary I found 4 highly related problems:
> - use after free
> - misleading of_phy_simple_xlate() API pattern (not a functional issue)
> - zombie PHYs
> - premature release call

Thanks.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()
  2026-02-24  0:37       ` Dmitry Torokhov
@ 2026-02-24 15:47         ` Vladimir Oltean
  2026-02-24 20:38           ` Dmitry Torokhov
  0 siblings, 1 reply; 9+ messages in thread
From: Vladimir Oltean @ 2026-02-24 15:47 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Vinod Koul, Kishon Vijay Abraham I, Neil Armstrong,
	Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold,
	Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin,
	Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy,
	linux-kernel

On Mon, Feb 23, 2026 at 04:37:39PM -0800, Dmitry Torokhov wrote:
> > You have hit on a weak spot of the Generic PHY framework and I would
> > like to encourage you to follow through with patches for this finding
> > (although it won't be exactly trivial, I think it's doable with some
> > determination).
> 
> Thank you for the detailed response. I am not sure how much time I can
> spend on phy rework as this change was an essentially a drive-by. I only
> happen to look there because I want to remove
> class_find_device_by_of_node() in favor of class_find_device_by_fwnode().

OK, I believe you can still do that without any dependency.

> > On your direct question:
> > It will impact downstream in subtle and unpleasant ways if you change
> > the of_xlate() semantics w.r.t. reference counting, but the code will
> > still compile just as before, i.e. when looking just at your downstream
> > driver (what 99% of developers do) there will be no obvious way of
> > knowing that something in the API has changed. I would avoid doing that.
> 
> Hmm, I was not aware that we needed to take particular measures for
> benefit of downstream trees... 

I think you're entirely missing the point.

Obviously it benefits the maintainer/reviewer first and foremost. If you
don't make the API change unmissable, drivers expecting the old convention
will eventually creep their way into the mainline kernel. Why put avoidable
pressure on reviewers watching out for things like this for years to come,
when you can force downstream to notice and adapt (OR not make the
change in the first place). Note that downstream means "API consumers
invisible to you, the API changer", not only "hopelessly un-upstreamable
drivers".

> > The patch above leaves a few loose ends.
> > 
> > The most obvious is of_phy_simple_xlate(), which has that put_device()
> > to balance out class_find_device_by_of_node() - which bumps the device
> > refcount to 2. It "looks" wrong but it is consistent with vendor
> > implementations of of_xlate(), which also provide a phy->dev refcount of 1.
> > And as explained, the refcount never _actually_ drops to 0 with the
> > above patch.
> > 
> > I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake.
> > Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I
> > would offer a new helper, devm_of_phy_simple_provider_register(), which
> > would internally use a callback that doesn't drop the refcount (say
> > __of_phy_simple_xlate() for lack of imagination). I would convert vendor
> > PHY drivers one by one to use this (it would be valid at any time to use
> > either the old or the new method).
> > 
> > You'd notice that most of the of_phy_simple_xlate() occurrences go away,
> > except for freescale/phy-fsl-lynx-28g.c which calls this function
> > directly from its own lynx_28g_xlate(). It will still have to keep
> > calling it, and for refcount equalization purposes that weird
> > put_device() will have to continue to exist. Just leave a comment as to
> > why that is.
> 
> I this case I would simply add a comment telling why the reference
> should (and can) be dropped where it is being dropped in 
> of_phy_simple_xlate() and call it a day. There is not much benefit in
> adding another helper.

OK.

> > But there are more important loose ends still.
> > 
> > I mentioned that "zombie" device. We've solved the memory safety issue,
> > but it's possible for consumers to hold onto a phy whose provider has
> > disappeared. The refcount of &phy->dev hasn't dropped to 0, so
> > technically it's a valid object, but from PHY API perspective, it's
> > still possible to call phy_init(), phy_power_on() and friends on this
> > PHY, and the Generic PHY core will be happy to further call into the
> > phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound,
> > so it should really be left alone.
> > 
> > If we fix the UAF but leave the zombie PHY problem, we've effectively
> > done nothing but silence static analysis checkers, while the code path
> > where the PHY provider unbinds effectively remains treated as poorly as
> > before, just moving the crashes to a different place.
> > 
> > I suspect what needs to be done here is to introduce a "bool dead" or
> > similar, which is to be set from phy_destroy() and checked from every
> > API call. The idea is that API functions on zombie PHYs should fail
> > without calling into their driver. I **suppose** that
> > try_module_get(phy->ops->owner) "tried" to avoid this situation, but
> > it just protects against module removal, not against "echo device >
> > /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable.
> 
> I think this is a problem common to many kernel subsystems, where there
> are devices that are vital for other devices to function, but are not
> parents of said devices. Think about regulators or clocks and such.
> In many such cases even validating at API level is not sufficient,
> because if you enable a clock you typically keep it running at least for
> a while, if not for entire duration of device being bound to a driver.
> If that clock goes away the device will break even though you are not
> calling any clock APIs at that particular time.
> 
> We need some kind of revocation mechanism to signal consumers that
> providers they rely upon are going away.

Yeah, this gave me pause.

When you unbind a Generic PHY you can kind of expect that the data path
it affects to not work. But you'd sort of expect it to start working
again when you rebind the PHY driver.

That will not be the case, though.

The problem, simply put, is that the struct phy that the consumer has
will be different from the second struct phy that the provider registers
when it rebinds. Using the stale struct phy, we cannot reach the phy_ops
of the new provider.

If we implement something similar to what Bartosz Golaszewski suggested
here:
https://lpc.events/event/17/contributions/1627/

aka decouple the struct phy given to the consumer from the struct phy
provided by the provider, and perform a PHY lookup during each
phy_init() / phy_power_up() / etc etc operation;

we are kinda able to:
- match the consumer phy to the provider phy and forward the call to
  phy_ops, if the provider is present (or rebound)
- return -ENODEV instead of calling into phy_ops, if the provider is not
  present

But there's still one big gap.

PHY consumers are driving the PHY through a state machine when they do
phy_init -> phy_power_up() -> phy_set_mode() -> [ phy_configure()...].
When the provider unbinds and then rebinds, that state is lost. But the
consumer has no idea. It knows it is in a state where it called
phy_power_up(), and next thing it knows, the power_count is 1 and it
must call phy_power_down().

The Generic PHY core cannot actually replay the entire state in a
meaningful way so as to make the provider reprobing completely
transparent (think of phy_calibrate() calls).

So I guess we're looking at what Bartosz refers to as a notification
side-channel that the PHY provider is going away.

At least, for drivers that care in a meaningful way. For drivers that
don't care about such complexity, there seems to be a simpler answer:
device_link_add(DL_FLAG_AUTOREMOVE_CONSUMER).

I can also answer why device links with "autoremove consumer" are not a
universal answer. This is because the consumer itself may be a
multi-port network switch (single device). If you unbind the Generic PHY
driver (retimer, SerDes PHY) from port 3, you don't want ports 0, 1 and
2 to also disappear from the kernel. You need the more complex PHY
provider disappearance notification which does something more localized
(calls phylink_stop(), I don't know).

I can say right off the bat that this is too complicated for me to even
think about in more detail than that, at the moment. I would be quite
happy if it's possible to unbind the PHY driver without the possibility
to rebind it, as a first step.

> > Finally, I have identified one more loose end still.
> > 
> > /**
> >  * of_phy_put() - release the PHY
> >  * @phy: the phy returned by of_phy_get()
> >  *
> >  * Releases a refcount the caller received from of_phy_get().
> >  */
> > void of_phy_put(struct phy *phy)
> > {
> > 	if (!phy || IS_ERR(phy))
> > 		return;
> > 
> > 	mutex_lock(&phy->mutex);
> > 	if (phy->ops->release)
> > 		phy->ops->release(phy);
> > 	mutex_unlock(&phy->mutex);
> > 
> > 	module_put(phy->ops->owner);
> > 	put_device(&phy->dev);
> > }
> > EXPORT_SYMBOL_GPL(of_phy_put);
> > 
> > This function is called by PHY consumers. A PHY provider can have
> > multiple consumers (example in the case of Ethernet: a QSGMII SerDes
> > lane has 4 MAC ports multiplexed onto it).
> > 
> > If a single consumer calls of_phy_put() to release its reference to the
> > SerDes lane, the phy->ops->release() method makes absolutely no sense.
> > There are 3 remaining consumers with handles to the lane! But we aren't
> > even telling the PHY which consumer has disappeared! It has nothing
> > useful to do with this information.
> > 
> > Looking at actual phy_ops implementations, what they want to know is
> > when _all_ consumers went away, not when individual consumers did.
> > So the phy->ops->release() call needs to be put somewhere which is
> > executed when all consumers disappear. If I were to guess, that would be
> > the phy_release() class callback, but this is completely untested.
> 
> Shouldn't phy->ops->release() keep track of users and decide when to
> destroy the resources? The rest seem fine as they simply drop references
> (I assume each user of shared phy will bump up references as needed?)

AFAICS, the only instance is phy/ti/phy-am654-serdes.c: serdes_am654_release().

This undoes the effect of serdes_am654_xlate(), i.e. calls mux_control_deselect(phy->control).

The fact that phy_provider->of_xlate() has side effects on this
platform, rather than just return a struct phy * for a given struct
device_node * (as it's supposed to do), is "interesting".

I don't have extra info why the PHY maintainer didn't add extra phy_ops
for consumer_bind() and consumer_unbind() or something like that, to
make it clear that these calls are supposed to be _per consumer_.
Notice how the first of_xlate() call sets am654_phy->busy = true, and
second of_xlate() call is designed to fail due to am654_phy->busy.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()
  2026-02-24 15:47         ` Vladimir Oltean
@ 2026-02-24 20:38           ` Dmitry Torokhov
  2026-02-25 23:50             ` Vladimir Oltean
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Torokhov @ 2026-02-24 20:38 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Vinod Koul, Kishon Vijay Abraham I, Neil Armstrong,
	Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold,
	Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin,
	Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy,
	linux-kernel

On Tue, Feb 24, 2026 at 05:47:30PM +0200, Vladimir Oltean wrote:
> On Mon, Feb 23, 2026 at 04:37:39PM -0800, Dmitry Torokhov wrote:
> > > You have hit on a weak spot of the Generic PHY framework and I would
> > > like to encourage you to follow through with patches for this finding
> > > (although it won't be exactly trivial, I think it's doable with some
> > > determination).
> > 
> > Thank you for the detailed response. I am not sure how much time I can
> > spend on phy rework as this change was an essentially a drive-by. I only
> > happen to look there because I want to remove
> > class_find_device_by_of_node() in favor of class_find_device_by_fwnode().
> 
> OK, I believe you can still do that without any dependency.

Right, this is independent, I just happened to look at that function.

> 
> > > On your direct question:
> > > It will impact downstream in subtle and unpleasant ways if you change
> > > the of_xlate() semantics w.r.t. reference counting, but the code will
> > > still compile just as before, i.e. when looking just at your downstream
> > > driver (what 99% of developers do) there will be no obvious way of
> > > knowing that something in the API has changed. I would avoid doing that.
> > 
> > Hmm, I was not aware that we needed to take particular measures for
> > benefit of downstream trees... 
> 
> I think you're entirely missing the point.
> 
> Obviously it benefits the maintainer/reviewer first and foremost. If you
> don't make the API change unmissable, drivers expecting the old convention
> will eventually creep their way into the mainline kernel. Why put avoidable
> pressure on reviewers watching out for things like this for years to come,
> when you can force downstream to notice and adapt (OR not make the
> change in the first place). Note that downstream means "API consumers
> invisible to you, the API changer", not only "hopelessly un-upstreamable
> drivers".

Well, this is whole "stable kernel API" argument. Rules of lifetime
around objects may change and the kernel needs to adapt. Anyway, this is
your subsystem so you get to decide.

> 
> > > The patch above leaves a few loose ends.
> > > 
> > > The most obvious is of_phy_simple_xlate(), which has that put_device()
> > > to balance out class_find_device_by_of_node() - which bumps the device
> > > refcount to 2. It "looks" wrong but it is consistent with vendor
> > > implementations of of_xlate(), which also provide a phy->dev refcount of 1.
> > > And as explained, the refcount never _actually_ drops to 0 with the
> > > above patch.
> > > 
> > > I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake.
> > > Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I
> > > would offer a new helper, devm_of_phy_simple_provider_register(), which
> > > would internally use a callback that doesn't drop the refcount (say
> > > __of_phy_simple_xlate() for lack of imagination). I would convert vendor
> > > PHY drivers one by one to use this (it would be valid at any time to use
> > > either the old or the new method).
> > > 
> > > You'd notice that most of the of_phy_simple_xlate() occurrences go away,
> > > except for freescale/phy-fsl-lynx-28g.c which calls this function
> > > directly from its own lynx_28g_xlate(). It will still have to keep
> > > calling it, and for refcount equalization purposes that weird
> > > put_device() will have to continue to exist. Just leave a comment as to
> > > why that is.
> > 
> > I this case I would simply add a comment telling why the reference
> > should (and can) be dropped where it is being dropped in 
> > of_phy_simple_xlate() and call it a day. There is not much benefit in
> > adding another helper.
> 
> OK.
> 
> > > But there are more important loose ends still.
> > > 
> > > I mentioned that "zombie" device. We've solved the memory safety issue,
> > > but it's possible for consumers to hold onto a phy whose provider has
> > > disappeared. The refcount of &phy->dev hasn't dropped to 0, so
> > > technically it's a valid object, but from PHY API perspective, it's
> > > still possible to call phy_init(), phy_power_on() and friends on this
> > > PHY, and the Generic PHY core will be happy to further call into the
> > > phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound,
> > > so it should really be left alone.
> > > 
> > > If we fix the UAF but leave the zombie PHY problem, we've effectively
> > > done nothing but silence static analysis checkers, while the code path
> > > where the PHY provider unbinds effectively remains treated as poorly as
> > > before, just moving the crashes to a different place.
> > > 
> > > I suspect what needs to be done here is to introduce a "bool dead" or
> > > similar, which is to be set from phy_destroy() and checked from every
> > > API call. The idea is that API functions on zombie PHYs should fail
> > > without calling into their driver. I **suppose** that
> > > try_module_get(phy->ops->owner) "tried" to avoid this situation, but
> > > it just protects against module removal, not against "echo device >
> > > /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable.
> > 
> > I think this is a problem common to many kernel subsystems, where there
> > are devices that are vital for other devices to function, but are not
> > parents of said devices. Think about regulators or clocks and such.
> > In many such cases even validating at API level is not sufficient,
> > because if you enable a clock you typically keep it running at least for
> > a while, if not for entire duration of device being bound to a driver.
> > If that clock goes away the device will break even though you are not
> > calling any clock APIs at that particular time.
> > 
> > We need some kind of revocation mechanism to signal consumers that
> > providers they rely upon are going away.
> 
> Yeah, this gave me pause.
> 
> When you unbind a Generic PHY you can kind of expect that the data path
> it affects to not work. But you'd sort of expect it to start working
> again when you rebind the PHY driver.
> 
> That will not be the case, though.
> 
> The problem, simply put, is that the struct phy that the consumer has
> will be different from the second struct phy that the provider registers
> when it rebinds. Using the stale struct phy, we cannot reach the phy_ops
> of the new provider.
> 
> If we implement something similar to what Bartosz Golaszewski suggested
> here:
> https://lpc.events/event/17/contributions/1627/

There is also "revocable" from Tzung-Bi Shih.

> 
> aka decouple the struct phy given to the consumer from the struct phy
> provided by the provider, and perform a PHY lookup during each
> phy_init() / phy_power_up() / etc etc operation;
> 
> we are kinda able to:
> - match the consumer phy to the provider phy and forward the call to
>   phy_ops, if the provider is present (or rebound)
> - return -ENODEV instead of calling into phy_ops, if the provider is not
>   present
> 
> But there's still one big gap.
> 
> PHY consumers are driving the PHY through a state machine when they do
> phy_init -> phy_power_up() -> phy_set_mode() -> [ phy_configure()...].
> When the provider unbinds and then rebinds, that state is lost. But the
> consumer has no idea. It knows it is in a state where it called
> phy_power_up(), and next thing it knows, the power_count is 1 and it
> must call phy_power_down().
> 
> The Generic PHY core cannot actually replay the entire state in a
> meaningful way so as to make the provider reprobing completely
> transparent (think of phy_calibrate() calls).
> 
> So I guess we're looking at what Bartosz refers to as a notification
> side-channel that the PHY provider is going away.
> 
> At least, for drivers that care in a meaningful way. For drivers that
> don't care about such complexity, there seems to be a simpler answer:
> device_link_add(DL_FLAG_AUTOREMOVE_CONSUMER).

Does DL_FLAG_AUTOREMOVE_CONSUMER result in forceful unbinding of the
consumer when provider goes away? And if it happens does it happen in
the right order?

> 
> I can also answer why device links with "autoremove consumer" are not a
> universal answer. This is because the consumer itself may be a
> multi-port network switch (single device). If you unbind the Generic PHY
> driver (retimer, SerDes PHY) from port 3, you don't want ports 0, 1 and
> 2 to also disappear from the kernel. You need the more complex PHY
> provider disappearance notification which does something more localized
> (calls phylink_stop(), I don't know).

Aren't the ports represented as individual devices with their own state
and lifetime?

> 
> 
> I can say right off the bat that this is too complicated for me to even
> think about in more detail than that, at the moment. I would be quite
> happy if it's possible to unbind the PHY driver without the possibility
> to rebind it, as a first step.

Yes, this is something that is not phy specific and I think should be
solved at driver core (or at least driver core should provide subsystems
tools to solve this).

Thanks.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()
  2026-02-24 20:38           ` Dmitry Torokhov
@ 2026-02-25 23:50             ` Vladimir Oltean
  0 siblings, 0 replies; 9+ messages in thread
From: Vladimir Oltean @ 2026-02-25 23:50 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Vinod Koul, Kishon Vijay Abraham I, Neil Armstrong,
	Rafael J. Wysocki, Geert Uytterhoeven, Johan Hovold,
	Claudiu Beznea, Dr. David Alan Gilbert, Peter Griffin,
	Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu, linux-phy,
	linux-kernel

On Tue, Feb 24, 2026 at 12:38:48PM -0800, Dmitry Torokhov wrote:
> > Obviously it benefits the maintainer/reviewer first and foremost. If you
> > don't make the API change unmissable, drivers expecting the old convention
> > will eventually creep their way into the mainline kernel. Why put avoidable
> > pressure on reviewers watching out for things like this for years to come,
> > when you can force downstream to notice and adapt (OR not make the
> > change in the first place). Note that downstream means "API consumers
> > invisible to you, the API changer", not only "hopelessly un-upstreamable
> > drivers".
> 
> Well, this is whole "stable kernel API" argument. Rules of lifetime
> around objects may change and the kernel needs to adapt.

Actually it's not the "stable kernel API" argument.

I just said that if the of_xlate() callbacks currently need to do:

	return phy;

and you propose a new scheme where they have to do:

	get_device(&phy->dev);
	return phy;

, then you are creating the possibility for:

- the API change to be missed by anyone who hasn't been _actively_
  monitoring you, because the old code will still compile just as well,
  and won't show any superficial runtime problem either
- the reviewers and maintainers to not notice that patches were
  developed according to the old convention

and such patches may even get accepted, but be buggy.

I actually don't care whether the kernel API is stable or not. But
if you change it, make it obvious it has changed.

> Anyway, this is your subsystem so you get to decide.

Not my subsystem. Sorry if I gave you the wrong impression.

> > > > But there are more important loose ends still.
> > > > 
> > > > I mentioned that "zombie" device. We've solved the memory safety issue,
> > > > but it's possible for consumers to hold onto a phy whose provider has
> > > > disappeared. The refcount of &phy->dev hasn't dropped to 0, so
> > > > technically it's a valid object, but from PHY API perspective, it's
> > > > still possible to call phy_init(), phy_power_on() and friends on this
> > > > PHY, and the Generic PHY core will be happy to further call into the
> > > > phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound,
> > > > so it should really be left alone.
> > > > 
> > > > If we fix the UAF but leave the zombie PHY problem, we've effectively
> > > > done nothing but silence static analysis checkers, while the code path
> > > > where the PHY provider unbinds effectively remains treated as poorly as
> > > > before, just moving the crashes to a different place.
> > > > 
> > > > I suspect what needs to be done here is to introduce a "bool dead" or
> > > > similar, which is to be set from phy_destroy() and checked from every
> > > > API call. The idea is that API functions on zombie PHYs should fail
> > > > without calling into their driver. I **suppose** that
> > > > try_module_get(phy->ops->owner) "tried" to avoid this situation, but
> > > > it just protects against module removal, not against "echo device >
> > > > /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable.
> > > 
> > > I think this is a problem common to many kernel subsystems, where there
> > > are devices that are vital for other devices to function, but are not
> > > parents of said devices. Think about regulators or clocks and such.
> > > In many such cases even validating at API level is not sufficient,
> > > because if you enable a clock you typically keep it running at least for
> > > a while, if not for entire duration of device being bound to a driver.
> > > If that clock goes away the device will break even though you are not
> > > calling any clock APIs at that particular time.
> > > 
> > > We need some kind of revocation mechanism to signal consumers that
> > > providers they rely upon are going away.
> > 
> > Yeah, this gave me pause.
> > 
> > When you unbind a Generic PHY you can kind of expect that the data path
> > it affects to not work. But you'd sort of expect it to start working
> > again when you rebind the PHY driver.
> > 
> > That will not be the case, though.
> > 
> > The problem, simply put, is that the struct phy that the consumer has
> > will be different from the second struct phy that the provider registers
> > when it rebinds. Using the stale struct phy, we cannot reach the phy_ops
> > of the new provider.
> > 
> > If we implement something similar to what Bartosz Golaszewski suggested
> > here:
> > https://lpc.events/event/17/contributions/1627/
> 
> There is also "revocable" from Tzung-Bi Shih.

OK. Thanks for the heads up.

> > aka decouple the struct phy given to the consumer from the struct phy
> > provided by the provider, and perform a PHY lookup during each
> > phy_init() / phy_power_up() / etc etc operation;
> > 
> > we are kinda able to:
> > - match the consumer phy to the provider phy and forward the call to
> >   phy_ops, if the provider is present (or rebound)
> > - return -ENODEV instead of calling into phy_ops, if the provider is not
> >   present
> > 
> > But there's still one big gap.
> > 
> > PHY consumers are driving the PHY through a state machine when they do
> > phy_init -> phy_power_up() -> phy_set_mode() -> [ phy_configure()...].
> > When the provider unbinds and then rebinds, that state is lost. But the
> > consumer has no idea. It knows it is in a state where it called
> > phy_power_up(), and next thing it knows, the power_count is 1 and it
> > must call phy_power_down().
> > 
> > The Generic PHY core cannot actually replay the entire state in a
> > meaningful way so as to make the provider reprobing completely
> > transparent (think of phy_calibrate() calls).
> > 
> > So I guess we're looking at what Bartosz refers to as a notification
> > side-channel that the PHY provider is going away.
> > 
> > At least, for drivers that care in a meaningful way. For drivers that
> > don't care about such complexity, there seems to be a simpler answer:
> > device_link_add(DL_FLAG_AUTOREMOVE_CONSUMER).
> 
> Does DL_FLAG_AUTOREMOVE_CONSUMER result in forceful unbinding of the
> consumer when provider goes away? And if it happens does it happen in
> the right order?

If "the right order" means "consumer before provider", yes.

> > I can also answer why device links with "autoremove consumer" are not a
> > universal answer. This is because the consumer itself may be a
> > multi-port network switch (single device). If you unbind the Generic PHY
> > driver (retimer, SerDes PHY) from port 3, you don't want ports 0, 1 and
> > 2 to also disappear from the kernel. You need the more complex PHY
> > provider disappearance notification which does something more localized
> > (calls phylink_stop(), I don't know).
> 
> Aren't the ports represented as individual devices with their own state
> and lifetime?

Nope. A single platform_device/pci_device/spi_device/i2c_client/mdio_device/etc
represents an entire switch, and its driver will call register_netdevice()
multiple times, once per port. This is the switchdev model as per
Documentation/networking/switchdev.rst. Technically each netdev has its
own "class" device (for /sys/class/net/) but it's not a "bus" device,
doesn't have a bound driver, and their parent is the aforementioned
platform_device/etc.

> > I can say right off the bat that this is too complicated for me to even
> > think about in more detail than that, at the moment. I would be quite
> > happy if it's possible to unbind the PHY driver without the possibility
> > to rebind it, as a first step.
> 
> Yes, this is something that is not phy specific and I think should be
> solved at driver core (or at least driver core should provide subsystems
> tools to solve this).

Some subsystems I'm slightly more familiar with do have mechanisms to
handle this.

Networking has (a pretty complicated) netdev_wait_allrefs_any(), which
waits for all references to a net_device to be dropped before fully
unregistering it (these are net_device :: dev_refcnt, different from
struct device references). Drivers are supposed to monitor the netdev
notifier call chain for the NETDEV_UNREGISTER event, and let go of the
reference.

Sadly I don't know enough about the Generic PHY subsystem to be making
generalizations with any claim of correctness, but I suspect the
majority of consumers are in a 1:1 relationship with the PHY provider,
and would be happy with the cost/benefit ratio given by the
device_link(DL_FLAG_AUTOREMOVE_CONSUMER) behaviour. Only a select few
would care to continue their lifetime in a PHY-less environment
(normal operation on other ports / degraded on the PHY-less port,
reestablish connection to provider when it comes back, etc).

I'm not sure that there exists a space for "revocable" in Generic PHY
between the above two options. Only if the device link idea doesn't work
out, for some reason I'm not seeing. I'll have to put this thought in
the background and let it simmer for a while.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()
  2026-02-19 23:57 [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() Dmitry Torokhov
  2026-02-20  0:11 ` Dmitry Torokhov
@ 2026-02-28  3:44 ` Zijun Hu
  1 sibling, 0 replies; 9+ messages in thread
From: Zijun Hu @ 2026-02-28  3:44 UTC (permalink / raw)
  To: Dmitry Torokhov, Vinod Koul
  Cc: Neil Armstrong, Rafael J. Wysocki, Geert Uytterhoeven,
	Johan Hovold, Claudiu Beznea, Dr. David Alan Gilbert,
	Peter Griffin, Dmitry Baryshkov, Krzysztof Kozlowski, Zijun Hu,
	linux-phy, linux-kernel

On 2/20/26 07:57, Dmitry Torokhov wrote:
> The implementation put_device()s located device and then uses
> container_of() on the pointer. The device may disappear by that time,
> resulting in UAF.
> 
> Fix the problem by keeping the reference to the framer device,
> avoiding getting an extra reference to it in framer_get(), and making
> sure to drop the reference in error path when we fail to get the module.
> 
> Fixes: e6625db66212 ("phy: core: Simplify API of_phy_simple_xlate() implementation")

this fix tag is wrong as explained by below comments.

> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
> ---
>  drivers/phy/phy-core.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/phy/phy-core.c b/drivers/phy/phy-core.c
> index 4ad396214d0c..cf62eb9ddca9 100644
> --- a/drivers/phy/phy-core.c
> +++ b/drivers/phy/phy-core.c
> @@ -682,10 +682,10 @@ struct phy *of_phy_get(struct device_node *np, const char *con_id)
>  	if (IS_ERR(phy))
>  		return phy;
>  
> -	if (!try_module_get(phy->ops->owner))
> +	if (!try_module_get(phy->ops->owner)) {
> +		put_device(&phy->dev);
>  		return ERR_PTR(-EPROBE_DEFER);
> -
> -	get_device(&phy->dev);
> +	}
>  
>  	return phy;
>  }
> @@ -765,7 +765,6 @@ struct phy *of_phy_simple_xlate(struct device *dev,
>  	if (!target_dev)
>  		return ERR_PTR(-ENODEV);
>  
> -	put_device(target_dev);

put reference count of @target_dev got by class_find_device_by_of_node()
so the following commit mentioned by the fix tag does not change the reference
count.
https://lore.kernel.org/all/20241213-phy_core_fix-v6-6-40ae28f5015a@quicinc.com/


>  	return to_phy(target_dev);
>  }
>  EXPORT_SYMBOL_GPL(of_phy_simple_xlate);


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-02-28  3:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-19 23:57 [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() Dmitry Torokhov
2026-02-20  0:11 ` Dmitry Torokhov
2026-02-21  1:01   ` Dmitry Torokhov
2026-02-23 23:15     ` Vladimir Oltean
2026-02-24  0:37       ` Dmitry Torokhov
2026-02-24 15:47         ` Vladimir Oltean
2026-02-24 20:38           ` Dmitry Torokhov
2026-02-25 23:50             ` Vladimir Oltean
2026-02-28  3:44 ` Zijun Hu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox