public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] arm_pmu: acpi: fix reference leak on failed device registration
@ 2026-04-15 17:41 Guangshuo Li
  2026-04-15 18:19 ` Mark Rutland
  0 siblings, 1 reply; 8+ messages in thread
From: Guangshuo Li @ 2026-04-15 17:41 UTC (permalink / raw)
  To: Will Deacon, Mark Rutland, Anshuman Khandual, linux-arm-kernel,
	linux-perf-users, linux-kernel
  Cc: Guangshuo Li, stable

When platform_device_register() fails in arm_acpi_register_pmu_device(),
the embedded struct device in pdev has already been initialized by
device_initialize(), but the failure path only unregisters the GSI and
does not drop the device reference for the current platform device:

  arm_acpi_register_pmu_device()
    -> platform_device_register(pdev)
       -> device_initialize(&pdev->dev)
       -> setup_pdev_dma_masks(pdev)
       -> platform_device_add(pdev)

This leads to a reference leak when platform_device_register() fails.
Fix this by calling platform_device_put() after unregistering the GSI.

The issue was identified by a static analysis tool I developed and
confirmed by manual review.

Fixes: 81e5ee4716098 ("arm_pmu: acpi: Refactor arm_spe_acpi_register_device()")
Cc: stable@vger.kernel.org
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
---
 drivers/perf/arm_pmu_acpi.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/perf/arm_pmu_acpi.c b/drivers/perf/arm_pmu_acpi.c
index e80f76d95e68..5ce382661e34 100644
--- a/drivers/perf/arm_pmu_acpi.c
+++ b/drivers/perf/arm_pmu_acpi.c
@@ -119,8 +119,10 @@ arm_acpi_register_pmu_device(struct platform_device *pdev, u8 len,
 
 	pdev->resource[0].start = irq;
 	ret = platform_device_register(pdev);
-	if (ret)
+	if (ret) {
 		acpi_unregister_gsi(gsi);
+		platform_device_put(pdev);
+	}
 
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm_pmu: acpi: fix reference leak on failed device registration
  2026-04-15 17:41 [PATCH] arm_pmu: acpi: fix reference leak on failed device registration Guangshuo Li
@ 2026-04-15 18:19 ` Mark Rutland
  2026-04-16  4:40   ` Greg Kroah-Hartman
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Rutland @ 2026-04-15 18:19 UTC (permalink / raw)
  To: Guangshuo Li, Greg Kroah-Hartman
  Cc: Will Deacon, Anshuman Khandual, linux-arm-kernel,
	linux-perf-users, linux-kernel, stable

Hi,

Thanks for the patch, but from a quick skim, I don't think this is the right
fix.

Greg, I think we might want to rework the core API here; question for
you at the end.

On Thu, Apr 16, 2026 at 01:41:59AM +0800, Guangshuo Li wrote:
> When platform_device_register() fails in arm_acpi_register_pmu_device(),
> the embedded struct device in pdev has already been initialized by
> device_initialize(), but the failure path only unregisters the GSI and
> does not drop the device reference for the current platform device:
> 
>   arm_acpi_register_pmu_device()
>     -> platform_device_register(pdev)
>        -> device_initialize(&pdev->dev)
>        -> setup_pdev_dma_masks(pdev)
>        -> platform_device_add(pdev)
> 
> This leads to a reference leak when platform_device_register() fails.

AFAICT you're saying that the reference was taken *within*
platform_device_register(), and then platform_device_register() itself
has failed. I think it's surprising that platform_device_register()
doesn't clean that up itself in the case of an error.

There are *tonnes* of calls to platform_device_register() throughout the
kernel that don't even bother to check the return value, and many that
just pass the return onto a caller that can't possibly know to call
platform_device_put().

Code in the same file as platform_device_register() expects it to clean up
after itself, e.g.

| int platform_add_devices(struct platform_device **devs, int num) 
| {
|         int i, ret = 0; 
| 
|         for (i = 0; i < num; i++) {
|                 ret = platform_device_register(devs[i]);
|                 if (ret) {
|                         while (--i >= 0)
|                                 platform_device_unregister(devs[i]);
|                         break;
|                 }    
|         }    
| 
|         return ret; 
| }

That's been there since the initial git commit, and back then,
platform_device_register() didn't mention that callers needed to perform
any cleanup.

I see a comment was added to platform_device_register() in commit:

  67e532a42cf4 ("driver core: platform: document registration-failure requirement")

... and that copied the commend added for device_register() in commit:

  5739411acbaa ("Driver core: Clarify device cleanup.")

... but the potential brokenness is so widespread, and the behaviour is
so surprising, that I'd argue the real but is that device_register()
doesn't clean up in case of error. I don't think it's worth changing
this single instance given the prevalance and churn fixing all of that
would involve.

I think it would be far better to fix the core driver API such that when
those functions return an error, they've already cleaned up for
themselves.

Greg, am I missing some functional reason why we can't rework
device_register() and friends to handle cleanup themselves? I appreciate
that'll involve churn for some callers, but AFAICT the majority of
callers don't have the required cleanup.

Mark.

> Fix this by calling platform_device_put() after unregistering the GSI.
> 
> The issue was identified by a static analysis tool I developed and
> confirmed by manual review.
> 
> Fixes: 81e5ee4716098 ("arm_pmu: acpi: Refactor arm_spe_acpi_register_device()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
> ---
>  drivers/perf/arm_pmu_acpi.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/perf/arm_pmu_acpi.c b/drivers/perf/arm_pmu_acpi.c
> index e80f76d95e68..5ce382661e34 100644
> --- a/drivers/perf/arm_pmu_acpi.c
> +++ b/drivers/perf/arm_pmu_acpi.c
> @@ -119,8 +119,10 @@ arm_acpi_register_pmu_device(struct platform_device *pdev, u8 len,
>  
>  	pdev->resource[0].start = irq;
>  	ret = platform_device_register(pdev);
> -	if (ret)
> +	if (ret) {
>  		acpi_unregister_gsi(gsi);
> +		platform_device_put(pdev);
> +	}
>  
>  	return ret;
>  }
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm_pmu: acpi: fix reference leak on failed device registration
  2026-04-15 18:19 ` Mark Rutland
@ 2026-04-16  4:40   ` Greg Kroah-Hartman
  2026-04-16  6:34     ` Guangshuo Li
  2026-04-16  7:23     ` Johan Hovold
  0 siblings, 2 replies; 8+ messages in thread
From: Greg Kroah-Hartman @ 2026-04-16  4:40 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Guangshuo Li, Will Deacon, Anshuman Khandual, linux-arm-kernel,
	linux-perf-users, linux-kernel, stable

On Wed, Apr 15, 2026 at 07:19:06PM +0100, Mark Rutland wrote:
> Hi,
> 
> Thanks for the patch, but from a quick skim, I don't think this is the right
> fix.
> 
> Greg, I think we might want to rework the core API here; question for
> you at the end.
> 
> On Thu, Apr 16, 2026 at 01:41:59AM +0800, Guangshuo Li wrote:
> > When platform_device_register() fails in arm_acpi_register_pmu_device(),
> > the embedded struct device in pdev has already been initialized by
> > device_initialize(), but the failure path only unregisters the GSI and
> > does not drop the device reference for the current platform device:
> > 
> >   arm_acpi_register_pmu_device()
> >     -> platform_device_register(pdev)
> >        -> device_initialize(&pdev->dev)
> >        -> setup_pdev_dma_masks(pdev)
> >        -> platform_device_add(pdev)
> > 
> > This leads to a reference leak when platform_device_register() fails.
> 
> AFAICT you're saying that the reference was taken *within*
> platform_device_register(), and then platform_device_register() itself
> has failed. I think it's surprising that platform_device_register()
> doesn't clean that up itself in the case of an error.
> 
> There are *tonnes* of calls to platform_device_register() throughout the
> kernel that don't even bother to check the return value, and many that
> just pass the return onto a caller that can't possibly know to call
> platform_device_put().
> 
> Code in the same file as platform_device_register() expects it to clean up
> after itself, e.g.
> 
> | int platform_add_devices(struct platform_device **devs, int num) 
> | {
> |         int i, ret = 0; 
> | 
> |         for (i = 0; i < num; i++) {
> |                 ret = platform_device_register(devs[i]);
> |                 if (ret) {
> |                         while (--i >= 0)
> |                                 platform_device_unregister(devs[i]);
> |                         break;
> |                 }    
> |         }    
> | 
> |         return ret; 
> | }
> 
> That's been there since the initial git commit, and back then,
> platform_device_register() didn't mention that callers needed to perform
> any cleanup.
> 
> I see a comment was added to platform_device_register() in commit:
> 
>   67e532a42cf4 ("driver core: platform: document registration-failure requirement")
> 
> ... and that copied the commend added for device_register() in commit:
> 
>   5739411acbaa ("Driver core: Clarify device cleanup.")
> 
> ... but the potential brokenness is so widespread, and the behaviour is
> so surprising, that I'd argue the real but is that device_register()
> doesn't clean up in case of error. I don't think it's worth changing
> this single instance given the prevalance and churn fixing all of that
> would involve.
> 
> I think it would be far better to fix the core driver API such that when
> those functions return an error, they've already cleaned up for
> themselves.
> 
> Greg, am I missing some functional reason why we can't rework
> device_register() and friends to handle cleanup themselves? I appreciate
> that'll involve churn for some callers, but AFAICT the majority of
> callers don't have the required cleanup.

Yes, we should fix the platform core code here, this should not be
required to do everywhere as obviously we all got it wrong.

Guangshuo, can you submit a patch to do that instead and ask for all of
your other patches to not be applied as well?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm_pmu: acpi: fix reference leak on failed device registration
  2026-04-16  4:40   ` Greg Kroah-Hartman
@ 2026-04-16  6:34     ` Guangshuo Li
  2026-04-16  7:23     ` Johan Hovold
  1 sibling, 0 replies; 8+ messages in thread
From: Guangshuo Li @ 2026-04-16  6:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Mark Rutland, Will Deacon, Anshuman Khandual, linux-arm-kernel,
	linux-perf-users, linux-kernel, stable

Hi Mark, Greg,

Thanks for the feedback.

On Thu, 16 Apr 2026 at 12:41, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Wed, Apr 15, 2026 at 07:19:06PM +0100, Mark Rutland wrote:
> > Hi,
> >
> > Thanks for the patch, but from a quick skim, I don't think this is the right
> > fix.
> >
> > Greg, I think we might want to rework the core API here; question for
> > you at the end.
> >
> > On Thu, Apr 16, 2026 at 01:41:59AM +0800, Guangshuo Li wrote:
> > > When platform_device_register() fails in arm_acpi_register_pmu_device(),
> > > the embedded struct device in pdev has already been initialized by
> > > device_initialize(), but the failure path only unregisters the GSI and
> > > does not drop the device reference for the current platform device:
> > >
> > >   arm_acpi_register_pmu_device()
> > >     -> platform_device_register(pdev)
> > >        -> device_initialize(&pdev->dev)
> > >        -> setup_pdev_dma_masks(pdev)
> > >        -> platform_device_add(pdev)
> > >
> > > This leads to a reference leak when platform_device_register() fails.
> >
> > AFAICT you're saying that the reference was taken *within*
> > platform_device_register(), and then platform_device_register() itself
> > has failed. I think it's surprising that platform_device_register()
> > doesn't clean that up itself in the case of an error.
> >
> > There are *tonnes* of calls to platform_device_register() throughout the
> > kernel that don't even bother to check the return value, and many that
> > just pass the return onto a caller that can't possibly know to call
> > platform_device_put().
> >
> > Code in the same file as platform_device_register() expects it to clean up
> > after itself, e.g.
> >
> > | int platform_add_devices(struct platform_device **devs, int num)
> > | {
> > |         int i, ret = 0;
> > |
> > |         for (i = 0; i < num; i++) {
> > |                 ret = platform_device_register(devs[i]);
> > |                 if (ret) {
> > |                         while (--i >= 0)
> > |                                 platform_device_unregister(devs[i]);
> > |                         break;
> > |                 }
> > |         }
> > |
> > |         return ret;
> > | }
> >
> > That's been there since the initial git commit, and back then,
> > platform_device_register() didn't mention that callers needed to perform
> > any cleanup.
> >
> > I see a comment was added to platform_device_register() in commit:
> >
> >   67e532a42cf4 ("driver core: platform: document registration-failure requirement")
> >
> > ... and that copied the commend added for device_register() in commit:
> >
> >   5739411acbaa ("Driver core: Clarify device cleanup.")
> >
> > ... but the potential brokenness is so widespread, and the behaviour is
> > so surprising, that I'd argue the real but is that device_register()
> > doesn't clean up in case of error. I don't think it's worth changing
> > this single instance given the prevalance and churn fixing all of that
> > would involve.
> >
> > I think it would be far better to fix the core driver API such that when
> > those functions return an error, they've already cleaned up for
> > themselves.
> >
> > Greg, am I missing some functional reason why we can't rework
> > device_register() and friends to handle cleanup themselves? I appreciate
> > that'll involve churn for some callers, but AFAICT the majority of
> > callers don't have the required cleanup.
>
> Yes, we should fix the platform core code here, this should not be
> required to do everywhere as obviously we all got it wrong.
>
> Guangshuo, can you submit a patch to do that instead and ask for all of
> your other patches to not be applied as well?
>
> thanks,
>
> greg k-h

I agree that fixing this in the platform core makes more sense than
handling it in individual callers.

I'll look into the core code and send a patch for that instead. I'll
also ask for my other related patches not to be applied.

Thanks,
Guangshuo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm_pmu: acpi: fix reference leak on failed device registration
  2026-04-16  4:40   ` Greg Kroah-Hartman
  2026-04-16  6:34     ` Guangshuo Li
@ 2026-04-16  7:23     ` Johan Hovold
  2026-04-16  8:59       ` Guangshuo Li
  2026-04-16  9:30       ` Mark Rutland
  1 sibling, 2 replies; 8+ messages in thread
From: Johan Hovold @ 2026-04-16  7:23 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Mark Rutland, Guangshuo Li, Will Deacon, Anshuman Khandual,
	linux-arm-kernel, linux-perf-users, linux-kernel, stable

On Thu, Apr 16, 2026 at 06:40:55AM +0200, Greg Kroah-Hartman wrote:
> On Wed, Apr 15, 2026 at 07:19:06PM +0100, Mark Rutland wrote:

> > AFAICT you're saying that the reference was taken *within*
> > platform_device_register(), and then platform_device_register() itself
> > has failed. I think it's surprising that platform_device_register()
> > doesn't clean that up itself in the case of an error.
> > 
> > There are *tonnes* of calls to platform_device_register() throughout the
> > kernel that don't even bother to check the return value, and many that
> > just pass the return onto a caller that can't possibly know to call
> > platform_device_put().
> > 
> > Code in the same file as platform_device_register() expects it to clean up
> > after itself, e.g.
> > 
> > | int platform_add_devices(struct platform_device **devs, int num) 
> > | {
> > |         int i, ret = 0; 
> > | 
> > |         for (i = 0; i < num; i++) {
> > |                 ret = platform_device_register(devs[i]);
> > |                 if (ret) {
> > |                         while (--i >= 0)
> > |                                 platform_device_unregister(devs[i]);
> > |                         break;
> > |                 }    
> > |         }    
> > | 
> > |         return ret; 
> > | }
> > 
> > That's been there since the initial git commit, and back then,
> > platform_device_register() didn't mention that callers needed to perform
> > any cleanup.
> > 
> > I see a comment was added to platform_device_register() in commit:
> > 
> >   67e532a42cf4 ("driver core: platform: document registration-failure requirement")
> > 
> > ... and that copied the commend added for device_register() in commit:
> > 
> >   5739411acbaa ("Driver core: Clarify device cleanup.")
> > 
> > ... but the potential brokenness is so widespread, and the behaviour is
> > so surprising, that I'd argue the real but is that device_register()
> > doesn't clean up in case of error. I don't think it's worth changing
> > this single instance given the prevalance and churn fixing all of that
> > would involve.
> > 
> > I think it would be far better to fix the core driver API such that when
> > those functions return an error, they've already cleaned up for
> > themselves.
> > 
> > Greg, am I missing some functional reason why we can't rework
> > device_register() and friends to handle cleanup themselves? I appreciate
> > that'll involve churn for some callers, but AFAICT the majority of
> > callers don't have the required cleanup.
> 
> Yes, we should fix the platform core code here, this should not be
> required to do everywhere as obviously we all got it wrong.

It's not just the platform code as this directly reflects the behaviour
of device_register() as Mark pointed out.

It is indeed an unfortunate quirk of the driver model, but one can argue
that having a registration function that frees its argument on errors
would be even worse. And even more so when many (or most) users get this
right.

So if we want to change this, I think we would need to deprecate
device_register() in favour of explicit device_initialize() and
device_add().

That said, most users of platform_device_register() appear to operate
on static platform devices which don't even have a release function and
would trigger a WARN() if we ever drop the reference (which is arguably
worse than leaking a tiny bit of memory).

So leaving things as-is is also an option.

Johan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm_pmu: acpi: fix reference leak on failed device registration
  2026-04-16  7:23     ` Johan Hovold
@ 2026-04-16  8:59       ` Guangshuo Li
  2026-04-16  9:50         ` Mark Rutland
  2026-04-16  9:30       ` Mark Rutland
  1 sibling, 1 reply; 8+ messages in thread
From: Guangshuo Li @ 2026-04-16  8:59 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Greg Kroah-Hartman, Mark Rutland, Will Deacon, Anshuman Khandual,
	linux-arm-kernel, linux-perf-users, linux-kernel, stable

Hi Greg, Mark, Johan,

Thanks for the further comments.

On Thu, 16 Apr 2026 at 15:23, Johan Hovold <johan@kernel.org> wrote:
>
> On Thu, Apr 16, 2026 at 06:40:55AM +0200, Greg Kroah-Hartman wrote:
> > On Wed, Apr 15, 2026 at 07:19:06PM +0100, Mark Rutland wrote:
>
> > > AFAICT you're saying that the reference was taken *within*
> > > platform_device_register(), and then platform_device_register() itself
> > > has failed. I think it's surprising that platform_device_register()
> > > doesn't clean that up itself in the case of an error.
> > >
> > > There are *tonnes* of calls to platform_device_register() throughout the
> > > kernel that don't even bother to check the return value, and many that
> > > just pass the return onto a caller that can't possibly know to call
> > > platform_device_put().
> > >
> > > Code in the same file as platform_device_register() expects it to clean up
> > > after itself, e.g.
> > >
> > > | int platform_add_devices(struct platform_device **devs, int num)
> > > | {
> > > |         int i, ret = 0;
> > > |
> > > |         for (i = 0; i < num; i++) {
> > > |                 ret = platform_device_register(devs[i]);
> > > |                 if (ret) {
> > > |                         while (--i >= 0)
> > > |                                 platform_device_unregister(devs[i]);
> > > |                         break;
> > > |                 }
> > > |         }
> > > |
> > > |         return ret;
> > > | }
> > >
> > > That's been there since the initial git commit, and back then,
> > > platform_device_register() didn't mention that callers needed to perform
> > > any cleanup.
> > >
> > > I see a comment was added to platform_device_register() in commit:
> > >
> > >   67e532a42cf4 ("driver core: platform: document registration-failure requirement")
> > >
> > > ... and that copied the commend added for device_register() in commit:
> > >
> > >   5739411acbaa ("Driver core: Clarify device cleanup.")
> > >
> > > ... but the potential brokenness is so widespread, and the behaviour is
> > > so surprising, that I'd argue the real but is that device_register()
> > > doesn't clean up in case of error. I don't think it's worth changing
> > > this single instance given the prevalance and churn fixing all of that
> > > would involve.
> > >
> > > I think it would be far better to fix the core driver API such that when
> > > those functions return an error, they've already cleaned up for
> > > themselves.
> > >
> > > Greg, am I missing some functional reason why we can't rework
> > > device_register() and friends to handle cleanup themselves? I appreciate
> > > that'll involve churn for some callers, but AFAICT the majority of
> > > callers don't have the required cleanup.
> >
> > Yes, we should fix the platform core code here, this should not be
> > required to do everywhere as obviously we all got it wrong.
>
> It's not just the platform code as this directly reflects the behaviour
> of device_register() as Mark pointed out.
>
> It is indeed an unfortunate quirk of the driver model, but one can argue
> that having a registration function that frees its argument on errors
> would be even worse. And even more so when many (or most) users get this
> right.
>
> So if we want to change this, I think we would need to deprecate
> device_register() in favour of explicit device_initialize() and
> device_add().
>
> That said, most users of platform_device_register() appear to operate
> on static platform devices which don't even have a release function and
> would trigger a WARN() if we ever drop the reference (which is arguably
> worse than leaking a tiny bit of memory).
>
> So leaving things as-is is also an option.
>
> Johan

I did some more investigation, and it looks like directly changing the
semantics of the existing API would break code that is already correct
today.

In particular, there seem to be at least two different kinds of callers:

Callers that already handle the failure path explicitly after
platform_device_register() fails. For these users, changing
platform_device_register() itself to drop the reference internally
would lead to double put / use-after-free issues.

Callers that operate on static struct platform_device objects. Many of
these do not have a release callback, so blindly dropping the
reference on failure would trigger a WARN.

Because of this, changing platform_device_register() itself to always
clean up on failure does not look safe.

One possible direction may be to leave platform_device_register()
unchanged, and instead add new helper APIs for the different cases.

For case (1), I was thinking of a helper like:

platform_device_register_and_put()

The implementation would simply call platform_device_register(), and if
that fails, call platform_device_put(). Callers converted to this helper
would then no longer perform their own put on the failure path.

For case (2), I was thinking of a helper like:

platform_device_register_static()

The implementation would first install a no-op release callback when
pdev->dev.release is not set, and then call
platform_device_register_and_put(). This would make the failure path
well-defined for static platform_device users, avoiding the reference
leak without triggering a WARN.

If this direction sounds reasonable, I would be happy to work on it and
send a patch, and I would also be very willing to help with the related
API conversion work for existing callers.

Thanks,
Guangshuo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm_pmu: acpi: fix reference leak on failed device registration
  2026-04-16  7:23     ` Johan Hovold
  2026-04-16  8:59       ` Guangshuo Li
@ 2026-04-16  9:30       ` Mark Rutland
  1 sibling, 0 replies; 8+ messages in thread
From: Mark Rutland @ 2026-04-16  9:30 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Greg Kroah-Hartman, Guangshuo Li, Will Deacon, Anshuman Khandual,
	linux-arm-kernel, linux-perf-users, linux-kernel, stable

On Thu, Apr 16, 2026 at 09:23:33AM +0200, Johan Hovold wrote:
> On Thu, Apr 16, 2026 at 06:40:55AM +0200, Greg Kroah-Hartman wrote:
> > On Wed, Apr 15, 2026 at 07:19:06PM +0100, Mark Rutland wrote:
> 
> > > AFAICT you're saying that the reference was taken *within*
> > > platform_device_register(), and then platform_device_register() itself
> > > has failed. I think it's surprising that platform_device_register()
> > > doesn't clean that up itself in the case of an error.
> > > 
> > > There are *tonnes* of calls to platform_device_register() throughout the
> > > kernel that don't even bother to check the return value, and many that
> > > just pass the return onto a caller that can't possibly know to call
> > > platform_device_put().
> > > 
> > > Code in the same file as platform_device_register() expects it to clean up
> > > after itself, e.g.
> > > 
> > > | int platform_add_devices(struct platform_device **devs, int num) 
> > > | {
> > > |         int i, ret = 0; 
> > > | 
> > > |         for (i = 0; i < num; i++) {
> > > |                 ret = platform_device_register(devs[i]);
> > > |                 if (ret) {
> > > |                         while (--i >= 0)
> > > |                                 platform_device_unregister(devs[i]);
> > > |                         break;
> > > |                 }    
> > > |         }    
> > > | 
> > > |         return ret; 
> > > | }
> > > 
> > > That's been there since the initial git commit, and back then,
> > > platform_device_register() didn't mention that callers needed to perform
> > > any cleanup.
> > > 
> > > I see a comment was added to platform_device_register() in commit:
> > > 
> > >   67e532a42cf4 ("driver core: platform: document registration-failure requirement")
> > > 
> > > ... and that copied the commend added for device_register() in commit:
> > > 
> > >   5739411acbaa ("Driver core: Clarify device cleanup.")
> > > 
> > > ... but the potential brokenness is so widespread, and the behaviour is
> > > so surprising, that I'd argue the real but is that device_register()
> > > doesn't clean up in case of error. I don't think it's worth changing
> > > this single instance given the prevalance and churn fixing all of that
> > > would involve.
> > > 
> > > I think it would be far better to fix the core driver API such that when
> > > those functions return an error, they've already cleaned up for
> > > themselves.
> > > 
> > > Greg, am I missing some functional reason why we can't rework
> > > device_register() and friends to handle cleanup themselves? I appreciate
> > > that'll involve churn for some callers, but AFAICT the majority of
> > > callers don't have the required cleanup.
> > 
> > Yes, we should fix the platform core code here, this should not be
> > required to do everywhere as obviously we all got it wrong.
> 
> It's not just the platform code as this directly reflects the behaviour
> of device_register() as Mark pointed out.
> 
> It is indeed an unfortunate quirk of the driver model, but one can argue
> that having a registration function that frees its argument on errors
> would be even worse. And even more so when many (or most) users get this
> right.

Ah, sorry; I had missed that the _put() step would actually free the
object (and as you explain below, how that won't work for many callers).

> So if we want to change this, I think we would need to deprecate
> device_register() in favour of explicit device_initialize() and
> device_add().

Is is possible to have {platfom_,}device_uninitialize() functions that
does everything except the ->release() call? If we had that, then we'd
be able to have a flow along the lines of:

	int some_init_function(void)
	{
		int err;
	
		platform_device_init(&static_pdev);
	
		err = platform_device_add(&static_pdev))
		if (err)
			goto out_uninit;
	
		return 0;
	
	out_uninit:
		platform_device_uninit(&static_pdev);
		return err;
	}

... which I think would align with what people generally expect to have
to do.

Those would have to check that only a single reference was held (from
the corresponding _initialize()), and could WARN/fail if more were held.

> That said, most users of platform_device_register() appear to operate
> on static platform devices which don't even have a release function and
> would trigger a WARN() if we ever drop the reference (which is arguably
> worse than leaking a tiny bit of memory).
> 
> So leaving things as-is is also an option.

I suspect that might be the best option for now.

Mark.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm_pmu: acpi: fix reference leak on failed device registration
  2026-04-16  8:59       ` Guangshuo Li
@ 2026-04-16  9:50         ` Mark Rutland
  0 siblings, 0 replies; 8+ messages in thread
From: Mark Rutland @ 2026-04-16  9:50 UTC (permalink / raw)
  To: Guangshuo Li
  Cc: Johan Hovold, Greg Kroah-Hartman, Will Deacon, Anshuman Khandual,
	linux-arm-kernel, linux-perf-users, linux-kernel, stable

On Thu, Apr 16, 2026 at 04:59:01PM +0800, Guangshuo Li wrote:
> On Thu, 16 Apr 2026 at 15:23, Johan Hovold <johan@kernel.org> wrote:
> > On Thu, Apr 16, 2026 at 06:40:55AM +0200, Greg Kroah-Hartman wrote:
> > > On Wed, Apr 15, 2026 at 07:19:06PM +0100, Mark Rutland wrote:

> > > > Greg, am I missing some functional reason why we can't rework
> > > > device_register() and friends to handle cleanup themselves? I appreciate
> > > > that'll involve churn for some callers, but AFAICT the majority of
> > > > callers don't have the required cleanup.
> > >
> > > Yes, we should fix the platform core code here, this should not be
> > > required to do everywhere as obviously we all got it wrong.
> >
> > It's not just the platform code as this directly reflects the behaviour
> > of device_register() as Mark pointed out.
> >
> > It is indeed an unfortunate quirk of the driver model, but one can argue
> > that having a registration function that frees its argument on errors
> > would be even worse. And even more so when many (or most) users get this
> > right.
> >
> > So if we want to change this, I think we would need to deprecate
> > device_register() in favour of explicit device_initialize() and
> > device_add().
> >
> > That said, most users of platform_device_register() appear to operate
> > on static platform devices which don't even have a release function and
> > would trigger a WARN() if we ever drop the reference (which is arguably
> > worse than leaking a tiny bit of memory).
> >
> > So leaving things as-is is also an option.
> >
> > Johan
> 
> I did some more investigation, and it looks like directly changing the
> semantics of the existing API would break code that is already correct
> today.

Evidently this wasn't entirely clear, but when I suggested changing the
semantic, I had implicitly meant that we'd also go and fix up callers to
handle the new semantic.

I agree that whatever we do, we'll have to change some callers, given
that existing callers have inconsistent expectations.

> In particular, there seem to be at least two different kinds of callers:
> 
> Callers that already handle the failure path explicitly after
> platform_device_register() fails. For these users, changing
> platform_device_register() itself to drop the reference internally
> would lead to double put / use-after-free issues.

Yes; for those we could drop the explicit cleanup.

As an alternative (as Johan mentioned above), if we deprecated
*_register() in favour of separate *_initialize() and *_add() calls,
then we could require that callers had explicit cleanup. As that cleanup
would more obviously pair with the *_initialize() step, it would be less
surprising than cleaning up for a function that returned an error.

As I mentioned in my other reply to Johan, that might also give options
for how to handle the static platform_device case, e.g. with an
*_uninitialize() function.

> Callers that operate on static struct platform_device objects. Many of
> these do not have a release callback, so blindly dropping the
> reference on failure would trigger a WARN.
> 
> Because of this, changing platform_device_register() itself to always
> clean up on failure does not look safe.

I agree that we probably can't have _*register() do all the necessary
cleanup, since callers want different things.

As per Johan's suggestion, and my reply, I suspect the best option
for a consistent API would be to deprecate *_register() in favour of
separate *_initialize() and *_add() calls.

> One possible direction may be to leave platform_device_register()
> unchanged, and instead add new helper APIs for the different cases.
> 
> For case (1), I was thinking of a helper like:
> 
> platform_device_register_and_put()
> 
> The implementation would simply call platform_device_register(), and if
> that fails, call platform_device_put(). Callers converted to this helper
> would then no longer perform their own put on the failure path.

I think that's going to be a source of confusion, because there's no
clear way to name that function. A '_and_put' suffix makes it sound like
it does a put unconditionally, rather than when the *_add() step fails.

Otherwise, I agree that would work for those callers.

> For case (2), I was thinking of a helper like:
> 
> platform_device_register_static()
> 
> The implementation would first install a no-op release callback when
> pdev->dev.release is not set, and then call
> platform_device_register_and_put(). This would make the failure path
> well-defined for static platform_device users, avoiding the reference
> leak without triggering a WARN.

Something like that might work.

As above, I think my preference would be to have separate
init/add/uninit calls, as that way each of the functions succeeds or
fails atomically, which is more aligned with general conventions.

> If this direction sounds reasonable, I would be happy to work on it and
> send a patch, and I would also be very willing to help with the related
> API conversion work for existing callers.

Fantastic!

I think we should hear what Greg thinks of the options before we start
on that, but it's great to hear that you're willing!

Mark.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-16  9:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-15 17:41 [PATCH] arm_pmu: acpi: fix reference leak on failed device registration Guangshuo Li
2026-04-15 18:19 ` Mark Rutland
2026-04-16  4:40   ` Greg Kroah-Hartman
2026-04-16  6:34     ` Guangshuo Li
2026-04-16  7:23     ` Johan Hovold
2026-04-16  8:59       ` Guangshuo Li
2026-04-16  9:50         ` Mark Rutland
2026-04-16  9:30       ` Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox