[PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing
@ 2025-10-22 10:30 Ketil Johnsen
  2025-10-22 10:53 ` Steven Price
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Ketil Johnsen @ 2025-10-22 10:30 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Heiko Stuebner
  Cc: Ketil Johnsen, dri-devel, linux-kernel

The function panthor_fw_unplug() will free the FW memory sections.
The problem is that there could still be pending FW events which are yet
not handled at this point. process_fw_events_work() can in this case try
to access said freed memory.

This fix introduces a destroyed state for the panthor_scheduler object,
and we check for this before processing FW events.

Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block")
---
v2:
- Followed Boris's advice and handle the race purely within the
  scheduler block (by adding a destroyed state)
---
 drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 0cc9055f4ee52..4996f987b8183 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -315,6 +315,13 @@ struct panthor_scheduler {
 		 */
 		struct list_head stopped_groups;
 	} reset;
+
+	/**
+	 * @destroyed: Scheduler object is (being) destroyed
+	 *
+	 * Normal scheduler operations should no longer take place.
+	 */
+	bool destroyed;
 };
 
 /**
@@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct work_struct *work)
 	u32 events = atomic_xchg(&sched->fw_events, 0);
 	struct panthor_device *ptdev = sched->ptdev;
 
-	mutex_lock(&sched->lock);
+	guard(mutex)(&sched->lock);
+
+	if (sched->destroyed)
+		return;
 
 	if (events & JOB_INT_GLOBAL_IF) {
 		sched_process_global_irq_locked(ptdev);
@@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct work_struct *work)
 		sched_process_csg_irq_locked(ptdev, csg_id);
 		events &= ~BIT(csg_id);
 	}
-
-	mutex_unlock(&sched->lock);
 }
 
 /**
@@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
 	cancel_delayed_work_sync(&sched->tick_work);
 
 	mutex_lock(&sched->lock);
+	sched->destroyed = true;
 	if (sched->pm.has_ref) {
 		pm_runtime_put(ptdev->base.dev);
 		sched->pm.has_ref = false;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing
  2025-10-22 10:30 [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing Ketil Johnsen
@ 2025-10-22 10:53 ` Steven Price
  2025-10-22 11:02 ` Liviu Dudau
  2025-10-22 12:37 ` Boris Brezillon
  2 siblings, 0 replies; 11+ messages in thread
From: Steven Price @ 2025-10-22 10:53 UTC (permalink / raw)
  To: Ketil Johnsen, Boris Brezillon, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Heiko Stuebner
  Cc: dri-devel, linux-kernel

On 22/10/2025 11:30, Ketil Johnsen wrote:
> The function panthor_fw_unplug() will free the FW memory sections.
> The problem is that there could still be pending FW events which are yet
> not handled at this point. process_fw_events_work() can in this case try
> to access said freed memory.
> 
> This fix introduces a destroyed state for the panthor_scheduler object,
> and we check for this before processing FW events.
> 
> Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
> Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block")

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
> v2:
> - Followed Boris's advice and handle the race purely within the
>   scheduler block (by adding a destroyed state)
> ---
>  drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index 0cc9055f4ee52..4996f987b8183 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -315,6 +315,13 @@ struct panthor_scheduler {
>  		 */
>  		struct list_head stopped_groups;
>  	} reset;
> +
> +	/**
> +	 * @destroyed: Scheduler object is (being) destroyed
> +	 *
> +	 * Normal scheduler operations should no longer take place.
> +	 */
> +	bool destroyed;
>  };
>  
>  /**
> @@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct work_struct *work)
>  	u32 events = atomic_xchg(&sched->fw_events, 0);
>  	struct panthor_device *ptdev = sched->ptdev;
>  
> -	mutex_lock(&sched->lock);
> +	guard(mutex)(&sched->lock);
> +
> +	if (sched->destroyed)
> +		return;
>  
>  	if (events & JOB_INT_GLOBAL_IF) {
>  		sched_process_global_irq_locked(ptdev);
> @@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct work_struct *work)
>  		sched_process_csg_irq_locked(ptdev, csg_id);
>  		events &= ~BIT(csg_id);
>  	}
> -
> -	mutex_unlock(&sched->lock);
>  }
>  
>  /**
> @@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
>  	cancel_delayed_work_sync(&sched->tick_work);
>  
>  	mutex_lock(&sched->lock);
> +	sched->destroyed = true;
>  	if (sched->pm.has_ref) {
>  		pm_runtime_put(ptdev->base.dev);
>  		sched->pm.has_ref = false;


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing
  2025-10-22 10:30 [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing Ketil Johnsen
  2025-10-22 10:53 ` Steven Price
@ 2025-10-22 11:02 ` Liviu Dudau
  2025-10-22 12:37 ` Boris Brezillon
  2 siblings, 0 replies; 11+ messages in thread
From: Liviu Dudau @ 2025-10-22 11:02 UTC (permalink / raw)
  To: Ketil Johnsen
  Cc: Boris Brezillon, Steven Price, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Heiko Stuebner,
	dri-devel, linux-kernel

On Wed, Oct 22, 2025 at 12:30:13PM +0200, Ketil Johnsen wrote:
> The function panthor_fw_unplug() will free the FW memory sections.
> The problem is that there could still be pending FW events which are yet
> not handled at this point. process_fw_events_work() can in this case try
> to access said freed memory.
> 
> This fix introduces a destroyed state for the panthor_scheduler object,
> and we check for this before processing FW events.
> 
> Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
> Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block")

Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>

Best regards,
Liviu

> ---
> v2:
> - Followed Boris's advice and handle the race purely within the
>   scheduler block (by adding a destroyed state)
> ---
>  drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index 0cc9055f4ee52..4996f987b8183 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -315,6 +315,13 @@ struct panthor_scheduler {
>  		 */
>  		struct list_head stopped_groups;
>  	} reset;
> +
> +	/**
> +	 * @destroyed: Scheduler object is (being) destroyed
> +	 *
> +	 * Normal scheduler operations should no longer take place.
> +	 */
> +	bool destroyed;
>  };
>  
>  /**
> @@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct work_struct *work)
>  	u32 events = atomic_xchg(&sched->fw_events, 0);
>  	struct panthor_device *ptdev = sched->ptdev;
>  
> -	mutex_lock(&sched->lock);
> +	guard(mutex)(&sched->lock);
> +
> +	if (sched->destroyed)
> +		return;
>  
>  	if (events & JOB_INT_GLOBAL_IF) {
>  		sched_process_global_irq_locked(ptdev);
> @@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct work_struct *work)
>  		sched_process_csg_irq_locked(ptdev, csg_id);
>  		events &= ~BIT(csg_id);
>  	}
> -
> -	mutex_unlock(&sched->lock);
>  }
>  
>  /**
> @@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
>  	cancel_delayed_work_sync(&sched->tick_work);
>  
>  	mutex_lock(&sched->lock);
> +	sched->destroyed = true;
>  	if (sched->pm.has_ref) {
>  		pm_runtime_put(ptdev->base.dev);
>  		sched->pm.has_ref = false;
> -- 
> 2.47.2
> 

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing
  2025-10-22 10:30 [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing Ketil Johnsen
  2025-10-22 10:53 ` Steven Price
  2025-10-22 11:02 ` Liviu Dudau
@ 2025-10-22 12:37 ` Boris Brezillon
  2025-10-22 13:36   ` Steven Price
  2 siblings, 1 reply; 11+ messages in thread
From: Boris Brezillon @ 2025-10-22 12:37 UTC (permalink / raw)
  To: Ketil Johnsen
  Cc: Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Heiko Stuebner,
	dri-devel, linux-kernel

On Wed, 22 Oct 2025 12:30:13 +0200
Ketil Johnsen <ketil.johnsen@arm.com> wrote:

> The function panthor_fw_unplug() will free the FW memory sections.
> The problem is that there could still be pending FW events which are yet
> not handled at this point. process_fw_events_work() can in this case try
> to access said freed memory.
> 
> This fix introduces a destroyed state for the panthor_scheduler object,
> and we check for this before processing FW events.
> 
> Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
> Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block")
> ---
> v2:
> - Followed Boris's advice and handle the race purely within the
>   scheduler block (by adding a destroyed state)
> ---
>  drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index 0cc9055f4ee52..4996f987b8183 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -315,6 +315,13 @@ struct panthor_scheduler {
>  		 */
>  		struct list_head stopped_groups;
>  	} reset;
> +
> +	/**
> +	 * @destroyed: Scheduler object is (being) destroyed
> +	 *
> +	 * Normal scheduler operations should no longer take place.
> +	 */
> +	bool destroyed;

Do we really need a new field for that? Can't we just reset
panthor_device::scheduler to NULL early enough in the unplug path?
I guess it's not that simple if we have works going back to ptdev
and then dereferencing ptdev->scheduler, but I think it's also
fundamentally broken to have scheduler works active after the
scheduler teardown has started, so we might want to add some more
checks in the work callbacks too.

>  };
>  
>  /**
> @@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct work_struct *work)
>  	u32 events = atomic_xchg(&sched->fw_events, 0);
>  	struct panthor_device *ptdev = sched->ptdev;
>  
> -	mutex_lock(&sched->lock);
> +	guard(mutex)(&sched->lock);
> +
> +	if (sched->destroyed)
> +		return;
>  
>  	if (events & JOB_INT_GLOBAL_IF) {
>  		sched_process_global_irq_locked(ptdev);
> @@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct work_struct *work)
>  		sched_process_csg_irq_locked(ptdev, csg_id);
>  		events &= ~BIT(csg_id);
>  	}
> -
> -	mutex_unlock(&sched->lock);
>  }
>  
>  /**
> @@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
>  	cancel_delayed_work_sync(&sched->tick_work);
>  
>  	mutex_lock(&sched->lock);
> +	sched->destroyed = true;
>  	if (sched->pm.has_ref) {
>  		pm_runtime_put(ptdev->base.dev);
>  		sched->pm.has_ref = false;

Hm, I'd really like to see a cancel_work_sync(&sched->fw_events_work)
rather than letting the work execute after we've started tearing down
the scheduler object.

If you follow my suggestion to reset the ptdev->scheduler field, I
guess something like that would do:

void panthor_sched_unplug(struct panthor_device *ptdev)
{
        struct panthor_scheduler *sched = ptdev->scheduler;

	/* We want the schedu */
	WRITE_ONCE(*ptdev->scheduler, NULL);

	cancel_work_sync(&sched->fw_events_work);
        cancel_delayed_work_sync(&sched->tick_work);

        mutex_lock(&sched->lock);
        if (sched->pm.has_ref) {
                pm_runtime_put(ptdev->base.dev);
                sched->pm.has_ref = false;
        }
        mutex_unlock(&sched->lock);
}

and

void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events) {
	struct panthor_scheduler *sched = READ_ONCE(*ptdev->scheduler);

	/* Scheduler is not initialized, or it's gone. */
        if (!sched)
                return;

        atomic_or(events, &sched->fw_events);
        sched_queue_work(sched, fw_events);
}


sched_queue_[delayed_]work() could also be automated to issue a drm_WARN_ON()
when it's called and ptdev->scheduler = NULL.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing
  2025-10-22 12:37 ` Boris Brezillon
@ 2025-10-22 13:36   ` Steven Price
  2025-10-22 14:00     ` Boris Brezillon
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Price @ 2025-10-22 13:36 UTC (permalink / raw)
  To: Boris Brezillon, Ketil Johnsen
  Cc: Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Heiko Stuebner, dri-devel,
	linux-kernel

On 22/10/2025 13:37, Boris Brezillon wrote:
> On Wed, 22 Oct 2025 12:30:13 +0200
> Ketil Johnsen <ketil.johnsen@arm.com> wrote:
> 
>> The function panthor_fw_unplug() will free the FW memory sections.
>> The problem is that there could still be pending FW events which are yet
>> not handled at this point. process_fw_events_work() can in this case try
>> to access said freed memory.
>>
>> This fix introduces a destroyed state for the panthor_scheduler object,
>> and we check for this before processing FW events.
>>
>> Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
>> Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block")
>> ---
>> v2:
>> - Followed Boris's advice and handle the race purely within the
>>   scheduler block (by adding a destroyed state)
>> ---
>>  drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++---
>>  1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
>> index 0cc9055f4ee52..4996f987b8183 100644
>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
>> @@ -315,6 +315,13 @@ struct panthor_scheduler {
>>  		 */
>>  		struct list_head stopped_groups;
>>  	} reset;
>> +
>> +	/**
>> +	 * @destroyed: Scheduler object is (being) destroyed
>> +	 *
>> +	 * Normal scheduler operations should no longer take place.
>> +	 */
>> +	bool destroyed;
> 
> Do we really need a new field for that? Can't we just reset
> panthor_device::scheduler to NULL early enough in the unplug path?
> I guess it's not that simple if we have works going back to ptdev
> and then dereferencing ptdev->scheduler, but I think it's also
> fundamentally broken to have scheduler works active after the
> scheduler teardown has started, so we might want to add some more
> checks in the work callbacks too.
> 
>>  };
>>  
>>  /**
>> @@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct work_struct *work)
>>  	u32 events = atomic_xchg(&sched->fw_events, 0);
>>  	struct panthor_device *ptdev = sched->ptdev;
>>  
>> -	mutex_lock(&sched->lock);
>> +	guard(mutex)(&sched->lock);
>> +
>> +	if (sched->destroyed)
>> +		return;
>>  
>>  	if (events & JOB_INT_GLOBAL_IF) {
>>  		sched_process_global_irq_locked(ptdev);
>> @@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct work_struct *work)
>>  		sched_process_csg_irq_locked(ptdev, csg_id);
>>  		events &= ~BIT(csg_id);
>>  	}
>> -
>> -	mutex_unlock(&sched->lock);
>>  }
>>  
>>  /**
>> @@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
>>  	cancel_delayed_work_sync(&sched->tick_work);
>>  
>>  	mutex_lock(&sched->lock);
>> +	sched->destroyed = true;
>>  	if (sched->pm.has_ref) {
>>  		pm_runtime_put(ptdev->base.dev);
>>  		sched->pm.has_ref = false;
> 
> Hm, I'd really like to see a cancel_work_sync(&sched->fw_events_work)
> rather than letting the work execute after we've started tearing down
> the scheduler object.
> 
> If you follow my suggestion to reset the ptdev->scheduler field, I
> guess something like that would do:
> 
> void panthor_sched_unplug(struct panthor_device *ptdev)
> {
>         struct panthor_scheduler *sched = ptdev->scheduler;
> 
> 	/* We want the schedu */
> 	WRITE_ONCE(*ptdev->scheduler, NULL);
> 
> 	cancel_work_sync(&sched->fw_events_work);
>         cancel_delayed_work_sync(&sched->tick_work);
> 
>         mutex_lock(&sched->lock);
>         if (sched->pm.has_ref) {
>                 pm_runtime_put(ptdev->base.dev);
>                 sched->pm.has_ref = false;
>         }
>         mutex_unlock(&sched->lock);
> }
> 
> and
> 
> void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events) {
> 	struct panthor_scheduler *sched = READ_ONCE(*ptdev->scheduler);
> 
> 	/* Scheduler is not initialized, or it's gone. */
>         if (!sched)
>                 return;
> 
>         atomic_or(events, &sched->fw_events);
>         sched_queue_work(sched, fw_events);
> }

Note there's also the path of panthor_mmu_irq_handler() calling
panthor_sched_report_mmu_fault() which will need to READ_ONCE() as well
to be safe.

I agree having an extra bool is ugly, but it easier to reason about than
the lock-free WRITE_ONCE/READ_ONCE dance. It worries me that this will
be regressed in the future. I can't immediately see how to wrap this in
a helper to ensure this is kept correct.

Thanks,
Steve

> 
> 
> sched_queue_[delayed_]work() could also be automated to issue a drm_WARN_ON()
> when it's called and ptdev->scheduler = NULL.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing
  2025-10-22 13:36   ` Steven Price
@ 2025-10-22 14:00     ` Boris Brezillon
  2025-10-22 14:28       ` Steven Price
  0 siblings, 1 reply; 11+ messages in thread
From: Boris Brezillon @ 2025-10-22 14:00 UTC (permalink / raw)
  To: Steven Price
  Cc: Ketil Johnsen, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Heiko Stuebner,
	dri-devel, linux-kernel

On Wed, 22 Oct 2025 14:36:23 +0100
Steven Price <steven.price@arm.com> wrote:

> On 22/10/2025 13:37, Boris Brezillon wrote:
> > On Wed, 22 Oct 2025 12:30:13 +0200
> > Ketil Johnsen <ketil.johnsen@arm.com> wrote:
> >   
> >> The function panthor_fw_unplug() will free the FW memory sections.
> >> The problem is that there could still be pending FW events which are yet
> >> not handled at this point. process_fw_events_work() can in this case try
> >> to access said freed memory.
> >>
> >> This fix introduces a destroyed state for the panthor_scheduler object,
> >> and we check for this before processing FW events.
> >>
> >> Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
> >> Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block")
> >> ---
> >> v2:
> >> - Followed Boris's advice and handle the race purely within the
> >>   scheduler block (by adding a destroyed state)
> >> ---
> >>  drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++---
> >>  1 file changed, 12 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> >> index 0cc9055f4ee52..4996f987b8183 100644
> >> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> >> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> >> @@ -315,6 +315,13 @@ struct panthor_scheduler {
> >>  		 */
> >>  		struct list_head stopped_groups;
> >>  	} reset;
> >> +
> >> +	/**
> >> +	 * @destroyed: Scheduler object is (being) destroyed
> >> +	 *
> >> +	 * Normal scheduler operations should no longer take place.
> >> +	 */
> >> +	bool destroyed;  
> > 
> > Do we really need a new field for that? Can't we just reset
> > panthor_device::scheduler to NULL early enough in the unplug path?
> > I guess it's not that simple if we have works going back to ptdev
> > and then dereferencing ptdev->scheduler, but I think it's also
> > fundamentally broken to have scheduler works active after the
> > scheduler teardown has started, so we might want to add some more
> > checks in the work callbacks too.
> >   
> >>  };
> >>  
> >>  /**
> >> @@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct work_struct *work)
> >>  	u32 events = atomic_xchg(&sched->fw_events, 0);
> >>  	struct panthor_device *ptdev = sched->ptdev;
> >>  
> >> -	mutex_lock(&sched->lock);
> >> +	guard(mutex)(&sched->lock);
> >> +
> >> +	if (sched->destroyed)
> >> +		return;
> >>  
> >>  	if (events & JOB_INT_GLOBAL_IF) {
> >>  		sched_process_global_irq_locked(ptdev);
> >> @@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct work_struct *work)
> >>  		sched_process_csg_irq_locked(ptdev, csg_id);
> >>  		events &= ~BIT(csg_id);
> >>  	}
> >> -
> >> -	mutex_unlock(&sched->lock);
> >>  }
> >>  
> >>  /**
> >> @@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
> >>  	cancel_delayed_work_sync(&sched->tick_work);
> >>  
> >>  	mutex_lock(&sched->lock);
> >> +	sched->destroyed = true;
> >>  	if (sched->pm.has_ref) {
> >>  		pm_runtime_put(ptdev->base.dev);
> >>  		sched->pm.has_ref = false;  
> > 
> > Hm, I'd really like to see a cancel_work_sync(&sched->fw_events_work)
> > rather than letting the work execute after we've started tearing down
> > the scheduler object.
> > 
> > If you follow my suggestion to reset the ptdev->scheduler field, I
> > guess something like that would do:
> > 
> > void panthor_sched_unplug(struct panthor_device *ptdev)
> > {
> >         struct panthor_scheduler *sched = ptdev->scheduler;
> > 
> > 	/* We want the schedu */
> > 	WRITE_ONCE(*ptdev->scheduler, NULL);
> > 
> > 	cancel_work_sync(&sched->fw_events_work);
> >         cancel_delayed_work_sync(&sched->tick_work);
> > 
> >         mutex_lock(&sched->lock);
> >         if (sched->pm.has_ref) {
> >                 pm_runtime_put(ptdev->base.dev);
> >                 sched->pm.has_ref = false;
> >         }
> >         mutex_unlock(&sched->lock);
> > }
> > 
> > and
> > 
> > void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events) {
> > 	struct panthor_scheduler *sched = READ_ONCE(*ptdev->scheduler);
> > 
> > 	/* Scheduler is not initialized, or it's gone. */
> >         if (!sched)
> >                 return;
> > 
> >         atomic_or(events, &sched->fw_events);
> >         sched_queue_work(sched, fw_events);
> > }  
> 
> Note there's also the path of panthor_mmu_irq_handler() calling
> panthor_sched_report_mmu_fault() which will need to READ_ONCE() as well
> to be safe.

This could be hidden behind a panthor_device_get_sched() helper, I
guess. Anyway, it's not so much that I'm against the addition of an
extra bool, but AFAICT, the problem is not entirely solved, as there
could be a pending work that gets executed after sched_unplug()
returns, and I adding this bool check just papers over the real bug
(which is that we never cancel the fw_event work).

> 
> I agree having an extra bool is ugly, but it easier to reason about than
> the lock-free WRITE_ONCE/READ_ONCE dance. It worries me that this will
> be regressed in the future. I can't immediately see how to wrap this in
> a helper to ensure this is kept correct.

Sure, but you're not really catching cases where the work runs after
the scheduler component has been unplugged in case someone forgot to
cancel some works. I think I'd rather identify those cases with a
kernel panic, than a random UAF when the work is being executed.
Ultimately, we should probably audit all works used in the driver, to
make sure they are properly cancelled at unplug() time by the relevant
<component>_unplug() functions.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing
  2025-10-22 14:00     ` Boris Brezillon
@ 2025-10-22 14:28       ` Steven Price
  2025-10-22 15:32         ` Boris Brezillon
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Price @ 2025-10-22 14:28 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Ketil Johnsen, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Heiko Stuebner,
	dri-devel, linux-kernel

On 22/10/2025 15:00, Boris Brezillon wrote:
> On Wed, 22 Oct 2025 14:36:23 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 22/10/2025 13:37, Boris Brezillon wrote:
>>> On Wed, 22 Oct 2025 12:30:13 +0200
>>> Ketil Johnsen <ketil.johnsen@arm.com> wrote:
>>>   
>>>> The function panthor_fw_unplug() will free the FW memory sections.
>>>> The problem is that there could still be pending FW events which are yet
>>>> not handled at this point. process_fw_events_work() can in this case try
>>>> to access said freed memory.
>>>>
>>>> This fix introduces a destroyed state for the panthor_scheduler object,
>>>> and we check for this before processing FW events.
>>>>
>>>> Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
>>>> Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block")
>>>> ---
>>>> v2:
>>>> - Followed Boris's advice and handle the race purely within the
>>>>   scheduler block (by adding a destroyed state)
>>>> ---
>>>>  drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++---
>>>>  1 file changed, 12 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
>>>> index 0cc9055f4ee52..4996f987b8183 100644
>>>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
>>>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
>>>> @@ -315,6 +315,13 @@ struct panthor_scheduler {
>>>>  		 */
>>>>  		struct list_head stopped_groups;
>>>>  	} reset;
>>>> +
>>>> +	/**
>>>> +	 * @destroyed: Scheduler object is (being) destroyed
>>>> +	 *
>>>> +	 * Normal scheduler operations should no longer take place.
>>>> +	 */
>>>> +	bool destroyed;  
>>>
>>> Do we really need a new field for that? Can't we just reset
>>> panthor_device::scheduler to NULL early enough in the unplug path?
>>> I guess it's not that simple if we have works going back to ptdev
>>> and then dereferencing ptdev->scheduler, but I think it's also
>>> fundamentally broken to have scheduler works active after the
>>> scheduler teardown has started, so we might want to add some more
>>> checks in the work callbacks too.
>>>   
>>>>  };
>>>>  
>>>>  /**
>>>> @@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct work_struct *work)
>>>>  	u32 events = atomic_xchg(&sched->fw_events, 0);
>>>>  	struct panthor_device *ptdev = sched->ptdev;
>>>>  
>>>> -	mutex_lock(&sched->lock);
>>>> +	guard(mutex)(&sched->lock);
>>>> +
>>>> +	if (sched->destroyed)
>>>> +		return;
>>>>  
>>>>  	if (events & JOB_INT_GLOBAL_IF) {
>>>>  		sched_process_global_irq_locked(ptdev);
>>>> @@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct work_struct *work)
>>>>  		sched_process_csg_irq_locked(ptdev, csg_id);
>>>>  		events &= ~BIT(csg_id);
>>>>  	}
>>>> -
>>>> -	mutex_unlock(&sched->lock);
>>>>  }
>>>>  
>>>>  /**
>>>> @@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
>>>>  	cancel_delayed_work_sync(&sched->tick_work);
>>>>  
>>>>  	mutex_lock(&sched->lock);
>>>> +	sched->destroyed = true;
>>>>  	if (sched->pm.has_ref) {
>>>>  		pm_runtime_put(ptdev->base.dev);
>>>>  		sched->pm.has_ref = false;  
>>>
>>> Hm, I'd really like to see a cancel_work_sync(&sched->fw_events_work)
>>> rather than letting the work execute after we've started tearing down
>>> the scheduler object.
>>>
>>> If you follow my suggestion to reset the ptdev->scheduler field, I
>>> guess something like that would do:
>>>
>>> void panthor_sched_unplug(struct panthor_device *ptdev)
>>> {
>>>         struct panthor_scheduler *sched = ptdev->scheduler;
>>>
>>> 	/* We want the schedu */
>>> 	WRITE_ONCE(*ptdev->scheduler, NULL);
>>>
>>> 	cancel_work_sync(&sched->fw_events_work);
>>>         cancel_delayed_work_sync(&sched->tick_work);
>>>
>>>         mutex_lock(&sched->lock);
>>>         if (sched->pm.has_ref) {
>>>                 pm_runtime_put(ptdev->base.dev);
>>>                 sched->pm.has_ref = false;
>>>         }
>>>         mutex_unlock(&sched->lock);
>>> }
>>>
>>> and
>>>
>>> void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events) {
>>> 	struct panthor_scheduler *sched = READ_ONCE(*ptdev->scheduler);
>>>
>>> 	/* Scheduler is not initialized, or it's gone. */
>>>         if (!sched)
>>>                 return;
>>>
>>>         atomic_or(events, &sched->fw_events);
>>>         sched_queue_work(sched, fw_events);
>>> }  
>>
>> Note there's also the path of panthor_mmu_irq_handler() calling
>> panthor_sched_report_mmu_fault() which will need to READ_ONCE() as well
>> to be safe.
> 
> This could be hidden behind a panthor_device_get_sched() helper, I
> guess. Anyway, it's not so much that I'm against the addition of an
> extra bool, but AFAICT, the problem is not entirely solved, as there
> could be a pending work that gets executed after sched_unplug()
> returns, and I adding this bool check just papers over the real bug
> (which is that we never cancel the fw_event work).
> 
>>
>> I agree having an extra bool is ugly, but it easier to reason about than
>> the lock-free WRITE_ONCE/READ_ONCE dance. It worries me that this will
>> be regressed in the future. I can't immediately see how to wrap this in
>> a helper to ensure this is kept correct.
> 
> Sure, but you're not really catching cases where the work runs after
> the scheduler component has been unplugged in case someone forgot to
> cancel some works. I think I'd rather identify those cases with a
> kernel panic, than a random UAF when the work is being executed.
> Ultimately, we should probably audit all works used in the driver, to
> make sure they are properly cancelled at unplug() time by the relevant
> <component>_unplug() functions.

Yes I agree, we should have a cancel_work_sync(&sched->fw_events_work)
call somewhere on the unplug path. That needs to be after the job irq
has been disabled which is currently done in panthor_fw_unplug().

Thanks,
Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing
  2025-10-22 14:28       ` Steven Price
@ 2025-10-22 15:32         ` Boris Brezillon
  2025-10-22 15:36           ` Steven Price
  2025-10-23 14:22           ` Ketil Johnsen
  0 siblings, 2 replies; 11+ messages in thread
From: Boris Brezillon @ 2025-10-22 15:32 UTC (permalink / raw)
  To: Steven Price
  Cc: Ketil Johnsen, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Heiko Stuebner,
	dri-devel, linux-kernel

On Wed, 22 Oct 2025 15:28:51 +0100
Steven Price <steven.price@arm.com> wrote:

> On 22/10/2025 15:00, Boris Brezillon wrote:
> > On Wed, 22 Oct 2025 14:36:23 +0100
> > Steven Price <steven.price@arm.com> wrote:
> >   
> >> On 22/10/2025 13:37, Boris Brezillon wrote:  
> >>> On Wed, 22 Oct 2025 12:30:13 +0200
> >>> Ketil Johnsen <ketil.johnsen@arm.com> wrote:
> >>>     
> >>>> The function panthor_fw_unplug() will free the FW memory sections.
> >>>> The problem is that there could still be pending FW events which are yet
> >>>> not handled at this point. process_fw_events_work() can in this case try
> >>>> to access said freed memory.
> >>>>
> >>>> This fix introduces a destroyed state for the panthor_scheduler object,
> >>>> and we check for this before processing FW events.
> >>>>
> >>>> Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
> >>>> Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block")
> >>>> ---
> >>>> v2:
> >>>> - Followed Boris's advice and handle the race purely within the
> >>>>   scheduler block (by adding a destroyed state)
> >>>> ---
> >>>>  drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++---
> >>>>  1 file changed, 12 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> >>>> index 0cc9055f4ee52..4996f987b8183 100644
> >>>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> >>>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> >>>> @@ -315,6 +315,13 @@ struct panthor_scheduler {
> >>>>  		 */
> >>>>  		struct list_head stopped_groups;
> >>>>  	} reset;
> >>>> +
> >>>> +	/**
> >>>> +	 * @destroyed: Scheduler object is (being) destroyed
> >>>> +	 *
> >>>> +	 * Normal scheduler operations should no longer take place.
> >>>> +	 */
> >>>> +	bool destroyed;    
> >>>
> >>> Do we really need a new field for that? Can't we just reset
> >>> panthor_device::scheduler to NULL early enough in the unplug path?
> >>> I guess it's not that simple if we have works going back to ptdev
> >>> and then dereferencing ptdev->scheduler, but I think it's also
> >>> fundamentally broken to have scheduler works active after the
> >>> scheduler teardown has started, so we might want to add some more
> >>> checks in the work callbacks too.
> >>>     
> >>>>  };
> >>>>  
> >>>>  /**
> >>>> @@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct work_struct *work)
> >>>>  	u32 events = atomic_xchg(&sched->fw_events, 0);
> >>>>  	struct panthor_device *ptdev = sched->ptdev;
> >>>>  
> >>>> -	mutex_lock(&sched->lock);
> >>>> +	guard(mutex)(&sched->lock);
> >>>> +
> >>>> +	if (sched->destroyed)
> >>>> +		return;
> >>>>  
> >>>>  	if (events & JOB_INT_GLOBAL_IF) {
> >>>>  		sched_process_global_irq_locked(ptdev);
> >>>> @@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct work_struct *work)
> >>>>  		sched_process_csg_irq_locked(ptdev, csg_id);
> >>>>  		events &= ~BIT(csg_id);
> >>>>  	}
> >>>> -
> >>>> -	mutex_unlock(&sched->lock);
> >>>>  }
> >>>>  
> >>>>  /**
> >>>> @@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
> >>>>  	cancel_delayed_work_sync(&sched->tick_work);
> >>>>  
> >>>>  	mutex_lock(&sched->lock);
> >>>> +	sched->destroyed = true;
> >>>>  	if (sched->pm.has_ref) {
> >>>>  		pm_runtime_put(ptdev->base.dev);
> >>>>  		sched->pm.has_ref = false;    
> >>>
> >>> Hm, I'd really like to see a cancel_work_sync(&sched->fw_events_work)
> >>> rather than letting the work execute after we've started tearing down
> >>> the scheduler object.
> >>>
> >>> If you follow my suggestion to reset the ptdev->scheduler field, I
> >>> guess something like that would do:
> >>>
> >>> void panthor_sched_unplug(struct panthor_device *ptdev)
> >>> {
> >>>         struct panthor_scheduler *sched = ptdev->scheduler;
> >>>
> >>> 	/* We want the schedu */
> >>> 	WRITE_ONCE(*ptdev->scheduler, NULL);
> >>>
> >>> 	cancel_work_sync(&sched->fw_events_work);
> >>>         cancel_delayed_work_sync(&sched->tick_work);
> >>>
> >>>         mutex_lock(&sched->lock);
> >>>         if (sched->pm.has_ref) {
> >>>                 pm_runtime_put(ptdev->base.dev);
> >>>                 sched->pm.has_ref = false;
> >>>         }
> >>>         mutex_unlock(&sched->lock);
> >>> }
> >>>
> >>> and
> >>>
> >>> void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events) {
> >>> 	struct panthor_scheduler *sched = READ_ONCE(*ptdev->scheduler);
> >>>
> >>> 	/* Scheduler is not initialized, or it's gone. */
> >>>         if (!sched)
> >>>                 return;
> >>>
> >>>         atomic_or(events, &sched->fw_events);
> >>>         sched_queue_work(sched, fw_events);
> >>> }    
> >>
> >> Note there's also the path of panthor_mmu_irq_handler() calling
> >> panthor_sched_report_mmu_fault() which will need to READ_ONCE() as well
> >> to be safe.  
> > 
> > This could be hidden behind a panthor_device_get_sched() helper, I
> > guess. Anyway, it's not so much that I'm against the addition of an
> > extra bool, but AFAICT, the problem is not entirely solved, as there
> > could be a pending work that gets executed after sched_unplug()
> > returns, and I adding this bool check just papers over the real bug
> > (which is that we never cancel the fw_event work).
> >   
> >>
> >> I agree having an extra bool is ugly, but it easier to reason about than
> >> the lock-free WRITE_ONCE/READ_ONCE dance. It worries me that this will
> >> be regressed in the future. I can't immediately see how to wrap this in
> >> a helper to ensure this is kept correct.  
> > 
> > Sure, but you're not really catching cases where the work runs after
> > the scheduler component has been unplugged in case someone forgot to
> > cancel some works. I think I'd rather identify those cases with a
> > kernel panic, than a random UAF when the work is being executed.
> > Ultimately, we should probably audit all works used in the driver, to
> > make sure they are properly cancelled at unplug() time by the relevant
> > <component>_unplug() functions.  
> 
> Yes I agree, we should have a cancel_work_sync(&sched->fw_events_work)
> call somewhere on the unplug path. That needs to be after the job irq
> has been disabled which is currently done in panthor_fw_unplug().

Not necessarily. If we prevent any further FW events to queue the
fw_events work, we can just cancel it in the sched_unplug() path, after
we've transition to this "sched-is-gone" state.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing
  2025-10-22 15:32         ` Boris Brezillon
@ 2025-10-22 15:36           ` Steven Price
  2025-10-23 14:22           ` Ketil Johnsen
  1 sibling, 0 replies; 11+ messages in thread
From: Steven Price @ 2025-10-22 15:36 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Ketil Johnsen, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Heiko Stuebner,
	dri-devel, linux-kernel

On 22/10/2025 16:32, Boris Brezillon wrote:
> On Wed, 22 Oct 2025 15:28:51 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 22/10/2025 15:00, Boris Brezillon wrote:
>>> On Wed, 22 Oct 2025 14:36:23 +0100
>>> Steven Price <steven.price@arm.com> wrote:
>>>   
>>>> On 22/10/2025 13:37, Boris Brezillon wrote:  
>>>>> On Wed, 22 Oct 2025 12:30:13 +0200
>>>>> Ketil Johnsen <ketil.johnsen@arm.com> wrote:
>>>>>     
>>>>>> The function panthor_fw_unplug() will free the FW memory sections.
>>>>>> The problem is that there could still be pending FW events which are yet
>>>>>> not handled at this point. process_fw_events_work() can in this case try
>>>>>> to access said freed memory.
>>>>>>
>>>>>> This fix introduces a destroyed state for the panthor_scheduler object,
>>>>>> and we check for this before processing FW events.
>>>>>>
>>>>>> Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
>>>>>> Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block")
>>>>>> ---
>>>>>> v2:
>>>>>> - Followed Boris's advice and handle the race purely within the
>>>>>>   scheduler block (by adding a destroyed state)
>>>>>> ---
>>>>>>  drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++---
>>>>>>  1 file changed, 12 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
>>>>>> index 0cc9055f4ee52..4996f987b8183 100644
>>>>>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
>>>>>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
>>>>>> @@ -315,6 +315,13 @@ struct panthor_scheduler {
>>>>>>  		 */
>>>>>>  		struct list_head stopped_groups;
>>>>>>  	} reset;
>>>>>> +
>>>>>> +	/**
>>>>>> +	 * @destroyed: Scheduler object is (being) destroyed
>>>>>> +	 *
>>>>>> +	 * Normal scheduler operations should no longer take place.
>>>>>> +	 */
>>>>>> +	bool destroyed;    
>>>>>
>>>>> Do we really need a new field for that? Can't we just reset
>>>>> panthor_device::scheduler to NULL early enough in the unplug path?
>>>>> I guess it's not that simple if we have works going back to ptdev
>>>>> and then dereferencing ptdev->scheduler, but I think it's also
>>>>> fundamentally broken to have scheduler works active after the
>>>>> scheduler teardown has started, so we might want to add some more
>>>>> checks in the work callbacks too.
>>>>>     
>>>>>>  };
>>>>>>  
>>>>>>  /**
>>>>>> @@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct work_struct *work)
>>>>>>  	u32 events = atomic_xchg(&sched->fw_events, 0);
>>>>>>  	struct panthor_device *ptdev = sched->ptdev;
>>>>>>  
>>>>>> -	mutex_lock(&sched->lock);
>>>>>> +	guard(mutex)(&sched->lock);
>>>>>> +
>>>>>> +	if (sched->destroyed)
>>>>>> +		return;
>>>>>>  
>>>>>>  	if (events & JOB_INT_GLOBAL_IF) {
>>>>>>  		sched_process_global_irq_locked(ptdev);
>>>>>> @@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct work_struct *work)
>>>>>>  		sched_process_csg_irq_locked(ptdev, csg_id);
>>>>>>  		events &= ~BIT(csg_id);
>>>>>>  	}
>>>>>> -
>>>>>> -	mutex_unlock(&sched->lock);
>>>>>>  }
>>>>>>  
>>>>>>  /**
>>>>>> @@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
>>>>>>  	cancel_delayed_work_sync(&sched->tick_work);
>>>>>>  
>>>>>>  	mutex_lock(&sched->lock);
>>>>>> +	sched->destroyed = true;
>>>>>>  	if (sched->pm.has_ref) {
>>>>>>  		pm_runtime_put(ptdev->base.dev);
>>>>>>  		sched->pm.has_ref = false;    
>>>>>
>>>>> Hm, I'd really like to see a cancel_work_sync(&sched->fw_events_work)
>>>>> rather than letting the work execute after we've started tearing down
>>>>> the scheduler object.
>>>>>
>>>>> If you follow my suggestion to reset the ptdev->scheduler field, I
>>>>> guess something like that would do:
>>>>>
>>>>> void panthor_sched_unplug(struct panthor_device *ptdev)
>>>>> {
>>>>>         struct panthor_scheduler *sched = ptdev->scheduler;
>>>>>
>>>>> 	/* We want the schedu */
>>>>> 	WRITE_ONCE(*ptdev->scheduler, NULL);
>>>>>
>>>>> 	cancel_work_sync(&sched->fw_events_work);
>>>>>         cancel_delayed_work_sync(&sched->tick_work);
>>>>>
>>>>>         mutex_lock(&sched->lock);
>>>>>         if (sched->pm.has_ref) {
>>>>>                 pm_runtime_put(ptdev->base.dev);
>>>>>                 sched->pm.has_ref = false;
>>>>>         }
>>>>>         mutex_unlock(&sched->lock);
>>>>> }
>>>>>
>>>>> and
>>>>>
>>>>> void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events) {
>>>>> 	struct panthor_scheduler *sched = READ_ONCE(*ptdev->scheduler);
>>>>>
>>>>> 	/* Scheduler is not initialized, or it's gone. */
>>>>>         if (!sched)
>>>>>                 return;
>>>>>
>>>>>         atomic_or(events, &sched->fw_events);
>>>>>         sched_queue_work(sched, fw_events);
>>>>> }    
>>>>
>>>> Note there's also the path of panthor_mmu_irq_handler() calling
>>>> panthor_sched_report_mmu_fault() which will need to READ_ONCE() as well
>>>> to be safe.  
>>>
>>> This could be hidden behind a panthor_device_get_sched() helper, I
>>> guess. Anyway, it's not so much that I'm against the addition of an
>>> extra bool, but AFAICT, the problem is not entirely solved, as there
>>> could be a pending work that gets executed after sched_unplug()
>>> returns, and I adding this bool check just papers over the real bug
>>> (which is that we never cancel the fw_event work).
>>>   
>>>>
>>>> I agree having an extra bool is ugly, but it easier to reason about than
>>>> the lock-free WRITE_ONCE/READ_ONCE dance. It worries me that this will
>>>> be regressed in the future. I can't immediately see how to wrap this in
>>>> a helper to ensure this is kept correct.  
>>>
>>> Sure, but you're not really catching cases where the work runs after
>>> the scheduler component has been unplugged in case someone forgot to
>>> cancel some works. I think I'd rather identify those cases with a
>>> kernel panic, than a random UAF when the work is being executed.
>>> Ultimately, we should probably audit all works used in the driver, to
>>> make sure they are properly cancelled at unplug() time by the relevant
>>> <component>_unplug() functions.  
>>
>> Yes I agree, we should have a cancel_work_sync(&sched->fw_events_work)
>> call somewhere on the unplug path. That needs to be after the job irq
>> has been disabled which is currently done in panthor_fw_unplug().
> 
> Not necessarily. If we prevent any further FW events to queue the
> fw_events work, we can just cancel it in the sched_unplug() path, after
> we've transition to this "sched-is-gone" state.

True that would also work.

Steve


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing
  2025-10-22 15:32         ` Boris Brezillon
  2025-10-22 15:36           ` Steven Price
@ 2025-10-23 14:22           ` Ketil Johnsen
  2025-10-24  6:33             ` Boris Brezillon
  1 sibling, 1 reply; 11+ messages in thread
From: Ketil Johnsen @ 2025-10-23 14:22 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price
  Cc: Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Heiko Stuebner, dri-devel,
	linux-kernel

On 22/10/2025 17:32, Boris Brezillon wrote:
> On Wed, 22 Oct 2025 15:28:51 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 22/10/2025 15:00, Boris Brezillon wrote:
>>> On Wed, 22 Oct 2025 14:36:23 +0100
>>> Steven Price <steven.price@arm.com> wrote:
>>>    
>>>> On 22/10/2025 13:37, Boris Brezillon wrote:
>>>>> On Wed, 22 Oct 2025 12:30:13 +0200
>>>>> Ketil Johnsen <ketil.johnsen@arm.com> wrote:
>>>>>      
>>>>>> The function panthor_fw_unplug() will free the FW memory sections.
>>>>>> The problem is that there could still be pending FW events which are yet
>>>>>> not handled at this point. process_fw_events_work() can in this case try
>>>>>> to access said freed memory.
>>>>>>
>>>>>> This fix introduces a destroyed state for the panthor_scheduler object,
>>>>>> and we check for this before processing FW events.
>>>>>>
>>>>>> Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
>>>>>> Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block")
>>>>>> ---
>>>>>> v2:
>>>>>> - Followed Boris's advice and handle the race purely within the
>>>>>>    scheduler block (by adding a destroyed state)
>>>>>> ---
>>>>>>   drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++---
>>>>>>   1 file changed, 12 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
>>>>>> index 0cc9055f4ee52..4996f987b8183 100644
>>>>>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
>>>>>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
>>>>>> @@ -315,6 +315,13 @@ struct panthor_scheduler {
>>>>>>   		 */
>>>>>>   		struct list_head stopped_groups;
>>>>>>   	} reset;
>>>>>> +
>>>>>> +	/**
>>>>>> +	 * @destroyed: Scheduler object is (being) destroyed
>>>>>> +	 *
>>>>>> +	 * Normal scheduler operations should no longer take place.
>>>>>> +	 */
>>>>>> +	bool destroyed;
>>>>>
>>>>> Do we really need a new field for that? Can't we just reset
>>>>> panthor_device::scheduler to NULL early enough in the unplug path?
>>>>> I guess it's not that simple if we have works going back to ptdev
>>>>> and then dereferencing ptdev->scheduler, but I think it's also
>>>>> fundamentally broken to have scheduler works active after the
>>>>> scheduler teardown has started, so we might want to add some more
>>>>> checks in the work callbacks too.
>>>>>      
>>>>>>   };
>>>>>>   
>>>>>>   /**
>>>>>> @@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct work_struct *work)
>>>>>>   	u32 events = atomic_xchg(&sched->fw_events, 0);
>>>>>>   	struct panthor_device *ptdev = sched->ptdev;
>>>>>>   
>>>>>> -	mutex_lock(&sched->lock);
>>>>>> +	guard(mutex)(&sched->lock);
>>>>>> +
>>>>>> +	if (sched->destroyed)
>>>>>> +		return;
>>>>>>   
>>>>>>   	if (events & JOB_INT_GLOBAL_IF) {
>>>>>>   		sched_process_global_irq_locked(ptdev);
>>>>>> @@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct work_struct *work)
>>>>>>   		sched_process_csg_irq_locked(ptdev, csg_id);
>>>>>>   		events &= ~BIT(csg_id);
>>>>>>   	}
>>>>>> -
>>>>>> -	mutex_unlock(&sched->lock);
>>>>>>   }
>>>>>>   
>>>>>>   /**
>>>>>> @@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
>>>>>>   	cancel_delayed_work_sync(&sched->tick_work);
>>>>>>   
>>>>>>   	mutex_lock(&sched->lock);
>>>>>> +	sched->destroyed = true;
>>>>>>   	if (sched->pm.has_ref) {
>>>>>>   		pm_runtime_put(ptdev->base.dev);
>>>>>>   		sched->pm.has_ref = false;
>>>>>
>>>>> Hm, I'd really like to see a cancel_work_sync(&sched->fw_events_work)
>>>>> rather than letting the work execute after we've started tearing down
>>>>> the scheduler object.
>>>>>
>>>>> If you follow my suggestion to reset the ptdev->scheduler field, I
>>>>> guess something like that would do:
>>>>>
>>>>> void panthor_sched_unplug(struct panthor_device *ptdev)
>>>>> {
>>>>>          struct panthor_scheduler *sched = ptdev->scheduler;
>>>>>
>>>>> 	/* We want the schedu */
>>>>> 	WRITE_ONCE(*ptdev->scheduler, NULL);
>>>>>
>>>>> 	cancel_work_sync(&sched->fw_events_work);
>>>>>          cancel_delayed_work_sync(&sched->tick_work);
>>>>>
>>>>>          mutex_lock(&sched->lock);
>>>>>          if (sched->pm.has_ref) {
>>>>>                  pm_runtime_put(ptdev->base.dev);
>>>>>                  sched->pm.has_ref = false;
>>>>>          }
>>>>>          mutex_unlock(&sched->lock);
>>>>> }
>>>>>
>>>>> and
>>>>>
>>>>> void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events) {
>>>>> 	struct panthor_scheduler *sched = READ_ONCE(*ptdev->scheduler);
>>>>>
>>>>> 	/* Scheduler is not initialized, or it's gone. */
>>>>>          if (!sched)
>>>>>                  return;
>>>>>
>>>>>          atomic_or(events, &sched->fw_events);
>>>>>          sched_queue_work(sched, fw_events);
>>>>> }
>>>>
>>>> Note there's also the path of panthor_mmu_irq_handler() calling
>>>> panthor_sched_report_mmu_fault() which will need to READ_ONCE() as well
>>>> to be safe.
>>>
>>> This could be hidden behind a panthor_device_get_sched() helper, I
>>> guess. Anyway, it's not so much that I'm against the addition of an
>>> extra bool, but AFAICT, the problem is not entirely solved, as there
>>> could be a pending work that gets executed after sched_unplug()
>>> returns, and I adding this bool check just papers over the real bug
>>> (which is that we never cancel the fw_event work).
>>>    
>>>>
>>>> I agree having an extra bool is ugly, but it easier to reason about than
>>>> the lock-free WRITE_ONCE/READ_ONCE dance. It worries me that this will
>>>> be regressed in the future. I can't immediately see how to wrap this in
>>>> a helper to ensure this is kept correct.
>>>
>>> Sure, but you're not really catching cases where the work runs after
>>> the scheduler component has been unplugged in case someone forgot to
>>> cancel some works. I think I'd rather identify those cases with a
>>> kernel panic, than a random UAF when the work is being executed.
>>> Ultimately, we should probably audit all works used in the driver, to
>>> make sure they are properly cancelled at unplug() time by the relevant
>>> <component>_unplug() functions.
>>
>> Yes I agree, we should have a cancel_work_sync(&sched->fw_events_work)
>> call somewhere on the unplug path. That needs to be after the job irq
>> has been disabled which is currently done in panthor_fw_unplug().
> 
> Not necessarily. If we prevent any further FW events to queue the
> fw_events work, we can just cancel it in the sched_unplug() path, after
> we've transition to this "sched-is-gone" state.

I don't see how panthor_sched_report_fw_events() could easily avoid 
queuing more work, without making this more complicated than it already 
is with this patch.

panthor_sched_unplug() need to know that 
panthor_sched_report_fw_events() won't schedule more work before it can
safely proceed and cancel pending work.

Ideally we would have disabled/suspended the IRQs to achieve this but 
that happens later in panthor_fw_unplug().

If we hold the sched->lock in panthor_sched_report_fw_events() over both 
the checking of schedulers validity and enqueuing of more work, then we 
achieve that, but modprobe will crash, since 
panthor_sched_report_fw_events() will be called during FW init, before 
ptdev->scheduler is assigned for the first time.

If we go down that route, then we need to also check if scheduler is 
valid in panthor_sched_report_fw_events(), and only take the lock if so. 
More complexity!
Otherwise we must introduce another mechanism to synchronize from 
panthor_sched_report_fw_events() back to panthor_sched_unplug(), but 
that would also add more complexity.

PS: We can not hold the sched->lock while cancelling the work either, as 
process_fw_events_work() already takes the lock. This will deadlock!

I'm currently not able to see how we can make this fix any simpler.

Also, In my mind, the root of the pain is that scheduler and FW module 
are two "equal" modules. If the scheduler owned the FW module, then this 
would be more straight forward. panthor_sched_unplug() would then just 
stop scheduling new work, "unplug" FW and then cancel any pending work. 
But that would of course be a much bigger change, potentially 
introducing a lot of other complications.

--
Ketil

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing
  2025-10-23 14:22           ` Ketil Johnsen
@ 2025-10-24  6:33             ` Boris Brezillon
  0 siblings, 0 replies; 11+ messages in thread
From: Boris Brezillon @ 2025-10-24  6:33 UTC (permalink / raw)
  To: Ketil Johnsen
  Cc: Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Heiko Stuebner,
	dri-devel, linux-kernel

On Thu, 23 Oct 2025 16:22:48 +0200
Ketil Johnsen <ketil.johnsen@arm.com> wrote:

> On 22/10/2025 17:32, Boris Brezillon wrote:
> > On Wed, 22 Oct 2025 15:28:51 +0100
> > Steven Price <steven.price@arm.com> wrote:
> >   
> >> On 22/10/2025 15:00, Boris Brezillon wrote:  
> >>> On Wed, 22 Oct 2025 14:36:23 +0100
> >>> Steven Price <steven.price@arm.com> wrote:
> >>>      
> >>>> On 22/10/2025 13:37, Boris Brezillon wrote:  
> >>>>> On Wed, 22 Oct 2025 12:30:13 +0200
> >>>>> Ketil Johnsen <ketil.johnsen@arm.com> wrote:
> >>>>>        
> >>>>>> The function panthor_fw_unplug() will free the FW memory sections.
> >>>>>> The problem is that there could still be pending FW events which are yet
> >>>>>> not handled at this point. process_fw_events_work() can in this case try
> >>>>>> to access said freed memory.
> >>>>>>
> >>>>>> This fix introduces a destroyed state for the panthor_scheduler object,
> >>>>>> and we check for this before processing FW events.
> >>>>>>
> >>>>>> Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
> >>>>>> Fixes: de85488138247 ("drm/panthor: Add the scheduler logical block")
> >>>>>> ---
> >>>>>> v2:
> >>>>>> - Followed Boris's advice and handle the race purely within the
> >>>>>>    scheduler block (by adding a destroyed state)
> >>>>>> ---
> >>>>>>   drivers/gpu/drm/panthor/panthor_sched.c | 15 ++++++++++++---
> >>>>>>   1 file changed, 12 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> >>>>>> index 0cc9055f4ee52..4996f987b8183 100644
> >>>>>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> >>>>>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> >>>>>> @@ -315,6 +315,13 @@ struct panthor_scheduler {
> >>>>>>   		 */
> >>>>>>   		struct list_head stopped_groups;
> >>>>>>   	} reset;
> >>>>>> +
> >>>>>> +	/**
> >>>>>> +	 * @destroyed: Scheduler object is (being) destroyed
> >>>>>> +	 *
> >>>>>> +	 * Normal scheduler operations should no longer take place.
> >>>>>> +	 */
> >>>>>> +	bool destroyed;  
> >>>>>
> >>>>> Do we really need a new field for that? Can't we just reset
> >>>>> panthor_device::scheduler to NULL early enough in the unplug path?
> >>>>> I guess it's not that simple if we have works going back to ptdev
> >>>>> and then dereferencing ptdev->scheduler, but I think it's also
> >>>>> fundamentally broken to have scheduler works active after the
> >>>>> scheduler teardown has started, so we might want to add some more
> >>>>> checks in the work callbacks too.
> >>>>>        
> >>>>>>   };
> >>>>>>   
> >>>>>>   /**
> >>>>>> @@ -1765,7 +1772,10 @@ static void process_fw_events_work(struct work_struct *work)
> >>>>>>   	u32 events = atomic_xchg(&sched->fw_events, 0);
> >>>>>>   	struct panthor_device *ptdev = sched->ptdev;
> >>>>>>   
> >>>>>> -	mutex_lock(&sched->lock);
> >>>>>> +	guard(mutex)(&sched->lock);
> >>>>>> +
> >>>>>> +	if (sched->destroyed)
> >>>>>> +		return;
> >>>>>>   
> >>>>>>   	if (events & JOB_INT_GLOBAL_IF) {
> >>>>>>   		sched_process_global_irq_locked(ptdev);
> >>>>>> @@ -1778,8 +1788,6 @@ static void process_fw_events_work(struct work_struct *work)
> >>>>>>   		sched_process_csg_irq_locked(ptdev, csg_id);
> >>>>>>   		events &= ~BIT(csg_id);
> >>>>>>   	}
> >>>>>> -
> >>>>>> -	mutex_unlock(&sched->lock);
> >>>>>>   }
> >>>>>>   
> >>>>>>   /**
> >>>>>> @@ -3882,6 +3890,7 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
> >>>>>>   	cancel_delayed_work_sync(&sched->tick_work);
> >>>>>>   
> >>>>>>   	mutex_lock(&sched->lock);
> >>>>>> +	sched->destroyed = true;
> >>>>>>   	if (sched->pm.has_ref) {
> >>>>>>   		pm_runtime_put(ptdev->base.dev);
> >>>>>>   		sched->pm.has_ref = false;  
> >>>>>
> >>>>> Hm, I'd really like to see a cancel_work_sync(&sched->fw_events_work)
> >>>>> rather than letting the work execute after we've started tearing down
> >>>>> the scheduler object.
> >>>>>
> >>>>> If you follow my suggestion to reset the ptdev->scheduler field, I
> >>>>> guess something like that would do:
> >>>>>
> >>>>> void panthor_sched_unplug(struct panthor_device *ptdev)
> >>>>> {
> >>>>>          struct panthor_scheduler *sched = ptdev->scheduler;
> >>>>>
> >>>>> 	/* We want the schedu */
> >>>>> 	WRITE_ONCE(*ptdev->scheduler, NULL);
> >>>>>
> >>>>> 	cancel_work_sync(&sched->fw_events_work);
> >>>>>          cancel_delayed_work_sync(&sched->tick_work);
> >>>>>
> >>>>>          mutex_lock(&sched->lock);
> >>>>>          if (sched->pm.has_ref) {
> >>>>>                  pm_runtime_put(ptdev->base.dev);
> >>>>>                  sched->pm.has_ref = false;
> >>>>>          }
> >>>>>          mutex_unlock(&sched->lock);
> >>>>> }
> >>>>>
> >>>>> and
> >>>>>
> >>>>> void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events) {
> >>>>> 	struct panthor_scheduler *sched = READ_ONCE(*ptdev->scheduler);
> >>>>>
> >>>>> 	/* Scheduler is not initialized, or it's gone. */
> >>>>>          if (!sched)
> >>>>>                  return;
> >>>>>
> >>>>>          atomic_or(events, &sched->fw_events);
> >>>>>          sched_queue_work(sched, fw_events);
> >>>>> }  
> >>>>
> >>>> Note there's also the path of panthor_mmu_irq_handler() calling
> >>>> panthor_sched_report_mmu_fault() which will need to READ_ONCE() as well
> >>>> to be safe.  
> >>>
> >>> This could be hidden behind a panthor_device_get_sched() helper, I
> >>> guess. Anyway, it's not so much that I'm against the addition of an
> >>> extra bool, but AFAICT, the problem is not entirely solved, as there
> >>> could be a pending work that gets executed after sched_unplug()
> >>> returns, and I adding this bool check just papers over the real bug
> >>> (which is that we never cancel the fw_event work).
> >>>      
> >>>>
> >>>> I agree having an extra bool is ugly, but it easier to reason about than
> >>>> the lock-free WRITE_ONCE/READ_ONCE dance. It worries me that this will
> >>>> be regressed in the future. I can't immediately see how to wrap this in
> >>>> a helper to ensure this is kept correct.  
> >>>
> >>> Sure, but you're not really catching cases where the work runs after
> >>> the scheduler component has been unplugged in case someone forgot to
> >>> cancel some works. I think I'd rather identify those cases with a
> >>> kernel panic, than a random UAF when the work is being executed.
> >>> Ultimately, we should probably audit all works used in the driver, to
> >>> make sure they are properly cancelled at unplug() time by the relevant
> >>> <component>_unplug() functions.  
> >>
> >> Yes I agree, we should have a cancel_work_sync(&sched->fw_events_work)
> >> call somewhere on the unplug path. That needs to be after the job irq
> >> has been disabled which is currently done in panthor_fw_unplug().  
> > 
> > Not necessarily. If we prevent any further FW events to queue the
> > fw_events work, we can just cancel it in the sched_unplug() path, after
> > we've transition to this "sched-is-gone" state.  
> 
> I don't see how panthor_sched_report_fw_events() could easily avoid 
> queuing more work, without making this more complicated than it already 
> is with this patch.
> 
> panthor_sched_unplug() need to know that 
> panthor_sched_report_fw_events() won't schedule more work before it can
> safely proceed and cancel pending work.
> 
> Ideally we would have disabled/suspended the IRQs to achieve this but 
> that happens later in panthor_fw_unplug().
> 
> If we hold the sched->lock in panthor_sched_report_fw_events() over both 
> the checking of schedulers validity and enqueuing of more work, then we 
> achieve that, but modprobe will crash, since 
> panthor_sched_report_fw_events() will be called during FW init, before 
> ptdev->scheduler is assigned for the first time.
> 
> If we go down that route, then we need to also check if scheduler is 
> valid in panthor_sched_report_fw_events(), and only take the lock if so. 
> More complexity!
> Otherwise we must introduce another mechanism to synchronize from 
> panthor_sched_report_fw_events() back to panthor_sched_unplug(), but 
> that would also add more complexity.
> 
> PS: We can not hold the sched->lock while cancelling the work either, as 
> process_fw_events_work() already takes the lock. This will deadlock!
> 
> I'm currently not able to see how we can make this fix any simpler.

I think I just found one. There are disable[_delayed]_work_sync() helpers
that do exactly what we want: flag the work as disabled, cancel the work
if it's pending, and wait for completion if it's currently executing. With
that, I don't think we need anything but the disable[_delayed]_work_sync()
calls in the various unplug path we have. I guess it would also be good
to transition to {disable,enable}[_delayed]_work_sync() in the reset path
so we can get rid of some open-coded logic doing the same thing in
sched_queue[_delayed]_work() helpers, but we can keep that for later.

--->8---
diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 81df49880bd8..c771d66a9690 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -120,7 +120,7 @@ static void panthor_device_reset_cleanup(struct drm_device *ddev, void *data)
 {
        struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
 
-       cancel_work_sync(&ptdev->reset.work);
+       disable_work_sync(&ptdev->reset.work);
        destroy_workqueue(ptdev->reset.wq);
 }
 
diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
index df767e82148a..5d9f38f301dc 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -1163,7 +1163,7 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
 {
        struct panthor_fw_section *section;
 
-       cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
+       disable_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
 
        if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev)) {
                /* Make sure the IRQ handler cannot be called after that point. */
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 3d1f57e3990f..adc4fd71175e 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -3855,7 +3855,10 @@ void panthor_sched_unplug(struct panthor_device *ptdev)
 {
        struct panthor_scheduler *sched = ptdev->scheduler;
 
-       cancel_delayed_work_sync(&sched->tick_work);
+       /* Disable all works before proceeding with the teardown. */
+       disable_work_sync(&sched->sync_upd_work);
+       disable_work_sync(&sched->fw_events_work);
+       disable_delayed_work_sync(&sched->tick_work);
 
        mutex_lock(&sched->lock);
        if (sched->pm.has_ref) {


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-10-24  6:33 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-22 10:30 [PATCH v2] drm/panthor: Fix UAF race between device unplug and FW event processing Ketil Johnsen
2025-10-22 10:53 ` Steven Price
2025-10-22 11:02 ` Liviu Dudau
2025-10-22 12:37 ` Boris Brezillon
2025-10-22 13:36   ` Steven Price
2025-10-22 14:00     ` Boris Brezillon
2025-10-22 14:28       ` Steven Price
2025-10-22 15:32         ` Boris Brezillon
2025-10-22 15:36           ` Steven Price
2025-10-23 14:22           ` Ketil Johnsen
2025-10-24  6:33             ` Boris Brezillon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox