Re: [PATCH] drm/panthor: Evict groups before VM termination

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Boris Brezillon <boris.brezillon@collabora.com>
To: Chia-I Wu <olvaffe@gmail.com>
Cc: Steven Price <steven.price@arm.com>,
	Ketil Johnsen <ketil.johnsen@arm.com>,
	Liviu Dudau <liviu.dudau@arm.com>,
	Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
	Maxime Ripard <mripard@kernel.org>,
	Thomas Zimmermann <tzimmermann@suse.de>,
	David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>,
	Grant Likely <grant.likely@linaro.org>,
	Heiko Stuebner <heiko@sntech.de>,
	dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] drm/panthor: Evict groups before VM termination
Date: Fri, 19 Dec 2025 08:39:33 +0100	[thread overview]
Message-ID: <20251219083933.2dc9c009@fedora> (raw)
In-Reply-To: <CAPaKu7SY+jsUxyoiiBRUgH7_M3C0V1-Fv_ibCak_6X-saEZ+fw@mail.gmail.com>

On Thu, 18 Dec 2025 14:54:35 -0800
Chia-I Wu <olvaffe@gmail.com> wrote:

> On Thu, Dec 18, 2025 at 10:36 AM Boris Brezillon
> <boris.brezillon@collabora.com> wrote:
> >
> > On Thu, 18 Dec 2025 16:57:28 +0000
> > Steven Price <steven.price@arm.com> wrote:
> >  
> > > On 18/12/2025 16:26, Ketil Johnsen wrote:  
> > > > Ensure all related groups are evicted and suspended before VM
> > > > destruction takes place.
> > > >
> > > > This fixes an issue where panthor_vm_destroy() destroys and unmaps the
> > > > heap context while there are still on slot groups using this.
> > > > The FW will do a write out to the heap context when a CSG (group) is
> > > > suspended, so a premature unmap of the heap context will cause a
> > > > GPU page fault.
> > > > This page fault is quite harmless, and do not affect the continued
> > > > operation of the GPU.
> > > >
> > > > Fixes: 647810ec2476 ("drm/panthor: Add the MMU/VM logical block")
> > > > Co-developed-by: Boris Brezillon <boris.brezillon@collabora.com>
> > > > Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
> > > > ---
> > > >  drivers/gpu/drm/panthor/panthor_mmu.c   |  4 ++++
> > > >  drivers/gpu/drm/panthor/panthor_sched.c | 16 ++++++++++++++++
> > > >  drivers/gpu/drm/panthor/panthor_sched.h |  1 +
> > > >  3 files changed, 21 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> > > > index 74230f7199121..0e4b301a9c70e 100644
> > > > --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> > > > +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> > > > @@ -1537,6 +1537,10 @@ static void panthor_vm_destroy(struct panthor_vm *vm)
> > > >
> > > >     vm->destroyed = true;
> > > >
> > > > +   /* Tell scheduler to stop all GPU work related to this VM */
> > > > +   if (refcount_read(&vm->as.active_cnt) > 0)
> > > > +           panthor_sched_prepare_for_vm_destruction(vm->ptdev);
> > > > +
> > > >     mutex_lock(&vm->heaps.lock);
> > > >     panthor_heap_pool_destroy(vm->heaps.pool);
> > > >     vm->heaps.pool = NULL;  
> Is it better to remove the panthor_heap_pool_destroy call here instead
> and let panthor_vm_free take care of it?

We can't because the heap_pool contains heap chunks (kernel BOs) that
are mapped in the very same VM, thus creating a circular ref. The whole
point of calling panthor_heap_pool_destroy() here is to kill this
circular ref. We could introduce the concept of weak and hard VM refs,
but I'm not sure it's worth it.

> 
> > > > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > > > index f680edcd40aad..fbbaab9b25efb 100644
> > > > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > > > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > > > @@ -2930,6 +2930,22 @@ void panthor_sched_report_mmu_fault(struct panthor_device *ptdev)
> > > >             sched_queue_delayed_work(ptdev->scheduler, tick, 0);
> > > >  }
> > > >
> > > > +void panthor_sched_prepare_for_vm_destruction(struct panthor_device *ptdev)
> > > > +{
> > > > +   /* FW can write out internal state, like the heap context, during CSG
> > > > +    * suspend. It is therefore important that the scheduler has fully
> > > > +    * evicted any pending and related groups before VM destruction can
> > > > +    * safely continue. Failure to do so can lead to GPU page faults.
> > > > +    * A controlled termination of a Panthor instance involves destroying
> > > > +    * the group(s) before the VM. This means any relevant group eviction
> > > > +    * has already been initiated by this point, and we just need to
> > > > +    * ensure that any pending tick_work() has been completed.
> > > > +    */
> > > > +   if (ptdev->scheduler) {
> > > > +           flush_work(&ptdev->scheduler->tick_work.work);
> > > > +   }  
> > >
> > > NIT: braces not needed.
> > >
> > > But I'm also struggling to understand in what situation ptdev->scheduler
> > > would be NULL?  
> >
> > I thought it could happen if the FW initialization fails in the middle,
> > and the FW VM is destroyed before the scheduler had a chance to
> > initialize, but it turns out the FW logic never calls
> > panthor_vm_destroy().
> >  
> > >
> > > Thanks,
> > > Steve
> > >  
> > > > +}
> > > > +
> > > >  void panthor_sched_resume(struct panthor_device *ptdev)
> > > >  {
> > > >     /* Force a tick to re-evaluate after a resume. */
> > > > diff --git a/drivers/gpu/drm/panthor/panthor_sched.h b/drivers/gpu/drm/panthor/panthor_sched.h
> > > > index f4a475aa34c0a..9a8692de8aded 100644
> > > > --- a/drivers/gpu/drm/panthor/panthor_sched.h
> > > > +++ b/drivers/gpu/drm/panthor/panthor_sched.h
> > > > @@ -50,6 +50,7 @@ void panthor_sched_suspend(struct panthor_device *ptdev);
> > > >  void panthor_sched_resume(struct panthor_device *ptdev);
> > > >
> > > >  void panthor_sched_report_mmu_fault(struct panthor_device *ptdev);
> > > > +void panthor_sched_prepare_for_vm_destruction(struct panthor_device *ptdev);
> > > >  void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events);
> > > >
> > > >  void panthor_fdinfo_gather_group_samples(struct panthor_file *pfile);  
> > >  
> >

next prev parent reply	other threads:[~2025-12-19  7:39 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-18 16:26 [PATCH] drm/panthor: Evict groups before VM termination Ketil Johnsen
2025-12-18 16:57 ` Steven Price
2025-12-18 17:59   ` Boris Brezillon
2025-12-18 22:54     ` Chia-I Wu
2025-12-19  7:39       ` Boris Brezillon [this message]
2025-12-19  9:19     ` Ketil Johnsen
2025-12-18 17:53 ` Boris Brezillon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251219083933.2dc9c009@fedora \
    --to=boris.brezillon@collabora.com \
    --cc=airlied@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=grant.likely@linaro.org \
    --cc=heiko@sntech.de \
    --cc=ketil.johnsen@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liviu.dudau@arm.com \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=mripard@kernel.org \
    --cc=olvaffe@gmail.com \
    --cc=simona@ffwll.ch \
    --cc=steven.price@arm.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.