All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Timur Kristóf" <timur.kristof@gmail.com>
To: Alex Deucher <alexdeucher@gmail.com>
Cc: amd-gfx@lists.freedesktop.org, Alex Deucher <alexander.deucher@amd.com>
Subject: Re: [PATCH 1/6] drm/amdgpu: don't reemit ring contents more than once
Date: Thu, 18 Dec 2025 09:47:13 -0600	[thread overview]
Message-ID: <2513938.h6RI2rZIcs@timur-max> (raw)
In-Reply-To: <CADnq5_MrscrX+4vpNwpzK47+nLvF0ognLrB=56hqfv89y-LA6g@mail.gmail.com>

On 2025. december 18., csütörtök 9:36:57 középső államokbeli zónaidő Alex 
Deucher wrote:
> On Thu, Dec 18, 2025 at 12:36 AM Timur Kristóf <timur.kristof@gmail.com> 
wrote:
> > On 2025. december 15., hétfő 10:07:06 középső államokbeli zónaidő Alex
> > Deucher> 
> > wrote:
> > > If we cancel a bad job and reemit the ring contents, and
> > > we get another timeout, cancel everything rather than reemitting.
> > > The wptr markers are only relevant for the original emit.  If
> > > we reemit, the wptr markers are no longer correct.
> > 
> > I see the point of not reemitting, considering the wptrs are no longer
> > correct. However, would it be possible to correct the wptrs instead?
> 
> Yes, that could be done, but it's more complicated.  This is just a
> short term fix until that is implemented.
> 
> Alex

I see. The patch is:
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>

> 
> > As it is, this sounds like it would make the reset less reliable when
> > there is more than one job emitted that can cause hangs.
> > 
> > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> > > ---
> > > 
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 22 +++++++++++++++++-----
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  |  2 ++
> > >  2 files changed, 19 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index
> > > 1fe31d2f27060..334ddd6e48c06 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > @@ -710,6 +710,7 @@ void
> > > amdgpu_fence_driver_guilty_force_completion(struct
> > > amdgpu_fence *af) struct amdgpu_ring *ring = af->ring;
> > > 
> > >       unsigned long flags;
> > >       u32 seq, last_seq;
> > > 
> > > +     bool reemitted = false;
> > > 
> > >       last_seq = amdgpu_fence_read(ring) & ring-
> > >
> > >fence_drv.num_fences_mask;
> > >
> > >       seq = ring->fence_drv.sync_seq & ring->fence_drv.num_fences_mask;
> > > 
> > > @@ -727,7 +728,9 @@ void
> > > amdgpu_fence_driver_guilty_force_completion(struct
> > > amdgpu_fence *af) if (unprocessed &&
> > > !dma_fence_is_signaled_locked(unprocessed)) { fence =
> > > container_of(unprocessed, struct amdgpu_fence, base);
> > > 
> > > -                     if (fence == af)
> > > +                     if (fence->reemitted > 1)
> > > +                             reemitted = true;
> > > +                     else if (fence == af)
> > > 
> > >                               dma_fence_set_error(&fence->base,
> > 
> > -ETIME);
> > 
> > >                       else if (fence->context == af->context)
> > >                       
> > >                               dma_fence_set_error(&fence->base,
> > 
> > -ECANCELED);
> > 
> > > @@ -735,9 +738,16 @@ void
> > > amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence *af)
> > > rcu_read_unlock();
> > > 
> > >       } while (last_seq != seq);
> > >       spin_unlock_irqrestore(&ring->fence_drv.lock, flags);
> > > 
> > > -     /* signal the guilty fence */
> > > -     amdgpu_fence_write(ring, (u32)af->base.seqno);
> > > -     amdgpu_fence_process(ring);
> > > +
> > > +     if (reemitted) {
> > > +             /* if we've already reemitted once then just cancel
> > 
> > everything */
> > 
> > > +             amdgpu_fence_driver_force_completion(af->ring);
> > > +             af->ring->ring_backup_entries_to_copy = 0;
> > > +     } else {
> > > +             /* signal the guilty fence */
> > > +             amdgpu_fence_write(ring, (u32)af->base.seqno);
> > > +             amdgpu_fence_process(ring);
> > > +     }
> > > 
> > >  }
> > >  
> > >  void amdgpu_fence_save_wptr(struct amdgpu_fence *af)
> > > 
> > > @@ -785,10 +795,12 @@ void
> > > amdgpu_ring_backup_unprocessed_commands(struct
> > > amdgpu_ring *ring, /* save everything if the ring is not guilty,
> > > otherwise
> > > 
> > >                        * just save the content from other
> > 
> > contexts.
> > 
> > >                        */
> > > 
> > > -                     if (!guilty_fence || (fence->context !=
> > 
> > guilty_fence->context))
> > 
> > > +                     if (!fence->reemitted &&
> > > +                         (!guilty_fence || (fence->context !=
> > 
> > guilty_fence->context)))
> > 
> > amdgpu_ring_backup_unprocessed_command(ring, wptr,
> > 
> >        fence->wptr);
> >        
> > >                       wptr = fence->wptr;
> > > 
> > > +                     fence->reemitted++;
> > > 
> > >               }
> > >               rcu_read_unlock();
> > >       
> > >       } while (last_seq != seq);
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h index
> > > a1fb0fadb6eab..d881829528976 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > > @@ -150,6 +150,8 @@ struct amdgpu_fence {
> > > 
> > >       u64                             wptr;
> > >       /* fence context for resets */
> > >       u64                             context;
> > > 
> > > +     /* has this fence been reemitted */
> > > +     unsigned int                    reemitted;
> > > 
> > >  };
> > >  
> > >  extern const struct drm_sched_backend_ops amdgpu_sched_ops;





      reply	other threads:[~2025-12-18 15:47 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-15 16:07 [PATCH 1/6] drm/amdgpu: don't reemit ring contents more than once Alex Deucher
2025-12-15 16:07 ` [PATCH 2/6] drm/amdgpu: always backup and reemit fences Alex Deucher
2025-12-18  5:17   ` Timur Kristóf
2025-12-15 16:07 ` [PATCH 3/6] drm/amdgpu: use dma_fence_get_status() for adapter reset Alex Deucher
2025-12-15 16:07 ` [PATCH 4/6] drm/amdgpu: avoid a warning in timedout job handler Alex Deucher
2025-12-18  5:15   ` Timur Kristóf
2025-12-18 15:41     ` Alex Deucher
2025-12-18 15:49       ` Timur Kristóf
2025-12-15 16:07 ` [PATCH 5/6] drm/amdgpu: mark fences with errors before ring reset Alex Deucher
2025-12-15 16:07 ` [PATCH 6/6] drm/amdgpu/gfx9: Implement KGQ " Alex Deucher
2025-12-18  5:11   ` Timur Kristóf
2025-12-18  5:28     ` Timur Kristóf
2025-12-18 15:58     ` Alex Deucher
2025-12-18 17:03       ` Timur Kristóf
2025-12-18 19:02         ` Alex Deucher
2025-12-18  5:20 ` [PATCH 1/6] drm/amdgpu: don't reemit ring contents more than once Timur Kristóf
2025-12-18 15:36   ` Alex Deucher
2025-12-18 15:47     ` Timur Kristóf [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2513938.h6RI2rZIcs@timur-max \
    --to=timur.kristof@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.