Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Timur Kristóf" <timur.kristof@gmail.com>
To: Alex Deucher <alexdeucher@gmail.com>,
	"Shetaia, Amir" <Amir.Shetaia@amd.com>
Cc: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>,
	"Marek Olšák" <maraeo@gmail.com>,
	"Natalie Vock" <natalie.vock@gmx.de>,
	"Melissa Wen" <mwen@igalia.com>
Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
Date: Fri, 12 Jun 2026 09:16:27 +0200	[thread overview]
Message-ID: <16497357.Emhk5qWAgF@timur-max> (raw)
In-Reply-To: <5044340.OV4Wx5bFTl@timur-hyperion>

Hi Everyone,

Can we please make some progress here?

I've still got 3 patch series on the mailing list to improve retry fault 
handling, without any feedback for many weeks now.
Can someone please review these please?

Thanks & best regards,
Timur

On 2026. május 14., csütörtök 23:24:49 közép-európai nyári idő Timur Kristóf 
wrote:
> Hi Amir,
> 
> Thanks, glad to hear that!
> 
> I recommend to also apply "Add fence argument to amdgpu_vm_handle_fault()"
> and "ACK the retry CAM after VM update finishes" as these fix the race
> between the CAM and the SDMA, and should improve the reliability of the
> whole mechanism in my opinion.
> 
> In the meantime I think I figured out what I was missing.
> Considering that the FORCE_MISS bits didn't change the behavior, I was
> thinking that the curlpit must be something else, so I started looking into
> the PTE that is being added.
> 
> It all started working for me when I added some new PTE flags to
> amdgpu_vm_handle_fault(): AMDGPU_PTE_IS_PTE and AMDGPU_PTE_NOALLOC.
> With that, I cleaned up my patches and force-pushed them to the same branch:
> 
> https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_navi4
> 
> What do you think?
> 
> Best regards,
> Timur
> 
> 
> On Thursday, May 14, 2026 9:32:16 PM Central European Summer Time Shetaia,
> 
> Amir wrote:
> > AMD General
> > 
> > Hi Timur,
> > 
> > I cherry-picked your soft-IH-ring + timestamp-filter patches onto our
> > build,> 
> > ran malloc 1 GiB sweep (N=10) on Navi4 (gfx1201):
> >                                                     PASS    1st hang  
> >                                                     sR_r
> > 
> > entered
> 
>  Baseline (no patches)            1/10    trial 6       97k
> 
> >   With your patches                  2/10    trial 8      43k
> > 
> > What I noticed on NV4:
> > - 2× trials before failure (soft-IH unbound workqueue helping)
> > - ~55% reduction in spurious svm_range_restore_pages (your
> > timestamp-filter
> > fix catches a real bug here too, not just Strix Halo)
> 
>  - Soft IH worker
> 
> > visibly on events_unbound queue
> > 
> > Same dma_fence_wait_timeout signature underneath, now on TWO workers
> > (amdgpu_irq_handle_ih_soft + ttm_bo_delayed_delete,
> 
>  TTM cleanup gated on
> 
> > the same fence chain). Central BO-clear/GCR deadlock unchanged, your
> > patches fix adjacent bugs but that one needs the HW fix.
> > Also tested UTCL0 retry-rate knob. On NV4 the gfx12 equivalents are
> > regGL1C_UTCL0_RETRY / regGL1XC_UTCL0_RETRY
> 
>  (the TCP-level register moved
> 
> > to GL1C-level on gfx12). Wrote INCR=0xff COUNT=0xf via umr to all SE/SA
> > banks → 0/10 PASS, hung sooner. So, slowing UTCL0 retries doesn't help.
> > deadlock isn't pressure-driven on UTCL0.
> > 
> > AMIR SHETAIA
> > Senior Software Development Engineer  |  AMD
> > Software Platform Architecture Team
> > --------------------------------------------------------------------------
> > -- ------------------------------------------------------
> 
>  1 Commerce Valley
> 
> > Drive, Markham, ON L3T 7X6
> > LinkedIn  |  Instagram  |  X  |  amd.com
> > 
> > 
> > 
> > 
> > -----Original Message-----
> > From: Timur Kristóf <timur.kristof@gmail.com>
> > Sent: Thursday, May 14, 2026 1:03 PM
> > To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir
> > <Amir.Shetaia@amd.com>
> 
>  Cc: amd-gfx@lists.freedesktop.org; Deucher,
> 
> > Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> > <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie Vock
> > <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com> Subject: Re: [PATCH
> > 0/6] drm/amdgpu: Improve retry fault handling
> > Hi Amir,
> > 
> > I pushed a work in progress branch here:
> > https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_nav
> > i4
> > 
> > 
> > This contains the two series that I sent to the mailing list yesterday:
> > "Improve retry fault handling"
> > "Improve soft IH ring"
> > 
> > If we have to rely on the soft IH ring on Navi4, then we absolutely need
> > the patches from "Improve soft IH ring", as it fixes a few bugs and
> > switches the soft IH ring to use the system unbound workqueue. Otherwise
> > it can happen that the soft IH ring fills up even before your CPU can
> > process anything in it.
> 
>  (This is especially bad on HW that doesn't have the filter
> 
> > CAM enabled, eg. Strix Halo.)
> > On top of that,
> > 
> > There is a patch to solve the race condition and make sure that the retry
> > CAM is ACKed after the VM update is complete. On Navi 31, this works and
> > successfully makes sure that the race is eliminated and each VM fault is
> > only processed once. I'd like to submit these if you guys agree on the
> > approach.
> > 
> > And finally, there is a very messy patch that contains what I tried to do
> > to get it all working on Navi 48. This has a lot of comments with my
> > notes during development, and some attempts I tried to fix the issue
> > (without success).
> > 
> > Basically, what I did was:
> > 
> > - Enable the retry CAM the same way as I did on IH 6.0 exactly like you
> > said
> 
>  - Added the cam_index to gmc_v12_0_process_interrupt
> 
> > - Added a few bits to gfxhub_v12_0 and gfx_v12_0 to actually enable retry
> > faults (without that, the faults don't have the retry bit in src_data[1])
> > 
> > Now, what I observe on Navi 48 is:
> > 
> > 1. Retry fault interrupt is triggered
> > 2. Dispatched on the soft IH ring
> > 3. The interrupt handler is called again on the soft IRQ ring 4. The fault
> > is mitigated by amdgpu_vm_handle_fault() 5. Finally, the retry CAM is
> > ACKed
> > 6. As soon as the CAM is ACKed, I get another interrupt for the same
> > address, and it goes on infinitely or until GPU reset is triggered.
> > 
> > Attempts at fixing what I saw on Navi 48:
> > 
> > - Tried to add a TLB flush to various places. Now that the race is
> > mitigated, the TLB flush doesn't make things worse anymore, but it's also
> > not helping.
> 
>  - Tried to set the FORCE_MISS bits on various cache related
> 
> > registers, hoping that it would work around the possibility of the PTE/PDE
> > not being updated in the cache. This didn't help either. - Tried to use
> > amdgpu.vm_update_mode=3 to use the CPU to update the page tables, but that
> > didn't help on Navi 48 - Tried to halt the CP using CP_ME_CNTL to see if
> > it
> > would stop sending interrupts, but it didn't. - Using umr I tried to
> > disable the L2 cache and reset it with the GRBM, but that just broke the
> > whole system.
> > Other notes on the retry fault topic in general:
> > 
> > - On GPUs that don't have the retry CAM (eg. Navi 1-2 and APUs) we'll need
> > to consider disabling the VM fault interrupt while processing the retry,
> > because the soft IH ring can fill up really quickly and the CPU may
> > struggle to keep up with it.
> 
>  - In general (for all GPUs), I think it would
> 
> > be better to update a larger VA range of subsequent pages rather than just
> > a single page, in order to more efficiently mitigate VM faults to
> > neighbouring pages. - Currently all of the invalid accesses are redirected
> > to the same dummy page, which is a security hole because it means that a
> > process that had an out of bounds write can leak data to a process that
> > does an out of bounds read. I already talked to Christian about this, he
> > has a few ideas how to fix it.
> > Thanks & best regards,
> > Timur
> > 
> > 
> > On Thursday, May 14, 2026 5:04:00 PM Central European Summer Time Shetaia,
> > 
> > Amir wrote:
> > > AMD General
> > > 
> > > 
> > > 
> > > Hi Timur,
> > > 
> > > 
> > > 
> > > Thanks for clarifying. yes, please push the Navi 48 WIP branch when
> > > you have a chance, useful to compare side-by-side.
> > > 
> > > 
> > > 
> > > Patch 5 / WDOORBELL on ih_v7_0: I have been able to confirm that
> > > doorbell ACK does not free the CAM slot on the NV4 family
> >  
> >  (gfx1200/1201). I
> >  
> > > verified empirically, slots walk monotonically until the CAM fills,
> > > then HW silently drops retry events. MMIO write to regIH_RETRY_CAM_ACK
> > > does free it. Almost certainly required for your Navi 48 attempt.
> > > Why PTE update doesn't reach UTC L0: Most likely a race between the
> > > SDMA PTE-update job completing and the CAM ACK firing the
> >  
> >  IRQ for re-handling.
> >  
> > > Until SDMA finishes the write + invalidation propagates, UTC L0 keeps
> > > getting the stale "invalid" reply which is exactly your symptom.
> > > To inspect with umr: clone
> > > https://gitlab.freedesktop.org/tomstdenis/umr and use
> > > scripts/diag/dump_all_cpc_info.sh (needs
> >  
> >  halt_if_hws_hang=1
> >  
> > > gpu_recovery=0). For UTC specifically: umr -i <gpu> -O bits -r
> > > '*.*.regGCVM_L2_STATUS' (look at L2_BUSY and CONTEXT_DOMAIN_BUSY
> > > per-VMID
> > > bits) and regGCVM_L2_PROTECTION_FAULT_STATUS (empty + busy =
> > > translation timeout, not denial). Page-table walk to confirm PTE
> > > landed: sudo umr -i <gpu> -vm <vmid>@<va> 1.
> > > Your ACK-after-SDMA-fence idea: good catch. We hadn't dug into that
> > > race because the BO-clear deadlock dominates on NV4, but
> >  
> >  the ordering bug is
> >  
> > > almost certainly there too. Fence callback is the right shape, cleaner
> > > than blocking the worker. Would be a generic fix benefiting both Navi 3
> > > and 4.
> 
>  Happy to review if you take a swing.
> 
> > > IH1 routing: fair point. I just inherited the existing IH0 path and
> > > haven't tried IH1. On the list once CAM ACK and fence ordering are
> > > stable.
> > > 
> > > 
> > > 
> > > AMIR SHETAIA
> > > Senior Software Development Engineer  |  AMD Software Platform
> > > Architecture Team
> > > ----------------------------------------------------------------------
> > > ------
> > > ------------------------------------------------------
> >  
> >  1 Commerce Valley
> >  
> > > Drive, Markham, ON L3T 7X6
> > > LinkedIn  |  Instagram  |  X  |  amd.com
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: Timur Kristóf <timur.kristof@gmail.com>
> > > Sent: Wednesday, May 13, 2026 6:12 PM
> > > To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir
> > > <Amir.Shetaia@amd.com>
> >  
> >  Cc: amd-gfx@lists.freedesktop.org; Deucher,
> >  
> > > Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> > > <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie
> > > Vock <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com> Subject: Re:
> > > [PATCH 0/6] drm/amdgpu: Improve retry fault handling Hi Amir,
> > > 
> > > > Timur, you are right, I see your patch 6 already does the MMIO ACK
> > > > for gmc_v11_0/ih_v6_0. I missed that. The gap is only in patch 5's
> > > > ih_v7_0 implementation, which still does WDOORBELL. that's where I'd
> > > > suggest swapping in MMIO for NV4.
> > > 
> > > First, let me clear up a slight misunderstanding here. The patch
> > > series that I sent here only contains what I managed to get working on
> > > Navi 31, it does not contain any code from my attempt at Navi 48.
> > > 
> > > 
> > > 
> > > The patch 5 is just a slight refactor of the pre-existing code and
> > > (intentionally) doesn't contain any functional changes. It seems that
> > > IH 7.1 relies on the doorbell, so I didn't want to remove it, albeit I
> > > have no means to verify if that actually works or not.
> > > 
> > > 
> > > 
> > > If you like, I can push a few WIP patches to a different branch
> > > tomorrow to show you exactly what I did on Navi 48. However, I
> > > wouldn't want to submit that to the mailing list without first making
> > > sure
> > > that it works well.
> > > 
> > > > 1. "Fault never resolves on NV48" different shape from our
> > > > broken-CAM-ACK symptom.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > You're right, those are different. Our cam-walk-monotonically
> > > > symptom only shows up when CAM is enabled but the ACK is broken.
> > > > 
> > >  > On your NV48 setup CAM
> > > > 
> > > > probably isn't enabled at all (your patch 6 only enables it for
> > > > ih_v6_0_irq_init, no equivalent in ih_v7_0_irq_init)
> > > 
> > > For the attempt on Navi 48, I enabled the CAM in ih_v7_0 the same way
> > > I do for
> >  
> >  ih_v6_0 in the series. But, because I couldn't get it fully working, I
> >  
> > > didn't include any of that code in the series.
> > > 
> > > > so retries fire
> > > > repeatedly on the IH ring instead of being deduped by CAM. That
> > > > matches what you're seeing .. amdgpu_vm_handle_fault keeps being
> > > > called but each call is on a fresh IRQ for the same address.
> > > > Two things that could be happening underneath:
> > > > - The fault handler runs but the updated PTE never reaches UTC L0
> > > > (TLB invalidation gap). On NV4 we see this as "valid PTEs failing to
> > > > translate"
> > > > in our UMR captures.
> > > 
> > > I think this explanation may fit what I saw.
> > > Why is it not reaching UTC L0?
> > > Also, how do you inspect this stuff in umr?
> > > 
> > > > 2. What bits we check on src_data[2]:
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > Honestly, we don't use src_data[2] for retry detection. We use it
> > > > only for
> > > > the cam_index: cam_index = entry->src_data[2] & 0x3ff;   /* low 10
> > > > bits
> > > > =
> > > > CAM slot */
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > For retry detection we initially used the gfx9 constant on
> > > > src_data[1] like you, but observed the bit cleared on a lot of NV4
> > > > events that should have been retries (waves were hung in xnack-stall
> > > > but no IH event matched).
> >  
> >  So
> >  
> > > > we just go through the retry path unconditionally on NV4
> > > 
> > > I see. I think I needed to change some gfxhub registers to get those
> > > src_data bits that you are missing.
> > > 
> > > > and let
> > > > amdgpu_vm_handle_fault sort it out via SVM range migration. May be
> > > > specific to gfx1201 / our test path
> > > 
> > > My test case is a simple Vulkan shader which I am executing with
> > > vkrunner.
> > > Each shader invocation does an out of bounds read from a different
> > > page. For Navi 31 (and Strix Halo), I started out with just 1 page
> > > fault, and once I could mitigate that reliably, I turned it up to
> > > several
> > > hundred faults.
> > > 
> > > > 3. TLB flush making it worse .. clue about what to do:
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > Honest answer: not really, not a SW-only fix. Our 1 GiB hang is an
> > > > architectural deadlock ... ih_soft_work blocks on a dma_fence for an
> > > > SDMA BO-clear, the BO-clear is stalled on a GCR (cache flush)
> > > > request, and the GC cache block isn't ACK'ing the GCR while UTC L2
> > > > is saturated by the user shader's XNACK retry storm. Adding a TLB
> > > > flush adds another translation request to the same saturated UTC,
> > > > which is why it makes things worse.
> > > 
> > > This may be related to a flaw in amdgpu_gmc_handle_retry_fault():
> > > what the function does is first call amdgpu_vm_handle_fault() which
> > > kicks off an SDMA job to update page tables, and then it ACKs the filter
> > > CAM.
> 
>  However, at the moment when the CAM is ACKed, the SDMA job is not
> 
> > > finished yet, so the CP sees that the page is still invalid and the page
> > > fault interrupt is fired again. I actually noticed that on Navi 31 too,
> > > but it's just not fatal there.
> >  
> >  It just basically handles the same page fault twice.
> >  
> > > Once we solve this flaw, I would like to propose to enable retry
> > > faults by default on Navi 3.
> > > Here is an idea for a solution:
> > > 
> > > 
> > > 
> > > Instead of ACKing the CAM right away, we should do it after the SDMA
> > > fence is signalled, ie. when we are sure the page tables are updated.
> > > Maybe we can set a callback on the fence and do it there, though it
> > > would require a slight code churn to get that to work.
> > > 
> > > 
> > > 
> > > What do you think?
> > > 
> > > > 4. IH1 ring on NV4:
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > Same as you ... retry faults on NV4 always come in on IH0. We
> > > > delegate from
> > > > IH0 to ih.ring_soft (amdgpu_irq_delegate(adev, entry, 8)) so the
> > > > SVM/migration path can sleep, but the original entry is on IH0. We
> > > > haven't tried IH1 routing.
> > > 
> > > Why, though?
> > > 
> > > 
> > > 
> > > The ih_v7_0 code does set up the IH1 ring and configure it exactly the
> > > same as the ih_v6_0, so I don't see why it wouldn't work?
> > > 
> > > > Re your branch: thanks for the gitlab link, easier than digging
> > > > through patchwork.
> > > > I'll cherry-pick patches 1, 3, 4 into our test build to see if patch
> > > > 4 cleans up the timestamp filter delta we're seeing (97k entered /
> > > > 2.8k completed at 1 GiB might be partly explained by your Strix Halo
> > > > bug).
> > > 
> > > The timestamp issue actually gave me an endless headache when I first
> > > got into this topic in December. I hope the patch helps!
> > > 
> > > 
> > > 
> > > Best regards,
> > > Timur

     prev parent reply	other threads:[~2026-06-12  7:16 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
2026-05-13 16:30 ` [PATCH 1/6] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
2026-05-18 14:57   ` Christian König
2026-05-13 16:30 ` [PATCH 2/6] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
2026-05-13 16:30 ` [PATCH 3/6] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
2026-05-19  7:00   ` Christian König
2026-05-13 16:30 ` [PATCH 4/6] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
2026-05-13 16:30 ` [PATCH 5/6] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
2026-05-13 16:30 ` [PATCH 6/6] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf
2026-05-13 16:36 ` [PATCH 0/6] drm/amdgpu: Improve retry fault handling Alex Deucher
2026-05-13 16:43   ` Timur Kristóf
2026-05-13 17:28     ` Shetaia, Amir
2026-05-13 17:32       ` Deucher, Alexander
2026-05-13 17:51       ` Timur Kristóf
2026-05-13 20:32         ` Shetaia, Amir
2026-05-13 22:12           ` Timur Kristóf
2026-05-14 15:04             ` Shetaia, Amir
2026-05-14 17:02               ` Timur Kristóf
2026-05-14 19:32                 ` Shetaia, Amir
2026-05-14 21:24                   ` Timur Kristóf
2026-06-12  7:16                     ` Timur Kristóf [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=16497357.Emhk5qWAgF@timur-max \
    --to=timur.kristof@gmail.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Amir.Shetaia@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=maraeo@gmail.com \
    --cc=mwen@igalia.com \
    --cc=natalie.vock@gmx.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox