Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Timur Kristóf" <timur.kristof@gmail.com>
To: Alex Deucher <alexdeucher@gmail.com>,
	"Shetaia, Amir" <Amir.Shetaia@amd.com>
Cc: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>,
	"Marek Olšák" <maraeo@gmail.com>,
	"Natalie Vock" <natalie.vock@gmx.de>,
	"Melissa Wen" <mwen@igalia.com>
Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
Date: Thu, 14 May 2026 23:24:49 +0200	[thread overview]
Message-ID: <5044340.OV4Wx5bFTl@timur-hyperion> (raw)
In-Reply-To: <PH8PR12MB6889CDC1BE1667402FDE152087072@PH8PR12MB6889.namprd12.prod.outlook.com>

Hi Amir,

Thanks, glad to hear that!

I recommend to also apply "Add fence argument to amdgpu_vm_handle_fault()" and 
"ACK the retry CAM after VM update finishes" as these fix the race between the 
CAM and the SDMA, and should improve the reliability of the whole mechanism in 
my opinion.

In the meantime I think I figured out what I was missing.
Considering that the FORCE_MISS bits didn't change the behavior, I was 
thinking that the curlpit must be something else, so I started looking into 
the PTE that is being added.

It all started working for me when I added some new PTE flags to 
amdgpu_vm_handle_fault(): AMDGPU_PTE_IS_PTE and AMDGPU_PTE_NOALLOC.
With that, I cleaned up my patches and force-pushed them to the same branch:

https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_navi4

What do you think?

Best regards,
Timur


On Thursday, May 14, 2026 9:32:16 PM Central European Summer Time Shetaia, 
Amir wrote:
> AMD General
> 
> Hi Timur,
> 
> I cherry-picked your soft-IH-ring + timestamp-filter patches onto our build,
> ran malloc 1 GiB sweep (N=10) on Navi4 (gfx1201):
 
>                                                     PASS    1st hang   sR_r
> entered
 Baseline (no patches)            1/10    trial 6       97k
>   With your patches                  2/10    trial 8      43k
> 
> What I noticed on NV4:
> - 2× trials before failure (soft-IH unbound workqueue helping)
> - ~55% reduction in spurious svm_range_restore_pages (your timestamp-filter
> fix catches a real bug here too, not just Strix Halo)
 - Soft IH worker
> visibly on events_unbound queue
> 
> Same dma_fence_wait_timeout signature underneath, now on TWO workers
> (amdgpu_irq_handle_ih_soft + ttm_bo_delayed_delete,
 TTM cleanup gated on
> the same fence chain). Central BO-clear/GCR deadlock unchanged, your
> patches fix adjacent bugs but that one needs the HW fix. 
> Also tested UTCL0 retry-rate knob. On NV4 the gfx12 equivalents are
> regGL1C_UTCL0_RETRY / regGL1XC_UTCL0_RETRY
 (the TCP-level register moved
> to GL1C-level on gfx12). Wrote INCR=0xff COUNT=0xf via umr to all SE/SA
> banks → 0/10 PASS, hung sooner. So, slowing UTCL0 retries doesn't help.
> deadlock isn't pressure-driven on UTCL0. 
> 
> AMIR SHETAIA
> Senior Software Development Engineer  |  AMD
> Software Platform Architecture Team
> ----------------------------------------------------------------------------
> ------------------------------------------------------
 1 Commerce Valley
> Drive, Markham, ON L3T 7X6
> LinkedIn  |  Instagram  |  X  |  amd.com
> 
> 
> 
> 
> -----Original Message-----
> From: Timur Kristóf <timur.kristof@gmail.com>
> Sent: Thursday, May 14, 2026 1:03 PM
> To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir
> <Amir.Shetaia@amd.com>
 Cc: amd-gfx@lists.freedesktop.org; Deucher,
> Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie Vock
> <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com> Subject: Re: [PATCH
> 0/6] drm/amdgpu: Improve retry fault handling 
> Hi Amir,
> 
> I pushed a work in progress branch here:
> https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_navi4
> 
 
> This contains the two series that I sent to the mailing list yesterday:
> "Improve retry fault handling"
> "Improve soft IH ring"
> 
> If we have to rely on the soft IH ring on Navi4, then we absolutely need the
> patches from "Improve soft IH ring", as it fixes a few bugs and switches
> the soft IH ring to use the system unbound workqueue. Otherwise it can
> happen that the soft IH ring fills up even before your CPU can process
> anything in it.
 (This is especially bad on HW that doesn't have the filter
> CAM enabled, eg. Strix Halo.) 
> On top of that,
> 
> There is a patch to solve the race condition and make sure that the retry
> CAM is ACKed after the VM update is complete. On Navi 31, this works and
> successfully makes sure that the race is eliminated and each VM fault is
> only processed once. I'd like to submit these if you guys agree on the
> approach.
 
> And finally, there is a very messy patch that contains what I tried to do to
> get it all working on Navi 48. This has a lot of comments with my notes
> during development, and some attempts I tried to fix the issue (without
> success).
 
> Basically, what I did was:
> 
> - Enable the retry CAM the same way as I did on IH 6.0 exactly like you
> said
 - Added the cam_index to gmc_v12_0_process_interrupt
> - Added a few bits to gfxhub_v12_0 and gfx_v12_0 to actually enable retry
> faults (without that, the faults don't have the retry bit in src_data[1])
 
> Now, what I observe on Navi 48 is:
> 
> 1. Retry fault interrupt is triggered
> 2. Dispatched on the soft IH ring
> 3. The interrupt handler is called again on the soft IRQ ring 4. The fault
> is mitigated by amdgpu_vm_handle_fault() 5. Finally, the retry CAM is ACKed
> 6. As soon as the CAM is ACKed, I get another interrupt for the same
> address, and it goes on infinitely or until GPU reset is triggered.
 
> Attempts at fixing what I saw on Navi 48:
> 
> - Tried to add a TLB flush to various places. Now that the race is
> mitigated, the TLB flush doesn't make things worse anymore, but it's also
> not helping.
 - Tried to set the FORCE_MISS bits on various cache related
> registers, hoping that it would work around the possibility of the PTE/PDE
> not being updated in the cache. This didn't help either. - Tried to use
> amdgpu.vm_update_mode=3 to use the CPU to update the page tables, but that
> didn't help on Navi 48 - Tried to halt the CP using CP_ME_CNTL to see if it
> would stop sending interrupts, but it didn't. - Using umr I tried to
> disable the L2 cache and reset it with the GRBM, but that just broke the
> whole system. 
> Other notes on the retry fault topic in general:
> 
> - On GPUs that don't have the retry CAM (eg. Navi 1-2 and APUs) we'll need
> to consider disabling the VM fault interrupt while processing the retry,
> because the soft IH ring can fill up really quickly and the CPU may
> struggle to keep up with it.
 - In general (for all GPUs), I think it would
> be better to update a larger VA range of subsequent pages rather than just
> a single page, in order to more efficiently mitigate VM faults to
> neighbouring pages. - Currently all of the invalid accesses are redirected
> to the same dummy page, which is a security hole because it means that a
> process that had an out of bounds write can leak data to a process that
> does an out of bounds read. I already talked to Christian about this, he
> has a few ideas how to fix it. 
> Thanks & best regards,
> Timur
> 
> 
> On Thursday, May 14, 2026 5:04:00 PM Central European Summer Time Shetaia,
> Amir wrote:
> > AMD General
> >
> >
> >
> > Hi Timur,
> >
> >
> >
> > Thanks for clarifying. yes, please push the Navi 48 WIP branch when
> > you have a chance, useful to compare side-by-side.
> 
> 
> 
> > Patch 5 / WDOORBELL on ih_v7_0: I have been able to confirm that
> > doorbell ACK does not free the CAM slot on the NV4 family
> 
>  (gfx1200/1201). I
> 
> > verified empirically, slots walk monotonically until the CAM fills,
> > then HW silently drops retry events. MMIO write to regIH_RETRY_CAM_ACK
> > does free it. Almost certainly required for your Navi 48 attempt.
> > Why PTE update doesn't reach UTC L0: Most likely a race between the
> > SDMA PTE-update job completing and the CAM ACK firing the
> 
>  IRQ for re-handling.
> 
> > Until SDMA finishes the write + invalidation propagates, UTC L0 keeps
> > getting the stale "invalid" reply which is exactly your symptom.
> > To inspect with umr: clone
> > https://gitlab.freedesktop.org/tomstdenis/umr and use
> > scripts/diag/dump_all_cpc_info.sh (needs
> 
>  halt_if_hws_hang=1
> 
> > gpu_recovery=0). For UTC specifically: umr -i <gpu> -O bits -r
> > '*.*.regGCVM_L2_STATUS' (look at L2_BUSY and CONTEXT_DOMAIN_BUSY
> > per-VMID
> > bits) and regGCVM_L2_PROTECTION_FAULT_STATUS (empty + busy =
> > translation timeout, not denial). Page-table walk to confirm PTE
> > landed: sudo umr -i <gpu> -vm <vmid>@<va> 1.
> > Your ACK-after-SDMA-fence idea: good catch. We hadn't dug into that
> > race because the BO-clear deadlock dominates on NV4, but
> 
>  the ordering bug is
> 
> > almost certainly there too. Fence callback is the right shape, cleaner
> > than blocking the worker. Would be a generic fix benefiting both Navi 3
> > and 4.
 Happy to review if you take a swing.
> > IH1 routing: fair point. I just inherited the existing IH0 path and
> > haven't tried IH1. On the list once CAM ACK and fence ordering are
> > stable.
> 
> 
> 
> > AMIR SHETAIA
> > Senior Software Development Engineer  |  AMD Software Platform
> > Architecture Team
> > ----------------------------------------------------------------------
> > ------
> > ------------------------------------------------------
> 
>  1 Commerce Valley
> 
> > Drive, Markham, ON L3T 7X6
> > LinkedIn  |  Instagram  |  X  |  amd.com
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Timur Kristóf <timur.kristof@gmail.com>
> > Sent: Wednesday, May 13, 2026 6:12 PM
> > To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir
> > <Amir.Shetaia@amd.com>
> 
>  Cc: amd-gfx@lists.freedesktop.org; Deucher,
> 
> > Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> > <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie
> > Vock <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com> Subject: Re:
> > [PATCH 0/6] drm/amdgpu: Improve retry fault handling Hi Amir,
> >
> >
> >
> >
> > > Timur, you are right, I see your patch 6 already does the MMIO ACK
> > > for gmc_v11_0/ih_v6_0. I missed that. The gap is only in patch 5's
> > > ih_v7_0 implementation, which still does WDOORBELL. that's where I'd
> > > suggest swapping in MMIO for NV4.
> >
> >
> >
> >
> > First, let me clear up a slight misunderstanding here. The patch
> > series that I sent here only contains what I managed to get working on
> > Navi 31, it does not contain any code from my attempt at Navi 48.
> 
> 
> 
> > The patch 5 is just a slight refactor of the pre-existing code and
> > (intentionally) doesn't contain any functional changes. It seems that
> > IH 7.1 relies on the doorbell, so I didn't want to remove it, albeit I
> > have no means to verify if that actually works or not.
> 
> 
> 
> > If you like, I can push a few WIP patches to a different branch
> > tomorrow to show you exactly what I did on Navi 48. However, I
> > wouldn't want to submit that to the mailing list without first making sure
> > that it works well.
> 
> 
> >
> >
> > > 1. "Fault never resolves on NV48" different shape from our
> > > broken-CAM-ACK symptom.
> >
> >
> >
> >
> >
> > > You're right, those are different. Our cam-walk-monotonically
> > > symptom only shows up when CAM is enabled but the ACK is broken.
> > >
> > >
> > >
> >  > On your NV48 setup CAM
> > >
> > >
> > >
> > > probably isn't enabled at all (your patch 6 only enables it for
> > > ih_v6_0_irq_init, no equivalent in ih_v7_0_irq_init)
> >
> >
> >
> >
> > For the attempt on Navi 48, I enabled the CAM in ih_v7_0 the same way
> > I do for
> 
>  ih_v6_0 in the series. But, because I couldn't get it fully working, I
> 
> > didn't include any of that code in the series.
> >
> >
> >
> > > so retries fire
> > > repeatedly on the IH ring instead of being deduped by CAM. That
> > > matches what you're seeing .. amdgpu_vm_handle_fault keeps being
> > > called but each call is on a fresh IRQ for the same address.
> > > Two things that could be happening underneath:
> > > - The fault handler runs but the updated PTE never reaches UTC L0
> > > (TLB invalidation gap). On NV4 we see this as "valid PTEs failing to
> > > translate"
> > > in our UMR captures.
> >
> >
> >
> >
> > I think this explanation may fit what I saw.
> > Why is it not reaching UTC L0?
> > Also, how do you inspect this stuff in umr?
> >
> >
> >
> >
> > > 2. What bits we check on src_data[2]:
> > >
> > >
> > >
> > >
> > >
> > > Honestly, we don't use src_data[2] for retry detection. We use it
> > > only for
> > > the cam_index: cam_index = entry->src_data[2] & 0x3ff;   /* low 10 bits
> > > =
> > > CAM slot */
> >
> >
> >
> >
> >
> > > For retry detection we initially used the gfx9 constant on
> > > src_data[1] like you, but observed the bit cleared on a lot of NV4
> > > events that should have been retries (waves were hung in xnack-stall
> > > but no IH event matched).
> 
>  So
> 
> > > we just go through the retry path unconditionally on NV4
> >
> >
> >
> >
> > I see. I think I needed to change some gfxhub registers to get those
> > src_data bits that you are missing.
> 
> 
> 
> >
> >
> > > and let
> > > amdgpu_vm_handle_fault sort it out via SVM range migration. May be
> > > specific to gfx1201 / our test path
> >
> >
> >
> >
> > My test case is a simple Vulkan shader which I am executing with
> > vkrunner.
> > Each shader invocation does an out of bounds read from a different
> > page. For Navi 31 (and Strix Halo), I started out with just 1 page
> > fault, and once I could mitigate that reliably, I turned it up to several
> > hundred faults.
> 
> 
> >
> >
> > > 3. TLB flush making it worse .. clue about what to do:
> > >
> > >
> > >
> > >
> > >
> > > Honest answer: not really, not a SW-only fix. Our 1 GiB hang is an
> > > architectural deadlock ... ih_soft_work blocks on a dma_fence for an
> > > SDMA BO-clear, the BO-clear is stalled on a GCR (cache flush)
> > > request, and the GC cache block isn't ACK'ing the GCR while UTC L2
> > > is saturated by the user shader's XNACK retry storm. Adding a TLB
> > > flush adds another translation request to the same saturated UTC,
> > > which is why it makes things worse.
> >
> >
> >
> >
> > This may be related to a flaw in amdgpu_gmc_handle_retry_fault():
> > what the function does is first call amdgpu_vm_handle_fault() which
> > kicks off an SDMA job to update page tables, and then it ACKs the filter
> > CAM.
 However, at the moment when the CAM is ACKed, the SDMA job is not
> > finished yet, so the CP sees that the page is still invalid and the page
> > fault interrupt is fired again. I actually noticed that on Navi 31 too,
> > but it's just not fatal there.
> 
>  It just basically handles the same page fault twice.
> 
> > Once we solve this flaw, I would like to propose to enable retry
> > faults by default on Navi 3.
> > Here is an idea for a solution:
> >
> >
> >
> > Instead of ACKing the CAM right away, we should do it after the SDMA
> > fence is signalled, ie. when we are sure the page tables are updated.
> > Maybe we can set a callback on the fence and do it there, though it
> > would require a slight code churn to get that to work.
> 
> 
> 
> > What do you think?
> >
> >
> >
> >
> > > 4. IH1 ring on NV4:
> > >
> > >
> > >
> > >
> > >
> > > Same as you ... retry faults on NV4 always come in on IH0. We
> > > delegate from
> > > IH0 to ih.ring_soft (amdgpu_irq_delegate(adev, entry, 8)) so the
> > > SVM/migration path can sleep, but the original entry is on IH0. We
> > > haven't tried IH1 routing.
> >
> >
> >
> >
> > Why, though?
> >
> >
> >
> > The ih_v7_0 code does set up the IH1 ring and configure it exactly the
> > same as the ih_v6_0, so I don't see why it wouldn't work?
> 
> 
> 
> >
> >
> > > Re your branch: thanks for the gitlab link, easier than digging
> > > through patchwork.
> > > I'll cherry-pick patches 1, 3, 4 into our test build to see if patch
> > > 4 cleans up the timestamp filter delta we're seeing (97k entered /
> > > 2.8k completed at 1 GiB might be partly explained by your Strix Halo
> > > bug).
> >
> >
> >
> >
> > The timestamp issue actually gave me an endless headache when I first
> > got into this topic in December. I hope the patch helps!
> 
> 
> 
> > Best regards,
> > Timur
> >
> >
> >
> 
> 
> 
>

     prev parent reply	other threads:[~2026-05-14 21:24 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
2026-05-13 16:30 ` [PATCH 1/6] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
2026-05-13 16:30 ` [PATCH 2/6] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
2026-05-13 16:30 ` [PATCH 3/6] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
2026-05-13 16:30 ` [PATCH 4/6] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
2026-05-13 16:30 ` [PATCH 5/6] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
2026-05-13 16:30 ` [PATCH 6/6] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf
2026-05-13 16:36 ` [PATCH 0/6] drm/amdgpu: Improve retry fault handling Alex Deucher
2026-05-13 16:43   ` Timur Kristóf
2026-05-13 17:28     ` Shetaia, Amir
2026-05-13 17:32       ` Deucher, Alexander
2026-05-13 17:51       ` Timur Kristóf
2026-05-13 20:32         ` Shetaia, Amir
2026-05-13 22:12           ` Timur Kristóf
2026-05-14 15:04             ` Shetaia, Amir
2026-05-14 17:02               ` Timur Kristóf
2026-05-14 19:32                 ` Shetaia, Amir
2026-05-14 21:24                   ` Timur Kristóf [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5044340.OV4Wx5bFTl@timur-hyperion \
    --to=timur.kristof@gmail.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Amir.Shetaia@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=maraeo@gmail.com \
    --cc=mwen@igalia.com \
    --cc=natalie.vock@gmx.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.