From: "Timur Kristóf" <timur.kristof@gmail.com>
To: Alex Deucher <alexdeucher@gmail.com>,
"Shetaia, Amir" <Amir.Shetaia@amd.com>
Cc: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
"Deucher, Alexander" <Alexander.Deucher@amd.com>,
"Koenig, Christian" <Christian.Koenig@amd.com>,
"Marek Olšák" <maraeo@gmail.com>,
"Natalie Vock" <natalie.vock@gmx.de>,
"Melissa Wen" <mwen@igalia.com>
Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
Date: Fri, 12 Jun 2026 09:16:27 +0200 [thread overview]
Message-ID: <16497357.Emhk5qWAgF@timur-max> (raw)
In-Reply-To: <5044340.OV4Wx5bFTl@timur-hyperion>
Hi Everyone,
Can we please make some progress here?
I've still got 3 patch series on the mailing list to improve retry fault
handling, without any feedback for many weeks now.
Can someone please review these please?
Thanks & best regards,
Timur
On 2026. május 14., csütörtök 23:24:49 közép-európai nyári idő Timur Kristóf
wrote:
> Hi Amir,
>
> Thanks, glad to hear that!
>
> I recommend to also apply "Add fence argument to amdgpu_vm_handle_fault()"
> and "ACK the retry CAM after VM update finishes" as these fix the race
> between the CAM and the SDMA, and should improve the reliability of the
> whole mechanism in my opinion.
>
> In the meantime I think I figured out what I was missing.
> Considering that the FORCE_MISS bits didn't change the behavior, I was
> thinking that the curlpit must be something else, so I started looking into
> the PTE that is being added.
>
> It all started working for me when I added some new PTE flags to
> amdgpu_vm_handle_fault(): AMDGPU_PTE_IS_PTE and AMDGPU_PTE_NOALLOC.
> With that, I cleaned up my patches and force-pushed them to the same branch:
>
> https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_navi4
>
> What do you think?
>
> Best regards,
> Timur
>
>
> On Thursday, May 14, 2026 9:32:16 PM Central European Summer Time Shetaia,
>
> Amir wrote:
> > AMD General
> >
> > Hi Timur,
> >
> > I cherry-picked your soft-IH-ring + timestamp-filter patches onto our
> > build,>
> > ran malloc 1 GiB sweep (N=10) on Navi4 (gfx1201):
> > PASS 1st hang
> > sR_r
> >
> > entered
>
> Baseline (no patches) 1/10 trial 6 97k
>
> > With your patches 2/10 trial 8 43k
> >
> > What I noticed on NV4:
> > - 2× trials before failure (soft-IH unbound workqueue helping)
> > - ~55% reduction in spurious svm_range_restore_pages (your
> > timestamp-filter
> > fix catches a real bug here too, not just Strix Halo)
>
> - Soft IH worker
>
> > visibly on events_unbound queue
> >
> > Same dma_fence_wait_timeout signature underneath, now on TWO workers
> > (amdgpu_irq_handle_ih_soft + ttm_bo_delayed_delete,
>
> TTM cleanup gated on
>
> > the same fence chain). Central BO-clear/GCR deadlock unchanged, your
> > patches fix adjacent bugs but that one needs the HW fix.
> > Also tested UTCL0 retry-rate knob. On NV4 the gfx12 equivalents are
> > regGL1C_UTCL0_RETRY / regGL1XC_UTCL0_RETRY
>
> (the TCP-level register moved
>
> > to GL1C-level on gfx12). Wrote INCR=0xff COUNT=0xf via umr to all SE/SA
> > banks → 0/10 PASS, hung sooner. So, slowing UTCL0 retries doesn't help.
> > deadlock isn't pressure-driven on UTCL0.
> >
> > AMIR SHETAIA
> > Senior Software Development Engineer | AMD
> > Software Platform Architecture Team
> > --------------------------------------------------------------------------
> > -- ------------------------------------------------------
>
> 1 Commerce Valley
>
> > Drive, Markham, ON L3T 7X6
> > LinkedIn | Instagram | X | amd.com
> >
> >
> >
> >
> > -----Original Message-----
> > From: Timur Kristóf <timur.kristof@gmail.com>
> > Sent: Thursday, May 14, 2026 1:03 PM
> > To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir
> > <Amir.Shetaia@amd.com>
>
> Cc: amd-gfx@lists.freedesktop.org; Deucher,
>
> > Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> > <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie Vock
> > <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com> Subject: Re: [PATCH
> > 0/6] drm/amdgpu: Improve retry fault handling
> > Hi Amir,
> >
> > I pushed a work in progress branch here:
> > https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_nav
> > i4
> >
> >
> > This contains the two series that I sent to the mailing list yesterday:
> > "Improve retry fault handling"
> > "Improve soft IH ring"
> >
> > If we have to rely on the soft IH ring on Navi4, then we absolutely need
> > the patches from "Improve soft IH ring", as it fixes a few bugs and
> > switches the soft IH ring to use the system unbound workqueue. Otherwise
> > it can happen that the soft IH ring fills up even before your CPU can
> > process anything in it.
>
> (This is especially bad on HW that doesn't have the filter
>
> > CAM enabled, eg. Strix Halo.)
> > On top of that,
> >
> > There is a patch to solve the race condition and make sure that the retry
> > CAM is ACKed after the VM update is complete. On Navi 31, this works and
> > successfully makes sure that the race is eliminated and each VM fault is
> > only processed once. I'd like to submit these if you guys agree on the
> > approach.
> >
> > And finally, there is a very messy patch that contains what I tried to do
> > to get it all working on Navi 48. This has a lot of comments with my
> > notes during development, and some attempts I tried to fix the issue
> > (without success).
> >
> > Basically, what I did was:
> >
> > - Enable the retry CAM the same way as I did on IH 6.0 exactly like you
> > said
>
> - Added the cam_index to gmc_v12_0_process_interrupt
>
> > - Added a few bits to gfxhub_v12_0 and gfx_v12_0 to actually enable retry
> > faults (without that, the faults don't have the retry bit in src_data[1])
> >
> > Now, what I observe on Navi 48 is:
> >
> > 1. Retry fault interrupt is triggered
> > 2. Dispatched on the soft IH ring
> > 3. The interrupt handler is called again on the soft IRQ ring 4. The fault
> > is mitigated by amdgpu_vm_handle_fault() 5. Finally, the retry CAM is
> > ACKed
> > 6. As soon as the CAM is ACKed, I get another interrupt for the same
> > address, and it goes on infinitely or until GPU reset is triggered.
> >
> > Attempts at fixing what I saw on Navi 48:
> >
> > - Tried to add a TLB flush to various places. Now that the race is
> > mitigated, the TLB flush doesn't make things worse anymore, but it's also
> > not helping.
>
> - Tried to set the FORCE_MISS bits on various cache related
>
> > registers, hoping that it would work around the possibility of the PTE/PDE
> > not being updated in the cache. This didn't help either. - Tried to use
> > amdgpu.vm_update_mode=3 to use the CPU to update the page tables, but that
> > didn't help on Navi 48 - Tried to halt the CP using CP_ME_CNTL to see if
> > it
> > would stop sending interrupts, but it didn't. - Using umr I tried to
> > disable the L2 cache and reset it with the GRBM, but that just broke the
> > whole system.
> > Other notes on the retry fault topic in general:
> >
> > - On GPUs that don't have the retry CAM (eg. Navi 1-2 and APUs) we'll need
> > to consider disabling the VM fault interrupt while processing the retry,
> > because the soft IH ring can fill up really quickly and the CPU may
> > struggle to keep up with it.
>
> - In general (for all GPUs), I think it would
>
> > be better to update a larger VA range of subsequent pages rather than just
> > a single page, in order to more efficiently mitigate VM faults to
> > neighbouring pages. - Currently all of the invalid accesses are redirected
> > to the same dummy page, which is a security hole because it means that a
> > process that had an out of bounds write can leak data to a process that
> > does an out of bounds read. I already talked to Christian about this, he
> > has a few ideas how to fix it.
> > Thanks & best regards,
> > Timur
> >
> >
> > On Thursday, May 14, 2026 5:04:00 PM Central European Summer Time Shetaia,
> >
> > Amir wrote:
> > > AMD General
> > >
> > >
> > >
> > > Hi Timur,
> > >
> > >
> > >
> > > Thanks for clarifying. yes, please push the Navi 48 WIP branch when
> > > you have a chance, useful to compare side-by-side.
> > >
> > >
> > >
> > > Patch 5 / WDOORBELL on ih_v7_0: I have been able to confirm that
> > > doorbell ACK does not free the CAM slot on the NV4 family
> >
> > (gfx1200/1201). I
> >
> > > verified empirically, slots walk monotonically until the CAM fills,
> > > then HW silently drops retry events. MMIO write to regIH_RETRY_CAM_ACK
> > > does free it. Almost certainly required for your Navi 48 attempt.
> > > Why PTE update doesn't reach UTC L0: Most likely a race between the
> > > SDMA PTE-update job completing and the CAM ACK firing the
> >
> > IRQ for re-handling.
> >
> > > Until SDMA finishes the write + invalidation propagates, UTC L0 keeps
> > > getting the stale "invalid" reply which is exactly your symptom.
> > > To inspect with umr: clone
> > > https://gitlab.freedesktop.org/tomstdenis/umr and use
> > > scripts/diag/dump_all_cpc_info.sh (needs
> >
> > halt_if_hws_hang=1
> >
> > > gpu_recovery=0). For UTC specifically: umr -i <gpu> -O bits -r
> > > '*.*.regGCVM_L2_STATUS' (look at L2_BUSY and CONTEXT_DOMAIN_BUSY
> > > per-VMID
> > > bits) and regGCVM_L2_PROTECTION_FAULT_STATUS (empty + busy =
> > > translation timeout, not denial). Page-table walk to confirm PTE
> > > landed: sudo umr -i <gpu> -vm <vmid>@<va> 1.
> > > Your ACK-after-SDMA-fence idea: good catch. We hadn't dug into that
> > > race because the BO-clear deadlock dominates on NV4, but
> >
> > the ordering bug is
> >
> > > almost certainly there too. Fence callback is the right shape, cleaner
> > > than blocking the worker. Would be a generic fix benefiting both Navi 3
> > > and 4.
>
> Happy to review if you take a swing.
>
> > > IH1 routing: fair point. I just inherited the existing IH0 path and
> > > haven't tried IH1. On the list once CAM ACK and fence ordering are
> > > stable.
> > >
> > >
> > >
> > > AMIR SHETAIA
> > > Senior Software Development Engineer | AMD Software Platform
> > > Architecture Team
> > > ----------------------------------------------------------------------
> > > ------
> > > ------------------------------------------------------
> >
> > 1 Commerce Valley
> >
> > > Drive, Markham, ON L3T 7X6
> > > LinkedIn | Instagram | X | amd.com
> > >
> > >
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Timur Kristóf <timur.kristof@gmail.com>
> > > Sent: Wednesday, May 13, 2026 6:12 PM
> > > To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir
> > > <Amir.Shetaia@amd.com>
> >
> > Cc: amd-gfx@lists.freedesktop.org; Deucher,
> >
> > > Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> > > <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie
> > > Vock <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com> Subject: Re:
> > > [PATCH 0/6] drm/amdgpu: Improve retry fault handling Hi Amir,
> > >
> > > > Timur, you are right, I see your patch 6 already does the MMIO ACK
> > > > for gmc_v11_0/ih_v6_0. I missed that. The gap is only in patch 5's
> > > > ih_v7_0 implementation, which still does WDOORBELL. that's where I'd
> > > > suggest swapping in MMIO for NV4.
> > >
> > > First, let me clear up a slight misunderstanding here. The patch
> > > series that I sent here only contains what I managed to get working on
> > > Navi 31, it does not contain any code from my attempt at Navi 48.
> > >
> > >
> > >
> > > The patch 5 is just a slight refactor of the pre-existing code and
> > > (intentionally) doesn't contain any functional changes. It seems that
> > > IH 7.1 relies on the doorbell, so I didn't want to remove it, albeit I
> > > have no means to verify if that actually works or not.
> > >
> > >
> > >
> > > If you like, I can push a few WIP patches to a different branch
> > > tomorrow to show you exactly what I did on Navi 48. However, I
> > > wouldn't want to submit that to the mailing list without first making
> > > sure
> > > that it works well.
> > >
> > > > 1. "Fault never resolves on NV48" different shape from our
> > > > broken-CAM-ACK symptom.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > You're right, those are different. Our cam-walk-monotonically
> > > > symptom only shows up when CAM is enabled but the ACK is broken.
> > > >
> > > > On your NV48 setup CAM
> > > >
> > > > probably isn't enabled at all (your patch 6 only enables it for
> > > > ih_v6_0_irq_init, no equivalent in ih_v7_0_irq_init)
> > >
> > > For the attempt on Navi 48, I enabled the CAM in ih_v7_0 the same way
> > > I do for
> >
> > ih_v6_0 in the series. But, because I couldn't get it fully working, I
> >
> > > didn't include any of that code in the series.
> > >
> > > > so retries fire
> > > > repeatedly on the IH ring instead of being deduped by CAM. That
> > > > matches what you're seeing .. amdgpu_vm_handle_fault keeps being
> > > > called but each call is on a fresh IRQ for the same address.
> > > > Two things that could be happening underneath:
> > > > - The fault handler runs but the updated PTE never reaches UTC L0
> > > > (TLB invalidation gap). On NV4 we see this as "valid PTEs failing to
> > > > translate"
> > > > in our UMR captures.
> > >
> > > I think this explanation may fit what I saw.
> > > Why is it not reaching UTC L0?
> > > Also, how do you inspect this stuff in umr?
> > >
> > > > 2. What bits we check on src_data[2]:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Honestly, we don't use src_data[2] for retry detection. We use it
> > > > only for
> > > > the cam_index: cam_index = entry->src_data[2] & 0x3ff; /* low 10
> > > > bits
> > > > =
> > > > CAM slot */
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > For retry detection we initially used the gfx9 constant on
> > > > src_data[1] like you, but observed the bit cleared on a lot of NV4
> > > > events that should have been retries (waves were hung in xnack-stall
> > > > but no IH event matched).
> >
> > So
> >
> > > > we just go through the retry path unconditionally on NV4
> > >
> > > I see. I think I needed to change some gfxhub registers to get those
> > > src_data bits that you are missing.
> > >
> > > > and let
> > > > amdgpu_vm_handle_fault sort it out via SVM range migration. May be
> > > > specific to gfx1201 / our test path
> > >
> > > My test case is a simple Vulkan shader which I am executing with
> > > vkrunner.
> > > Each shader invocation does an out of bounds read from a different
> > > page. For Navi 31 (and Strix Halo), I started out with just 1 page
> > > fault, and once I could mitigate that reliably, I turned it up to
> > > several
> > > hundred faults.
> > >
> > > > 3. TLB flush making it worse .. clue about what to do:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Honest answer: not really, not a SW-only fix. Our 1 GiB hang is an
> > > > architectural deadlock ... ih_soft_work blocks on a dma_fence for an
> > > > SDMA BO-clear, the BO-clear is stalled on a GCR (cache flush)
> > > > request, and the GC cache block isn't ACK'ing the GCR while UTC L2
> > > > is saturated by the user shader's XNACK retry storm. Adding a TLB
> > > > flush adds another translation request to the same saturated UTC,
> > > > which is why it makes things worse.
> > >
> > > This may be related to a flaw in amdgpu_gmc_handle_retry_fault():
> > > what the function does is first call amdgpu_vm_handle_fault() which
> > > kicks off an SDMA job to update page tables, and then it ACKs the filter
> > > CAM.
>
> However, at the moment when the CAM is ACKed, the SDMA job is not
>
> > > finished yet, so the CP sees that the page is still invalid and the page
> > > fault interrupt is fired again. I actually noticed that on Navi 31 too,
> > > but it's just not fatal there.
> >
> > It just basically handles the same page fault twice.
> >
> > > Once we solve this flaw, I would like to propose to enable retry
> > > faults by default on Navi 3.
> > > Here is an idea for a solution:
> > >
> > >
> > >
> > > Instead of ACKing the CAM right away, we should do it after the SDMA
> > > fence is signalled, ie. when we are sure the page tables are updated.
> > > Maybe we can set a callback on the fence and do it there, though it
> > > would require a slight code churn to get that to work.
> > >
> > >
> > >
> > > What do you think?
> > >
> > > > 4. IH1 ring on NV4:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Same as you ... retry faults on NV4 always come in on IH0. We
> > > > delegate from
> > > > IH0 to ih.ring_soft (amdgpu_irq_delegate(adev, entry, 8)) so the
> > > > SVM/migration path can sleep, but the original entry is on IH0. We
> > > > haven't tried IH1 routing.
> > >
> > > Why, though?
> > >
> > >
> > >
> > > The ih_v7_0 code does set up the IH1 ring and configure it exactly the
> > > same as the ih_v6_0, so I don't see why it wouldn't work?
> > >
> > > > Re your branch: thanks for the gitlab link, easier than digging
> > > > through patchwork.
> > > > I'll cherry-pick patches 1, 3, 4 into our test build to see if patch
> > > > 4 cleans up the timestamp filter delta we're seeing (97k entered /
> > > > 2.8k completed at 1 GiB might be partly explained by your Strix Halo
> > > > bug).
> > >
> > > The timestamp issue actually gave me an endless headache when I first
> > > got into this topic in December. I hope the patch helps!
> > >
> > >
> > >
> > > Best regards,
> > > Timur
prev parent reply other threads:[~2026-06-12 7:16 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
2026-05-13 16:30 ` [PATCH 1/6] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
2026-05-18 14:57 ` Christian König
2026-05-13 16:30 ` [PATCH 2/6] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
2026-05-13 16:30 ` [PATCH 3/6] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
2026-05-19 7:00 ` Christian König
2026-05-13 16:30 ` [PATCH 4/6] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
2026-05-13 16:30 ` [PATCH 5/6] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
2026-05-13 16:30 ` [PATCH 6/6] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf
2026-05-13 16:36 ` [PATCH 0/6] drm/amdgpu: Improve retry fault handling Alex Deucher
2026-05-13 16:43 ` Timur Kristóf
2026-05-13 17:28 ` Shetaia, Amir
2026-05-13 17:32 ` Deucher, Alexander
2026-05-13 17:51 ` Timur Kristóf
2026-05-13 20:32 ` Shetaia, Amir
2026-05-13 22:12 ` Timur Kristóf
2026-05-14 15:04 ` Shetaia, Amir
2026-05-14 17:02 ` Timur Kristóf
2026-05-14 19:32 ` Shetaia, Amir
2026-05-14 21:24 ` Timur Kristóf
2026-06-12 7:16 ` Timur Kristóf [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=16497357.Emhk5qWAgF@timur-max \
--to=timur.kristof@gmail.com \
--cc=Alexander.Deucher@amd.com \
--cc=Amir.Shetaia@amd.com \
--cc=Christian.Koenig@amd.com \
--cc=alexdeucher@gmail.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=maraeo@gmail.com \
--cc=mwen@igalia.com \
--cc=natalie.vock@gmx.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox