From: "Timur Kristóf" <timur.kristof@gmail.com>
To: Alex Deucher <alexdeucher@gmail.com>,
"Shetaia, Amir" <Amir.Shetaia@amd.com>
Cc: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
"Deucher, Alexander" <Alexander.Deucher@amd.com>,
"Koenig, Christian" <Christian.Koenig@amd.com>,
"Marek Olšák" <maraeo@gmail.com>,
"Natalie Vock" <natalie.vock@gmx.de>,
"Melissa Wen" <mwen@igalia.com>
Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
Date: Wed, 13 May 2026 19:51:49 +0200 [thread overview]
Message-ID: <10056920.eNJFYEL58v@timur-hyperion> (raw)
In-Reply-To: <PH8PR12MB688984F5D361A30D77FB046E87062@PH8PR12MB6889.namprd12.prod.outlook.com>
Hi Amir,
Thanks for the quick response!
See my replies below.
On Wednesday, May 13, 2026 7:28:41 PM Central European Summer Time Shetaia,
>
> Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK for
> the past few weeks and what you're describing on NV48 lines up closely with
> what we've seen
> Quick highlights from my work:
>
> 1. IH retry CAM ACK doesn't actually free the slot when written via
> WDOORBELL on NV4 .. we have to use MMIO
> (WREG32_SOC15(OSSSYS, 0,
> regIH_RETRY_CAM_ACK, cam_index & 0x3ff)).
I agree. That's my conclusion as well and that's exactly what I'm doing in my
series for Navi 31, see the following patch:
"drm/amdgpu: Enable retry CAM on Navi 3 dGPUs"
> "fault never resolves" is exactly the symptom you'd see if the
> CAM never gets cleared.
Not exactly.
When the CAM never gets cleared, the first page fault is still resolved, but
subsequent page faults (that belong to the same CAM entry) will cause a hang
because the IRQ handler is not called (because the IRQ is filtered out).
That's not what I see on Navi 48. Instead what I see is that the IRQ is fired
repeatedly and amdgpu_vm_handle_fault() is called repeatedly, but just doesn't
resolve the fault.
> 2. gfx12 needs its own retry-fault detection path ..
> amdgpu_gmc_handle_retry_fault on gfx9-era constants
> (AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never matches on
> gfx12. We added a gfx12-native handler that reads from src_data[2] for NV4.
Interesting. Could you share what bits you checked on src_data[2]?
The gfx9-era constants worked for me on both Navi 31 and 48 for detecting
retry faults; however I needed to program some extra register fields in the
gfxhub code to actually enable retry fault interrupts.
>
> 3. TLB flush making it worse is a known trap .. on NV4 we see the same. The
> flush adds more pressure on the same UTC L2
> already saturated by the retry
> storm; the GCR can't drain. We have UMR captures showing GCVM_L2 stuck busy
> on the user VMID with SDMA parked on a GCR ack.
I am pretty sure this is what I saw.
Do you have any clue about what can be done about this?
> 4. Up to ~512 MiB our patches resolve faults cleanly;
That's pretty impressive! Nice work!
> at 1 GiB we see random
> hangs that we've isolated to an SDMA ->
> GCR -> GC-cache deadlock when the
> BO-clear runs in ih_soft_work context.
Actually something I forgot to ask: on Navi 4x is it possible to use the IH1
ring? On my machine it seemed that the retry fault interrupts always come in
on the IH0 ring even though the IH1 is enabled and configured upstream already.
> Could you reply with your series? I tried searching the inbox but couldn't
> find it. Once I have it, I can diff against ours to see what overlaps and
> what's net-new on each side.
You can view it on patchwork or the mailing list arcives:
https://patchwork.freedesktop.org/series/166522/
https://lists.freedesktop.org/archives/amd-gfx/2026-May/thread.html#144500
Or if that's more comfortable for you, here is my GitLab branch:
https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults
Thanks & best regards,
Timur
next prev parent reply other threads:[~2026-05-13 17:51 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
2026-05-13 16:30 ` [PATCH 1/6] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
2026-05-13 16:30 ` [PATCH 2/6] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
2026-05-13 16:30 ` [PATCH 3/6] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
2026-05-13 16:30 ` [PATCH 4/6] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
2026-05-13 16:30 ` [PATCH 5/6] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
2026-05-13 16:30 ` [PATCH 6/6] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf
2026-05-13 16:36 ` [PATCH 0/6] drm/amdgpu: Improve retry fault handling Alex Deucher
2026-05-13 16:43 ` Timur Kristóf
2026-05-13 17:28 ` Shetaia, Amir
2026-05-13 17:32 ` Deucher, Alexander
2026-05-13 17:51 ` Timur Kristóf [this message]
2026-05-13 20:32 ` Shetaia, Amir
2026-05-13 22:12 ` Timur Kristóf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=10056920.eNJFYEL58v@timur-hyperion \
--to=timur.kristof@gmail.com \
--cc=Alexander.Deucher@amd.com \
--cc=Amir.Shetaia@amd.com \
--cc=Christian.Koenig@amd.com \
--cc=alexdeucher@gmail.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=maraeo@gmail.com \
--cc=mwen@igalia.com \
--cc=natalie.vock@gmx.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox