Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Timur Kristóf" <timur.kristof@gmail.com>
To: Alex Deucher <alexdeucher@gmail.com>,
	"Shetaia, Amir" <Amir.Shetaia@amd.com>
Cc: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>,
	"Marek Olšák" <maraeo@gmail.com>,
	"Natalie Vock" <natalie.vock@gmx.de>,
	"Melissa Wen" <mwen@igalia.com>
Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
Date: Wed, 13 May 2026 19:51:49 +0200	[thread overview]
Message-ID: <10056920.eNJFYEL58v@timur-hyperion> (raw)
In-Reply-To: <PH8PR12MB688984F5D361A30D77FB046E87062@PH8PR12MB6889.namprd12.prod.outlook.com>

Hi Amir,

Thanks for the quick response!
See my replies below.

On Wednesday, May 13, 2026 7:28:41 PM Central European Summer Time Shetaia, 
> 
> Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK for
> the past few weeks and what you're describing on NV48 lines up closely with
> what we've seen

> Quick highlights from my work:
> 
> 1. IH retry CAM ACK doesn't actually free the slot when written via
> WDOORBELL on NV4 .. we have to use MMIO
> (WREG32_SOC15(OSSSYS, 0,
> regIH_RETRY_CAM_ACK, cam_index & 0x3ff)).

I agree. That's my conclusion as well and that's exactly what I'm doing in my 
series for Navi 31, see the following patch:
"drm/amdgpu: Enable retry CAM on Navi 3 dGPUs"

> "fault never resolves" is exactly the symptom you'd see if the
> CAM never gets cleared. 

Not exactly.

When the CAM never gets cleared, the first page fault is still resolved, but 
subsequent page faults (that belong to the same CAM entry) will cause a hang 
because the IRQ handler is not called (because the IRQ is filtered out).

That's not what I see on Navi 48. Instead what I see is that the IRQ is fired 
repeatedly and amdgpu_vm_handle_fault() is called repeatedly, but just doesn't 
resolve the fault.

> 2. gfx12 needs its own retry-fault detection path ..
> amdgpu_gmc_handle_retry_fault on gfx9-era constants
> (AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never matches on
> gfx12. We added a gfx12-native handler that reads from src_data[2] for NV4.

Interesting. Could you share what bits you checked on src_data[2]?

The gfx9-era constants worked for me on both Navi 31 and 48 for detecting 
retry faults; however I needed to program some extra register fields in the 
gfxhub code to actually enable retry fault interrupts.

> 
> 3. TLB flush making it worse is a known trap .. on NV4 we see the same. The
> flush adds more pressure on the same UTC L2
> already saturated by the retry
> storm; the GCR can't drain. We have UMR captures showing GCVM_L2 stuck busy
> on the user VMID with SDMA parked on a GCR ack.

I am pretty sure this is what I saw.
Do you have any clue about what can be done about this?

> 4. Up to ~512 MiB our patches resolve faults cleanly;

That's pretty impressive! Nice work!

> at 1 GiB we see random
> hangs that we've isolated to an SDMA ->
> GCR -> GC-cache deadlock when the
> BO-clear runs in ih_soft_work context. 

Actually something I forgot to ask: on Navi 4x is it possible to use the IH1 
ring? On my machine it seemed that the retry fault interrupts always come in 
on the IH0 ring even though the IH1 is enabled and configured upstream already.

> Could you reply with your series? I tried searching the inbox but couldn't
> find it. Once I have it, I can diff against ours to see what overlaps and
> what's net-new on each side.

You can view it on patchwork or the mailing list arcives:
https://patchwork.freedesktop.org/series/166522/
https://lists.freedesktop.org/archives/amd-gfx/2026-May/thread.html#144500

Or if that's more comfortable for you, here is my GitLab branch:
https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults

Thanks & best regards,
Timur

next prev parent reply	other threads:[~2026-05-13 17:51 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
2026-05-13 16:30 ` [PATCH 1/6] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
2026-05-13 16:30 ` [PATCH 2/6] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
2026-05-13 16:30 ` [PATCH 3/6] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
2026-05-13 16:30 ` [PATCH 4/6] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
2026-05-13 16:30 ` [PATCH 5/6] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
2026-05-13 16:30 ` [PATCH 6/6] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf
2026-05-13 16:36 ` [PATCH 0/6] drm/amdgpu: Improve retry fault handling Alex Deucher
2026-05-13 16:43   ` Timur Kristóf
2026-05-13 17:28     ` Shetaia, Amir
2026-05-13 17:32       ` Deucher, Alexander
2026-05-13 17:51       ` Timur Kristóf [this message]
2026-05-13 20:32         ` Shetaia, Amir
2026-05-13 22:12           ` Timur Kristóf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=10056920.eNJFYEL58v@timur-hyperion \
    --to=timur.kristof@gmail.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Amir.Shetaia@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=maraeo@gmail.com \
    --cc=mwen@igalia.com \
    --cc=natalie.vock@gmx.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox