Re: [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Timur Kristóf" <timur.kristof@gmail.com>
To: amd-gfx@lists.freedesktop.org,
	"Alex Deucher" <alexander.deucher@amd.com>,
	christian.koenig@amd.com, "Natalie Vock" <natalie.vock@gmx.de>,
	"Mario Limonciello" <mario.limonciello@amd.com>,
	"Amir Shetaia" <Amir.Shetaia@amd.com>,
	"Marek Olšák" <maraeo@gmail.com>,
	"Tvrtko Ursulin" <tursulin@ursulin.net>
Subject: Re: [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts
Date: Mon, 15 Jun 2026 17:32:25 +0200	[thread overview]
Message-ID: <10078559.eNJFYEL58v@timur-hyperion> (raw)
In-Reply-To: <0b18193b-9f2d-4ea9-8db3-08579325ab0c@ursulin.net>

On Monday, June 15, 2026 5:23:52 PM Central European Summer Time Tvrtko 
Ursulin wrote:
> On 15/06/2026 15:52, Timur Kristóf wrote:
> > On Monday, June 15, 2026 4:32:23 PM Central European Summer Time Tvrtko
> > 
> > Ursulin wrote:
> >> On 25/05/2026 12:45, Timur Kristóf wrote:
> >>> Different interrupts may have different timestamp sources,
> >>> which shouldn't be compared.
> >>> 
> >>> If we compare the timestamps of retry faults to timestamps
> >>> of other interrupts, it may result in all retry fault
> >>> interrupts being filtered out, because of the different
> >>> time stamp source.
> >>> 
> >>> This issue was observed on Strix Halo.
> >>> Solved by storing the timestamp of the last page fault interrupt.
> > 
> > Hi,
> > 
> >> This one may require access to AMD docs to review. For example I am
> >> immediately curious as to how many different clock sources on a single
> >> IH there are
> > 
> > As far as I know there are various timestamp sources in the GPU and some
> > interrupts use different ones. I am not aware of any documentation on this
> > topic, unfortunately.
> > 
> >> how does that relate to the timestamp_src field
> > 
> > The timestamp_src field is set differently when the timestamp source is
> > different. So, it could happen that we accidentally filter out all page
> > faults when we shouldn't.
> > 
> >> and if there are indeed multiple clock domains should the patch perhaps
> >> be
> >> generalized to something like
> >> ih->processed_timestamp[entry->timestamp_src] or something?
> > 
> > For the context of this patch, I think it doesn't matter how many
> > different
> > kinds of time stamps there are. What's important is that we just shouldn't
> > compare timestamps of page faults with time stamps of other interrupts.
> 
> True, thank you!
> 
> Another question is why the backward timestamp check is needed only for
> fault interrupts? I do not see it elsewhere.

Correct, this is only used for retry fault interrupts and only when they are 
dispatched to the soft IH ring.

The reason this was added is because when retry faults are enabled and the GPU 
hits a VM fault, it keeps spamming the CPU with many interrupts for the same 
fault until the fault is resolved. The CPU needs to filter out the faults which 
it is already handling, otherwise we would end up handling the same fault 
multiple times.

(As a side note, I should also probably look into how to reduce the frequency 
of how often these interrupts are repeated.)

> 
> Let me also ask two more things below.
> 
> > As far as I see the timestamp doesn't really matter for other interrupts
> > as we only use it to filter out page faults and nothing else.
> > 
> > Hope this helps,
> > Timur
> > 
> >>> ---
> >>> 
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 5 ++++-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 +
> >>>    2 files changed, 5 insertions(+), 1 deletion(-)
> >>> 
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
> >>> 13bec8461cde..52258f1341c2 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>> @@ -437,9 +437,12 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device
> >>> *adev,>
> >>> 
> >>>    	uint32_t hash;
> >>>    	
> >>>    	/* Stale retry fault if timestamp goes backward */
> >>> 
> >>> -	if (amdgpu_ih_ts_after(timestamp, ih->processed_timestamp))
> >>> +	if (timestamp == adev->gmc.processed_fault_timestamp ||
> >>> +		amdgpu_ih_ts_after(timestamp, adev-
> >> 
> >> gmc.processed_fault_timestamp))
> 
> First thing is whether you are confident the equality check is either
> safe or required?

I don't see why it wouldn't be safe. But maybe it isn't required.
What do you suggest instead?

> For example can two blocks fault with the same timestamp on different
> addresses?

They might. But keep in mind that the GFX block just keeps spamming the 
interrupts until the fault is handled. So, if we filter one out by mistake, we 
know we will just receive the same fault again very soon.

> Or from a different angle, is the clock granularity good enough to not
> coalesce two separate faults to a single timestamp?

I am not sure about that.

> 
> >>>    		return true;
> >>> 
> >>> +	adev->gmc.processed_fault_timestamp = MAX(timestamp,
> >>> adev->gmc.processed_fault_timestamp); +
> 
> Doesn't a plain assign work here? The if above has already verified new
> timestamp is larger than the old.
> 
> Regards,
> 
> Tvrtko
> 
> >>>    	/* If we don't have space left in the ring buffer return
> > 
> > immediately */
> > 
> >>>    	stamp = max(timestamp, AMDGPU_GMC_FAULT_TIMEOUT + 1) -
> >>>    	
> >>>    		AMDGPU_GMC_FAULT_TIMEOUT;
> >>> 
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
> >>> 676e3aaa1f27..77eb15380284 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>> @@ -361,6 +361,7 @@ struct amdgpu_gmc {
> >>> 
> >>>    	u64 noretry_flags;
> >>>    	u64 init_pte_flags;
> >>> 
> >>> +	u64 processed_fault_timestamp;
> >>> 
> >>>    	bool flush_tlb_needs_extra_type_0;
> >>>    	bool flush_tlb_needs_extra_type_2;

next prev parent reply	other threads:[~2026-06-15 15:32 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-25 11:45 [PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2) Timur Kristóf
2026-05-25 11:45 ` [PATCH 1/7] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
2026-05-25 11:45 ` [PATCH 2/7] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
2026-05-26 15:00   ` Alex Deucher
2026-05-25 11:45 ` [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
2026-06-15 14:32   ` Tvrtko Ursulin
2026-06-15 14:52     ` Timur Kristóf
2026-06-15 15:23       ` Tvrtko Ursulin
2026-06-15 15:32         ` Timur Kristóf [this message]
2026-06-15 15:48           ` Tvrtko Ursulin
2026-05-25 11:45 ` [PATCH 4/7] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
2026-06-15 14:44   ` Tvrtko Ursulin
2026-06-15 15:02     ` Timur Kristóf
2026-05-25 11:45 ` [PATCH 5/7] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
2026-05-25 11:45 ` [PATCH 6/7] drm/amdgpu/gfxhub: Respect noretry flag for retry faults on GFX12.1 Timur Kristóf
2026-05-25 11:45 ` [PATCH 7/7] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=10078559.eNJFYEL58v@timur-hyperion \
    --to=timur.kristof@gmail.com \
    --cc=Amir.Shetaia@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=maraeo@gmail.com \
    --cc=mario.limonciello@amd.com \
    --cc=natalie.vock@gmx.de \
    --cc=tursulin@ursulin.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.