From: "Timur Kristóf" <timur.kristof@gmail.com>
To: "Tvrtko Ursulin" <tursulin@ursulin.net>,
amd-gfx@lists.freedesktop.org,
"Alex Deucher" <alexander.deucher@amd.com>,
"Natalie Vock" <natalie.vock@gmx.de>,
"Mario Limonciello" <mario.limonciello@amd.com>,
"Amir Shetaia" <Amir.Shetaia@amd.com>,
"Marek Olšák" <maraeo@gmail.com>,
"Christian König" <christian.koenig@amd.com>
Subject: Re: [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts
Date: Tue, 16 Jun 2026 13:17:36 +0200 [thread overview]
Message-ID: <2899310.lGaqSPkdTl@timur-hyperion> (raw)
In-Reply-To: <eb84897e-f4b3-4638-b7e7-e8a9dc787297@amd.com>
On Tuesday, June 16, 2026 12:15:12 PM Central European Summer Time Christian
König wrote:
> A bit late to the discussion, sorry. Trying to answer questions now.
>
> On 6/15/26 17:48, Tvrtko Ursulin wrote:
> > On 15/06/2026 16:32, Timur Kristóf wrote:
> >> On Monday, June 15, 2026 5:23:52 PM Central European Summer Time Tvrtko
> >>
> >> Ursulin wrote:
> >>> On 15/06/2026 15:52, Timur Kristóf wrote:
> >>>> On Monday, June 15, 2026 4:32:23 PM Central European Summer Time Tvrtko
> >>>>
> >>>> Ursulin wrote:
> >>>>> On 25/05/2026 12:45, Timur Kristóf wrote:
> >>>>>> Different interrupts may have different timestamp sources,
> >>>>>> which shouldn't be compared.
> >>>>>>
> >>>>>> If we compare the timestamps of retry faults to timestamps
> >>>>>> of other interrupts, it may result in all retry fault
> >>>>>> interrupts being filtered out, because of the different
> >>>>>> time stamp source.
> >>>>>>
> >>>>>> This issue was observed on Strix Halo.
> >>>>>> Solved by storing the timestamp of the last page fault interrupt.
> >>>>
> >>>> Hi,
> >>>>
> >>>>> This one may require access to AMD docs to review. For example I am
> >>>>> immediately curious as to how many different clock sources on a single
> >>>>> IH there are
> >>>>
> >>>> As far as I know there are various timestamp sources in the GPU and
> >>>> some
> >>>> interrupts use different ones. I am not aware of any documentation on
> >>>> this
> >>>> topic, unfortunately.
> >>>>
> >>>>> how does that relate to the timestamp_src field
>
> The timestamp_src bit indicates if the timestamp came from the IH block
> which wrote the IV to memory or the original IP block which signaled the IH
> that an interrupt happened.
> >>>> The timestamp_src field is set differently when the timestamp source is
> >>>> different. So, it could happen that we accidentally filter out all page
> >>>> faults when we shouldn't.
> >>>>
> >>>>> and if there are indeed multiple clock domains should the patch
> >>>>> perhaps
> >>>>> be
> >>>>> generalized to something like
> >>>>> ih->processed_timestamp[entry->timestamp_src] or something?
> >>>>
> >>>> For the context of this patch, I think it doesn't matter how many
> >>>> different
> >>>> kinds of time stamps there are. What's important is that we just
> >>>> shouldn't
> >>>> compare timestamps of page faults with time stamps of other interrupts.
> >>>
> >>> True, thank you!
>
> Yeah, completely agree.
>
> >>> Another question is why the backward timestamp check is needed only for
> >>> fault interrupts? I do not see it elsewhere.
> >>
> >> Correct, this is only used for retry fault interrupts and only when they
> >> are dispatched to the soft IH ring.
> >>
> >> The reason this was added is because when retry faults are enabled and
> >> the GPU hits a VM fault, it keeps spamming the CPU with many interrupts
> >> for the same fault until the fault is resolved. The CPU needs to filter
> >> out the faults which it is already handling, otherwise we would end up
> >> handling the same fault multiple times.
> >
> > Got it, thank you!
> >
> >> (As a side note, I should also probably look into how to reduce the
> >> frequency of how often these interrupts are repeated.)
> >>
> >>> Let me also ask two more things below.
> >>>
> >>>> As far as I see the timestamp doesn't really matter for other
> >>>> interrupts
> >>>> as we only use it to filter out page faults and nothing else.
> >>>>
> >>>> Hope this helps,
> >>>> Timur
> >>>>
> >>>>>> ---
> >>>>>>
> >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 5 ++++-
> >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 +
> >>>>>> 2 files changed, 5 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
> >>>>>> 13bec8461cde..52258f1341c2 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>>>>> @@ -437,9 +437,12 @@ bool amdgpu_gmc_filter_faults(struct
> >>>>>> amdgpu_device
> >>>>>> *adev,>
> >>>>>>
> >>>>>> uint32_t hash;
> >>>>>>
> >>>>>> /* Stale retry fault if timestamp goes backward */
> >>>>>>
> >>>>>> - if (amdgpu_ih_ts_after(timestamp, ih->processed_timestamp))
> >>>>>> + if (timestamp == adev->gmc.processed_fault_timestamp ||
> >>>>>> + amdgpu_ih_ts_after(timestamp, adev-
> >>>>>
> >>>>> gmc.processed_fault_timestamp))
> >>>
> >>> First thing is whether you are confident the equality check is either
> >>> safe or required?
> >>
> >> I don't see why it wouldn't be safe. But maybe it isn't required.
> >> What do you suggest instead?
> >
> > Safe as is whether it has potential to swallow a legitimate unseen faults.
> >
> > Looking at amdgpu_gmc_filter_faults() a bit lower down, it does appear to
> > filter out repeated faults on the same address. Would it be safe to rely
> > on that instead of the timestamp equality check?
> Yes.
>
> The check was added with patch to make retry page faults more resilent to IH
> ring buffer overflow: commit 3c2d6ea27955cfac8590884d207353eece8c2cee
> Author: Philip Yang <Philip.Yang@amd.com>
> Date: Thu Nov 18 15:24:55 2021 -0500
>
> drm/amdgpu: handle IH ring1 overflow
>
> IH ring1 is used to process GPU retry fault, overflow is enabled to
> drain retry fault because we want receive other interrupts while
> handling retry fault to recover range. There is no overflow flag set
> when wptr pass rptr. Use timestamp of rptr and wptr to handle overflow
> and drain retry fault.
>
> If fault timestamp goes backward, the fault is filtered and should not
> be processed. Drain fault is finished if processed_timestamp is equal to
> or larger than checkpoint timestamp.
>
> Add amdgpu_ih_functions interface decode_iv_ts for different chips to
> get timestamp from IV entry with different iv size and timestamp offset.
> amdgpu_ih_decode_iv_ts_helper is used for vega10, vega20, navi10.
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
> Acked-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>
> But as far as I can see the whole idea is completely broken. The timestamp
> can also go backward in case of a reset for example.
>
> So having this check like this is clearly a bad idea.
>
> What we could do in amdgpu_gmc_filter_faults() is to check some range for
> the timestamp, e.g. last seen timestamp (in amdgpu_gmc_filter_faults(),
> e.g. only faults) - value X is considered a duplicate caused by ring buffer
> wrap around.
>
Hi,
I agree that filtering based on timestamps is not a good idea in general, and
we should probably re-think it. I wish someone had thought of that in 2021
when the above patch was accepted.
Maybe this patch should be dropped and the timestamp check should be removed
instead. What do you guys think about that?
Best regards,
Timur
>
> > I appreciate that may cause a transient interrupt handling storm if the
> > clock granularity is poor, but maybe that is better than losing a fault.>
> >>> For example can two blocks fault with the same timestamp on different
> >>> addresses?
> >>
> >> They might. But keep in mind that the GFX block just keeps spamming the
> >> interrupts until the fault is handled. So, if we filter one out by
> >> mistake, we know we will just receive the same fault again very soon.
> >>
> >>> Or from a different angle, is the clock granularity good enough to not
> >>> coalesce two separate faults to a single timestamp?
> >>
> >> I am not sure about that.
> >
> > I guess if the equality filter can be removed then this concern also goes
> > away.
> >
> > Regards,
> >
> > Tvrtko
> >
> >>>>>> return true;
> >>>>>>
> >>>>>> + adev->gmc.processed_fault_timestamp = MAX(timestamp,
> >>>>>> adev->gmc.processed_fault_timestamp); +
> >>>
> >>> Doesn't a plain assign work here? The if above has already verified new
> >>> timestamp is larger than the old.
> >>>
> >>> Regards,
> >>>
> >>> Tvrtko
> >>>
> >>>>>> /* If we don't have space left in the ring buffer return
> >>>>
> >>>> immediately */
> >>>>
> >>>>>> stamp = max(timestamp, AMDGPU_GMC_FAULT_TIMEOUT + 1) -
> >>>>>>
> >>>>>> AMDGPU_GMC_FAULT_TIMEOUT;
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
> >>>>>> 676e3aaa1f27..77eb15380284 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>>>>> @@ -361,6 +361,7 @@ struct amdgpu_gmc {
> >>>>>>
> >>>>>> u64 noretry_flags;
> >>>>>> u64 init_pte_flags;
> >>>>>>
> >>>>>> + u64 processed_fault_timestamp;
> >>>>>>
> >>>>>> bool flush_tlb_needs_extra_type_0;
> >>>>>> bool flush_tlb_needs_extra_type_2;
next prev parent reply other threads:[~2026-06-16 11:17 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-25 11:45 [PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2) Timur Kristóf
2026-05-25 11:45 ` [PATCH 1/7] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
2026-05-25 11:45 ` [PATCH 2/7] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
2026-05-26 15:00 ` Alex Deucher
2026-05-25 11:45 ` [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
2026-06-15 14:32 ` Tvrtko Ursulin
2026-06-15 14:52 ` Timur Kristóf
2026-06-15 15:23 ` Tvrtko Ursulin
2026-06-15 15:32 ` Timur Kristóf
2026-06-15 15:48 ` Tvrtko Ursulin
2026-06-16 10:15 ` Christian König
2026-06-16 11:17 ` Timur Kristóf [this message]
2026-06-16 12:48 ` Christian König
2026-05-25 11:45 ` [PATCH 4/7] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
2026-06-15 14:44 ` Tvrtko Ursulin
2026-06-15 15:02 ` Timur Kristóf
2026-06-16 10:34 ` Christian König
2026-05-25 11:45 ` [PATCH 5/7] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
2026-06-16 8:02 ` Tvrtko Ursulin
2026-06-16 11:54 ` Timur Kristóf
2026-05-25 11:45 ` [PATCH 6/7] drm/amdgpu/gfxhub: Respect noretry flag for retry faults on GFX12.1 Timur Kristóf
2026-06-16 8:09 ` Tvrtko Ursulin
2026-06-16 11:57 ` Timur Kristóf
2026-06-16 12:16 ` Tvrtko Ursulin
2026-06-16 12:36 ` Timur Kristóf
2026-05-25 11:45 ` [PATCH 7/7] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2899310.lGaqSPkdTl@timur-hyperion \
--to=timur.kristof@gmail.com \
--cc=Amir.Shetaia@amd.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=maraeo@gmail.com \
--cc=mario.limonciello@amd.com \
--cc=natalie.vock@gmx.de \
--cc=tursulin@ursulin.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.