From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BD141CD4F39 for ; Thu, 14 May 2026 21:24:55 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 36C2310E004; Thu, 14 May 2026 21:24:55 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="VC89xAMp"; dkim-atps=neutral Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com [209.85.128.53]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8AFDC10E004 for ; Thu, 14 May 2026 21:24:53 +0000 (UTC) Received: by mail-wm1-f53.google.com with SMTP id 5b1f17b1804b1-488b150559bso66531095e9.1 for ; Thu, 14 May 2026 14:24:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778793892; x=1779398692; darn=lists.freedesktop.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=BXO9xaXpbZ0XEQpGzHtPBknevDBGb0nIBIAv6nzooxY=; b=VC89xAMpEu6d+Gxl4TbuvHaYp6WnSXr9oUuFPKTLuc0iRrP/0TkdKpnQ6zwJX3VqyQ V0F/N81qkEYfRTYS4qWL1NRJZDbFrQrKHTPZg1V58f/8qLx6ruO67LDzXeYYxwZzL2h5 KIiIWYtukirAtSydLPEbOQR26h5ZoeMnMGPcewH0J6D+D4UGb/GT4nosyuHCtbLUIr67 FkljMCSAHS2tV0EJAFDBKJZ+2foi+lQjOwmJR160Fgbr8XhiOj2E6sYAmVOgnm8k/LE+ RGNAooHX6vxVJzvx/a+ZWYUZ+1JRFfO4KDe8tYSaGfPkfLb2lEG10xeT8iX+ZfXSMXD0 MBig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778793892; x=1779398692; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=BXO9xaXpbZ0XEQpGzHtPBknevDBGb0nIBIAv6nzooxY=; b=ggpXuEZK+h26tcwB51Jnzde6tbl0UVzZNPcprGTX8cz6jIBuZt6LlA7rlYOay3VlqV 66Tke9XXEBa4LYWhALR/cpBiqkLYg0iXt6H0T1xr6VnthLjY1vNVNuC4C+/ZzfaQeKYv PS7Bqd8GAohx6TEqzOt/N6o1j39ml+X+HCj7uOZ/rNvhwHSNh89on9YO2SSS8xeHX59A gUY1EPT6AC6kiyOyoVxU6Ex8uggTl/sIsXlxeFcB85GEUaf4OEYRshV+zmC89Kl8qDWF Qk/X95EFYDHf4gehHm8F+5lPRCO6tD4M/+2MEAuOsqpdYc9oD0InkDsb1pP8Khs7FnIr qLpg== X-Gm-Message-State: AOJu0Yx4X1F90/ak11YL72ZKkOk5OOSnqOewSpXwcm0BSyG3Br97ZTkK gZDJ/UOZYn5ybBcLNPVFOD3T42vbCsb/yv50hnE1uYkmEIfp/4NdQebW X-Gm-Gg: Acq92OGALHMcJcQeN2UxRbA5+aUPnVF3FFaGeVUKcvP9bYtbi8sMWB/A2h7cdjSKp// D4wM2Lpzcqh/p2OEedTQcQpnzYDcZdnTfLRNk0N/XWKgZcet+OqLQR3zP5VAjeKQoaQUsSaSZuv YZhwZAYAEn1R9Z+dK+i5x1ABi3ZcpPX4J27Xs9HOtSqTND88WVZyabpdfBb6ro+D2n0VLlGIsr9 TetYi3Hy19co8G5jjTTecNuH4Qo8dtnlaKF9X2vMAX0NAhavYbbnaNkzOzww0sCHF6mm+tBvjFG vJe8xlv1anE1sGdKtT5swtWOP+ofN/FDJh/GaGlzt5A+i3JM6MxXM8SGLFhpJBctgNd65i56rqY KnGgkE04Pjgz0L8+n/VihjfJ1zmTwDJ5AwMv0Mz8U0qG+jrPvVQpyi6SViaofuxoosuQJJvK2Yr HI/oB6hwKsfQ0kA2PBIVhq8rHBvYLaGl1zXA0g9nB59FMih1n70hw0HwCRYe1YpCfZV42H4P0xF dDmH/6i X-Received: by 2002:a05:600c:4e02:b0:483:2c98:4368 with SMTP id 5b1f17b1804b1-48fe6322416mr14263175e9.18.1778793891628; Thu, 14 May 2026 14:24:51 -0700 (PDT) Received: from timur-hyperion.localnet (540017BE.dsl.pool.telekom.hu. [84.0.23.190]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-48fe53804aesm15599805e9.15.2026.05.14.14.24.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 May 2026 14:24:51 -0700 (PDT) From: Timur =?UTF-8?B?S3Jpc3TDs2Y=?= To: Alex Deucher , "Shetaia, Amir" Cc: "amd-gfx@lists.freedesktop.org" , "Deucher, Alexander" , "Koenig, Christian" , Marek =?UTF-8?B?T2zFocOhaw==?= , Natalie Vock , Melissa Wen Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling Date: Thu, 14 May 2026 23:24:49 +0200 Message-ID: <5044340.OV4Wx5bFTl@timur-hyperion> In-Reply-To: References: <20260513163043.8725-1-timur.kristof@gmail.com> <2342353.vFx2qVVIhK@timur-hyperion> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" Hi Amir, Thanks, glad to hear that! I recommend to also apply "Add fence argument to amdgpu_vm_handle_fault()" = and=20 "ACK the retry CAM after VM update finishes" as these fix the race between = the=20 CAM and the SDMA, and should improve the reliability of the whole mechanism= in=20 my opinion. In the meantime I think I figured out what I was missing. Considering that the FORCE_MISS bits didn't change the behavior, I was=20 thinking that the curlpit must be something else, so I started looking into= =20 the PTE that is being added. It all started working for me when I added some new PTE flags to=20 amdgpu_vm_handle_fault(): AMDGPU_PTE_IS_PTE and AMDGPU_PTE_NOALLOC. With that, I cleaned up my patches and force-pushed them to the same branch: https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_navi4 What do you think? Best regards, Timur On Thursday, May 14, 2026 9:32:16=E2=80=AFPM Central European Summer Time S= hetaia,=20 Amir wrote: > AMD General >=20 > Hi Timur, >=20 > I cherry-picked your soft-IH-ring + timestamp-filter patches onto our bui= ld, > ran malloc 1 GiB sweep (N=3D10) on Navi4 (gfx1201): =20 > PASS 1st hang sR= _r > entered Baseline (no patches) 1/10 trial 6 97k > With your patches 2/10 trial 8 43k >=20 > What I noticed on NV4: > - 2=C3=97 trials before failure (soft-IH unbound workqueue helping) > - ~55% reduction in spurious svm_range_restore_pages (your timestamp-filt= er > fix catches a real bug here too, not just Strix Halo) - Soft IH worker > visibly on events_unbound queue >=20 > Same dma_fence_wait_timeout signature underneath, now on TWO workers > (amdgpu_irq_handle_ih_soft + ttm_bo_delayed_delete, TTM cleanup gated on > the same fence chain). Central BO-clear/GCR deadlock unchanged, your > patches fix adjacent bugs but that one needs the HW fix.=20 > Also tested UTCL0 retry-rate knob. On NV4 the gfx12 equivalents are > regGL1C_UTCL0_RETRY / regGL1XC_UTCL0_RETRY (the TCP-level register moved > to GL1C-level on gfx12). Wrote INCR=3D0xff COUNT=3D0xf via umr to all SE/= SA > banks =E2=86=92 0/10 PASS, hung sooner. So, slowing UTCL0 retries doesn't= help. > deadlock isn't pressure-driven on UTCL0.=20 >=20 > AMIR SHETAIA > Senior Software Development Engineer | AMD > Software Platform Architecture Team > -------------------------------------------------------------------------= =2D-- > ------------------------------------------------------ 1 Commerce Valley > Drive, Markham, ON L3T 7X6 > LinkedIn | Instagram | X | amd.com >=20 >=20 >=20 >=20 > -----Original Message----- > From: Timur Krist=C3=B3f > Sent: Thursday, May 14, 2026 1:03 PM > To: Alex Deucher ; Shetaia, Amir > Cc: amd-gfx@lists.freedesktop.org; Deucher, > Alexander ; Koenig, Christian > ; Marek Ol=C5=A1=C3=A1k ; Nat= alie Vock > ; Melissa Wen Subject: Re: [PATCH > 0/6] drm/amdgpu: Improve retry fault handling=20 > Hi Amir, >=20 > I pushed a work in progress branch here: > https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_na= vi4 >=20 =20 > This contains the two series that I sent to the mailing list yesterday: > "Improve retry fault handling" > "Improve soft IH ring" >=20 > If we have to rely on the soft IH ring on Navi4, then we absolutely need = the > patches from "Improve soft IH ring", as it fixes a few bugs and switches > the soft IH ring to use the system unbound workqueue. Otherwise it can > happen that the soft IH ring fills up even before your CPU can process > anything in it. (This is especially bad on HW that doesn't have the filter > CAM enabled, eg. Strix Halo.)=20 > On top of that, >=20 > There is a patch to solve the race condition and make sure that the retry > CAM is ACKed after the VM update is complete. On Navi 31, this works and > successfully makes sure that the race is eliminated and each VM fault is > only processed once. I'd like to submit these if you guys agree on the > approach. =20 > And finally, there is a very messy patch that contains what I tried to do= to > get it all working on Navi 48. This has a lot of comments with my notes > during development, and some attempts I tried to fix the issue (without > success). =20 > Basically, what I did was: >=20 > - Enable the retry CAM the same way as I did on IH 6.0 exactly like you > said - Added the cam_index to gmc_v12_0_process_interrupt > - Added a few bits to gfxhub_v12_0 and gfx_v12_0 to actually enable retry > faults (without that, the faults don't have the retry bit in src_data[1]) =20 > Now, what I observe on Navi 48 is: >=20 > 1. Retry fault interrupt is triggered > 2. Dispatched on the soft IH ring > 3. The interrupt handler is called again on the soft IRQ ring 4. The fault > is mitigated by amdgpu_vm_handle_fault() 5. Finally, the retry CAM is ACK= ed > 6. As soon as the CAM is ACKed, I get another interrupt for the same > address, and it goes on infinitely or until GPU reset is triggered. =20 > Attempts at fixing what I saw on Navi 48: >=20 > - Tried to add a TLB flush to various places. Now that the race is > mitigated, the TLB flush doesn't make things worse anymore, but it's also > not helping. - Tried to set the FORCE_MISS bits on various cache related > registers, hoping that it would work around the possibility of the PTE/PDE > not being updated in the cache. This didn't help either. - Tried to use > amdgpu.vm_update_mode=3D3 to use the CPU to update the page tables, but t= hat > didn't help on Navi 48 - Tried to halt the CP using CP_ME_CNTL to see if = it > would stop sending interrupts, but it didn't. - Using umr I tried to > disable the L2 cache and reset it with the GRBM, but that just broke the > whole system.=20 > Other notes on the retry fault topic in general: >=20 > - On GPUs that don't have the retry CAM (eg. Navi 1-2 and APUs) we'll need > to consider disabling the VM fault interrupt while processing the retry, > because the soft IH ring can fill up really quickly and the CPU may > struggle to keep up with it. - In general (for all GPUs), I think it would > be better to update a larger VA range of subsequent pages rather than just > a single page, in order to more efficiently mitigate VM faults to > neighbouring pages. - Currently all of the invalid accesses are redirected > to the same dummy page, which is a security hole because it means that a > process that had an out of bounds write can leak data to a process that > does an out of bounds read. I already talked to Christian about this, he > has a few ideas how to fix it.=20 > Thanks & best regards, > Timur >=20 >=20 > On Thursday, May 14, 2026 5:04:00=E2=80=AFPM Central European Summer Time= Shetaia, > Amir wrote: > > AMD General > > > > > > > > Hi Timur, > > > > > > > > Thanks for clarifying. yes, please push the Navi 48 WIP branch when > > you have a chance, useful to compare side-by-side. >=20 >=20 >=20 > > Patch 5 / WDOORBELL on ih_v7_0: I have been able to confirm that > > doorbell ACK does not free the CAM slot on the NV4 family >=20 > (gfx1200/1201). I >=20 > > verified empirically, slots walk monotonically until the CAM fills, > > then HW silently drops retry events. MMIO write to regIH_RETRY_CAM_ACK > > does free it. Almost certainly required for your Navi 48 attempt. > > Why PTE update doesn't reach UTC L0: Most likely a race between the > > SDMA PTE-update job completing and the CAM ACK firing the >=20 > IRQ for re-handling. >=20 > > Until SDMA finishes the write + invalidation propagates, UTC L0 keeps > > getting the stale "invalid" reply which is exactly your symptom. > > To inspect with umr: clone > > https://gitlab.freedesktop.org/tomstdenis/umr and use > > scripts/diag/dump_all_cpc_info.sh (needs >=20 > halt_if_hws_hang=3D1 >=20 > > gpu_recovery=3D0). For UTC specifically: umr -i -O bits -r > > '*.*.regGCVM_L2_STATUS' (look at L2_BUSY and CONTEXT_DOMAIN_BUSY > > per-VMID > > bits) and regGCVM_L2_PROTECTION_FAULT_STATUS (empty + busy =3D > > translation timeout, not denial). Page-table walk to confirm PTE > > landed: sudo umr -i -vm @ 1. > > Your ACK-after-SDMA-fence idea: good catch. We hadn't dug into that > > race because the BO-clear deadlock dominates on NV4, but >=20 > the ordering bug is >=20 > > almost certainly there too. Fence callback is the right shape, cleaner > > than blocking the worker. Would be a generic fix benefiting both Navi 3 > > and 4. Happy to review if you take a swing. > > IH1 routing: fair point. I just inherited the existing IH0 path and > > haven't tried IH1. On the list once CAM ACK and fence ordering are > > stable. >=20 >=20 >=20 > > AMIR SHETAIA > > Senior Software Development Engineer | AMD Software Platform > > Architecture Team > > ---------------------------------------------------------------------- > > ------ > > ------------------------------------------------------ >=20 > 1 Commerce Valley >=20 > > Drive, Markham, ON L3T 7X6 > > LinkedIn | Instagram | X | amd.com > > > > > > > > > > > > > > -----Original Message----- > > From: Timur Krist=C3=B3f > > Sent: Wednesday, May 13, 2026 6:12 PM > > To: Alex Deucher ; Shetaia, Amir > > >=20 > Cc: amd-gfx@lists.freedesktop.org; Deucher, >=20 > > Alexander ; Koenig, Christian > > ; Marek Ol=C5=A1=C3=A1k ; N= atalie > > Vock ; Melissa Wen Subject: Re: > > [PATCH 0/6] drm/amdgpu: Improve retry fault handling Hi Amir, > > > > > > > > > > > Timur, you are right, I see your patch 6 already does the MMIO ACK > > > for gmc_v11_0/ih_v6_0. I missed that. The gap is only in patch 5's > > > ih_v7_0 implementation, which still does WDOORBELL. that's where I'd > > > suggest swapping in MMIO for NV4. > > > > > > > > > > First, let me clear up a slight misunderstanding here. The patch > > series that I sent here only contains what I managed to get working on > > Navi 31, it does not contain any code from my attempt at Navi 48. >=20 >=20 >=20 > > The patch 5 is just a slight refactor of the pre-existing code and > > (intentionally) doesn't contain any functional changes. It seems that > > IH 7.1 relies on the doorbell, so I didn't want to remove it, albeit I > > have no means to verify if that actually works or not. >=20 >=20 >=20 > > If you like, I can push a few WIP patches to a different branch > > tomorrow to show you exactly what I did on Navi 48. However, I > > wouldn't want to submit that to the mailing list without first making s= ure > > that it works well. >=20 >=20 > > > > > > > 1. "Fault never resolves on NV48" different shape from our > > > broken-CAM-ACK symptom. > > > > > > > > > > > > > You're right, those are different. Our cam-walk-monotonically > > > symptom only shows up when CAM is enabled but the ACK is broken. > > > > > > > > > > > > On your NV48 setup CAM > > > > > > > > > > > > probably isn't enabled at all (your patch 6 only enables it for > > > ih_v6_0_irq_init, no equivalent in ih_v7_0_irq_init) > > > > > > > > > > For the attempt on Navi 48, I enabled the CAM in ih_v7_0 the same way > > I do for >=20 > ih_v6_0 in the series. But, because I couldn't get it fully working, I >=20 > > didn't include any of that code in the series. > > > > > > > > > so retries fire > > > repeatedly on the IH ring instead of being deduped by CAM. That > > > matches what you're seeing .. amdgpu_vm_handle_fault keeps being > > > called but each call is on a fresh IRQ for the same address. > > > Two things that could be happening underneath: > > > - The fault handler runs but the updated PTE never reaches UTC L0 > > > (TLB invalidation gap). On NV4 we see this as "valid PTEs failing to > > > translate" > > > in our UMR captures. > > > > > > > > > > I think this explanation may fit what I saw. > > Why is it not reaching UTC L0? > > Also, how do you inspect this stuff in umr? > > > > > > > > > > > 2. What bits we check on src_data[2]: > > > > > > > > > > > > > > > > > > Honestly, we don't use src_data[2] for retry detection. We use it > > > only for > > > the cam_index: cam_index =3D entry->src_data[2] & 0x3ff; /* low 10 = bits > > > =3D > > > CAM slot */ > > > > > > > > > > > > > For retry detection we initially used the gfx9 constant on > > > src_data[1] like you, but observed the bit cleared on a lot of NV4 > > > events that should have been retries (waves were hung in xnack-stall > > > but no IH event matched). >=20 > So >=20 > > > we just go through the retry path unconditionally on NV4 > > > > > > > > > > I see. I think I needed to change some gfxhub registers to get those > > src_data bits that you are missing. >=20 >=20 >=20 > > > > > > > and let > > > amdgpu_vm_handle_fault sort it out via SVM range migration. May be > > > specific to gfx1201 / our test path > > > > > > > > > > My test case is a simple Vulkan shader which I am executing with > > vkrunner. > > Each shader invocation does an out of bounds read from a different > > page. For Navi 31 (and Strix Halo), I started out with just 1 page > > fault, and once I could mitigate that reliably, I turned it up to sever= al > > hundred faults. >=20 >=20 > > > > > > > 3. TLB flush making it worse .. clue about what to do: > > > > > > > > > > > > > > > > > > Honest answer: not really, not a SW-only fix. Our 1 GiB hang is an > > > architectural deadlock ... ih_soft_work blocks on a dma_fence for an > > > SDMA BO-clear, the BO-clear is stalled on a GCR (cache flush) > > > request, and the GC cache block isn't ACK'ing the GCR while UTC L2 > > > is saturated by the user shader's XNACK retry storm. Adding a TLB > > > flush adds another translation request to the same saturated UTC, > > > which is why it makes things worse. > > > > > > > > > > This may be related to a flaw in amdgpu_gmc_handle_retry_fault(): > > what the function does is first call amdgpu_vm_handle_fault() which > > kicks off an SDMA job to update page tables, and then it ACKs the filter > > CAM. However, at the moment when the CAM is ACKed, the SDMA job is not > > finished yet, so the CP sees that the page is still invalid and the page > > fault interrupt is fired again. I actually noticed that on Navi 31 too, > > but it's just not fatal there. >=20 > It just basically handles the same page fault twice. >=20 > > Once we solve this flaw, I would like to propose to enable retry > > faults by default on Navi 3. > > Here is an idea for a solution: > > > > > > > > Instead of ACKing the CAM right away, we should do it after the SDMA > > fence is signalled, ie. when we are sure the page tables are updated. > > Maybe we can set a callback on the fence and do it there, though it > > would require a slight code churn to get that to work. >=20 >=20 >=20 > > What do you think? > > > > > > > > > > > 4. IH1 ring on NV4: > > > > > > > > > > > > > > > > > > Same as you ... retry faults on NV4 always come in on IH0. We > > > delegate from > > > IH0 to ih.ring_soft (amdgpu_irq_delegate(adev, entry, 8)) so the > > > SVM/migration path can sleep, but the original entry is on IH0. We > > > haven't tried IH1 routing. > > > > > > > > > > Why, though? > > > > > > > > The ih_v7_0 code does set up the IH1 ring and configure it exactly the > > same as the ih_v6_0, so I don't see why it wouldn't work? >=20 >=20 >=20 > > > > > > > Re your branch: thanks for the gitlab link, easier than digging > > > through patchwork. > > > I'll cherry-pick patches 1, 3, 4 into our test build to see if patch > > > 4 cleans up the timestamp filter delta we're seeing (97k entered / > > > 2.8k completed at 1 GiB might be partly explained by your Strix Halo > > > bug). > > > > > > > > > > The timestamp issue actually gave me an endless headache when I first > > got into this topic in December. I hope the patch helps! >=20 >=20 >=20 > > Best regards, > > Timur > > > > > > >=20 >=20 >=20 >=20