From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D857EFA3755 for ; Fri, 2 Jan 2026 18:53:20 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 5C6B710E1DF; Fri, 2 Jan 2026 18:53:20 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=ibm.com header.i=@ibm.com header.b="YnlBVAMs"; dkim-atps=neutral Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by gabe.freedesktop.org (Postfix) with ESMTPS id AB50510E1DF for ; Fri, 2 Jan 2026 18:53:18 +0000 (UTC) Received: from pps.filterd (m0353725.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 6021kbG0020715; Fri, 2 Jan 2026 18:53:16 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=pp1; bh=R0gBvZ JbqX0i9AB6W5hgg622suGBbHa2cDTJS854OPs=; b=YnlBVAMsIpnIFLO5lvYa1R RxV2KsdTRvX4R6dgtkCNCUVsOzhErb2fSRBUjaasEtJCad9IDYTq4JY96h3hu+51 exloNk9L62F/07mYLLi/ZDWMZONzRVJLYMUadTpAoMfhH4UOr4VhtPzKVg2U+fE2 1Mn9TkfOjssDtJpAfw5e2V+VTbpB96U/Qu5f+PL+QXl2UsbhJTlLS//4hn0pRw1d TcKnbExZEkxiMt7a+CEkEvdrjD9pRgKI+XqkCvBLYe0XAJd7rKgANcEDJ1vSE77o ONKpQ7Pt7JdeuWtbhGQxZMwIDYq0g9AmtpwKBs874rvVHnu0xueyxpqaMiJEjX8w == Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4ba5vfdks6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 02 Jan 2026 18:53:16 +0000 (GMT) Received: from m0353725.ppops.net (m0353725.ppops.net [127.0.0.1]) by pps.reinject (8.18.1.12/8.18.0.8) with ESMTP id 602InDhn020051; Fri, 2 Jan 2026 18:53:15 GMT Received: from ppma23.wdc07v.mail.ibm.com (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4ba5vfdkry-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 02 Jan 2026 18:53:15 +0000 (GMT) Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1]) by ppma23.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 602IZNDE024967; Fri, 2 Jan 2026 18:53:14 GMT Received: from smtprelay02.dal12v.mail.ibm.com ([172.16.1.4]) by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4bau9kqq16-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 02 Jan 2026 18:53:14 +0000 Received: from smtpav05.dal12v.mail.ibm.com (smtpav05.dal12v.mail.ibm.com [10.241.53.104]) by smtprelay02.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 602IrDK223331568 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 2 Jan 2026 18:53:13 GMT Received: from smtpav05.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3ED9D58056; Fri, 2 Jan 2026 18:53:13 +0000 (GMT) Received: from smtpav05.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EF1F958052; Fri, 2 Jan 2026 18:53:09 +0000 (GMT) Received: from [9.124.213.92] (unknown [9.124.213.92]) by smtpav05.dal12v.mail.ibm.com (Postfix) with ESMTP; Fri, 2 Jan 2026 18:53:09 +0000 (GMT) Message-ID: Date: Sat, 3 Jan 2026 00:23:08 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page size systems To: "Yat Sin, David" , Alex Deucher Cc: "Koenig, Christian" , "Ritesh Harjani (IBM)" , "amd-gfx@lists.freedesktop.org" , "Kuehling, Felix" , "Deucher, Alexander" , "Russell, Kent" , Vaidyanathan Srinivasan , Mukesh Kumar Chaurasiya References: <871pl0567w.ritesh.list@gmail.com> <1f10b67a-ffdc-4962-af52-758247569e09@amd.com> <996ef75a-71b3-4ba7-a255-40516c5e9acd@amd.com> <1f2a0b14-9cff-40cd-bdbc-71fae06c34b1@linux.ibm.com> Content-Language: en-US From: Donet Tom In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Authority-Analysis: v=2.4 cv=a9k9NESF c=1 sm=1 tr=0 ts=6958141c cx=c_pps a=3Bg1Hr4SwmMryq2xdFQyZA==:117 a=3Bg1Hr4SwmMryq2xdFQyZA==:17 a=IkcTkHD0fZMA:10 a=vUbySO9Y5rIA:10 a=VkNPw1HP01LnGYTKEx00:22 a=2SrNnTgdAAAA:20 a=NEAV23lmAAAA:8 a=pGLkceISAAAA:8 a=VnNF1IyMAAAA:8 a=zd2uoN0lAAAA:8 a=e5mUnYsNAAAA:8 a=OK0YvipjHml4EQt-6i0A:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10 a=Vxmtnl_E_bksehYqCbjh:22 a=bA3UWDv6hWIuX7UZL3qL:22 X-Proofpoint-GUID: geqyOtsoSvDNFi8eZOL1Q7XW_7vAAezP X-Proofpoint-ORIG-GUID: FG-8nraLMe_IRMwb2pSWCcaeztYKEonO X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwMTAyMDE2NCBTYWx0ZWRfXzoM2wotFe/Iu edl1Rsm4JfHt1A3th+LbmhRczDO7XEAmcF4DrmLkiDz5WUkYZtYPC+J604HA038C4laWCyVQani Gff0N9SlcmcAKWLBCCgCbBOO3dsSjl4JSusxc1se+auzeQlmqIbjABw5hHMZspwI+RqWtVUkM/X kd9s6Rf8a6iwoCp4PhxQDn70gi0u3o3aFEtLV6wwtbaCH9rBDo/rR8ddkizVYDKJZ+lkstI7Pwv OMUc+Cu+3eEeZ4lbIS5verNRh5LZaQLJQdOgS08ULgB9jL0UUY9HShPaehUtPyKGMB5uHpGQIKG SsZ3jKO9fVW+Clq5QBAH7h4ZUHEJ78n5xxp/0aGVp0/hOmQiQ2x2Yk+llkR+wgAkcukVbaHXeNr 2/kHY8EA1a/xOPa3FhAnRD/nDcU+RIxWXfJA3pzDstH0gA7H5RQJr80RHPJ4IGsveOtFkDyn6S/ nqdXQQ7XoPWLAkAJ6Qw== X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1121,Hydra:6.1.9,FMLib:17.12.100.49 definitions=2026-01-02_02,2025-12-31_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 bulkscore=0 impostorscore=0 clxscore=1011 adultscore=0 priorityscore=1501 phishscore=0 malwarescore=0 lowpriorityscore=0 spamscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.19.0-2512120000 definitions=main-2601020164 X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" On 12/18/25 3:01 AM, Yat Sin, David wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > HDP flush is done in ROCm using these 3 methods: > > 1. For AQL packets, this is done by setting the system-scope acquire and release fences in the packet header. > For example, it is set here: > https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_kernel.cpp#L878 > > And when the headers are defined here: > https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L85 > > > 2. Via a SDMA packet. This is done before doing a memory copy: > The function is called here: > https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L484 > And the packet (POLL_REGMEM) is generated here: > https://github.com/ROCm/rocm-systems/blob/develop/projects/rocr-runtime/runtime/hsa-runtime/core/runtime/amd_blit_sdma.cpp#L1154 > > > 3. By writing to a MMIO remapped address: > The address is stored in rocclr here: > https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocdevice.cpp#L607 > > And the flush is triggered by writing a 1, e.g here: > https://github.com/ROCm/rocm-systems/blob/develop/projects/clr/rocclr/device/rocm/rocvirtual.cpp#L3831 Thank you David. > > > Regards, > David > > >> -----Original Message----- >> From: Alex Deucher >> Sent: Wednesday, December 17, 2025 9:23 AM >> To: Donet Tom ; Yat Sin, David >> >> Cc: Koenig, Christian ; Ritesh Harjani (IBM) >> ; amd-gfx@lists.freedesktop.org; Kuehling, Felix >> ; Deucher, Alexander >> ; Russell, Kent ; >> Vaidyanathan Srinivasan ; Mukesh Kumar Chaurasiya >> >> Subject: Re: [RFC PATCH v1 0/8] amdgpu/amdkfd: Add support for non-4K page >> size systems >> >> On Wed, Dec 17, 2025 at 4:03 AM Donet Tom wrote: >>> >>> On 12/16/25 7:32 PM, Alex Deucher wrote: >>>> On Tue, Dec 16, 2025 at 8:55 AM Donet Tom >> wrote: >>>>> On 12/15/25 7:39 PM, Alex Deucher wrote: >>>>>> On Mon, Dec 15, 2025 at 4:47 AM Christian König >>>>>> wrote: >>>>>>> On 12/12/25 18:24, Alex Deucher wrote: >>>>>>>> On Fri, Dec 12, 2025 at 8:19 AM Christian König >>>>>>>> wrote: >>>>>>>>> On 12/12/25 11:45, Ritesh Harjani (IBM) wrote: >>>>>>>>>> Christian König writes: >>>>>>>>>>>> Setup details: >>>>>>>>>>>> ============ >>>>>>>>>>>> System details: Power10 LPAR using 64K pagesize. >>>>>>>>>>>> AMD GPU: >>>>>>>>>>>> Name: gfx90a >>>>>>>>>>>> Marketing Name: AMD Instinct MI210 >>>>>>>>>>>> >>>>>>>>>>>> Queries: >>>>>>>>>>>> ======= >>>>>>>>>>>> 1. We currently ran rocr-debug agent tests [1] and rccl unit tests [2] >> to test >>>>>>>>>>>> these changes. Is there anything else that you would suggest us >> to run to >>>>>>>>>>>> shake out any other page size related issues w.r.t the kernel >> driver? >>>>>>>>>>> The ROCm team needs to answer that. >>>>>>>>>>> >>>>>>>>>> Is there any separate mailing list or list of people whom we >>>>>>>>>> can cc then? >>>>>>>>> With Felix on CC you already got the right person, but he's on vacation >> and will not be back before the end of the year. >>>>>>>>> I can check on Monday if some people are still around which could >> answer a couple of questions, but in general don't expect a quick response. >>>>>>>>>>>> 2. Patch 1/8: We have a querry regarding eop buffer size Is this eop >> ring buffer >>>>>>>>>>>> size HW dependent? Should it be made PAGE_SIZE? >>>>>>>>>>> Yes and no. >>>>>>>>>>> >>>>>>>>>> If you could more elaborate on this please? I am assuming you >>>>>>>>>> would anyway respond with more context / details on Patch-1 >>>>>>>>>> itself. If yes, that would be great! >>>>>>>>> Well, in general the EOP (End of Pipe) buffer contains in a ring buffer of >> all the events and actions the CP should execute when shaders and cache flushes >> finish. >>>>>>>>> The size depends on the HW generation and configuration of the GPU >> etc..., but don't ask me for details how that is calculated. >>>>>>>>> The point is that the size is completely unrelated to the CPU, so using >> PAGE_SIZE is clearly incorrect. >>>>>>>>>>>> 3. Patch 5/8: also have a query w.r.t the error paths when system >> page size > 4K. >>>>>>>>>>>> Do we need to lift this restriction and add MMIO remap support >> for systems with >>>>>>>>>>>> non-4K page sizes? >>>>>>>>>>> The problem is the HW can't do this. >>>>>>>>>>> >>>>>>>>>> We aren't that familiar with the HW / SW stack here. Wanted to >>>>>>>>>> understand what functionality will be unsupported due to this HW >> limitation then? >>>>>>>>> The problem is that the CPU must map some of the registers/resources >> of the GPU into the address space of the application and you run into security >> issues when you map more than 4k at a time. >>>>>>>> Right. There are some 4K pages with the MMIO register BAR which >>>>>>>> are empty and registers can be remapped into them. In this case >>>>>>>> we remap the HDP flush registers into one of those register >>>>>>>> pages. This allows applications to flush the HDP write FIFO >>>>>>>> from either the CPU or another device. This is needed to flush >>>>>>>> data written by the CPU or another device to the VRAM BAR out to >>>>>>>> VRAM (i.e., so the GPU can see it). This is flushed internally >>>>>>>> as part of the shader dispatch packets, >>>>>>> As far as I know this is only done for graphics shader submissions to the >> classic CS interface, but not for compute dispatches through ROCm queues. >>>>>> There is an explicit PM4 packet to flush the HDP cache for userqs >>>>>> and for AQL the flush is handled via one of the flags in the >>>>>> dispatch packet. The MMIO remap is needed for more fine grained >>>>>> use cases where you might have the CPU or another device operating >>>>>> in a gang like scenario with the GPU. >>>>> Thank you, Alex. >>>>> >>>>> We were encountering an issue while running the RCCL unit tests. >>>>> With 2 GPUs, all tests passed successfully; however, when running >>>>> with more than 2 GPUs, the tests began to fail at random points >>>>> with the following >>>>> errors: >>>>> >>>>> [ 598.576821] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed >>>>> for queue with doorbell_id: 80030008 [ 606.696820] amdgpu >>>>> 0048:0f:00.0: amdgpu: Failed to evict process queues [ 606.696826] >>>>> amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4 [ >>>>> 610.696852] amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed >>>>> for queue with doorbell_id: 80030008 [ 610.696869] amdgpu >>>>> 0048:0f:00.0: amdgpu: Failed to evict process queues [ 610.696942] >>>>> amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues >>>>> >>>>> >>>>> After applying patches 7/8 and 8/8, we are no longer seeing this issue. >>>>> >>>>> One question I have is: we only started observing this problem when >>>>> the number of GPUs increased. Could this be related to MMIO >>>>> remapping not being available? >>>> It could be. E.g., if the CPU or a GPU writes data to VRAM on >>>> another GPU, you will need to flush the HDP to make sure that data >>>> hits VRAM before the GPU attached to the VRAM can see it. >>> >>> Thanks Alex >>> >>> I am now suspecting that the queue preemption issue may be related to >>> the unavailability of MMIO remapping. I am not very familiar with this area. >>> >>> Could you please point me to the relevant code path where the PM4 >>> packet is issued to flush the HDP cache? >> + David who is more familiar with the ROCm runtime. >> >> PM4 has a packet called HDP_FLUSH which flushes the HDP. For AQL, it's >> handled by one of the flags I think. Most things in ROCm use AQL. >> >> @David Yat Sin Can you point to how HDP flushes are handled in the ROCm >> runtime? >> >> Alex >> >>> I am consistently able to reproduce this issue on my system when using >>> more than three GPUs if patches 7/8 and 8/8 are not applied. In your >>> opinion, is there anything that can be done to speed up the HDP flush >>> or to avoid this situation altogether? >>> >>> >>> >>>> Alex >>>> >>>>>> Alex >>>>>> >>>>>>> That's the reason why ROCm needs the remapped MMIO register BAR. >>>>>>> >>>>>>>> but there are certain cases where an application may want more >>>>>>>> control. This is probably not a showstopper for most ROCm apps. >>>>>>> Well the problem is that you absolutely need the HDP flush/invalidation for >> 100% correctness. It does work most of the time without it, but you then risk data >> corruption. >>>>>>> Apart from making the flush/invalidate an IOCTL I think we could also just >> use a global dummy page in VRAM. >>>>>>> If you make two 32bit writes which are apart from each other and then a >> read back a 32bit value from VRAM that should invalidate the HDP as well. It's less >> efficient than the MMIO BAR remap but still much better than going though an >> IOCTL. >>>>>>> The only tricky part is that you need to get the HW barriers with the doorbell >> write right..... >>>>>>>> That said, the region is only 4K so if you allow applications to >>>>>>>> map a larger region they would get access to GPU register pages >>>>>>>> which they shouldn't have access to. >>>>>>> But don't we also have problems with the doorbell? E.g. the global >> aggregated one needs to be 4k as well, or is it ok to over allocate there? >>>>>>> Thinking more about it there is also a major problem with page tables. >> Those are 4k by default on modern systems as well and while over allocating them >> to 64k is possible that not only wastes some VRAM but can also result in OOM >> situations because we can't allocate the necessary page tables to switch from 2MiB >> to 4k pages in some cases. >>>>>>> Christian. >>>>>>> >>>>>>>> Alex >>>>>>>> >>>>>>>>>>>> [1] ROCr debug agent tests: >>>>>>>>>>>> https://github.com/ROCm/rocr_debug_agent >>>>>>>>>>>> [2] RCCL tests: >>>>>>>>>>>> https://github.com/ROCm/rccl/tree/develop/test >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Please note that the changes in this series are on a best >>>>>>>>>>>> effort basis from our end. Therefore, requesting the amd-gfx >>>>>>>>>>>> community (who have deeper knowledge of the HW & SW stack) >>>>>>>>>>>> to kindly help with the review and provide feedback / >>>>>>>>>>>> comments on these patches. The idea here is, to also have non-4K >> pagesize (e.g. 64K) well supported with amd gpu kernel driver. >>>>>>>>>>> Well this is generally nice to have, but there are unfortunately some >> HW limitations which makes ROCm pretty much unusable on non 4k page size >> systems. >>>>>>>>>> That's a bummer :( >>>>>>>>>> - Do we have some HW documentation around what are these >> limitations around non-4K pagesize? Any links to such please? >>>>>>>>> You already mentioned MMIO remap which obviously has that problem, >> but if I'm not completely mistaken the PCIe doorbell BAR and some global seq >> counter resources will also cause problems here. >>>>>>>>> This can all be worked around by delegating those MMIO accesses into >> the kernel, but that means tons of extra IOCTL overhead. >>>>>>>>> Especially the cache flushes which are necessary to avoid corruption >> are really bad for performance in such an approach. >>>>>>>>>> - Are there any latest AMD GPU versions which maybe lifts such >> restrictions? >>>>>>>>> Not that I know off any. >>>>>>>>> >>>>>>>>>>> What we can do is to support graphics and MM, but that should >> already work out of the box. >>>>>>>>>> - Maybe we should also document, what will work and what won't work >> due to these HW limitations. >>>>>>>>> Well pretty much everything, I need to double check how ROCm does >> HDP flushing/invalidating when the MMIO remap isn't available. >>>>>>>>> Could be that there is already a fallback path and that's the reason why >> this approach actually works at all. >>>>>>>>>>> What we can do is to support graphics and MM, but that should >>>>>>>>>>> already work out of the box.> >>>>>>>>>> So these patches helped us resolve most of the issues like >>>>>>>>>> SDMA hangs and GPU kernel page faults which we saw with rocr >>>>>>>>>> and rccl tests with 64K pagesize. Meaning, we didn't see this >>>>>>>>>> working out of box perhaps due to 64K pagesize. >>>>>>>>> Yeah, but this is all for ROCm and not the graphics side. >>>>>>>>> >>>>>>>>> To be honest I'm not sure how ROCm even works when you have 64k >> pages at the moment. I would expect much more issue lurking in the kernel driver. >>>>>>>>>> AFAIU, some of these patches may require re-work based on >>>>>>>>>> reviews, but at least with these changes, we were able to see all the >> tests passing. >>>>>>>>>>> I need to talk with Alex and the ROCm team about it if workarounds >> can be implemented for those issues. >>>>>>>>>> Thanks a lot! That would be super helpful! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Christian. >>>>>>>>>>> >>>>>>>>>> Thanks again for the quick response on the patch series. >>>>>>>>> You are welcome, but since it's so near to the end of the year not all >> people are available any more. >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>> -ritesh