From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Donet Tom <donettom@linux.ibm.com>,
amd-gfx@lists.freedesktop.org,
Felix Kuehling <Felix.Kuehling@amd.com>,
Alex Deucher <alexander.deucher@amd.com>,
Alex Deucher <alexdeucher@gmail.com>,
christian.koenig@amd.com, Philip Yang <yangp@amd.com>
Cc: David.YatSin@amd.com, Kent.Russell@amd.com,
Vaidyanathan Srinivasan <svaidy@linux.ibm.com>,
donettom@linux.ibm.com, stable@vger.kernel.org
Subject: Re: [RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages
Date: Sun, 01 Mar 2026 15:06:10 +0530 [thread overview]
Message-ID: <87seajj3hx.ritesh.list@gmail.com> (raw)
In-Reply-To: <f6b7f3e49ea54fc9c5c3f8dae607382ba9d6f58e.1771656655.git.donettom@linux.ibm.com>
Donet Tom <donettom@linux.ibm.com> writes:
> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
> 4K pages, both values match (8KB), so allocation and reserved space
> are consistent.
>
> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB,
> while the reserved trap area remains 8KB. This mismatch causes the
> kernel to crash when running rocminfo or rccl unit tests.
>
#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << 12)
#define AMDGPU_VA_RESERVED_TRAP_START(adev) (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
- AMDGPU_VA_RESERVED_TRAP_SIZE)
#define AMDGPU_VA_RESERVED_BOTTOM (1ULL << 16)
#define AMDGPU_VA_RESERVED_TOP (AMDGPU_VA_RESERVED_TRAP_SIZE + \
AMDGPU_VA_RESERVED_SEQ64_SIZE + \
AMDGPU_VA_RESERVED_CSA_SIZE)
#define AMDGPU_VA_RESERVED_TRAP_START(adev) (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
- AMDGPU_VA_RESERVED_TRAP_SIZE)
In kfd_init_apertures_v9()...
/*
* Place TBA/TMA on opposite side of VM hole to prevent
* stray faults from triggering SVM on these pages.
*/
pdd->qpd.cwsr_base = AMDGPU_VA_RESERVED_TRAP_START(pdd->dev->adev);
& In kfd_process_device_init_cwsr_dgpu()...
/* cwsr_base is only set for dGPU */
ret = kfd_process_alloc_gpuvm(pdd, qpd->cwsr_base,
KFD_CWSR_TBA_TMA_SIZE, flags, &mem, &kaddr);
This shows that it expects KFD_CWSW_TBA_TMA_SIZE (2 * PAGE_SIZE) size of
region, from cwsr_base. However the AMDGPU_VA_RESERVED_TRAP_SIZE only
reserves 8KB. This would work on 4K pagesize systems but on non-4K
pagesize (say 64K), this would fail, since it could overflow into the
SEQ64 region.
Hence the fix in this looks right to me. Although I am not an expert on
the amd gpu driver side, so I would let the experts review this as well.
But FWIW -
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
> BUG: Kernel NULL pointer dereference on read at 0x00000002
> Faulting instruction address: 0xc0000000002c8a64
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
> NIP: c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
> MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
> XER: 00000036
> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
> IRQMASK: 1
> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
> c00000013d814540
> GPR04: 0000000000000002 c00000013d814550 0000000000000045
> 0000000000000000
> GPR08: c00000013444d000 c00000013d814538 c00000013d814538
> 0000000084002268
> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
> 0000000000020000
> GPR16: 0000000000000000 0000000000000002 c00000015f653000
> 0000000000000000
> GPR20: c000000138662400 c00000013d814540 0000000000000000
> c00000013d814500
> GPR24: 0000000000000000 0000000000000002 c0000001e0957888
> c0000001e0957878
> GPR28: c00000013d814548 0000000000000000 c00000013d814540
> c0000001e0957888
> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
> Call Trace:
> 0xc0000001e0957890 (unreliable)
> __mutex_lock.constprop.0+0x58/0xd00
> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
> kfd_ioctl+0x514/0x670 [amdgpu]
> sys_ioctl+0x134/0x180
> system_call_exception+0x114/0x300
> system_call_vectored_common+0x15c/0x2ec
>
> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
> ensuring that the reserved trap area matches the allocation size
> across all page sizes.
>
> cc: stable@vger.kernel.org
Cc: makes sense. So that the older kernel versions would get this fix too!
> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole")
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index 139642eacdd0..a5eae49f9471 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
> #define AMDGPU_VA_RESERVED_SEQ64_SIZE (2ULL << 20)
> #define AMDGPU_VA_RESERVED_SEQ64_START(adev) (AMDGPU_VA_RESERVED_CSA_START(adev) \
> - AMDGPU_VA_RESERVED_SEQ64_SIZE)
> -#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << 12)
> +#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << PAGE_SHIFT)
> #define AMDGPU_VA_RESERVED_TRAP_START(adev) (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
> - AMDGPU_VA_RESERVED_TRAP_SIZE)
> #define AMDGPU_VA_RESERVED_BOTTOM (1ULL << 16)
> --
> 2.52.0
next prev parent reply other threads:[~2026-03-02 13:48 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-21 7:09 [RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Donet Tom
2026-02-21 7:09 ` [RFC PATCH v3 1/6] drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 2 PAGE_SIZE pages Donet Tom
2026-03-01 9:36 ` Ritesh Harjani [this message]
2026-02-21 7:09 ` [RFC PATCH v3 2/6] drm/amdkfd: Align expected_queue_size to PAGE_SIZE Donet Tom
2026-02-21 7:09 ` [RFC PATCH v3 3/6] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Donet Tom
2026-02-21 7:09 ` [RFC PATCH v3 4/6] drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K page size Donet Tom
2026-02-21 7:09 ` [RFC PATCH v3 5/6] drm/amd: Fix MQD and control stack alignment for non-4K Donet Tom
2026-02-21 7:09 ` [RFC PATCH v3 6/6] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size Donet Tom
2026-03-06 17:54 ` [RFC PATCH v3 0/6] drm/amd: Add support for non-4K page size systems Donet Tom
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87seajj3hx.ritesh.list@gmail.com \
--to=ritesh.list@gmail.com \
--cc=David.YatSin@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=Kent.Russell@amd.com \
--cc=alexander.deucher@amd.com \
--cc=alexdeucher@gmail.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=donettom@linux.ibm.com \
--cc=stable@vger.kernel.org \
--cc=svaidy@linux.ibm.com \
--cc=yangp@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.