* [PATCH] drm/amdgpu: clamp user gartsize against device capacity
@ 2026-05-14 19:49 Ahmed Elmetwally
0 siblings, 0 replies; only message in thread
From: Ahmed Elmetwally @ 2026-05-14 19:49 UTC (permalink / raw)
To: alexander.deucher, christian.koenig
Cc: airlied, simona, amd-gfx, dri-devel, linux-kernel,
Ahmed Elmetwally
When the user-supplied amdgpu.gartsize= module parameter requests a GART
aperture so large that the resulting page-table BO cannot be allocated
from VRAM, gmc_v11_0_sw_init() aborts with -ENOMEM during the kernel BO
pin in amdgpu_gart_table_vram_alloc(), which fails amdgpu probe entirely
and leaves the device without a /dev/dri node.
The GART page table is 8 bytes per AMDGPU_GPU_PAGE_SIZE-sized page,
i.e. table_size = gart_size / 512. The table BO is allocated from real
VRAM. On small-VRAM SoCs (APUs with stolen VRAM -- Strix Halo at ~1 GiB
is the case at hand) a user-supplied gartsize that produces a table BO
larger than what can be pinned in VRAM is silently accepted today and
only fails far downstream.
Reproducer on Strix Halo (gfx1151, 1 GiB stolen VRAM):
/etc/modprobe.d/amdgpu-tuning.conf:
options amdgpu gartsize=262144 # 256 GiB -> 512 MiB page table
dmesg:
amdgpu: [gmc_v11_0]*ERROR* GART aperture is needed by the driver but
no memory has been pinned for it ...
amdgpu 0000:c6:00.0: amdgpu: SW IP initialize failed
amdgpu 0000:c6:00.0: amdgpu: sw_init of IP block <gmc_v11_0> failed -12
Recovery requires booting with amdgpu.gartsize=N on the kernel command
line -- i.e. the user must already know the cause, from rescue media,
with the GPU offline.
The user-supplied gartsize is a tuning hint, not a correctness
requirement. Compute the maximum sensible value such that the page-table
BO fits within real_vram_size / 8 (leaves 7/8 of VRAM for everything
else), and if the user value exceeds it, log a warning and fall back to
the per-IP auto default. With this patch the same modprobe.d entry
above produces:
amdgpu 0000:c6:00.0: amdgpu: amdgpu.gartsize=262144 MiB exceeds device
capacity (real_vram=1024 MiB, max sensible=65536 MiB); clamping to
default 512 MiB
and the device probes normally.
The helper lives in amdgpu_gmc.c so the other gmc_v*_0 backends can
adopt it in follow-up patches; this patch only wires gmc_v11_0 because
that is where the failure was observed.
Signed-off-by: Ahmed Elmetwally <en22ue@gmail.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 47 +++++++++++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 2 +
drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 6 +---
3 files changed, 50 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -330,6 +330,53 @@ void amdgpu_gmc_gart_location(struct amdgpu_device *adev,
dev_info(adev->dev, "GART: %lluM 0x%016llX - 0x%016llX\n",
mc->gart_size >> 20, mc->gart_start, mc->gart_end);
}
+
+/**
+ * amdgpu_gmc_validate_gart_size - clamp amdgpu.gartsize against VRAM capacity
+ *
+ * @adev: amdgpu device (real_vram_size must already be populated)
+ * @user_mb: value of the amdgpu_gart_size module parameter (MiB),
+ * or -1 for auto
+ * @default_mb: per-IP auto default in MiB
+ *
+ * The GART page table is allocated from VRAM and sized as
+ * gart_size / AMDGPU_GPU_PAGE_SIZE * sizeof(u64). A typoed or mis-pasted
+ * amdgpu.gartsize value (e.g. 262144 MiB on a 1 GiB-VRAM APU) produces a
+ * page-table BO that cannot be pinned, aborting GPU probe. Cap the page
+ * table at 1/8 of real VRAM, warn, and fall back to the auto default.
+ * Returns gart_size in bytes.
+ */
+u64 amdgpu_gmc_validate_gart_size(struct amdgpu_device *adev,
+ int user_mb, u32 default_mb)
+{
+ u64 want_bytes, max_bytes;
+
+ if (user_mb == -1)
+ return (u64)default_mb << 20;
+
+ want_bytes = (u64)user_mb << 20;
+
+ /*
+ * page-table BO bytes = gart_bytes / AMDGPU_GPU_PAGE_SIZE * 8
+ * Constraint: page-table BO <= real_vram_size / 8
+ * gart_bytes <= (real_vram_size / 8) * (AMDGPU_GPU_PAGE_SIZE / 8)
+ */
+ max_bytes = (adev->gmc.real_vram_size / 8) *
+ (AMDGPU_GPU_PAGE_SIZE / 8);
+
+ if (want_bytes > max_bytes) {
+ dev_warn(adev->dev,
+ "amdgpu.gartsize=%u MiB exceeds device capacity "
+ "(real_vram=%llu MiB, max sensible=%llu MiB); "
+ "clamping to default %u MiB\n",
+ user_mb,
+ adev->gmc.real_vram_size >> 20,
+ max_bytes >> 20,
+ default_mb);
+ return (u64)default_mb << 20;
+ }
+
+ return want_bytes;
+}
/**
* amdgpu_gmc_agp_location - try to find AGP location
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -416,6 +416,8 @@ void amdgpu_gmc_vram_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc,
void amdgpu_gmc_gart_location(struct amdgpu_device *adev,
struct amdgpu_gmc *mc,
enum amdgpu_gart_placement gart_placement);
+u64 amdgpu_gmc_validate_gart_size(struct amdgpu_device *adev,
+ int user_mb, u32 default_mb);
void amdgpu_gmc_agp_location(struct amdgpu_device *adev,
struct amdgpu_gmc *mc);
void amdgpu_gmc_set_agp_default(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -708,11 +708,7 @@ static int gmc_v11_0_mc_init(struct amdgpu_device *adev)
if (adev->gmc.visible_vram_size > adev->gmc.real_vram_size)
adev->gmc.visible_vram_size = adev->gmc.real_vram_size;
- /* set the gart size */
- if (amdgpu_gart_size == -1)
- adev->gmc.gart_size = 512ULL << 20;
- else
- adev->gmc.gart_size = (u64)amdgpu_gart_size << 20;
+ adev->gmc.gart_size = amdgpu_gmc_validate_gart_size(adev,
+ amdgpu_gart_size, 512);
gmc_v11_0_vram_gtt_location(adev, &adev->gmc);
--
2.45.0
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2026-05-14 19:49 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14 19:49 [PATCH] drm/amdgpu: clamp user gartsize against device capacity Ahmed Elmetwally
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.