The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: clamp user gartsize against device capacity
@ 2026-05-14 19:49 Ahmed Elmetwally
  0 siblings, 0 replies; only message in thread
From: Ahmed Elmetwally @ 2026-05-14 19:49 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig
  Cc: airlied, simona, amd-gfx, dri-devel, linux-kernel,
	Ahmed Elmetwally

When the user-supplied amdgpu.gartsize= module parameter requests a GART
aperture so large that the resulting page-table BO cannot be allocated
from VRAM, gmc_v11_0_sw_init() aborts with -ENOMEM during the kernel BO
pin in amdgpu_gart_table_vram_alloc(), which fails amdgpu probe entirely
and leaves the device without a /dev/dri node.

The GART page table is 8 bytes per AMDGPU_GPU_PAGE_SIZE-sized page,
i.e. table_size = gart_size / 512. The table BO is allocated from real
VRAM. On small-VRAM SoCs (APUs with stolen VRAM -- Strix Halo at ~1 GiB
is the case at hand) a user-supplied gartsize that produces a table BO
larger than what can be pinned in VRAM is silently accepted today and
only fails far downstream.

Reproducer on Strix Halo (gfx1151, 1 GiB stolen VRAM):

  /etc/modprobe.d/amdgpu-tuning.conf:
    options amdgpu gartsize=262144     # 256 GiB -> 512 MiB page table

  dmesg:
    amdgpu: [gmc_v11_0]*ERROR* GART aperture is needed by the driver but
            no memory has been pinned for it ...
    amdgpu 0000:c6:00.0: amdgpu: SW IP initialize failed
    amdgpu 0000:c6:00.0: amdgpu: sw_init of IP block <gmc_v11_0> failed -12

Recovery requires booting with amdgpu.gartsize=N on the kernel command
line -- i.e. the user must already know the cause, from rescue media,
with the GPU offline.

The user-supplied gartsize is a tuning hint, not a correctness
requirement. Compute the maximum sensible value such that the page-table
BO fits within real_vram_size / 8 (leaves 7/8 of VRAM for everything
else), and if the user value exceeds it, log a warning and fall back to
the per-IP auto default. With this patch the same modprobe.d entry
above produces:

  amdgpu 0000:c6:00.0: amdgpu: amdgpu.gartsize=262144 MiB exceeds device
    capacity (real_vram=1024 MiB, max sensible=65536 MiB); clamping to
    default 512 MiB

and the device probes normally.

The helper lives in amdgpu_gmc.c so the other gmc_v*_0 backends can
adopt it in follow-up patches; this patch only wires gmc_v11_0 because
that is where the failure was observed.

Signed-off-by: Ahmed Elmetwally <en22ue@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 47 +++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  2 +
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  |  6 +---
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -330,6 +330,53 @@ void amdgpu_gmc_gart_location(struct amdgpu_device *adev,
 	dev_info(adev->dev, "GART: %lluM 0x%016llX - 0x%016llX\n",
 			mc->gart_size >> 20, mc->gart_start, mc->gart_end);
 }
+
+/**
+ * amdgpu_gmc_validate_gart_size - clamp amdgpu.gartsize against VRAM capacity
+ *
+ * @adev:        amdgpu device (real_vram_size must already be populated)
+ * @user_mb:     value of the amdgpu_gart_size module parameter (MiB),
+ *               or -1 for auto
+ * @default_mb:  per-IP auto default in MiB
+ *
+ * The GART page table is allocated from VRAM and sized as
+ * gart_size / AMDGPU_GPU_PAGE_SIZE * sizeof(u64). A typoed or mis-pasted
+ * amdgpu.gartsize value (e.g. 262144 MiB on a 1 GiB-VRAM APU) produces a
+ * page-table BO that cannot be pinned, aborting GPU probe. Cap the page
+ * table at 1/8 of real VRAM, warn, and fall back to the auto default.
+ * Returns gart_size in bytes.
+ */
+u64 amdgpu_gmc_validate_gart_size(struct amdgpu_device *adev,
+				  int user_mb, u32 default_mb)
+{
+	u64 want_bytes, max_bytes;
+
+	if (user_mb == -1)
+		return (u64)default_mb << 20;
+
+	want_bytes = (u64)user_mb << 20;
+
+	/*
+	 * page-table BO bytes = gart_bytes / AMDGPU_GPU_PAGE_SIZE * 8
+	 * Constraint: page-table BO <= real_vram_size / 8
+	 *   gart_bytes <= (real_vram_size / 8) * (AMDGPU_GPU_PAGE_SIZE / 8)
+	 */
+	max_bytes = (adev->gmc.real_vram_size / 8) *
+		    (AMDGPU_GPU_PAGE_SIZE / 8);
+
+	if (want_bytes > max_bytes) {
+		dev_warn(adev->dev,
+			 "amdgpu.gartsize=%u MiB exceeds device capacity "
+			 "(real_vram=%llu MiB, max sensible=%llu MiB); "
+			 "clamping to default %u MiB\n",
+			 user_mb,
+			 adev->gmc.real_vram_size >> 20,
+			 max_bytes >> 20,
+			 default_mb);
+		return (u64)default_mb << 20;
+	}
+
+	return want_bytes;
+}

 /**
  * amdgpu_gmc_agp_location - try to find AGP location
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -416,6 +416,8 @@ void amdgpu_gmc_vram_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc,
 void amdgpu_gmc_gart_location(struct amdgpu_device *adev,
 			      struct amdgpu_gmc *mc,
 			      enum amdgpu_gart_placement gart_placement);
+u64 amdgpu_gmc_validate_gart_size(struct amdgpu_device *adev,
+				  int user_mb, u32 default_mb);
 void amdgpu_gmc_agp_location(struct amdgpu_device *adev,
 			     struct amdgpu_gmc *mc);
 void amdgpu_gmc_set_agp_default(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -708,11 +708,7 @@ static int gmc_v11_0_mc_init(struct amdgpu_device *adev)
 	if (adev->gmc.visible_vram_size > adev->gmc.real_vram_size)
 		adev->gmc.visible_vram_size = adev->gmc.real_vram_size;

-	/* set the gart size */
-	if (amdgpu_gart_size == -1)
-		adev->gmc.gart_size = 512ULL << 20;
-	else
-		adev->gmc.gart_size = (u64)amdgpu_gart_size << 20;
+	adev->gmc.gart_size = amdgpu_gmc_validate_gart_size(adev,
+				amdgpu_gart_size, 512);

 	gmc_v11_0_vram_gtt_location(adev, &adev->gmc);

--
2.45.0

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-05-14 19:49 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14 19:49 [PATCH] drm/amdgpu: clamp user gartsize against device capacity Ahmed Elmetwally

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox