[PATCH] drm/amdgpu: clamp user gartsize against device capacity

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Ahmed Elmetwally <en22ue@gmail.com>
To: alexander.deucher@amd.com, christian.koenig@amd.com
Cc: airlied@gmail.com, simona@ffwll.ch,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, Ahmed Elmetwally <en22ue@gmail.com>
Subject: [PATCH] drm/amdgpu: clamp user gartsize against device capacity
Date: Thu, 14 May 2026 20:49:37 +0100	[thread overview]
Message-ID: <20260514194937.35649-1-en22ue@gmail.com> (raw)

When the user-supplied amdgpu.gartsize= module parameter requests a GART
aperture so large that the resulting page-table BO cannot be allocated
from VRAM, gmc_v11_0_sw_init() aborts with -ENOMEM during the kernel BO
pin in amdgpu_gart_table_vram_alloc(), which fails amdgpu probe entirely
and leaves the device without a /dev/dri node.

The GART page table is 8 bytes per AMDGPU_GPU_PAGE_SIZE-sized page,
i.e. table_size = gart_size / 512. The table BO is allocated from real
VRAM. On small-VRAM SoCs (APUs with stolen VRAM -- Strix Halo at ~1 GiB
is the case at hand) a user-supplied gartsize that produces a table BO
larger than what can be pinned in VRAM is silently accepted today and
only fails far downstream.

Reproducer on Strix Halo (gfx1151, 1 GiB stolen VRAM):

  /etc/modprobe.d/amdgpu-tuning.conf:
    options amdgpu gartsize=262144     # 256 GiB -> 512 MiB page table

  dmesg:
    amdgpu: [gmc_v11_0]*ERROR* GART aperture is needed by the driver but
            no memory has been pinned for it ...
    amdgpu 0000:c6:00.0: amdgpu: SW IP initialize failed
    amdgpu 0000:c6:00.0: amdgpu: sw_init of IP block <gmc_v11_0> failed -12

Recovery requires booting with amdgpu.gartsize=N on the kernel command
line -- i.e. the user must already know the cause, from rescue media,
with the GPU offline.

The user-supplied gartsize is a tuning hint, not a correctness
requirement. Compute the maximum sensible value such that the page-table
BO fits within real_vram_size / 8 (leaves 7/8 of VRAM for everything
else), and if the user value exceeds it, log a warning and fall back to
the per-IP auto default. With this patch the same modprobe.d entry
above produces:

  amdgpu 0000:c6:00.0: amdgpu: amdgpu.gartsize=262144 MiB exceeds device
    capacity (real_vram=1024 MiB, max sensible=65536 MiB); clamping to
    default 512 MiB

and the device probes normally.

The helper lives in amdgpu_gmc.c so the other gmc_v*_0 backends can
adopt it in follow-up patches; this patch only wires gmc_v11_0 because
that is where the failure was observed.

Signed-off-by: Ahmed Elmetwally <en22ue@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 47 +++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  2 +
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  |  6 +---
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -330,6 +330,53 @@ void amdgpu_gmc_gart_location(struct amdgpu_device *adev,
 	dev_info(adev->dev, "GART: %lluM 0x%016llX - 0x%016llX\n",
 			mc->gart_size >> 20, mc->gart_start, mc->gart_end);
 }
+
+/**
+ * amdgpu_gmc_validate_gart_size - clamp amdgpu.gartsize against VRAM capacity
+ *
+ * @adev:        amdgpu device (real_vram_size must already be populated)
+ * @user_mb:     value of the amdgpu_gart_size module parameter (MiB),
+ *               or -1 for auto
+ * @default_mb:  per-IP auto default in MiB
+ *
+ * The GART page table is allocated from VRAM and sized as
+ * gart_size / AMDGPU_GPU_PAGE_SIZE * sizeof(u64). A typoed or mis-pasted
+ * amdgpu.gartsize value (e.g. 262144 MiB on a 1 GiB-VRAM APU) produces a
+ * page-table BO that cannot be pinned, aborting GPU probe. Cap the page
+ * table at 1/8 of real VRAM, warn, and fall back to the auto default.
+ * Returns gart_size in bytes.
+ */
+u64 amdgpu_gmc_validate_gart_size(struct amdgpu_device *adev,
+				  int user_mb, u32 default_mb)
+{
+	u64 want_bytes, max_bytes;
+
+	if (user_mb == -1)
+		return (u64)default_mb << 20;
+
+	want_bytes = (u64)user_mb << 20;
+
+	/*
+	 * page-table BO bytes = gart_bytes / AMDGPU_GPU_PAGE_SIZE * 8
+	 * Constraint: page-table BO <= real_vram_size / 8
+	 *   gart_bytes <= (real_vram_size / 8) * (AMDGPU_GPU_PAGE_SIZE / 8)
+	 */
+	max_bytes = (adev->gmc.real_vram_size / 8) *
+		    (AMDGPU_GPU_PAGE_SIZE / 8);
+
+	if (want_bytes > max_bytes) {
+		dev_warn(adev->dev,
+			 "amdgpu.gartsize=%u MiB exceeds device capacity "
+			 "(real_vram=%llu MiB, max sensible=%llu MiB); "
+			 "clamping to default %u MiB\n",
+			 user_mb,
+			 adev->gmc.real_vram_size >> 20,
+			 max_bytes >> 20,
+			 default_mb);
+		return (u64)default_mb << 20;
+	}
+
+	return want_bytes;
+}

 /**
  * amdgpu_gmc_agp_location - try to find AGP location
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -416,6 +416,8 @@ void amdgpu_gmc_vram_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc,
 void amdgpu_gmc_gart_location(struct amdgpu_device *adev,
 			      struct amdgpu_gmc *mc,
 			      enum amdgpu_gart_placement gart_placement);
+u64 amdgpu_gmc_validate_gart_size(struct amdgpu_device *adev,
+				  int user_mb, u32 default_mb);
 void amdgpu_gmc_agp_location(struct amdgpu_device *adev,
 			     struct amdgpu_gmc *mc);
 void amdgpu_gmc_set_agp_default(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -708,11 +708,7 @@ static int gmc_v11_0_mc_init(struct amdgpu_device *adev)
 	if (adev->gmc.visible_vram_size > adev->gmc.real_vram_size)
 		adev->gmc.visible_vram_size = adev->gmc.real_vram_size;

-	/* set the gart size */
-	if (amdgpu_gart_size == -1)
-		adev->gmc.gart_size = 512ULL << 20;
-	else
-		adev->gmc.gart_size = (u64)amdgpu_gart_size << 20;
+	adev->gmc.gart_size = amdgpu_gmc_validate_gart_size(adev,
+				amdgpu_gart_size, 512);

 	gmc_v11_0_vram_gtt_location(adev, &adev->gmc);

--
2.45.0

                 reply	other threads:[~2026-05-14 19:49 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260514194937.35649-1-en22ue@gmail.com \
    --to=en22ue@gmail.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=simona@ffwll.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox