From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2F98CCD8C9D for ; Thu, 11 Jun 2026 18:32:57 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8F8F110F0E5; Thu, 11 Jun 2026 18:32:56 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.b="CQZxW/7u"; dkim-atps=neutral Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by gabe.freedesktop.org (Postfix) with ESMTPS id 9A0C110F0D3 for ; Thu, 11 Jun 2026 18:32:55 +0000 (UTC) Received: from smtp.kernel.org (quasi.space.kernel.org [100.103.45.18]) by sea.source.kernel.org (Postfix) with ESMTP id 54EDD404C4; Thu, 11 Jun 2026 18:32:55 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id F065E1F000E9; Thu, 11 Jun 2026 18:32:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1781202775; bh=wj07LR9OwbvYyeJXyCaTmGJwF1ISfPKiU5ozJFMFyjc=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date; b=CQZxW/7uwpKAFWHIeN3vVnwE+GYpay1cigDyMah6uclp8F/xRkZDuPc+GpTowpPHj fCEXh/fAuOQq/jfqk55jvjdMDZNMKh/KDRyoPG0VDR08bVKLrBHge4MAXFOfZMdDEZ EOTGkbVQwFXj5exVXVuU2yINvu/q+kxEzXB8xTXC5v2NRjK6iDeqqR4AgyM2+kVZjk NtD5ha5csVfHxeeExfA+ONqSvqXBK/Y/stYCJoYb3iBqzB4oR0j0sqR1XPGrVjzNAD woyFDL9mymVAW9k+gEPVYopPITX7+VEDLE0QjxlEIuUbxi1jNLFxZjhiHs008on2bS B0NFltVGRgkGA== From: sashiko-bot@kernel.org Subject: Re: [PATCH v6 6/6] drm/amdgpu: Wire up dmem cgroup reclaim for VRAM manager To: =?utf-8?b?VGhvbWFzIEhlbGxzdHLDtm0=?= Cc: dri-devel@lists.freedesktop.org, tj@kernel.org, mkoutny@suse.com, hannes@cmpxchg.org In-Reply-To: <20260611173301.17473-7-thomas.hellstrom@linux.intel.com> References: <20260611173301.17473-7-thomas.hellstrom@linux.intel.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Thu, 11 Jun 2026 18:32:53 +0000 Message-Id: <20260611183254.F065E1F000E9@smtp.kernel.org> X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: sashiko-reviews@lists.linux.dev Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Thank you for your contribution! Sashiko AI review found 3 potential issue(= s) to consider: New issues: - [Critical] Early return in `amdgpu_vram_mgr_fini()` on eviction failure s= kips `dmem_cgroup_unregister_region()`, leading to a use-after-free. - [Medium] The cgroup region name is hardcoded to 'vram', breaking the devi= ce-specific namespace prefix and causing collisions on multi-GPU setups. Pre-existing issues: - [High] Unconditional cleanup of the buddy allocator causes a UAF/NULL der= eference if `gpu_buddy_init()` fails during initialization. -- commit 869c423a8a9487b5c3e7321e4f9a0e14f69318b4 Author: Thomas Hellstr=C3=B6m drm/amdgpu: Wire up dmem cgroup reclaim for VRAM manager Register the VRAM manager with the dmem cgroup reclaim infrastructure so that lowering dmem.max below current VRAM usage triggers TTM eviction rather than failing with -EBUSY. > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/d= rm/amd/amdgpu/amdgpu_vram_mgr.c > index 08f05c3aed1d5..2250bab0970dd 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c [ ... ] > @@ -933,12 +938,16 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev) > if (err) > return err; > =20 > - man->cg =3D drmm_cgroup_register_region(adev_to_drm(adev), "vram", > - &(struct dmem_cgroup_init){ > - .size =3D adev->gmc.real_vram_size, > - }); > - if (IS_ERR(man->cg)) > - return PTR_ERR(man->cg); > + cg =3D dmem_cgroup_register_region(&(struct dmem_cgroup_init){ > + .size =3D adev->gmc.real_vram_size, > + .ops =3D &amdgpu_vram_mgr_dmem_ops, > + .reclaim_priv =3D man, > + }, "vram"); [Severity: Medium] Does this hardcoded "vram" string break the device-specific namespace prefi= x? Before this patch, drmm_cgroup_register_region() automatically prefixed the cgroup region name with the DRM device's unique identifier. On systems with multiple AMD GPUs, won't using just "vram" here cause all GPUs to register a region with the exact same name, breaking userspace tools' ability to distinguish between the GPUs' memory cgroups? [ ... ] > @@ -966,6 +975,16 @@ void amdgpu_vram_mgr_fini(struct amdgpu_device *adev) > ret =3D ttm_resource_manager_evict_all(&adev->mman.bdev, man); > if (ret) > return; > =20 > + /* > + * Drain any in-flight dmem cgroup reclaim callbacks and remove the > + * region from the global list. This must happen after evict_all() > + * so that ttm_resource_free() can still uncharge via man->cg while > + * BOs are being evicted. > + */ > + dmem_cgroup_unregister_region(mgr->cg_region); [Severity: Critical] If ttm_resource_manager_evict_all() fails (for example, due to pinned BOs) and returns an error, the early return above skips the newly added dmem_cgroup_unregister_region() call. Could this permanently leak the region and leave a dangling pointer as its reclaim callback argument? When the DRM device is eventually released and the amdgpu_device struct is freed, any subsequent memory pressure triggering dmem cgroup reclaim would dereference the freed man pointer, causing a kernel use-after-free. > + mgr->cg_region =3D NULL; > + man->cg =3D NULL; > + > mutex_lock(&mgr->lock); > list_for_each_entry_safe(rsv, temp, &mgr->reservations_pending, blocks) > kfree(rsv); [ ... ] > list_for_each_entry_safe(rsv, temp, &mgr->reserved_pages, blocks) { > gpu_buddy_free_list(&mgr->mm, &rsv->allocated, 0); > kfree(rsv); > } > if (!adev->gmc.is_app_apu) > gpu_buddy_fini(&mgr->mm); [Severity: High] This isn't a bug introduced by this patch, but does this unconditional cleanup cause a use-after-free or NULL pointer dereference if gpu_buddy_init() failed during initialization? If gpu_buddy_init() fails, it returns an error and frees its internal arrays without setting them to NULL. The driver error path then calls amdgpu_device_fini_sw(), which triggers amdgpu_vram_mgr_fini(). Since mm->n_roots was populated before the allocation failure, won't gpu_buddy_fini() iterate over mm->n_roots and dereference the uninitialized or NULL mm->roots pointer? --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260611173301.1747= 3-1-thomas.hellstrom@linux.intel.com?part=3D6