public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: "Mario Limonciello" <mario.limonciello@amd.com>,
	post@davidak.de, "Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Sasha Levin" <sashal@kernel.org>,
	Xinhui.Pan@amd.com, airlied@gmail.com, daniel@ffwll.ch,
	andrey.grodzovsky@amd.com, evan.quan@amd.com,
	Amaranath.Somalapuram@amd.com, lang.yu@amd.com,
	Jack.Xiao@amd.com, amd-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.0 21/30] drm/amd: Fail the suspend if resources can't be evicted
Date: Thu, 10 Nov 2022 21:33:29 -0500	[thread overview]
Message-ID: <20221111023340.227279-21-sashal@kernel.org> (raw)
In-Reply-To: <20221111023340.227279-1-sashal@kernel.org>

From: Mario Limonciello <mario.limonciello@amd.com>

[ Upstream commit 8d4de331f1b24a22d18e3c6116aa25228cf54854 ]

If a system does not have swap and memory is under 100% usage,
amdgpu will fail to evict resources.  Currently the suspend
carries on proceeding to reset the GPU:

```
[drm] evicting device resources failed
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
[drm] free PSP TMR buffer
[TTM] Failed allocating page table
[drm] evicting device resources failed
amdgpu 0000:03:00.0: amdgpu: MODE1 reset
amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
```

At this point if the suspend actually succeeded I think that amdgpu
would have recovered because the GPU would have power cut off and
restored.  However the kernel fails to continue the suspend from the
memory pressure and amdgpu fails to run the "resume" from the aborted
suspend.

```
ACPI: PM: Preparing to enter system sleep state S3
SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
  cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
  node 0: slabs: 22, objs: 1122, free: 0
ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)

[drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
[drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
amdgpu 0000:03:00.0: PM: failed to resume async: error -62
```

To avoid this series of unfortunate events, fail amdgpu's suspend
when the memory eviction fails.  This will let the system gracefully
recover and the user can try suspend again when the memory pressure
is relieved.

Reported-by: post@davidak.de
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9170aeaad93e..e0c960cc1d2e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4055,15 +4055,18 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
  * at suspend time.
  *
  */
-static void amdgpu_device_evict_resources(struct amdgpu_device *adev)
+static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
 {
+	int ret;
+
 	/* No need to evict vram on APUs for suspend to ram or s2idle */
 	if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU))
-		return;
+		return 0;
 
-	if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM))
+	ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM);
+	if (ret)
 		DRM_WARN("evicting device resources failed\n");
-
+	return ret;
 }
 
 /*
@@ -4113,7 +4116,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 	if (!adev->in_s0ix)
 		amdgpu_amdkfd_suspend(adev, adev->in_runpm);
 
-	amdgpu_device_evict_resources(adev);
+	r = amdgpu_device_evict_resources(adev);
+	if (r)
+		return r;
 
 	amdgpu_fence_driver_hw_fini(adev);
 
-- 
2.35.1


  parent reply	other threads:[~2022-11-11  2:35 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-11  2:33 [PATCH AUTOSEL 6.0 01/30] cxl/mbox: Add a check on input payload size Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 02/30] RDMA/efa: Add EFA 0xefa2 PCI ID Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 03/30] btrfs: raid56: properly handle the error when unable to find the missing stripe Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 04/30] NFSv4: Retry LOCK on OLD_STATEID during delegation return Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 05/30] SUNRPC: Fix crasher in gss_unwrap_resp_integ() Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 06/30] ACPI: x86: Add another system to quirk list for forcing StorageD3Enable Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 07/30] drm/rockchip: vop2: fix null pointer in plane_atomic_disable Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 08/30] drm/rockchip: vop2: disable planes when disabling the crtc Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 09/30] ksefltests: pidfd: Fix wait_states: Test terminated by timeout Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 10/30] powerpc/64e: Fix amdgpu build on Book3E w/o AltiVec Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 11/30] block: blk_add_rq_to_plug(): clear stale 'last' after flush Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 12/30] firmware: arm_scmi: Cleanup the core driver removal callback Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 13/30] firmware: arm_scmi: Make tx_prepare time out eventually Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 14/30] i2c: tegra: Allocate DMA memory for DMA engine Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 15/30] i2c: i801: add lis3lv02d's I2C address for Vostro 5568 Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 16/30] drm/imx: imx-tve: Fix return type of imx_tve_connector_mode_valid Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 17/30] btrfs: remove pointless and double ulist frees in error paths of qgroup tests Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 18/30] drm/amd/display: Ignore Cable ID Feature Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 19/30] drm/amd/display: Enable timing sync on DCN32 Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 20/30] drm/amdgpu: set fb_modifiers_not_supported in vkms Sasha Levin
2022-11-11  2:33 ` Sasha Levin [this message]
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 22/30] drm/amd/display: Fix DCN32 DSC delay calculation Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 23/30] drm/amd/display: Use forced DSC bpp in DML Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 24/30] drm/amd/display: Round up DST_after_scaler to nearest int Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 25/30] drm/amd/display: Investigate tool reported FCLK P-state deviations Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 26/30] Bluetooth: L2CAP: Fix l2cap_global_chan_by_psm Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 27/30] cxl/pmem: Use size_add() against integer overflow Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 28/30] x86/cpu: Add several Intel server CPU model numbers Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 29/30] tools/testing/cxl: Fix some error exits Sasha Levin
2022-11-11  2:33 ` [PATCH AUTOSEL 6.0 30/30] cifs: always iterate smb sessions using primary channel Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221111023340.227279-21-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Amaranath.Somalapuram@amd.com \
    --cc=Jack.Xiao@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=evan.quan@amd.com \
    --cc=lang.yu@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mario.limonciello@amd.com \
    --cc=post@davidak.de \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox