All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jiang Liu <gerry@linux.alibaba.com>
To: alexander.deucher@amd.com, christian.koenig@amd.com,
	Xinhui.Pan@amd.com, airlied@gmail.com, simona@ffwll.ch,
	sunil.khatri@amd.com, lijo.lazar@amd.com, Hawking.Zhang@amd.com,
	mario.limonciello@amd.com, xiaogang.chen@amd.com,
	Kent.Russell@amd.com, shuox.liu@linux.alibaba.com,
	amd-gfx@lists.freedesktop.org
Cc: Jiang Liu <gerry@linux.alibaba.com>
Subject: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs
Date: Tue, 14 Jan 2025 17:54:56 +0800	[thread overview]
Message-ID: <cover.1736847835.git.gerry@linux.alibaba.com> (raw)

For virtual machines with AMD SR-IOV vGPUs, following work flow may be
used to support virtual machine hibernation(suspend):
1) suspends a virtual machine with AMD vGPU A.
2) hypervisor dumps guest RAM content to a disk image.
3) hypervisor loads the guest system image from disk.
4) resumes the guest OS with a different AMD vGPU B.

The step 4 above is special because we are resuming with a different
AMD vGPU device and the amdgpu driver may observe changed device
properties. To support above work flow, we need to fix those changed
device properties cached by the amdgpu drivers.

With information from the amdgpu driver source code (haven't read
corresponding hardware specs yet), we have identified following changed
device properties:
1) PCI MMIO address. This can be fixed by hypervisor.
2) serial_number, unique_id, xgmi_device_id, fru_id in sysfs. Seems
   they are information only.
3) xgmi_physical_id if xgmi is enabled, which affects VRAM MC address.
4) mc_fb_offset, which affects VRAM physical address.

We will focus on the VRAM address related changes here, because it's
sensitive to the GPU functionalities. The original data sources include
.get_mc_fb_offset(), .get_fb_location() and xgmi hardware registers.
The main data cached by amdgpu driver are adev->gmc.vram_start and
adev->vm_manager.vram_base_offset. And the major consumers of the
cached information are ip_block.hw_init() and GMU page table builder.

After code analysis, we found that most consumers of dev->gmc.vram_start
and adev->vm_manager.vram_base_offset directly read value from these
two variables on demand instead of caching them. So if we fix these
two cached fields on resume, everything should work as expected.

But there's an exception, and an very import exception, that callers
of amdgpu_bo_create_kernel()/amdgpu_bo_create_reserved() may cache
VRAM addresses. With further analysis, the callers of these interface
have three different patterns:
1) This pattern is safe.
   - call amdgpu_bo_create_reserved() in ip_block.hw_init()
   - call amdgpu_bo_free_kernel() in ip_block.suspend()
   - call amdgpu_bo_create_reserved() in ip_block.resume()
2) This pattern works with current implementaiton of amdgpu_bo_create_reserved()
   but bo.pin_count gets incorrect.
   - call amdgpu_bo_create_reserved() in ip_block.hw_init()
   - call amdgpu_bo_create_reserved() in ip_block.resume()
3) This pattern needs to be enhanced.
   - call amdgpu_bo_create_reserved() in ip_block.sw_init()

So my question is which pattern should we use here? Personally I prefer
pattern 2 with enhancement to fix the bo.pin_count.

Currently there're still bugs in SRIOV suspend/resume, so we can't test
our hypothesis. And we are not sure whether there are still other
blocking to enable resume with different AMD SR-IOV vGPUs.

Help is needed to identify more task items to enable resume with
different AMD SR-IOV vGPUs:)

Jiang Liu (2):
  drm/amdgpu: update cached vram base addresses on resume
  drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr()

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 15 +++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h      |  6 ++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c   |  9 +++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h   |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c |  9 +++++++++
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c       |  7 +++++++
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c        |  6 ++++++
 7 files changed, 51 insertions(+), 2 deletions(-)

-- 
2.43.5


             reply	other threads:[~2025-01-14  9:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-14  9:54 Jiang Liu [this message]
2025-01-14  9:54 ` [RFC v1 1/2] drm/amdgpu: update cached vram base addresses on resume Jiang Liu
2025-01-14  9:54 ` [RFC v1 2/2] drm/amdgpu: introduce helper amdgpu_bo_get_pinned_gpu_addr() Jiang Liu
2025-01-14 10:35   ` Christian König
2025-01-14 10:46 ` [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs Christian König
2025-01-14 11:03   ` Gerry Liu
2025-01-14 12:43     ` Christian König
2025-01-14 18:00       ` Liu, Shaoyun
2025-01-15  1:47         ` Gerry Liu
2025-01-15  4:03           ` Liu, Shaoyun
2025-01-15  5:24             ` Gerry Liu
2025-01-15 11:23               ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1736847835.git.gerry@linux.alibaba.com \
    --to=gerry@linux.alibaba.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=Kent.Russell@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=lijo.lazar@amd.com \
    --cc=mario.limonciello@amd.com \
    --cc=shuox.liu@linux.alibaba.com \
    --cc=simona@ffwll.ch \
    --cc=sunil.khatri@amd.com \
    --cc=xiaogang.chen@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.