CRIU (Checkpoint/Restore in Userspace) mailing list
 help / color / mirror / Atom feed
* [PATCH 0/3] Patches to allow amdgpu restore without kfd file.
@ 2026-04-10 14:45 David Francis
  2026-04-10 14:45 ` [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas David Francis
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: David Francis @ 2026-04-10 14:45 UTC (permalink / raw)
  To: criu; +Cc: tvrtko.ursulin, David Francis

These patches allow processes that have renderD device files open
but not kfd device files to dump / restore.

Mostly a proof-of-concept / baseline for future work since there's
currently no way for such a process to dump / restore its queues.

David Francis (3):
  plugin/amdgpu: Add plugin to inventory even if there are no vmas
  plugin/amdgpu: Add topology dump file
  plugins/amdgpu: Make next_fd without kfd

 plugins/amdgpu/amdgpu_plugin.c      | 113 ++++++++++++++++++++++++++--
 plugins/amdgpu/amdgpu_plugin_drm.c  |  64 ++++++++++++++--
 plugins/amdgpu/amdgpu_plugin_util.h |   9 +++
 plugins/amdgpu/criu-amdgpu.proto    |   5 ++
 4 files changed, 180 insertions(+), 11 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas
  2026-04-10 14:45 [PATCH 0/3] Patches to allow amdgpu restore without kfd file David Francis
@ 2026-04-10 14:45 ` David Francis
  2026-04-10 15:46   ` Tvrtko Ursulin
  2026-04-10 14:45 ` [PATCH 2/3] plugin/amdgpu: Add topology dump file David Francis
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: David Francis @ 2026-04-10 14:45 UTC (permalink / raw)
  To: criu; +Cc: tvrtko.ursulin, David Francis

The amdgpu plugin is added to the plugin inventory as the first
vma map is dumped. But it's completely possible for a process to
have driver files open but no vma maps. In this case, we still
will need the plugin on restore.

Add the plugin to the inventory whenever a device file is dumped.

Signed-off-by: David Francis <David.Francis@amd.com>
---
 plugins/amdgpu/amdgpu_plugin.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c
index e3ba0de64..89ab10dac 100644
--- a/plugins/amdgpu/amdgpu_plugin.c
+++ b/plugins/amdgpu/amdgpu_plugin.c
@@ -1439,6 +1439,15 @@ int amdgpu_plugin_dump_file(int fd, int id)
 		if (ret)
 			return ret;
 
+		if (!plugin_added_to_inventory) {
+			ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
+			if (ret) {
+				pr_err("Failed to add AMDGPU plugin to inventory image\n");
+				return ret;
+			}
+			plugin_added_to_inventory = true;
+		}
+
 		ret = record_dumped_fd(fd, true);
 		if (ret)
 			return ret;
@@ -1543,6 +1552,15 @@ int amdgpu_plugin_dump_file(int fd, int id)
 	if (ret)
 		goto exit;
 
+	if (!plugin_added_to_inventory) {
+		ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
+		if (ret) {
+			pr_err("Failed to add AMDGPU plugin to inventory image\n");
+			goto exit;
+		}
+		plugin_added_to_inventory = true;
+	}
+
 exit:
 	xfree((void *)args.devices);
 	xfree((void *)args.bos);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] plugin/amdgpu: Add topology dump file
  2026-04-10 14:45 [PATCH 0/3] Patches to allow amdgpu restore without kfd file David Francis
  2026-04-10 14:45 ` [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas David Francis
@ 2026-04-10 14:45 ` David Francis
  2026-04-16 14:54   ` Tvrtko Ursulin
  2026-04-10 14:45 ` [PATCH 3/3] plugins/amdgpu: Make next_fd without kfd David Francis
  2026-04-10 15:42 ` [PATCH 0/3] Patches to allow amdgpu restore without kfd file Tvrtko Ursulin
  3 siblings, 1 reply; 15+ messages in thread
From: David Francis @ 2026-04-10 14:45 UTC (permalink / raw)
  To: criu; +Cc: tvrtko.ursulin, David Francis

The state of the source topology (the GPUs, CPUs, and links
between them) is saved by the plugin as part of kfd dump.

If there is no kfd dump, we need to save the topology anyways.

Do so in new file amdgpu-topology.img.

Signed-off-by: David Francis <David.Francis@amd.com>
---
 plugins/amdgpu/amdgpu_plugin.c      | 84 ++++++++++++++++++++++++++---
 plugins/amdgpu/amdgpu_plugin_drm.c  | 64 ++++++++++++++++++++--
 plugins/amdgpu/amdgpu_plugin_util.h |  9 ++++
 plugins/amdgpu/criu-amdgpu.proto    |  5 ++
 4 files changed, 151 insertions(+), 11 deletions(-)

diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c
index 89ab10dac..1e9785440 100644
--- a/plugins/amdgpu/amdgpu_plugin.c
+++ b/plugins/amdgpu/amdgpu_plugin.c
@@ -91,6 +91,9 @@ int current_pid;
  */
 bool parallel_disabled = false;
 
+bool kfd_dump_complete = false;
+bool amdgpu_topology_dump_complete = false;
+
 pthread_t parallel_thread = 0;
 int parallel_thread_result = 0;
 /**************************************************************************************************/
@@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
 		devinfo->node_id = node->id;
 
 		if (NODE_IS_GPU(node)) {
-			devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
-			if (!devinfo->gpu_id)
-				continue;
+			if (maps) {
+				devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
+				if (!devinfo->gpu_id)
+					continue;
+			} else {
+				devinfo->gpu_id = node->gpu_id;
+			}
+
 
 			devinfo->simd_count = node->simd_count;
 			devinfo->mem_banks_count = node->mem_banks_count;
@@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
 				if (!iolink->valid)
 					continue;
 
-				list_for_each_entry(node2, &sys->nodes, listm_system)
-					if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
-						link_to_present_node = true;
+				if (maps) {
+					list_for_each_entry(node2, &sys->nodes, listm_system)
+						if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
+							link_to_present_node = true;
+				} else {
+					link_to_present_node = true;
+				}
 
 				if (!link_to_present_node)
 					continue;
@@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage)
 	maps_init(&checkpoint_maps);
 	maps_init(&restore_maps);
 
+	if (stage == CR_PLUGIN_STAGE__DUMP) {
+		kfd_dump_complete = false;
+		amdgpu_topology_dump_complete = false;
+	}
+
 	if (stage == CR_PLUGIN_STAGE__RESTORE) {
 		if (has_children(root_item)) {
 			pr_info("Parallel restore disabled\n");
@@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id)
 	if (ret)
 		goto exit;
 
+	kfd_dump_complete = true;
 	if (!plugin_added_to_inventory) {
 		ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
 		if (ret) {
@@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed)
 
 		pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id);
 
+		if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) {
+			AmdgpuDevinfo *ad;
+
+			pr_info("No restore maps found, making them from topology file\n");
+
+			img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true);
+			if (!img_fp) {
+				pr_err("Failed to find either kfd or amdgpu src topology information\n");
+				ret = -EINVAL;
+				goto exit;
+			}
+
+			buf = xmalloc(img_size);
+			if (!buf) {
+				pr_err("Failed to allocate memory\n");
+				return -ENOMEM;
+			}
+
+			ret = read_fp(img_fp, buf, img_size);
+			if (ret) {
+				pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
+				ret = -EINVAL;
+				goto exit;
+			}
+
+			ad = amdgpu_devinfo__unpack(NULL, img_size, buf);
+			if (rd == NULL) {
+				pr_perror("Unable to parse the amdgpu topology message\n");
+				fclose(img_fp);
+				ret = -EINVAL;
+				goto exit;
+			}
+			fclose(img_fp);
+
+			ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology);
+			if (ret) {
+				pr_err("Failed to convert amdgpu device information to topology\n");
+				ret = -EINVAL;
+				goto exit;
+			}
+
+			ret = topology_parse(&dest_topology, "Local");
+			if (ret) {
+				pr_err("Failed to parse local system topology\n");
+				goto exit;
+			}
+
+			ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps);
+			if (ret) {
+				pr_err("Failed to map GPUs\n");
+				goto exit;
+			}
+		}
+
 		target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id);
 		if (!target_gpu_id) {
 			fd = -ENODEV;
diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c
index c1dfb2dd4..a4c650753 100644
--- a/plugins/amdgpu/amdgpu_plugin_drm.c
+++ b/plugins/amdgpu/amdgpu_plugin_drm.c
@@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm)
 		return -ENODEV;
 	}
 
-	/* Get the GPU_ID of the DRM device */
-	rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
-	if (!rd->gpu_id) {
-		pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
-		return -ENODEV;
+	if (kfd_dump_complete) {
+		/* Get the GPU_ID of the DRM device */
+		rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
+		if (!rd->gpu_id) {
+			pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
+			return -ENODEV;
+		}
+	} else {
+		rd->gpu_id = tp_node->gpu_id;
+
+		if (!amdgpu_topology_dump_complete) {
+			AmdgpuDevinfo *ad = NULL;
+			unsigned char *buf;
+
+			ad = xmalloc(sizeof(*ad));
+			amdgpu_devinfo__init(ad);
+
+			ad->num_of_devices = src_topology.num_nodes;
+
+			ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices);
+			if (!ad->device_entries) {
+				pr_err("Failed to allocate device_entries\n");
+				return -ENOMEM;
+			}
+
+			for (int i = 0; i < ad->num_of_devices; i++) {
+				KfdDeviceEntry *entry = xzalloc(sizeof(*entry));
+
+				if (!entry) {
+					pr_err("Failed to allocate entry\n");
+					return -ENOMEM;
+				}
+
+				kfd_device_entry__init(entry);
+
+				ad->device_entries[i] = entry;
+				ad->n_device_entries++;
+			}
+
+			topology_to_devinfo(&src_topology, NULL, ad->device_entries);
+
+			len = amdgpu_devinfo__get_packed_size(ad);
+
+			buf = xmalloc(len);
+			if (!buf) {
+				pr_perror("Failed to allocate memory to store protobuf");
+				return -ENOMEM;
+			}
+
+			amdgpu_devinfo__pack(ad, buf);
+
+			ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len);
+			if (ret) {
+				pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
+				return -EINVAL;
+			}
+
+			amdgpu_topology_dump_complete = true;
+		}
 	}
 
 	len = criu_render_node__get_packed_size(rd);
diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h
index 69b98a31c..ccfe30b49 100644
--- a/plugins/amdgpu/amdgpu_plugin_util.h
+++ b/plugins/amdgpu/amdgpu_plugin_util.h
@@ -2,6 +2,7 @@
 #define __AMDGPU_PLUGIN_UTIL_H__
 
 #include <libdrm/amdgpu.h>
+#include "criu-amdgpu.pb-c.h"
 
 #ifndef _GNU_SOURCE
 #define _GNU_SOURCE 1
@@ -59,6 +60,9 @@
 /* Name of file having serialized data of DRM device buffer objects (BOs) */
 #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img"
 
+/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/
+#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img"
+
 /* Helper macros to Checkpoint and Restore a ROCm file */
 #define HSAKMT_SHM_PATH			"/dev/shm/hsakmt_shared_mem"
 #define HSAKMT_SHM				"/hsakmt_shared_mem"
@@ -115,6 +119,9 @@ extern bool kfd_vram_size_check;
 extern bool kfd_numa_check;
 extern bool kfd_capability_check;
 
+extern bool kfd_dump_complete;
+extern bool amdgpu_topology_dump_complete;
+
 int read_fp(FILE *fp, void *buf, const size_t buf_len);
 int write_fp(FILE *fp, const void *buf, const size_t buf_len);
 int read_file(const char *file_path, void *buf, const size_t buf_len);
@@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp,
 
 int serve_out_dmabuf_fd(int handle, int fd);
 
+int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries);
+
 #endif		/* __AMDGPU_PLUGIN_UTIL_H__ */
diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto
index 7682a8f21..6e44e22aa 100644
--- a/plugins/amdgpu/criu-amdgpu.proto
+++ b/plugins/amdgpu/criu-amdgpu.proto
@@ -93,3 +93,8 @@ message criu_render_node {
 message criu_dmabuf_node {
 	required uint32 gem_handle = 1;
 }
+
+message amdgpu_devinfo {
+	required uint32 num_of_devices = 1;
+	repeated kfd_device_entry device_entries = 2;
+}
\ No newline at end of file
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] plugins/amdgpu: Make next_fd without kfd
  2026-04-10 14:45 [PATCH 0/3] Patches to allow amdgpu restore without kfd file David Francis
  2026-04-10 14:45 ` [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas David Francis
  2026-04-10 14:45 ` [PATCH 2/3] plugin/amdgpu: Add topology dump file David Francis
@ 2026-04-10 14:45 ` David Francis
  2026-04-10 15:42 ` [PATCH 0/3] Patches to allow amdgpu restore without kfd file Tvrtko Ursulin
  3 siblings, 0 replies; 15+ messages in thread
From: David Francis @ 2026-04-10 14:45 UTC (permalink / raw)
  To: criu; +Cc: tvrtko.ursulin, David Francis

The amdgpu plugin needs to know what high fds are safe to use
to store drm fds during the restore process. This is kept in
next_fd, which is set as part of kfd restore.

If there's no kfd restore, set next_fd anyways. Use the
current pid as the pid to check; by this point in the restore
the pid has been changed to the correct value.

Signed-off-by: David Francis <David.Francis@amd.com>
---
 plugins/amdgpu/amdgpu_plugin.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c
index 1e9785440..5dcf80363 100644
--- a/plugins/amdgpu/amdgpu_plugin.c
+++ b/plugins/amdgpu/amdgpu_plugin.c
@@ -1926,6 +1926,17 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed)
 
 		pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id);
 
+		if (fd_next == -1) {
+			current_pid = getpid();
+			fd_next = find_unused_fd_pid(getpid());
+			if (fd_next <= 0) {
+				pr_err("Failed to find unused fd (fd:%d)\n", fd_next);
+				ret = -EINVAL;
+				goto exit;
+			}
+			pr_info("High fd for pid %d set to %d\n", current_pid, fd_next);
+		}
+
 		if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) {
 			AmdgpuDevinfo *ad;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file.
  2026-04-10 14:45 [PATCH 0/3] Patches to allow amdgpu restore without kfd file David Francis
                   ` (2 preceding siblings ...)
  2026-04-10 14:45 ` [PATCH 3/3] plugins/amdgpu: Make next_fd without kfd David Francis
@ 2026-04-10 15:42 ` Tvrtko Ursulin
  2026-04-10 15:46   ` Francis, David
  3 siblings, 1 reply; 15+ messages in thread
From: Tvrtko Ursulin @ 2026-04-10 15:42 UTC (permalink / raw)
  To: David Francis, criu



On 10/04/2026 15:45, David Francis wrote:
> These patches allow processes that have renderD device files open
> but not kfd device files to dump / restore.

The series allows the first/minimal test case from my IGT to pass?

Regards,

Tvrtko

> 
> Mostly a proof-of-concept / baseline for future work since there's
> currently no way for such a process to dump / restore its queues.
> 
> David Francis (3):
>    plugin/amdgpu: Add plugin to inventory even if there are no vmas
>    plugin/amdgpu: Add topology dump file
>    plugins/amdgpu: Make next_fd without kfd
> 
>   plugins/amdgpu/amdgpu_plugin.c      | 113 ++++++++++++++++++++++++++--
>   plugins/amdgpu/amdgpu_plugin_drm.c  |  64 ++++++++++++++--
>   plugins/amdgpu/amdgpu_plugin_util.h |   9 +++
>   plugins/amdgpu/criu-amdgpu.proto    |   5 ++
>   4 files changed, 180 insertions(+), 11 deletions(-)
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file.
  2026-04-10 15:42 ` [PATCH 0/3] Patches to allow amdgpu restore without kfd file Tvrtko Ursulin
@ 2026-04-10 15:46   ` Francis, David
  2026-04-10 16:04     ` Tvrtko Ursulin
  0 siblings, 1 reply; 15+ messages in thread
From: Francis, David @ 2026-04-10 15:46 UTC (permalink / raw)
  To: Tvrtko Ursulin, criu@lists.linux.dev

Haven't tested with having a BO created / mapped, but since that code should already be in from the dmabuf IPC work, it should work.

That's what I'm testing next.

David.

________________________________________
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Sent: Friday, April 10, 2026 11:42 AM
To: Francis, David; criu@lists.linux.dev
Subject: Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file.



On 10/04/2026 15:45, David Francis wrote:
> These patches allow processes that have renderD device files open
> but not kfd device files to dump / restore.

The series allows the first/minimal test case from my IGT to pass?

Regards,

Tvrtko

>
> Mostly a proof-of-concept / baseline for future work since there's
> currently no way for such a process to dump / restore its queues.
>
> David Francis (3):
>    plugin/amdgpu: Add plugin to inventory even if there are no vmas
>    plugin/amdgpu: Add topology dump file
>    plugins/amdgpu: Make next_fd without kfd
>
>   plugins/amdgpu/amdgpu_plugin.c      | 113 ++++++++++++++++++++++++++--
>   plugins/amdgpu/amdgpu_plugin_drm.c  |  64 ++++++++++++++--
>   plugins/amdgpu/amdgpu_plugin_util.h |   9 +++
>   plugins/amdgpu/criu-amdgpu.proto    |   5 ++
>   4 files changed, 180 insertions(+), 11 deletions(-)
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas
  2026-04-10 14:45 ` [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas David Francis
@ 2026-04-10 15:46   ` Tvrtko Ursulin
  2026-04-10 16:15     ` Francis, David
  0 siblings, 1 reply; 15+ messages in thread
From: Tvrtko Ursulin @ 2026-04-10 15:46 UTC (permalink / raw)
  To: David Francis, criu


On 10/04/2026 15:45, David Francis wrote:
> The amdgpu plugin is added to the plugin inventory as the first
> vma map is dumped. But it's completely possible for a process to
> have driver files open but no vma maps. In this case, we still
> will need the plugin on restore.
> 
> Add the plugin to the inventory whenever a device file is dumped.
> 
> Signed-off-by: David Francis <David.Francis@amd.com>
> ---
>   plugins/amdgpu/amdgpu_plugin.c | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c
> index e3ba0de64..89ab10dac 100644
> --- a/plugins/amdgpu/amdgpu_plugin.c
> +++ b/plugins/amdgpu/amdgpu_plugin.c
> @@ -1439,6 +1439,15 @@ int amdgpu_plugin_dump_file(int fd, int id)
>   		if (ret)
>   			return ret;
>   
> +		if (!plugin_added_to_inventory) {
> +			ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
> +			if (ret) {
> +				pr_err("Failed to add AMDGPU plugin to inventory image\n");
> +				return ret;
> +			}
> +			plugin_added_to_inventory = true;
> +		}
> +
>   		ret = record_dumped_fd(fd, true);
>   		if (ret)
>   			return ret;
> @@ -1543,6 +1552,15 @@ int amdgpu_plugin_dump_file(int fd, int id)
>   	if (ret)
>   		goto exit;
>   
> +	if (!plugin_added_to_inventory) {
> +		ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
> +		if (ret) {
> +			pr_err("Failed to add AMDGPU plugin to inventory image\n");
> +			goto exit;
> +		}
> +		plugin_added_to_inventory = true;
> +	}
> +
>   exit:
>   	xfree((void *)args.devices);
>   	xfree((void *)args.bos);

This is very similar to the patch I submitted and you reviewed. Just so 
happens I have to rebase since when I collected all r-b's the series 
does not apply any longer. I don't mind hugely if you take authorship 
but some sort of a acknowledgement tag would have been nice.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file.
  2026-04-10 15:46   ` Francis, David
@ 2026-04-10 16:04     ` Tvrtko Ursulin
  2026-04-10 16:12       ` Francis, David
  0 siblings, 1 reply; 15+ messages in thread
From: Tvrtko Ursulin @ 2026-04-10 16:04 UTC (permalink / raw)
  To: Francis, David, criu@lists.linux.dev


On 10/04/2026 16:46, Francis, David wrote:
> Haven't tested with having a BO created / mapped, but since that code should already be in from the dmabuf IPC work, it should work.
> 
> That's what I'm testing next.

I meant the first test case from 
https://cgit.freedesktop.org/~tursulin/intel-gpu-tools/commit/?h=amd-criu&id=24487611e08c753e73b3fc650ca85793bf053f4d 
ie. igt_subtest("open")?

Regards,

Tvrtko

> 
> David.
> 
> ________________________________________
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Sent: Friday, April 10, 2026 11:42 AM
> To: Francis, David; criu@lists.linux.dev
> Subject: Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file.
> 
> 
> 
> On 10/04/2026 15:45, David Francis wrote:
>> These patches allow processes that have renderD device files open
>> but not kfd device files to dump / restore.
> 
> The series allows the first/minimal test case from my IGT to pass?
> 
> Regards,
> 
> Tvrtko
> 
>>
>> Mostly a proof-of-concept / baseline for future work since there's
>> currently no way for such a process to dump / restore its queues.
>>
>> David Francis (3):
>>     plugin/amdgpu: Add plugin to inventory even if there are no vmas
>>     plugin/amdgpu: Add topology dump file
>>     plugins/amdgpu: Make next_fd without kfd
>>
>>    plugins/amdgpu/amdgpu_plugin.c      | 113 ++++++++++++++++++++++++++--
>>    plugins/amdgpu/amdgpu_plugin_drm.c  |  64 ++++++++++++++--
>>    plugins/amdgpu/amdgpu_plugin_util.h |   9 +++
>>    plugins/amdgpu/criu-amdgpu.proto    |   5 ++
>>    4 files changed, 180 insertions(+), 11 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file.
  2026-04-10 16:04     ` Tvrtko Ursulin
@ 2026-04-10 16:12       ` Francis, David
  0 siblings, 0 replies; 15+ messages in thread
From: Francis, David @ 2026-04-10 16:12 UTC (permalink / raw)
  To: Tvrtko Ursulin, criu@lists.linux.dev


> I meant the first test case from
> https://cgit.freedesktop.org/~tursulin/intel-gpu-tools/commit/?h=amd-criu&id=24487611e08c753e73b3fc650ca85793bf053f4d
> ie. igt_subtest("open")?

Whoops, misread before. Yes, that should work

David

________________________________________
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Sent: Friday, April 10, 2026 12:04 PM
To: Francis, David; criu@lists.linux.dev
Subject: Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file.


On 10/04/2026 16:46, Francis, David wrote:
> Haven't tested with having a BO created / mapped, but since that code should already be in from the dmabuf IPC work, it should work.
>
> That's what I'm testing next.

I meant the first test case from
https://cgit.freedesktop.org/~tursulin/intel-gpu-tools/commit/?h=amd-criu&id=24487611e08c753e73b3fc650ca85793bf053f4d
ie. igt_subtest("open")?

Regards,

Tvrtko

>
> David.
>
> ________________________________________
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Sent: Friday, April 10, 2026 11:42 AM
> To: Francis, David; criu@lists.linux.dev
> Subject: Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file.
>
>
>
> On 10/04/2026 15:45, David Francis wrote:
>> These patches allow processes that have renderD device files open
>> but not kfd device files to dump / restore.
>
> The series allows the first/minimal test case from my IGT to pass?
>
> Regards,
>
> Tvrtko
>
>>
>> Mostly a proof-of-concept / baseline for future work since there's
>> currently no way for such a process to dump / restore its queues.
>>
>> David Francis (3):
>>     plugin/amdgpu: Add plugin to inventory even if there are no vmas
>>     plugin/amdgpu: Add topology dump file
>>     plugins/amdgpu: Make next_fd without kfd
>>
>>    plugins/amdgpu/amdgpu_plugin.c      | 113 ++++++++++++++++++++++++++--
>>    plugins/amdgpu/amdgpu_plugin_drm.c  |  64 ++++++++++++++--
>>    plugins/amdgpu/amdgpu_plugin_util.h |   9 +++
>>    plugins/amdgpu/criu-amdgpu.proto    |   5 ++
>>    4 files changed, 180 insertions(+), 11 deletions(-)
>>
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas
  2026-04-10 15:46   ` Tvrtko Ursulin
@ 2026-04-10 16:15     ` Francis, David
  0 siblings, 0 replies; 15+ messages in thread
From: Francis, David @ 2026-04-10 16:15 UTC (permalink / raw)
  To: Tvrtko Ursulin, criu@lists.linux.dev


> This is very similar to the patch I submitted and you reviewed. Just so
> happens I have to rebase since when I collected all r-b's the series
> does not apply any longer. I don't mind hugely if you take authorship
> but some sort of a acknowledgement tag would have been nice.

My apologies; I was working on this in parallel with reviewing your
patches and missed the similarity. Yours is fine, or I can add
an attribution tag.

Sorry again,
David

________________________________________
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Sent: Friday, April 10, 2026 11:46 AM
To: Francis, David; criu@lists.linux.dev
Subject: Re: [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas


On 10/04/2026 15:45, David Francis wrote:
> The amdgpu plugin is added to the plugin inventory as the first
> vma map is dumped. But it's completely possible for a process to
> have driver files open but no vma maps. In this case, we still
> will need the plugin on restore.
>
> Add the plugin to the inventory whenever a device file is dumped.
>
> Signed-off-by: David Francis <David.Francis@amd.com>
> ---
>   plugins/amdgpu/amdgpu_plugin.c | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
>
> diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c
> index e3ba0de64..89ab10dac 100644
> --- a/plugins/amdgpu/amdgpu_plugin.c
> +++ b/plugins/amdgpu/amdgpu_plugin.c
> @@ -1439,6 +1439,15 @@ int amdgpu_plugin_dump_file(int fd, int id)
>               if (ret)
>                       return ret;
>
> +             if (!plugin_added_to_inventory) {
> +                     ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
> +                     if (ret) {
> +                             pr_err("Failed to add AMDGPU plugin to inventory image\n");
> +                             return ret;
> +                     }
> +                     plugin_added_to_inventory = true;
> +             }
> +
>               ret = record_dumped_fd(fd, true);
>               if (ret)
>                       return ret;
> @@ -1543,6 +1552,15 @@ int amdgpu_plugin_dump_file(int fd, int id)
>       if (ret)
>               goto exit;
>
> +     if (!plugin_added_to_inventory) {
> +             ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
> +             if (ret) {
> +                     pr_err("Failed to add AMDGPU plugin to inventory image\n");
> +                     goto exit;
> +             }
> +             plugin_added_to_inventory = true;
> +     }
> +
>   exit:
>       xfree((void *)args.devices);
>       xfree((void *)args.bos);

This is very similar to the patch I submitted and you reviewed. Just so
happens I have to rebase since when I collected all r-b's the series
does not apply any longer. I don't mind hugely if you take authorship
but some sort of a acknowledgement tag would have been nice.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file
  2026-04-10 14:45 ` [PATCH 2/3] plugin/amdgpu: Add topology dump file David Francis
@ 2026-04-16 14:54   ` Tvrtko Ursulin
  2026-04-16 18:26     ` Francis, David
  0 siblings, 1 reply; 15+ messages in thread
From: Tvrtko Ursulin @ 2026-04-16 14:54 UTC (permalink / raw)
  To: David Francis, criu


On 10/04/2026 15:45, David Francis wrote:
> The state of the source topology (the GPUs, CPUs, and links
> between them) is saved by the plugin as part of kfd dump.
> 
> If there is no kfd dump, we need to save the topology anyways.
> 
> Do so in new file amdgpu-topology.img.

Ah neat, but I am guessing the data comes from kfd? At some point we 
will need to bite the bullet and start designing uapi for an amdgpu only 
world.

In the meantime, would you like me to review this in detail to perhaps 
have it merged in the interim? Although in that case backward 
compatibility gets more complicated.

Regards,

Tvrtko

> 
> Signed-off-by: David Francis <David.Francis@amd.com>
> ---
>   plugins/amdgpu/amdgpu_plugin.c      | 84 ++++++++++++++++++++++++++---
>   plugins/amdgpu/amdgpu_plugin_drm.c  | 64 ++++++++++++++++++++--
>   plugins/amdgpu/amdgpu_plugin_util.h |  9 ++++
>   plugins/amdgpu/criu-amdgpu.proto    |  5 ++
>   4 files changed, 151 insertions(+), 11 deletions(-)
> 
> diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c
> index 89ab10dac..1e9785440 100644
> --- a/plugins/amdgpu/amdgpu_plugin.c
> +++ b/plugins/amdgpu/amdgpu_plugin.c
> @@ -91,6 +91,9 @@ int current_pid;
>    */
>   bool parallel_disabled = false;
>   
> +bool kfd_dump_complete = false;
> +bool amdgpu_topology_dump_complete = false;
> +
>   pthread_t parallel_thread = 0;
>   int parallel_thread_result = 0;
>   /**************************************************************************************************/
> @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
>   		devinfo->node_id = node->id;
>   
>   		if (NODE_IS_GPU(node)) {
> -			devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
> -			if (!devinfo->gpu_id)
> -				continue;
> +			if (maps) {
> +				devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
> +				if (!devinfo->gpu_id)
> +					continue;
> +			} else {
> +				devinfo->gpu_id = node->gpu_id;
> +			}
> +
>   
>   			devinfo->simd_count = node->simd_count;
>   			devinfo->mem_banks_count = node->mem_banks_count;
> @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
>   				if (!iolink->valid)
>   					continue;
>   
> -				list_for_each_entry(node2, &sys->nodes, listm_system)
> -					if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
> -						link_to_present_node = true;
> +				if (maps) {
> +					list_for_each_entry(node2, &sys->nodes, listm_system)
> +						if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
> +							link_to_present_node = true;
> +				} else {
> +					link_to_present_node = true;
> +				}
>   
>   				if (!link_to_present_node)
>   					continue;
> @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage)
>   	maps_init(&checkpoint_maps);
>   	maps_init(&restore_maps);
>   
> +	if (stage == CR_PLUGIN_STAGE__DUMP) {
> +		kfd_dump_complete = false;
> +		amdgpu_topology_dump_complete = false;
> +	}
> +
>   	if (stage == CR_PLUGIN_STAGE__RESTORE) {
>   		if (has_children(root_item)) {
>   			pr_info("Parallel restore disabled\n");
> @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id)
>   	if (ret)
>   		goto exit;
>   
> +	kfd_dump_complete = true;
>   	if (!plugin_added_to_inventory) {
>   		ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
>   		if (ret) {
> @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed)
>   
>   		pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id);
>   
> +		if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) {
> +			AmdgpuDevinfo *ad;
> +
> +			pr_info("No restore maps found, making them from topology file\n");
> +
> +			img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true);
> +			if (!img_fp) {
> +				pr_err("Failed to find either kfd or amdgpu src topology information\n");
> +				ret = -EINVAL;
> +				goto exit;
> +			}
> +
> +			buf = xmalloc(img_size);
> +			if (!buf) {
> +				pr_err("Failed to allocate memory\n");
> +				return -ENOMEM;
> +			}
> +
> +			ret = read_fp(img_fp, buf, img_size);
> +			if (ret) {
> +				pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
> +				ret = -EINVAL;
> +				goto exit;
> +			}
> +
> +			ad = amdgpu_devinfo__unpack(NULL, img_size, buf);
> +			if (rd == NULL) {
> +				pr_perror("Unable to parse the amdgpu topology message\n");
> +				fclose(img_fp);
> +				ret = -EINVAL;
> +				goto exit;
> +			}
> +			fclose(img_fp);
> +
> +			ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology);
> +			if (ret) {
> +				pr_err("Failed to convert amdgpu device information to topology\n");
> +				ret = -EINVAL;
> +				goto exit;
> +			}
> +
> +			ret = topology_parse(&dest_topology, "Local");
> +			if (ret) {
> +				pr_err("Failed to parse local system topology\n");
> +				goto exit;
> +			}
> +
> +			ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps);
> +			if (ret) {
> +				pr_err("Failed to map GPUs\n");
> +				goto exit;
> +			}
> +		}
> +
>   		target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id);
>   		if (!target_gpu_id) {
>   			fd = -ENODEV;
> diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c
> index c1dfb2dd4..a4c650753 100644
> --- a/plugins/amdgpu/amdgpu_plugin_drm.c
> +++ b/plugins/amdgpu/amdgpu_plugin_drm.c
> @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm)
>   		return -ENODEV;
>   	}
>   
> -	/* Get the GPU_ID of the DRM device */
> -	rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
> -	if (!rd->gpu_id) {
> -		pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
> -		return -ENODEV;
> +	if (kfd_dump_complete) {
> +		/* Get the GPU_ID of the DRM device */
> +		rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
> +		if (!rd->gpu_id) {
> +			pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
> +			return -ENODEV;
> +		}
> +	} else {
> +		rd->gpu_id = tp_node->gpu_id;
> +
> +		if (!amdgpu_topology_dump_complete) {
> +			AmdgpuDevinfo *ad = NULL;
> +			unsigned char *buf;
> +
> +			ad = xmalloc(sizeof(*ad));
> +			amdgpu_devinfo__init(ad);
> +
> +			ad->num_of_devices = src_topology.num_nodes;
> +
> +			ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices);
> +			if (!ad->device_entries) {
> +				pr_err("Failed to allocate device_entries\n");
> +				return -ENOMEM;
> +			}
> +
> +			for (int i = 0; i < ad->num_of_devices; i++) {
> +				KfdDeviceEntry *entry = xzalloc(sizeof(*entry));
> +
> +				if (!entry) {
> +					pr_err("Failed to allocate entry\n");
> +					return -ENOMEM;
> +				}
> +
> +				kfd_device_entry__init(entry);
> +
> +				ad->device_entries[i] = entry;
> +				ad->n_device_entries++;
> +			}
> +
> +			topology_to_devinfo(&src_topology, NULL, ad->device_entries);
> +
> +			len = amdgpu_devinfo__get_packed_size(ad);
> +
> +			buf = xmalloc(len);
> +			if (!buf) {
> +				pr_perror("Failed to allocate memory to store protobuf");
> +				return -ENOMEM;
> +			}
> +
> +			amdgpu_devinfo__pack(ad, buf);
> +
> +			ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len);
> +			if (ret) {
> +				pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
> +				return -EINVAL;
> +			}
> +
> +			amdgpu_topology_dump_complete = true;
> +		}
>   	}
>   
>   	len = criu_render_node__get_packed_size(rd);
> diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h
> index 69b98a31c..ccfe30b49 100644
> --- a/plugins/amdgpu/amdgpu_plugin_util.h
> +++ b/plugins/amdgpu/amdgpu_plugin_util.h
> @@ -2,6 +2,7 @@
>   #define __AMDGPU_PLUGIN_UTIL_H__
>   
>   #include <libdrm/amdgpu.h>
> +#include "criu-amdgpu.pb-c.h"
>   
>   #ifndef _GNU_SOURCE
>   #define _GNU_SOURCE 1
> @@ -59,6 +60,9 @@
>   /* Name of file having serialized data of DRM device buffer objects (BOs) */
>   #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img"
>   
> +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/
> +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img"
> +
>   /* Helper macros to Checkpoint and Restore a ROCm file */
>   #define HSAKMT_SHM_PATH			"/dev/shm/hsakmt_shared_mem"
>   #define HSAKMT_SHM				"/hsakmt_shared_mem"
> @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check;
>   extern bool kfd_numa_check;
>   extern bool kfd_capability_check;
>   
> +extern bool kfd_dump_complete;
> +extern bool amdgpu_topology_dump_complete;
> +
>   int read_fp(FILE *fp, void *buf, const size_t buf_len);
>   int write_fp(FILE *fp, const void *buf, const size_t buf_len);
>   int read_file(const char *file_path, void *buf, const size_t buf_len);
> @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp,
>   
>   int serve_out_dmabuf_fd(int handle, int fd);
>   
> +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries);
> +
>   #endif		/* __AMDGPU_PLUGIN_UTIL_H__ */
> diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto
> index 7682a8f21..6e44e22aa 100644
> --- a/plugins/amdgpu/criu-amdgpu.proto
> +++ b/plugins/amdgpu/criu-amdgpu.proto
> @@ -93,3 +93,8 @@ message criu_render_node {
>   message criu_dmabuf_node {
>   	required uint32 gem_handle = 1;
>   }
> +
> +message amdgpu_devinfo {
> +	required uint32 num_of_devices = 1;
> +	repeated kfd_device_entry device_entries = 2;
> +}
> \ No newline at end of file


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file
  2026-04-16 14:54   ` Tvrtko Ursulin
@ 2026-04-16 18:26     ` Francis, David
  2026-04-17  9:07       ` Tvrtko Ursulin
  0 siblings, 1 reply; 15+ messages in thread
From: Francis, David @ 2026-04-16 18:26 UTC (permalink / raw)
  To: Tvrtko Ursulin, criu@lists.linux.dev

> Ah neat, but I am guessing the data comes from kfd? At some point we
> will need to bite the bullet and start designing uapi for an amdgpu only
> world.

No, the data comes solely from sysfs in this case.

No urgency on merging this since it doesn't do much without more patches
but also unlikely to cause problems on its own - if kfd is present, the plugin
will continue to make the same dump it always did.

Review would be appreciated, thanks.


________________________________________
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Sent: Thursday, April 16, 2026 10:54 AM
To: Francis, David; criu@lists.linux.dev
Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file


On 10/04/2026 15:45, David Francis wrote:
> The state of the source topology (the GPUs, CPUs, and links
> between them) is saved by the plugin as part of kfd dump.
>
> If there is no kfd dump, we need to save the topology anyways.
>
> Do so in new file amdgpu-topology.img.

Ah neat, but I am guessing the data comes from kfd? At some point we
will need to bite the bullet and start designing uapi for an amdgpu only
world.

In the meantime, would you like me to review this in detail to perhaps
have it merged in the interim? Although in that case backward
compatibility gets more complicated.

Regards,

Tvrtko

>
> Signed-off-by: David Francis <David.Francis@amd.com>
> ---
>   plugins/amdgpu/amdgpu_plugin.c      | 84 ++++++++++++++++++++++++++---
>   plugins/amdgpu/amdgpu_plugin_drm.c  | 64 ++++++++++++++++++++--
>   plugins/amdgpu/amdgpu_plugin_util.h |  9 ++++
>   plugins/amdgpu/criu-amdgpu.proto    |  5 ++
>   4 files changed, 151 insertions(+), 11 deletions(-)
>
> diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c
> index 89ab10dac..1e9785440 100644
> --- a/plugins/amdgpu/amdgpu_plugin.c
> +++ b/plugins/amdgpu/amdgpu_plugin.c
> @@ -91,6 +91,9 @@ int current_pid;
>    */
>   bool parallel_disabled = false;
>
> +bool kfd_dump_complete = false;
> +bool amdgpu_topology_dump_complete = false;
> +
>   pthread_t parallel_thread = 0;
>   int parallel_thread_result = 0;
>   /**************************************************************************************************/
> @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
>               devinfo->node_id = node->id;
>
>               if (NODE_IS_GPU(node)) {
> -                     devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
> -                     if (!devinfo->gpu_id)
> -                             continue;
> +                     if (maps) {
> +                             devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
> +                             if (!devinfo->gpu_id)
> +                                     continue;
> +                     } else {
> +                             devinfo->gpu_id = node->gpu_id;
> +                     }
> +
>
>                       devinfo->simd_count = node->simd_count;
>                       devinfo->mem_banks_count = node->mem_banks_count;
> @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
>                               if (!iolink->valid)
>                                       continue;
>
> -                             list_for_each_entry(node2, &sys->nodes, listm_system)
> -                                     if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
> -                                             link_to_present_node = true;
> +                             if (maps) {
> +                                     list_for_each_entry(node2, &sys->nodes, listm_system)
> +                                             if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
> +                                                     link_to_present_node = true;
> +                             } else {
> +                                     link_to_present_node = true;
> +                             }
>
>                               if (!link_to_present_node)
>                                       continue;
> @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage)
>       maps_init(&checkpoint_maps);
>       maps_init(&restore_maps);
>
> +     if (stage == CR_PLUGIN_STAGE__DUMP) {
> +             kfd_dump_complete = false;
> +             amdgpu_topology_dump_complete = false;
> +     }
> +
>       if (stage == CR_PLUGIN_STAGE__RESTORE) {
>               if (has_children(root_item)) {
>                       pr_info("Parallel restore disabled\n");
> @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id)
>       if (ret)
>               goto exit;
>
> +     kfd_dump_complete = true;
>       if (!plugin_added_to_inventory) {
>               ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
>               if (ret) {
> @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed)
>
>               pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id);
>
> +             if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) {
> +                     AmdgpuDevinfo *ad;
> +
> +                     pr_info("No restore maps found, making them from topology file\n");
> +
> +                     img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true);
> +                     if (!img_fp) {
> +                             pr_err("Failed to find either kfd or amdgpu src topology information\n");
> +                             ret = -EINVAL;
> +                             goto exit;
> +                     }
> +
> +                     buf = xmalloc(img_size);
> +                     if (!buf) {
> +                             pr_err("Failed to allocate memory\n");
> +                             return -ENOMEM;
> +                     }
> +
> +                     ret = read_fp(img_fp, buf, img_size);
> +                     if (ret) {
> +                             pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
> +                             ret = -EINVAL;
> +                             goto exit;
> +                     }
> +
> +                     ad = amdgpu_devinfo__unpack(NULL, img_size, buf);
> +                     if (rd == NULL) {
> +                             pr_perror("Unable to parse the amdgpu topology message\n");
> +                             fclose(img_fp);
> +                             ret = -EINVAL;
> +                             goto exit;
> +                     }
> +                     fclose(img_fp);
> +
> +                     ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology);
> +                     if (ret) {
> +                             pr_err("Failed to convert amdgpu device information to topology\n");
> +                             ret = -EINVAL;
> +                             goto exit;
> +                     }
> +
> +                     ret = topology_parse(&dest_topology, "Local");
> +                     if (ret) {
> +                             pr_err("Failed to parse local system topology\n");
> +                             goto exit;
> +                     }
> +
> +                     ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps);
> +                     if (ret) {
> +                             pr_err("Failed to map GPUs\n");
> +                             goto exit;
> +                     }
> +             }
> +
>               target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id);
>               if (!target_gpu_id) {
>                       fd = -ENODEV;
> diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c
> index c1dfb2dd4..a4c650753 100644
> --- a/plugins/amdgpu/amdgpu_plugin_drm.c
> +++ b/plugins/amdgpu/amdgpu_plugin_drm.c
> @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm)
>               return -ENODEV;
>       }
>
> -     /* Get the GPU_ID of the DRM device */
> -     rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
> -     if (!rd->gpu_id) {
> -             pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
> -             return -ENODEV;
> +     if (kfd_dump_complete) {
> +             /* Get the GPU_ID of the DRM device */
> +             rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
> +             if (!rd->gpu_id) {
> +                     pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
> +                     return -ENODEV;
> +             }
> +     } else {
> +             rd->gpu_id = tp_node->gpu_id;
> +
> +             if (!amdgpu_topology_dump_complete) {
> +                     AmdgpuDevinfo *ad = NULL;
> +                     unsigned char *buf;
> +
> +                     ad = xmalloc(sizeof(*ad));
> +                     amdgpu_devinfo__init(ad);
> +
> +                     ad->num_of_devices = src_topology.num_nodes;
> +
> +                     ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices);
> +                     if (!ad->device_entries) {
> +                             pr_err("Failed to allocate device_entries\n");
> +                             return -ENOMEM;
> +                     }
> +
> +                     for (int i = 0; i < ad->num_of_devices; i++) {
> +                             KfdDeviceEntry *entry = xzalloc(sizeof(*entry));
> +
> +                             if (!entry) {
> +                                     pr_err("Failed to allocate entry\n");
> +                                     return -ENOMEM;
> +                             }
> +
> +                             kfd_device_entry__init(entry);
> +
> +                             ad->device_entries[i] = entry;
> +                             ad->n_device_entries++;
> +                     }
> +
> +                     topology_to_devinfo(&src_topology, NULL, ad->device_entries);
> +
> +                     len = amdgpu_devinfo__get_packed_size(ad);
> +
> +                     buf = xmalloc(len);
> +                     if (!buf) {
> +                             pr_perror("Failed to allocate memory to store protobuf");
> +                             return -ENOMEM;
> +                     }
> +
> +                     amdgpu_devinfo__pack(ad, buf);
> +
> +                     ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len);
> +                     if (ret) {
> +                             pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
> +                             return -EINVAL;
> +                     }
> +
> +                     amdgpu_topology_dump_complete = true;
> +             }
>       }
>
>       len = criu_render_node__get_packed_size(rd);
> diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h
> index 69b98a31c..ccfe30b49 100644
> --- a/plugins/amdgpu/amdgpu_plugin_util.h
> +++ b/plugins/amdgpu/amdgpu_plugin_util.h
> @@ -2,6 +2,7 @@
>   #define __AMDGPU_PLUGIN_UTIL_H__
>
>   #include <libdrm/amdgpu.h>
> +#include "criu-amdgpu.pb-c.h"
>
>   #ifndef _GNU_SOURCE
>   #define _GNU_SOURCE 1
> @@ -59,6 +60,9 @@
>   /* Name of file having serialized data of DRM device buffer objects (BOs) */
>   #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img"
>
> +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/
> +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img"
> +
>   /* Helper macros to Checkpoint and Restore a ROCm file */
>   #define HSAKMT_SHM_PATH                     "/dev/shm/hsakmt_shared_mem"
>   #define HSAKMT_SHM                          "/hsakmt_shared_mem"
> @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check;
>   extern bool kfd_numa_check;
>   extern bool kfd_capability_check;
>
> +extern bool kfd_dump_complete;
> +extern bool amdgpu_topology_dump_complete;
> +
>   int read_fp(FILE *fp, void *buf, const size_t buf_len);
>   int write_fp(FILE *fp, const void *buf, const size_t buf_len);
>   int read_file(const char *file_path, void *buf, const size_t buf_len);
> @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp,
>
>   int serve_out_dmabuf_fd(int handle, int fd);
>
> +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries);
> +
>   #endif              /* __AMDGPU_PLUGIN_UTIL_H__ */
> diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto
> index 7682a8f21..6e44e22aa 100644
> --- a/plugins/amdgpu/criu-amdgpu.proto
> +++ b/plugins/amdgpu/criu-amdgpu.proto
> @@ -93,3 +93,8 @@ message criu_render_node {
>   message criu_dmabuf_node {
>       required uint32 gem_handle = 1;
>   }
> +
> +message amdgpu_devinfo {
> +     required uint32 num_of_devices = 1;
> +     repeated kfd_device_entry device_entries = 2;
> +}
> \ No newline at end of file


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file
  2026-04-16 18:26     ` Francis, David
@ 2026-04-17  9:07       ` Tvrtko Ursulin
  2026-04-17 13:21         ` Francis, David
  0 siblings, 1 reply; 15+ messages in thread
From: Tvrtko Ursulin @ 2026-04-17  9:07 UTC (permalink / raw)
  To: Francis, David, criu@lists.linux.dev


On 16/04/2026 19:26, Francis, David wrote:
>> Ah neat, but I am guessing the data comes from kfd? At some point we
>> will need to bite the bullet and start designing uapi for an amdgpu only
>> world.
> 
> No, the data comes solely from sysfs in this case.

But exported by kfd, no? Ie. /sys/class/kfd/kfd/topology/nodes/. I 
understand the goal is to migrate away from kfd which is why I wondered 
if it makes sense to add an image based of both internal and external 
kfd data.

> No urgency on merging this since it doesn't do much without more patches
> but also unlikely to cause problems on its own - if kfd is present, the plugin
> will continue to make the same dump it always did.
> 
> Review would be appreciated, thanks.

In one aspect it is better than my hack of allow restore from a single 
to a single GPU system, although even there the gpu id is not stable, 
but on the other hand it does add an image format which looks like a 
dead end from a design persective.

Hm, how does this patch handle gpu id changing across reboots?

Regards,

Tvrtko

> ________________________________________
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Sent: Thursday, April 16, 2026 10:54 AM
> To: Francis, David; criu@lists.linux.dev
> Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file
> 
> 
> On 10/04/2026 15:45, David Francis wrote:
>> The state of the source topology (the GPUs, CPUs, and links
>> between them) is saved by the plugin as part of kfd dump.
>>
>> If there is no kfd dump, we need to save the topology anyways.
>>
>> Do so in new file amdgpu-topology.img.
> 
> Ah neat, but I am guessing the data comes from kfd? At some point we
> will need to bite the bullet and start designing uapi for an amdgpu only
> world.
> 
> In the meantime, would you like me to review this in detail to perhaps
> have it merged in the interim? Although in that case backward
> compatibility gets more complicated.
> 
> Regards,
> 
> Tvrtko
> 
>>
>> Signed-off-by: David Francis <David.Francis@amd.com>
>> ---
>>    plugins/amdgpu/amdgpu_plugin.c      | 84 ++++++++++++++++++++++++++---
>>    plugins/amdgpu/amdgpu_plugin_drm.c  | 64 ++++++++++++++++++++--
>>    plugins/amdgpu/amdgpu_plugin_util.h |  9 ++++
>>    plugins/amdgpu/criu-amdgpu.proto    |  5 ++
>>    4 files changed, 151 insertions(+), 11 deletions(-)
>>
>> diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c
>> index 89ab10dac..1e9785440 100644
>> --- a/plugins/amdgpu/amdgpu_plugin.c
>> +++ b/plugins/amdgpu/amdgpu_plugin.c
>> @@ -91,6 +91,9 @@ int current_pid;
>>     */
>>    bool parallel_disabled = false;
>>
>> +bool kfd_dump_complete = false;
>> +bool amdgpu_topology_dump_complete = false;
>> +
>>    pthread_t parallel_thread = 0;
>>    int parallel_thread_result = 0;
>>    /**************************************************************************************************/
>> @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
>>                devinfo->node_id = node->id;
>>
>>                if (NODE_IS_GPU(node)) {
>> -                     devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
>> -                     if (!devinfo->gpu_id)
>> -                             continue;
>> +                     if (maps) {
>> +                             devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
>> +                             if (!devinfo->gpu_id)
>> +                                     continue;
>> +                     } else {
>> +                             devinfo->gpu_id = node->gpu_id;
>> +                     }
>> +
>>
>>                        devinfo->simd_count = node->simd_count;
>>                        devinfo->mem_banks_count = node->mem_banks_count;
>> @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
>>                                if (!iolink->valid)
>>                                        continue;
>>
>> -                             list_for_each_entry(node2, &sys->nodes, listm_system)
>> -                                     if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
>> -                                             link_to_present_node = true;
>> +                             if (maps) {
>> +                                     list_for_each_entry(node2, &sys->nodes, listm_system)
>> +                                             if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
>> +                                                     link_to_present_node = true;
>> +                             } else {
>> +                                     link_to_present_node = true;
>> +                             }
>>
>>                                if (!link_to_present_node)
>>                                        continue;
>> @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage)
>>        maps_init(&checkpoint_maps);
>>        maps_init(&restore_maps);
>>
>> +     if (stage == CR_PLUGIN_STAGE__DUMP) {
>> +             kfd_dump_complete = false;
>> +             amdgpu_topology_dump_complete = false;
>> +     }
>> +
>>        if (stage == CR_PLUGIN_STAGE__RESTORE) {
>>                if (has_children(root_item)) {
>>                        pr_info("Parallel restore disabled\n");
>> @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id)
>>        if (ret)
>>                goto exit;
>>
>> +     kfd_dump_complete = true;
>>        if (!plugin_added_to_inventory) {
>>                ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
>>                if (ret) {
>> @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed)
>>
>>                pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id);
>>
>> +             if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) {
>> +                     AmdgpuDevinfo *ad;
>> +
>> +                     pr_info("No restore maps found, making them from topology file\n");
>> +
>> +                     img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true);
>> +                     if (!img_fp) {
>> +                             pr_err("Failed to find either kfd or amdgpu src topology information\n");
>> +                             ret = -EINVAL;
>> +                             goto exit;
>> +                     }
>> +
>> +                     buf = xmalloc(img_size);
>> +                     if (!buf) {
>> +                             pr_err("Failed to allocate memory\n");
>> +                             return -ENOMEM;
>> +                     }
>> +
>> +                     ret = read_fp(img_fp, buf, img_size);
>> +                     if (ret) {
>> +                             pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
>> +                             ret = -EINVAL;
>> +                             goto exit;
>> +                     }
>> +
>> +                     ad = amdgpu_devinfo__unpack(NULL, img_size, buf);
>> +                     if (rd == NULL) {
>> +                             pr_perror("Unable to parse the amdgpu topology message\n");
>> +                             fclose(img_fp);
>> +                             ret = -EINVAL;
>> +                             goto exit;
>> +                     }
>> +                     fclose(img_fp);
>> +
>> +                     ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology);
>> +                     if (ret) {
>> +                             pr_err("Failed to convert amdgpu device information to topology\n");
>> +                             ret = -EINVAL;
>> +                             goto exit;
>> +                     }
>> +
>> +                     ret = topology_parse(&dest_topology, "Local");
>> +                     if (ret) {
>> +                             pr_err("Failed to parse local system topology\n");
>> +                             goto exit;
>> +                     }
>> +
>> +                     ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps);
>> +                     if (ret) {
>> +                             pr_err("Failed to map GPUs\n");
>> +                             goto exit;
>> +                     }
>> +             }
>> +
>>                target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id);
>>                if (!target_gpu_id) {
>>                        fd = -ENODEV;
>> diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c
>> index c1dfb2dd4..a4c650753 100644
>> --- a/plugins/amdgpu/amdgpu_plugin_drm.c
>> +++ b/plugins/amdgpu/amdgpu_plugin_drm.c
>> @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm)
>>                return -ENODEV;
>>        }
>>
>> -     /* Get the GPU_ID of the DRM device */
>> -     rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
>> -     if (!rd->gpu_id) {
>> -             pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
>> -             return -ENODEV;
>> +     if (kfd_dump_complete) {
>> +             /* Get the GPU_ID of the DRM device */
>> +             rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
>> +             if (!rd->gpu_id) {
>> +                     pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
>> +                     return -ENODEV;
>> +             }
>> +     } else {
>> +             rd->gpu_id = tp_node->gpu_id;
>> +
>> +             if (!amdgpu_topology_dump_complete) {
>> +                     AmdgpuDevinfo *ad = NULL;
>> +                     unsigned char *buf;
>> +
>> +                     ad = xmalloc(sizeof(*ad));
>> +                     amdgpu_devinfo__init(ad);
>> +
>> +                     ad->num_of_devices = src_topology.num_nodes;
>> +
>> +                     ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices);
>> +                     if (!ad->device_entries) {
>> +                             pr_err("Failed to allocate device_entries\n");
>> +                             return -ENOMEM;
>> +                     }
>> +
>> +                     for (int i = 0; i < ad->num_of_devices; i++) {
>> +                             KfdDeviceEntry *entry = xzalloc(sizeof(*entry));
>> +
>> +                             if (!entry) {
>> +                                     pr_err("Failed to allocate entry\n");
>> +                                     return -ENOMEM;
>> +                             }
>> +
>> +                             kfd_device_entry__init(entry);
>> +
>> +                             ad->device_entries[i] = entry;
>> +                             ad->n_device_entries++;
>> +                     }
>> +
>> +                     topology_to_devinfo(&src_topology, NULL, ad->device_entries);
>> +
>> +                     len = amdgpu_devinfo__get_packed_size(ad);
>> +
>> +                     buf = xmalloc(len);
>> +                     if (!buf) {
>> +                             pr_perror("Failed to allocate memory to store protobuf");
>> +                             return -ENOMEM;
>> +                     }
>> +
>> +                     amdgpu_devinfo__pack(ad, buf);
>> +
>> +                     ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len);
>> +                     if (ret) {
>> +                             pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
>> +                             return -EINVAL;
>> +                     }
>> +
>> +                     amdgpu_topology_dump_complete = true;
>> +             }
>>        }
>>
>>        len = criu_render_node__get_packed_size(rd);
>> diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h
>> index 69b98a31c..ccfe30b49 100644
>> --- a/plugins/amdgpu/amdgpu_plugin_util.h
>> +++ b/plugins/amdgpu/amdgpu_plugin_util.h
>> @@ -2,6 +2,7 @@
>>    #define __AMDGPU_PLUGIN_UTIL_H__
>>
>>    #include <libdrm/amdgpu.h>
>> +#include "criu-amdgpu.pb-c.h"
>>
>>    #ifndef _GNU_SOURCE
>>    #define _GNU_SOURCE 1
>> @@ -59,6 +60,9 @@
>>    /* Name of file having serialized data of DRM device buffer objects (BOs) */
>>    #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img"
>>
>> +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/
>> +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img"
>> +
>>    /* Helper macros to Checkpoint and Restore a ROCm file */
>>    #define HSAKMT_SHM_PATH                     "/dev/shm/hsakmt_shared_mem"
>>    #define HSAKMT_SHM                          "/hsakmt_shared_mem"
>> @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check;
>>    extern bool kfd_numa_check;
>>    extern bool kfd_capability_check;
>>
>> +extern bool kfd_dump_complete;
>> +extern bool amdgpu_topology_dump_complete;
>> +
>>    int read_fp(FILE *fp, void *buf, const size_t buf_len);
>>    int write_fp(FILE *fp, const void *buf, const size_t buf_len);
>>    int read_file(const char *file_path, void *buf, const size_t buf_len);
>> @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp,
>>
>>    int serve_out_dmabuf_fd(int handle, int fd);
>>
>> +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries);
>> +
>>    #endif              /* __AMDGPU_PLUGIN_UTIL_H__ */
>> diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto
>> index 7682a8f21..6e44e22aa 100644
>> --- a/plugins/amdgpu/criu-amdgpu.proto
>> +++ b/plugins/amdgpu/criu-amdgpu.proto
>> @@ -93,3 +93,8 @@ message criu_render_node {
>>    message criu_dmabuf_node {
>>        required uint32 gem_handle = 1;
>>    }
>> +
>> +message amdgpu_devinfo {
>> +     required uint32 num_of_devices = 1;
>> +     repeated kfd_device_entry device_entries = 2;
>> +}
>> \ No newline at end of file
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file
  2026-04-17  9:07       ` Tvrtko Ursulin
@ 2026-04-17 13:21         ` Francis, David
  2026-04-22 13:58           ` Tvrtko Ursulin
  0 siblings, 1 reply; 15+ messages in thread
From: Francis, David @ 2026-04-17 13:21 UTC (permalink / raw)
  To: Tvrtko Ursulin, criu@lists.linux.dev

>>> Ah neat, but I am guessing the data comes from kfd? At some point we
>>> will need to bite the bullet and start designing uapi for an amdgpu only
>>> world.
>>
>> No, the data comes solely from sysfs in this case.
>
> But exported by kfd, no? Ie. /sys/class/kfd/kfd/topology/nodes/. I
> understand the goal is to migrate away from kfd which is why I wondered
> if it makes sense to add an image based of both internal and external
> kfd data.

I think current plans are to deprecate the kfd device file first -
/sys/class/kfd will continue to exist for a good while longer.

> In one aspect it is better than my hack of allow restore from a single
> to a single GPU system, although even there the gpu id is not stable,
> but on the other hand it does add an image format which looks like a
> dead end from a design persective.
>
> Hm, how does this patch handle gpu id changing across reboots?

It does the whole amdgpu_plugin_topology matching thing to find a
mapping between the old GPUs and the new ones.

The renderD minor numbers of the new GPUs determine which files are
opened to replace the open renderD handles. It doesn't really matter
what the gpu ids are at that point. To target a specific GPU in amdgpu,
you don't need gpu id, you just point the operation at the right renderD
file.

________________________________________
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Sent: Friday, April 17, 2026 5:07 AM
To: Francis, David; criu@lists.linux.dev
Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file


On 16/04/2026 19:26, Francis, David wrote:
>> Ah neat, but I am guessing the data comes from kfd? At some point we
>> will need to bite the bullet and start designing uapi for an amdgpu only
>> world.
>
> No, the data comes solely from sysfs in this case.

But exported by kfd, no? Ie. /sys/class/kfd/kfd/topology/nodes/. I
understand the goal is to migrate away from kfd which is why I wondered
if it makes sense to add an image based of both internal and external
kfd data.

> No urgency on merging this since it doesn't do much without more patches
> but also unlikely to cause problems on its own - if kfd is present, the plugin
> will continue to make the same dump it always did.
>
> Review would be appreciated, thanks.

In one aspect it is better than my hack of allow restore from a single
to a single GPU system, although even there the gpu id is not stable,
but on the other hand it does add an image format which looks like a
dead end from a design persective.

Hm, how does this patch handle gpu id changing across reboots?

Regards,

Tvrtko

> ________________________________________
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Sent: Thursday, April 16, 2026 10:54 AM
> To: Francis, David; criu@lists.linux.dev
> Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file
>
>
> On 10/04/2026 15:45, David Francis wrote:
>> The state of the source topology (the GPUs, CPUs, and links
>> between them) is saved by the plugin as part of kfd dump.
>>
>> If there is no kfd dump, we need to save the topology anyways.
>>
>> Do so in new file amdgpu-topology.img.
>
> Ah neat, but I am guessing the data comes from kfd? At some point we
> will need to bite the bullet and start designing uapi for an amdgpu only
> world.
>
> In the meantime, would you like me to review this in detail to perhaps
> have it merged in the interim? Although in that case backward
> compatibility gets more complicated.
>
> Regards,
>
> Tvrtko
>
>>
>> Signed-off-by: David Francis <David.Francis@amd.com>
>> ---
>>    plugins/amdgpu/amdgpu_plugin.c      | 84 ++++++++++++++++++++++++++---
>>    plugins/amdgpu/amdgpu_plugin_drm.c  | 64 ++++++++++++++++++++--
>>    plugins/amdgpu/amdgpu_plugin_util.h |  9 ++++
>>    plugins/amdgpu/criu-amdgpu.proto    |  5 ++
>>    4 files changed, 151 insertions(+), 11 deletions(-)
>>
>> diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c
>> index 89ab10dac..1e9785440 100644
>> --- a/plugins/amdgpu/amdgpu_plugin.c
>> +++ b/plugins/amdgpu/amdgpu_plugin.c
>> @@ -91,6 +91,9 @@ int current_pid;
>>     */
>>    bool parallel_disabled = false;
>>
>> +bool kfd_dump_complete = false;
>> +bool amdgpu_topology_dump_complete = false;
>> +
>>    pthread_t parallel_thread = 0;
>>    int parallel_thread_result = 0;
>>    /**************************************************************************************************/
>> @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
>>                devinfo->node_id = node->id;
>>
>>                if (NODE_IS_GPU(node)) {
>> -                     devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
>> -                     if (!devinfo->gpu_id)
>> -                             continue;
>> +                     if (maps) {
>> +                             devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
>> +                             if (!devinfo->gpu_id)
>> +                                     continue;
>> +                     } else {
>> +                             devinfo->gpu_id = node->gpu_id;
>> +                     }
>> +
>>
>>                        devinfo->simd_count = node->simd_count;
>>                        devinfo->mem_banks_count = node->mem_banks_count;
>> @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
>>                                if (!iolink->valid)
>>                                        continue;
>>
>> -                             list_for_each_entry(node2, &sys->nodes, listm_system)
>> -                                     if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
>> -                                             link_to_present_node = true;
>> +                             if (maps) {
>> +                                     list_for_each_entry(node2, &sys->nodes, listm_system)
>> +                                             if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
>> +                                                     link_to_present_node = true;
>> +                             } else {
>> +                                     link_to_present_node = true;
>> +                             }
>>
>>                                if (!link_to_present_node)
>>                                        continue;
>> @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage)
>>        maps_init(&checkpoint_maps);
>>        maps_init(&restore_maps);
>>
>> +     if (stage == CR_PLUGIN_STAGE__DUMP) {
>> +             kfd_dump_complete = false;
>> +             amdgpu_topology_dump_complete = false;
>> +     }
>> +
>>        if (stage == CR_PLUGIN_STAGE__RESTORE) {
>>                if (has_children(root_item)) {
>>                        pr_info("Parallel restore disabled\n");
>> @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id)
>>        if (ret)
>>                goto exit;
>>
>> +     kfd_dump_complete = true;
>>        if (!plugin_added_to_inventory) {
>>                ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
>>                if (ret) {
>> @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed)
>>
>>                pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id);
>>
>> +             if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) {
>> +                     AmdgpuDevinfo *ad;
>> +
>> +                     pr_info("No restore maps found, making them from topology file\n");
>> +
>> +                     img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true);
>> +                     if (!img_fp) {
>> +                             pr_err("Failed to find either kfd or amdgpu src topology information\n");
>> +                             ret = -EINVAL;
>> +                             goto exit;
>> +                     }
>> +
>> +                     buf = xmalloc(img_size);
>> +                     if (!buf) {
>> +                             pr_err("Failed to allocate memory\n");
>> +                             return -ENOMEM;
>> +                     }
>> +
>> +                     ret = read_fp(img_fp, buf, img_size);
>> +                     if (ret) {
>> +                             pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
>> +                             ret = -EINVAL;
>> +                             goto exit;
>> +                     }
>> +
>> +                     ad = amdgpu_devinfo__unpack(NULL, img_size, buf);
>> +                     if (rd == NULL) {
>> +                             pr_perror("Unable to parse the amdgpu topology message\n");
>> +                             fclose(img_fp);
>> +                             ret = -EINVAL;
>> +                             goto exit;
>> +                     }
>> +                     fclose(img_fp);
>> +
>> +                     ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology);
>> +                     if (ret) {
>> +                             pr_err("Failed to convert amdgpu device information to topology\n");
>> +                             ret = -EINVAL;
>> +                             goto exit;
>> +                     }
>> +
>> +                     ret = topology_parse(&dest_topology, "Local");
>> +                     if (ret) {
>> +                             pr_err("Failed to parse local system topology\n");
>> +                             goto exit;
>> +                     }
>> +
>> +                     ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps);
>> +                     if (ret) {
>> +                             pr_err("Failed to map GPUs\n");
>> +                             goto exit;
>> +                     }
>> +             }
>> +
>>                target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id);
>>                if (!target_gpu_id) {
>>                        fd = -ENODEV;
>> diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c
>> index c1dfb2dd4..a4c650753 100644
>> --- a/plugins/amdgpu/amdgpu_plugin_drm.c
>> +++ b/plugins/amdgpu/amdgpu_plugin_drm.c
>> @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm)
>>                return -ENODEV;
>>        }
>>
>> -     /* Get the GPU_ID of the DRM device */
>> -     rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
>> -     if (!rd->gpu_id) {
>> -             pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
>> -             return -ENODEV;
>> +     if (kfd_dump_complete) {
>> +             /* Get the GPU_ID of the DRM device */
>> +             rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
>> +             if (!rd->gpu_id) {
>> +                     pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
>> +                     return -ENODEV;
>> +             }
>> +     } else {
>> +             rd->gpu_id = tp_node->gpu_id;
>> +
>> +             if (!amdgpu_topology_dump_complete) {
>> +                     AmdgpuDevinfo *ad = NULL;
>> +                     unsigned char *buf;
>> +
>> +                     ad = xmalloc(sizeof(*ad));
>> +                     amdgpu_devinfo__init(ad);
>> +
>> +                     ad->num_of_devices = src_topology.num_nodes;
>> +
>> +                     ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices);
>> +                     if (!ad->device_entries) {
>> +                             pr_err("Failed to allocate device_entries\n");
>> +                             return -ENOMEM;
>> +                     }
>> +
>> +                     for (int i = 0; i < ad->num_of_devices; i++) {
>> +                             KfdDeviceEntry *entry = xzalloc(sizeof(*entry));
>> +
>> +                             if (!entry) {
>> +                                     pr_err("Failed to allocate entry\n");
>> +                                     return -ENOMEM;
>> +                             }
>> +
>> +                             kfd_device_entry__init(entry);
>> +
>> +                             ad->device_entries[i] = entry;
>> +                             ad->n_device_entries++;
>> +                     }
>> +
>> +                     topology_to_devinfo(&src_topology, NULL, ad->device_entries);
>> +
>> +                     len = amdgpu_devinfo__get_packed_size(ad);
>> +
>> +                     buf = xmalloc(len);
>> +                     if (!buf) {
>> +                             pr_perror("Failed to allocate memory to store protobuf");
>> +                             return -ENOMEM;
>> +                     }
>> +
>> +                     amdgpu_devinfo__pack(ad, buf);
>> +
>> +                     ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len);
>> +                     if (ret) {
>> +                             pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
>> +                             return -EINVAL;
>> +                     }
>> +
>> +                     amdgpu_topology_dump_complete = true;
>> +             }
>>        }
>>
>>        len = criu_render_node__get_packed_size(rd);
>> diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h
>> index 69b98a31c..ccfe30b49 100644
>> --- a/plugins/amdgpu/amdgpu_plugin_util.h
>> +++ b/plugins/amdgpu/amdgpu_plugin_util.h
>> @@ -2,6 +2,7 @@
>>    #define __AMDGPU_PLUGIN_UTIL_H__
>>
>>    #include <libdrm/amdgpu.h>
>> +#include "criu-amdgpu.pb-c.h"
>>
>>    #ifndef _GNU_SOURCE
>>    #define _GNU_SOURCE 1
>> @@ -59,6 +60,9 @@
>>    /* Name of file having serialized data of DRM device buffer objects (BOs) */
>>    #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img"
>>
>> +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/
>> +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img"
>> +
>>    /* Helper macros to Checkpoint and Restore a ROCm file */
>>    #define HSAKMT_SHM_PATH                     "/dev/shm/hsakmt_shared_mem"
>>    #define HSAKMT_SHM                          "/hsakmt_shared_mem"
>> @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check;
>>    extern bool kfd_numa_check;
>>    extern bool kfd_capability_check;
>>
>> +extern bool kfd_dump_complete;
>> +extern bool amdgpu_topology_dump_complete;
>> +
>>    int read_fp(FILE *fp, void *buf, const size_t buf_len);
>>    int write_fp(FILE *fp, const void *buf, const size_t buf_len);
>>    int read_file(const char *file_path, void *buf, const size_t buf_len);
>> @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp,
>>
>>    int serve_out_dmabuf_fd(int handle, int fd);
>>
>> +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries);
>> +
>>    #endif              /* __AMDGPU_PLUGIN_UTIL_H__ */
>> diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto
>> index 7682a8f21..6e44e22aa 100644
>> --- a/plugins/amdgpu/criu-amdgpu.proto
>> +++ b/plugins/amdgpu/criu-amdgpu.proto
>> @@ -93,3 +93,8 @@ message criu_render_node {
>>    message criu_dmabuf_node {
>>        required uint32 gem_handle = 1;
>>    }
>> +
>> +message amdgpu_devinfo {
>> +     required uint32 num_of_devices = 1;
>> +     repeated kfd_device_entry device_entries = 2;
>> +}
>> \ No newline at end of file
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file
  2026-04-17 13:21         ` Francis, David
@ 2026-04-22 13:58           ` Tvrtko Ursulin
  0 siblings, 0 replies; 15+ messages in thread
From: Tvrtko Ursulin @ 2026-04-22 13:58 UTC (permalink / raw)
  To: Francis, David, criu@lists.linux.dev


On 17/04/2026 14:21, Francis, David wrote:
>>>> Ah neat, but I am guessing the data comes from kfd? At some point we
>>>> will need to bite the bullet and start designing uapi for an amdgpu only
>>>> world.
>>>
>>> No, the data comes solely from sysfs in this case.
>>
>> But exported by kfd, no? Ie. /sys/class/kfd/kfd/topology/nodes/. I
>> understand the goal is to migrate away from kfd which is why I wondered
>> if it makes sense to add an image based of both internal and external
>> kfd data.
> 
> I think current plans are to deprecate the kfd device file first -
> /sys/class/kfd will continue to exist for a good while longer.

Exported by the kfd driver, or you mean the plan is to export them from 
amdgpu to achieve uapi compatibility in a way?

Regards,

Tvrtko

>> In one aspect it is better than my hack of allow restore from a single
>> to a single GPU system, although even there the gpu id is not stable,
>> but on the other hand it does add an image format which looks like a
>> dead end from a design persective.
>>
>> Hm, how does this patch handle gpu id changing across reboots?
> 
> It does the whole amdgpu_plugin_topology matching thing to find a
> mapping between the old GPUs and the new ones.
> 
> The renderD minor numbers of the new GPUs determine which files are
> opened to replace the open renderD handles. It doesn't really matter
> what the gpu ids are at that point. To target a specific GPU in amdgpu,
> you don't need gpu id, you just point the operation at the right renderD
> file.
> 
> ________________________________________
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Sent: Friday, April 17, 2026 5:07 AM
> To: Francis, David; criu@lists.linux.dev
> Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file
> 
> 
> On 16/04/2026 19:26, Francis, David wrote:
>>> Ah neat, but I am guessing the data comes from kfd? At some point we
>>> will need to bite the bullet and start designing uapi for an amdgpu only
>>> world.
>>
>> No, the data comes solely from sysfs in this case.
> 
> But exported by kfd, no? Ie. /sys/class/kfd/kfd/topology/nodes/. I
> understand the goal is to migrate away from kfd which is why I wondered
> if it makes sense to add an image based of both internal and external
> kfd data.
> 
>> No urgency on merging this since it doesn't do much without more patches
>> but also unlikely to cause problems on its own - if kfd is present, the plugin
>> will continue to make the same dump it always did.
>>
>> Review would be appreciated, thanks.
> 
> In one aspect it is better than my hack of allow restore from a single
> to a single GPU system, although even there the gpu id is not stable,
> but on the other hand it does add an image format which looks like a
> dead end from a design persective.
> 
> Hm, how does this patch handle gpu id changing across reboots?
> 
> Regards,
> 
> Tvrtko
> 
>> ________________________________________
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Sent: Thursday, April 16, 2026 10:54 AM
>> To: Francis, David; criu@lists.linux.dev
>> Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file
>>
>>
>> On 10/04/2026 15:45, David Francis wrote:
>>> The state of the source topology (the GPUs, CPUs, and links
>>> between them) is saved by the plugin as part of kfd dump.
>>>
>>> If there is no kfd dump, we need to save the topology anyways.
>>>
>>> Do so in new file amdgpu-topology.img.
>>
>> Ah neat, but I am guessing the data comes from kfd? At some point we
>> will need to bite the bullet and start designing uapi for an amdgpu only
>> world.
>>
>> In the meantime, would you like me to review this in detail to perhaps
>> have it merged in the interim? Although in that case backward
>> compatibility gets more complicated.
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> Signed-off-by: David Francis <David.Francis@amd.com>
>>> ---
>>>     plugins/amdgpu/amdgpu_plugin.c      | 84 ++++++++++++++++++++++++++---
>>>     plugins/amdgpu/amdgpu_plugin_drm.c  | 64 ++++++++++++++++++++--
>>>     plugins/amdgpu/amdgpu_plugin_util.h |  9 ++++
>>>     plugins/amdgpu/criu-amdgpu.proto    |  5 ++
>>>     4 files changed, 151 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c
>>> index 89ab10dac..1e9785440 100644
>>> --- a/plugins/amdgpu/amdgpu_plugin.c
>>> +++ b/plugins/amdgpu/amdgpu_plugin.c
>>> @@ -91,6 +91,9 @@ int current_pid;
>>>      */
>>>     bool parallel_disabled = false;
>>>
>>> +bool kfd_dump_complete = false;
>>> +bool amdgpu_topology_dump_complete = false;
>>> +
>>>     pthread_t parallel_thread = 0;
>>>     int parallel_thread_result = 0;
>>>     /**************************************************************************************************/
>>> @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
>>>                 devinfo->node_id = node->id;
>>>
>>>                 if (NODE_IS_GPU(node)) {
>>> -                     devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
>>> -                     if (!devinfo->gpu_id)
>>> -                             continue;
>>> +                     if (maps) {
>>> +                             devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id);
>>> +                             if (!devinfo->gpu_id)
>>> +                                     continue;
>>> +                     } else {
>>> +                             devinfo->gpu_id = node->gpu_id;
>>> +                     }
>>> +
>>>
>>>                         devinfo->simd_count = node->simd_count;
>>>                         devinfo->mem_banks_count = node->mem_banks_count;
>>> @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi
>>>                                 if (!iolink->valid)
>>>                                         continue;
>>>
>>> -                             list_for_each_entry(node2, &sys->nodes, listm_system)
>>> -                                     if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
>>> -                                             link_to_present_node = true;
>>> +                             if (maps) {
>>> +                                     list_for_each_entry(node2, &sys->nodes, listm_system)
>>> +                                             if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0)
>>> +                                                     link_to_present_node = true;
>>> +                             } else {
>>> +                                     link_to_present_node = true;
>>> +                             }
>>>
>>>                                 if (!link_to_present_node)
>>>                                         continue;
>>> @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage)
>>>         maps_init(&checkpoint_maps);
>>>         maps_init(&restore_maps);
>>>
>>> +     if (stage == CR_PLUGIN_STAGE__DUMP) {
>>> +             kfd_dump_complete = false;
>>> +             amdgpu_topology_dump_complete = false;
>>> +     }
>>> +
>>>         if (stage == CR_PLUGIN_STAGE__RESTORE) {
>>>                 if (has_children(root_item)) {
>>>                         pr_info("Parallel restore disabled\n");
>>> @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id)
>>>         if (ret)
>>>                 goto exit;
>>>
>>> +     kfd_dump_complete = true;
>>>         if (!plugin_added_to_inventory) {
>>>                 ret = add_inventory_plugin(CR_PLUGIN_DESC.name);
>>>                 if (ret) {
>>> @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed)
>>>
>>>                 pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id);
>>>
>>> +             if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) {
>>> +                     AmdgpuDevinfo *ad;
>>> +
>>> +                     pr_info("No restore maps found, making them from topology file\n");
>>> +
>>> +                     img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true);
>>> +                     if (!img_fp) {
>>> +                             pr_err("Failed to find either kfd or amdgpu src topology information\n");
>>> +                             ret = -EINVAL;
>>> +                             goto exit;
>>> +                     }
>>> +
>>> +                     buf = xmalloc(img_size);
>>> +                     if (!buf) {
>>> +                             pr_err("Failed to allocate memory\n");
>>> +                             return -ENOMEM;
>>> +                     }
>>> +
>>> +                     ret = read_fp(img_fp, buf, img_size);
>>> +                     if (ret) {
>>> +                             pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
>>> +                             ret = -EINVAL;
>>> +                             goto exit;
>>> +                     }
>>> +
>>> +                     ad = amdgpu_devinfo__unpack(NULL, img_size, buf);
>>> +                     if (rd == NULL) {
>>> +                             pr_perror("Unable to parse the amdgpu topology message\n");
>>> +                             fclose(img_fp);
>>> +                             ret = -EINVAL;
>>> +                             goto exit;
>>> +                     }
>>> +                     fclose(img_fp);
>>> +
>>> +                     ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology);
>>> +                     if (ret) {
>>> +                             pr_err("Failed to convert amdgpu device information to topology\n");
>>> +                             ret = -EINVAL;
>>> +                             goto exit;
>>> +                     }
>>> +
>>> +                     ret = topology_parse(&dest_topology, "Local");
>>> +                     if (ret) {
>>> +                             pr_err("Failed to parse local system topology\n");
>>> +                             goto exit;
>>> +                     }
>>> +
>>> +                     ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps);
>>> +                     if (ret) {
>>> +                             pr_err("Failed to map GPUs\n");
>>> +                             goto exit;
>>> +                     }
>>> +             }
>>> +
>>>                 target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id);
>>>                 if (!target_gpu_id) {
>>>                         fd = -ENODEV;
>>> diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c
>>> index c1dfb2dd4..a4c650753 100644
>>> --- a/plugins/amdgpu/amdgpu_plugin_drm.c
>>> +++ b/plugins/amdgpu/amdgpu_plugin_drm.c
>>> @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm)
>>>                 return -ENODEV;
>>>         }
>>>
>>> -     /* Get the GPU_ID of the DRM device */
>>> -     rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
>>> -     if (!rd->gpu_id) {
>>> -             pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
>>> -             return -ENODEV;
>>> +     if (kfd_dump_complete) {
>>> +             /* Get the GPU_ID of the DRM device */
>>> +             rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id);
>>> +             if (!rd->gpu_id) {
>>> +                     pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id);
>>> +                     return -ENODEV;
>>> +             }
>>> +     } else {
>>> +             rd->gpu_id = tp_node->gpu_id;
>>> +
>>> +             if (!amdgpu_topology_dump_complete) {
>>> +                     AmdgpuDevinfo *ad = NULL;
>>> +                     unsigned char *buf;
>>> +
>>> +                     ad = xmalloc(sizeof(*ad));
>>> +                     amdgpu_devinfo__init(ad);
>>> +
>>> +                     ad->num_of_devices = src_topology.num_nodes;
>>> +
>>> +                     ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices);
>>> +                     if (!ad->device_entries) {
>>> +                             pr_err("Failed to allocate device_entries\n");
>>> +                             return -ENOMEM;
>>> +                     }
>>> +
>>> +                     for (int i = 0; i < ad->num_of_devices; i++) {
>>> +                             KfdDeviceEntry *entry = xzalloc(sizeof(*entry));
>>> +
>>> +                             if (!entry) {
>>> +                                     pr_err("Failed to allocate entry\n");
>>> +                                     return -ENOMEM;
>>> +                             }
>>> +
>>> +                             kfd_device_entry__init(entry);
>>> +
>>> +                             ad->device_entries[i] = entry;
>>> +                             ad->n_device_entries++;
>>> +                     }
>>> +
>>> +                     topology_to_devinfo(&src_topology, NULL, ad->device_entries);
>>> +
>>> +                     len = amdgpu_devinfo__get_packed_size(ad);
>>> +
>>> +                     buf = xmalloc(len);
>>> +                     if (!buf) {
>>> +                             pr_perror("Failed to allocate memory to store protobuf");
>>> +                             return -ENOMEM;
>>> +                     }
>>> +
>>> +                     amdgpu_devinfo__pack(ad, buf);
>>> +
>>> +                     ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len);
>>> +                     if (ret) {
>>> +                             pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE);
>>> +                             return -EINVAL;
>>> +                     }
>>> +
>>> +                     amdgpu_topology_dump_complete = true;
>>> +             }
>>>         }
>>>
>>>         len = criu_render_node__get_packed_size(rd);
>>> diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h
>>> index 69b98a31c..ccfe30b49 100644
>>> --- a/plugins/amdgpu/amdgpu_plugin_util.h
>>> +++ b/plugins/amdgpu/amdgpu_plugin_util.h
>>> @@ -2,6 +2,7 @@
>>>     #define __AMDGPU_PLUGIN_UTIL_H__
>>>
>>>     #include <libdrm/amdgpu.h>
>>> +#include "criu-amdgpu.pb-c.h"
>>>
>>>     #ifndef _GNU_SOURCE
>>>     #define _GNU_SOURCE 1
>>> @@ -59,6 +60,9 @@
>>>     /* Name of file having serialized data of DRM device buffer objects (BOs) */
>>>     #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img"
>>>
>>> +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/
>>> +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img"
>>> +
>>>     /* Helper macros to Checkpoint and Restore a ROCm file */
>>>     #define HSAKMT_SHM_PATH                     "/dev/shm/hsakmt_shared_mem"
>>>     #define HSAKMT_SHM                          "/hsakmt_shared_mem"
>>> @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check;
>>>     extern bool kfd_numa_check;
>>>     extern bool kfd_capability_check;
>>>
>>> +extern bool kfd_dump_complete;
>>> +extern bool amdgpu_topology_dump_complete;
>>> +
>>>     int read_fp(FILE *fp, void *buf, const size_t buf_len);
>>>     int write_fp(FILE *fp, const void *buf, const size_t buf_len);
>>>     int read_file(const char *file_path, void *buf, const size_t buf_len);
>>> @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp,
>>>
>>>     int serve_out_dmabuf_fd(int handle, int fd);
>>>
>>> +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries);
>>> +
>>>     #endif              /* __AMDGPU_PLUGIN_UTIL_H__ */
>>> diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto
>>> index 7682a8f21..6e44e22aa 100644
>>> --- a/plugins/amdgpu/criu-amdgpu.proto
>>> +++ b/plugins/amdgpu/criu-amdgpu.proto
>>> @@ -93,3 +93,8 @@ message criu_render_node {
>>>     message criu_dmabuf_node {
>>>         required uint32 gem_handle = 1;
>>>     }
>>> +
>>> +message amdgpu_devinfo {
>>> +     required uint32 num_of_devices = 1;
>>> +     repeated kfd_device_entry device_entries = 2;
>>> +}
>>> \ No newline at end of file
>>
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-04-22 13:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-10 14:45 [PATCH 0/3] Patches to allow amdgpu restore without kfd file David Francis
2026-04-10 14:45 ` [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas David Francis
2026-04-10 15:46   ` Tvrtko Ursulin
2026-04-10 16:15     ` Francis, David
2026-04-10 14:45 ` [PATCH 2/3] plugin/amdgpu: Add topology dump file David Francis
2026-04-16 14:54   ` Tvrtko Ursulin
2026-04-16 18:26     ` Francis, David
2026-04-17  9:07       ` Tvrtko Ursulin
2026-04-17 13:21         ` Francis, David
2026-04-22 13:58           ` Tvrtko Ursulin
2026-04-10 14:45 ` [PATCH 3/3] plugins/amdgpu: Make next_fd without kfd David Francis
2026-04-10 15:42 ` [PATCH 0/3] Patches to allow amdgpu restore without kfd file Tvrtko Ursulin
2026-04-10 15:46   ` Francis, David
2026-04-10 16:04     ` Tvrtko Ursulin
2026-04-10 16:12       ` Francis, David

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox