* [PATCH 0/3] Patches to allow amdgpu restore without kfd file.
@ 2026-04-10 14:45 David Francis
2026-04-10 14:45 ` [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas David Francis
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: David Francis @ 2026-04-10 14:45 UTC (permalink / raw)
To: criu; +Cc: tvrtko.ursulin, David Francis
These patches allow processes that have renderD device files open
but not kfd device files to dump / restore.
Mostly a proof-of-concept / baseline for future work since there's
currently no way for such a process to dump / restore its queues.
David Francis (3):
plugin/amdgpu: Add plugin to inventory even if there are no vmas
plugin/amdgpu: Add topology dump file
plugins/amdgpu: Make next_fd without kfd
plugins/amdgpu/amdgpu_plugin.c | 113 ++++++++++++++++++++++++++--
plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++--
plugins/amdgpu/amdgpu_plugin_util.h | 9 +++
plugins/amdgpu/criu-amdgpu.proto | 5 ++
4 files changed, 180 insertions(+), 11 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas 2026-04-10 14:45 [PATCH 0/3] Patches to allow amdgpu restore without kfd file David Francis @ 2026-04-10 14:45 ` David Francis 2026-04-10 15:46 ` Tvrtko Ursulin 2026-04-10 14:45 ` [PATCH 2/3] plugin/amdgpu: Add topology dump file David Francis ` (2 subsequent siblings) 3 siblings, 1 reply; 15+ messages in thread From: David Francis @ 2026-04-10 14:45 UTC (permalink / raw) To: criu; +Cc: tvrtko.ursulin, David Francis The amdgpu plugin is added to the plugin inventory as the first vma map is dumped. But it's completely possible for a process to have driver files open but no vma maps. In this case, we still will need the plugin on restore. Add the plugin to the inventory whenever a device file is dumped. Signed-off-by: David Francis <David.Francis@amd.com> --- plugins/amdgpu/amdgpu_plugin.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c index e3ba0de64..89ab10dac 100644 --- a/plugins/amdgpu/amdgpu_plugin.c +++ b/plugins/amdgpu/amdgpu_plugin.c @@ -1439,6 +1439,15 @@ int amdgpu_plugin_dump_file(int fd, int id) if (ret) return ret; + if (!plugin_added_to_inventory) { + ret = add_inventory_plugin(CR_PLUGIN_DESC.name); + if (ret) { + pr_err("Failed to add AMDGPU plugin to inventory image\n"); + return ret; + } + plugin_added_to_inventory = true; + } + ret = record_dumped_fd(fd, true); if (ret) return ret; @@ -1543,6 +1552,15 @@ int amdgpu_plugin_dump_file(int fd, int id) if (ret) goto exit; + if (!plugin_added_to_inventory) { + ret = add_inventory_plugin(CR_PLUGIN_DESC.name); + if (ret) { + pr_err("Failed to add AMDGPU plugin to inventory image\n"); + goto exit; + } + plugin_added_to_inventory = true; + } + exit: xfree((void *)args.devices); xfree((void *)args.bos); -- 2.34.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas 2026-04-10 14:45 ` [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas David Francis @ 2026-04-10 15:46 ` Tvrtko Ursulin 2026-04-10 16:15 ` Francis, David 0 siblings, 1 reply; 15+ messages in thread From: Tvrtko Ursulin @ 2026-04-10 15:46 UTC (permalink / raw) To: David Francis, criu On 10/04/2026 15:45, David Francis wrote: > The amdgpu plugin is added to the plugin inventory as the first > vma map is dumped. But it's completely possible for a process to > have driver files open but no vma maps. In this case, we still > will need the plugin on restore. > > Add the plugin to the inventory whenever a device file is dumped. > > Signed-off-by: David Francis <David.Francis@amd.com> > --- > plugins/amdgpu/amdgpu_plugin.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c > index e3ba0de64..89ab10dac 100644 > --- a/plugins/amdgpu/amdgpu_plugin.c > +++ b/plugins/amdgpu/amdgpu_plugin.c > @@ -1439,6 +1439,15 @@ int amdgpu_plugin_dump_file(int fd, int id) > if (ret) > return ret; > > + if (!plugin_added_to_inventory) { > + ret = add_inventory_plugin(CR_PLUGIN_DESC.name); > + if (ret) { > + pr_err("Failed to add AMDGPU plugin to inventory image\n"); > + return ret; > + } > + plugin_added_to_inventory = true; > + } > + > ret = record_dumped_fd(fd, true); > if (ret) > return ret; > @@ -1543,6 +1552,15 @@ int amdgpu_plugin_dump_file(int fd, int id) > if (ret) > goto exit; > > + if (!plugin_added_to_inventory) { > + ret = add_inventory_plugin(CR_PLUGIN_DESC.name); > + if (ret) { > + pr_err("Failed to add AMDGPU plugin to inventory image\n"); > + goto exit; > + } > + plugin_added_to_inventory = true; > + } > + > exit: > xfree((void *)args.devices); > xfree((void *)args.bos); This is very similar to the patch I submitted and you reviewed. Just so happens I have to rebase since when I collected all r-b's the series does not apply any longer. I don't mind hugely if you take authorship but some sort of a acknowledgement tag would have been nice. Regards, Tvrtko ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas 2026-04-10 15:46 ` Tvrtko Ursulin @ 2026-04-10 16:15 ` Francis, David 0 siblings, 0 replies; 15+ messages in thread From: Francis, David @ 2026-04-10 16:15 UTC (permalink / raw) To: Tvrtko Ursulin, criu@lists.linux.dev > This is very similar to the patch I submitted and you reviewed. Just so > happens I have to rebase since when I collected all r-b's the series > does not apply any longer. I don't mind hugely if you take authorship > but some sort of a acknowledgement tag would have been nice. My apologies; I was working on this in parallel with reviewing your patches and missed the similarity. Yours is fine, or I can add an attribution tag. Sorry again, David ________________________________________ From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Sent: Friday, April 10, 2026 11:46 AM To: Francis, David; criu@lists.linux.dev Subject: Re: [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas On 10/04/2026 15:45, David Francis wrote: > The amdgpu plugin is added to the plugin inventory as the first > vma map is dumped. But it's completely possible for a process to > have driver files open but no vma maps. In this case, we still > will need the plugin on restore. > > Add the plugin to the inventory whenever a device file is dumped. > > Signed-off-by: David Francis <David.Francis@amd.com> > --- > plugins/amdgpu/amdgpu_plugin.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c > index e3ba0de64..89ab10dac 100644 > --- a/plugins/amdgpu/amdgpu_plugin.c > +++ b/plugins/amdgpu/amdgpu_plugin.c > @@ -1439,6 +1439,15 @@ int amdgpu_plugin_dump_file(int fd, int id) > if (ret) > return ret; > > + if (!plugin_added_to_inventory) { > + ret = add_inventory_plugin(CR_PLUGIN_DESC.name); > + if (ret) { > + pr_err("Failed to add AMDGPU plugin to inventory image\n"); > + return ret; > + } > + plugin_added_to_inventory = true; > + } > + > ret = record_dumped_fd(fd, true); > if (ret) > return ret; > @@ -1543,6 +1552,15 @@ int amdgpu_plugin_dump_file(int fd, int id) > if (ret) > goto exit; > > + if (!plugin_added_to_inventory) { > + ret = add_inventory_plugin(CR_PLUGIN_DESC.name); > + if (ret) { > + pr_err("Failed to add AMDGPU plugin to inventory image\n"); > + goto exit; > + } > + plugin_added_to_inventory = true; > + } > + > exit: > xfree((void *)args.devices); > xfree((void *)args.bos); This is very similar to the patch I submitted and you reviewed. Just so happens I have to rebase since when I collected all r-b's the series does not apply any longer. I don't mind hugely if you take authorship but some sort of a acknowledgement tag would have been nice. Regards, Tvrtko ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/3] plugin/amdgpu: Add topology dump file 2026-04-10 14:45 [PATCH 0/3] Patches to allow amdgpu restore without kfd file David Francis 2026-04-10 14:45 ` [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas David Francis @ 2026-04-10 14:45 ` David Francis 2026-04-16 14:54 ` Tvrtko Ursulin 2026-04-10 14:45 ` [PATCH 3/3] plugins/amdgpu: Make next_fd without kfd David Francis 2026-04-10 15:42 ` [PATCH 0/3] Patches to allow amdgpu restore without kfd file Tvrtko Ursulin 3 siblings, 1 reply; 15+ messages in thread From: David Francis @ 2026-04-10 14:45 UTC (permalink / raw) To: criu; +Cc: tvrtko.ursulin, David Francis The state of the source topology (the GPUs, CPUs, and links between them) is saved by the plugin as part of kfd dump. If there is no kfd dump, we need to save the topology anyways. Do so in new file amdgpu-topology.img. Signed-off-by: David Francis <David.Francis@amd.com> --- plugins/amdgpu/amdgpu_plugin.c | 84 ++++++++++++++++++++++++++--- plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++++++++-- plugins/amdgpu/amdgpu_plugin_util.h | 9 ++++ plugins/amdgpu/criu-amdgpu.proto | 5 ++ 4 files changed, 151 insertions(+), 11 deletions(-) diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c index 89ab10dac..1e9785440 100644 --- a/plugins/amdgpu/amdgpu_plugin.c +++ b/plugins/amdgpu/amdgpu_plugin.c @@ -91,6 +91,9 @@ int current_pid; */ bool parallel_disabled = false; +bool kfd_dump_complete = false; +bool amdgpu_topology_dump_complete = false; + pthread_t parallel_thread = 0; int parallel_thread_result = 0; /**************************************************************************************************/ @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi devinfo->node_id = node->id; if (NODE_IS_GPU(node)) { - devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); - if (!devinfo->gpu_id) - continue; + if (maps) { + devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); + if (!devinfo->gpu_id) + continue; + } else { + devinfo->gpu_id = node->gpu_id; + } + devinfo->simd_count = node->simd_count; devinfo->mem_banks_count = node->mem_banks_count; @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi if (!iolink->valid) continue; - list_for_each_entry(node2, &sys->nodes, listm_system) - if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) - link_to_present_node = true; + if (maps) { + list_for_each_entry(node2, &sys->nodes, listm_system) + if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) + link_to_present_node = true; + } else { + link_to_present_node = true; + } if (!link_to_present_node) continue; @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage) maps_init(&checkpoint_maps); maps_init(&restore_maps); + if (stage == CR_PLUGIN_STAGE__DUMP) { + kfd_dump_complete = false; + amdgpu_topology_dump_complete = false; + } + if (stage == CR_PLUGIN_STAGE__RESTORE) { if (has_children(root_item)) { pr_info("Parallel restore disabled\n"); @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id) if (ret) goto exit; + kfd_dump_complete = true; if (!plugin_added_to_inventory) { ret = add_inventory_plugin(CR_PLUGIN_DESC.name); if (ret) { @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed) pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id); + if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) { + AmdgpuDevinfo *ad; + + pr_info("No restore maps found, making them from topology file\n"); + + img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true); + if (!img_fp) { + pr_err("Failed to find either kfd or amdgpu src topology information\n"); + ret = -EINVAL; + goto exit; + } + + buf = xmalloc(img_size); + if (!buf) { + pr_err("Failed to allocate memory\n"); + return -ENOMEM; + } + + ret = read_fp(img_fp, buf, img_size); + if (ret) { + pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE); + ret = -EINVAL; + goto exit; + } + + ad = amdgpu_devinfo__unpack(NULL, img_size, buf); + if (rd == NULL) { + pr_perror("Unable to parse the amdgpu topology message\n"); + fclose(img_fp); + ret = -EINVAL; + goto exit; + } + fclose(img_fp); + + ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology); + if (ret) { + pr_err("Failed to convert amdgpu device information to topology\n"); + ret = -EINVAL; + goto exit; + } + + ret = topology_parse(&dest_topology, "Local"); + if (ret) { + pr_err("Failed to parse local system topology\n"); + goto exit; + } + + ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps); + if (ret) { + pr_err("Failed to map GPUs\n"); + goto exit; + } + } + target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id); if (!target_gpu_id) { fd = -ENODEV; diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c index c1dfb2dd4..a4c650753 100644 --- a/plugins/amdgpu/amdgpu_plugin_drm.c +++ b/plugins/amdgpu/amdgpu_plugin_drm.c @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm) return -ENODEV; } - /* Get the GPU_ID of the DRM device */ - rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); - if (!rd->gpu_id) { - pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); - return -ENODEV; + if (kfd_dump_complete) { + /* Get the GPU_ID of the DRM device */ + rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); + if (!rd->gpu_id) { + pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); + return -ENODEV; + } + } else { + rd->gpu_id = tp_node->gpu_id; + + if (!amdgpu_topology_dump_complete) { + AmdgpuDevinfo *ad = NULL; + unsigned char *buf; + + ad = xmalloc(sizeof(*ad)); + amdgpu_devinfo__init(ad); + + ad->num_of_devices = src_topology.num_nodes; + + ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices); + if (!ad->device_entries) { + pr_err("Failed to allocate device_entries\n"); + return -ENOMEM; + } + + for (int i = 0; i < ad->num_of_devices; i++) { + KfdDeviceEntry *entry = xzalloc(sizeof(*entry)); + + if (!entry) { + pr_err("Failed to allocate entry\n"); + return -ENOMEM; + } + + kfd_device_entry__init(entry); + + ad->device_entries[i] = entry; + ad->n_device_entries++; + } + + topology_to_devinfo(&src_topology, NULL, ad->device_entries); + + len = amdgpu_devinfo__get_packed_size(ad); + + buf = xmalloc(len); + if (!buf) { + pr_perror("Failed to allocate memory to store protobuf"); + return -ENOMEM; + } + + amdgpu_devinfo__pack(ad, buf); + + ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len); + if (ret) { + pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE); + return -EINVAL; + } + + amdgpu_topology_dump_complete = true; + } } len = criu_render_node__get_packed_size(rd); diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h index 69b98a31c..ccfe30b49 100644 --- a/plugins/amdgpu/amdgpu_plugin_util.h +++ b/plugins/amdgpu/amdgpu_plugin_util.h @@ -2,6 +2,7 @@ #define __AMDGPU_PLUGIN_UTIL_H__ #include <libdrm/amdgpu.h> +#include "criu-amdgpu.pb-c.h" #ifndef _GNU_SOURCE #define _GNU_SOURCE 1 @@ -59,6 +60,9 @@ /* Name of file having serialized data of DRM device buffer objects (BOs) */ #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img" +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/ +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img" + /* Helper macros to Checkpoint and Restore a ROCm file */ #define HSAKMT_SHM_PATH "/dev/shm/hsakmt_shared_mem" #define HSAKMT_SHM "/hsakmt_shared_mem" @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check; extern bool kfd_numa_check; extern bool kfd_capability_check; +extern bool kfd_dump_complete; +extern bool amdgpu_topology_dump_complete; + int read_fp(FILE *fp, void *buf, const size_t buf_len); int write_fp(FILE *fp, const void *buf, const size_t buf_len); int read_file(const char *file_path, void *buf, const size_t buf_len); @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp, int serve_out_dmabuf_fd(int handle, int fd); +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries); + #endif /* __AMDGPU_PLUGIN_UTIL_H__ */ diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto index 7682a8f21..6e44e22aa 100644 --- a/plugins/amdgpu/criu-amdgpu.proto +++ b/plugins/amdgpu/criu-amdgpu.proto @@ -93,3 +93,8 @@ message criu_render_node { message criu_dmabuf_node { required uint32 gem_handle = 1; } + +message amdgpu_devinfo { + required uint32 num_of_devices = 1; + repeated kfd_device_entry device_entries = 2; +} \ No newline at end of file -- 2.34.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file 2026-04-10 14:45 ` [PATCH 2/3] plugin/amdgpu: Add topology dump file David Francis @ 2026-04-16 14:54 ` Tvrtko Ursulin 2026-04-16 18:26 ` Francis, David 0 siblings, 1 reply; 15+ messages in thread From: Tvrtko Ursulin @ 2026-04-16 14:54 UTC (permalink / raw) To: David Francis, criu On 10/04/2026 15:45, David Francis wrote: > The state of the source topology (the GPUs, CPUs, and links > between them) is saved by the plugin as part of kfd dump. > > If there is no kfd dump, we need to save the topology anyways. > > Do so in new file amdgpu-topology.img. Ah neat, but I am guessing the data comes from kfd? At some point we will need to bite the bullet and start designing uapi for an amdgpu only world. In the meantime, would you like me to review this in detail to perhaps have it merged in the interim? Although in that case backward compatibility gets more complicated. Regards, Tvrtko > > Signed-off-by: David Francis <David.Francis@amd.com> > --- > plugins/amdgpu/amdgpu_plugin.c | 84 ++++++++++++++++++++++++++--- > plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++++++++-- > plugins/amdgpu/amdgpu_plugin_util.h | 9 ++++ > plugins/amdgpu/criu-amdgpu.proto | 5 ++ > 4 files changed, 151 insertions(+), 11 deletions(-) > > diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c > index 89ab10dac..1e9785440 100644 > --- a/plugins/amdgpu/amdgpu_plugin.c > +++ b/plugins/amdgpu/amdgpu_plugin.c > @@ -91,6 +91,9 @@ int current_pid; > */ > bool parallel_disabled = false; > > +bool kfd_dump_complete = false; > +bool amdgpu_topology_dump_complete = false; > + > pthread_t parallel_thread = 0; > int parallel_thread_result = 0; > /**************************************************************************************************/ > @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi > devinfo->node_id = node->id; > > if (NODE_IS_GPU(node)) { > - devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); > - if (!devinfo->gpu_id) > - continue; > + if (maps) { > + devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); > + if (!devinfo->gpu_id) > + continue; > + } else { > + devinfo->gpu_id = node->gpu_id; > + } > + > > devinfo->simd_count = node->simd_count; > devinfo->mem_banks_count = node->mem_banks_count; > @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi > if (!iolink->valid) > continue; > > - list_for_each_entry(node2, &sys->nodes, listm_system) > - if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) > - link_to_present_node = true; > + if (maps) { > + list_for_each_entry(node2, &sys->nodes, listm_system) > + if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) > + link_to_present_node = true; > + } else { > + link_to_present_node = true; > + } > > if (!link_to_present_node) > continue; > @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage) > maps_init(&checkpoint_maps); > maps_init(&restore_maps); > > + if (stage == CR_PLUGIN_STAGE__DUMP) { > + kfd_dump_complete = false; > + amdgpu_topology_dump_complete = false; > + } > + > if (stage == CR_PLUGIN_STAGE__RESTORE) { > if (has_children(root_item)) { > pr_info("Parallel restore disabled\n"); > @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id) > if (ret) > goto exit; > > + kfd_dump_complete = true; > if (!plugin_added_to_inventory) { > ret = add_inventory_plugin(CR_PLUGIN_DESC.name); > if (ret) { > @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed) > > pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id); > > + if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) { > + AmdgpuDevinfo *ad; > + > + pr_info("No restore maps found, making them from topology file\n"); > + > + img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true); > + if (!img_fp) { > + pr_err("Failed to find either kfd or amdgpu src topology information\n"); > + ret = -EINVAL; > + goto exit; > + } > + > + buf = xmalloc(img_size); > + if (!buf) { > + pr_err("Failed to allocate memory\n"); > + return -ENOMEM; > + } > + > + ret = read_fp(img_fp, buf, img_size); > + if (ret) { > + pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE); > + ret = -EINVAL; > + goto exit; > + } > + > + ad = amdgpu_devinfo__unpack(NULL, img_size, buf); > + if (rd == NULL) { > + pr_perror("Unable to parse the amdgpu topology message\n"); > + fclose(img_fp); > + ret = -EINVAL; > + goto exit; > + } > + fclose(img_fp); > + > + ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology); > + if (ret) { > + pr_err("Failed to convert amdgpu device information to topology\n"); > + ret = -EINVAL; > + goto exit; > + } > + > + ret = topology_parse(&dest_topology, "Local"); > + if (ret) { > + pr_err("Failed to parse local system topology\n"); > + goto exit; > + } > + > + ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps); > + if (ret) { > + pr_err("Failed to map GPUs\n"); > + goto exit; > + } > + } > + > target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id); > if (!target_gpu_id) { > fd = -ENODEV; > diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c > index c1dfb2dd4..a4c650753 100644 > --- a/plugins/amdgpu/amdgpu_plugin_drm.c > +++ b/plugins/amdgpu/amdgpu_plugin_drm.c > @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm) > return -ENODEV; > } > > - /* Get the GPU_ID of the DRM device */ > - rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); > - if (!rd->gpu_id) { > - pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); > - return -ENODEV; > + if (kfd_dump_complete) { > + /* Get the GPU_ID of the DRM device */ > + rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); > + if (!rd->gpu_id) { > + pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); > + return -ENODEV; > + } > + } else { > + rd->gpu_id = tp_node->gpu_id; > + > + if (!amdgpu_topology_dump_complete) { > + AmdgpuDevinfo *ad = NULL; > + unsigned char *buf; > + > + ad = xmalloc(sizeof(*ad)); > + amdgpu_devinfo__init(ad); > + > + ad->num_of_devices = src_topology.num_nodes; > + > + ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices); > + if (!ad->device_entries) { > + pr_err("Failed to allocate device_entries\n"); > + return -ENOMEM; > + } > + > + for (int i = 0; i < ad->num_of_devices; i++) { > + KfdDeviceEntry *entry = xzalloc(sizeof(*entry)); > + > + if (!entry) { > + pr_err("Failed to allocate entry\n"); > + return -ENOMEM; > + } > + > + kfd_device_entry__init(entry); > + > + ad->device_entries[i] = entry; > + ad->n_device_entries++; > + } > + > + topology_to_devinfo(&src_topology, NULL, ad->device_entries); > + > + len = amdgpu_devinfo__get_packed_size(ad); > + > + buf = xmalloc(len); > + if (!buf) { > + pr_perror("Failed to allocate memory to store protobuf"); > + return -ENOMEM; > + } > + > + amdgpu_devinfo__pack(ad, buf); > + > + ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len); > + if (ret) { > + pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE); > + return -EINVAL; > + } > + > + amdgpu_topology_dump_complete = true; > + } > } > > len = criu_render_node__get_packed_size(rd); > diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h > index 69b98a31c..ccfe30b49 100644 > --- a/plugins/amdgpu/amdgpu_plugin_util.h > +++ b/plugins/amdgpu/amdgpu_plugin_util.h > @@ -2,6 +2,7 @@ > #define __AMDGPU_PLUGIN_UTIL_H__ > > #include <libdrm/amdgpu.h> > +#include "criu-amdgpu.pb-c.h" > > #ifndef _GNU_SOURCE > #define _GNU_SOURCE 1 > @@ -59,6 +60,9 @@ > /* Name of file having serialized data of DRM device buffer objects (BOs) */ > #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img" > > +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/ > +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img" > + > /* Helper macros to Checkpoint and Restore a ROCm file */ > #define HSAKMT_SHM_PATH "/dev/shm/hsakmt_shared_mem" > #define HSAKMT_SHM "/hsakmt_shared_mem" > @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check; > extern bool kfd_numa_check; > extern bool kfd_capability_check; > > +extern bool kfd_dump_complete; > +extern bool amdgpu_topology_dump_complete; > + > int read_fp(FILE *fp, void *buf, const size_t buf_len); > int write_fp(FILE *fp, const void *buf, const size_t buf_len); > int read_file(const char *file_path, void *buf, const size_t buf_len); > @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp, > > int serve_out_dmabuf_fd(int handle, int fd); > > +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries); > + > #endif /* __AMDGPU_PLUGIN_UTIL_H__ */ > diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto > index 7682a8f21..6e44e22aa 100644 > --- a/plugins/amdgpu/criu-amdgpu.proto > +++ b/plugins/amdgpu/criu-amdgpu.proto > @@ -93,3 +93,8 @@ message criu_render_node { > message criu_dmabuf_node { > required uint32 gem_handle = 1; > } > + > +message amdgpu_devinfo { > + required uint32 num_of_devices = 1; > + repeated kfd_device_entry device_entries = 2; > +} > \ No newline at end of file ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file 2026-04-16 14:54 ` Tvrtko Ursulin @ 2026-04-16 18:26 ` Francis, David 2026-04-17 9:07 ` Tvrtko Ursulin 0 siblings, 1 reply; 15+ messages in thread From: Francis, David @ 2026-04-16 18:26 UTC (permalink / raw) To: Tvrtko Ursulin, criu@lists.linux.dev > Ah neat, but I am guessing the data comes from kfd? At some point we > will need to bite the bullet and start designing uapi for an amdgpu only > world. No, the data comes solely from sysfs in this case. No urgency on merging this since it doesn't do much without more patches but also unlikely to cause problems on its own - if kfd is present, the plugin will continue to make the same dump it always did. Review would be appreciated, thanks. ________________________________________ From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Sent: Thursday, April 16, 2026 10:54 AM To: Francis, David; criu@lists.linux.dev Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file On 10/04/2026 15:45, David Francis wrote: > The state of the source topology (the GPUs, CPUs, and links > between them) is saved by the plugin as part of kfd dump. > > If there is no kfd dump, we need to save the topology anyways. > > Do so in new file amdgpu-topology.img. Ah neat, but I am guessing the data comes from kfd? At some point we will need to bite the bullet and start designing uapi for an amdgpu only world. In the meantime, would you like me to review this in detail to perhaps have it merged in the interim? Although in that case backward compatibility gets more complicated. Regards, Tvrtko > > Signed-off-by: David Francis <David.Francis@amd.com> > --- > plugins/amdgpu/amdgpu_plugin.c | 84 ++++++++++++++++++++++++++--- > plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++++++++-- > plugins/amdgpu/amdgpu_plugin_util.h | 9 ++++ > plugins/amdgpu/criu-amdgpu.proto | 5 ++ > 4 files changed, 151 insertions(+), 11 deletions(-) > > diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c > index 89ab10dac..1e9785440 100644 > --- a/plugins/amdgpu/amdgpu_plugin.c > +++ b/plugins/amdgpu/amdgpu_plugin.c > @@ -91,6 +91,9 @@ int current_pid; > */ > bool parallel_disabled = false; > > +bool kfd_dump_complete = false; > +bool amdgpu_topology_dump_complete = false; > + > pthread_t parallel_thread = 0; > int parallel_thread_result = 0; > /**************************************************************************************************/ > @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi > devinfo->node_id = node->id; > > if (NODE_IS_GPU(node)) { > - devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); > - if (!devinfo->gpu_id) > - continue; > + if (maps) { > + devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); > + if (!devinfo->gpu_id) > + continue; > + } else { > + devinfo->gpu_id = node->gpu_id; > + } > + > > devinfo->simd_count = node->simd_count; > devinfo->mem_banks_count = node->mem_banks_count; > @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi > if (!iolink->valid) > continue; > > - list_for_each_entry(node2, &sys->nodes, listm_system) > - if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) > - link_to_present_node = true; > + if (maps) { > + list_for_each_entry(node2, &sys->nodes, listm_system) > + if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) > + link_to_present_node = true; > + } else { > + link_to_present_node = true; > + } > > if (!link_to_present_node) > continue; > @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage) > maps_init(&checkpoint_maps); > maps_init(&restore_maps); > > + if (stage == CR_PLUGIN_STAGE__DUMP) { > + kfd_dump_complete = false; > + amdgpu_topology_dump_complete = false; > + } > + > if (stage == CR_PLUGIN_STAGE__RESTORE) { > if (has_children(root_item)) { > pr_info("Parallel restore disabled\n"); > @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id) > if (ret) > goto exit; > > + kfd_dump_complete = true; > if (!plugin_added_to_inventory) { > ret = add_inventory_plugin(CR_PLUGIN_DESC.name); > if (ret) { > @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed) > > pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id); > > + if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) { > + AmdgpuDevinfo *ad; > + > + pr_info("No restore maps found, making them from topology file\n"); > + > + img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true); > + if (!img_fp) { > + pr_err("Failed to find either kfd or amdgpu src topology information\n"); > + ret = -EINVAL; > + goto exit; > + } > + > + buf = xmalloc(img_size); > + if (!buf) { > + pr_err("Failed to allocate memory\n"); > + return -ENOMEM; > + } > + > + ret = read_fp(img_fp, buf, img_size); > + if (ret) { > + pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE); > + ret = -EINVAL; > + goto exit; > + } > + > + ad = amdgpu_devinfo__unpack(NULL, img_size, buf); > + if (rd == NULL) { > + pr_perror("Unable to parse the amdgpu topology message\n"); > + fclose(img_fp); > + ret = -EINVAL; > + goto exit; > + } > + fclose(img_fp); > + > + ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology); > + if (ret) { > + pr_err("Failed to convert amdgpu device information to topology\n"); > + ret = -EINVAL; > + goto exit; > + } > + > + ret = topology_parse(&dest_topology, "Local"); > + if (ret) { > + pr_err("Failed to parse local system topology\n"); > + goto exit; > + } > + > + ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps); > + if (ret) { > + pr_err("Failed to map GPUs\n"); > + goto exit; > + } > + } > + > target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id); > if (!target_gpu_id) { > fd = -ENODEV; > diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c > index c1dfb2dd4..a4c650753 100644 > --- a/plugins/amdgpu/amdgpu_plugin_drm.c > +++ b/plugins/amdgpu/amdgpu_plugin_drm.c > @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm) > return -ENODEV; > } > > - /* Get the GPU_ID of the DRM device */ > - rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); > - if (!rd->gpu_id) { > - pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); > - return -ENODEV; > + if (kfd_dump_complete) { > + /* Get the GPU_ID of the DRM device */ > + rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); > + if (!rd->gpu_id) { > + pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); > + return -ENODEV; > + } > + } else { > + rd->gpu_id = tp_node->gpu_id; > + > + if (!amdgpu_topology_dump_complete) { > + AmdgpuDevinfo *ad = NULL; > + unsigned char *buf; > + > + ad = xmalloc(sizeof(*ad)); > + amdgpu_devinfo__init(ad); > + > + ad->num_of_devices = src_topology.num_nodes; > + > + ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices); > + if (!ad->device_entries) { > + pr_err("Failed to allocate device_entries\n"); > + return -ENOMEM; > + } > + > + for (int i = 0; i < ad->num_of_devices; i++) { > + KfdDeviceEntry *entry = xzalloc(sizeof(*entry)); > + > + if (!entry) { > + pr_err("Failed to allocate entry\n"); > + return -ENOMEM; > + } > + > + kfd_device_entry__init(entry); > + > + ad->device_entries[i] = entry; > + ad->n_device_entries++; > + } > + > + topology_to_devinfo(&src_topology, NULL, ad->device_entries); > + > + len = amdgpu_devinfo__get_packed_size(ad); > + > + buf = xmalloc(len); > + if (!buf) { > + pr_perror("Failed to allocate memory to store protobuf"); > + return -ENOMEM; > + } > + > + amdgpu_devinfo__pack(ad, buf); > + > + ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len); > + if (ret) { > + pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE); > + return -EINVAL; > + } > + > + amdgpu_topology_dump_complete = true; > + } > } > > len = criu_render_node__get_packed_size(rd); > diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h > index 69b98a31c..ccfe30b49 100644 > --- a/plugins/amdgpu/amdgpu_plugin_util.h > +++ b/plugins/amdgpu/amdgpu_plugin_util.h > @@ -2,6 +2,7 @@ > #define __AMDGPU_PLUGIN_UTIL_H__ > > #include <libdrm/amdgpu.h> > +#include "criu-amdgpu.pb-c.h" > > #ifndef _GNU_SOURCE > #define _GNU_SOURCE 1 > @@ -59,6 +60,9 @@ > /* Name of file having serialized data of DRM device buffer objects (BOs) */ > #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img" > > +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/ > +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img" > + > /* Helper macros to Checkpoint and Restore a ROCm file */ > #define HSAKMT_SHM_PATH "/dev/shm/hsakmt_shared_mem" > #define HSAKMT_SHM "/hsakmt_shared_mem" > @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check; > extern bool kfd_numa_check; > extern bool kfd_capability_check; > > +extern bool kfd_dump_complete; > +extern bool amdgpu_topology_dump_complete; > + > int read_fp(FILE *fp, void *buf, const size_t buf_len); > int write_fp(FILE *fp, const void *buf, const size_t buf_len); > int read_file(const char *file_path, void *buf, const size_t buf_len); > @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp, > > int serve_out_dmabuf_fd(int handle, int fd); > > +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries); > + > #endif /* __AMDGPU_PLUGIN_UTIL_H__ */ > diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto > index 7682a8f21..6e44e22aa 100644 > --- a/plugins/amdgpu/criu-amdgpu.proto > +++ b/plugins/amdgpu/criu-amdgpu.proto > @@ -93,3 +93,8 @@ message criu_render_node { > message criu_dmabuf_node { > required uint32 gem_handle = 1; > } > + > +message amdgpu_devinfo { > + required uint32 num_of_devices = 1; > + repeated kfd_device_entry device_entries = 2; > +} > \ No newline at end of file ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file 2026-04-16 18:26 ` Francis, David @ 2026-04-17 9:07 ` Tvrtko Ursulin 2026-04-17 13:21 ` Francis, David 0 siblings, 1 reply; 15+ messages in thread From: Tvrtko Ursulin @ 2026-04-17 9:07 UTC (permalink / raw) To: Francis, David, criu@lists.linux.dev On 16/04/2026 19:26, Francis, David wrote: >> Ah neat, but I am guessing the data comes from kfd? At some point we >> will need to bite the bullet and start designing uapi for an amdgpu only >> world. > > No, the data comes solely from sysfs in this case. But exported by kfd, no? Ie. /sys/class/kfd/kfd/topology/nodes/. I understand the goal is to migrate away from kfd which is why I wondered if it makes sense to add an image based of both internal and external kfd data. > No urgency on merging this since it doesn't do much without more patches > but also unlikely to cause problems on its own - if kfd is present, the plugin > will continue to make the same dump it always did. > > Review would be appreciated, thanks. In one aspect it is better than my hack of allow restore from a single to a single GPU system, although even there the gpu id is not stable, but on the other hand it does add an image format which looks like a dead end from a design persective. Hm, how does this patch handle gpu id changing across reboots? Regards, Tvrtko > ________________________________________ > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > Sent: Thursday, April 16, 2026 10:54 AM > To: Francis, David; criu@lists.linux.dev > Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file > > > On 10/04/2026 15:45, David Francis wrote: >> The state of the source topology (the GPUs, CPUs, and links >> between them) is saved by the plugin as part of kfd dump. >> >> If there is no kfd dump, we need to save the topology anyways. >> >> Do so in new file amdgpu-topology.img. > > Ah neat, but I am guessing the data comes from kfd? At some point we > will need to bite the bullet and start designing uapi for an amdgpu only > world. > > In the meantime, would you like me to review this in detail to perhaps > have it merged in the interim? Although in that case backward > compatibility gets more complicated. > > Regards, > > Tvrtko > >> >> Signed-off-by: David Francis <David.Francis@amd.com> >> --- >> plugins/amdgpu/amdgpu_plugin.c | 84 ++++++++++++++++++++++++++--- >> plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++++++++-- >> plugins/amdgpu/amdgpu_plugin_util.h | 9 ++++ >> plugins/amdgpu/criu-amdgpu.proto | 5 ++ >> 4 files changed, 151 insertions(+), 11 deletions(-) >> >> diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c >> index 89ab10dac..1e9785440 100644 >> --- a/plugins/amdgpu/amdgpu_plugin.c >> +++ b/plugins/amdgpu/amdgpu_plugin.c >> @@ -91,6 +91,9 @@ int current_pid; >> */ >> bool parallel_disabled = false; >> >> +bool kfd_dump_complete = false; >> +bool amdgpu_topology_dump_complete = false; >> + >> pthread_t parallel_thread = 0; >> int parallel_thread_result = 0; >> /**************************************************************************************************/ >> @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi >> devinfo->node_id = node->id; >> >> if (NODE_IS_GPU(node)) { >> - devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); >> - if (!devinfo->gpu_id) >> - continue; >> + if (maps) { >> + devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); >> + if (!devinfo->gpu_id) >> + continue; >> + } else { >> + devinfo->gpu_id = node->gpu_id; >> + } >> + >> >> devinfo->simd_count = node->simd_count; >> devinfo->mem_banks_count = node->mem_banks_count; >> @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi >> if (!iolink->valid) >> continue; >> >> - list_for_each_entry(node2, &sys->nodes, listm_system) >> - if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) >> - link_to_present_node = true; >> + if (maps) { >> + list_for_each_entry(node2, &sys->nodes, listm_system) >> + if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) >> + link_to_present_node = true; >> + } else { >> + link_to_present_node = true; >> + } >> >> if (!link_to_present_node) >> continue; >> @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage) >> maps_init(&checkpoint_maps); >> maps_init(&restore_maps); >> >> + if (stage == CR_PLUGIN_STAGE__DUMP) { >> + kfd_dump_complete = false; >> + amdgpu_topology_dump_complete = false; >> + } >> + >> if (stage == CR_PLUGIN_STAGE__RESTORE) { >> if (has_children(root_item)) { >> pr_info("Parallel restore disabled\n"); >> @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id) >> if (ret) >> goto exit; >> >> + kfd_dump_complete = true; >> if (!plugin_added_to_inventory) { >> ret = add_inventory_plugin(CR_PLUGIN_DESC.name); >> if (ret) { >> @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed) >> >> pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id); >> >> + if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) { >> + AmdgpuDevinfo *ad; >> + >> + pr_info("No restore maps found, making them from topology file\n"); >> + >> + img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true); >> + if (!img_fp) { >> + pr_err("Failed to find either kfd or amdgpu src topology information\n"); >> + ret = -EINVAL; >> + goto exit; >> + } >> + >> + buf = xmalloc(img_size); >> + if (!buf) { >> + pr_err("Failed to allocate memory\n"); >> + return -ENOMEM; >> + } >> + >> + ret = read_fp(img_fp, buf, img_size); >> + if (ret) { >> + pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE); >> + ret = -EINVAL; >> + goto exit; >> + } >> + >> + ad = amdgpu_devinfo__unpack(NULL, img_size, buf); >> + if (rd == NULL) { >> + pr_perror("Unable to parse the amdgpu topology message\n"); >> + fclose(img_fp); >> + ret = -EINVAL; >> + goto exit; >> + } >> + fclose(img_fp); >> + >> + ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology); >> + if (ret) { >> + pr_err("Failed to convert amdgpu device information to topology\n"); >> + ret = -EINVAL; >> + goto exit; >> + } >> + >> + ret = topology_parse(&dest_topology, "Local"); >> + if (ret) { >> + pr_err("Failed to parse local system topology\n"); >> + goto exit; >> + } >> + >> + ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps); >> + if (ret) { >> + pr_err("Failed to map GPUs\n"); >> + goto exit; >> + } >> + } >> + >> target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id); >> if (!target_gpu_id) { >> fd = -ENODEV; >> diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c >> index c1dfb2dd4..a4c650753 100644 >> --- a/plugins/amdgpu/amdgpu_plugin_drm.c >> +++ b/plugins/amdgpu/amdgpu_plugin_drm.c >> @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm) >> return -ENODEV; >> } >> >> - /* Get the GPU_ID of the DRM device */ >> - rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); >> - if (!rd->gpu_id) { >> - pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); >> - return -ENODEV; >> + if (kfd_dump_complete) { >> + /* Get the GPU_ID of the DRM device */ >> + rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); >> + if (!rd->gpu_id) { >> + pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); >> + return -ENODEV; >> + } >> + } else { >> + rd->gpu_id = tp_node->gpu_id; >> + >> + if (!amdgpu_topology_dump_complete) { >> + AmdgpuDevinfo *ad = NULL; >> + unsigned char *buf; >> + >> + ad = xmalloc(sizeof(*ad)); >> + amdgpu_devinfo__init(ad); >> + >> + ad->num_of_devices = src_topology.num_nodes; >> + >> + ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices); >> + if (!ad->device_entries) { >> + pr_err("Failed to allocate device_entries\n"); >> + return -ENOMEM; >> + } >> + >> + for (int i = 0; i < ad->num_of_devices; i++) { >> + KfdDeviceEntry *entry = xzalloc(sizeof(*entry)); >> + >> + if (!entry) { >> + pr_err("Failed to allocate entry\n"); >> + return -ENOMEM; >> + } >> + >> + kfd_device_entry__init(entry); >> + >> + ad->device_entries[i] = entry; >> + ad->n_device_entries++; >> + } >> + >> + topology_to_devinfo(&src_topology, NULL, ad->device_entries); >> + >> + len = amdgpu_devinfo__get_packed_size(ad); >> + >> + buf = xmalloc(len); >> + if (!buf) { >> + pr_perror("Failed to allocate memory to store protobuf"); >> + return -ENOMEM; >> + } >> + >> + amdgpu_devinfo__pack(ad, buf); >> + >> + ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len); >> + if (ret) { >> + pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE); >> + return -EINVAL; >> + } >> + >> + amdgpu_topology_dump_complete = true; >> + } >> } >> >> len = criu_render_node__get_packed_size(rd); >> diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h >> index 69b98a31c..ccfe30b49 100644 >> --- a/plugins/amdgpu/amdgpu_plugin_util.h >> +++ b/plugins/amdgpu/amdgpu_plugin_util.h >> @@ -2,6 +2,7 @@ >> #define __AMDGPU_PLUGIN_UTIL_H__ >> >> #include <libdrm/amdgpu.h> >> +#include "criu-amdgpu.pb-c.h" >> >> #ifndef _GNU_SOURCE >> #define _GNU_SOURCE 1 >> @@ -59,6 +60,9 @@ >> /* Name of file having serialized data of DRM device buffer objects (BOs) */ >> #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img" >> >> +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/ >> +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img" >> + >> /* Helper macros to Checkpoint and Restore a ROCm file */ >> #define HSAKMT_SHM_PATH "/dev/shm/hsakmt_shared_mem" >> #define HSAKMT_SHM "/hsakmt_shared_mem" >> @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check; >> extern bool kfd_numa_check; >> extern bool kfd_capability_check; >> >> +extern bool kfd_dump_complete; >> +extern bool amdgpu_topology_dump_complete; >> + >> int read_fp(FILE *fp, void *buf, const size_t buf_len); >> int write_fp(FILE *fp, const void *buf, const size_t buf_len); >> int read_file(const char *file_path, void *buf, const size_t buf_len); >> @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp, >> >> int serve_out_dmabuf_fd(int handle, int fd); >> >> +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries); >> + >> #endif /* __AMDGPU_PLUGIN_UTIL_H__ */ >> diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto >> index 7682a8f21..6e44e22aa 100644 >> --- a/plugins/amdgpu/criu-amdgpu.proto >> +++ b/plugins/amdgpu/criu-amdgpu.proto >> @@ -93,3 +93,8 @@ message criu_render_node { >> message criu_dmabuf_node { >> required uint32 gem_handle = 1; >> } >> + >> +message amdgpu_devinfo { >> + required uint32 num_of_devices = 1; >> + repeated kfd_device_entry device_entries = 2; >> +} >> \ No newline at end of file > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file 2026-04-17 9:07 ` Tvrtko Ursulin @ 2026-04-17 13:21 ` Francis, David 2026-04-22 13:58 ` Tvrtko Ursulin 0 siblings, 1 reply; 15+ messages in thread From: Francis, David @ 2026-04-17 13:21 UTC (permalink / raw) To: Tvrtko Ursulin, criu@lists.linux.dev >>> Ah neat, but I am guessing the data comes from kfd? At some point we >>> will need to bite the bullet and start designing uapi for an amdgpu only >>> world. >> >> No, the data comes solely from sysfs in this case. > > But exported by kfd, no? Ie. /sys/class/kfd/kfd/topology/nodes/. I > understand the goal is to migrate away from kfd which is why I wondered > if it makes sense to add an image based of both internal and external > kfd data. I think current plans are to deprecate the kfd device file first - /sys/class/kfd will continue to exist for a good while longer. > In one aspect it is better than my hack of allow restore from a single > to a single GPU system, although even there the gpu id is not stable, > but on the other hand it does add an image format which looks like a > dead end from a design persective. > > Hm, how does this patch handle gpu id changing across reboots? It does the whole amdgpu_plugin_topology matching thing to find a mapping between the old GPUs and the new ones. The renderD minor numbers of the new GPUs determine which files are opened to replace the open renderD handles. It doesn't really matter what the gpu ids are at that point. To target a specific GPU in amdgpu, you don't need gpu id, you just point the operation at the right renderD file. ________________________________________ From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Sent: Friday, April 17, 2026 5:07 AM To: Francis, David; criu@lists.linux.dev Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file On 16/04/2026 19:26, Francis, David wrote: >> Ah neat, but I am guessing the data comes from kfd? At some point we >> will need to bite the bullet and start designing uapi for an amdgpu only >> world. > > No, the data comes solely from sysfs in this case. But exported by kfd, no? Ie. /sys/class/kfd/kfd/topology/nodes/. I understand the goal is to migrate away from kfd which is why I wondered if it makes sense to add an image based of both internal and external kfd data. > No urgency on merging this since it doesn't do much without more patches > but also unlikely to cause problems on its own - if kfd is present, the plugin > will continue to make the same dump it always did. > > Review would be appreciated, thanks. In one aspect it is better than my hack of allow restore from a single to a single GPU system, although even there the gpu id is not stable, but on the other hand it does add an image format which looks like a dead end from a design persective. Hm, how does this patch handle gpu id changing across reboots? Regards, Tvrtko > ________________________________________ > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > Sent: Thursday, April 16, 2026 10:54 AM > To: Francis, David; criu@lists.linux.dev > Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file > > > On 10/04/2026 15:45, David Francis wrote: >> The state of the source topology (the GPUs, CPUs, and links >> between them) is saved by the plugin as part of kfd dump. >> >> If there is no kfd dump, we need to save the topology anyways. >> >> Do so in new file amdgpu-topology.img. > > Ah neat, but I am guessing the data comes from kfd? At some point we > will need to bite the bullet and start designing uapi for an amdgpu only > world. > > In the meantime, would you like me to review this in detail to perhaps > have it merged in the interim? Although in that case backward > compatibility gets more complicated. > > Regards, > > Tvrtko > >> >> Signed-off-by: David Francis <David.Francis@amd.com> >> --- >> plugins/amdgpu/amdgpu_plugin.c | 84 ++++++++++++++++++++++++++--- >> plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++++++++-- >> plugins/amdgpu/amdgpu_plugin_util.h | 9 ++++ >> plugins/amdgpu/criu-amdgpu.proto | 5 ++ >> 4 files changed, 151 insertions(+), 11 deletions(-) >> >> diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c >> index 89ab10dac..1e9785440 100644 >> --- a/plugins/amdgpu/amdgpu_plugin.c >> +++ b/plugins/amdgpu/amdgpu_plugin.c >> @@ -91,6 +91,9 @@ int current_pid; >> */ >> bool parallel_disabled = false; >> >> +bool kfd_dump_complete = false; >> +bool amdgpu_topology_dump_complete = false; >> + >> pthread_t parallel_thread = 0; >> int parallel_thread_result = 0; >> /**************************************************************************************************/ >> @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi >> devinfo->node_id = node->id; >> >> if (NODE_IS_GPU(node)) { >> - devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); >> - if (!devinfo->gpu_id) >> - continue; >> + if (maps) { >> + devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); >> + if (!devinfo->gpu_id) >> + continue; >> + } else { >> + devinfo->gpu_id = node->gpu_id; >> + } >> + >> >> devinfo->simd_count = node->simd_count; >> devinfo->mem_banks_count = node->mem_banks_count; >> @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi >> if (!iolink->valid) >> continue; >> >> - list_for_each_entry(node2, &sys->nodes, listm_system) >> - if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) >> - link_to_present_node = true; >> + if (maps) { >> + list_for_each_entry(node2, &sys->nodes, listm_system) >> + if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) >> + link_to_present_node = true; >> + } else { >> + link_to_present_node = true; >> + } >> >> if (!link_to_present_node) >> continue; >> @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage) >> maps_init(&checkpoint_maps); >> maps_init(&restore_maps); >> >> + if (stage == CR_PLUGIN_STAGE__DUMP) { >> + kfd_dump_complete = false; >> + amdgpu_topology_dump_complete = false; >> + } >> + >> if (stage == CR_PLUGIN_STAGE__RESTORE) { >> if (has_children(root_item)) { >> pr_info("Parallel restore disabled\n"); >> @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id) >> if (ret) >> goto exit; >> >> + kfd_dump_complete = true; >> if (!plugin_added_to_inventory) { >> ret = add_inventory_plugin(CR_PLUGIN_DESC.name); >> if (ret) { >> @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed) >> >> pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id); >> >> + if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) { >> + AmdgpuDevinfo *ad; >> + >> + pr_info("No restore maps found, making them from topology file\n"); >> + >> + img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true); >> + if (!img_fp) { >> + pr_err("Failed to find either kfd or amdgpu src topology information\n"); >> + ret = -EINVAL; >> + goto exit; >> + } >> + >> + buf = xmalloc(img_size); >> + if (!buf) { >> + pr_err("Failed to allocate memory\n"); >> + return -ENOMEM; >> + } >> + >> + ret = read_fp(img_fp, buf, img_size); >> + if (ret) { >> + pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE); >> + ret = -EINVAL; >> + goto exit; >> + } >> + >> + ad = amdgpu_devinfo__unpack(NULL, img_size, buf); >> + if (rd == NULL) { >> + pr_perror("Unable to parse the amdgpu topology message\n"); >> + fclose(img_fp); >> + ret = -EINVAL; >> + goto exit; >> + } >> + fclose(img_fp); >> + >> + ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology); >> + if (ret) { >> + pr_err("Failed to convert amdgpu device information to topology\n"); >> + ret = -EINVAL; >> + goto exit; >> + } >> + >> + ret = topology_parse(&dest_topology, "Local"); >> + if (ret) { >> + pr_err("Failed to parse local system topology\n"); >> + goto exit; >> + } >> + >> + ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps); >> + if (ret) { >> + pr_err("Failed to map GPUs\n"); >> + goto exit; >> + } >> + } >> + >> target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id); >> if (!target_gpu_id) { >> fd = -ENODEV; >> diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c >> index c1dfb2dd4..a4c650753 100644 >> --- a/plugins/amdgpu/amdgpu_plugin_drm.c >> +++ b/plugins/amdgpu/amdgpu_plugin_drm.c >> @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm) >> return -ENODEV; >> } >> >> - /* Get the GPU_ID of the DRM device */ >> - rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); >> - if (!rd->gpu_id) { >> - pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); >> - return -ENODEV; >> + if (kfd_dump_complete) { >> + /* Get the GPU_ID of the DRM device */ >> + rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); >> + if (!rd->gpu_id) { >> + pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); >> + return -ENODEV; >> + } >> + } else { >> + rd->gpu_id = tp_node->gpu_id; >> + >> + if (!amdgpu_topology_dump_complete) { >> + AmdgpuDevinfo *ad = NULL; >> + unsigned char *buf; >> + >> + ad = xmalloc(sizeof(*ad)); >> + amdgpu_devinfo__init(ad); >> + >> + ad->num_of_devices = src_topology.num_nodes; >> + >> + ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices); >> + if (!ad->device_entries) { >> + pr_err("Failed to allocate device_entries\n"); >> + return -ENOMEM; >> + } >> + >> + for (int i = 0; i < ad->num_of_devices; i++) { >> + KfdDeviceEntry *entry = xzalloc(sizeof(*entry)); >> + >> + if (!entry) { >> + pr_err("Failed to allocate entry\n"); >> + return -ENOMEM; >> + } >> + >> + kfd_device_entry__init(entry); >> + >> + ad->device_entries[i] = entry; >> + ad->n_device_entries++; >> + } >> + >> + topology_to_devinfo(&src_topology, NULL, ad->device_entries); >> + >> + len = amdgpu_devinfo__get_packed_size(ad); >> + >> + buf = xmalloc(len); >> + if (!buf) { >> + pr_perror("Failed to allocate memory to store protobuf"); >> + return -ENOMEM; >> + } >> + >> + amdgpu_devinfo__pack(ad, buf); >> + >> + ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len); >> + if (ret) { >> + pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE); >> + return -EINVAL; >> + } >> + >> + amdgpu_topology_dump_complete = true; >> + } >> } >> >> len = criu_render_node__get_packed_size(rd); >> diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h >> index 69b98a31c..ccfe30b49 100644 >> --- a/plugins/amdgpu/amdgpu_plugin_util.h >> +++ b/plugins/amdgpu/amdgpu_plugin_util.h >> @@ -2,6 +2,7 @@ >> #define __AMDGPU_PLUGIN_UTIL_H__ >> >> #include <libdrm/amdgpu.h> >> +#include "criu-amdgpu.pb-c.h" >> >> #ifndef _GNU_SOURCE >> #define _GNU_SOURCE 1 >> @@ -59,6 +60,9 @@ >> /* Name of file having serialized data of DRM device buffer objects (BOs) */ >> #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img" >> >> +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/ >> +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img" >> + >> /* Helper macros to Checkpoint and Restore a ROCm file */ >> #define HSAKMT_SHM_PATH "/dev/shm/hsakmt_shared_mem" >> #define HSAKMT_SHM "/hsakmt_shared_mem" >> @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check; >> extern bool kfd_numa_check; >> extern bool kfd_capability_check; >> >> +extern bool kfd_dump_complete; >> +extern bool amdgpu_topology_dump_complete; >> + >> int read_fp(FILE *fp, void *buf, const size_t buf_len); >> int write_fp(FILE *fp, const void *buf, const size_t buf_len); >> int read_file(const char *file_path, void *buf, const size_t buf_len); >> @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp, >> >> int serve_out_dmabuf_fd(int handle, int fd); >> >> +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries); >> + >> #endif /* __AMDGPU_PLUGIN_UTIL_H__ */ >> diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto >> index 7682a8f21..6e44e22aa 100644 >> --- a/plugins/amdgpu/criu-amdgpu.proto >> +++ b/plugins/amdgpu/criu-amdgpu.proto >> @@ -93,3 +93,8 @@ message criu_render_node { >> message criu_dmabuf_node { >> required uint32 gem_handle = 1; >> } >> + >> +message amdgpu_devinfo { >> + required uint32 num_of_devices = 1; >> + repeated kfd_device_entry device_entries = 2; >> +} >> \ No newline at end of file > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file 2026-04-17 13:21 ` Francis, David @ 2026-04-22 13:58 ` Tvrtko Ursulin 0 siblings, 0 replies; 15+ messages in thread From: Tvrtko Ursulin @ 2026-04-22 13:58 UTC (permalink / raw) To: Francis, David, criu@lists.linux.dev On 17/04/2026 14:21, Francis, David wrote: >>>> Ah neat, but I am guessing the data comes from kfd? At some point we >>>> will need to bite the bullet and start designing uapi for an amdgpu only >>>> world. >>> >>> No, the data comes solely from sysfs in this case. >> >> But exported by kfd, no? Ie. /sys/class/kfd/kfd/topology/nodes/. I >> understand the goal is to migrate away from kfd which is why I wondered >> if it makes sense to add an image based of both internal and external >> kfd data. > > I think current plans are to deprecate the kfd device file first - > /sys/class/kfd will continue to exist for a good while longer. Exported by the kfd driver, or you mean the plan is to export them from amdgpu to achieve uapi compatibility in a way? Regards, Tvrtko >> In one aspect it is better than my hack of allow restore from a single >> to a single GPU system, although even there the gpu id is not stable, >> but on the other hand it does add an image format which looks like a >> dead end from a design persective. >> >> Hm, how does this patch handle gpu id changing across reboots? > > It does the whole amdgpu_plugin_topology matching thing to find a > mapping between the old GPUs and the new ones. > > The renderD minor numbers of the new GPUs determine which files are > opened to replace the open renderD handles. It doesn't really matter > what the gpu ids are at that point. To target a specific GPU in amdgpu, > you don't need gpu id, you just point the operation at the right renderD > file. > > ________________________________________ > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > Sent: Friday, April 17, 2026 5:07 AM > To: Francis, David; criu@lists.linux.dev > Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file > > > On 16/04/2026 19:26, Francis, David wrote: >>> Ah neat, but I am guessing the data comes from kfd? At some point we >>> will need to bite the bullet and start designing uapi for an amdgpu only >>> world. >> >> No, the data comes solely from sysfs in this case. > > But exported by kfd, no? Ie. /sys/class/kfd/kfd/topology/nodes/. I > understand the goal is to migrate away from kfd which is why I wondered > if it makes sense to add an image based of both internal and external > kfd data. > >> No urgency on merging this since it doesn't do much without more patches >> but also unlikely to cause problems on its own - if kfd is present, the plugin >> will continue to make the same dump it always did. >> >> Review would be appreciated, thanks. > > In one aspect it is better than my hack of allow restore from a single > to a single GPU system, although even there the gpu id is not stable, > but on the other hand it does add an image format which looks like a > dead end from a design persective. > > Hm, how does this patch handle gpu id changing across reboots? > > Regards, > > Tvrtko > >> ________________________________________ >> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >> Sent: Thursday, April 16, 2026 10:54 AM >> To: Francis, David; criu@lists.linux.dev >> Subject: Re: [PATCH 2/3] plugin/amdgpu: Add topology dump file >> >> >> On 10/04/2026 15:45, David Francis wrote: >>> The state of the source topology (the GPUs, CPUs, and links >>> between them) is saved by the plugin as part of kfd dump. >>> >>> If there is no kfd dump, we need to save the topology anyways. >>> >>> Do so in new file amdgpu-topology.img. >> >> Ah neat, but I am guessing the data comes from kfd? At some point we >> will need to bite the bullet and start designing uapi for an amdgpu only >> world. >> >> In the meantime, would you like me to review this in detail to perhaps >> have it merged in the interim? Although in that case backward >> compatibility gets more complicated. >> >> Regards, >> >> Tvrtko >> >>> >>> Signed-off-by: David Francis <David.Francis@amd.com> >>> --- >>> plugins/amdgpu/amdgpu_plugin.c | 84 ++++++++++++++++++++++++++--- >>> plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++++++++-- >>> plugins/amdgpu/amdgpu_plugin_util.h | 9 ++++ >>> plugins/amdgpu/criu-amdgpu.proto | 5 ++ >>> 4 files changed, 151 insertions(+), 11 deletions(-) >>> >>> diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c >>> index 89ab10dac..1e9785440 100644 >>> --- a/plugins/amdgpu/amdgpu_plugin.c >>> +++ b/plugins/amdgpu/amdgpu_plugin.c >>> @@ -91,6 +91,9 @@ int current_pid; >>> */ >>> bool parallel_disabled = false; >>> >>> +bool kfd_dump_complete = false; >>> +bool amdgpu_topology_dump_complete = false; >>> + >>> pthread_t parallel_thread = 0; >>> int parallel_thread_result = 0; >>> /**************************************************************************************************/ >>> @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi >>> devinfo->node_id = node->id; >>> >>> if (NODE_IS_GPU(node)) { >>> - devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); >>> - if (!devinfo->gpu_id) >>> - continue; >>> + if (maps) { >>> + devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); >>> + if (!devinfo->gpu_id) >>> + continue; >>> + } else { >>> + devinfo->gpu_id = node->gpu_id; >>> + } >>> + >>> >>> devinfo->simd_count = node->simd_count; >>> devinfo->mem_banks_count = node->mem_banks_count; >>> @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi >>> if (!iolink->valid) >>> continue; >>> >>> - list_for_each_entry(node2, &sys->nodes, listm_system) >>> - if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) >>> - link_to_present_node = true; >>> + if (maps) { >>> + list_for_each_entry(node2, &sys->nodes, listm_system) >>> + if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) >>> + link_to_present_node = true; >>> + } else { >>> + link_to_present_node = true; >>> + } >>> >>> if (!link_to_present_node) >>> continue; >>> @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage) >>> maps_init(&checkpoint_maps); >>> maps_init(&restore_maps); >>> >>> + if (stage == CR_PLUGIN_STAGE__DUMP) { >>> + kfd_dump_complete = false; >>> + amdgpu_topology_dump_complete = false; >>> + } >>> + >>> if (stage == CR_PLUGIN_STAGE__RESTORE) { >>> if (has_children(root_item)) { >>> pr_info("Parallel restore disabled\n"); >>> @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id) >>> if (ret) >>> goto exit; >>> >>> + kfd_dump_complete = true; >>> if (!plugin_added_to_inventory) { >>> ret = add_inventory_plugin(CR_PLUGIN_DESC.name); >>> if (ret) { >>> @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed) >>> >>> pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id); >>> >>> + if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) { >>> + AmdgpuDevinfo *ad; >>> + >>> + pr_info("No restore maps found, making them from topology file\n"); >>> + >>> + img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true); >>> + if (!img_fp) { >>> + pr_err("Failed to find either kfd or amdgpu src topology information\n"); >>> + ret = -EINVAL; >>> + goto exit; >>> + } >>> + >>> + buf = xmalloc(img_size); >>> + if (!buf) { >>> + pr_err("Failed to allocate memory\n"); >>> + return -ENOMEM; >>> + } >>> + >>> + ret = read_fp(img_fp, buf, img_size); >>> + if (ret) { >>> + pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE); >>> + ret = -EINVAL; >>> + goto exit; >>> + } >>> + >>> + ad = amdgpu_devinfo__unpack(NULL, img_size, buf); >>> + if (rd == NULL) { >>> + pr_perror("Unable to parse the amdgpu topology message\n"); >>> + fclose(img_fp); >>> + ret = -EINVAL; >>> + goto exit; >>> + } >>> + fclose(img_fp); >>> + >>> + ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology); >>> + if (ret) { >>> + pr_err("Failed to convert amdgpu device information to topology\n"); >>> + ret = -EINVAL; >>> + goto exit; >>> + } >>> + >>> + ret = topology_parse(&dest_topology, "Local"); >>> + if (ret) { >>> + pr_err("Failed to parse local system topology\n"); >>> + goto exit; >>> + } >>> + >>> + ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps); >>> + if (ret) { >>> + pr_err("Failed to map GPUs\n"); >>> + goto exit; >>> + } >>> + } >>> + >>> target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id); >>> if (!target_gpu_id) { >>> fd = -ENODEV; >>> diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c >>> index c1dfb2dd4..a4c650753 100644 >>> --- a/plugins/amdgpu/amdgpu_plugin_drm.c >>> +++ b/plugins/amdgpu/amdgpu_plugin_drm.c >>> @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm) >>> return -ENODEV; >>> } >>> >>> - /* Get the GPU_ID of the DRM device */ >>> - rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); >>> - if (!rd->gpu_id) { >>> - pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); >>> - return -ENODEV; >>> + if (kfd_dump_complete) { >>> + /* Get the GPU_ID of the DRM device */ >>> + rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); >>> + if (!rd->gpu_id) { >>> + pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); >>> + return -ENODEV; >>> + } >>> + } else { >>> + rd->gpu_id = tp_node->gpu_id; >>> + >>> + if (!amdgpu_topology_dump_complete) { >>> + AmdgpuDevinfo *ad = NULL; >>> + unsigned char *buf; >>> + >>> + ad = xmalloc(sizeof(*ad)); >>> + amdgpu_devinfo__init(ad); >>> + >>> + ad->num_of_devices = src_topology.num_nodes; >>> + >>> + ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices); >>> + if (!ad->device_entries) { >>> + pr_err("Failed to allocate device_entries\n"); >>> + return -ENOMEM; >>> + } >>> + >>> + for (int i = 0; i < ad->num_of_devices; i++) { >>> + KfdDeviceEntry *entry = xzalloc(sizeof(*entry)); >>> + >>> + if (!entry) { >>> + pr_err("Failed to allocate entry\n"); >>> + return -ENOMEM; >>> + } >>> + >>> + kfd_device_entry__init(entry); >>> + >>> + ad->device_entries[i] = entry; >>> + ad->n_device_entries++; >>> + } >>> + >>> + topology_to_devinfo(&src_topology, NULL, ad->device_entries); >>> + >>> + len = amdgpu_devinfo__get_packed_size(ad); >>> + >>> + buf = xmalloc(len); >>> + if (!buf) { >>> + pr_perror("Failed to allocate memory to store protobuf"); >>> + return -ENOMEM; >>> + } >>> + >>> + amdgpu_devinfo__pack(ad, buf); >>> + >>> + ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len); >>> + if (ret) { >>> + pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE); >>> + return -EINVAL; >>> + } >>> + >>> + amdgpu_topology_dump_complete = true; >>> + } >>> } >>> >>> len = criu_render_node__get_packed_size(rd); >>> diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h >>> index 69b98a31c..ccfe30b49 100644 >>> --- a/plugins/amdgpu/amdgpu_plugin_util.h >>> +++ b/plugins/amdgpu/amdgpu_plugin_util.h >>> @@ -2,6 +2,7 @@ >>> #define __AMDGPU_PLUGIN_UTIL_H__ >>> >>> #include <libdrm/amdgpu.h> >>> +#include "criu-amdgpu.pb-c.h" >>> >>> #ifndef _GNU_SOURCE >>> #define _GNU_SOURCE 1 >>> @@ -59,6 +60,9 @@ >>> /* Name of file having serialized data of DRM device buffer objects (BOs) */ >>> #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img" >>> >>> +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/ >>> +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img" >>> + >>> /* Helper macros to Checkpoint and Restore a ROCm file */ >>> #define HSAKMT_SHM_PATH "/dev/shm/hsakmt_shared_mem" >>> #define HSAKMT_SHM "/hsakmt_shared_mem" >>> @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check; >>> extern bool kfd_numa_check; >>> extern bool kfd_capability_check; >>> >>> +extern bool kfd_dump_complete; >>> +extern bool amdgpu_topology_dump_complete; >>> + >>> int read_fp(FILE *fp, void *buf, const size_t buf_len); >>> int write_fp(FILE *fp, const void *buf, const size_t buf_len); >>> int read_file(const char *file_path, void *buf, const size_t buf_len); >>> @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp, >>> >>> int serve_out_dmabuf_fd(int handle, int fd); >>> >>> +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries); >>> + >>> #endif /* __AMDGPU_PLUGIN_UTIL_H__ */ >>> diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto >>> index 7682a8f21..6e44e22aa 100644 >>> --- a/plugins/amdgpu/criu-amdgpu.proto >>> +++ b/plugins/amdgpu/criu-amdgpu.proto >>> @@ -93,3 +93,8 @@ message criu_render_node { >>> message criu_dmabuf_node { >>> required uint32 gem_handle = 1; >>> } >>> + >>> +message amdgpu_devinfo { >>> + required uint32 num_of_devices = 1; >>> + repeated kfd_device_entry device_entries = 2; >>> +} >>> \ No newline at end of file >> > ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 3/3] plugins/amdgpu: Make next_fd without kfd 2026-04-10 14:45 [PATCH 0/3] Patches to allow amdgpu restore without kfd file David Francis 2026-04-10 14:45 ` [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas David Francis 2026-04-10 14:45 ` [PATCH 2/3] plugin/amdgpu: Add topology dump file David Francis @ 2026-04-10 14:45 ` David Francis 2026-04-10 15:42 ` [PATCH 0/3] Patches to allow amdgpu restore without kfd file Tvrtko Ursulin 3 siblings, 0 replies; 15+ messages in thread From: David Francis @ 2026-04-10 14:45 UTC (permalink / raw) To: criu; +Cc: tvrtko.ursulin, David Francis The amdgpu plugin needs to know what high fds are safe to use to store drm fds during the restore process. This is kept in next_fd, which is set as part of kfd restore. If there's no kfd restore, set next_fd anyways. Use the current pid as the pid to check; by this point in the restore the pid has been changed to the correct value. Signed-off-by: David Francis <David.Francis@amd.com> --- plugins/amdgpu/amdgpu_plugin.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c index 1e9785440..5dcf80363 100644 --- a/plugins/amdgpu/amdgpu_plugin.c +++ b/plugins/amdgpu/amdgpu_plugin.c @@ -1926,6 +1926,17 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed) pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id); + if (fd_next == -1) { + current_pid = getpid(); + fd_next = find_unused_fd_pid(getpid()); + if (fd_next <= 0) { + pr_err("Failed to find unused fd (fd:%d)\n", fd_next); + ret = -EINVAL; + goto exit; + } + pr_info("High fd for pid %d set to %d\n", current_pid, fd_next); + } + if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) { AmdgpuDevinfo *ad; -- 2.34.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file. 2026-04-10 14:45 [PATCH 0/3] Patches to allow amdgpu restore without kfd file David Francis ` (2 preceding siblings ...) 2026-04-10 14:45 ` [PATCH 3/3] plugins/amdgpu: Make next_fd without kfd David Francis @ 2026-04-10 15:42 ` Tvrtko Ursulin 2026-04-10 15:46 ` Francis, David 3 siblings, 1 reply; 15+ messages in thread From: Tvrtko Ursulin @ 2026-04-10 15:42 UTC (permalink / raw) To: David Francis, criu On 10/04/2026 15:45, David Francis wrote: > These patches allow processes that have renderD device files open > but not kfd device files to dump / restore. The series allows the first/minimal test case from my IGT to pass? Regards, Tvrtko > > Mostly a proof-of-concept / baseline for future work since there's > currently no way for such a process to dump / restore its queues. > > David Francis (3): > plugin/amdgpu: Add plugin to inventory even if there are no vmas > plugin/amdgpu: Add topology dump file > plugins/amdgpu: Make next_fd without kfd > > plugins/amdgpu/amdgpu_plugin.c | 113 ++++++++++++++++++++++++++-- > plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++-- > plugins/amdgpu/amdgpu_plugin_util.h | 9 +++ > plugins/amdgpu/criu-amdgpu.proto | 5 ++ > 4 files changed, 180 insertions(+), 11 deletions(-) > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file. 2026-04-10 15:42 ` [PATCH 0/3] Patches to allow amdgpu restore without kfd file Tvrtko Ursulin @ 2026-04-10 15:46 ` Francis, David 2026-04-10 16:04 ` Tvrtko Ursulin 0 siblings, 1 reply; 15+ messages in thread From: Francis, David @ 2026-04-10 15:46 UTC (permalink / raw) To: Tvrtko Ursulin, criu@lists.linux.dev Haven't tested with having a BO created / mapped, but since that code should already be in from the dmabuf IPC work, it should work. That's what I'm testing next. David. ________________________________________ From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Sent: Friday, April 10, 2026 11:42 AM To: Francis, David; criu@lists.linux.dev Subject: Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file. On 10/04/2026 15:45, David Francis wrote: > These patches allow processes that have renderD device files open > but not kfd device files to dump / restore. The series allows the first/minimal test case from my IGT to pass? Regards, Tvrtko > > Mostly a proof-of-concept / baseline for future work since there's > currently no way for such a process to dump / restore its queues. > > David Francis (3): > plugin/amdgpu: Add plugin to inventory even if there are no vmas > plugin/amdgpu: Add topology dump file > plugins/amdgpu: Make next_fd without kfd > > plugins/amdgpu/amdgpu_plugin.c | 113 ++++++++++++++++++++++++++-- > plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++-- > plugins/amdgpu/amdgpu_plugin_util.h | 9 +++ > plugins/amdgpu/criu-amdgpu.proto | 5 ++ > 4 files changed, 180 insertions(+), 11 deletions(-) > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file. 2026-04-10 15:46 ` Francis, David @ 2026-04-10 16:04 ` Tvrtko Ursulin 2026-04-10 16:12 ` Francis, David 0 siblings, 1 reply; 15+ messages in thread From: Tvrtko Ursulin @ 2026-04-10 16:04 UTC (permalink / raw) To: Francis, David, criu@lists.linux.dev On 10/04/2026 16:46, Francis, David wrote: > Haven't tested with having a BO created / mapped, but since that code should already be in from the dmabuf IPC work, it should work. > > That's what I'm testing next. I meant the first test case from https://cgit.freedesktop.org/~tursulin/intel-gpu-tools/commit/?h=amd-criu&id=24487611e08c753e73b3fc650ca85793bf053f4d ie. igt_subtest("open")? Regards, Tvrtko > > David. > > ________________________________________ > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > Sent: Friday, April 10, 2026 11:42 AM > To: Francis, David; criu@lists.linux.dev > Subject: Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file. > > > > On 10/04/2026 15:45, David Francis wrote: >> These patches allow processes that have renderD device files open >> but not kfd device files to dump / restore. > > The series allows the first/minimal test case from my IGT to pass? > > Regards, > > Tvrtko > >> >> Mostly a proof-of-concept / baseline for future work since there's >> currently no way for such a process to dump / restore its queues. >> >> David Francis (3): >> plugin/amdgpu: Add plugin to inventory even if there are no vmas >> plugin/amdgpu: Add topology dump file >> plugins/amdgpu: Make next_fd without kfd >> >> plugins/amdgpu/amdgpu_plugin.c | 113 ++++++++++++++++++++++++++-- >> plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++-- >> plugins/amdgpu/amdgpu_plugin_util.h | 9 +++ >> plugins/amdgpu/criu-amdgpu.proto | 5 ++ >> 4 files changed, 180 insertions(+), 11 deletions(-) >> > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file. 2026-04-10 16:04 ` Tvrtko Ursulin @ 2026-04-10 16:12 ` Francis, David 0 siblings, 0 replies; 15+ messages in thread From: Francis, David @ 2026-04-10 16:12 UTC (permalink / raw) To: Tvrtko Ursulin, criu@lists.linux.dev > I meant the first test case from > https://cgit.freedesktop.org/~tursulin/intel-gpu-tools/commit/?h=amd-criu&id=24487611e08c753e73b3fc650ca85793bf053f4d > ie. igt_subtest("open")? Whoops, misread before. Yes, that should work David ________________________________________ From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Sent: Friday, April 10, 2026 12:04 PM To: Francis, David; criu@lists.linux.dev Subject: Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file. On 10/04/2026 16:46, Francis, David wrote: > Haven't tested with having a BO created / mapped, but since that code should already be in from the dmabuf IPC work, it should work. > > That's what I'm testing next. I meant the first test case from https://cgit.freedesktop.org/~tursulin/intel-gpu-tools/commit/?h=amd-criu&id=24487611e08c753e73b3fc650ca85793bf053f4d ie. igt_subtest("open")? Regards, Tvrtko > > David. > > ________________________________________ > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > Sent: Friday, April 10, 2026 11:42 AM > To: Francis, David; criu@lists.linux.dev > Subject: Re: [PATCH 0/3] Patches to allow amdgpu restore without kfd file. > > > > On 10/04/2026 15:45, David Francis wrote: >> These patches allow processes that have renderD device files open >> but not kfd device files to dump / restore. > > The series allows the first/minimal test case from my IGT to pass? > > Regards, > > Tvrtko > >> >> Mostly a proof-of-concept / baseline for future work since there's >> currently no way for such a process to dump / restore its queues. >> >> David Francis (3): >> plugin/amdgpu: Add plugin to inventory even if there are no vmas >> plugin/amdgpu: Add topology dump file >> plugins/amdgpu: Make next_fd without kfd >> >> plugins/amdgpu/amdgpu_plugin.c | 113 ++++++++++++++++++++++++++-- >> plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++-- >> plugins/amdgpu/amdgpu_plugin_util.h | 9 +++ >> plugins/amdgpu/criu-amdgpu.proto | 5 ++ >> 4 files changed, 180 insertions(+), 11 deletions(-) >> > ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-04-22 13:58 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-10 14:45 [PATCH 0/3] Patches to allow amdgpu restore without kfd file David Francis 2026-04-10 14:45 ` [PATCH 1/3] plugin/amdgpu: Add plugin to inventory even if there are no vmas David Francis 2026-04-10 15:46 ` Tvrtko Ursulin 2026-04-10 16:15 ` Francis, David 2026-04-10 14:45 ` [PATCH 2/3] plugin/amdgpu: Add topology dump file David Francis 2026-04-16 14:54 ` Tvrtko Ursulin 2026-04-16 18:26 ` Francis, David 2026-04-17 9:07 ` Tvrtko Ursulin 2026-04-17 13:21 ` Francis, David 2026-04-22 13:58 ` Tvrtko Ursulin 2026-04-10 14:45 ` [PATCH 3/3] plugins/amdgpu: Make next_fd without kfd David Francis 2026-04-10 15:42 ` [PATCH 0/3] Patches to allow amdgpu restore without kfd file Tvrtko Ursulin 2026-04-10 15:46 ` Francis, David 2026-04-10 16:04 ` Tvrtko Ursulin 2026-04-10 16:12 ` Francis, David
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox