From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from DM1PR04CU001.outbound.protection.outlook.com (mail-centralusazon11010061.outbound.protection.outlook.com [52.101.61.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C06543D47CC for ; Fri, 10 Apr 2026 14:45:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.61.61 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775832332; cv=fail; b=U7ruO7E/DHQrWK19sIfFRvX1dXuwOip6c9vQazzYOyGKhVjQ+ZzVyLw2LUU0rNg7Mdc790fo+w2X8he0AmIJdblpF42G0NMvuawzbdxpOl9DRrR/CHy00OfS/2Hn1Bg93CDPw7bEZEsEjDNwqrKyc3pLkuSNwWiFGUH1jHukqHU= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775832332; c=relaxed/simple; bh=Ag4vjHmtN6N6/FgxW+3wTxRF1hJxeZoMmAKEk6Dasq8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=jklsBSqhB2lxaVLrpgrbIQq+Lk42DB1fLk3bm998f0vcpNh//iey346lT2ZdVNDBnZvA5Z4/BlIQRnSh4wxRBXApv9k+5P5NjgKk33Fpnyb39NX15NycAaaiYy7Pu6FNmB2rBEUDMyy3sVPiB2mfimPh86+gThg4Y8BH3eOofmA= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=YY59kKid; arc=fail smtp.client-ip=52.101.61.61 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="YY59kKid" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=UMDE8LE9QsfgRA1ccM4f7CMlzTHm8borSGZHjqJtaBb343vrGzSpfGOrsyBYt1he6dMmfQ3Eax0KC/ymRnYvqwBTuYbaqKsTP2YqxH3DxW9Bi2WBiLwMFmQtxxdvKzA0ORJkqtfJ6WQhEwg+m5NDlEBBVo0tL72E0JpQgorLs7JmPoqEVKI46lpS/NZWz3Eq0hMRp3SuXMk+ktiIUtTo7Vklmn480KGgngAoDMVL30Fvk17l526KqIc8XESPFQqYP2ZfNf/0H1ZHEcrMDLo2+HMudigWH14q/da5l+qidFSluX1DZzUTFKcxm0QMKVwA+z0Qhrua/r0/479Xgj8F6g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=qTxUVFNsoKuiP/yCKraikJ4VmhPP/OoMwz6zM96thzU=; b=DDt0OJFexAqnETEvlGRybl1LK4/zTRmZWq0WWDZRhGhEuJAbHe4B9w/8kq8WwQi853tobmAQusr6Ms+hRkZ1aIAVzl8ISbQ/kAPMzTmMI25YfxUJibRkKhWv2HcxwCylxLXo7YP3cymNc2wn3pT+1Kx31+GIdMzNrixNDFU5//+aiBcqUIbetPyWgrPihMNi+aEWIVfktR062BeagvygcHWcq0ERrCwo9lJY+I2yATVwoyxPE1xVPL/xPULZfdlvuEXhrZa8fdTSWiytSLvmreIfLg4uOU31zAaC5G3pwtwqmR/idYELLx62AqW6jUu8CjSwFNNDfKeBlewK5r+l+g== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=lists.linux.dev smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=qTxUVFNsoKuiP/yCKraikJ4VmhPP/OoMwz6zM96thzU=; b=YY59kKidGSGTIntYGkWVAszonvKLHfHZvOXh2/5iWL4l5FkI8kJfnq0El6tmZSotjW7/78IAMeddxNwOmZtsPbKz/jOHV1SOzVeoNxU4Xch+rF++6pUzv7ns7zezZJ1kuMlJEd0wUd85wD7d2iXim2h3i1IMJxQvmNQBwcL9m4o= Received: from BN0PR04CA0052.namprd04.prod.outlook.com (2603:10b6:408:e8::27) by SA1PR12MB999086.namprd12.prod.outlook.com (2603:10b6:806:49f::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9769.44; Fri, 10 Apr 2026 14:45:26 +0000 Received: from BN2PEPF00004FBD.namprd04.prod.outlook.com (2603:10b6:408:e8:cafe::6a) by BN0PR04CA0052.outlook.office365.com (2603:10b6:408:e8::27) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9769.43 via Frontend Transport; Fri, 10 Apr 2026 14:45:26 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by BN2PEPF00004FBD.mail.protection.outlook.com (10.167.243.183) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9769.17 via Frontend Transport; Fri, 10 Apr 2026 14:45:26 +0000 Received: from fdavid-dev.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Fri, 10 Apr 2026 09:45:26 -0500 From: David Francis To: CC: , David Francis Subject: [PATCH 2/3] plugin/amdgpu: Add topology dump file Date: Fri, 10 Apr 2026 10:45:08 -0400 Message-ID: <20260410144509.738903-3-David.Francis@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260410144509.738903-1-David.Francis@amd.com> References: <20260410144509.738903-1-David.Francis@amd.com> Precedence: bulk X-Mailing-List: criu@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: satlexmb07.amd.com (10.181.42.216) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN2PEPF00004FBD:EE_|SA1PR12MB999086:EE_ X-MS-Office365-Filtering-Correlation-Id: e844c81a-f5ec-4cab-8bbd-08de970fcdc6 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|36860700016|82310400026|376014|1800799024|17002099007|18002099003|22082099003|56012099003; X-Microsoft-Antispam-Message-Info: G6GLgj4z263SER2wMuG8BOLoWYs9sDdWXRYz/qt9CnzABjrVlnjZBqsvh25aH9MtyTb3MgPqZqyl/W4iBDsRYpdLQI+9VN8qEvuc2z0mYW4krjsjmbMXOPjvBLhONEaL4gzJ8/RqRspTqAzpYKrlAxxIIi6vZ77JRvkDv0VRh1kS193VkbZoswt/700NsyZpcd+18+yiiIk9p31RfG97taNmqojxc4ObikJNKtdFkBTx+DerGL9O//hGkGIUys9JfJUzhPPc8YEzK6SzxihsHhVlDh//WD281qoPmsnw84EQh0NUNMrztTlsGpAdjFsrZJdYJb+DfByYknCgDfk5C/PgOjVbEqfdCQc7uepPjWdw8aW5V5FSbRJwiarmcAg6kye9YqeWgoXujd42u4+VuuNYoVkf5ix2J+wkU07+2eTDKvDi25X4CCEfNBKj4V/PFLpS+Kp9qMFyUHkQyM0acIsdGAk3qlrKJW7f97/sEgKV7u6oA2QRlRNIMZ/+ZZQEFoWHW2m08+yI6o1NKEcb67knUcTVBCDNc7M1nU368fmlZd+Ac7UpMw65cK1re1sHMWZUUGcsWGq1Z2y4pXUjjKHj1L0v8GuJnv5dGYNntTSXWaf2PIGTgxBTvTKzEgaMU4Gv+ZUmwR5xGb3L3Ewu3ApLOsjOpMpUjallhwoVRq1LGKnEfuFIrUMy/Z5m4pUAtkuznuJSlkxclfP/nbsUx4KxfY2eRLJvcc/DKvQ7w/3xQCeWy5o9nKJuSoHZY2a8z16MpHdE24SGJicOO25sEw== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(36860700016)(82310400026)(376014)(1800799024)(17002099007)(18002099003)(22082099003)(56012099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: EXxwUi9kX1sHjwVs+b/P4L6DK7EaVOrI/GHBybOhpGcMuZ77gvnPLgkj2+I/rnFsRW8U0kj+B0V86sriaZikVxblE6k9Gd2ofQrSBZdx9fwwqjDWoxzaYbNJnyFhPyfi6kxKJj1PU9B9+Mj2eWZgDO3ubVeiruOb0xNY1j5Pc08WvO51ZaJC2uQmcTPxAHY9LJUouXgmbja+gRDosSsPAaWXmbvshFYkPijAP4LPqj4DUK2lVfY2VD3s5+YTVMcJbBHwytIc+Cct1Pfzqv8QzKJ4coyyNCt2J3VJqPK77j1oyKCFh0PVhUagTS7qfVqwGFmep+k6OsgAnXBW6SOWLFZvGpb1WEONkpz7quOo/7aNiNCxfs36Y8l69LQWxVJdmxLaNzu03s/Sho60Y5IHgjSiOcnDRN+U0p2VJOPfCn7m/pyxky2BCzyC6BJGy3Sf X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Apr 2026 14:45:26.5400 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: e844c81a-f5ec-4cab-8bbd-08de970fcdc6 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BN2PEPF00004FBD.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: SA1PR12MB999086 The state of the source topology (the GPUs, CPUs, and links between them) is saved by the plugin as part of kfd dump. If there is no kfd dump, we need to save the topology anyways. Do so in new file amdgpu-topology.img. Signed-off-by: David Francis --- plugins/amdgpu/amdgpu_plugin.c | 84 ++++++++++++++++++++++++++--- plugins/amdgpu/amdgpu_plugin_drm.c | 64 ++++++++++++++++++++-- plugins/amdgpu/amdgpu_plugin_util.h | 9 ++++ plugins/amdgpu/criu-amdgpu.proto | 5 ++ 4 files changed, 151 insertions(+), 11 deletions(-) diff --git a/plugins/amdgpu/amdgpu_plugin.c b/plugins/amdgpu/amdgpu_plugin.c index 89ab10dac..1e9785440 100644 --- a/plugins/amdgpu/amdgpu_plugin.c +++ b/plugins/amdgpu/amdgpu_plugin.c @@ -91,6 +91,9 @@ int current_pid; */ bool parallel_disabled = false; +bool kfd_dump_complete = false; +bool amdgpu_topology_dump_complete = false; + pthread_t parallel_thread = 0; int parallel_thread_result = 0; /**************************************************************************************************/ @@ -189,9 +192,14 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi devinfo->node_id = node->id; if (NODE_IS_GPU(node)) { - devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); - if (!devinfo->gpu_id) - continue; + if (maps) { + devinfo->gpu_id = maps_get_dest_gpu(maps, node->gpu_id); + if (!devinfo->gpu_id) + continue; + } else { + devinfo->gpu_id = node->gpu_id; + } + devinfo->simd_count = node->simd_count; devinfo->mem_banks_count = node->mem_banks_count; @@ -238,9 +246,13 @@ int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDevi if (!iolink->valid) continue; - list_for_each_entry(node2, &sys->nodes, listm_system) - if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) - link_to_present_node = true; + if (maps) { + list_for_each_entry(node2, &sys->nodes, listm_system) + if (node2->id == iolink->node_to_id && maps_get_dest_gpu(maps, node2->gpu_id) != 0) + link_to_present_node = true; + } else { + link_to_present_node = true; + } if (!link_to_present_node) continue; @@ -386,6 +398,11 @@ int amdgpu_plugin_init(int stage) maps_init(&checkpoint_maps); maps_init(&restore_maps); + if (stage == CR_PLUGIN_STAGE__DUMP) { + kfd_dump_complete = false; + amdgpu_topology_dump_complete = false; + } + if (stage == CR_PLUGIN_STAGE__RESTORE) { if (has_children(root_item)) { pr_info("Parallel restore disabled\n"); @@ -1552,6 +1569,7 @@ int amdgpu_plugin_dump_file(int fd, int id) if (ret) goto exit; + kfd_dump_complete = true; if (!plugin_added_to_inventory) { ret = add_inventory_plugin(CR_PLUGIN_DESC.name); if (ret) { @@ -1908,6 +1926,60 @@ int amdgpu_plugin_restore_file(int id, bool *retry_needed) pr_info("render node gpu_id = 0x%04x\n", rd->gpu_id); + if (list_empty(&restore_maps.cpu_maps) && list_empty(&restore_maps.gpu_maps)) { + AmdgpuDevinfo *ad; + + pr_info("No restore maps found, making them from topology file\n"); + + img_fp = open_img_file(IMG_AMDGPU_TOPOLOGY_FILE, false, &img_size, true); + if (!img_fp) { + pr_err("Failed to find either kfd or amdgpu src topology information\n"); + ret = -EINVAL; + goto exit; + } + + buf = xmalloc(img_size); + if (!buf) { + pr_err("Failed to allocate memory\n"); + return -ENOMEM; + } + + ret = read_fp(img_fp, buf, img_size); + if (ret) { + pr_err("Unable to read from %s\n", IMG_AMDGPU_TOPOLOGY_FILE); + ret = -EINVAL; + goto exit; + } + + ad = amdgpu_devinfo__unpack(NULL, img_size, buf); + if (rd == NULL) { + pr_perror("Unable to parse the amdgpu topology message\n"); + fclose(img_fp); + ret = -EINVAL; + goto exit; + } + fclose(img_fp); + + ret = devinfo_to_topology(ad->device_entries, ad->num_of_devices, &src_topology); + if (ret) { + pr_err("Failed to convert amdgpu device information to topology\n"); + ret = -EINVAL; + goto exit; + } + + ret = topology_parse(&dest_topology, "Local"); + if (ret) { + pr_err("Failed to parse local system topology\n"); + goto exit; + } + + ret = set_restore_gpu_maps(&src_topology, &dest_topology, &restore_maps); + if (ret) { + pr_err("Failed to map GPUs\n"); + goto exit; + } + } + target_gpu_id = maps_get_dest_gpu(&restore_maps, rd->gpu_id); if (!target_gpu_id) { fd = -ENODEV; diff --git a/plugins/amdgpu/amdgpu_plugin_drm.c b/plugins/amdgpu/amdgpu_plugin_drm.c index c1dfb2dd4..a4c650753 100644 --- a/plugins/amdgpu/amdgpu_plugin_drm.c +++ b/plugins/amdgpu/amdgpu_plugin_drm.c @@ -467,11 +467,65 @@ int amdgpu_plugin_drm_dump_file(int fd, int id, struct stat *drm) return -ENODEV; } - /* Get the GPU_ID of the DRM device */ - rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); - if (!rd->gpu_id) { - pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); - return -ENODEV; + if (kfd_dump_complete) { + /* Get the GPU_ID of the DRM device */ + rd->gpu_id = maps_get_dest_gpu(&checkpoint_maps, tp_node->gpu_id); + if (!rd->gpu_id) { + pr_err("Failed to find valid gpu_id for the device = %d\n", rd->gpu_id); + return -ENODEV; + } + } else { + rd->gpu_id = tp_node->gpu_id; + + if (!amdgpu_topology_dump_complete) { + AmdgpuDevinfo *ad = NULL; + unsigned char *buf; + + ad = xmalloc(sizeof(*ad)); + amdgpu_devinfo__init(ad); + + ad->num_of_devices = src_topology.num_nodes; + + ad->device_entries = xmalloc(sizeof(KfdDeviceEntry *) * ad->num_of_devices); + if (!ad->device_entries) { + pr_err("Failed to allocate device_entries\n"); + return -ENOMEM; + } + + for (int i = 0; i < ad->num_of_devices; i++) { + KfdDeviceEntry *entry = xzalloc(sizeof(*entry)); + + if (!entry) { + pr_err("Failed to allocate entry\n"); + return -ENOMEM; + } + + kfd_device_entry__init(entry); + + ad->device_entries[i] = entry; + ad->n_device_entries++; + } + + topology_to_devinfo(&src_topology, NULL, ad->device_entries); + + len = amdgpu_devinfo__get_packed_size(ad); + + buf = xmalloc(len); + if (!buf) { + pr_perror("Failed to allocate memory to store protobuf"); + return -ENOMEM; + } + + amdgpu_devinfo__pack(ad, buf); + + ret = write_img_file(IMG_AMDGPU_TOPOLOGY_FILE, buf, len); + if (ret) { + pr_err("Failed to write image file %s\n", IMG_AMDGPU_TOPOLOGY_FILE); + return -EINVAL; + } + + amdgpu_topology_dump_complete = true; + } } len = criu_render_node__get_packed_size(rd); diff --git a/plugins/amdgpu/amdgpu_plugin_util.h b/plugins/amdgpu/amdgpu_plugin_util.h index 69b98a31c..ccfe30b49 100644 --- a/plugins/amdgpu/amdgpu_plugin_util.h +++ b/plugins/amdgpu/amdgpu_plugin_util.h @@ -2,6 +2,7 @@ #define __AMDGPU_PLUGIN_UTIL_H__ #include +#include "criu-amdgpu.pb-c.h" #ifndef _GNU_SOURCE #define _GNU_SOURCE 1 @@ -59,6 +60,9 @@ /* Name of file having serialized data of DRM device buffer objects (BOs) */ #define IMG_DRM_PAGES_FILE "amdgpu-drm-pages-%d-%d-%04x.img" +/* Name of file containing the source device topology (generated only if IMG_KFD_FILE is not)*/ +#define IMG_AMDGPU_TOPOLOGY_FILE "amdgpu-topology.img" + /* Helper macros to Checkpoint and Restore a ROCm file */ #define HSAKMT_SHM_PATH "/dev/shm/hsakmt_shared_mem" #define HSAKMT_SHM "/hsakmt_shared_mem" @@ -115,6 +119,9 @@ extern bool kfd_vram_size_check; extern bool kfd_numa_check; extern bool kfd_capability_check; +extern bool kfd_dump_complete; +extern bool amdgpu_topology_dump_complete; + int read_fp(FILE *fp, void *buf, const size_t buf_len); int write_fp(FILE *fp, const void *buf, const size_t buf_len); int read_file(const char *file_path, void *buf, const size_t buf_len); @@ -142,4 +149,6 @@ int sdma_copy_bo(int shared_fd, uint64_t size, FILE *storage_fp, int serve_out_dmabuf_fd(int handle, int fd); +int topology_to_devinfo(struct tp_system *sys, struct device_maps *maps, KfdDeviceEntry **deviceEntries); + #endif /* __AMDGPU_PLUGIN_UTIL_H__ */ diff --git a/plugins/amdgpu/criu-amdgpu.proto b/plugins/amdgpu/criu-amdgpu.proto index 7682a8f21..6e44e22aa 100644 --- a/plugins/amdgpu/criu-amdgpu.proto +++ b/plugins/amdgpu/criu-amdgpu.proto @@ -93,3 +93,8 @@ message criu_render_node { message criu_dmabuf_node { required uint32 gem_handle = 1; } + +message amdgpu_devinfo { + required uint32 num_of_devices = 1; + repeated kfd_device_entry device_entries = 2; +} \ No newline at end of file -- 2.34.1