From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 73FC2EE645F for ; Thu, 12 Sep 2024 08:21:28 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sof4V-0005xh-MX; Thu, 12 Sep 2024 04:20:59 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sof4T-0005qc-OV for qemu-devel@nongnu.org; Thu, 12 Sep 2024 04:20:58 -0400 Received: from mail-mw2nam04on20620.outbound.protection.outlook.com ([2a01:111:f403:240a::620] helo=NAM04-MW2-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sof4Q-0005zZ-CE for qemu-devel@nongnu.org; Thu, 12 Sep 2024 04:20:57 -0400 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=h71ImSF6Mr4BNjH/37AK2PVbVfIFeXR6fh8jT3v8+fHDnDfeiH0zpxyMoJqqa9iY3nga8TtUhAfFOWvokvCWiSCtOXcEGtkcwrKoV12MeuBEQ90wrNS8nmZvf8DP+jHv/m5Ik1P4O5ZI1ZbkNJdWgwctGXX86LuCx5GYEit0u9TAv8+mx9qZsoIjZZU/lO6ftJ3O8adNdWwHvwHBvjwFR2PREtKHpFwdcu7FXZvAHG4AyLnjagNTyzWbjVf2GNw1rGgfozg12X72e3GLEl+6i3+h4mZRFDyVwEvdaAYZLS2I/Zi75g6saNC3Nwql8UPHmjoG2mrem/bHr+mvyqiG5Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Y/+BYWs3wrrF8ivuj4hx8vslvttn4Q5EkH9QRawbfd0=; b=Y6eHgtao2nTSOTI7jRgkSeqfDlwmuKakP1DLKsTUjMMTQzKKOCupTdJosayxrbDPbMk0txe6EytnF45qtucxDqAYlvUDEzKQ28i9cY+hx3sUEMqqw50IOgb++G6xMaku8q2HvsSfm919TgGVLT1002iuag1nfJVi696R3xXMeWNqWeMWA7J4Q4QNAeVfQV+bT6FXinfBYlFxrX9VB7h0+Cy2Hv7f2JHut3oBa4I4RnAnYA2XmGrS10VbBkiPy/zEiu4s9eE/uoNtkzdECdW7O02SycogPBW+Ea0D8pNj+GvDpCrH4LTwSdDi+cSZPJyzKgkVQTDgp882/wJWA2VX8A== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Y/+BYWs3wrrF8ivuj4hx8vslvttn4Q5EkH9QRawbfd0=; b=dt+hrRlJhe+uoRbkpxkK4XVDlvHMSge3GtqGypyQ2DLl9zgC6g5YEMwkeoU2gfxLmclDYqSU5uve96HEbL/Bq3TcbFWIuGsBcOgsbpgOQhdzgdkqKa6j4uwDidaj1GZGEPc2MxXJtbXMVhUaefzVl7TTVe3R8b8EB2e4BkSdstqm8fE4ODIcsbEYXoNUdRIhx0hXTI0wvN6H2+ABC47XBdEE603ciPF6a1/AwkdQkryNoO9VP3Z2cRjYwJydKp0edW0HeC/KpRlk9xYjFf/8bhy2YMhnDZmv3Jsz7/59CK7vTaEk8RaJdfDTpaBEnwB8G7bskFjZt6pTBpASkPX6lg== Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=nvidia.com; Received: from DM6PR12MB5549.namprd12.prod.outlook.com (2603:10b6:5:209::13) by DS7PR12MB8371.namprd12.prod.outlook.com (2603:10b6:8:e9::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7939.20; Thu, 12 Sep 2024 08:20:47 +0000 Received: from DM6PR12MB5549.namprd12.prod.outlook.com ([fe80::e2a0:b00b:806b:dc91]) by DM6PR12MB5549.namprd12.prod.outlook.com ([fe80::e2a0:b00b:806b:dc91%6]) with mapi id 15.20.7962.017; Thu, 12 Sep 2024 08:20:46 +0000 Message-ID: <5a9dba27-96e5-4404-b429-c36d9b1c5707@nvidia.com> Date: Thu, 12 Sep 2024 11:20:39 +0300 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 15/17] vfio/migration: Multifd device state transfer support - receive side To: "Maciej S. Szmigiero" , Peter Xu Cc: Alex Williamson , =?UTF-8?Q?C=C3=A9dric_Le_Goater?= , Fabiano Rosas , Eric Blake , Markus Armbruster , =?UTF-8?Q?Daniel_P_=2E_Berrang=C3=A9?= , Joao Martins , qemu-devel@nongnu.org References: <4133ce80174fa3b81070adaeeb068554beba2854.1724701542.git.maciej.szmigiero@oracle.com> <12ebbb56-c2a7-42d0-8dad-9b459a490c39@nvidia.com> Content-Language: en-US From: Avihai Horon In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-ClientProxiedBy: LO3P123CA0009.GBRP123.PROD.OUTLOOK.COM (2603:10a6:600:ba::14) To DM6PR12MB5549.namprd12.prod.outlook.com (2603:10b6:5:209::13) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DM6PR12MB5549:EE_|DS7PR12MB8371:EE_ X-MS-Office365-Filtering-Correlation-Id: 1dd1c4ec-32f3-42a0-cd17-08dcd303cd9a X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|376014|1800799024|366016; X-Microsoft-Antispam-Message-Info: =?utf-8?B?OTFyQjNPS3pST3ZMY2R6ZWRIRjluYU8vWDBlcWJnellRUUQ4aU5KM2F5d3g1?= =?utf-8?B?YWRXRE5Jc0tEVEV1RWwya2dVaSsyZUNDM05kakdnV2k1Z2xLaU1tY0dpZGRY?= =?utf-8?B?S241aldlSElaZVlmSWNVSzNraVRlQlo5a1ZEM1lxVUpIbUxCd0xpdXpnUjFM?= =?utf-8?B?VXZXcVRDQ2YxMG9Lc3hoS1ordk44T29Uc3hnSzQ1bXFyUkdmY2d4UW1MUkdu?= =?utf-8?B?SXl4SmdHOXFJRk1EMjRocGFMMVo3VTluVjhhQlE5Qi93NmhINUFoS3FEQmtZ?= =?utf-8?B?dU9JdTVWMXdRYjMwTnpseURtV1hlU0tvYkJDUnNjVnQ3UzAvcnJ0YlJJaHhE?= =?utf-8?B?cStBWktFMjAyd2NobytSMjk2bFhXUk9TdW9ybTRVaDlQdWp2YmdMMjByTVlt?= =?utf-8?B?RGpaRnRhditudlRxRkVqd1dDdXVzek5YdU9CRjY5Tjkxd1N3QkN2cGM2Vzhm?= =?utf-8?B?MXJwQW1TSE5VWGh5VmRha3krNSt5UmdPZ2dIUUtGOVNYakJzbWRMQkdYNFFm?= =?utf-8?B?bDV0bFY5Ukl4NGJQSGhDZzduVFNiT04rZVZoRXdNVWJRRzJxZ0MzY1Q3Qkpx?= =?utf-8?B?WnRDa1A5YWpqTmhwem5KY1lPM1hGU2N5MC9MY2MwZTJmMmtLTXJUandsaWp0?= =?utf-8?B?UmcxZ2ovNzQxR0xRSWpYL1dITWMzUFBVT2lLVXJLSUY3M1dWMTY5K2Y3VlRm?= =?utf-8?B?WVBIakhncFhvblpyMEpZN2t0ZFgyRUpuNXlRMUxFaHJYTnJqVlZvUEJOQ0hH?= =?utf-8?B?N1BZU2UzUnVqbXV5ZjI0T0dKSkVqcnozVXR4TFY3SlFudHVFZVlIaUNqRkg3?= =?utf-8?B?QzAwbUsxK0pDQVFSS1VYN2VqREFMWFVxVnEwMkNnb01ibFp2dWt4OExsam5C?= =?utf-8?B?UmRvelpVdXlpZFJ5RmhFN1hHK1g2TlVkWm5rQ2dYSmd6SmtQak5vRG1Wd3Va?= =?utf-8?B?WnZ4ODFybjNiN25aUzVKZ1hjQ3EwcWVHQWU0dExmQmM3RGppdGVwOEFRVWlU?= =?utf-8?B?UzRhL3ZEOVpKVEVyY3M3QTM2VE90WVM4azRWUWRPajBFYWlvTWU4NHdqWHFP?= =?utf-8?B?UmNVVHVDQWQzcWlvQVpxck9wbjZLZUxYNzVPM2JpY25kWEU4WGpvaWFtdnFV?= =?utf-8?B?UUtOSU4vQzF4K1FRa0VpeUx6YWlHZHgxVE42RWd0MVhyUEhRL3EvU2ZwODAz?= =?utf-8?B?MmQvd284S1BkNHN5Y1RkY0pkUjlXbG85WDg4T0tndm9qblo4c21qZzFYVGM3?= =?utf-8?B?VmZMdXpZaExwZUROb0hIMlZnS1Vtd1JjenYySCtkNmpubUozazU1OWljVENq?= =?utf-8?B?TkZ1LzRIOGE0NWJVb1MyNkdsemlXTXVNaGFnbThLQklEQ3NuRVEvenFjL0Yw?= =?utf-8?B?UkUxd0praTNOSzZ5Mnl6cUdmSGZrS2gxbzROOXQvRXBjSEdEbGh5cm9UUHhQ?= =?utf-8?B?Wm1FUUM4OFFCWDhWTmUxTmtVcENORnErWDNZa29OM3JDZi9zdFU4UVZBRnZN?= =?utf-8?B?Z0hhcVdaNDRrN2JoNFdwa1pxSUUwV3kwai9xNUE3TGJxbEdUZXEvaUxhYUxQ?= =?utf-8?B?SGI5RmF5eHlDQ3RaVEdUdDcvUTZRVEFXMmdqTFFRS3M0dVdrM0l5Ukx6d3E4?= =?utf-8?B?VzBBSlJlT2xyMU9Ea1lzdHl5M2wyd25jVUxRaVM0K1dadVZPNlMwMDV3VEdp?= =?utf-8?B?RUx4cjlKb25iT2QwT2M3QUpCa2I1V1ROSFVhODF5Q1JZWkU5TVl3elUxNjEr?= =?utf-8?B?Z2xnVzBvdC9SODZ4S1NWaFNSbnNJTVlsVDRnbWRMNVp3RTI3ZGFPd2tNazVY?= =?utf-8?B?TlJDTmdYeDN4M3dhcUE5QT09?= X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DM6PR12MB5549.namprd12.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230040)(7416014)(376014)(1800799024)(366016); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?MjBkcURtRjFXcm1pZk1FbjBYdHdpeitmUVRIY0JVT2ZjN1VpQVVZY0g1ZkdG?= =?utf-8?B?d1R2bDFyU2Ixdk5UYUE2NUhDT1FvWDlNL2QrblRha2xTeWc0WDkvNmFZRzBi?= =?utf-8?B?VEdJMUk1UGNGb2VudTZOU21NZDRzTGp3azg3aUk1eGd0NEpLalBmYXQ4bG4y?= =?utf-8?B?dUhmR3FVbVR2SVlucDkrZllzS2pQL1gzenhuWFd0MlNiN1h5RjVIYnFwaVNN?= =?utf-8?B?VzZMVERBN001dFZsbVFnakFQOE94ZUFYSXVqOExOMEh2UkNvdkRVOG5rZ21q?= =?utf-8?B?RHJrbHg1MTd2SGh2YXNRNkZLS1VwWWJrMXFBQkMxOGZxSXJZTnd3cVErWHRy?= =?utf-8?B?ZGRUdjlyZFR1bWtLS3ZnVkZRVFRpTWs5dGZvUGtlcTZIeFJTU0NNSmdmb3lI?= =?utf-8?B?YkN1VXVKZTBOdzUxTFFGTWIzNUFFTmNZc1NiVHpzUzk5MDFmMEVLWWIvYlpV?= =?utf-8?B?eEtPcGJuVmluc3pmcytrb1VEQXlLN0duTENqZDBtWm4reE5scmNKVG1YQVJC?= =?utf-8?B?ejNWQS9Scjcxc1RHK1pjS2kxN1NhR0dhRG93RVR0eXFEUmhMaWxlZGg3SS82?= =?utf-8?B?YVR4R2tyQVk5OVpZNnlnMDFSWk9nbEtSSWlqZ05qbU5rZ0pHU25OdDlweGpS?= =?utf-8?B?MWQxTjd5WHZxekxhVHc2cktGVXY3REJuandscStucG5HK0RpNmd2S2FpT1Jn?= =?utf-8?B?VXBxQXRTV3JYQUNkYWd4TmZLanZkc3Q0RzJCdmxsdkJuQmJsUkMxVUJHUEZ1?= =?utf-8?B?T1AzeUpkUVNVY0JMZEtoR1EwTU81RHFOTjZzcXEzempkbHFEL0svem9lL29M?= =?utf-8?B?RnZwYlVZd3hxcWhFaTRlL1VucUFENS8wczkwSGNzYm1tc2hTaWcyV242OUcw?= =?utf-8?B?alJDbkJTbGVxaFRDYXdYdkorNEcrUk1lZVkybTIrMkhDMVVzdlRyd3ZiR0VT?= =?utf-8?B?aW1GNDZJTER5OXBsZlJZYzVpQzA5dEdWZStoeHpBRVd0RWZnTzBCNEVKQzhK?= =?utf-8?B?eU9IOGIxaEZ3dWtoMmgwUElHOGIxRmdwenNJMGFIQmVjM2plTzFPWTJaV2Zl?= =?utf-8?B?R3IwMlpBSkdhbzZ2Tmo0amgzSWRib0lYSm41TVQ1RFp5RXhENXZxTEhEV1VF?= =?utf-8?B?Zi9ndFpBcUdIdVl3Wjk3b2d3aHp2bFB5ZGZaSGhrZmozcjdIR2tZc3R6YmlO?= =?utf-8?B?TjlCSHdMK2ZCWVp5UExaREkva1dUWTQ5dXRBQlA3ZFFhM1NkaGtQRjltUi9l?= =?utf-8?B?KzNONEFMd2IzbkNBN1huM1kxaUxDNjV5c09uUjdlVUQ2REs5OWdoMWZGVHpv?= =?utf-8?B?TVl0VWVRcUk2bEgyUE42RmtwRHFzS3JybllUU0p5SmRDaHdIWjh2UVhCbW45?= =?utf-8?B?MkVvOWVDMzExYnVRSzEydFBaMmRLeUtuU0tkWS9PODhtb21HV3U1dDNHVEQx?= =?utf-8?B?aWRJMmFNc3VzNmNzTk1sZ1cwRkduakxxS3ZWM2p5b3czZ2lVUURzenMrUjFY?= =?utf-8?B?R1ZTeHFSNGZ2eEVtbWdHWUh2bXIyNGhjb1NTbzN6bjlaUXA3dk15SVBDVlVv?= =?utf-8?B?eXJUcVk3SGNOZ1ZsR0JHbHkzS1NJUldwV0d4aGFyM3BreFVCcUJPU0REdGE2?= =?utf-8?B?dGxxaG41dVpBYm9LdkFvWEdqa0xOVzNCNTVJb0o1c2QxSkFQanFab2JET0lE?= =?utf-8?B?UE83VDk1eDNwd1JKcTZUdWhPcWVQZXNVTjNTVkxJaXlTOGVHa0ZuWWxQbk96?= =?utf-8?B?Y01WdklZYzZkYjhXMHR1MUxkRjdFZG5nOFJVRVQ3MFZ4cUxWUmM2eWxYekdQ?= =?utf-8?B?Ym1LR3F5dmw3ODVxNGlzdWw0d09xWWtnbUhmSTZ2SEtScy9CWUgxQWowRWVM?= =?utf-8?B?c2N3QlcwczdHc3AxZHd6UlhlMkRoS0dMWFpPTng2aVIzNm9PTnVGV0NJbUo1?= =?utf-8?B?dmZQMU8vRTg5R0ttdXhpdVJSUlR6TTh0TVNweWkrbHhtMnkvclhsNnVwclNT?= =?utf-8?B?WHpyRG5Pd3Mvb2UvdkcrbGxnZk12clpaRk1ML0E5dm91ckZNSUdJZDJSTHdp?= =?utf-8?B?enVIL1dtZEFPd3NrTTZrMUJrdFc2QjkwM09yOE1QYmpZZFlUZTlVb2dVaW5x?= =?utf-8?Q?yqkqXVpXKFQ4Usjsu2q+ArXhv?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: 1dd1c4ec-32f3-42a0-cd17-08dcd303cd9a X-MS-Exchange-CrossTenant-AuthSource: DM6PR12MB5549.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 12 Sep 2024 08:20:46.9409 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: CSkmp6chGQA9sDLdncjkzsDvdTGTnQtkv6ZIuyyJS1lT7snXABxmN59FeERf4dn/GLzIuRQCgqFwXqxWX2X0DA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS7PR12MB8371 Received-SPF: softfail client-ip=2a01:111:f403:240a::620; envelope-from=avihaih@nvidia.com; helo=NAM04-MW2-obe.outbound.protection.outlook.com X-Spam_score_int: -21 X-Spam_score: -2.2 X-Spam_bar: -- X-Spam_report: (-2.2 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.144, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On 09/09/2024 21:06, Maciej S. Szmigiero wrote: > External email: Use caution opening links or attachments > > > On 9.09.2024 10:55, Avihai Horon wrote: >> >> On 27/08/2024 20:54, Maciej S. Szmigiero wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> From: "Maciej S. Szmigiero" >>> >>> The multifd received data needs to be reassembled since device state >>> packets sent via different multifd channels can arrive out-of-order. >>> >>> Therefore, each VFIO device state packet carries a header indicating >>> its position in the stream. >>> >>> The last such VFIO device state packet should have >>> VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config >>> state. >>> >>> Since it's important to finish loading device state transferred via >>> the main migration channel (via save_live_iterate handler) before >>> starting loading the data asynchronously transferred via multifd >>> a new VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE flag is introduced to >>> mark the end of the main migration channel data. >>> >>> The device state loading process waits until that flag is seen before >>> commencing loading of the multifd-transferred device state. >>> >>> Signed-off-by: Maciej S. Szmigiero >>> --- >>>   hw/vfio/migration.c           | 338 >>> +++++++++++++++++++++++++++++++++- >>>   hw/vfio/pci.c                 |   2 + >>>   hw/vfio/trace-events          |   9 +- >>>   include/hw/vfio/vfio-common.h |  17 ++ >>>   4 files changed, 362 insertions(+), 4 deletions(-) >>> >>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c >>> index 24679d8c5034..57c1542528dc 100644 >>> --- a/hw/vfio/migration.c >>> +++ b/hw/vfio/migration.c >>> @@ -15,6 +15,7 @@ >>>   #include >>>   #include >>> >>> +#include "io/channel-buffer.h" >>>   #include "sysemu/runstate.h" >>>   #include "hw/vfio/vfio-common.h" >>>   #include "migration/misc.h" >>> @@ -47,6 +48,7 @@ >>>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL) >>>   #define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL) >>>   #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL) >>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE (0xffffffffef100006ULL) >>> >>>   /* >>>    * This is an arbitrary size based on migration of mlx5 devices, >>> where typically >>> @@ -55,6 +57,15 @@ >>>    */ >>>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB) >>> >>> +#define VFIO_DEVICE_STATE_CONFIG_STATE (1) >>> + >>> +typedef struct VFIODeviceStatePacket { >>> +    uint32_t version; >>> +    uint32_t idx; >>> +    uint32_t flags; >>> +    uint8_t data[0]; >>> +} QEMU_PACKED VFIODeviceStatePacket; >>> + >>>   static int64_t bytes_transferred; >>> >>>   static const char *mig_state_to_str(enum vfio_device_mig_state state) >>> @@ -254,6 +265,188 @@ static int vfio_load_buffer(QEMUFile *f, >>> VFIODevice *vbasedev, >>>       return ret; >>>   } >>> >>> +typedef struct LoadedBuffer { >>> +    bool is_present; >>> +    char *data; >>> +    size_t len; >>> +} LoadedBuffer; >> >> Maybe rename LoadedBuffer to a more specific name, like VFIOStateBuffer? > > Will do. > >> I also feel like LoadedBuffer deserves a separate commit. >> Plus, I think it will be good to add a full API for this, that wraps >> the g_array_* calls and holds the extra members. >> E.g, VFIOStateBuffer, VFIOStateArray (will hold load_buf_idx, >> load_buf_idx_last, etc.), vfio_state_array_destroy(), >> vfio_state_array_alloc(), vfio_state_array_get(), etc... >> IMHO, this will make it clearer. > > Will think about wrapping GArray accesses in separate methods, > however wrapping a single line GArray call in a separate function > normally would seem a bit excessive. Sure, let's do it only if it makes the code cleaner. > >>> + >>> +static void loaded_buffer_clear(gpointer data) >>> +{ >>> +    LoadedBuffer *lb = data; >>> + >>> +    if (!lb->is_present) { >>> +        return; >>> +    } >>> + >>> +    g_clear_pointer(&lb->data, g_free); >>> +    lb->is_present = false; >>> +} >>> + >>> +static int vfio_load_state_buffer(void *opaque, char *data, size_t >>> data_size, >>> +                                  Error **errp) >>> +{ >>> +    VFIODevice *vbasedev = opaque; >>> +    VFIOMigration *migration = vbasedev->migration; >>> +    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data; >>> +    QEMU_LOCK_GUARD(&migration->load_bufs_mutex); >> >> Move lock to where it's needed? I.e., after >> trace_vfio_load_state_device_buffer_incoming(vbasedev->name, >> packet->idx) > > It's a declaration of a new variable so I guess it should always be > at the top of the code block in the kernel / QEMU code style? Yes, but it's opaque to the user. Looking at other QEMU_LOCK_GUARD call sites in the code and it seems like people are using it in the middle of code blocks as well. > > Also, these checks below are very unlikely to fail and even if they do, > I doubt a failed migration due to bit stream corruption is a scenario > worth optimizing run time performance for. IMO, in this case it's more for readability, but we can go either way and let the maintainer decide. > >>> +    LoadedBuffer *lb; >>> + >>> +    if (data_size < sizeof(*packet)) { >>> +        error_setg(errp, "packet too short at %zu (min is %zu)", >>> +                   data_size, sizeof(*packet)); >>> +        return -1; >>> +    } >>> + >>> +    if (packet->version != 0) { >>> +        error_setg(errp, "packet has unknown version %" PRIu32, >>> +                   packet->version); >>> +        return -1; >>> +    } >>> + >>> +    if (packet->idx == UINT32_MAX) { >>> +        error_setg(errp, "packet has too high idx %" PRIu32, >>> +                   packet->idx); >>> +        return -1; >>> +    } >>> + >>> + trace_vfio_load_state_device_buffer_incoming(vbasedev->name, >>> packet->idx); >>> + >>> +    /* config state packet should be the last one in the stream */ >>> +    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) { >>> +        migration->load_buf_idx_last = packet->idx; >>> +    } >>> + >>> +    assert(migration->load_bufs); >>> +    if (packet->idx >= migration->load_bufs->len) { >>> +        g_array_set_size(migration->load_bufs, packet->idx + 1); >>> +    } >>> + >>> +    lb = &g_array_index(migration->load_bufs, typeof(*lb), >>> packet->idx); >>> +    if (lb->is_present) { >>> +        error_setg(errp, "state buffer %" PRIu32 " already filled", >>> packet->idx); >>> +        return -1; >>> +    } >>> + >>> +    assert(packet->idx >= migration->load_buf_idx); >>> + >>> +    migration->load_buf_queued_pending_buffers++; >>> +    if (migration->load_buf_queued_pending_buffers > >>> +        vbasedev->migration_max_queued_buffers) { >>> +        error_setg(errp, >>> +                   "queuing state buffer %" PRIu32 " would exceed >>> the max of %" PRIu64, >>> +                   packet->idx, >>> vbasedev->migration_max_queued_buffers); >>> +        return -1; >>> +    } >> >> I feel like max_queued_buffers accounting/checking/configuration >> should be split to a separate patch that will come after this patch. >> Also, should we count bytes instead of buffers? Current buffer size >> is 1MB but this could change, and the normal user should not care or >> know what is the buffer size. >> So maybe rename to migration_max_pending_bytes or such? > > Since it's Peter that asked for this limit to be introduced in the > first place > I would like to ask him what his preference here. > > @Peter: max queued buffers or bytes? > >>> + >>> +    lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet)); >>> +    lb->len = data_size - sizeof(*packet); >>> +    lb->is_present = true; >>> + >>> + qemu_cond_broadcast(&migration->load_bufs_buffer_ready_cond); >> >> There is only one thread waiting, shouldn't signal be enough? > > Will change this to _signal() since it clearly doesn't > make sense to have a future-proof API here - it's an > implementation detail. > >>> + >>> +    return 0; >>> +} >>> + >>> +static void *vfio_load_bufs_thread(void *opaque) >>> +{ >>> +    VFIODevice *vbasedev = opaque; >>> +    VFIOMigration *migration = vbasedev->migration; >>> +    Error **errp = &migration->load_bufs_thread_errp; >>> +    g_autoptr(QemuLockable) locker = qemu_lockable_auto_lock( >>> + QEMU_MAKE_LOCKABLE(&migration->load_bufs_mutex)); >> >> Any special reason to use QemuLockable? > > I prefer automatic lock management (RAII-like) for the same reason > I prefer automatic memory management: it makes it much harder to > forget to unlock the lock (or free memory) in some error path. > > That's the reason these primitives were introduced in QEMU in the > first place (apparently modeled after its Glib equivalents) and > why these are being (slowly) introduced to Linux kernel too. Agree, I guess what I really meant is why not use QEMU_LOCK_GUARD()? > >>> +    LoadedBuffer *lb; >>> + >>> +    while (!migration->load_bufs_device_ready && >>> +           !migration->load_bufs_thread_want_exit) { >>> + qemu_cond_wait(&migration->load_bufs_device_ready_cond, >>> &migration->load_bufs_mutex); >>> +    } >>> + >>> +    while (!migration->load_bufs_thread_want_exit) { >>> +        bool starved; >>> +        ssize_t ret; >>> + >>> +        assert(migration->load_buf_idx <= >>> migration->load_buf_idx_last); >>> + >>> +        if (migration->load_buf_idx >= migration->load_bufs->len) { >>> +            assert(migration->load_buf_idx == >>> migration->load_bufs->len); >>> +            starved = true; >>> +        } else { >>> +            lb = &g_array_index(migration->load_bufs, typeof(*lb), >>> migration->load_buf_idx); >>> +            starved = !lb->is_present; >>> +        } >>> + >>> +        if (starved) { >>> + trace_vfio_load_state_device_buffer_starved(vbasedev->name, >>> migration->load_buf_idx); >>> + qemu_cond_wait(&migration->load_bufs_buffer_ready_cond, >>> &migration->load_bufs_mutex); >>> +            continue; >>> +        } >>> + >>> +        if (migration->load_buf_idx == migration->load_buf_idx_last) { >>> +            break; >>> +        } >>> + >>> +        if (migration->load_buf_idx == 0) { >>> + trace_vfio_load_state_device_buffer_start(vbasedev->name); >>> +        } >>> + >>> +        if (lb->len) { >>> +            g_autofree char *buf = NULL; >>> +            size_t buf_len; >>> +            int errno_save; >>> + >>> + trace_vfio_load_state_device_buffer_load_start(vbasedev->name, >>> + migration->load_buf_idx); >>> + >>> +            /* lb might become re-allocated when we drop the lock */ >>> +            buf = g_steal_pointer(&lb->data); >>> +            buf_len = lb->len; >>> + >>> +            /* Loading data to the device takes a while, drop the >>> lock during this process */ >>> + qemu_mutex_unlock(&migration->load_bufs_mutex); >>> +            ret = write(migration->data_fd, buf, buf_len); >>> +            errno_save = errno; >>> + qemu_mutex_lock(&migration->load_bufs_mutex); >>> + >>> +            if (ret < 0) { >>> +                error_setg(errp, "write to state buffer %" PRIu32 " >>> failed with %d", >>> +                           migration->load_buf_idx, errno_save); >>> +                break; >>> +            } else if (ret < buf_len) { >>> +                error_setg(errp, "write to state buffer %" PRIu32 " >>> incomplete %zd / %zu", >>> +                           migration->load_buf_idx, ret, buf_len); >>> +                break; >>> +            } >>> + >>> + trace_vfio_load_state_device_buffer_load_end(vbasedev->name, >>> + migration->load_buf_idx); >>> +        } >>> + >>> +        assert(migration->load_buf_queued_pending_buffers > 0); >>> +        migration->load_buf_queued_pending_buffers--; >>> + >>> +        if (migration->load_buf_idx == migration->load_buf_idx_last >>> - 1) { >>> + trace_vfio_load_state_device_buffer_end(vbasedev->name); >>> +        } >>> + >>> +        migration->load_buf_idx++; >>> +    } >>> + >>> +    if (migration->load_bufs_thread_want_exit && >>> +        !*errp) { >>> +        error_setg(errp, "load bufs thread asked to quit"); >>> +    } >>> + >>> +    g_clear_pointer(&locker, qemu_lockable_auto_unlock); >>> + >>> +    qemu_loadvm_load_finish_ready_lock(); >>> +    migration->load_bufs_thread_finished = true; >>> +    qemu_loadvm_load_finish_ready_broadcast(); >>> +    qemu_loadvm_load_finish_ready_unlock(); >>> + >>> +    return NULL; >>> +} >>> + >>>   static int vfio_save_device_config_state(QEMUFile *f, void *opaque, >>>                                            Error **errp) >>>   { >>> @@ -285,6 +478,8 @@ static int >>> vfio_load_device_config_state(QEMUFile *f, void *opaque) >>>       VFIODevice *vbasedev = opaque; >>>       uint64_t data; >>> >>> + trace_vfio_load_device_config_state_start(vbasedev->name); >> >> Maybe split this and below trace_vfio_load_device_config_state_end to >> a separate patch? > > I guess you mean to add these trace points in a separate patch? > Can do. > >>> + >>>       if (vbasedev->ops && vbasedev->ops->vfio_load_config) { >>>           int ret; >>> >>> @@ -303,7 +498,7 @@ static int >>> vfio_load_device_config_state(QEMUFile *f, void *opaque) >>>           return -EINVAL; >>>       } >>> >>> -    trace_vfio_load_device_config_state(vbasedev->name); >>> + trace_vfio_load_device_config_state_end(vbasedev->name); >>>       return qemu_file_get_error(f); >>>   } >>> >>> @@ -687,16 +882,70 @@ static void vfio_save_state(QEMUFile *f, void >>> *opaque) >>>   static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp) >>>   { >>>       VFIODevice *vbasedev = opaque; >>> +    VFIOMigration *migration = vbasedev->migration; >>> +    int ret; >>> + >>> +    ret = vfio_migration_set_state(vbasedev, >>> VFIO_DEVICE_STATE_RESUMING, >>> + vbasedev->migration->device_state, errp); >>> +    if (ret) { >>> +        return ret; >>> +    } >>> + >>> +    assert(!migration->load_bufs); >>> +    migration->load_bufs = g_array_new(FALSE, TRUE, >>> sizeof(LoadedBuffer)); >>> +    g_array_set_clear_func(migration->load_bufs, loaded_buffer_clear); >>> + >>> +    qemu_mutex_init(&migration->load_bufs_mutex); >>> + >>> +    migration->load_bufs_device_ready = false; >>> + qemu_cond_init(&migration->load_bufs_device_ready_cond); >>> + >>> +    migration->load_buf_idx = 0; >>> +    migration->load_buf_idx_last = UINT32_MAX; >>> +    migration->load_buf_queued_pending_buffers = 0; >>> + qemu_cond_init(&migration->load_bufs_buffer_ready_cond); >>> + >>> +    migration->config_state_loaded_to_dev = false; >>> + >>> +    assert(!migration->load_bufs_thread_started); >> >> Maybe do all these allocations (and de-allocations) only if multifd >> device state is supported and enabled? >> Extracting this to its own function could also be good. > > Sure, will try to avoid unnecessarily allocating multifd device state > related things if this functionality is unavailable anyway. > >>> >>> -    return vfio_migration_set_state(vbasedev, >>> VFIO_DEVICE_STATE_RESUMING, >>> - vbasedev->migration->device_state, errp); >>> +    migration->load_bufs_thread_finished = false; >>> +    migration->load_bufs_thread_want_exit = false; >>> +    qemu_thread_create(&migration->load_bufs_thread, "vfio-load-bufs", >>> +                       vfio_load_bufs_thread, opaque, >>> QEMU_THREAD_JOINABLE); >> >> The device state save threads are manged by migration core thread >> pool. Don't we want to apply the same thread management scheme for >> the load flow as well? > > I think that (in contrast with the device state saving threads) > the buffer loading / reordering thread is an implementation detail > of the VFIO driver, so I don't think it really makes sense for multifd > code > to manage it. Hmm, yes I understand. Thanks.