From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from NAM10-BN7-obe.outbound.protection.outlook.com (mail-bn7nam10on2079.outbound.protection.outlook.com [40.107.92.79]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7862A1F542C for ; Wed, 8 Jan 2025 11:20:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.92.79 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736335258; cv=fail; b=Ns7rRK5J8Lw5bUiUNtPPQuqhr7y/z2vpIEKRQ4shlrXVo+UYjyuwpQr7LOC/+CBRP7b32xZPV9Z8M5JVzY04ulR6+DXWlv0Hy9d9CB8YODXLebiVGPa30EkLJ2rxel9A9iHZpbZ69flWU5bkI8DUOWGJhNMykwV/815M8u8/4Xc= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736335258; c=relaxed/simple; bh=i9uCXC1+BLqDe0yJTauSwqsJeZpMEyh9Z1onyi4QOFY=; h=Message-ID:Date:Subject:To:Cc:References:From:In-Reply-To: Content-Type:MIME-Version; b=ovlESg9r8wKH/EmPHgkMTHTJ14UxsGKXU2LfAAEYJ9nlnPFmnSVR8sZEbykgiYGFbeWs2CU/6PsFCUctPUQs6Y97PveXjXPHHxVbd0PzpmVXrnF3qyUnIpyfQsvwKxbjCeK2dUDgZM5srVPLCtrZTi7NlsOblbXgtL6sAcvGE9I= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=Ll4IlHuh; arc=fail smtp.client-ip=40.107.92.79 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="Ll4IlHuh" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=qyPmfGbj9YLjQKg6RR9i5gR5VeO7Hrn+bIo1TupiLWfz8oY2jIVsckznnPE6A+UOX698nwrhSqpafJl0C8qmHCbIYhgUp77+jEr+qb0oEdusxjtjkKnTlj07z2iHVK758gIrwIQYxssT+RG/wwm7frF5zt/8/XEsGibpV7buDrkXozqEj4cGchPRlytQJ/yzbl0RqFx0tvCnXEgCKJRZg18xMMSENNIaq6a/bWsZbdmDjo9JJymrL9zAlegMrDj5+z6j29UkIQog9/vGBMwpmTHDAgLg4moandfaZXSa4Zf0eOOJshY4y1RzuvPB18Ef+sLXn4Sv0lYp5oxhV4k5SQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=qUH9wDwTYXnfVYzswwnwJGvo4cvkK47TXya/U/vXF+k=; b=hOunpdsBthO+ejfFThwfQRLRIjB56PI2iddERPvYHJf6I4wD+OoHSJJykRzZka0JKqs5/JubJ+LOfQ90fVr5/WfAowk9GZXAaAJ8ZNmQY15q2RfcqP/MWTHS6PCZW0VMuGSQFTGUnqTHa/6OsC0D1HOUg2O7NYMExFFojoPxzpQl04XbM34pdyVHKktbTjY7jeFpk6vi9QOvuTMCVWBnx08Btkdy6ykTUVxo93QRA3Y+X+8c1ZHGcZQAWvVIKdJ5qCzMXkixEEZplsf2UH+IT8OfRsugJ8xYW4nskKaz6v0RJfw9ydbH0yHyfXkWU7hf47rAy8jim2VXMNSPGRUj1w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=qUH9wDwTYXnfVYzswwnwJGvo4cvkK47TXya/U/vXF+k=; b=Ll4IlHuhyyB5exyindsmMG8zSFLQGw0NTZltWDdakyymNbLtsU4tbMIiL+L5b3hpaBdI4WF49AkZvO9kwIjja52VjZs2Em+nUglmbVXn2a2sk3jq5MtSFLK8QRtGc3kKM0qtNRiFvYDcPsxkLmen36rBf+LkPGBLQWAypjja5iw= Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=amd.com; Received: from CH3PR12MB9194.namprd12.prod.outlook.com (2603:10b6:610:19f::7) by CY8PR12MB7490.namprd12.prod.outlook.com (2603:10b6:930:91::13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8314.17; Wed, 8 Jan 2025 11:20:49 +0000 Received: from CH3PR12MB9194.namprd12.prod.outlook.com ([fe80::53fb:bf76:727f:d00f]) by CH3PR12MB9194.namprd12.prod.outlook.com ([fe80::53fb:bf76:727f:d00f%5]) with mapi id 15.20.8335.011; Wed, 8 Jan 2025 11:20:48 +0000 Message-ID: <219a4a7a-7c96-4746-9aba-ed06a1a00f3e@amd.com> Date: Wed, 8 Jan 2025 22:20:35 +1100 User-Agent: Mozilla Thunderbird Beta Subject: Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Content-Language: en-US To: Chenyi Qiang , David Hildenbrand , Paolo Bonzini , Peter Xu , =?UTF-8?Q?Philippe_Mathieu-Daud=C3=A9?= , Michael Roth Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org, Williams Dan J , Peng Chao P , Gao Chao , Xu Yilun References: <20241213070852.106092-1-chenyi.qiang@intel.com> <20241213070852.106092-3-chenyi.qiang@intel.com> <80ac1338-a116-48f5-9874-72d42b5b65b4@intel.com> From: Alexey Kardashevskiy In-Reply-To: <80ac1338-a116-48f5-9874-72d42b5b65b4@intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-ClientProxiedBy: TP0P295CA0044.TWNP295.PROD.OUTLOOK.COM (2603:1096:910:4::6) To CH3PR12MB9194.namprd12.prod.outlook.com (2603:10b6:610:19f::7) Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH3PR12MB9194:EE_|CY8PR12MB7490:EE_ X-MS-Office365-Filtering-Correlation-Id: 7f3dddb8-83e2-491c-e4f6-08dd2fd680a8 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|1800799024|366016|376014; X-Microsoft-Antispam-Message-Info: =?utf-8?B?OG1za2tFaDYzRXN1Slp2WTh3d25Rd0ZUeDlsVUFTbGpHRDlDMFlBTTVUcmk2?= =?utf-8?B?ZlhyV3hqSGNoSXhsYnhQc005ZmgrUjJ5L2ZieUdMSDQ3YmV1NHgxU2tIa1Ar?= =?utf-8?B?OU93RjFabHZST3Zja1p5em5yUDV3N3VjS2Y5Q09XUkFOSFI1YlRJeGtpTUhO?= =?utf-8?B?MjFTTHlhWURNR0ZSbmlGU1lEc0tlRzFZbS91bmZwdW5ObDAxNzVVdGIxVjdF?= =?utf-8?B?c29NdmtWNjVvYVhvcnhiYjFvVlUzNXNTMmRxUy91c29tMGV4NGxJUnhLQzFl?= =?utf-8?B?YVZaa0haRzFFN0VUSE5BYTZyTnJheHVmT3U1MFVldnNFQXJ3cjJncnlBK3BH?= =?utf-8?B?dUltNjUvZUVJMVJSOUh2bmVvTGR5T0RlVHdVN1Y3M0FPQTJWYThUQjRUY3U5?= =?utf-8?B?S2NQS1F2L3Njc1hJVUlaa2RScXMyY0padXBFTnFqaGFrNUF0V25RNDRjMXZM?= =?utf-8?B?ZW85NDg5VndMcXMyL29WQjUxVzhTRExJTDhNb2gyNWpMU3JvekpOMHU2dWl2?= =?utf-8?B?eDJ1ZUpxbGlCV0VWN0wwT0dMN2ZGY0hrQWRkcEZKT0tZWGg4NGFGYWdzNDFR?= =?utf-8?B?cm1EK2ZSSUs4VHFpUEJYcWNrdXNLdHpVZGRMQXU0TkJFN2Yxc3hzQm83ZDhN?= =?utf-8?B?UGtZL0tYUm9JQjlmNDN4Nm5Bb2htMDkxakhDSGhVNkFnUVRCVGJwaVRkbkFV?= =?utf-8?B?WlRESXhXSStxdU9KNFo2dzc2TnBJRjVXb0FQMlIyQjZ1ZTliNGFMMk9RcWgy?= =?utf-8?B?ZDgrclBvaW0xNDJXVzlGdko2YnhtUElSaW0vMHgyUTl6UC8yL0d5Z2FpTlN3?= =?utf-8?B?WW9aTllPS2puZDQ5SVZDRDdRdldPd1hyVFVZZndxZ0ZxZ1UyME9zTmlMeGVN?= =?utf-8?B?OEVsT2Nxd0ptZS9FdFpRMXV2akh5c0JiMUdOeWFNWkpwR3V3VzNRSFRuZk5n?= =?utf-8?B?NGFtYTc0YlVrOTFYV09TelhJZlMwTW9oeXNvT2wvN0JpTDl0S1lZbURFc0VG?= =?utf-8?B?NjVRNXVKQ2l3RG1jT3B5N0NtTDZEKzZLeVlBb3hsOFdWSS80SnllMm9xQnYv?= =?utf-8?B?RHpPendRcGh6NTlLZkR0MGVMV0VPQTRQVWg2RC9yVjhTeU12NFVKU0lCWmVH?= =?utf-8?B?UldDZTZMQ1VlQ1pFZDRvNlBvNUJuZjRHQU9TcEx1UnlCRXBnNkNiNkJmUzdq?= =?utf-8?B?YjBXUVk4TEdhekp4ZFcyb0F5OWFSODBFNC91M0NkOFZjdGIwL2tWR0FXVXgr?= =?utf-8?B?SmE3eHQvNkVvZCt3cUM3eFRvdExCdGgyS1o2TjdqazgydzdxanRlUXM3a2VI?= =?utf-8?B?alFicUgyYnl5VFM0RFZRTGI5eWVGak50MElZTW4rSWNDQmViWEl6NHNhTzdZ?= =?utf-8?B?R25qZ2dkd0krcEloaFhOcEtYUm8waitybWhPS2lWMzljL3YyQjY0T1ZWVy9C?= =?utf-8?B?ZVVHNWFqNWwxTSt2NG9veDRPYkZSRmZld0x2OFczYWFYV3JwdmhsRDBkZWZJ?= =?utf-8?B?amQzRTR0dlAyNjRNUS9HdTZJQithd1BwM3JNa1RFNXk4S0NqWWJwQ0RMZHZn?= =?utf-8?B?WHFaek1jMGlwSkxzUDhUWExzbER0WDRhOENJT1Q4ODVYaDFiSFF6VS83TzBt?= =?utf-8?B?eW1saW9oYmR1dUFUeUZEZER3TWdETCs4em9LdHA4a2R6RWRvSTFnWExrbHNp?= =?utf-8?B?UUZZeE1LL0tnVTYrOUNMVzNldS8weTZtMkpIbk5ySTVrSzhtMS9hQ1BEMFQw?= =?utf-8?B?RXc5UTllUXVjb1d1NFRiMU1uSTdYamZsU3IvMWlRUWdkZXZhUUkxUFpMM3Va?= =?utf-8?B?V2tmeGFSNmxaN0ptRGsydXl1OWdPMFZiSXJ4RnBiMWNpTzJscVJRWmhvKzdp?= =?utf-8?Q?Xj70ig+vArG1i?= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CH3PR12MB9194.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(7416014)(1800799024)(366016)(376014);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?bmJmUHlOQjlvdHZsUVNsSXRYcFlERHNVSkxPMGw3MDBRcmt6VkZHR2s3TDli?= =?utf-8?B?Ym1iQTdnUmVjQWUyWlIzQ2hEREtSM1NLMVkvUmIwVStMYytjMjc2WHlpMXhX?= =?utf-8?B?eFNwZi9NcEZTTFRtQU83UWE5WStDalJkV0o4Uk1FWWNwTjN4WmhHNEQyaGZ5?= =?utf-8?B?YTFmYy9CVEJsR2JrVmlnUndNejEySkd1d2Y1YjhQTjZ4RkhRY1dYcWdLS0dV?= =?utf-8?B?MjN2dHFyVU5XWFBrU2IzMzdDMTVuWFdkREdBaklHZHZGaGcraWJOQnE1SUgx?= =?utf-8?B?QmZlQ3gxMkZ4NmhaT2VPbHlqbS9lLzZJZXpYbVkycDZrVFc4czZKK1dRNEw1?= =?utf-8?B?b1IwbE1HL01mb096WU91OSs3a3NxUzhtS0t2T2x4dzFrZTNIN0N0Qkh0MGls?= =?utf-8?B?MEkvTkVvYUIrcTY4TmFBMkNzN2UrYjF2YVhzRW40RjJaSjFheXlxWEt4akRu?= =?utf-8?B?Tkc4SWY4dld3OXZTb0t6R2RrS2FsbW1LenkxcyticXpHQWtGWVJOaGFzc2sz?= =?utf-8?B?RUVYM3c4ZkxZdEwxZmlqOXdURUFJdFVLN3VjSlRZempkTXlNS0VTdndJa2VW?= =?utf-8?B?Y1BSSUo1aDU3MDdqbGR3L3FPUU1RdFkwOUdOYVZZazJxeHNwMEhXMG1LOWZG?= =?utf-8?B?UFN3dkZzT3RvMTNZQzB1NGFjM0JkbnVKWGR0ZlhFZi8xTnhVSGZqZUdhRFF3?= =?utf-8?B?dUdkUHBsKzJ5N21pMmhPbVAxWndNNVNYaC9YbmROWVVMS1QrYnZIK0x6Mk92?= =?utf-8?B?ZmJ3TUliSFllL3B1MGp4eGYxUjRCNWVHMjd2eVBScXlwWUF5UUc2N1NQNUQv?= =?utf-8?B?OEpEYUwzK3hXcE5nelRVQm9zVXZkaDN6U3UyWE82bk9aUnAxZ1I5ZE4xd1Vp?= =?utf-8?B?RTljdCtCTCs5d1RLQ25XenFDRXdWblpRUHlQeUE1dXJubnpiRFY2ZmUvR24v?= =?utf-8?B?SUc5RkhpWHBiVVFtc3ZaRklBT3pMQWxDdnozWnFqbXlEeFZlRjRWTmZPOTVG?= =?utf-8?B?UDhlVGxUQTFvckNLNVB3M0IvZG8vcUhCckYyWFNIeGNKbFBHM2JoRHBCNmIr?= =?utf-8?B?MzJwV1JqaGR0T0dFVk5CY2dWdnZNc1ZjWEtiWUxMR29HK09yN3JtUjNJbW5v?= =?utf-8?B?L3p4VWJQRjh0STdXbExETFpSRjdrL1k2YlRrUVNmSFQxMmtFNk9UdGN2K251?= =?utf-8?B?bDdPalNhSWZ6VlE3Qkc1aVl6V2FOcVkrUG5ySlkyRzhhYW5NZWRlZ3FjMmVK?= =?utf-8?B?Mk5KUmFkV3kwcWNzVEl3cmVYenR5RE0xRmVoRE1td0ROVUs0aFNkU1Nzb1Ry?= =?utf-8?B?cWtvMUhIbnhUZEpCem11V2VlblVVOFNXOVRGTFNrR1JFNThCSUQvUHdoTm00?= =?utf-8?B?dEVDcDEyUFpMMHdIV09YUjc0ZUY3Q2dsSXhlcnMzRlJVVTl6RVAvTWJ1S2Vq?= =?utf-8?B?Wm1pdk9NbURBRjJWenJ4bFB2V3cvaVI3Q2hRdmdRSU1kL1N4NmVxRE5EWUhD?= =?utf-8?B?V2hvUnpYeHRlMHRIa2t6eFNpWkRBcEkwOThLSloxZWJlZ3BOT1M0bTdha29h?= =?utf-8?B?SnVLV1dhZ0dWeVY1dml3b2grZFU1UnRwTEZLNjd6VjErbFRyR1ZWU0lLdWln?= =?utf-8?B?Wk5SMEtyTzFabEkrbTk4QnVQMW1NaktqemdDUWFUS0tUS2crK1BVZXBvWVFh?= =?utf-8?B?R2l1Z1FLbUlLakIrOFFhY1VNSnhRdENTL2lEeHVYdDU4SGpIYUhRamY0N1Fi?= =?utf-8?B?ZXFyNG5aaDVxejZzTExaNUEzYldKNnVnMHF1SXM3WnEyRUpzZFhFOGxXQTNa?= =?utf-8?B?ZVg4N1VuYUNDN1lhQUw2YkU0dzNzZmtMdDYzdVF5bmNFQldML3VYaHZjNUNl?= =?utf-8?B?czNnQ2dzREx4Ym1CN21JL1BOYkhITjVVN2EyckxldjJLYUM4bmt1RXZzWm9I?= =?utf-8?B?USs5eEEvYVh0YkRUSEkwaDJua3dEekE0eHlqdDhobVRJSnoyT2FEQjk5eVlQ?= =?utf-8?B?SjZWSGJiOTF3cGI1ZGFtdHhNd0xlN3plK21jMlpVN05lcklteHh6SFptT3ZF?= =?utf-8?B?T1VMV3BETmgwK3k4Tk5GOEdqTVpMYmNNZzdRVSt6V1FielJkelc2ZGxySnhN?= =?utf-8?Q?s+GBrseS+XuCSPSiyjH5anq+I?= X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: 7f3dddb8-83e2-491c-e4f6-08dd2fd680a8 X-MS-Exchange-CrossTenant-AuthSource: CH3PR12MB9194.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 08 Jan 2025 11:20:48.5998 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: 8zMl2N7X84R/6suhiFuHe/irAFBccqz7GdNP62aJxC0d81JeSMJoljG6SH6NBGpcc/4vIStJzGuCWiepR7GPog== X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY8PR12MB7490 On 8/1/25 21:56, Chenyi Qiang wrote: > > > On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote: >> On 13/12/24 18:08, Chenyi Qiang wrote: >>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require >>> uncoordinated discard") highlighted, some subsystems like VFIO might >>> disable ram block discard. However, guest_memfd relies on the discard >>> operation to perform page conversion between private and shared memory. >>> This can lead to stale IOMMU mapping issue when assigning a hardware >>> device to a confidential VM via shared memory (unprotected memory >>> pages). Blocking shared page discard can solve this problem, but it >>> could cause guests to consume twice the memory with VFIO, which is not >>> acceptable in some cases. An alternative solution is to convey other >>> systems like VFIO to refresh its outdated IOMMU mappings. >>> >>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust >>> VFIO mappings in relation to VM page assignment. Effectively page >>> conversion is similar to hot-removing a page in one mode and adding it >>> back in the other, so the similar work that needs to happen in response >>> to virtio-mem changes needs to happen for page conversion events. >>> Introduce the RamDiscardManager to guest_memfd to achieve it. >>> >>> However, guest_memfd is not an object so it cannot directly implement >>> the RamDiscardManager interface. >>> >>> One solution is to implement the interface in HostMemoryBackend. Any >> >> This sounds about right. >> >>> guest_memfd-backed host memory backend can register itself in the target >>> MemoryRegion. However, this solution doesn't cover the scenario where a >>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g. >>> the virtual BIOS MemoryRegion. >> >> What is this virtual BIOS MemoryRegion exactly? What does it look like >> in "info mtree -f"? Do we really want this memory to be DMAable? > > virtual BIOS shows in a separate region: > > Root memory region: system > 0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM > ... > 00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM Looks like a normal MR which can be backed by guest_memfd. > 0000000100000000-000000017fffffff (prio 0, ram): pc.ram > @0000000080000000 KVM Anyway if there is no guest_memfd backing it and memory_region_has_ram_discard_manager() returns false, then the MR is just going to be mapped for VFIO as usual which seems... alright, right? > We also consider to implement the interface in HostMemoryBackend, but > maybe implement with guest_memfd region is more general. We don't know > if any DMAable memory would belong to HostMemoryBackend although at > present it is. > > If it is more appropriate to implement it with HostMemoryBackend, I can > change to this way. Seems cleaner imho. >> >> >>> Thus, choose the second option, i.e. define an object type named >>> guest_memfd_manager with RamDiscardManager interface. Upon creation of >>> guest_memfd, a new guest_memfd_manager object can be instantiated and >>> registered to the managed guest_memfd MemoryRegion to handle the page >>> conversion events. >>> >>> In the context of guest_memfd, the discarded state signifies that the >>> page is private, while the populated state indicated that the page is >>> shared. The state of the memory is tracked at the granularity of the >>> host page size (i.e. block_size), as the minimum conversion size can be >>> one page per request. >>> >>> In addition, VFIO expects the DMA mapping for a specific iova to be >>> mapped and unmapped with the same granularity. However, the confidential >>> VMs may do partial conversion, e.g. conversion happens on a small region >>> within a large region. To prevent such invalid cases and before any >>> potential optimization comes out, all operations are performed with 4K >>> granularity. >>> >>> Signed-off-by: Chenyi Qiang >>> --- >>>   include/sysemu/guest-memfd-manager.h |  46 +++++ >>>   system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++ >>>   system/meson.build                   |   1 + >>>   3 files changed, 297 insertions(+) >>>   create mode 100644 include/sysemu/guest-memfd-manager.h >>>   create mode 100644 system/guest-memfd-manager.c >>> >>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/ >>> guest-memfd-manager.h >>> new file mode 100644 >>> index 0000000000..ba4a99b614 >>> --- /dev/null >>> +++ b/include/sysemu/guest-memfd-manager.h >>> @@ -0,0 +1,46 @@ >>> +/* >>> + * QEMU guest memfd manager >>> + * >>> + * Copyright Intel >>> + * >>> + * Author: >>> + *      Chenyi Qiang >>> + * >>> + * This work is licensed under the terms of the GNU GPL, version 2 or >>> later. >>> + * See the COPYING file in the top-level directory >>> + * >>> + */ >>> + >>> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H >>> +#define SYSEMU_GUEST_MEMFD_MANAGER_H >>> + >>> +#include "sysemu/hostmem.h" >>> + >>> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager" >>> + >>> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass, >>> GUEST_MEMFD_MANAGER) >>> + >>> +struct GuestMemfdManager { >>> +    Object parent; >>> + >>> +    /* Managed memory region. */ >> >> Do not need this comment. And the period. > > [...] > >> >>> +    MemoryRegion *mr; >>> + >>> +    /* >>> +     * 1-setting of the bit represents the memory is populated (shared). >>> +     */ > > Will fix it. > >> >> Could be 1 line comment. >> >>> +    int32_t bitmap_size; >> >> int or unsigned >> >>> +    unsigned long *bitmap; >>> + >>> +    /* block size and alignment */ >>> +    uint64_t block_size; >> >> unsigned? >> >> (u)int(32|64)_t make sense for migrations which is not the case (yet?). >> Thanks, > > I think these fields would be helpful for future migration support. > Maybe defining as this way is more straightforward. > >> >>> + >>> +    /* listeners to notify on populate/discard activity. */ >> >> Do not really need this comment either imho. >> > > I prefer to provide the comment for each field as virtio-mem do. If it > is not necessary, I would remove those obvious ones. [bikeshedding on] But the "RamDiscardListener" word says that already, why repeating? :) It should add information, not duplicate. Like the block_size comment which mentions "alignment" [bikeshedding off] >>> +    QLIST_HEAD(, RamDiscardListener) rdl_list; >>> +}; >>> + >>> +struct GuestMemfdManagerClass { >>> +    ObjectClass parent_class; >>> +}; >>> + >>> +#endif > > [...] > >            void *arg, >>> + >>> guest_memfd_section_cb cb) >>> +{ >>> +    unsigned long first_one_bit, last_one_bit; >>> +    uint64_t offset, size; >>> +    int ret = 0; >>> + >>> +    first_one_bit = section->offset_within_region / gmm->block_size; >>> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size, >>> first_one_bit); >>> + >>> +    while (first_one_bit < gmm->bitmap_size) { >>> +        MemoryRegionSection tmp = *section; >>> + >>> +        offset = first_one_bit * gmm->block_size; >>> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size, >>> +                                          first_one_bit + 1) - 1; >>> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size; >> >> This tries calling cb() on bigger chunks even though we say from the >> beginning that only page size is supported? >> >> May be simplify this for now and extend if/when VFIO learns to split >> mappings,  or  just drop it when we get in-place page state convertion >> (which will make this all irrelevant)? > > The cb() will call with big chunks but actually it do the split with the > granularity of block_size in the cb(). See the > vfio_ram_discard_notify_populate(), which do the DMA_MAP with > granularity size. Right, and this all happens inside QEMU - first the code finds bigger chunks and then it splits them anyway to call the VFIO driver. Seems pointless to bother about bigger chunks here. > >> >> >>> + >>> +        if (!memory_region_section_intersect_range(&tmp, offset, >>> size)) { >>> +            break; >>> +        } >>> + >>> +        ret = cb(&tmp, arg); >>> +        if (ret) { >>> +            break; >>> +        } >>> + >>> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size, >>> +                                      last_one_bit + 2); >>> +    } >>> + >>> +    return ret; >>> +} >>> + >>> +static int guest_memfd_for_each_discarded_section(const >>> GuestMemfdManager *gmm, >>> +                                                  MemoryRegionSection >>> *section, >>> +                                                  void *arg, >>> + >>> guest_memfd_section_cb cb) >>> +{ >>> +    unsigned long first_zero_bit, last_zero_bit; >>> +    uint64_t offset, size; >>> +    int ret = 0; >>> + >>> +    first_zero_bit = section->offset_within_region / gmm->block_size; >>> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size, >>> +                                        first_zero_bit); >>> + >>> +    while (first_zero_bit < gmm->bitmap_size) { >>> +        MemoryRegionSection tmp = *section; >>> + >>> +        offset = first_zero_bit * gmm->block_size; >>> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size, >>> +                                      first_zero_bit + 1) - 1; >>> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size; >>> + >>> +        if (!memory_region_section_intersect_range(&tmp, offset, >>> size)) { >>> +            break; >>> +        } >>> + >>> +        ret = cb(&tmp, arg); >>> +        if (ret) { >>> +            break; >>> +        } >>> + >>> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm- >>>> bitmap_size, >>> +                                            last_zero_bit + 2); >>> +    } >>> + >>> +    return ret; >>> +} >>> + >>> +static uint64_t guest_memfd_rdm_get_min_granularity(const >>> RamDiscardManager *rdm, >>> +                                                    const >>> MemoryRegion *mr) >>> +{ >>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>> + >>> +    g_assert(mr == gmm->mr); >>> +    return gmm->block_size; >>> +} >>> + >>> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm, >>> +                                              RamDiscardListener *rdl, >>> +                                              MemoryRegionSection >>> *section) >>> +{ >>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>> +    int ret; >>> + >>> +    g_assert(section->mr == gmm->mr); >>> +    rdl->section = memory_region_section_new_copy(section); >>> + >>> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next); >>> + >>> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl, >>> + >>> guest_memfd_notify_populate_cb); >>> +    if (ret) { >>> +        error_report("%s: Failed to register RAM discard listener: >>> %s", __func__, >>> +                     strerror(-ret)); >>> +    } >>> +} >>> + >>> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm, >>> +                                                RamDiscardListener *rdl) >>> +{ >>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>> +    int ret; >>> + >>> +    g_assert(rdl->section); >>> +    g_assert(rdl->section->mr == gmm->mr); >>> + >>> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl, >>> + >>> guest_memfd_notify_discard_cb); >>> +    if (ret) { >>> +        error_report("%s: Failed to unregister RAM discard listener: >>> %s", __func__, >>> +                     strerror(-ret)); >>> +    } >>> + >>> +    memory_region_section_free_copy(rdl->section); >>> +    rdl->section = NULL; >>> +    QLIST_REMOVE(rdl, next); >>> + >>> +} >>> + >>> +typedef struct GuestMemfdReplayData { >>> +    void *fn; >> >> s/void */ReplayRamPopulate/ > > [...] > >> >>> +    void *opaque; >>> +} GuestMemfdReplayData; >>> + >>> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection >>> *section, void *arg) >>> +{ >>> +    struct GuestMemfdReplayData *data = arg; >> >> Drop "struct" here and below. > > Fixed. Thanks! > >> >>> +    ReplayRamPopulate replay_fn = data->fn; >>> + >>> +    return replay_fn(section, data->opaque); >>> +} >>> + >>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager >>> *rdm, >>> +                                            MemoryRegionSection >>> *section, >>> +                                            ReplayRamPopulate replay_fn, >>> +                                            void *opaque) >>> +{ >>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = >>> opaque }; >>> + >>> +    g_assert(section->mr == gmm->mr); >>> +    return guest_memfd_for_each_populated_section(gmm, section, &data, >>> + >>> guest_memfd_rdm_replay_populated_cb); >>> +} >>> + >>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection >>> *section, void *arg) >>> +{ >>> +    struct GuestMemfdReplayData *data = arg; >>> +    ReplayRamDiscard replay_fn = data->fn; >>> + >>> +    replay_fn(section, data->opaque); >> >> >> guest_memfd_rdm_replay_populated_cb() checks for errors though. > > It follows current definiton of ReplayRamDiscard() and > ReplayRamPopulate() where replay_discard() doesn't return errors and > replay_populate() returns errors. A trace would be appropriate imho. Thanks, >> >>> + >>> +    return 0; >>> +} >>> + >>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager >>> *rdm, >>> +                                             MemoryRegionSection >>> *section, >>> +                                             ReplayRamDiscard replay_fn, >>> +                                             void *opaque) >>> +{ >>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = >>> opaque }; >>> + >>> +    g_assert(section->mr == gmm->mr); >>> +    guest_memfd_for_each_discarded_section(gmm, section, &data, >>> + >>> guest_memfd_rdm_replay_discarded_cb); >>> +} >>> + >>> +static void guest_memfd_manager_init(Object *obj) >>> +{ >>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj); >>> + >>> +    QLIST_INIT(&gmm->rdl_list); >>> +} >>> + >>> +static void guest_memfd_manager_finalize(Object *obj) >>> +{ >>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap); >> >> >> bitmap is not allocated though. And 5/7 removes this anyway. Thanks, > > Will remove it. Thanks. > >> >> >>> +} >>> + >>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void *data) >>> +{ >>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc); >>> + >>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity; >>> +    rdmc->register_listener = guest_memfd_rdm_register_listener; >>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener; >>> +    rdmc->is_populated = guest_memfd_rdm_is_populated; >>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated; >>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded; >>> +} >>> diff --git a/system/meson.build b/system/meson.build >>> index 4952f4b2c7..ed4e1137bd 100644 >>> --- a/system/meson.build >>> +++ b/system/meson.build >>> @@ -15,6 +15,7 @@ system_ss.add(files( >>>     'dirtylimit.c', >>>     'dma-helpers.c', >>>     'globals.c', >>> +  'guest-memfd-manager.c', >>>     'memory_mapping.c', >>>     'qdev-monitor.c', >>>     'qtest.c', >> > -- Alexey