From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from NAM02-BN1-obe.outbound.protection.outlook.com (mail-bn1nam02on2067.outbound.protection.outlook.com [40.107.212.67]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 28BCEBE5E for ; Thu, 9 Jan 2025 02:55:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.212.67 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736391343; cv=fail; b=GM1K3HAhrjKOe+yYyuybv/WJxGdYJYWwZ+95tF2tkckepj1gCWhg8+lCPs/LanPga3LV+iI3LbWwHIF6AqZL40hVaAi1sQCqbQycwn7jDOBkW7/F02Ho6i07+8zdeEgefa1h104+QlTWOD7f09u1vmxohGEFZabVQ1ZCVLGG094= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736391343; c=relaxed/simple; bh=vLfiiD0uwrVDRlb661BwWf4oFhU9fUY2TmSLuFtBZiQ=; h=Message-ID:Date:Subject:To:Cc:References:From:In-Reply-To: Content-Type:MIME-Version; b=UqkuKABEKdqQCWrcYzxH0NCqj7it/RaTMaGiXzqmWz5TmjFB+r2lsAbAkXUIIXMV6RN8XPaP4qHUQjAgNlYmA4AiGX227NFMgaLWTDWWvYzjapjqZQAdYPtZIwSIuE95FizC/ut5bG1bpJaUPCH7pPhr7kMKsStDLkYwhRcF6wE= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=3nlMNW8S; arc=fail smtp.client-ip=40.107.212.67 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="3nlMNW8S" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=xvumNc1lVqv6fYaa6khqkk0KBPM9MddribSdxU3WWQF7Jso8qa52OKRoPCPbcRyz7Fq3Lry6vEiM5XkTKjQsnajNenF3qhUz+tcCWkiLb0UE8EJwuM0Q5PsTbHMKZQ40UsmL8KT+g/QcYhn7hxdKsY2em69yZ0RFio4Kplsf39CcDaOdjo0K5sjTFSIgI3hOncQeQlsQuYTSz4RL33UZt4efF64g4mBMxahYoQNgtfNhOVUUuZYpiY326SVHCJkV02xCerJL2F2i7hU/QzacbT6XnT6TwEDswB708Qd2+c+5CPbzn5sVWN/RB0OamnHBRHWSzUJRV5wltba8bAnzcg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=kkg/Qwmp/oUrenKBqRyP8kauzDVrHED3hvt+UOCMzDM=; b=lSEkKklj/ZJB00q9lsc/LVWcmJB7+EqrFhDVBrlS1Xyp5JxQDG32VympPi09J7SputWUudIVyuZ25O4U/GEi3FxuRoXN1oFAtz4rcsyALtGtt5Cu+6dAnF4JXHe1um47S60Jhk77ojfI5E9Xyz/ceuNq/d6lGyTX6izmReyb2uwfoxDhYP54Jhv1yFRH0Sr633GkFwRW8rrevlvlJDZnqw3Jqn05IewfXh8V3AV79unnacGjaCOrOBUinF4CcJ6ESD6QYqmUGn2xHj9xOosTlp4FnlFMj6Wa3TVcdOp6XHK2+j3b9tHLL3sT77dQR3wcMjl0c4dvYJntgx5A6Fk6Dg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=kkg/Qwmp/oUrenKBqRyP8kauzDVrHED3hvt+UOCMzDM=; b=3nlMNW8SLgL44B38AS6678dD0ZLfAaFdMHn0f3A8H/QrsBf+f/eGKmWTiX3zZIYRNiZFMJWUpI8Mp99u0M5TPwUjtn5jXsXNk6QDjJokBoBV9BzlKIuENg7xJR/dsSelf6ylGDQtrOvTs85JoZVRM6ARF8CORfldq+XtOGLISOw= Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=amd.com; Received: from CH3PR12MB9194.namprd12.prod.outlook.com (2603:10b6:610:19f::7) by CH3PR12MB7500.namprd12.prod.outlook.com (2603:10b6:610:148::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8335.10; Thu, 9 Jan 2025 02:55:37 +0000 Received: from CH3PR12MB9194.namprd12.prod.outlook.com ([fe80::53fb:bf76:727f:d00f]) by CH3PR12MB9194.namprd12.prod.outlook.com ([fe80::53fb:bf76:727f:d00f%5]) with mapi id 15.20.8335.011; Thu, 9 Jan 2025 02:55:37 +0000 Message-ID: <8c8e024d-03dc-4201-8038-9e9e60467fad@amd.com> Date: Thu, 9 Jan 2025 13:55:28 +1100 User-Agent: Mozilla Thunderbird Beta Subject: Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager Content-Language: en-US To: Chenyi Qiang , David Hildenbrand , Paolo Bonzini , Peter Xu , =?UTF-8?Q?Philippe_Mathieu-Daud=C3=A9?= , Michael Roth Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org, Williams Dan J , Peng Chao P , Gao Chao , Xu Yilun References: <20241213070852.106092-1-chenyi.qiang@intel.com> <20241213070852.106092-3-chenyi.qiang@intel.com> <80ac1338-a116-48f5-9874-72d42b5b65b4@intel.com> <219a4a7a-7c96-4746-9aba-ed06a1a00f3e@amd.com> <58b96b74-bf9c-45d3-8c2e-459ec2206fc8@intel.com> From: Alexey Kardashevskiy In-Reply-To: <58b96b74-bf9c-45d3-8c2e-459ec2206fc8@intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-ClientProxiedBy: MEVPR01CA0075.ausprd01.prod.outlook.com (2603:10c6:220:201::11) To CH3PR12MB9194.namprd12.prod.outlook.com (2603:10b6:610:19f::7) Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH3PR12MB9194:EE_|CH3PR12MB7500:EE_ X-MS-Office365-Filtering-Correlation-Id: 12ce79aa-405c-4c82-fec0-08dd305917fc X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|376014|7416014|366016; X-Microsoft-Antispam-Message-Info: =?utf-8?B?WmJVTkY4akZuR0VBa1dIQ0RwOHhPbzlPQ0JrZjJINHhHemNYRlNKWkE0UlRH?= =?utf-8?B?UzJGQ0tzV0NiWjFYRGRBZVlVSE0xNVFjZW9ObnduSzIrd2N1M1hXWlUrUkRT?= =?utf-8?B?bnBNQ2xKSnRRcGV1UXhHa0ZiNCtQWmN2ZUd1WFZzcUVCSlREVTEvME5teFNs?= =?utf-8?B?N0NKT2VWcG5ZYk1UbFRCRUhlc2hkcDNBNjlNNTB2QmJaWUlDZmtaSnZsMkY0?= =?utf-8?B?MFVqZjZ2NGx6Y1JNSHpkdjA2V0VBeE5sTGxVU01iS0g2c0pnQkFLVXZpYVF3?= =?utf-8?B?by9nbnphS1lIK1Y2cUdUR3M2Q3B3aDd3UkpaRzVFREI0K2JaL0F3TkpZMzBw?= =?utf-8?B?SkdYZWlkRjhpd0ZDZVZveTdlYldseFJ5Zm9WY1E3bzdCU3VLV0ZKdk9yS2VB?= =?utf-8?B?bDJSUkNjNlVKZkhkZVgvS25BQTRVMzBhdk5DUTFLTE5DcEpOa0hrcFdpOWx6?= =?utf-8?B?djRtdlB2aTFTOU9QWjRpUFkrOXl5TGdXazVYaDBJb2ltZ2pLZHdvTklUN1NY?= =?utf-8?B?YjZubHNKZlBOdXBubEVRZTNzajNEU3hMMVI0UFBIMURKRjFJTHczRmtjOEFa?= =?utf-8?B?REhKdnFrcXczbmxyMHE3K0xHQkVHeXg2bU8wenYwZXBya0Mwa2duRmp5U3p3?= =?utf-8?B?ZzQ1TWgyOHFlRllyQUNScGlvdnpoZWZObC8xdGZtUVZqanVQckNGaTlOREsw?= =?utf-8?B?VHRhcVdBcUc4ZzlkUmk3V29CZ0VmUUdHbUJyd2JqRlZ2M1pidmsxSktVbnF0?= =?utf-8?B?dmIwR3Q1MDRLQ3RIYjN0Nmx6ZFpueXIwQzJEYzVDOTBWWklhejZNUUx3VS9U?= =?utf-8?B?V1VKbkhxL0htS0N4UWNDSE1QZERFVlowSHV6ZG9WZ1hOUGI0Z0k2SFhDMDh0?= =?utf-8?B?SDRwV2J5Yy9HSHpQeXhtQWFOSXBucmRLNm9vV0FPZ0QrLzVmUVZ6TG15Undi?= =?utf-8?B?WU9oMEwyeUxNU3NnYzZsbXhKWmh0UnUyRkVxNW9OWGlMVWx2akNVU1JmMmVN?= =?utf-8?B?WnpFUHh6MXNGVG8xTGpPOS9JTEtaZ3g3U2FrVkdaTVBRd2lnTnlJa0wyWE5l?= =?utf-8?B?SE5QbmplNmRBTTlFOEVtelhxdzVybW5MVENvTzZnNmRqa0dBSnE0Q2sySTdC?= =?utf-8?B?R1VLMUFrTnVqVkRHcjI0Y2tWNU5zRG41VDFsSzZSZ29TVnFMTVZEcXJMcm84?= =?utf-8?B?SWxuclJzUkNpb0dXOEQxR2xTMW5LLzRzUTE1eEVVMnhnNmE5NUR4TkYzRFlN?= =?utf-8?B?bnBSdTN1aUErbGRUWE1xTWlmTnFyY2YyZDRiWGQyS0tvUFZNbnFtUVpqM24v?= =?utf-8?B?QjR0Z0JEVFZ4VnlVbTI4dFROUFJaRFJMa00xK2JZLzBSYWdSOW5tZW4yS2x5?= =?utf-8?B?RTBzTU9qajNINWxMejV3R3ltTUwyenZVQ2pjR2NCQ2RHQXArLytyOVNmZ1N5?= =?utf-8?B?eTErQVZVamMyQVJJbG1UYno0N0kvSkl3YWhLcDMwd3lqNVJGTTdOb3M5eWFv?= =?utf-8?B?TWRtaGtpUVIrK2J6aFNtOWNDNHZlRXJlbGtnTEVmUDlOQWJrbE1uZUVmYzBu?= =?utf-8?B?Y2MwVG5zeTV1UFcxQTNNQytaSmpxOU12OHZOOVhWSjJLSkY4cHJtYng2OTJO?= =?utf-8?B?eDkyUXlCd1hUSXVLUmtqalI4Ni9ITEtIcHVyMDY4YUQ4L2FjTUV1VzNBcnVo?= =?utf-8?B?cTVFOFZKWDZ3QjZncUpHNzNiU2tQQW1ZaTNuNE9MMUJDOEpsZWJYYW9FS1Np?= =?utf-8?B?SWw0ck5uaUNWb29UWFVhUlA4L2xvMGU4Tmk5WkhRNHY2czl2VXlVeEI4QlBO?= =?utf-8?B?SDRZVUU2NFRONmE2eVhaYUg3TWpmKzFycGpFa1ZVK0tYNnVQUVlXMmQ1L0I5?= =?utf-8?Q?ZJO4iNIRr0Sx2?= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CH3PR12MB9194.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(1800799024)(376014)(7416014)(366016);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?N29RcWtTMXBnTTA4dzYxNFVDQTVqVjB0NDNqSVE2bnFydnlDamdaVWg5UlV0?= =?utf-8?B?NER2S3J1ZVNGRWI0SzdadVlOSWREUnA2ZmJXekM4L0R3N1p1dTlWMWliNXN4?= =?utf-8?B?djVhZnVyVE02RWppS0E4L2R2T2dGTDhCQlNnTDloelVRWStGTlVaZnpGSFNI?= =?utf-8?B?cWEwZmVEaUJhNHptYlVKN0lSYlBBYlhCYlVBbnRPeUJLVHdhbmRmNGpJTXVN?= =?utf-8?B?MWoyQjRGTEZjUkZKQXZhd2ZJK1MzQ1hCQm13YzhxSlZrcXlHN2VmU0lCOVIy?= =?utf-8?B?VHl2TFNVUnFpWTdkZWkwVGJwSUNkTVMrVnk3MFZqVWh3SGE2ZTdndjEweTFY?= =?utf-8?B?UFd0UVZ3eGlFOWwvWlpUZ01BMVVTRFpOSVdOWERVdDVnU1d1bnNsR0lYMTYw?= =?utf-8?B?Njh6TlZWWjVVZ3FQRDFJMDk2d2FZSUU0TVd1S2JXUGNEOWh4UmkwTHoxaG5l?= =?utf-8?B?b0hLaDRmTHlnTlJhUnNvazNSWFlVYnVWdjl4Rk5iWVU4RS9iY3VQTk9EKzdG?= =?utf-8?B?SDhlNGFMY01makk2dCtwUWtleWlCMitsWU9GaDJ4Z05WakN1OW4zTEVUTmZh?= =?utf-8?B?ZUg0b2p2K21oT2pXa3pTaXQrdEczc0tlVG1sNytDTWVWcFVsaWhmY05uRkI3?= =?utf-8?B?OUx0cnZURGgrOXlyZG1JTkRXUmRkSE02UkZIdkJVTXVJSnR4bDQwSHQyVGcy?= =?utf-8?B?VFhNNWNxVE1qeStEZlYvNms1M1hCVk55NWxaRW9Qb05mRGYxWDZJSEErL29h?= =?utf-8?B?eTlSY2dyWHprR0tKWEtMZ3lXdkxTcExJVjU2clFEWWxjdW42dlZCM09RekQw?= =?utf-8?B?Z2E2QU5qTXdBcXVDd0x6YWsrV3MzaWIyc2k2L0d1cWU3c09ydlVQV2Q5OFJy?= =?utf-8?B?cUQrZnpreE5NbTFsbmQ1K1dnNm53L1R6RW1iLzJRWFYyVnVmWkVweHBRbjBE?= =?utf-8?B?Z1g3SFZ0VThhcmRGZGtDb2lKTlR0WlJQRDl0U0dEc3lXc1lLNUlpMTRZdUkr?= =?utf-8?B?dFhvQUlSemJDRnJUbGI1K3hOK2c0R2xTMS95WjZVN2Z4MWI4UWxva05abVdI?= =?utf-8?B?MGw2T21RL1J6WnpmOGVROExHcHBYOHJCWnlKNmNhZzYydzRSSjRmS1JUVG5p?= =?utf-8?B?UDlBQUM3a3lqY3dZSldiMDdJdUh1dGd4N29vZGh0bnYxYkwvNnhCTzRkbXl4?= =?utf-8?B?V3NyVFFDaGhGQURBTFJSN2RyQ0F0MS9GRVl2M1dNVjFVSzNZSEtqaWFzQUtB?= =?utf-8?B?aC9wV3R0dEY0YjlvTnVPUHRjT3VqUlJiSG16UG9idGhBMXhjU1VGQkhtYnNO?= =?utf-8?B?QTZrYVA4RVJTenlwWDFXbW1IUHZNU0pUOVcvR2JlRlJwUWhHd0N1Tm5nSlo3?= =?utf-8?B?Z1ZwQitkZ0t2Z2NnMW1wdEtpSkJuREUveUpRaTJ5R282S1UzREJDd1BrUGJl?= =?utf-8?B?TkZiazUzMFFYNXZsZlZXMjA1dG1raVVTRjJYYURqRmhXQ1ZTU00xUXhOZ0tS?= =?utf-8?B?VzRrTG43TE02NG9SUjJJUFErZ1RnUGtYemJtTGV2bGh5ZHZkVWZES0ovazIv?= =?utf-8?B?K1hpYTlJRGxOWHNuZzdXc202MEhZL1VhQ296VUt3ZDhzVnVKRkk0K2h2RFlB?= =?utf-8?B?cTRDMmN2cnZUR2IweTZMTjI2VTdLTzZYR3V0bTI4QUEyL0Q2TG5XZXJ3clRZ?= =?utf-8?B?VVIwRURSdzJYamdHd0FCalNyS2NlS3lBWThwWno4eWJyZTlyY2xzWTZ0SktE?= =?utf-8?B?MGdkYWNPNGpIZFB4TzY3dmpQb2pzUzdWZnB1ZG5ERjBqT01pRWNob2IyNXVM?= =?utf-8?B?TGVmdW1IY3I1WW1QUWdUZ0lqLzc0TmNRbkR2SUJWWEdOR1EzRERoemZhWUZF?= =?utf-8?B?MSthVmtKUzA2a1NvekpwcmxvSmdOc1NEWUN3NjhoRVdCZWlzQmZldUhKcndy?= =?utf-8?B?c1dJUFNuV0FLemVyeFNPWE9lMFI0T2dYVVkzalZWdlJ2WlNGbXVsRlFGbFhs?= =?utf-8?B?ckswbi9lekw1cVNqVVNyaGdZUDdqcFQyRnRTdU5aa0hwOENvUm5LSHUrMnpp?= =?utf-8?B?NGNWSEJCSUg0czhGN2djYm5ybHhmcGZhaDFwcVFnWGZmdGFxUCtSaDdOR09y?= =?utf-8?Q?hqWAHTQOe2SJHJ4Cm/xiS/LNv?= X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: 12ce79aa-405c-4c82-fec0-08dd305917fc X-MS-Exchange-CrossTenant-AuthSource: CH3PR12MB9194.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Jan 2025 02:55:37.0525 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: NjxII7CcXS7/e9RuZBAGUYDAHk3i1Hec2rgPvsoOspIwiLFRc0ili7ziIDcBI8WpymeeXyRlDr1wj4zIv26XSQ== X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH3PR12MB7500 On 9/1/25 13:11, Chenyi Qiang wrote: > > > On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote: >> >> >> On 8/1/25 21:56, Chenyi Qiang wrote: >>> >>> >>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote: >>>> On 13/12/24 18:08, Chenyi Qiang wrote: >>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require >>>>> uncoordinated discard") highlighted, some subsystems like VFIO might >>>>> disable ram block discard. However, guest_memfd relies on the discard >>>>> operation to perform page conversion between private and shared memory. >>>>> This can lead to stale IOMMU mapping issue when assigning a hardware >>>>> device to a confidential VM via shared memory (unprotected memory >>>>> pages). Blocking shared page discard can solve this problem, but it >>>>> could cause guests to consume twice the memory with VFIO, which is not >>>>> acceptable in some cases. An alternative solution is to convey other >>>>> systems like VFIO to refresh its outdated IOMMU mappings. >>>>> >>>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust >>>>> VFIO mappings in relation to VM page assignment. Effectively page >>>>> conversion is similar to hot-removing a page in one mode and adding it >>>>> back in the other, so the similar work that needs to happen in response >>>>> to virtio-mem changes needs to happen for page conversion events. >>>>> Introduce the RamDiscardManager to guest_memfd to achieve it. >>>>> >>>>> However, guest_memfd is not an object so it cannot directly implement >>>>> the RamDiscardManager interface. >>>>> >>>>> One solution is to implement the interface in HostMemoryBackend. Any >>>> >>>> This sounds about right. >>>> >>>>> guest_memfd-backed host memory backend can register itself in the >>>>> target >>>>> MemoryRegion. However, this solution doesn't cover the scenario where a >>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g. >>>>> the virtual BIOS MemoryRegion. >>>> >>>> What is this virtual BIOS MemoryRegion exactly? What does it look like >>>> in "info mtree -f"? Do we really want this memory to be DMAable? >>> >>> virtual BIOS shows in a separate region: >>> >>>   Root memory region: system >>>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM >>>    ... >>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM >> >> Looks like a normal MR which can be backed by guest_memfd. > > Yes, virtual BIOS memory region is initialized by > memory_region_init_ram_guest_memfd() which will be backed by a guest_memfd. > > The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual > BIOS image will be loaded and then copied to private region. > After that, > the loaded image will be discarded and this region become useless. I'd think it is loaded as "struct Rom" and then copied to the MR-ram_guest_memfd() which does not leave MR useless - we still see "pc.bios" in the list so it is not discarded. What piece of code are you referring to exactly? > So I > feel like this virtual BIOS should not be backed by guest_memfd? From the above it sounds like the opposite, i.e. it should :) >> >>>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram >>> @0000000080000000 KVM >> >> Anyway if there is no guest_memfd backing it and >> memory_region_has_ram_discard_manager() returns false, then the MR is >> just going to be mapped for VFIO as usual which seems... alright, right? > > Correct. As the vBIOS is backed by guest_memfd and we implement the RDM > for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO. > > If we go with the HostMemoryBackend instead of guest_memfd_manager, this > MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or > just ignore it since the MR is useless (but looks not so good). Sorry I am missing necessary details here, let's figure out the above. > >> >> >>> We also consider to implement the interface in HostMemoryBackend, but >>> maybe implement with guest_memfd region is more general. We don't know >>> if any DMAable memory would belong to HostMemoryBackend although at >>> present it is. >>> >>> If it is more appropriate to implement it with HostMemoryBackend, I can >>> change to this way. >> >> Seems cleaner imho. > > I can go this way. > >> >>>> >>>> >>>>> Thus, choose the second option, i.e. define an object type named >>>>> guest_memfd_manager with RamDiscardManager interface. Upon creation of >>>>> guest_memfd, a new guest_memfd_manager object can be instantiated and >>>>> registered to the managed guest_memfd MemoryRegion to handle the page >>>>> conversion events. >>>>> >>>>> In the context of guest_memfd, the discarded state signifies that the >>>>> page is private, while the populated state indicated that the page is >>>>> shared. The state of the memory is tracked at the granularity of the >>>>> host page size (i.e. block_size), as the minimum conversion size can be >>>>> one page per request. >>>>> >>>>> In addition, VFIO expects the DMA mapping for a specific iova to be >>>>> mapped and unmapped with the same granularity. However, the >>>>> confidential >>>>> VMs may do partial conversion, e.g. conversion happens on a small >>>>> region >>>>> within a large region. To prevent such invalid cases and before any >>>>> potential optimization comes out, all operations are performed with 4K >>>>> granularity. >>>>> >>>>> Signed-off-by: Chenyi Qiang >>>>> --- >>>>>    include/sysemu/guest-memfd-manager.h |  46 +++++ >>>>>    system/guest-memfd-manager.c         | 250 ++++++++++++++++++++++ >>>>> +++++ >>>>>    system/meson.build                   |   1 + >>>>>    3 files changed, 297 insertions(+) >>>>>    create mode 100644 include/sysemu/guest-memfd-manager.h >>>>>    create mode 100644 system/guest-memfd-manager.c >>>>> >>>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/ >>>>> guest-memfd-manager.h >>>>> new file mode 100644 >>>>> index 0000000000..ba4a99b614 >>>>> --- /dev/null >>>>> +++ b/include/sysemu/guest-memfd-manager.h >>>>> @@ -0,0 +1,46 @@ >>>>> +/* >>>>> + * QEMU guest memfd manager >>>>> + * >>>>> + * Copyright Intel >>>>> + * >>>>> + * Author: >>>>> + *      Chenyi Qiang >>>>> + * >>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or >>>>> later. >>>>> + * See the COPYING file in the top-level directory >>>>> + * >>>>> + */ >>>>> + >>>>> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H >>>>> +#define SYSEMU_GUEST_MEMFD_MANAGER_H >>>>> + >>>>> +#include "sysemu/hostmem.h" >>>>> + >>>>> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager" >>>>> + >>>>> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass, >>>>> GUEST_MEMFD_MANAGER) >>>>> + >>>>> +struct GuestMemfdManager { >>>>> +    Object parent; >>>>> + >>>>> +    /* Managed memory region. */ >>>> >>>> Do not need this comment. And the period. >>> >>> [...] >>> >>>> >>>>> +    MemoryRegion *mr; >>>>> + >>>>> +    /* >>>>> +     * 1-setting of the bit represents the memory is populated >>>>> (shared). >>>>> +     */ >>> >>> Will fix it. >>> >>>> >>>> Could be 1 line comment. >>>> >>>>> +    int32_t bitmap_size; >>>> >>>> int or unsigned >>>> >>>>> +    unsigned long *bitmap; >>>>> + >>>>> +    /* block size and alignment */ >>>>> +    uint64_t block_size; >>>> >>>> unsigned? >>>> >>>> (u)int(32|64)_t make sense for migrations which is not the case (yet?). >>>> Thanks, >>> >>> I think these fields would be helpful for future migration support. >>> Maybe defining as this way is more straightforward. >>> >>>> >>>>> + >>>>> +    /* listeners to notify on populate/discard activity. */ >>>> >>>> Do not really need this comment either imho. >>>> >>> >>> I prefer to provide the comment for each field as virtio-mem do. If it >>> is not necessary, I would remove those obvious ones. >> >> [bikeshedding on] But the "RamDiscardListener" word says that already, >> why repeating? :) It should add information, not duplicate. Like the >> block_size comment which mentions "alignment" [bikeshedding off] > > Got it. Thanks! > >> >>>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list; >>>>> +}; >>>>> + >>>>> +struct GuestMemfdManagerClass { >>>>> +    ObjectClass parent_class; >>>>> +}; >>>>> + >>>>> +#endif >>> >>> [...] >>> >>>             void *arg, >>>>> + >>>>> guest_memfd_section_cb cb) >>>>> +{ >>>>> +    unsigned long first_one_bit, last_one_bit; >>>>> +    uint64_t offset, size; >>>>> +    int ret = 0; >>>>> + >>>>> +    first_one_bit = section->offset_within_region / gmm->block_size; >>>>> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size, >>>>> first_one_bit); >>>>> + >>>>> +    while (first_one_bit < gmm->bitmap_size) { >>>>> +        MemoryRegionSection tmp = *section; >>>>> + >>>>> +        offset = first_one_bit * gmm->block_size; >>>>> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm- >>>>>> bitmap_size, >>>>> +                                          first_one_bit + 1) - 1; >>>>> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size; >>>> >>>> This tries calling cb() on bigger chunks even though we say from the >>>> beginning that only page size is supported? >>>> >>>> May be simplify this for now and extend if/when VFIO learns to split >>>> mappings,  or  just drop it when we get in-place page state convertion >>>> (which will make this all irrelevant)? >>> >>> The cb() will call with big chunks but actually it do the split with the >>> granularity of block_size in the cb(). See the >>> vfio_ram_discard_notify_populate(), which do the DMA_MAP with >>> granularity size. >> >> >> Right, and this all happens inside QEMU - first the code finds bigger >> chunks and then it splits them anyway to call the VFIO driver. Seems >> pointless to bother about bigger chunks here. >> >>> >>>> >>>> >>>>> + >>>>> +        if (!memory_region_section_intersect_range(&tmp, offset, >>>>> size)) { >>>>> +            break; >>>>> +        } >>>>> + >>>>> +        ret = cb(&tmp, arg); >>>>> +        if (ret) { >>>>> +            break; >>>>> +        } >>>>> + >>>>> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size, >>>>> +                                      last_one_bit + 2); >>>>> +    } >>>>> + >>>>> +    return ret; >>>>> +} >>>>> + >>>>> +static int guest_memfd_for_each_discarded_section(const >>>>> GuestMemfdManager *gmm, >>>>> +                                                  MemoryRegionSection >>>>> *section, >>>>> +                                                  void *arg, >>>>> + >>>>> guest_memfd_section_cb cb) >>>>> +{ >>>>> +    unsigned long first_zero_bit, last_zero_bit; >>>>> +    uint64_t offset, size; >>>>> +    int ret = 0; >>>>> + >>>>> +    first_zero_bit = section->offset_within_region / gmm->block_size; >>>>> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size, >>>>> +                                        first_zero_bit); >>>>> + >>>>> +    while (first_zero_bit < gmm->bitmap_size) { >>>>> +        MemoryRegionSection tmp = *section; >>>>> + >>>>> +        offset = first_zero_bit * gmm->block_size; >>>>> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size, >>>>> +                                      first_zero_bit + 1) - 1; >>>>> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size; >>>>> + >>>>> +        if (!memory_region_section_intersect_range(&tmp, offset, >>>>> size)) { >>>>> +            break; >>>>> +        } >>>>> + >>>>> +        ret = cb(&tmp, arg); >>>>> +        if (ret) { >>>>> +            break; >>>>> +        } >>>>> + >>>>> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm- >>>>>> bitmap_size, >>>>> +                                            last_zero_bit + 2); >>>>> +    } >>>>> + >>>>> +    return ret; >>>>> +} >>>>> + >>>>> +static uint64_t guest_memfd_rdm_get_min_granularity(const >>>>> RamDiscardManager *rdm, >>>>> +                                                    const >>>>> MemoryRegion *mr) >>>>> +{ >>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>>>> + >>>>> +    g_assert(mr == gmm->mr); >>>>> +    return gmm->block_size; >>>>> +} >>>>> + >>>>> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm, >>>>> +                                              RamDiscardListener *rdl, >>>>> +                                              MemoryRegionSection >>>>> *section) >>>>> +{ >>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>>>> +    int ret; >>>>> + >>>>> +    g_assert(section->mr == gmm->mr); >>>>> +    rdl->section = memory_region_section_new_copy(section); >>>>> + >>>>> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next); >>>>> + >>>>> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl, >>>>> + >>>>> guest_memfd_notify_populate_cb); >>>>> +    if (ret) { >>>>> +        error_report("%s: Failed to register RAM discard listener: >>>>> %s", __func__, >>>>> +                     strerror(-ret)); >>>>> +    } >>>>> +} >>>>> + >>>>> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager >>>>> *rdm, >>>>> +                                                RamDiscardListener >>>>> *rdl) >>>>> +{ >>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>>>> +    int ret; >>>>> + >>>>> +    g_assert(rdl->section); >>>>> +    g_assert(rdl->section->mr == gmm->mr); >>>>> + >>>>> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, >>>>> rdl, >>>>> + >>>>> guest_memfd_notify_discard_cb); >>>>> +    if (ret) { >>>>> +        error_report("%s: Failed to unregister RAM discard listener: >>>>> %s", __func__, >>>>> +                     strerror(-ret)); >>>>> +    } >>>>> + >>>>> +    memory_region_section_free_copy(rdl->section); >>>>> +    rdl->section = NULL; >>>>> +    QLIST_REMOVE(rdl, next); >>>>> + >>>>> +} >>>>> + >>>>> +typedef struct GuestMemfdReplayData { >>>>> +    void *fn; >>>> >>>> s/void */ReplayRamPopulate/ >>> >>> [...] >>> >>>> >>>>> +    void *opaque; >>>>> +} GuestMemfdReplayData; >>>>> + >>>>> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection >>>>> *section, void *arg) >>>>> +{ >>>>> +    struct GuestMemfdReplayData *data = arg; >>>> >>>> Drop "struct" here and below. >>> >>> Fixed. Thanks! >>> >>>> >>>>> +    ReplayRamPopulate replay_fn = data->fn; >>>>> + >>>>> +    return replay_fn(section, data->opaque); >>>>> +} >>>>> + >>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager >>>>> *rdm, >>>>> +                                            MemoryRegionSection >>>>> *section, >>>>> +                                            ReplayRamPopulate >>>>> replay_fn, >>>>> +                                            void *opaque) >>>>> +{ >>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = >>>>> opaque }; >>>>> + >>>>> +    g_assert(section->mr == gmm->mr); >>>>> +    return guest_memfd_for_each_populated_section(gmm, section, &data, >>>>> + >>>>> guest_memfd_rdm_replay_populated_cb); >>>>> +} >>>>> + >>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection >>>>> *section, void *arg) >>>>> +{ >>>>> +    struct GuestMemfdReplayData *data = arg; >>>>> +    ReplayRamDiscard replay_fn = data->fn; >>>>> + >>>>> +    replay_fn(section, data->opaque); >>>> >>>> >>>> guest_memfd_rdm_replay_populated_cb() checks for errors though. >>> >>> It follows current definiton of ReplayRamDiscard() and >>> ReplayRamPopulate() where replay_discard() doesn't return errors and >>> replay_populate() returns errors. >> >> A trace would be appropriate imho. Thanks, > > Sorry, can't catch you. What kind of info to be traced? The errors > returned by replay_populate()? Yeah. imho these are useful as we expect this part to work in general too, right? Thanks, > >> >>>> >>>>> + >>>>> +    return 0; >>>>> +} >>>>> + >>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager >>>>> *rdm, >>>>> +                                             MemoryRegionSection >>>>> *section, >>>>> +                                             ReplayRamDiscard >>>>> replay_fn, >>>>> +                                             void *opaque) >>>>> +{ >>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = >>>>> opaque }; >>>>> + >>>>> +    g_assert(section->mr == gmm->mr); >>>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data, >>>>> + >>>>> guest_memfd_rdm_replay_discarded_cb); >>>>> +} >>>>> + >>>>> +static void guest_memfd_manager_init(Object *obj) >>>>> +{ >>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj); >>>>> + >>>>> +    QLIST_INIT(&gmm->rdl_list); >>>>> +} >>>>> + >>>>> +static void guest_memfd_manager_finalize(Object *obj) >>>>> +{ >>>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap); >>>> >>>> >>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks, >>> >>> Will remove it. Thanks. >>> >>>> >>>> >>>>> +} >>>>> + >>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void >>>>> *data) >>>>> +{ >>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc); >>>>> + >>>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity; >>>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener; >>>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener; >>>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated; >>>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated; >>>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded; >>>>> +} >>>>> diff --git a/system/meson.build b/system/meson.build >>>>> index 4952f4b2c7..ed4e1137bd 100644 >>>>> --- a/system/meson.build >>>>> +++ b/system/meson.build >>>>> @@ -15,6 +15,7 @@ system_ss.add(files( >>>>>      'dirtylimit.c', >>>>>      'dma-helpers.c', >>>>>      'globals.c', >>>>> +  'guest-memfd-manager.c', >>>>>      'memory_mapping.c', >>>>>      'qdev-monitor.c', >>>>>      'qtest.c', >>>> >>> >> > -- Alexey