From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 99605C54E64 for ; Thu, 28 Mar 2024 10:10:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E9AAA6B008A; Thu, 28 Mar 2024 06:10:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E728B6B0092; Thu, 28 Mar 2024 06:10:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D60D16B0095; Thu, 28 Mar 2024 06:10:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B3F836B008A for ; Thu, 28 Mar 2024 06:10:28 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 75BE61A10A7 for ; Thu, 28 Mar 2024 10:10:28 +0000 (UTC) X-FDA: 81946028136.07.C147026 Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf25.hostedemail.com (Postfix) with ESMTP id B8BDDA002C for ; Thu, 28 Mar 2024 10:10:26 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=J+W70UgU; spf=pass (imf25.hostedemail.com: domain of qperret@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=qperret@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711620626; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Puyk1VksZOZMOoSvwfN6vEXC/OypwIzDynZuRboVTDQ=; b=cTXzwbIMIC7yUakWfETfDU0Fi2yFwrmlUU9wwB8m17KwwuzCMQ9tJCrV0HRrW8DG3eeDpt Z9joff9MvgBKwunWeTbBO/8sYgmpi0GdeJOJ2AxUtMXXe6QBvQT1MLm6hAnrf8KV8Tq0n0 EXhpUgqTMfjbzAErO/qWCl4DuYEdy88= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=J+W70UgU; spf=pass (imf25.hostedemail.com: domain of qperret@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=qperret@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711620626; a=rsa-sha256; cv=none; b=Hawkm8u5QBK0K1bto17IQwO5jm3omrCVGG5iXDo3r0e0tJBmW0po95ZVqF9k6C5Imp70dT yz4KFk5bDgdEpKRtsqu14s4yR3WcwypbdMhyov5dXmpbY17RNUI4xKbEP8Hn0/UbKcOFDM +e2mL2NGWfK+KloIVgAHF0lnktfwviI= Received: by mail-ej1-f54.google.com with SMTP id a640c23a62f3a-a2f22bfb4e6so101049166b.0 for ; Thu, 28 Mar 2024 03:10:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711620625; x=1712225425; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=Puyk1VksZOZMOoSvwfN6vEXC/OypwIzDynZuRboVTDQ=; b=J+W70UgUPmSsiM1scbeijiDbrFOz7F9EIvK1qwL2I5GFjX2+eudjeXkytQpJ5meLZ4 8fm02PBryZ1ti/GOE/vYrEBQFvRP3bfqC1XLo3j69gsuj1U3bzRxEIA8uP2G+IVNvbsr EzV+gul4wydZ9ULoZB8GuRWVo0BnMKhwWUFj2Gu1LAuO6HK/lKFlkHKoQ9q5kqYdpAOm FX77/1MdkiLht2+N/sFUszFHBs/NExw/625jOYOsEUIv7ulZt1CzLHJplaXIcej9ZbGu qFCKUE0CXFR1bsKSnlxPf8SBJSUz5PBtXvVMwfPxBxvvC/ULP7PmzvvXJVlEcygf6u6U UPqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711620625; x=1712225425; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Puyk1VksZOZMOoSvwfN6vEXC/OypwIzDynZuRboVTDQ=; b=u8ETNzZBqNdpwK0qy0U7QH9f4Yg7cc/yDrr+46WV9N/B2+Yl5ynvjHYPnXJlJ8aPjW 7kXicVJM4A61BfiV+Z2MwcL+Vkppyjn4BR4Jd5a/mrlg36ZsHUP3vLiQlKrY7XGfDu7B 4RRmkcBLy7SNO4jffRCm0e09y3rNfuokhgJ999qQTlOLr5OmJX2cLRLhSO16GuIHi47k RE7Nl+HOuzte1KqFeRSAjdQg9DEU8QpVR8w0ZW9UWsONc6qq/aY5ZlHG/cOFGZL7b92R efRQXfIXi9lfk0DSGMiU1WsoQuWyfDff8/zrD/JHELY0r55zgB/evEJwYn6ZuYRYAhjX 4R+w== X-Forwarded-Encrypted: i=1; AJvYcCUk8pad4xkQ9spyqS0oJm7fxwxOGOSwubHe8QTIyexqDhSmHkWFEFibfYN0vTNRYYMT67xHit/Zs38Yxr9Ye8AfYSs= X-Gm-Message-State: AOJu0Yz8v4ZeWyEI/UlYihk1N/4l4i4SgT0vog5wz2RlA+67HmFDmM+x Mlr1wNDnJiZssfkhFpgbKMdnswLa05lF1+lKCET3dJH+d+dIVkNacnU1Nmi36Q== X-Google-Smtp-Source: AGHT+IGXnI9D0/a/dDSyK30CUeRiI6B81YbRt4vqKAzIHpgoErhg2X2EKKArBQHecRRX/nEjQaIWXQ== X-Received: by 2002:a17:906:f0d0:b0:a4e:e20:df53 with SMTP id dk16-20020a170906f0d000b00a4e0e20df53mr1476586ejb.59.1711620624947; Thu, 28 Mar 2024 03:10:24 -0700 (PDT) Received: from google.com (61.134.90.34.bc.googleusercontent.com. [34.90.134.61]) by smtp.gmail.com with ESMTPSA id kg26-20020a17090776fa00b00a449026672esm571212ejc.81.2024.03.28.03.10.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 Mar 2024 03:10:23 -0700 (PDT) Date: Thu, 28 Mar 2024 10:10:20 +0000 From: Quentin Perret To: David Hildenbrand Cc: Will Deacon , Sean Christopherson , Vishal Annapurve , Matthew Wilcox , Fuad Tabba , kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk, brauner@kernel.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, ackerleytng@google.com, mail@maciej.szmigiero.name, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, keirf@google.com, linux-mm@kvack.org Subject: Re: folio_mmapped Message-ID: References: <7470390a-5a97-475d-aaad-0f6dfb3d26ea@redhat.com> <40f82a61-39b0-4dda-ac32-a7b5da2a31e8@redhat.com> <20240319143119.GA2736@willie-the-truck> <2d6fc3c0-a55b-4316-90b8-deabb065d007@redhat.com> <20240327193454.GB11880@willie-the-truck> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: B8BDDA002C X-Rspam-User: X-Stat-Signature: wx7ka7ruchpzybkc967dm6nbzxym5nyf X-Rspamd-Server: rspam01 X-HE-Tag: 1711620626-352385 X-HE-Meta: U2FsdGVkX1+fhMv8CZB6TkPzcgyAOn42/sdFrkRmrIeLZn3eooFWq7YhqSF/VVhZ18Cfckrfj5TEi5XUHj1UuSz5qGl+wb0ABcBRlUE4ZrsuKlgV9xhxc3QasXEwoxuGn16DwtEbdFMQFKar3jXw2Xz0IEhJtVzsGmtBddkBlMinY3g8VVN17oBQbYjHIKphk4MjNNid2Mp23ttMkVSs/nd/NF/jr5qMmSuiP/l/WBGQxfW0R0lADRrhU8lxFNfz0iIKh+r2UrIgFDMRjDwKtrgm2u1YIBNqygA6J56ZLC2O8szbsRlJ4dKE5tVjVDGO+smdXs+lURlWnnoxJ5t5bpwh+jtFKfkGRYe02rdezkeVj24bXvl53DvWbdPdN4pzXRD3zLLVoTBPvTvMPOt9ZYEbpgYbe8Cz8KWKhn+BfDTnmgKbHAZsN/BYBiMTzKENIfNonDEpPZML5bVoqFsk2BHMK3kaqPmw26lx8RnqmvP143MINO6UZlgA2V6WKJOXWEMpx18WA4rHCqN0+064zUucnVl4UYzF8zJyube/BIUWax532HM/HpcTSSVbc2PdyLlpBn1UnvjtUsXkPDXVzsmynP+Rl8otj5S58ATfibeRSX7OpykNOxM51k5chVdczxr1Uk7Kl0yE3EYuuOKH7HZyvyIf4YqKXcZehiOC+yMt/3V1dnaIbxvmHTJyTYGX6N2Su9xJWAAFR4Abmucf0S8+hRxkrdzCllTfAhIK8EAsfkp+94nSBj3w6aC7Hyg+k5V6+bRdHxPGeGNyGSofaHRIqJcf7fm30ksJ12aX0k3VXyTYh3MlxYp/ZS9o5kWhCx7yxX7EtaF3FQHZ0caRtpQ2d5lTYv3Umg823EO6/dInwh2CqZ1uP1cfnVAiOlBRbX0nyxcJDGXySNe0yxpgALTZzjVaEBwCH14L4eFDL3t9UDpJhaL/vrpj9ifKWVEr9aq+wEJHLljtrcLxNLV 57SAHqzH s/Zcu1mqVjsWDvj4e2HQkeG6fer8n+XQfpcBhZQ057g3/yEKyPozMBavXYp7/KjPdaUdiMwLIu8fFMebGoUHqd5D/fiVMyFZsmjY+BN/e6F50aMgU/ghjhUeD0cydZCyAHK5lIYwKkmCd/pmx3VyK/xF3NUZBXkdxhfXGBqmr8mN05FrU+pu40xW3+hDnagaeX9IF0G6xOQxnUU0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hey David, I'll try to pick up the baton while Will is away :-) On Thursday 28 Mar 2024 at 10:06:52 (+0100), David Hildenbrand wrote: > On 27.03.24 20:34, Will Deacon wrote: > > I suppose the main thing is that the architecture backend can deal with > > these states, so the core code shouldn't really care as long as it's > > aware that shared memory may be pinned. > > So IIUC, the states are: > > (1) Private: inaccesible by the host, accessible by the guest, "owned by > the guest" > > (2) Host Shared: accessible by the host + guest, "owned by the host" > > (3) Guest Shared: accessible by the host, "owned by the guest" Yup. > Memory ballooning is simply transitioning from (3) to (2), and then > discarding the memory. Well, not quite actually, see below. > Any state I am missing? So there is probably state (0) which is 'owned only by the host'. It's a bit obvious, but I'll make it explicit because it has its importance for the rest of the discussion. And while at it, there are other cases (memory shared/owned with/by the hypervisor and/or TrustZone) but they're somewhat irrelevant to this discussion. These pages are usually backed by kernel allocations, so much less problematic to deal with. So let's ignore those. > Which transitions are possible? Basically a page must be in the 'exclusively owned' state for an owner to initiate a share or donation. So e.g. a shared page must be unshared before it can be donated to someone else (that is true regardless of the owner, host, guest, hypervisor, ...). That simplifies significantly the state tracking in pKVM. > (1) <-> (2) ? Not sure if the direct transition is possible. Yep, not possible. > (2) <-> (3) ? IIUC yes. Actually it's not directly possible as is. The ballooning procedure is essentially a (1) -> (0) transition. (We also tolerate (3) -> (0) in a single hypercall when doing ballooning, but it's technically just a (3) -> (1) -> (0) sequence that has been micro-optimized). Note that state (2) is actually never used for protected VMs. It's mainly used to implement standard non-protected VMs. The biggest difference in pKVM between protected and non-protected VMs is basically that in the former case, in the fault path KVM does a (0) -> (1) transition, but in the latter it's (0) -> (2). That implies that in the unprotected case, the host remains the page owner and is allowed to decide to unshare arbitrary pages, to restrict the guest permissions for the shared pages etc, which paves the way for implementing migration, swap, ... relatively easily. > (1) <-> (3) ? IIUC yes. Yep. > > I agree on all of these and, yes, (3) is the problem for us. We've also > > been thinking a bit about CoW recently and I suspect the use of > > vm_normal_page() in do_wp_page() could lead to issues similar to those > > we hit with GUP. There are various ways to approach that, but I'm not > > sure what's best. > > Would COW be required or is that just the nasty side-effect of trying to use > anonymous memory? That'd qualify as an undesirable side effect I think. > > > > > I'm curious, may there be a requirement in the future that shared memory > > > could be mapped into other processes? (thinking vhost-user and such things). > > > > It's not impossible. We use crosvm as our VMM, and that has a > > multi-process sandbox mode which I think relies on just that... > > > > Okay, so basing the design on anonymous memory might not be the best choice > ... :/ So, while we're at this stage, let me throw another idea at the wall to see if it sticks :-) One observation is that a standard memfd would work relatively well for pKVM if we had a way to enforce that all mappings to it are MAP_SHARED. KVM would still need to take an 'exclusive GUP' from the fault path (which may fail in case of a pre-existing GUP, but that's fine), but then CoW and friends largely become a non-issue by construction I think. Is there any way we could enforce that cleanly? Perhaps introducing a sort of 'mmap notifier' would do the trick? By that I mean something a bit similar to an MMU notifier offered by memfd that KVM could register against whenever the memfd is attached to a protected VM memslot. One of the nice things here is that we could retain an entire mapping of the whole of guest memory in userspace, conversions wouldn't require any additional efforts from userspace. A bad thing is that a process that is being passed such a memfd may not expect the new semantic and the inability to map !MAP_SHARED. But I guess a process that receives a handle to private memory must be enlightened regardless of the type of fd, so maybe it's not so bad. Thoughts? Thanks, Quentin