From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4457F2D033 for ; Tue, 7 Nov 2023 16:11:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ekvmndCj" Received: from mail-qt1-x835.google.com (mail-qt1-x835.google.com [IPv6:2607:f8b0:4864:20::835]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 64448D0BE for ; Tue, 7 Nov 2023 08:11:47 -0800 (PST) Received: by mail-qt1-x835.google.com with SMTP id d75a77b69052e-41ea9c5e83cso398411cf.0 for ; Tue, 07 Nov 2023 08:11:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1699373506; x=1699978306; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=92D3EfTzzdthNK7rSHmuSHVu1YwaZFfSd0BUJGI4pFk=; b=ekvmndCj2cf2iOFzgD6q8Udtov1agwafyzAwVH49Ynj6mXdytGiuVlb6FZcQpvpXJP ZfDv5xbpFNfUu56z1hZSPmndtvicAOG4KnXdOYNEBd5tROSJuyFZPvqED5vZ0mWuTk01 jQrDgi3dUsL2wxNycNNfjTtKr9TnRD7QC2hlUSD8M1fiw+O6YBBJjYiBqUkKNJdI0slY z2hgkNIsqiuRF9BkCsQ28CunKUR20PVrK7UBrXpTGhrnYP2RHZ6wK9EpI3Cd8ClGVrjS h3w9XxKny7o1cyDhEOyWLuSGsqcb99Z6WMP6MLB84tRE6QZ15pJeud7WXq+97Pb2DcPp VyNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699373506; x=1699978306; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=92D3EfTzzdthNK7rSHmuSHVu1YwaZFfSd0BUJGI4pFk=; b=Gf5CEBwYx4paw4SMQDRBBOFBtAVsxsTRxWzuIuMh137ZPsZZJWbfqzUGH3tfac3IzO 3VGnuyBtSrsH2kjeAcvVNC50aktxPVO5MUQ4MrAxiQ2f2pCdcAWlUNlmrPC805XJrh0W SfGOg2PYzevEKNDxsBReYr8+HmcI6FQkrK3Iv93t25tuB7gAdKpNm/VKKiDrXDXa8yQe 5BcBrS0riezwhROITe+Hkar2eUf8TN0r9EEwYqilwo7b7rNN7n4+drMDP4z683RZHMGe VsAWxdC5sfeCK1FxIjZT055AWf2H8Sv6naMXhuOR6o2wMtbj/EJsVf48W5zZB2aU5ulQ qcHA== X-Gm-Message-State: AOJu0Yw7i0x0iMmQl/quTxsszbbddrs1/qQc/0CEXN/6mQqqBAdzm7go 4AyF+bX3dPLXJO9Qx5w2htmZ8lOKq09BmtyiABbTBQ== X-Google-Smtp-Source: AGHT+IH+7xEU/GbYvYhEOG364XtU5UQolnxlAD7FE+WORX88cMM8eJ39PxSED20MnOxxa3anRLFRg+jSdzFe6yl+9x8= X-Received: by 2002:aed:2799:0:b0:41c:e312:cbd2 with SMTP id a25-20020aed2799000000b0041ce312cbd2mr310622qtd.29.1699373506374; Tue, 07 Nov 2023 08:11:46 -0800 (PST) Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: In-Reply-To: From: James Houghton Date: Tue, 7 Nov 2023 08:11:09 -0800 Message-ID: Subject: Re: RFC: A KVM-specific alternative to UserfaultFD To: Peter Xu Cc: David Matlack , Axel Rasmussen , Paolo Bonzini , kvm list , Sean Christopherson , Oliver Upton , Mike Kravetz , Andrea Arcangeli , Frank van der Linden Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Nov 7, 2023 at 6:22=E2=80=AFAM Peter Xu wrote: > > On Mon, Nov 06, 2023 at 03:22:05PM -0800, David Matlack wrote: > > On Mon, Nov 6, 2023 at 3:03=E2=80=AFPM Peter Xu wro= te: > > > On Mon, Nov 06, 2023 at 02:24:13PM -0800, Axel Rasmussen wrote: > > > > On Mon, Nov 6, 2023 at 12:23=E2=80=AFPM Peter Xu wrote: > > > > > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote: > > > > > > > > > > > > * Memory Overhead: UserfaultFD requires an extra 8 bytes per = page of > > > > > > guest memory for the userspace page table entries. > > > > > > > > > > What is this one? > > > > > > > > In the way we use userfaultfd, there are two shared userspace mappi= ngs > > > > - one non-UFFD registered one which is used to resolve demand pagin= g > > > > faults, and another UFFD-registered one which is handed to KVM et a= l > > > > for the guest to use. I think David is talking about the "second" > > > > mapping as overhead here, since with the KVM-based approach he's > > > > describing we don't need that mapping. > > > > > > I see, but then is it userspace relevant? IMHO we should discuss the > > > proposal based only on the design itself, rather than relying on any > > > details on possible userspace implementations if two mappings are not > > > required but optional. > > > > What I mean here is that for UserfaultFD to track accesses at > > PAGE_SIZE granularity, that requires 1 PTE per page, i.e. 8 bytes per > > page. Versus the KVM-based approach which only requires 1 bit per page > > for the present bitmap. This is inherent in the design of UserfaultFD > > because it uses PTEs to track what is present, not specific to how we > > use UserfaultFD. > > Shouldn't the userspace normally still maintain one virtual mapping anywa= y > for the guest address range? As IIUC kvm still relies a lot on HVA to wo= rk > (at least before guest memfd)? E.g., KVM_SET_USER_MEMORY_REGION, or mmu > notifiers. If so, that 8 bytes should be there with/without userfaultfd, > IIUC. > > Also, I think that's not strictly needed for any kind of file memories, a= s > in those case userfaultfd works with page cache. This extra ~8 bytes per page overhead is real, and it is the theoretical maximum additional overhead that userfaultfd would require over a KVM-based demand paging alternative when we are using hugepages. Consider the case where we are using THPs and have just finished post-copy, and we haven't done any collapsing yet: For userfaultfd: because we have UFFDIO_COPY'd or UFFDIO_CONTINUE'd at 4K (because we demand-fetched at 4K), the userspace page tables are entirely shattered. KVM has no choice but to have an entirely shattered second-stage page table as well. For KVM demand paging: the userspace page tables can remain entirely populated, so we get PMD mappings here. KVM, though, uses 4K SPTEs because we have only just finished post-copy and haven't started collapsing yet. So both systems end up with a shattered second stage page table, but userfaultfd has a shattered userspace page table as well (+8 bytes/4K if using THP, +another 8 bytes/2M if using HugeTLB-1G, etc.) and that is where the extra overhead comes from. The second mapping of guest memory that we use today (through which we install memory), given that we are using hugepages, will use PMDs and PUDs, so the overhead is minimal. Hope that clears things up! Thanks, James