From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C64DC38142 for ; Tue, 31 Jan 2023 10:28:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230404AbjAaK2z (ORCPT ); Tue, 31 Jan 2023 05:28:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54382 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229565AbjAaK2w (ORCPT ); Tue, 31 Jan 2023 05:28:52 -0500 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B4B7A2DE4A for ; Tue, 31 Jan 2023 02:28:51 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 4639A614B5 for ; Tue, 31 Jan 2023 10:28:51 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A61FCC433EF; Tue, 31 Jan 2023 10:28:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1675160930; bh=CFpXrESVC1VdJC/0chHwon8lVrNNHg2Yos0VogDVZXk=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=OUeXrhtJWXHtJce7JnnW0wsghJTKEZpyIB3fsQZDm6enKXhrI7pcsKyTbkOexod28 0A9T2eh5MVqP7Botskkqg2gQgQAmvWz1OLrwwQBp/g/grOdenkOCKH9yqd4QjHHKri q75M1+5WX1631uzP8mYvAqunLhAUnD5igrMVnvrSLJZaW/8PyVszO3NoZVoABDCE1q tpa6c48qQtwsgh7temQqia5CUUVW4ycBEVq+Iofq1FjQ8z2ZCA45hx3xNVJxtjyWhc vknJYJ0t/D5M7KyKu75osVnUWPQ06Dv5uUuC3yccAs3bTrDsQ0RKMHBmwmaMUEFbaT TF1nKdgEyRsBQ== Received: from sofa.misterjones.org ([185.219.108.64] helo=goblin-girl.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1pMnse-006A5h-B5; Tue, 31 Jan 2023 10:28:48 +0000 Date: Tue, 31 Jan 2023 10:28:47 +0000 Message-ID: <86h6w70zhc.wl-maz@kernel.org> From: Marc Zyngier To: Ricardo Koller Cc: Oliver Upton , pbonzini@redhat.com, oupton@google.com, yuzenghui@huawei.com, dmatlack@google.com, kvm@vger.kernel.org, kvmarm@lists.linux.dev, qperret@google.com, catalin.marinas@arm.com, andrew.jones@linux.dev, seanjc@google.com, alexandru.elisei@arm.com, suzuki.poulose@arm.com, eric.auger@redhat.com, gshan@redhat.com, reijiw@google.com, rananta@google.com, bgardon@google.com, ricarkol@gmail.com Subject: Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled In-Reply-To: References: <20230113035000.480021-1-ricarkol@google.com> <20230113035000.480021-7-ricarkol@google.com> <86v8ktkqfx.wl-maz@kernel.org> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/28.2 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-SA-Exim-Connect-IP: 185.219.108.64 X-SA-Exim-Rcpt-To: ricarkol@google.com, oliver.upton@linux.dev, pbonzini@redhat.com, oupton@google.com, yuzenghui@huawei.com, dmatlack@google.com, kvm@vger.kernel.org, kvmarm@lists.linux.dev, qperret@google.com, catalin.marinas@arm.com, andrew.jones@linux.dev, seanjc@google.com, alexandru.elisei@arm.com, suzuki.poulose@arm.com, eric.auger@redhat.com, gshan@redhat.com, reijiw@google.com, rananta@google.com, bgardon@google.com, ricarkol@gmail.com X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Fri, 27 Jan 2023 15:45:15 +0000, Ricardo Koller wrote: > > > The one thing that would convince me to make it an option is the > > amount of memory this thing consumes. 512+ pages is a huge amount, and > > I'm not overly happy about that. Why can't this be a userspace visible > > option, selectable on a per VM (or memslot) basis? > > > > It should be possible. I am exploring a couple of ideas that could > help when the hugepages are not 1G (e.g., 2M). However, they add > complexity and I'm not sure they help much. > > (will be using PAGE_SIZE=4K to make things simpler) > > This feature pre-allocates 513 pages before splitting every 1G range. > For example, it converts 1G block PTEs into trees made of 513 pages. > When not using this feature, the same 513 pages would be allocated, > but lazily over a longer period of time. This is an important difference. It avoids the upfront allocation "thermal shock", giving time to the kernel to reclaim memory from somewhere else. Doing it upfront means you *must* have 2MB+ of immediately available memory for each GB of RAM you guest uses. > > Eager-splitting pre-allocates those pages in order to split huge-pages > into fully populated trees. Which is needed in order to use FEAT_BBM > and skipping the expensive TLBI broadcasts. 513 is just the number of > pages needed to break a 1G huge-page. I understand that. But it also clear that 1GB huge pages are unlikely to be THPs, and I wonder if we should treat the two differently. Using HugeTLBFS pages is significant here. > > We could optimize for smaller huge-pages, like 2M by splitting 1 > huge-page at a time: only preallocate one 4K page at a time. The > trick is how to know that we are splitting 2M huge-pages. We could > either get the vma pagesize or use hints from userspace. I'm not sure > that this is worth it though. The user will most likely want to split > big ranges of memory (>1G), so optimizing for smaller huge-pages only > converts the left into the right: > > alloc 1 page | | alloc 512 pages > split 2M huge-page | | split 2M huge-page > alloc 1 page | | split 2M huge-page > split 2M huge-page | => | split 2M huge-page > ... > alloc 1 page | | split 2M huge-page > split 2M huge-page | | split 2M huge-page > > Still thinking of what else to do. I think the 1G case fits your own use case, but I doubt this covers the majority of the users. Most people rely on the kernel ability to use THPs, which are capped at the first level of block mapping. 2MB (and 32MB for 16kB base pages) are the most likely mappings in my experience (512MB with 64kB pages are vanishingly rare). Having to pay an upfront cost for HugeTLBFS doesn't shock me, and it fits the model. For THPs, where everything is opportunistic and the user not involved, this is a lot more debatable. This is why I'd like this behaviour to be a buy-in, either directly (a first class userspace API) or indirectly (the provenance of the memory). Thanks, M. -- Without deviation from the norm, progress is not possible.