From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B84BFC54EAA for ; Thu, 26 Jan 2023 20:10:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232741AbjAZUKX (ORCPT ); Thu, 26 Jan 2023 15:10:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229828AbjAZUKW (ORCPT ); Thu, 26 Jan 2023 15:10:22 -0500 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9BD6C3F297 for ; Thu, 26 Jan 2023 12:10:14 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 40D5D618A9 for ; Thu, 26 Jan 2023 20:10:14 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 83361C433D2; Thu, 26 Jan 2023 20:10:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1674763813; bh=uOu5z9Ooi3oSKC53tDihHmURSCVTBzPhl+j9Il0JBSE=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=JdPuNC4kKq31M5Jp36hDlXJ60GqTO2+1RB2Mu3omu1dF5M4b4rAvm4P4QtP5hWWYg RNvTnm207UGdsKewBV4JXmr/bQHLkFYhapcdvCt44QDjHzMArqibI8F4OhOXnsMLV7 Sb0RE6jV+RUoI6infzjZW9bdvfVgywmxftVBPNaKv4YxyZrQL06UaCqfavVceChi0N BR/R1yNz6PAVG2eZNI29AZ7QUTHYPC/0bjHoD9k5mAa8TwBA1pMJuZqHT6oPtnpzTV zu2ZzmEVQIoewfD3EY7jqxb3xqr719zeHJvr1cyleAv2UU01uBV5uxjUQq8hv5pXIe +y8bCgoGCueAw== Received: from sofa.misterjones.org ([185.219.108.64] helo=goblin-girl.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1pL8ZX-004wYB-9j; Thu, 26 Jan 2023 20:10:11 +0000 Date: Thu, 26 Jan 2023 20:10:10 +0000 Message-ID: <86v8ktkqfx.wl-maz@kernel.org> From: Marc Zyngier To: Ricardo Koller Cc: Oliver Upton , pbonzini@redhat.com, oupton@google.com, yuzenghui@huawei.com, dmatlack@google.com, kvm@vger.kernel.org, kvmarm@lists.linux.dev, qperret@google.com, catalin.marinas@arm.com, andrew.jones@linux.dev, seanjc@google.com, alexandru.elisei@arm.com, suzuki.poulose@arm.com, eric.auger@redhat.com, gshan@redhat.com, reijiw@google.com, rananta@google.com, bgardon@google.com, ricarkol@gmail.com Subject: Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled In-Reply-To: References: <20230113035000.480021-1-ricarkol@google.com> <20230113035000.480021-7-ricarkol@google.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/28.2 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable X-SA-Exim-Connect-IP: 185.219.108.64 X-SA-Exim-Rcpt-To: ricarkol@google.com, oliver.upton@linux.dev, pbonzini@redhat.com, oupton@google.com, yuzenghui@huawei.com, dmatlack@google.com, kvm@vger.kernel.org, kvmarm@lists.linux.dev, qperret@google.com, catalin.marinas@arm.com, andrew.jones@linux.dev, seanjc@google.com, alexandru.elisei@arm.com, suzuki.poulose@arm.com, eric.auger@redhat.com, gshan@redhat.com, reijiw@google.com, rananta@google.com, bgardon@google.com, ricarkol@gmail.com X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Thu, 26 Jan 2023 18:45:43 +0000, Ricardo Koller wrote: >=20 > On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton wro= te: > > > > Hi Ricardo, > > > > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote: > > > Split huge pages eagerly when enabling dirty logging. The goal is to > > > avoid doing it while faulting on write-protected pages, which > > > negatively impacts guest performance. > > > > > > A memslot marked for dirty logging is split in 1GB pieces at a time. > > > This is in order to release the mmu_lock and give other kernel threads > > > the opportunity to run, and also in order to allocate enough pages to > > > split a 1GB range worth of huge pages (or a single 1GB huge page). > > > Note that these page allocations can fail, so eager page splitting is > > > best-effort. This is not a correctness issue though, as huge pages > > > can still be split on write-faults. > > > > > > The benefits of eager page splitting are the same as in x86, added > > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by > > > the TDP MMU when dirty logging is enabled"). For example, when running > > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, > > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access > > > all of their memory after dirty logging is enabled decreased by 44% > > > from 2.58s to 1.42s. > > > > > > Signed-off-by: Ricardo Koller > > > --- > > > arch/arm64/include/asm/kvm_host.h | 30 ++++++++ > > > arch/arm64/kvm/mmu.c | 110 ++++++++++++++++++++++++++++= +- > > > 2 files changed, 138 insertions(+), 2 deletions(-) > > > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/a= sm/kvm_host.h > > > index 35a159d131b5..6ab37209b1d1 100644 > > > --- a/arch/arm64/include/asm/kvm_host.h > > > +++ b/arch/arm64/include/asm/kvm_host.h > > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu { > > > /* The last vcpu id that ran on each physical CPU */ > > > int __percpu *last_vcpu_ran; > > > > > > + /* > > > + * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth= of huge > > > + * pages. It is used to allocate stage2 page tables while split= ting > > > + * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CA= PACITY. > > > + * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influenc= es both > > > + * the capacity of the split page cache (CACHE_CAPACITY), and h= ow often > > > + * KVM reschedules. Be wary of raising CHUNK_SIZE too high. > > > + * > > > + * A good heuristic to pick CHUNK_SIZE is that it should be lar= ger than > > > + * all the available huge-page sizes, and be a multiple of all = the > > > + * other ones; for example, 1GB when all the available huge-pag= e sizes > > > + * are (1GB, 2MB, 32MB, 512MB). > > > + * > > > + * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE;= for > > > + * example, 1GB requires the following number of PAGE_SIZE-page= s: > > > + * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB). > > > + * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB /= 2MB)). > > > + * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB= ). > > > + * - 2 when using 512MB hugepages with 64KB granules (1GB / 512= MB). > > > + * CACHE_CAPACITY below assumes the worst case: 1GB hugepages w= ith 4KB > > > + * granules. > > > + * > > > + * Protected by kvm->slots_lock. > > > + */ > > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE SZ_1G > > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY = \ > > > + (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) + = \ > > > + DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M)) > > > > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL > > as the batch size? 513 pages across all page sizes is a non-negligible > > amount of memory that goes largely unused when PAGE_SIZE !=3D 4K. > > >=20 > Sounds good, will refine this for v2. >=20 > > With that change it is a lot easier to correctly match the cache > > capacity to the selected page size. Additionally, we continue to have a > > single set of batching logic that we can improve later on. > > > > > + struct kvm_mmu_memory_cache split_page_cache; > > > + > > > struct kvm_arch *arch; > > > }; > > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > index 700c5774b50d..41ee330edae3 100644 > > > --- a/arch/arm64/kvm/mmu.c > > > +++ b/arch/arm64/kvm/mmu.c > > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector; > > > > > > static unsigned long io_map_base; > > > > > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr= _t end) > > > +bool __read_mostly eager_page_split =3D true; > > > +module_param(eager_page_split, bool, 0644); > > > + > > > > Unless someone is really begging for it I'd prefer we not add a module > > parameter for this. >=20 > It was mainly to match x86 and because it makes perf testing a bit > simpler. What do others think? =46rom my PoV this is a no. If you have a flag because this is an experimental feature (like NV), then this is a kernel option, and you taint the kernel when it is set. If you have a flag because this is a modal option that makes different use of the HW which cannot be exposed to userspace (like GICv4), then this also is a kernel option. This is neither. The one thing that would convince me to make it an option is the amount of memory this thing consumes. 512+ pages is a huge amount, and I'm not overly happy about that. Why can't this be a userspace visible option, selectable on a per VM (or memslot) basis? Thanks, M. --=20 Without deviation from the norm, progress is not possible.