From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 33B0AC6FD1F
	for <linux-arm-kernel@archiver.kernel.org>; Fri, 29 Mar 2024 19:16:07 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:
	Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=ZQMo599DEUru2izcbvCZyQvJpvbbC2ZsSovnS3GTcs0=; b=VPNa0J3pR+9nh1
	WodnbNAyCSmYL+2VG1+NjeFvYSw1IlZZnFS+gSOjmE2CrnZDeaP796MXoAHL+yIOz5xRdq0Xiqtsk
	19DuYbg07kdF22/7PMUKadR6LHBTIu9wVvX+F7l/jTJLxBEltdFqsPoCS6V7lEEw1WyTMyxTH5c0y
	P5n6uMmNSgKMmjW/6vSzVgsLIDFBCIrqOTRwWcFSMu/lYb/cOpT+kTZL9Y1klFWnx/i+Hnw+T2ikB
	2wu6RRUyJ9ygH6FKS4670B4Kw5hSw1QIAmIl2hwHymXlM+Do6yGYqHbeikGBLccMb/DlMEX5J83Wx
	3r6+klPqCIl3oZu4FEHQ==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux))
	id 1rqHhh-00000001jgh-2kXV;
	Fri, 29 Mar 2024 19:15:53 +0000
Received: from shrimp.cherry.relay.mailchannels.net ([23.83.223.164])
	by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux))
	id 1rqHhe-00000001jgD-0dR1
	for linux-arm-kernel@lists.infradead.org;
	Fri, 29 Mar 2024 19:15:51 +0000
X-Sender-Id: dreamhost|x-authsender|kjlx@templeofstupid.com
Received: from relay.mailchannels.net (localhost [127.0.0.1])
	by relay.mailchannels.net (Postfix) with ESMTP id 251E77A22D9
	for <linux-arm-kernel@lists.infradead.org>; Fri, 29 Mar 2024 19:15:46 +0000 (UTC)
Received: from pdx1-sub0-mail-a283.dreamhost.com (unknown [127.0.0.6])
	(Authenticated sender: dreamhost)
	by relay.mailchannels.net (Postfix) with ESMTPA id B90B07A3204
	for <linux-arm-kernel@lists.infradead.org>; Fri, 29 Mar 2024 19:15:45 +0000 (UTC)
ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1711739745; a=rsa-sha256;
	cv=none;
	b=PHRm9VvYMfUdygfVuHYcJF8cFYCu8+6TkhBhbZC82nlz8RU/POd8ehmzQvaTZHS0FV2pre
	hrx+uZhE+3pyBIfkTpW84fwBwjGmeN1pOekTDcVKMYB5Nex/yk4qvfsJUdytw4lwzdtCuN
	ub1VmQ9qWAq40oxzMumv/rHPErrEsAdhD6BVTMHuxayTeqEr9qapwtgm2Cof0p0vvgpX4D
	J8uBfmE0DnbyApBe2eun31tifj6Mhbkv9EZSUrZ1Nf7/uv5fTH3fWMyyWQ5os5lwRXuKh8
	TJETnkqm/9GP08JZHLB+6BYcf+X0xFVVYtxTo/RBPaPju8C2CKm08iimoX+zew==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=mailchannels.net;
	s=arc-2022; t=1711739745;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=EnlZ5bLVy+yCG0ojGQujTIq4MzqFgRZKmBSDVDyKsvw=;
	b=X4waSwCPtNsp5fare4dWfk04kmrBuQY8SE4Vv2cO2gY1FXGuoBSmrCGGZ30VNX0PXAHZ8T
	1Bn4t2/2pOTWS46ValAqTxpf8+mddVL8PWgnkumIid7r8MFbRAjx+9aH82+7+0abbqpBQh
	f8lWlffkOniHtUaNz/12ZpfAGpyQHAdeSCvLbF6383FkJbXmzYP+7L6H42Xgd3nmq2/cwG
	OP/dhPmIQupuGZTiXOFOCiFF7PfJXotPyKjb/fxVaEWk9CrdUOuRPxPqxsj8D7NcfRDiED
	AmqWE+nVgzQHHVa7RErWSW0KlUWX6OtUp7mz0KRZ61fEIUsxRqkuLyNc+KllTA==
ARC-Authentication-Results: i=1;
	rspamd-699949c56f-gmpvn;
	auth=pass smtp.auth=dreamhost smtp.mailfrom=kjlx@templeofstupid.com
X-Sender-Id: dreamhost|x-authsender|kjlx@templeofstupid.com
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|kjlx@templeofstupid.com
X-MailChannels-Auth-Id: dreamhost
X-Abaft-Cure: 0da56fe612bfe8e0_1711739746009_1649933862
X-MC-Loop-Signature: 1711739746009:4117987444
X-MC-Ingress-Time: 1711739746009
Received: from pdx1-sub0-mail-a283.dreamhost.com (pop.dreamhost.com
 [64.90.62.162])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384)
	by 100.118.210.220 (trex/6.9.2);
	Fri, 29 Mar 2024 19:15:46 +0000
Received: from kmjvbox.templeofstupid.com (c-73-222-159-162.hsd1.ca.comcast.net [73.222.159.162])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	(Authenticated sender: kjlx@templeofstupid.com)
	by pdx1-sub0-mail-a283.dreamhost.com (Postfix) with ESMTPSA id 4V5qr10PTXzZ3
	for <linux-arm-kernel@lists.infradead.org>; Fri, 29 Mar 2024 12:15:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=templeofstupid.com;
	s=dreamhost; t=1711739745;
	bh=EnlZ5bLVy+yCG0ojGQujTIq4MzqFgRZKmBSDVDyKsvw=;
	h=Date:From:To:Cc:Subject:Content-Type;
	b=NQQc8MK0iSg8t46rKs0l7KChbqPJCT3UV27c+dIAnyjDkYes8WOqyLB4LKMcGaSFh
	 aLtq/6IDeEDZesWoKYY8UXpgBSsA2aFLPemwD6WRqhTR8F5QNE4JIqtCyC9jFMXROA
	 qe5qZ6DRT7IfQ3keIAbLB4IsRBJ/jsK/Yi4vGPycjtBdRFp48g+/XuPEx/dEC5W03D
	 0jqqpTEbdgfwqmQANlsb8UK39J6eybUomYdkdTmcEffvQlTLm+MkwdLXslXUjSwzYF
	 8M2FH/28q6nsnU40rH9ZoIhy2E1Rw0WgBPEhLiLVHCC4ioLmkWd0zz7lhELw8Mty5m
	 tSCIX+Rx9lzcQ==
Received: from johansen (uid 1000)
	(envelope-from kjlx@templeofstupid.com)
	id e0098
	by kmjvbox.templeofstupid.com (DragonFly Mail Agent v0.12);
	Fri, 29 Mar 2024 12:15:37 -0700
Date: Fri, 29 Mar 2024 12:15:37 -0700
From: Krister Johansen <kjlx@templeofstupid.com>
To: Oliver Upton <oliver.upton@linux.dev>
Cc: Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Ali Saidi <alisaidi@amazon.com>,
	David Reaver <me@davidreaver.com>,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to
 smallest block
Message-ID: <20240329191537.GA2051@templeofstupid.com>
References: <cover.1711649501.git.kjlx@templeofstupid.com>
 <ebf0fac84cb1d19bdc6e73576e3cc40a9cab0635.1711649501.git.kjlx@templeofstupid.com>
 <ZgbGtpj5mStTkAkn@linux.dev>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <ZgbGtpj5mStTkAkn@linux.dev>
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20240329_121550_359104_227C1733 
X-CRM114-Status: GOOD (  49.17  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Hi Oliver,
Thanks for the response.

On Fri, Mar 29, 2024 at 06:48:38AM -0700, Oliver Upton wrote:
> On Thu, Mar 28, 2024 at 12:05:08PM -0700, Krister Johansen wrote:
> > stage2_apply_range() for unmap operations can interfere with the
> > performance of IO if the device's interrupts share the CPU where the
> > unmap operation is occurring.  commit 5994bc9e05c2 ("KVM: arm64: Limit
> > stage2_apply_range() batch size to largest block") improved this.  Prior
> > to that commit, workloads that were unfortunate enough to have their IO
> > interrupts pinned to the same CPU as the unmap operation would observe a
> > complete stall.  With the switch to using the largest block size, it is
> > possible for IO to make progress, albeit at a reduced speed.
> 
> Can you describe the workload a bit more? I'm having a hard time
> understanding how you're unmapping that much memory on the fly in
> your workload. Is guest memory getting swapped? Are VMs being torn
> down?

Sorry I wasn't clear here.  Yes, it's the VMs getting torn down that's
causing the problems.  The container VMs don't have long lifetimes, but
some may be up to 256Gb in size, depending on the user.  The workloads
running the VMs aren't especially performance sensitive, but their users
do notice when network connections time-out.  IOW, if the performance is
bad enough to temporarily prevent new TCP connections from being
established or requests / responses being recieved in a timely fashion,
we'll hear about it.  Users deploy their services a lot, so there's a
lot of container vm churn.  (Really it's automation redeploying the
services on behalf of the users in response to new commits to their
repos...)

> Also, it seems a bit odd to steer interrupts *into* the workload you
> care about...

Ah, that was only intentionally done for the purposes of measuring the
impact.  That's not done on purpose in production.

Nevertheless, the example we tend to run into is that a box may have 2
NICs and each NIC has 32 Tx-Rx queues.  This means we've got 64 NIC
interrupts, each assigned to a different CPU.  Our systems have 64 CPUs.
What happens in practice is that a VM will get torn down, and that has a
1-in-64 chance of impacting the performance of the subset of the flows
that are mapped via RSS to the interrupt that happens to be assigned to
the CPU where the VM is being torn down.

Of course, the obvious next question is why not just bind the VMs flows
to the CPUs the VM is running on?  We don't have a 1:1 mapping of
network device to VM, or VM to CPU right now, which frustrates this
approach.

> > Further reducing the stage2_apply_range() batch size has substantial
> > performance improvements for IO that share a CPU performing an unmap
> > operation.  By switching to a 2mb chunk, IO performance regressions were
> > no longer observed in this author's tests.  E.g. it was possible to
> > obtain the advertised device throughput despite an unmap operation
> > occurring on the CPU where the interrupt was running.  There is a
> > tradeoff, however.  No changes were observed in per-operation timings
> > when running the kvm_pagetable_test without an interrupt load.  However,
> > with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
> > by about 15% and unmap times increased by about 58%.  In essence, this
> > trades slower map/unmap times for improved IO throughput.
> 
> There are other users of the range-based operations, like
> write-protection. Live migration is especially sensitive to the latency
> of page table updates as it can affect the VMM's ability to converge
> with the guest.

To be clear, the reduction in performance was observed when I
concurrently executed both the kvm_pagetable_test and a networking
benchmark where the NIC's interrupts were assigned to the same CPU where
the pagetable test was executing.  I didn't see a slowdown just running
the pagetable test.

> > Cc: <stable@vger.kernel.org> # 5.15.x: 3b5c082bbfa2: KVM: arm64: Work out supported block level at compile time
> > Cc: <stable@vger.kernel.org> # 5.15.x: 5994bc9e05c2: KVM: arm64: Limit stage2_apply_range() batch size to largest block
> > Cc: <stable@vger.kernel.org> # 5.15.x
> 
> This is a performance improvement, *not* a correctness fix. Please don't
> cc stable for it.

Apologies.  I consulted the Stable Rules[1] before applying these tags and
the guidance it gave was just that "It must either fix a real bug that
bothers people."

In our case, the teardown causes TCP throughput to drop from 9.5Gbps to
about 2Gbps during a teardown, which is something that does bother our
users.

> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h | 4 ++++
> >  arch/arm64/kvm/mmu.c                 | 2 +-
> >  2 files changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 19278dfe7978..b0c4651a4d9a 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -19,11 +19,15 @@
> >   *  - 4K (level 1):	1GB
> >   *  - 16K (level 2):	32MB
> >   *  - 64K (level 2):	512MB
> > + *
> > + *  The max block level is the _smallest_ supported block size for KVM.
> 
> This feels like a non sequitur given the old comment is left in place...

I'll fix if we keep this approach.  Is the objection to the name
KVM_PGTABLE_MAX_BLOCK_LEVEL or just the comment?

> >   */
> >  #ifdef CONFIG_ARM64_4K_PAGES
> >  #define KVM_PGTABLE_MIN_BLOCK_LEVEL	1
> > +#define KVM_PGTABLE_MAX_BLOCK_LEVEL	2
> >  #else
> >  #define KVM_PGTABLE_MIN_BLOCK_LEVEL	2
> > +#define KVM_PGTABLE_MAX_BLOCK_LEVEL	KVM_PGTABLE_MIN_BLOCK_LEVEL
> >  #endif
> >  
> >  #define kvm_lpa2_is_enabled()		system_supports_lpa2()
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index dc04bc767865..1e927b306aee 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -41,7 +41,7 @@ static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
> >  
> >  static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> >  {
> > -	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> > +	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MAX_BLOCK_LEVEL);
> >  
> >  	return __stage2_range_addr_end(addr, end, size);
> >  }
> 
> This doesn't feel right to me. A property that we had before is that
> leaf entries are visited at most once, since every mapping size was
> evenly divisible into KVM_PGTABLE_MIN_BLOCK_LEVEL.
> 
> Seems like we could wind up visiting a PUD mapping 512 times, at least
> for 4K pages.

I have an idea, but it seems to go against the current design of the
pagtable walkers.  My sense was that they've been written to
discourage passing mutable state to the function that calls
kvm_pgtable_walk().  If we were willing to permit this, it seems like we
could leverage __kvm_pgtable_visit()'s knowledge about the size of the
mapping it walked to determine whether range_addr_end should be
incremented by our BLOCK_LEVEL constant, or advanced to the end of the
mapping that was already successfully walked.  (If I'm reading right,
anyway)  Does that seem like a reasonable approach?

If we do modify the walk to allow state to be passed back, I have a
second patch I'd like to send you.  Ali found that there was a
performance regression on the kvm_pagetable_test on the map creation
step when a large number of threads operated on a comparatively small
memory range.  (E.g. 64 cpus and 8g of RAM).  We debugged this a bit and
found that there's an unmap in the map creation step if the test ends up
instantiating a readable zero page that needs to be copied and made
writable.  With the deferred TLBI logic, the tlb invalidation happens at
the end of the unmap operation whether a PTE is cleared or not.  With so
many threads, this doesn't always suceeed. The prior approach of just
doing the invalidation in stage2_unmap_put_pte() outperforms the
deferred invalidation, because stage2_unmap_put_pte() only calls
__kvm_tlb_flush_vmid_ipa() if it clears a valid PTE.  If we modify the
walk to keep state on whether any PTEs are successfully cleared, and
condition the deferred invalidation on that state, we obtain performance
that is equivalent to the pre range based deferred invalidation
approach.

Thanks,

-K

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel