From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 37DF2C6FD1F
	for <linux-arm-kernel@archiver.kernel.org>; Tue,  2 Apr 2024 17:01:28 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:
	Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=2bIM+hs1B7yZ79id30c8heQOb2RRKpWBeTqnjD9ASMA=; b=uF0JAjkrbuMdxW
	i07LEC833XC2qd/RhLox5XInR2s/KRiTjQOR44ZPsMNyaewsu162OPwx1f9yPyTd28lwLzopClZnw
	gYp6ErN8NKaePWPzRBSZkcwpwWwQIsvtL0KcnEy+YPMA9VQWPl592hbhi9BpCpExok2KUOgSYsIf5
	T9/kx0Yi2Dj3WgSCj/fPheAUjoVsGf7vrisGOvjILv8dznQ1B8BLZON3h8iy86vIt5AIBpwyO5Vr1
	2DWROaTfoZcmJXCK0V424/aShXqtXaUC50goHNKjY3YhWhL8fj6Pr5qGVIozMI4xGnuwMbVhtAVGR
	SHQmBv8ynL1fBI7lLhig==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux))
	id 1rrhVX-0000000C8gU-2946;
	Tue, 02 Apr 2024 17:01:11 +0000
Received: from snail.cherry.relay.mailchannels.net ([23.83.223.170])
	by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux))
	id 1rrhVT-0000000C8eg-1yRr
	for linux-arm-kernel@lists.infradead.org;
	Tue, 02 Apr 2024 17:01:09 +0000
X-Sender-Id: dreamhost|x-authsender|kjlx@templeofstupid.com
Received: from relay.mailchannels.net (localhost [127.0.0.1])
	by relay.mailchannels.net (Postfix) with ESMTP id 42694900C52
	for <linux-arm-kernel@lists.infradead.org>; Tue,  2 Apr 2024 17:00:59 +0000 (UTC)
Received: from pdx1-sub0-mail-a309.dreamhost.com (unknown [127.0.0.6])
	(Authenticated sender: dreamhost)
	by relay.mailchannels.net (Postfix) with ESMTPA id BEBA7900B22
	for <linux-arm-kernel@lists.infradead.org>; Tue,  2 Apr 2024 17:00:57 +0000 (UTC)
ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1712077257; a=rsa-sha256;
	cv=none;
	b=zonctyMcJdUA0UMe1qUGGbzWfgri5+L8S/1HZJEvbNVmTZvs9ez4OnTb3dBKsgiBcVTxsg
	6fi+94j9XecQlyDRopto+GHemOkO1oxWWXxSa9B2PgYPi9xHIcyRE+QVHI1uawK7sbem+Y
	29xnrHFjPS2rp+xq3/DJGNaIzYeBxT3X17uyBJWqvyaQ246Nsc8OgPwiPcTB6glIizHQRx
	TRHt84IHuyHJ1Y2iohFDSLCPYSlNHIIbih8mUbcAX941EevQcm5EcHqDlM/Ubqf7tvlnmm
	epqEx1uA+JVYOtTe8SZTaOBhzP0oDiFxBmJTJ9AOvFmPLQwnC8lFHPq7KvDfaw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=mailchannels.net;
	s=arc-2022; t=1712077257;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ZYG+1dFMQK089kCWsOzfLTsoNX83jIbm79QFok+yJmc=;
	b=qB8mE0FqeUTVSsxg93I2TkuTzBJtUOLGwoYrrjEcZC9qz263/XBIL+5sXC6z7WVuBsw5W0
	WvuLbp1FkLqpY+ofQk0999m17qlGk5NGFXjW2fR8d1H6WQ/rjpByl+s76iPmy16Xygt5kV
	ZqsL95kmEKijzaNlYbtoAzLXfdO7IOQo500cAjZNenXNOHdA6BN+s6/TnE7xcEu5Q5gYqH
	aSOrlSjfftqlFEYmbUnuxJe12ZP1tVMbGOzHnXdp9zaQeqo57hg3r6FN7Q5CrPa83l7tVo
	xuMTp4PXBnhajp6Zkbt2V9diii0CYS8SrEjbx8LfCu8DRl2jIFPr7O1/xZHb9g==
ARC-Authentication-Results: i=1;
	rspamd-5ffc56d49c-w8fnl;
	auth=pass smtp.auth=dreamhost smtp.mailfrom=kjlx@templeofstupid.com
X-Sender-Id: dreamhost|x-authsender|kjlx@templeofstupid.com
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|kjlx@templeofstupid.com
X-MailChannels-Auth-Id: dreamhost
X-Supply-Illustrious: 31381e50689b6f74_1712077258842_51203631
X-MC-Loop-Signature: 1712077258842:1721021908
X-MC-Ingress-Time: 1712077258842
Received: from pdx1-sub0-mail-a309.dreamhost.com (pop.dreamhost.com
 [64.90.62.162])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384)
	by 100.102.225.137 (trex/6.9.2);
	Tue, 02 Apr 2024 17:00:58 +0000
Received: from kmjvbox.templeofstupid.com (c-73-222-159-162.hsd1.ca.comcast.net [73.222.159.162])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	(Authenticated sender: kjlx@templeofstupid.com)
	by pdx1-sub0-mail-a309.dreamhost.com (Postfix) with ESMTPSA id 4V8Dfd2b9cz14R
	for <linux-arm-kernel@lists.infradead.org>; Tue,  2 Apr 2024 10:00:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=templeofstupid.com;
	s=dreamhost; t=1712077257;
	bh=ZYG+1dFMQK089kCWsOzfLTsoNX83jIbm79QFok+yJmc=;
	h=Date:From:To:Cc:Subject:Content-Type;
	b=W+RGHknRfIsysmdVJTnu8kOUoFZiUgFbgCaQ7q3pC92XSC7G2VxZ0gIChnOkh013m
	 S2quLRfNbRq+s4CqPqLMNdcZ/ewL1nedKNcQEqdmP64WiM1nXb1izyCTATj7zzl6Zy
	 uoFqpDSD5SQ+5wAzEvHaSYJQEWK94GV7DbBwo0yHMGooZ6gCJs9w9029ZTZEE6Cir8
	 8YU6riqU8Lrh2/dIJzSRjWzpG/p10TQBvMpJywF9cPNotp2aR6LJzdyBaLdQp6zSwx
	 prwGRV1WbLb3UGirjBuBO9BzgKwQ3vJfmfkeGwNNtrO6YIS/fBOeIEl5jvM+OSwLkT
	 0MGDX1VssPNhQ==
Received: from johansen (uid 1000)
	(envelope-from kjlx@templeofstupid.com)
	id e009c
	by kmjvbox.templeofstupid.com (DragonFly Mail Agent v0.12);
	Tue, 02 Apr 2024 10:00:52 -0700
Date: Tue, 2 Apr 2024 10:00:52 -0700
From: Krister Johansen <kjlx@templeofstupid.com>
To: Marc Zyngier <maz@kernel.org>
Cc: Krister Johansen <kjlx@templeofstupid.com>,
	Oliver Upton <oliver.upton@linux.dev>,
	James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Ali Saidi <alisaidi@amazon.com>,
	David Reaver <me@davidreaver.com>,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to
 smallest block
Message-ID: <20240402170052.GA1988@templeofstupid.com>
References: <cover.1711649501.git.kjlx@templeofstupid.com>
 <ebf0fac84cb1d19bdc6e73576e3cc40a9cab0635.1711649501.git.kjlx@templeofstupid.com>
 <ZgbGtpj5mStTkAkn@linux.dev>
 <20240329191537.GA2051@templeofstupid.com>
 <87r0fsrpko.wl-maz@kernel.org>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <87r0fsrpko.wl-maz@kernel.org>
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20240402_100107_674125_CAA686DD 
X-CRM114-Status: GOOD (  53.47  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Hi Marc,

On Sat, Mar 30, 2024 at 10:17:43AM +0000, Marc Zyngier wrote:
> On Fri, 29 Mar 2024 19:15:37 +0000,
> Krister Johansen <kjlx@templeofstupid.com> wrote:
> > On Fri, Mar 29, 2024 at 06:48:38AM -0700, Oliver Upton wrote:
> > > On Thu, Mar 28, 2024 at 12:05:08PM -0700, Krister Johansen wrote:
> > > > stage2_apply_range() for unmap operations can interfere with the
> > > > performance of IO if the device's interrupts share the CPU where the
> > > > unmap operation is occurring.  commit 5994bc9e05c2 ("KVM: arm64: Limit
> > > > stage2_apply_range() batch size to largest block") improved this.  Prior
> > > > to that commit, workloads that were unfortunate enough to have their IO
> > > > interrupts pinned to the same CPU as the unmap operation would observe a
> > > > complete stall.  With the switch to using the largest block size, it is
> > > > possible for IO to make progress, albeit at a reduced speed.
> > > 
> > > Can you describe the workload a bit more? I'm having a hard time
> > > understanding how you're unmapping that much memory on the fly in
> > > your workload. Is guest memory getting swapped? Are VMs being torn
> > > down?
> > 
> > Sorry I wasn't clear here.  Yes, it's the VMs getting torn down that's
> > causing the problems.  The container VMs don't have long lifetimes, but
> > some may be up to 256Gb in size, depending on the user.  The workloads
> > running the VMs aren't especially performance sensitive, but their users
> > do notice when network connections time-out.  IOW, if the performance is
> > bad enough to temporarily prevent new TCP connections from being
> > established or requests / responses being recieved in a timely fashion,
> > we'll hear about it.  Users deploy their services a lot, so there's a
> > lot of container vm churn.  (Really it's automation redeploying the
> > services on behalf of the users in response to new commits to their
> > repos...)
> 
> I think this advocates for a teardown-specific code path rather than
> just relying on the usual S2 unmapping which is really designed for
> eviction. There are two things to consider here:
> 
> - TLB invalidation: this should only take a single VMALLS12E1, rather
>   than iterating over the PTs
> 
> - Cache maintenance: this could be elided with FWB, or *optionally*
>   elided if userspace buys in a "I don't need to see the memory of the
>   guest after teardown" type of behaviour

This approach would work for this workload, I think.  The hardware
supports FWB and AFAIK isn't looking at the guest memory after teardown.
This is also desirable because in the future we'd like to support
hotplug of VFIO devices. A separate path for unmap the memory used by
the device vs unmap all of the guest seems smart.

> > > Also, it seems a bit odd to steer interrupts *into* the workload you
> > > care about...
> > 
> > Ah, that was only intentionally done for the purposes of measuring the
> > impact.  That's not done on purpose in production.
> > 
> > Nevertheless, the example we tend to run into is that a box may have 2
> > NICs and each NIC has 32 Tx-Rx queues.  This means we've got 64 NIC
> > interrupts, each assigned to a different CPU.  Our systems have 64 CPUs.
> > What happens in practice is that a VM will get torn down, and that has a
> > 1-in-64 chance of impacting the performance of the subset of the flows
> > that are mapped via RSS to the interrupt that happens to be assigned to
> > the CPU where the VM is being torn down.
> > 
> > Of course, the obvious next question is why not just bind the VMs flows
> > to the CPUs the VM is running on?  We don't have a 1:1 mapping of
> > network device to VM, or VM to CPU right now, which frustrates this
> > approach.
> > 
> > > > Further reducing the stage2_apply_range() batch size has substantial
> > > > performance improvements for IO that share a CPU performing an unmap
> > > > operation.  By switching to a 2mb chunk, IO performance regressions were
> > > > no longer observed in this author's tests.  E.g. it was possible to
> > > > obtain the advertised device throughput despite an unmap operation
> > > > occurring on the CPU where the interrupt was running.  There is a
> > > > tradeoff, however.  No changes were observed in per-operation timings
> > > > when running the kvm_pagetable_test without an interrupt load.  However,
> > > > with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
> > > > by about 15% and unmap times increased by about 58%.  In essence, this
> > > > trades slower map/unmap times for improved IO throughput.
> > > 
> > > There are other users of the range-based operations, like
> > > write-protection. Live migration is especially sensitive to the latency
> > > of page table updates as it can affect the VMM's ability to converge
> > > with the guest.
> > 
> > To be clear, the reduction in performance was observed when I
> > concurrently executed both the kvm_pagetable_test and a networking
> > benchmark where the NIC's interrupts were assigned to the same CPU where
> > the pagetable test was executing.  I didn't see a slowdown just running
> > the pagetable test.
> 
> Any chance you could share more details about your HW configuration
> (what CPU is that?)  and the type of traffic? This is the sort of
> things I'd like to be able to reproduce in order to experiment various
> strategies.

Sure, I only have access to documentation that is publicly available.

The hardware where we ran into this inititally was Graviton 3, which is
a Neoverse-V1 based core.  It does not support FEAT_TLBIRANGE.  I've
also tested on Graviton 4, which is Neoverse-V2 based.  It _does_
support FEAT_TLBIRANGE.  The deferred range based invalidation
support, was enough to allow us to teardown a large VM based on 4k pages
and not incur a visible performance penalty.  I haven't had a chance to
test to see if and how Will's patches change this, though.

The tests themselves were not especially fancy. The networking hardware
was a ENA device on an EC2 box with 30Gbps limit (5/10 Gbps per flow,
depending on the config).  The storage tested was a gp3 EBS device
configured to max IOPS/throughput (16,000 IOPS / 1000Mb/s).

Networking tests were iperf3 with a 9001 byte packet size.  The storage
tests were fio's randwrite workload in directio mode using the libaio
backend.  The "IOPS" test used a 4k blocksize and a queue depth of 128.
The "throughput" test used a blocksize of 64k and an iodepth of 32.  For
the fio tests, it was a 10gb file and 2 workers, mostly because the EBS
devices have two hardware queues for data.

I ran the kvm_page_table_test with a few different sizes, but settled on
64G with 1 vcpu for most tests.

Let me know if there's anything else I can share here.

-K

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel