From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-182.mta0.migadu.com (out-182.mta0.migadu.com [91.218.175.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C0CE829BDBC for ; Sat, 27 Sep 2025 07:38:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.182 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758958720; cv=none; b=Os49fMHuBxQanIC5atRKyuk9i4C2ROHe9UbzQbxFSahDTrgDNI3WCGXnyxzrWdODg+OoHJdYfyY6mh7DrHX2gqQba+ml2AoDOQqC2lyj1FUgioqRiO6na6GQPVwyQ+PodrgI3a85IMNXSvuKGoXq3CEedEiA5iFZTa0eAdJhXKc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758958720; c=relaxed/simple; bh=tr3cVw/fHsBObw+4/03vyNgdWXv1m3E2RPdz5cDO+i4=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=fsGxPZ9PqFCnRLIAA5xm7fgCePyjr0t0v0ejZ73xE8nJnpdBAIlylmyVPr/PIuZiQ9d3/U/J81KPK71g9OcGLPhQnieLLCSrbMjKCu7kQ+9FdZOfiDHu+7r/AHqCFcRO7rt773ftg528daEOFPJKwCG15evRokkvKbl0vNrTdX0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=MDhzVIrG; arc=none smtp.client-ip=91.218.175.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="MDhzVIrG" Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1758958715; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Wr7c37MXYsi1SZjRIiCMW6TvRd9zbqLAteCUC2bU040=; b=MDhzVIrG9XqOhn+X4EyvqT5WFEC48i0ScQW1M2TEQNfH+Abi956kVrFuXVs5+tNbHUeEBS HSH1dZ6zZI6FfoXFd0qRETX0UB3fscW0ypzjqRJvDzeO0w95mYEUlUuw8Y//m8zDINyAux 7aMnDrKrTlaK0SG5hxR2LTl6ILFDhs4= Date: Sat, 27 Sep 2025 08:38:29 +0100 Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing To: David Hildenbrand , Will Deacon Cc: Dave Hansen , "Roy, Patrick" , "pbonzini@redhat.com" , "corbet@lwn.net" , "maz@kernel.org" , "oliver.upton@linux.dev" , "joey.gouly@arm.com" , "suzuki.poulose@arm.com" , "yuzenghui@huawei.com" , "catalin.marinas@arm.com" , "tglx@linutronix.de" , "mingo@redhat.com" , "bp@alien8.de" , "dave.hansen@linux.intel.com" , "x86@kernel.org" , "hpa@zytor.com" , "luto@kernel.org" , "peterz@infradead.org" , "willy@infradead.org" , "akpm@linux-foundation.org" , "lorenzo.stoakes@oracle.com" , "Liam.Howlett@oracle.com" , "vbabka@suse.cz" , "rppt@kernel.org" , "surenb@google.com" , "mhocko@suse.com" , "song@kernel.org" , "jolsa@kernel.org" , "ast@kernel.org" , "daniel@iogearbox.net" , "andrii@kernel.org" , "martin.lau@linux.dev" , "eddyz87@gmail.com" , "yonghong.song@linux.dev" , "john.fastabend@gmail.com" , "kpsingh@kernel.org" , "sdf@fomichev.me" , "haoluo@google.com" , "jgg@ziepe.ca" , "jhubbard@nvidia.com" , "peterx@redhat.com" , "jannh@google.com" , "pfalcato@suse.de" , "shuah@kernel.org" , "seanjc@google.com" , "kvm@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" , "kvmarm@lists.linux.dev" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , "bpf@vger.kernel.org" , "linux-kselftest@vger.kernel.org" , "Cali, Marco" , "Kalyazin, Nikita" , "Thomson, Jack" , "derekmn@amazon.co.uk" , "tabba@google.com" , "ackerleytng@google.com" References: <20250924151101.2225820-4-patrick.roy@campus.lmu.de> <20250924152214.7292-1-roypat@amazon.co.uk> <20250924152214.7292-3-roypat@amazon.co.uk> <82bff1c4-987f-46cb-833c-bd99eaa46e7a@intel.com> <5d11b5f7-3208-4ea8-bbff-f535cf62d576@redhat.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Patrick Roy Content-Language: en-US In-Reply-To: <5d11b5f7-3208-4ea8-bbff-f535cf62d576@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On Fri, 2025-09-26 at 21:09 +0100, David Hildenbrand wrote: > On 26.09.25 12:53, Will Deacon wrote: >> On Fri, Sep 26, 2025 at 10:46:15AM +0100, Patrick Roy wrote: >>> >>> >>> On Thu, 2025-09-25 at 21:13 +0100, David Hildenbrand wrote: >>>> On 25.09.25 21:59, Dave Hansen wrote: >>>>> On 9/25/25 12:20, David Hildenbrand wrote: >>>>>> On 25.09.25 20:27, Dave Hansen wrote: >>>>>>> On 9/24/25 08:22, Roy, Patrick wrote: >>>>>>>> Add an option to not perform TLB flushes after direct map manipulations. >>>>>>> >>>>>>> I'd really prefer this be left out for now. It's a massive can of worms. >>>>>>> Let's agree on something that works and has well-defined behavior before >>>>>>> we go breaking it on purpose. >>>>>> >>>>>> May I ask what the big concern here is? >>>>> >>>>> It's not a _big_ concern. >>>> >>>> Oh, I read "can of worms" and thought there is something seriously problematic :) >>>> >>>>> I just think we want to start on something >>>>> like this as simple, secure, and deterministic as possible. >>>> >>>> Yes, I agree. And it should be the default. Less secure would have to be opt-in and documented thoroughly. >>> >>> Yes, I am definitely happy to have the 100% secure behavior be the >>> default, and the skipping of TLB flushes be an opt-in, with thorough >>> documentation! >>> >>> But I would like to include the "skip tlb flushes" option as part of >>> this patch series straight away, because as I was alluding to in the >>> commit message, with TLB flushes this is not usable for Firecracker for >>> performance reasons :( >> >> I really don't want that option for arm64. If we're going to bother >> unmapping from the linear map, we should invalidate the TLB. > > Reading "TLB flushes result in a up to 40x elongation of page faults in > guest_memfd (scaling with the number of CPU cores), or a 5x elongation > of memory population,", I can understand why one would want that optimization :) > > @Patrick, couldn't we use fallocate() to preallocate memory and batch the TLB flush within such an operation? > > That is, we wouldn't flush after each individual direct-map modification but after multiple ones part of a single operation like fallocate of a larger range. > > Likely wouldn't make all use cases happy. > For Firecracker, we rely a lot on not preallocating _all_ VM memory, and trying to ensure only the actual "working set" of a VM is faulted in (we pack a lot more VMs onto a physical host than there is actual physical memory available). For VMs that are restored from a snapshot, we know pretty well what memory needs to be faulted in (that's where @Nikita's write syscall comes in), so there we could try such an optimization. But for everything else we very much rely on the on-demand nature of guest memory allocation (and hence direct map removal). And even right now, the long pole performance-wise are these on-demand faults, so really, we don't want them to become even slower :( Also, can we really batch multiple TLB flushes as you suggest? Even if pages are at consecutive indices in guest_memfd, they're not guaranteed to be continguous physically, e.g. we couldn't just coalesce multiple TLB flushes into a single TLB flush of a larger range. There's probably other things we can try. Backing guest_memfd with hugepages would reduce the number TLB flushes by 512x (although not all users of Firecracker at Amazon [can] use hugepages). And I do still wonder if it's possible to have "async TLB flushes" where we simply don't wait for the IPI (x86 terminology, not sure what the mechanism on arm64 is). Looking at smp_call_function_many_cond()/invlpgb_kernel_range_flush() on x86, it seems so? Although seems like on ARM it's actually just handled by a single instruction (TLBI) and not some interprocess communication thingy. Maybe there's a variant that's faster / better for this usecase?