From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-174.mta0.migadu.com (out-174.mta0.migadu.com [91.218.175.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CE9B263F2D for ; Sat, 11 Oct 2025 14:32:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.174 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760193176; cv=none; b=E+u9i8l4yrYoMmbQ+13P+dQhilPw4GSI3h1q+bCejCFoxZIRfMzEJo9RMMohJrbMbTrQDVDomnOrO5+jUod6GgJssqbDB2VcHIOlnVtcs9Tv/dBZoELex7Aj8Q1gh1I1HwPPDJwxGRYDI2Y7/fYfBoq4ZCKZdGwOe5aMRaFg6bA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760193176; c=relaxed/simple; bh=Rg//Celz7bZVYwFutchuLMutpRd3KAuXCKgvp9l+Sp4=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=uu7KhF7fwvhAZJltRif9BGFK95Mz2SLkT8ozXXphoFME8ZMfzDNZ7A1VTVR5mJCsLDvU+Je/CgURhssyb+CQJXXtWmyXIiYfqdyC191hCrOSU0uKDWXgaOLd9PQ5K9iKpAkfwWpQbZVawd0+xnhLtDASt+MA5a+uPElJQ5QqclU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=UjrCKKxZ; arc=none smtp.client-ip=91.218.175.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="UjrCKKxZ" Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1760193160; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=W6DAktFf8ux13w20eariAQM5EE06H68osCBE8i4364c=; b=UjrCKKxZ7HfUt7+I13g5vfVRPU1qXFULaDafrS7I83LwXVt6aH0moEoYrW9v3fMHyte3Lc JIVb1vAXalk/iAX9JV1paqTF7GbCKdZcWmZHtsHe0AsdOKnwoKmIv3g4EwebnsComkRoff oN69fjYfPJQ1nqoEOUx/q55njQikJ+A= Date: Sat, 11 Oct 2025 16:32:34 +0200 Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing To: David Hildenbrand , Will Deacon Cc: Dave Hansen , "Roy, Patrick" , "pbonzini@redhat.com" , "corbet@lwn.net" , "maz@kernel.org" , "oliver.upton@linux.dev" , "joey.gouly@arm.com" , "suzuki.poulose@arm.com" , "yuzenghui@huawei.com" , "catalin.marinas@arm.com" , "tglx@linutronix.de" , "mingo@redhat.com" , "bp@alien8.de" , "dave.hansen@linux.intel.com" , "x86@kernel.org" , "hpa@zytor.com" , "luto@kernel.org" , "peterz@infradead.org" , "willy@infradead.org" , "akpm@linux-foundation.org" , "lorenzo.stoakes@oracle.com" , "Liam.Howlett@oracle.com" , "vbabka@suse.cz" , "rppt@kernel.org" , "surenb@google.com" , "mhocko@suse.com" , "song@kernel.org" , "jolsa@kernel.org" , "ast@kernel.org" , "daniel@iogearbox.net" , "andrii@kernel.org" , "martin.lau@linux.dev" , "eddyz87@gmail.com" , "yonghong.song@linux.dev" , "john.fastabend@gmail.com" , "kpsingh@kernel.org" , "sdf@fomichev.me" , "haoluo@google.com" , "jgg@ziepe.ca" , "jhubbard@nvidia.com" , "peterx@redhat.com" , "jannh@google.com" , "pfalcato@suse.de" , "shuah@kernel.org" , "seanjc@google.com" , "kvm@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" , "kvmarm@lists.linux.dev" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , "bpf@vger.kernel.org" , "linux-kselftest@vger.kernel.org" , "Cali, Marco" , "Kalyazin, Nikita" , "Thomson, Jack" , "derekmn@amazon.co.uk" , "tabba@google.com" , "ackerleytng@google.com" References: <20250924151101.2225820-4-patrick.roy@campus.lmu.de> <20250924152214.7292-1-roypat@amazon.co.uk> <20250924152214.7292-3-roypat@amazon.co.uk> <82bff1c4-987f-46cb-833c-bd99eaa46e7a@intel.com> <5d11b5f7-3208-4ea8-bbff-f535cf62d576@redhat.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Patrick Roy Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT Hey all, sorry it took me a while to get back to this, turns out moving internationally is move time consuming than I expected. On Mon, 2025-09-29 at 12:20 +0200, David Hildenbrand wrote: > On 27.09.25 09:38, Patrick Roy wrote: >> On Fri, 2025-09-26 at 21:09 +0100, David Hildenbrand wrote: >>> On 26.09.25 12:53, Will Deacon wrote: >>>> On Fri, Sep 26, 2025 at 10:46:15AM +0100, Patrick Roy wrote: >>>>> On Thu, 2025-09-25 at 21:13 +0100, David Hildenbrand wrote: >>>>>> On 25.09.25 21:59, Dave Hansen wrote: >>>>>>> On 9/25/25 12:20, David Hildenbrand wrote: >>>>>>>> On 25.09.25 20:27, Dave Hansen wrote: >>>>>>>>> On 9/24/25 08:22, Roy, Patrick wrote: >>>>>>>>>> Add an option to not perform TLB flushes after direct map manipulations. >>>>>>>>> >>>>>>>>> I'd really prefer this be left out for now. It's a massive can of worms. >>>>>>>>> Let's agree on something that works and has well-defined behavior before >>>>>>>>> we go breaking it on purpose. >>>>>>>> >>>>>>>> May I ask what the big concern here is? >>>>>>> >>>>>>> It's not a _big_ concern. >>>>>> >>>>>> Oh, I read "can of worms" and thought there is something seriously problematic :) >>>>>> >>>>>>> I just think we want to start on something >>>>>>> like this as simple, secure, and deterministic as possible. >>>>>> >>>>>> Yes, I agree. And it should be the default. Less secure would have to be opt-in and documented thoroughly. >>>>> >>>>> Yes, I am definitely happy to have the 100% secure behavior be the >>>>> default, and the skipping of TLB flushes be an opt-in, with thorough >>>>> documentation! >>>>> >>>>> But I would like to include the "skip tlb flushes" option as part of >>>>> this patch series straight away, because as I was alluding to in the >>>>> commit message, with TLB flushes this is not usable for Firecracker for >>>>> performance reasons :( >>>> >>>> I really don't want that option for arm64. If we're going to bother >>>> unmapping from the linear map, we should invalidate the TLB. >>> >>> Reading "TLB flushes result in a up to 40x elongation of page faults in >>> guest_memfd (scaling with the number of CPU cores), or a 5x elongation >>> of memory population,", I can understand why one would want that optimization :) >>> >>> @Patrick, couldn't we use fallocate() to preallocate memory and batch the TLB flush within such an operation? >>> >>> That is, we wouldn't flush after each individual direct-map modification but after multiple ones part of a single operation like fallocate of a larger range. >>> >>> Likely wouldn't make all use cases happy. >>> >> >> For Firecracker, we rely a lot on not preallocating _all_ VM memory, and >> trying to ensure only the actual "working set" of a VM is faulted in (we >> pack a lot more VMs onto a physical host than there is actual physical >> memory available). For VMs that are restored from a snapshot, we know >> pretty well what memory needs to be faulted in (that's where @Nikita's >> write syscall comes in), so there we could try such an optimization. But >> for everything else we very much rely on the on-demand nature of guest >> memory allocation (and hence direct map removal). And even right now, >> the long pole performance-wise are these on-demand faults, so really, we >> don't want them to become even slower :( > > Makes sense. I guess even without support for large folios one could implement a kind of "fault" around: for example, on access to one addr, allocate+prepare all pages in the same 2 M chunk, flushing the tlb only once after adjusting all the direct map entries. > >> >> Also, can we really batch multiple TLB flushes as you suggest? Even if >> pages are at consecutive indices in guest_memfd, they're not guaranteed >> to be continguous physically, e.g. we couldn't just coalesce multiple >> TLB flushes into a single TLB flush of a larger range. > > Well, you there is the option on just flushing the complete tlb of course :) When trying to flush a range you would indeed run into the problem of flushing an ever growing range. In the last guest_memfd upstream call (over a week ago now), we've discussed the option of batching and deferring TLB flushes, while providing a sort of "deadline" at which a TLB flush will deterministically be done. E.g. guest_memfd would keep a counter of how many pages got direct map zapped, and do a flush of a range that contains all zapped pages every 512 allocated pages (and to ensure the flushes even happen in a timely manner if no allocations happen for a long time, also every, say, 5 seconds or something like that). Would that work for everyone? I briefly tested the performance of batch-flushes with secretmem in QEMU, and its within of 30% of the "no TLB flushes at all" solution in a simple benchmark that just memsets 2GiB of memory. I think something like this, together with the batch-flushing at the end of fallocate() / write() as David suggested above should work for Firecracker. >> There's probably other things we can try. Backing guest_memfd with >> hugepages would reduce the number TLB flushes by 512x (although not all >> users of Firecracker at Amazon [can] use hugepages). > > Right. > >> >> And I do still wonder if it's possible to have "async TLB flushes" where >> we simply don't wait for the IPI (x86 terminology, not sure what the >> mechanism on arm64 is). Looking at >> smp_call_function_many_cond()/invlpgb_kernel_range_flush() on x86, it >> seems so? Although seems like on ARM it's actually just handled by a >> single instruction (TLBI) and not some interprocess communication >> thingy. Maybe there's a variant that's faster / better for this usecase? > > Right, some architectures (and IIRC also x86 with some extension) are able to flush remote TLBs without IPIs. > > Doing a quick search, there seems to be some research on async TLB flushing, e.g., [1]. > > In the context here, I wonder whether an async TLB flush would be > significantly better than not doing an explicit TLB flush: in both > cases, it's not really deterministic when the relevant TLB entries > will vanish: with the async variant it might happen faster on average > I guess. I actually did end up playing around with this a while ago, and it made things slightly better performance wise, but it was still too bad to be useful :( > > [1] https://cs.yale.edu/homes/abhishek/kumar-taco20.pdf > Best, Patrick