From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E9DA5126C10; Fri, 15 Nov 2024 16:23:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731687815; cv=none; b=tkQ93B8qppqR5sKsUMxTh5ApJNZxCJaC24WErY5yWUmK/8/az73oUwSov1E4RXDpqygXyrHXmlFUvpb2+rr40Z4ZuJXZiZZHXxKP2Rf71CXl9G58YKeWhzEmHHl7HHwVN6p0yBjDZEgdwbvdX2MepLnVJkszMzBlfT50QdWWRn4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731687815; c=relaxed/simple; bh=DM27c+y+WOBeLHMhqrwMS7rn3rHMwv1YhipdWe80R8Q=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=KH5Ldu5lfT8eaxJOXguv+w/cqTJPtI2BvPlwaPY3nhl8om9A81A5UIYdRb4CDDj4/kXs+qoIXWwVZHwXSw46AcR6hBTAvYzUmW8W0xeGETULayuzuqKo5UOFoAg/jHPFDUwxu1NPuwbEtDZVWkOfs7H6Q/uECK3iD9Uc9UQbYhY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 745ABC4CED5; Fri, 15 Nov 2024 16:23:32 +0000 (UTC) Date: Fri, 15 Nov 2024 16:23:30 +0000 From: Catalin Marinas To: Marc Zyngier Cc: "Aneesh Kumar K.V" , linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, Suzuki K Poulose , Steven Price , Will Deacon , Mark Rutland , Oliver Upton , Joey Gouly , Zenghui Yu Subject: Re: [PATCH 4/4] arm64: mte: Use stage-2 NoTagAccess memory attribute if supported Message-ID: References: <20241028094014.2596619-1-aneesh.kumar@kernel.org> <20241028094014.2596619-5-aneesh.kumar@kernel.org> <87o734ts4m.wl-maz@kernel.org> <87jzdst6os.wl-maz@kernel.org> <86ldxozn3i.wl-maz@kernel.org> Precedence: bulk X-Mailing-List: kvmarm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <86ldxozn3i.wl-maz@kernel.org> (joining the thread as well, though not sure I'm bringing anything new) On Tue, Nov 12, 2024 at 11:51:45AM +0000, Marc Zyngier wrote: > On Fri, 08 Nov 2024 07:59:31 +0000, Aneesh Kumar K.V wrote: > > Marc Zyngier writes: > > > On Mon, 28 Oct 2024 13:28:42 +0000, Aneesh Kumar K.V wrote: > > >> Marc Zyngier writes: > > >> > On Mon, 28 Oct 2024 09:40:14 +0000, "Aneesh Kumar K.V (Arm)" wrote: > > >> >> Currently, the kernel won't start a guest if the MTE feature is enabled > > >> >> and the guest RAM is backed by memory which doesn't support access tags. > > >> >> Update this such that the kernel uses the NoTagAccess memory attribute > > >> >> while mapping pages from VMAs for which MTE is not allowed. The fault > > >> >> from accessing the access tags with such pages is forwarded to VMM so > > >> >> that VMM can decide to kill the guest or remap the pages so that > > >> >> access tag storage is allowed. > > >> > > > >> > I only have questions here: > > >> > > > >> > - what is the benefit of such approach? why shouldn't that be the > > >> > kernel's job to fix it? > > >> > > >> IMHO leaving that policy decision to VMM makes the kernel changes > > >> simpler. In most cases, VMM will kill the guest, because these > > >> restrictions of MTE_ALLOWED are applied at the memslot/vma. > > > > > > Where is that captured? The whole idea behind FEAT_MTE_PERM was that > > > it would be the hypervisor's task to lazily allocate MTE-capable > > > memory as tagged-access would occur. > > > > Lazily allocating MTE-capable memory requires changes to different > > kernel subsystems and previous attempts got dropped [1] because it > > was not clear whether the benefit of saving 3% memory overhead was worth > > the complexity we add to the kernel. I'd say the most complex part in Alex's approach was the need to reuse the tag storage for classic data and kick the pages around when some other page needs to store tags in there. That approach is pretty much dead. In theory, having MTE and non-MTE memory (heterogeneous) without a carveout reuse would be a bit more manageable - not that far from the NUMA migration and at least you only migrate the page being accessed rather than unrelated ones where the tags need to go. But this was not Alex's goal for Android since people were asking for the reuse of the 3% carveout rather than a smaller carveout. Other future deployments, CXL-attached memory etc. may benefit from a new scheme but I wouldn't rush in implementing anything in the kernel for now. The VMM may be in a better position to manage such heterogeneous memory for the guest if it knows the capabilities of the slots (e.g. some DAX mmap() vs anonymous mmap()). This would require the VMM replacing a page within a slot from one memory type to another (while preserving the data). I don't think we have a concrete use-case yet to be worth the hassle. > That's not the point. Tagged memory doesn't have to cover the whole of > physical memory, and it can be statically allocated. The architecture > doesn't mandate that all of the memory is MTE-capable. There's some vague wording that general purpose memory should be MTE capable if FEAT_MTE2 or later is advertised. But that's not well defined and one can have other types of memory in the physical space (e.g. CXL) that don't support tags. Last time I looked we still haven't got a way to describe memory capabilities in firmware. For the time being, I think a real use-case for FEAT_MTE_PERM is in the context of cacheable MMIO (there are some patches around from Nvidia to do this with VFIO). That memory, if exposed to guest as WB and the guest enables MTE, may trigger some SErrors. With FEAT_MTE_PERM KVM could trap and inject a fault back into the guest - maybe SEA. Is it easier to do this from KVM itself or we would rather exit to the VMM and let it handle? The latter allows room for other fancier things in the future but the former may be quicker, in the absence of other strong use-cases. > > This patchset is not looking at that feature. Instead, it can be used to > > enable MTE in configurations that currently won't allow MTE. One such > > example is libkrun which includes linux kernel as firmware in a > > dynamically linked library (libkrunfw). libkrun can insert the kernel > > region which got mmaped as part of the library load, directly into the > > guest memory map instead of copying the kernel. Such a guest config > > can't enable MTE currently even though we will never use the newly > > inserted memory regions as tag access memory. I've never played with libkrunfw. Does it handle inserting a Linux kernel? Such approach may not work well with MTE. The kernel frees the init text/data sections back into the page allocator. With MTE advertised as present, the guest will try to reuse that memory, potentially as tagged. However, since the VMM mmap'ed the guest kernel from a file, VM_MTE is not supported in the VMM address space. Even if the mapping is MAP_PRIVATE and the page copied on write, the vma remains the original one associated with the file, so VM_MTE_ALLOWED not set. We could revisit this and allow mprotect(PROT_MTE) to force the CoW on private file mappings but we'd need some strong requirement for this (MTE+libkrunfw could be such thing if people need this combination). > > Similarly, virtiofs dax support can use a page cache region as > > virtio-shm region. We can use MTE_PERM to enable MTE in this config. > > And this use case doesn't contradict what I am stating above. But it > definitely contradicts what you wrote: "In most cases, VMM will kill > the guest". To simplify things, I think whatever is presented to the VM as standard RAM (typically present at the VM boot) should either support MTE or MTE will be disabled for the guest. For other types of memory, whether WB MMIO or RAM presented as virtio-(pmem, shm etc.) backed by files in the VMM, the VM should be aware it is not standard RAM and should not attempt to enable MTE on it. If it does (either by mistake or malice), FEAT_MTE_PERM should trap and inject a fault back into the guest (or kill it altogether but for debugging, I'd rather inject a fault if possible). -- Catalin