Re: [PATCH v3 00/14] arm64: Support for running as a guest in Arm CCA

From: Catalin Marinas <catalin.marinas@arm.com>
To: Michael Kelley <mhklinux@outlook.com>
Cc: Steven Price <steven.price@arm.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"kvmarm@lists.linux.dev" <kvmarm@lists.linux.dev>,
	Marc Zyngier <maz@kernel.org>, Will Deacon <will@kernel.org>,
	James Morse <james.morse@arm.com>,
	Oliver Upton <oliver.upton@linux.dev>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Joey Gouly <joey.gouly@arm.com>,
	Alexandru Elisei <alexandru.elisei@arm.com>,
	Christoffer Dall <christoffer.dall@arm.com>,
	Fuad Tabba <tabba@google.com>,
	"linux-coco@lists.linux.dev" <linux-coco@lists.linux.dev>,
	Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Subject: Re: [PATCH v3 00/14] arm64: Support for running as a guest in Arm CCA
Date: Mon, 10 Jun 2024 18:46:06 +0100	[thread overview]
Message-ID: <Zmc73jAL2XdLU49P@arm.com> (raw)
In-Reply-To: <SN6PR02MB4157E83EAFA5EBEF5C5889BFD4C62@SN6PR02MB4157.namprd02.prod.outlook.com>

On Mon, Jun 10, 2024 at 05:03:44PM +0000, Michael Kelley wrote:
> From: Catalin Marinas <catalin.marinas@arm.com> Sent: Monday, June 10, 2024 3:34 AM
> > I wonder whether something like __GFP_DECRYPTED could be used to get
> > shared memory from the allocation time and avoid having to change the
> > vmalloc() ranges. This way functions like netvsc_init_buf() would get
> > decrypted memory from the start and vmbus_establish_gpadl() would not
> > need to call set_memory_decrypted() on a vmalloc() address.
> 
> I would not have any conceptual objections to such an approach. But I'm
> certainly not an expert in that area so I'm not sure what it would take
> to make that work for vmalloc(). I presume that __GFP_DECRYPTED
> should also work for kmalloc()?
> 
> I've seen the separate discussion about a designated pool of decrypted
> memory, to avoid always allocating a new page and decrypting when a
> smaller allocation is sufficient. If such a pool could also work for page size
> or larger allocations, it would have the additional benefit of concentrating
> decrypted allocations in fewer 2 Meg large pages vs. scattering wherever
> and forcing the break-up of more large page mappings in the direct map.

Yeah, my quick, not fully tested hack here:

https://lore.kernel.org/linux-arm-kernel/ZmNJdSxSz-sYpVgI@arm.com/

It's the underlying page allocator that gives back decrypted pages when
the flag is passed, so it should work for alloc_pages() and friends. The
kmalloc() changes only ensure that we have separate caches for this
memory and they are not merged. It needs some more work on kmem_cache,
maybe introducing a SLAB_DECRYPTED flag as well as not to rely on the
GFP flag.

For vmalloc(), we'd need a pgprot_decrypted() macro to ensure the
decrypted pages are marked with the appropriate attributes (arch
specific), otherwise it's fairly easy to wire up if alloc_pages() gives
back decrypted memory.

> I'll note that netvsc devices can be added or removed from a running VM.
> The vmalloc() memory allocated by netvsc_init_buf() can be freed, and/or
> additional calls to netvsc_init_buf() can be made at any time -- they aren't
> limited to initial Linux boot.  So the mechanism for getting decrypted
> memory at allocation time must be reasonably dynamic.

I think the above should work. But, of course, we'd have to get this
past the mm maintainers, it's likely that I missed something.

> Rejecting vmalloc() addresses may work for the moment -- I don't know
> when CCA guests might be tried on Hyper-V.  The original SEV-SNP and TDX
> work started that way as well. :-) Handling the vmalloc() case was added
> later, though I think on x86 the machinery to also flip all the alias PTEs was
> already mostly or completely in place, probably for other reasons. So
> fixing the vmalloc() case was more about not assuming that the underlying
> physical address range is contiguous. Instead, each page must be processed
> independently, which was straightforward.

There may be a slight performance impact but I guess that's not on a
critical path. Walking the page tables and changing the vmalloc ptes
should be fine but for each page, we'd have to break the linear map,
flush the TLBs, re-create the linear map. Those TLBs may become a
bottleneck, especially on hardware with lots of CPUs and the
microarchitecture. Note that even with a __GFP_DECRYPTED attribute, we'd
still need to go for individual pages in the linear map.

-- 
Catalin