Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v2 6/6] drivers/pci/hyperv/arm64: vPCI MSI IRQ domain from DT
From: Roman Kisel @ 2024-06-11 14:40 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Saurabh Singh Sengar, arnd, bhelgaas, bp, catalin.marinas,
	dave.hansen, decui, haiyangz, hpa, kw, kys, lenb, lpieralisi,
	mingo, mhklinux, rafael, robh, tglx, wei.liu, will, linux-acpi,
	linux-arch, linux-arm-kernel, linux-hyperv, linux-kernel,
	linux-pci, x86, ssengar, sunilmut, vdso
In-Reply-To: <20240607195501.GA858122@bhelgaas>



On 6/7/2024 12:55 PM, Bjorn Helgaas wrote:
> On Wed, May 15, 2024 at 01:12:38PM -0500, Bjorn Helgaas wrote:
>> On Wed, May 15, 2024 at 09:34:09AM -0700, Roman Kisel wrote:
>>>
>>>
>>> On 5/15/2024 2:48 AM, Saurabh Singh Sengar wrote:
>>>> On Tue, May 14, 2024 at 03:43:53PM -0700, Roman Kisel wrote:
>>>>> The hyperv-pci driver uses ACPI for MSI IRQ domain configuration
>>>>> on arm64 thereby it won't be able to do that in the VTL mode where
>>>>> only DeviceTree can be used.
>>>>>
>>>>> Update the hyperv-pci driver to discover interrupt configuration
>>>>> via DeviceTree.
>>>>
>>>> Subject prefix should be "PCI: hv:"
> 
> I forgot to also suggest that the subject line begin with a verb,
> e.g., "Get vPCI MSI IRQ domain from DT" or similar, again so it reads
> consistently with previous commits.
> 
> Oh, I see patch 5/6, "Get the irq number from DeviceTree" is also very
> similar.  It would be nice if they matched, e.g., both used "IRQ" and
> "DT".
> 
> Bjorn

Will update, thanks! Going to send another version during the next week 
most likely.

-- 
Thank you,
Roman

^ permalink raw reply

* Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()
From: David Hildenbrand @ 2024-06-11 10:06 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton
  Cc: linux-mm, linux-hyperv, virtualization, xen-devel, kasan-dev,
	Mike Rapoport, Oscar Salvador, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Alexander Potapenko, Marco Elver,
	Dmitry Vyukov
In-Reply-To: <20240607090939.89524-2-david@redhat.com>

On 07.06.24 11:09, David Hildenbrand wrote:
> In preparation for further changes, let's teach __free_pages_core()
> about the differences of memory hotplug handling.
> 
> Move the memory hotplug specific handling from generic_online_page() to
> __free_pages_core(), use adjust_managed_page_count() on the memory
> hotplug path, and spell out why memory freed via memblock
> cannot currently use adjust_managed_page_count().
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

@Andrew, can you squash the following?

 From 0a7921cf21cacf178ca7485da0138fc38a97a28e Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 11 Jun 2024 12:05:09 +0200
Subject: [PATCH] fixup: mm/highmem: make nr_free_highpages() return "unsigned
  long"

Fixup the memblock comment.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  mm/page_alloc.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0c8a8354be36..fc53f96db58a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1245,7 +1245,7 @@ void __free_pages_core(struct page *page, unsigned int order,
  		debug_pagealloc_map_pages(page, nr_pages);
  		adjust_managed_page_count(page, nr_pages);
  	} else {
-		/* memblock adjusts totalram_pages() ahead of time. */
+		/* memblock adjusts totalram_pages() manually. */
  		atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
  	}
  
-- 
2.45.2



-- 
Cheers,

David / dhildenb


^ permalink raw reply related

* Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()
From: David Hildenbrand @ 2024-06-11  9:42 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton
  Cc: linux-mm, linux-hyperv, virtualization, xen-devel, kasan-dev,
	Mike Rapoport, Oscar Salvador, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Alexander Potapenko, Marco Elver,
	Dmitry Vyukov
In-Reply-To: <20240607090939.89524-3-david@redhat.com>

On 07.06.24 11:09, David Hildenbrand wrote:
> We currently initialize the memmap such that PG_reserved is set and the
> refcount of the page is 1. In virtio-mem code, we have to manually clear
> that PG_reserved flag to make memory offlining with partially hotplugged
> memory blocks possible: has_unmovable_pages() would otherwise bail out on
> such pages.
> 
> We want to avoid PG_reserved where possible and move to typed pages
> instead. Further, we want to further enlighten memory offlining code about
> PG_offline: offline pages in an online memory section. One example is
> handling managed page count adjustments in a cleaner way during memory
> offlining.
> 
> So let's initialize the pages with PG_offline instead of PG_reserved.
> generic_online_page()->__free_pages_core() will now clear that flag before
> handing that memory to the buddy.
> 
> Note that the page refcount is still 1 and would forbid offlining of such
> memory except when special care is take during GOING_OFFLINE as
> currently only implemented by virtio-mem.
> 
> With this change, we can now get non-PageReserved() pages in the XEN
> balloon list. From what I can tell, that can already happen via
> decrease_reservation(), so that should be fine.
> 
> HV-balloon should not really observe a change: partial online memory
> blocks still cannot get surprise-offlined, because the refcount of these
> PageOffline() pages is 1.
> 
> Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
> hotplugged pages are now PageOffline() instead of PageReserved() before
> they are handed over to the buddy.
> 
> We'll leave the ZONE_DEVICE case alone for now.
> 

@Andrew, can we add here:

"Note that self-hosted vmemmap pages will no longer be marked as 
reserved. This matches ordinary vmemmap pages allocated from the buddy 
during memory hotplug. Now, really only vmemmap pages allocated from 
memblock during early boot will be marked reserved. Existing 
PageReserved() checks seem to be handling all relevant cases correctly 
even after this change."

-- 
Cheers,

David / dhildenb


^ permalink raw reply

* Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()
From: David Hildenbrand @ 2024-06-11  8:04 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Andrew Morton, Mike Rapoport, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <ZmgAsolx7SAHeDW7@localhost.localdomain>

On 11.06.24 09:45, Oscar Salvador wrote:
> On Mon, Jun 10, 2024 at 10:56:02AM +0200, David Hildenbrand wrote:
>> There are fortunately not that many left.
>>
>> I'd even say marking them (vmemmap) reserved is more wrong than right: note
>> that ordinary vmemmap pages after memory hotplug are not reserved! Only
>> bootmem should be reserved.
> 
> Ok, that is a very good point that I missed.
> I thought that hotplugged-vmemmap pages (not selfhosted) were marked as
> Reserved, that is why I thought this would be inconsistent.
> But then, if that is the case, I think we are safe as kernel can already
> encounter vmemmap pages that are not reserved and it deals with them
> somehow.
> 
>> Let's take at the relevant core-mm ones (arch stuff is mostly just for MMIO
>> remapping)
>>
> ...
>> Any PageReserved user that I am missing, or why we should handle these
>> vmemmap pages differently than the ones allocated during ordinary memory
>> hotplug?
> 
> No, I cannot think of a reason why normal vmemmap pages should behave
> different than self-hosted.
> 
> I was also confused because I thought that after this change
> pfn_to_online_page() would be different for self-hosted vmemmap pages,
> because I thought that somehow we relied on PageOffline(), but it is not
> the case.

Fortunately not :) PageFakeOffline() or PageLogicallyOffline()  might be 
clearer, but I don't quite like these names. If you have a good idea, 
please let me know.

> 
>> In the future, we might want to consider using a dedicated page type for
>> them, so we can stop using a bit that doesn't allow to reliably identify
>> them. (we should mark all vmemmap with that type then)
> 
> Yes, a all-vmemmap pages type would be a good thing, so we do not have
> to special case.
> 
> Just one last thing.
> Now self-hosted vmemmap pages will have the PageOffline cleared, and that
> will still remain after the memory-block they belong to has gone
> offline, which is ok because those vmemmap pages lay around until the
> chunk of memory gets removed.

Yes, and that memmap might even get poisoned in debug kernels to catch 
any wrong access.

> 
> Ok, just wanted to convince myself that there will no be surprises.
> 
> Thanks David for claryfing.

Thanks for the review and raising that. I'll add more details to the 
patch description!

-- 
Cheers,

David / dhildenb


^ permalink raw reply

* Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()
From: Oscar Salvador @ 2024-06-11  8:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Andrew Morton, Mike Rapoport, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <20240607090939.89524-3-david@redhat.com>

On Fri, Jun 07, 2024 at 11:09:37AM +0200, David Hildenbrand wrote:
> We currently initialize the memmap such that PG_reserved is set and the
> refcount of the page is 1. In virtio-mem code, we have to manually clear
> that PG_reserved flag to make memory offlining with partially hotplugged
> memory blocks possible: has_unmovable_pages() would otherwise bail out on
> such pages.
> 
> We want to avoid PG_reserved where possible and move to typed pages
> instead. Further, we want to further enlighten memory offlining code about
> PG_offline: offline pages in an online memory section. One example is
> handling managed page count adjustments in a cleaner way during memory
> offlining.
> 
> So let's initialize the pages with PG_offline instead of PG_reserved.
> generic_online_page()->__free_pages_core() will now clear that flag before
> handing that memory to the buddy.
> 
> Note that the page refcount is still 1 and would forbid offlining of such
> memory except when special care is take during GOING_OFFLINE as
> currently only implemented by virtio-mem.
> 
> With this change, we can now get non-PageReserved() pages in the XEN
> balloon list. From what I can tell, that can already happen via
> decrease_reservation(), so that should be fine.
> 
> HV-balloon should not really observe a change: partial online memory
> blocks still cannot get surprise-offlined, because the refcount of these
> PageOffline() pages is 1.
> 
> Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
> hotplugged pages are now PageOffline() instead of PageReserved() before
> they are handed over to the buddy.
> 
> We'll leave the ZONE_DEVICE case alone for now.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Oscar Salvador <osalvador@suse.de> # for the generic
memory-hotplug bits


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply

* Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()
From: Oscar Salvador @ 2024-06-11  7:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Andrew Morton, Mike Rapoport, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <5d9583e1-3374-437d-8eea-6ab1e1400a30@redhat.com>

On Mon, Jun 10, 2024 at 10:56:02AM +0200, David Hildenbrand wrote:
> There are fortunately not that many left.
> 
> I'd even say marking them (vmemmap) reserved is more wrong than right: note
> that ordinary vmemmap pages after memory hotplug are not reserved! Only
> bootmem should be reserved.

Ok, that is a very good point that I missed.
I thought that hotplugged-vmemmap pages (not selfhosted) were marked as
Reserved, that is why I thought this would be inconsistent.
But then, if that is the case, I think we are safe as kernel can already
encounter vmemmap pages that are not reserved and it deals with them
somehow.

> Let's take at the relevant core-mm ones (arch stuff is mostly just for MMIO
> remapping)
> 
... 
> Any PageReserved user that I am missing, or why we should handle these
> vmemmap pages differently than the ones allocated during ordinary memory
> hotplug?

No, I cannot think of a reason why normal vmemmap pages should behave
different than self-hosted.

I was also confused because I thought that after this change
pfn_to_online_page() would be different for self-hosted vmemmap pages,
because I thought that somehow we relied on PageOffline(), but it is not
the case.

> In the future, we might want to consider using a dedicated page type for
> them, so we can stop using a bit that doesn't allow to reliably identify
> them. (we should mark all vmemmap with that type then)

Yes, a all-vmemmap pages type would be a good thing, so we do not have
to special case.

Just one last thing.
Now self-hosted vmemmap pages will have the PageOffline cleared, and that
will still remain after the memory-block they belong to has gone
offline, which is ok because those vmemmap pages lay around until the
chunk of memory gets removed.

Ok, just wanted to convince myself that there will no be surprises.

Thanks David for claryfing.

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply

* Re: [PATCH net-next v3] net: mana: Allow variable size indirection table
From: Shradha Gupta @ 2024-06-11  5:31 UTC (permalink / raw)
  To: Simon Horman
  Cc: linux-hardening, netdev, linux-hyperv, linux-kernel, linux-rdma,
	Colin Ian King, Ahmed Zaki, Pavan Chebbi, Souradeep Chakrabarti,
	Konstantin Taranov, Kees Cook, Paolo Abeni, Jakub Kicinski,
	Eric Dumazet, David S. Miller, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Leon Romanovsky, Jason Gunthorpe, Long Li,
	Shradha Gupta
In-Reply-To: <20240606163334.GO791188@kernel.org>

On Thu, Jun 06, 2024 at 05:33:34PM +0100, Simon Horman wrote:
> On Wed, Jun 05, 2024 at 01:39:06AM -0700, Shradha Gupta wrote:
> > On Tue, Jun 04, 2024 at 10:33:49AM +0100, Simon Horman wrote:
> > > On Fri, May 31, 2024 at 08:37:41AM -0700, Shradha Gupta wrote:
> > > > Allow variable size indirection table allocation in MANA instead
> > > > of using a constant value MANA_INDIRECT_TABLE_SIZE.
> > > > The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
> > > > indirection table is allocated dynamically.
> > > > 
> > > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > > Reviewed-by: Dexuan Cui <decui@microsoft.com>
> > > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > 
> > > ...
> > > 
> > > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > > 
> > > ...
> > > 
> > > > @@ -2344,11 +2352,33 @@ static int mana_create_vport(struct mana_port_context *apc,
> > > >  	return mana_create_txq(apc, net);
> > > >  }
> > > >  
> > > > +static int mana_rss_table_alloc(struct mana_port_context *apc)
> > > > +{
> > > > +	if (!apc->indir_table_sz) {
> > > > +		netdev_err(apc->ndev,
> > > > +			   "Indirection table size not set for vPort %d\n",
> > > > +			   apc->port_idx);
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > > +	apc->indir_table = kcalloc(apc->indir_table_sz, sizeof(u32), GFP_KERNEL);
> > > > +	if (!apc->indir_table)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	apc->rxobj_table = kcalloc(apc->indir_table_sz, sizeof(mana_handle_t), GFP_KERNEL);
> > > > +	if (!apc->rxobj_table) {
> > > > +		kfree(apc->indir_table);
> > > 
> > > Hi, Shradha
> > > 
> > > Perhaps I am on the wrong track here, but I have some concerns
> > > about clean-up paths.
> > > 
> > > Firstly.  I think that apc->indir_table should be to NULL here for
> > > consistency with other clean-up paths. Or alternatively, fields of apc
> > > should not set to NULL elsewhere after being freed.
> > 
> > Hi Simon,
> > 
> > Thanks for the comments. This makes sense, I am planning of consistently
> > removing the NULLify from other places too as per Leon's comments.
> 
> Great!
> 
> > > In looking into this I noticed that mana_probe() does not call
> > > mana_remove() or return an error in the cases where mana_probe_port()
> > > or mana_attach() fail unless add_adev also fails. If so, is that
> > > intentional?
> > 
> > Right, so most calls like mana_probe_port(), mana_attach() cleanup after
> > themselves in the code if there is any error. So, not having to call
> > mana_remove() in these cases in mana_probe() is intentional. But I do
> > agree that an error is returned in mana_probe() only if add_adev also
> > fails. I'll fix that too in the next version
> 
> I'm not entirely sure, but perhaps that is a candidate for a separate patch.
> 
> > > 
> > > In any case, I would suggest as a follow-up, arranging things so that
> > > when an error occurs in a function, anything that was allocated is
> > > unwound before returning an error.
> > > 
> > > I think this would make allocation/deallocation easier to reason with.
> > > And I suspect it would avoid both the need for fields of structures to
> > > be zeroed after being freed, and the need to call mana_remove() from
> > > mana_probe().
> > 
> > Agreed
> > > 
> > > > +		return -ENOMEM;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > >  static void mana_rss_table_init(struct mana_port_context *apc)
> > > >  {
> > > >  	int i;
> > > >  
> > > > -	for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++)
> > > > +	for (i = 0; i < apc->indir_table_sz; i++)
> > > >  		apc->indir_table[i] =
> > > >  			ethtool_rxfh_indir_default(i, apc->num_queues);
> > > >  }
> > > 
> > > ...
> > > 
> > > > @@ -2739,11 +2772,17 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
> > > >  	err = register_netdev(ndev);
> > > >  	if (err) {
> > > >  		netdev_err(ndev, "Unable to register netdev.\n");
> > > > -		goto reset_apc;
> > > > +		goto free_indir;
> > > >  	}
> > > >  
> > > >  	return 0;
> > > >  
> > > > +free_indir:
> > > > +	apc->indir_table_sz = 0;
> > > > +	kfree(apc->indir_table);
> > > > +	apc->indir_table = NULL;
> > > > +	kfree(apc->rxobj_table);
> > > > +	apc->rxobj_table = NULL;
> > > >  reset_apc:
> > > >  	kfree(apc->rxqs);
> > > >  	apc->rxqs = NULL;
> > > 
> > > nit: Not strictly related to this patch, but the reset_apc code should
> > >      probably be a call to mana_cleanup_port_context() as it is the dual of
> > >      mana_init_port_context() which is called earlier in mana_probe_port()
> > 
> > Sure, let me do that too.
> 
> FWIIW, I think it would be appropriate to put that change in a separate patch.
Fixing this and other similar changes in a different patch. Thanks
> 
> > > 
> > > ...
> > > 
> > > > @@ -2931,6 +2972,11 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> > > >  		}
> > > >  
> > > >  		unregister_netdevice(ndev);
> > > > +		apc->indir_table_sz = 0;
> > > > +		kfree(apc->indir_table);
> > > > +		apc->indir_table = NULL;
> > > > +		kfree(apc->rxobj_table);
> > > > +		apc->rxobj_table = NULL;
> > > 
> > > The code to free and zero indir_table_sz and indir_table appears twice
> > > in this patch. Perhaps a helper to do this, which would be the dual
> > > of mana_rss_table_alloc is in order.
> > Makes sense, will change this too.
> 
> Thanks.

^ permalink raw reply

* [PATCH net-next] net: mana: Add support for variable page sizes of ARM64
From: Haiyang Zhang @ 2024-06-10 21:22 UTC (permalink / raw)
  To: linux-hyperv, netdev
  Cc: haiyangz, decui, stephen, kys, paulros, olaf, vkuznets, davem,
	wei.liu, edumazet, kuba, pabeni, leon, longli, ssengar,
	linux-rdma, daniel, john.fastabend, bpf, ast, hawk, tglx,
	shradhagupta, linux-kernel

As defined by the MANA Hardware spec, the queue size for DMA is 4KB
minimal, and power of 2.
To support variable page sizes (4KB, 16KB, 64KB) of ARM64, define
the minimal queue size as a macro separate from the PAGE_SIZE, which
we always assumed it to be 4KB before supporting ARM64.
Also, update the relevant code related to size alignment, DMA region
calculations, etc.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/Kconfig        |  2 +-
 .../net/ethernet/microsoft/mana/gdma_main.c   |  8 +++----
 .../net/ethernet/microsoft/mana/hw_channel.c  | 22 +++++++++----------
 drivers/net/ethernet/microsoft/mana/mana_en.c |  8 +++----
 .../net/ethernet/microsoft/mana/shm_channel.c |  9 ++++----
 include/net/mana/gdma.h                       |  7 +++++-
 include/net/mana/mana.h                       |  3 ++-
 7 files changed, 33 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/Kconfig b/drivers/net/ethernet/microsoft/Kconfig
index 286f0d5697a1..901fbffbf718 100644
--- a/drivers/net/ethernet/microsoft/Kconfig
+++ b/drivers/net/ethernet/microsoft/Kconfig
@@ -18,7 +18,7 @@ if NET_VENDOR_MICROSOFT
 config MICROSOFT_MANA
 	tristate "Microsoft Azure Network Adapter (MANA) support"
 	depends on PCI_MSI
-	depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN && ARM64_4K_PAGES)
+	depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN)
 	depends on PCI_HYPERV
 	select AUXILIARY_BUS
 	select PAGE_POOL
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 1332db9a08eb..c9df942d0d02 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -182,7 +182,7 @@ int mana_gd_alloc_memory(struct gdma_context *gc, unsigned int length,
 	dma_addr_t dma_handle;
 	void *buf;
 
-	if (length < PAGE_SIZE || !is_power_of_2(length))
+	if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
 		return -EINVAL;
 
 	gmi->dev = gc->dev;
@@ -717,7 +717,7 @@ EXPORT_SYMBOL_NS(mana_gd_destroy_dma_region, NET_MANA);
 static int mana_gd_create_dma_region(struct gdma_dev *gd,
 				     struct gdma_mem_info *gmi)
 {
-	unsigned int num_page = gmi->length / PAGE_SIZE;
+	unsigned int num_page = gmi->length / MANA_MIN_QSIZE;
 	struct gdma_create_dma_region_req *req = NULL;
 	struct gdma_create_dma_region_resp resp = {};
 	struct gdma_context *gc = gd->gdma_context;
@@ -727,7 +727,7 @@ static int mana_gd_create_dma_region(struct gdma_dev *gd,
 	int err;
 	int i;
 
-	if (length < PAGE_SIZE || !is_power_of_2(length))
+	if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
 		return -EINVAL;
 
 	if (offset_in_page(gmi->virt_addr) != 0)
@@ -751,7 +751,7 @@ static int mana_gd_create_dma_region(struct gdma_dev *gd,
 	req->page_addr_list_len = num_page;
 
 	for (i = 0; i < num_page; i++)
-		req->page_addr_list[i] = gmi->dma_handle +  i * PAGE_SIZE;
+		req->page_addr_list[i] = gmi->dma_handle +  i * MANA_MIN_QSIZE;
 
 	err = mana_gd_send_request(gc, req_msg_size, req, sizeof(resp), &resp);
 	if (err)
diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index bbc4f9e16c98..038dc31e09cd 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -362,12 +362,12 @@ static int mana_hwc_create_cq(struct hw_channel_context *hwc, u16 q_depth,
 	int err;
 
 	eq_size = roundup_pow_of_two(GDMA_EQE_SIZE * q_depth);
-	if (eq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
-		eq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
+	if (eq_size < MANA_MIN_QSIZE)
+		eq_size = MANA_MIN_QSIZE;
 
 	cq_size = roundup_pow_of_two(GDMA_CQE_SIZE * q_depth);
-	if (cq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
-		cq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
+	if (cq_size < MANA_MIN_QSIZE)
+		cq_size = MANA_MIN_QSIZE;
 
 	hwc_cq = kzalloc(sizeof(*hwc_cq), GFP_KERNEL);
 	if (!hwc_cq)
@@ -429,7 +429,7 @@ static int mana_hwc_alloc_dma_buf(struct hw_channel_context *hwc, u16 q_depth,
 
 	dma_buf->num_reqs = q_depth;
 
-	buf_size = PAGE_ALIGN(q_depth * max_msg_size);
+	buf_size = MANA_MIN_QALIGN(q_depth * max_msg_size);
 
 	gmi = &dma_buf->mem_info;
 	err = mana_gd_alloc_memory(gc, buf_size, gmi);
@@ -497,8 +497,8 @@ static int mana_hwc_create_wq(struct hw_channel_context *hwc,
 	else
 		queue_size = roundup_pow_of_two(GDMA_MAX_SQE_SIZE * q_depth);
 
-	if (queue_size < MINIMUM_SUPPORTED_PAGE_SIZE)
-		queue_size = MINIMUM_SUPPORTED_PAGE_SIZE;
+	if (queue_size < MANA_MIN_QSIZE)
+		queue_size = MANA_MIN_QSIZE;
 
 	hwc_wq = kzalloc(sizeof(*hwc_wq), GFP_KERNEL);
 	if (!hwc_wq)
@@ -628,10 +628,10 @@ static int mana_hwc_establish_channel(struct gdma_context *gc, u16 *q_depth,
 	init_completion(&hwc->hwc_init_eqe_comp);
 
 	err = mana_smc_setup_hwc(&gc->shm_channel, false,
-				 eq->mem_info.dma_handle,
-				 cq->mem_info.dma_handle,
-				 rq->mem_info.dma_handle,
-				 sq->mem_info.dma_handle,
+				 virt_to_phys(eq->mem_info.virt_addr),
+				 virt_to_phys(cq->mem_info.virt_addr),
+				 virt_to_phys(rq->mem_info.virt_addr),
+				 virt_to_phys(sq->mem_info.virt_addr),
 				 eq->eq.msix_index);
 	if (err)
 		return err;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index d087cf954f75..6a891dbce686 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1889,10 +1889,10 @@ static int mana_create_txq(struct mana_port_context *apc,
 	 *  to prevent overflow.
 	 */
 	txq_size = MAX_SEND_BUFFERS_PER_QUEUE * 32;
-	BUILD_BUG_ON(!PAGE_ALIGNED(txq_size));
+	BUILD_BUG_ON(!MANA_MIN_QALIGNED(txq_size));
 
 	cq_size = MAX_SEND_BUFFERS_PER_QUEUE * COMP_ENTRY_SIZE;
-	cq_size = PAGE_ALIGN(cq_size);
+	cq_size = MANA_MIN_QALIGN(cq_size);
 
 	gc = gd->gdma_context;
 
@@ -2189,8 +2189,8 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 	if (err)
 		goto out;
 
-	rq_size = PAGE_ALIGN(rq_size);
-	cq_size = PAGE_ALIGN(cq_size);
+	rq_size = MANA_MIN_QALIGN(rq_size);
+	cq_size = MANA_MIN_QALIGN(cq_size);
 
 	/* Create RQ */
 	memset(&spec, 0, sizeof(spec));
diff --git a/drivers/net/ethernet/microsoft/mana/shm_channel.c b/drivers/net/ethernet/microsoft/mana/shm_channel.c
index 5553af9c8085..9a54a163d8d1 100644
--- a/drivers/net/ethernet/microsoft/mana/shm_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/shm_channel.c
@@ -6,6 +6,7 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 
+#include <net/mana/gdma.h>
 #include <net/mana/shm_channel.h>
 
 #define PAGE_FRAME_L48_WIDTH_BYTES 6
@@ -183,7 +184,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool reset_vf, u64 eq_addr,
 
 	/* EQ addr: low 48 bits of frame address */
 	shmem = (u64 *)ptr;
-	frame_addr = PHYS_PFN(eq_addr);
+	frame_addr = MANA_PFN(eq_addr);
 	*shmem = frame_addr & PAGE_FRAME_L48_MASK;
 	all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
 		(frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
@@ -191,7 +192,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool reset_vf, u64 eq_addr,
 
 	/* CQ addr: low 48 bits of frame address */
 	shmem = (u64 *)ptr;
-	frame_addr = PHYS_PFN(cq_addr);
+	frame_addr = MANA_PFN(cq_addr);
 	*shmem = frame_addr & PAGE_FRAME_L48_MASK;
 	all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
 		(frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
@@ -199,7 +200,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool reset_vf, u64 eq_addr,
 
 	/* RQ addr: low 48 bits of frame address */
 	shmem = (u64 *)ptr;
-	frame_addr = PHYS_PFN(rq_addr);
+	frame_addr = MANA_PFN(rq_addr);
 	*shmem = frame_addr & PAGE_FRAME_L48_MASK;
 	all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
 		(frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
@@ -207,7 +208,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool reset_vf, u64 eq_addr,
 
 	/* SQ addr: low 48 bits of frame address */
 	shmem = (u64 *)ptr;
-	frame_addr = PHYS_PFN(sq_addr);
+	frame_addr = MANA_PFN(sq_addr);
 	*shmem = frame_addr & PAGE_FRAME_L48_MASK;
 	all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
 		(frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 27684135bb4d..b392559c33e9 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -224,7 +224,12 @@ struct gdma_dev {
 	struct auxiliary_device *adev;
 };
 
-#define MINIMUM_SUPPORTED_PAGE_SIZE PAGE_SIZE
+/* These are defined by HW */
+#define MANA_MIN_QSHIFT 12
+#define MANA_MIN_QSIZE (1 << MANA_MIN_QSHIFT)
+#define MANA_MIN_QALIGN(x) ALIGN((x), MANA_MIN_QSIZE)
+#define MANA_MIN_QALIGNED(addr) IS_ALIGNED((unsigned long)(addr), MANA_MIN_QSIZE)
+#define MANA_PFN(a) (PHYS_PFN(a) << (PAGE_SHIFT - MANA_MIN_QSHIFT))
 
 #define GDMA_CQE_SIZE 64
 #define GDMA_EQE_SIZE 16
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 561f6719fb4e..43e8fc574354 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -42,7 +42,8 @@ enum TRI_STATE {
 
 #define MAX_SEND_BUFFERS_PER_QUEUE 256
 
-#define EQ_SIZE (8 * PAGE_SIZE)
+#define EQ_SIZE (8 * MANA_MIN_QSIZE)
+
 #define LOG2_EQ_THROTTLE 3
 
 #define MAX_PORTS_IN_MANA_DEV 256
-- 
2.34.1


^ permalink raw reply related

* [PATCH 1/1] Documentation: hyperv: Add overview of Confidential Computing VM support
From: mhkelley58 @ 2024-06-10 20:28 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, corbet, linux-kernel, linux-hyperv,
	linux-doc, linux-coco

From: Michael Kelley <mhklinux@outlook.com>

Add documentation topic for Confidential Computing (CoCo) VM support
in Linux guests on Hyper-V.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 Documentation/virt/hyperv/coco.rst  | 258 ++++++++++++++++++++++++++++
 Documentation/virt/hyperv/index.rst |   1 +
 2 files changed, 259 insertions(+)
 create mode 100644 Documentation/virt/hyperv/coco.rst

diff --git a/Documentation/virt/hyperv/coco.rst b/Documentation/virt/hyperv/coco.rst
new file mode 100644
index 000000000000..ffd6ba7a1d64
--- /dev/null
+++ b/Documentation/virt/hyperv/coco.rst
@@ -0,0 +1,258 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Confidential Computing VMs
+==========================
+Hyper-V can create and run Linux guests that are Confidential Computing
+(CoCo) VMs. Such VMs cooperate with the physical processor to better protect
+the confidentiality and integrity of data in the VM's memory, even in the
+face of a hypervisor/VMM that has been compromised and may behave maliciously.
+CoCo VMs on Hyper-V share the generic CoCo VM threat model and security
+objectives described in Documentation/security/snp-tdx-threat-model.rst. Note
+that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or
+"isolation VMs".
+
+A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the
+following:
+
+* Physical hardware with a processor that supports CoCo VMs
+
+* The hardware runs a version of Windows/Hyper-V with support for CoCo VMs
+
+* The VM runs a version of Linux that supports being a CoCo VM
+
+The physical hardware requirements are as follows:
+
+* AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME,
+  SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo
+  VM on Hyper-V.
+
+* Intel processor with TDX
+
+To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V
+when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM,
+or vice versa, after it is created.
+
+Operational Modes
+-----------------
+Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is
+created and cannot be changed during the life of the VM.
+
+* Fully-enlightened mode. In this mode, the guest operating system is
+  enlightened to understand and manage all aspects of running as a CoCo VM.
+
+* Paravisor mode. In this mode, a paravisor layer between the guest and the
+  host provides some operations needed to run as a CoCo VM. The guest operating
+  system can have fewer CoCo enlightenments than is required in the
+  fully-enlightened case.
+
+Conceptually, fully-enlightened mode and paravisor mode may be treated as
+points on a spectrum spanning the degree of guest enlightenment needed to run
+as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full
+implementation of paravisor mode is the other end of the spectrum, where all
+aspects of running as a CoCo VM are handled by the paravisor, and a normal
+guest OS with no knowledge of memory encryption or other aspects of CoCo VMs
+can run successfully. However, the Hyper-V implementation of paravisor mode
+does not go this far, and is somewhere in the middle of the spectrum. Some
+aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS
+must be enlightened for other aspects. Unfortunately, there is no
+standardized enumeration of feature/functions that might be provided in the
+paravisor, and there is no standardized mechanism for a guest OS to query the
+paravisor for the feature/functions it provides. The understanding of what
+the paravisor provides is hard-coded in the guest OS.
+
+Paravisor mode has similarities to the Coconut project, which aims to provide
+a limited paravisor to provide services to the guest such as a virtual TPM.
+However, the Hyper-V paravisor generally handles more aspects of CoCo VMs
+than is currently envisioned for Coconut, and so is further toward the "no
+guest enlightenments required" end of the spectrum.
+
+In the CoCo VM threat model, the paravisor is in the guest security domain
+and must be trusted by the guest OS. By implication, the hypervisor/VMM must
+protect itself against a potentially malicious paravisor just like it
+protects against a potentially malicious guest.
+
+The hardware architectural approach to fully-enlightened vs. paravisor mode
+varies depending on the underlying processor.
+
+* With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in
+  VMPL 0 and has full control of the guest context. In paravisor mode, the
+  guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor
+  running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have.
+  Certain operations require the guest to invoke the paravisor. Furthermore, in
+  paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode
+  as defined by the SEV-SNP architecture. This mode simplifies guest management
+  of memory encryption when a paravisor is used.
+
+* With Intel TDX processor, in fully-enlightened mode the guest OS runs in an
+  L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the
+  L1 VM, and the guest OS runs in a nested L2 VM.
+
+Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This
+MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and
+whether a paravisor is being used. It is straightforward to build a single
+kernel image that can boot and run properly on either architecture, and in
+either mode.
+
+Paravisor Effects
+-----------------
+Running in paravisor mode affects the following areas of generic Linux kernel
+CoCo VM functionality:
+
+* Initial guest memory setup. When a new VM is created in paravisor mode, the
+  paravisor runs first and sets up the guest physical memory as encrypted. The
+  guest Linux does normal memory initialization, except for explicitly marking
+  appropriate ranges as decrypted (shared). In paravisor mode, Linux does not
+  perform the early boot memory setup steps that are particularly tricky with
+  AMD SEV-SNP in fully-enlightened mode.
+
+* #VC/#VE exception handling. In paravisor mode, Hyper-V configures the guest
+  CoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM,
+  respectively, and not the guest Linux. Consequently, these exception handlers
+  do not run in the guest Linux and are not a required enlightenment for a
+  Linux guest in paravisor mode.
+
+* CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in the
+  guest indicating that the VM is operating with the respective hardware
+  support. While these CPUID flags are visible in fully-enlightened CoCo VMs,
+  the paravisor filters out these flags and the guest Linux does not see them.
+  Throughout the Linux kernel, explicitly testing these flags has mostly been
+  eliminated in favor of the cc_platform_has() function, with the goal of
+  abstracting the differences between SEV-SNP and TDX. But the
+  cc_platform_has() abstraction also allows the Hyper-V paravisor configuration
+  to selectively enable aspects of CoCo VM functionality even when the CPUID
+  flags are not set. The exception is early boot memory setup on SEV-SNP, which
+  tests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisor
+  mode VM achieves the desired effect or not running SEV-SNP specific early
+  boot memory setup.
+
+* Device emulation. In paravisor mode, the Hyper-V paravisor provides
+  emulation of devices such as the IO-APIC and TPM. Because the emulation
+  happens in the paravisor in the guest context (instead of the hypervisor/VMM
+  context), MMIO accesses to these devices must be encrypted references instead
+  of the decrypted references that would be used in a fully-enlightened CoCo
+  VM. The __ioremap_caller() function has been enhanced to make a callback to
+  check whether a particular address range should be treated as encrypted
+  (private). See the "is_private_mmio" callback.
+
+* Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guest
+  memory between encrypted and decrypted requires coordinating with the
+  hypervisor/VMM. This is done via callbacks invoked from
+  __set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP and
+  TDX implementations of these callbacks are used. In paravisor mode, a Hyper-V
+  specific set of callbacks is used. These callbacks invoke the paravisor so
+  that the paravisor can coordinate the transitions and inform the hypervisor
+  as necessary. See hv_vtom_init() where these callback are set up.
+
+* Interrupt injection. In fully enlightened mode, a malicious hypervisor
+  could inject interrupts into the guest OS at times that violate x86/x64
+  architectural rules. For full protection, the guest OS should include
+  enlightenments that use the interrupt injection management features provided
+  by CoCo-capable processors. In paravisor mode, the paravisor mediates
+  interrupt injection into the guest OS, and ensures that the guest OS only
+  sees interrupts that are "legal". The paravisor uses the interrupt injection
+  management features provided by the CoCo-capable physical processor, thereby
+  masking these complexities from the guest OS.
+
+Hyper-V Hypercalls
+------------------
+When in fully-enlightened mode, hypercalls made by the Linux guest are routed
+directly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode,
+normal hypercalls trap to the paravisor first, which may in turn invoke the
+hypervisor. But the paravisor is idiosyncratic in this regard, and a few
+hypercalls made by the Linux guest must always be routed directly to the
+hypervisor. These hypercall sites test for a paravisor being present, and use
+a special invocation sequence. See hv_post_message(), for example.
+
+Guest communication with Hyper-V
+--------------------------------
+Separate from the generic Linux kernel handling of memory encryption in Linux
+CoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memory
+shared between the Linux guest and the host. This shared memory must be
+marked decrypted to enable communication. Furthermore, since the threat model
+includes a compromised and potentially malicious host, the guest must guard
+against leaking any unintended data to the host through this shared memory.
+
+These Hyper-V and VMBus memory pages are marked as decrypted:
+
+* VMBus monitor pages
+
+* Synthetic interrupt controller (synic) related pages (unless supplied by
+  the paravisor)
+
+* Per-cpu hypercall input and output pages (unless running with a paravisor)
+
+* VMBus ring buffers. The direct mapping is marked decrypted in
+  __vmbus_establish_gpadl(). The secondary mapping created in
+  hv_ringbuffer_init() must also include the "decrypted" attribute.
+
+When the guest writes data to memory that is shared with the host, it must
+ensure that only the intended data is written. Padding or unused fields must
+be initialized to zeros before copying into the shared memory so that random
+kernel data is not inadvertently given to the host.
+
+Similarly, when the guest reads memory that is shared with the host, it must
+validate the data before acting on it so that a malicious host cannot induce
+the guest to expose unintended data. Doing such validation can be tricky
+because the host can modify the shared memory areas even while or after
+validation is performed. For messages passed from the host to the guest in a
+VMBus ring buffer, the length of the message is validated, and the message is
+copied into a temporary (encrypted) buffer for further validation and
+processing. The copying adds a small amount of overhead, but is the only way
+to protect against a malicious host. See hv_pkt_iter_first().
+
+Many drivers for VMBus devices have been "hardened" by adding code to fully
+validate messages received over VMBus, instead of assuming that Hyper-V is
+acting cooperatively. Such drivers are marked as "allowed_in_isolated" in the
+vmbus_devs[] table. Other drivers for VMBus devices that are not needed in a
+CoCo VM have not been hardened, and they are not allowed to load in a CoCo
+VM. See vmbus_is_valid_offer() where such devices are excluded.
+
+Two VMBus devices depend on the Hyper-V host to do DMA data transfers:
+storvsc for disk I/O and netvsc for network I/O. storvsc uses the normal
+Linux kernel DMA APIs, and so bounce buffering through decrypted swiotlb
+memory is done implicitly. netvsc has two modes for data transfers. The first
+mode goes through send and receive buffer space that is explicitly allocated
+by the netvsc driver, and is used for most smaller packets. These send and
+receive buffers are marked decrypted by __vmbus_establish_gpadl(). Because
+the netvsc driver explicitly copies packets to/from these buffers, the
+equivalent of bounce buffering between encrypted and decrypted memory is
+already part of the data path. The second mode uses the normal Linux kernel
+DMA APIs, and is bounce buffered through swiotlb memory implicitly like in
+storvsc.
+
+Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM.
+Linux PCI device drivers access PCI config space using standard APIs provided
+by the Linux PCI subsystem. On Hyper-V, these functions directly access MMIO
+space, and the access traps to Hyper-V for emulation. But in CoCo VMs, memory
+encryption prevents Hyper-V from reading the guest instruction stream to
+emulate the access. So in a CoCo VM, these functions must make a hypercall
+with arguments explicitly describing the access. See
+_hv_pcifront_read_config() and _hv_pcifront_write_config() and the
+"use_calls" flag indicating to use hypercalls.
+
+load_unaligned_zeropad()
+------------------------
+When transitioning memory between encrypted and decrypted, the caller of
+set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring
+the memory isn't in use and isn't referenced while the transition is in
+progress. The transition has multiple steps, and includes interaction with
+the Hyper-V host. The memory is in an inconsistent state until all steps are
+complete. A reference while the state is inconsistent could result in an
+exception that can't be cleanly fixed up.
+
+However, the kernel load_unaligned_zeropad() mechanism may make stray
+references that can't be prevented by the caller of set_memory_encrypted() or
+set_memory_decrypted(), so there's specific code in the #VC or #VE exception
+handler to fixup this case. But a CoCo VM running on Hyper-V may be
+configured to run with a paravisor, with the #VC or #VE exception routed to
+the paravisor. There's no architectural way to forward the exceptions back to
+the guest kernel, and in such a case, the load_unaligned_zeropad() fixup code
+in the #VC/#VE handlers doesn't run.
+
+To avoid this problem, the Hyper-V specific functions for notifying the
+hypervisor of the transition mark pages as "not present" while a transition
+is in progress. If load_unaligned_zeropad() causes a stray reference, a
+normal page fault is generated instead of #VC or #VE, and the page-fault-
+based handlers for load_unaligned_zeropad() fixup the reference. When the
+encrypted/decrypted transition is complete, the pages are marked as "present"
+again. See hv_vtom_clear_present() and hv_vtom_set_host_visibility().
diff --git a/Documentation/virt/hyperv/index.rst b/Documentation/virt/hyperv/index.rst
index de447e11b4a5..79bc4080329e 100644
--- a/Documentation/virt/hyperv/index.rst
+++ b/Documentation/virt/hyperv/index.rst
@@ -11,3 +11,4 @@ Hyper-V Enlightenments
    vmbus
    clocks
    vpci
+   coco
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCHv11 18/19] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
From: Kirill A. Shutemov @ 2024-06-10 14:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, Rafael J. Wysocki,
	Peter Zijlstra, Adrian Hunter, Kuppuswamy Sathyanarayanan,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	Kalra, Ashish, Sean Christopherson, Huang, Kai, Ard Biesheuvel,
	Baoquan He, H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang,
	kexec, linux-hyperv, linux-acpi, linux-coco, linux-kernel,
	Tao Liu
In-Reply-To: <20240610134020.GCZmcCRFxuObyv1W_d@fat_crate.local>

On Mon, Jun 10, 2024 at 03:40:20PM +0200, Borislav Petkov wrote:
> On Fri, Jun 07, 2024 at 06:14:28PM +0300, Kirill A. Shutemov wrote:
> >   I was able to address this issue by switching cpa_lock to a mutex.
> >   However, this solution will only work if the callers for set_memory
> >   interfaces are not called from an atomic context. I need to verify if
> >   this is the case.
> 
> Dunno, I'd be nervous about this. Althouth from looking at
> 
>    ad5ca55f6bdb ("x86, cpa: srlz cpa(), global flush tlb after splitting big page and before doing cpa")
> 
> I don't see how "So that we don't allow any other cpu" can't be done
> with a mutex. Perhaps the set_memory* interfaces should be usable in as
> many contexts as possible.
> 
> Have you run this with lockdep enabled?

Yes, it booted to the shell just fine. However, that doesn't prove
anything. The set_memory_* function has many obscured cases.

> > - The function __flush_tlb_all() in kernel_(un)map_pages_in_pgd() must be
> >   called with preemption disabled. Once again, I am unsure why this has
> >   not caused issues in the EFI case.
> 
> It could be because EFI does all that setup on the BSP only before the
> others have arrived but I don't remember anymore... It is more than
> a decade ago when I did this...

Are you okay with this? Disabling preemption looks strange, but I don't
see a better option.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCHv11 18/19] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
From: Borislav Petkov @ 2024-06-10 13:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, Rafael J. Wysocki,
	Peter Zijlstra, Adrian Hunter, Kuppuswamy Sathyanarayanan,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	Kalra, Ashish, Sean Christopherson, Huang, Kai, Ard Biesheuvel,
	Baoquan He, H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang,
	kexec, linux-hyperv, linux-acpi, linux-coco, linux-kernel,
	Tao Liu
In-Reply-To: <icu4yecqfwhmbexupo4zzei4lbe5sgavsfkm27jd6t6gyjynul@c2wap3jhtik7>

On Fri, Jun 07, 2024 at 06:14:28PM +0300, Kirill A. Shutemov wrote:
>   I was able to address this issue by switching cpa_lock to a mutex.
>   However, this solution will only work if the callers for set_memory
>   interfaces are not called from an atomic context. I need to verify if
>   this is the case.

Dunno, I'd be nervous about this. Althouth from looking at

   ad5ca55f6bdb ("x86, cpa: srlz cpa(), global flush tlb after splitting big page and before doing cpa")

I don't see how "So that we don't allow any other cpu" can't be done
with a mutex. Perhaps the set_memory* interfaces should be usable in as
many contexts as possible.

Have you run this with lockdep enabled?

> - The function __flush_tlb_all() in kernel_(un)map_pages_in_pgd() must be
>   called with preemption disabled. Once again, I am unsure why this has
>   not caused issues in the EFI case.

It could be because EFI does all that setup on the BSP only before the
others have arrived but I don't remember anymore... It is more than
a decade ago when I did this...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()
From: Oscar Salvador @ 2024-06-10 11:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Andrew Morton, Mike Rapoport, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <13070847-4129-490c-b228-2e52bd77566a@redhat.com>

On Mon, Jun 10, 2024 at 10:38:05AM +0200, David Hildenbrand wrote:
> On 10.06.24 06:03, Oscar Salvador wrote:
> > On Fri, Jun 07, 2024 at 11:09:36AM +0200, David Hildenbrand wrote:
> > > In preparation for further changes, let's teach __free_pages_core()
> > > about the differences of memory hotplug handling.
> > > 
> > > Move the memory hotplug specific handling from generic_online_page() to
> > > __free_pages_core(), use adjust_managed_page_count() on the memory
> > > hotplug path, and spell out why memory freed via memblock
> > > cannot currently use adjust_managed_page_count().
> > > 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > 
> > All looks good but I am puzzled with something.
> > 
> > > +	} else {
> > > +		/* memblock adjusts totalram_pages() ahead of time. */
> > > +		atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
> > > +	}
> > 
> > You say that memblock adjusts totalram_pages ahead of time, and I guess
> > you mean in memblock_free_all()
> 
> And memblock_free_late(), which uses atomic_long_inc().

Ah yes.

 
> Right (it's suboptimal, but not really problematic so far. Hopefully Wei can
> clean it up and move it in here as well)

That would be great.

> For the time being
> 
> "/* memblock adjusts totalram_pages() manually. */"

Yes, I think that is better ;-)

Thanks!
 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply

* [PATCH net-next v4] net: mana: Allow variable size indirection table
From: Shradha Gupta @ 2024-06-10 10:28 UTC (permalink / raw)
  To: linux-hardening, netdev, linux-hyperv, linux-kernel, linux-rdma
  Cc: Shradha Gupta, Colin Ian King, Ahmed Zaki, Pavan Chebbi,
	Souradeep Chakrabarti, Konstantin Taranov, Kees Cook, Paolo Abeni,
	Jakub Kicinski, Eric Dumazet, David S. Miller, Dexuan Cui,
	Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Leon Romanovsky,
	Jason Gunthorpe, Long Li, Shradha Gupta

Allow variable size indirection table allocation in MANA instead
of using a constant value MANA_INDIRECT_TABLE_SIZE.
The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
indirection table is allocated dynamically.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 Changes in v4:
 * Skip NULLify after free
 * Log proper errors in mana_probe() if mana_attach(), mana_probe_port()
   fails
 * Implement mana_cleanup_indir_table() to avoid code duplication.

 Changes in v3:
 * Fixed the memory leak(save_table) in mana_set_rxfh()

 Changes in v2:
 * Rebased to latest net-next tree
 * Rearranged cleanup code in mana_probe_port to avoid extra operations
---
 drivers/infiniband/hw/mana/qp.c               | 10 +--
 drivers/net/ethernet/microsoft/mana/mana_en.c | 85 ++++++++++++++++---
 .../ethernet/microsoft/mana/mana_ethtool.c    | 27 ++++--
 include/net/mana/gdma.h                       |  4 +-
 include/net/mana/mana.h                       |  9 +-
 5 files changed, 104 insertions(+), 31 deletions(-)

diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index ba13c5abf8ef..2d411a16a127 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -21,7 +21,7 @@ static int mana_ib_cfg_vport_steering(struct mana_ib_dev *dev,
 
 	gc = mdev_to_gc(dev);
 
-	req_buf_size = struct_size(req, indir_tab, MANA_INDIRECT_TABLE_SIZE);
+	req_buf_size = struct_size(req, indir_tab, MANA_INDIRECT_TABLE_DEF_SIZE);
 	req = kzalloc(req_buf_size, GFP_KERNEL);
 	if (!req)
 		return -ENOMEM;
@@ -41,18 +41,18 @@ static int mana_ib_cfg_vport_steering(struct mana_ib_dev *dev,
 	if (log_ind_tbl_size)
 		req->rss_enable = true;
 
-	req->num_indir_entries = MANA_INDIRECT_TABLE_SIZE;
+	req->num_indir_entries = MANA_INDIRECT_TABLE_DEF_SIZE;
 	req->indir_tab_offset = offsetof(struct mana_cfg_rx_steer_req_v2,
 					 indir_tab);
 	req->update_indir_tab = true;
 	req->cqe_coalescing_enable = 1;
 
 	/* The ind table passed to the hardware must have
-	 * MANA_INDIRECT_TABLE_SIZE entries. Adjust the verb
+	 * MANA_INDIRECT_TABLE_DEF_SIZE entries. Adjust the verb
 	 * ind_table to MANA_INDIRECT_TABLE_SIZE if required
 	 */
 	ibdev_dbg(&dev->ib_dev, "ind table size %u\n", 1 << log_ind_tbl_size);
-	for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++) {
+	for (i = 0; i < MANA_INDIRECT_TABLE_DEF_SIZE; i++) {
 		req->indir_tab[i] = ind_table[i % (1 << log_ind_tbl_size)];
 		ibdev_dbg(&dev->ib_dev, "index %u handle 0x%llx\n", i,
 			  req->indir_tab[i]);
@@ -137,7 +137,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 	}
 
 	ind_tbl_size = 1 << ind_tbl->log_ind_tbl_size;
-	if (ind_tbl_size > MANA_INDIRECT_TABLE_SIZE) {
+	if (ind_tbl_size > MANA_INDIRECT_TABLE_DEF_SIZE) {
 		ibdev_dbg(&mdev->ib_dev,
 			  "Indirect table size %d exceeding limit\n",
 			  ind_tbl_size);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index d087cf954f75..d87b57626769 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -481,7 +481,7 @@ static int mana_get_tx_queue(struct net_device *ndev, struct sk_buff *skb,
 	struct sock *sk = skb->sk;
 	int txq;
 
-	txq = apc->indir_table[hash & MANA_INDIRECT_TABLE_MASK];
+	txq = apc->indir_table[hash & (apc->indir_table_sz - 1)];
 
 	if (txq != old_q && sk && sk_fullsock(sk) &&
 	    rcu_access_pointer(sk->sk_dst_cache))
@@ -721,6 +721,13 @@ static void mana_cleanup_port_context(struct mana_port_context *apc)
 	apc->rxqs = NULL;
 }
 
+static void mana_cleanup_indir_table(struct mana_port_context *apc)
+{
+	apc->indir_table_sz = 0;
+	kfree(apc->indir_table);
+	kfree(apc->rxobj_table);
+}
+
 static int mana_init_port_context(struct mana_port_context *apc)
 {
 	apc->rxqs = kcalloc(apc->num_queues, sizeof(struct mana_rxq *),
@@ -962,7 +969,16 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
 
 	*max_sq = resp.max_num_sq;
 	*max_rq = resp.max_num_rq;
-	*num_indir_entry = resp.num_indirection_ent;
+	if (resp.num_indirection_ent > 0 &&
+	    resp.num_indirection_ent <= MANA_INDIRECT_TABLE_MAX_SIZE &&
+	    is_power_of_2(resp.num_indirection_ent)) {
+		*num_indir_entry = resp.num_indirection_ent;
+	} else {
+		netdev_warn(apc->ndev,
+			    "Setting indirection table size to default %d for vPort %d\n",
+			    MANA_INDIRECT_TABLE_DEF_SIZE, apc->port_idx);
+		*num_indir_entry = MANA_INDIRECT_TABLE_DEF_SIZE;
+	}
 
 	apc->port_handle = resp.vport;
 	ether_addr_copy(apc->mac_addr, resp.mac_addr);
@@ -1054,14 +1070,13 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 				   bool update_default_rxobj, bool update_key,
 				   bool update_tab)
 {
-	u16 num_entries = MANA_INDIRECT_TABLE_SIZE;
 	struct mana_cfg_rx_steer_req_v2 *req;
 	struct mana_cfg_rx_steer_resp resp = {};
 	struct net_device *ndev = apc->ndev;
 	u32 req_buf_size;
 	int err;
 
-	req_buf_size = struct_size(req, indir_tab, num_entries);
+	req_buf_size = struct_size(req, indir_tab, apc->indir_table_sz);
 	req = kzalloc(req_buf_size, GFP_KERNEL);
 	if (!req)
 		return -ENOMEM;
@@ -1072,7 +1087,7 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 	req->hdr.req.msg_version = GDMA_MESSAGE_V2;
 
 	req->vport = apc->port_handle;
-	req->num_indir_entries = num_entries;
+	req->num_indir_entries = apc->indir_table_sz;
 	req->indir_tab_offset = offsetof(struct mana_cfg_rx_steer_req_v2,
 					 indir_tab);
 	req->rx_enable = rx;
@@ -1111,7 +1126,7 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 	}
 
 	netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
-		    apc->port_handle, num_entries);
+		    apc->port_handle, apc->indir_table_sz);
 out:
 	kfree(req);
 	return err;
@@ -2344,11 +2359,33 @@ static int mana_create_vport(struct mana_port_context *apc,
 	return mana_create_txq(apc, net);
 }
 
+static int mana_rss_table_alloc(struct mana_port_context *apc)
+{
+	if (!apc->indir_table_sz) {
+		netdev_err(apc->ndev,
+			   "Indirection table size not set for vPort %d\n",
+			   apc->port_idx);
+		return -EINVAL;
+	}
+
+	apc->indir_table = kcalloc(apc->indir_table_sz, sizeof(u32), GFP_KERNEL);
+	if (!apc->indir_table)
+		return -ENOMEM;
+
+	apc->rxobj_table = kcalloc(apc->indir_table_sz, sizeof(mana_handle_t), GFP_KERNEL);
+	if (!apc->rxobj_table) {
+		kfree(apc->indir_table);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
 static void mana_rss_table_init(struct mana_port_context *apc)
 {
 	int i;
 
-	for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++)
+	for (i = 0; i < apc->indir_table_sz; i++)
 		apc->indir_table[i] =
 			ethtool_rxfh_indir_default(i, apc->num_queues);
 }
@@ -2361,7 +2398,7 @@ int mana_config_rss(struct mana_port_context *apc, enum TRI_STATE rx,
 	int i;
 
 	if (update_tab) {
-		for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++) {
+		for (i = 0; i < apc->indir_table_sz; i++) {
 			queue_idx = apc->indir_table[i];
 			apc->rxobj_table[i] = apc->rxqs[queue_idx]->rxobj;
 		}
@@ -2466,7 +2503,6 @@ static int mana_init_port(struct net_device *ndev)
 	struct mana_port_context *apc = netdev_priv(ndev);
 	u32 max_txq, max_rxq, max_queues;
 	int port_idx = apc->port_idx;
-	u32 num_indirect_entries;
 	int err;
 
 	err = mana_init_port_context(apc);
@@ -2474,7 +2510,7 @@ static int mana_init_port(struct net_device *ndev)
 		return err;
 
 	err = mana_query_vport_cfg(apc, port_idx, &max_txq, &max_rxq,
-				   &num_indirect_entries);
+				   &apc->indir_table_sz);
 	if (err) {
 		netdev_err(ndev, "Failed to query info for vPort %d\n",
 			   port_idx);
@@ -2723,6 +2759,10 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	if (err)
 		goto free_net;
 
+	err = mana_rss_table_alloc(apc);
+	if (err)
+		goto reset_apc;
+
 	netdev_lockdep_set_classes(ndev);
 
 	ndev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
@@ -2739,11 +2779,13 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	err = register_netdev(ndev);
 	if (err) {
 		netdev_err(ndev, "Unable to register netdev.\n");
-		goto reset_apc;
+		goto free_indir;
 	}
 
 	return 0;
 
+free_indir:
+	mana_cleanup_indir_table(apc);
 reset_apc:
 	kfree(apc->rxqs);
 	apc->rxqs = NULL;
@@ -2872,16 +2914,30 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 	if (!resuming) {
 		for (i = 0; i < ac->num_ports; i++) {
 			err = mana_probe_port(ac, i, &ac->ports[i]);
-			if (err)
+			/* we log the port for which the probe failed and stop
+			 * probes for subsequent ports.
+			 * Note that we keep running ports, for which the probes
+			 * were successful, unless add_adev fails too
+			 */
+			if (err) {
+				dev_err(dev, "Probe Failed for port %d\n", i);
 				break;
+			}
 		}
 	} else {
 		for (i = 0; i < ac->num_ports; i++) {
 			rtnl_lock();
 			err = mana_attach(ac->ports[i]);
 			rtnl_unlock();
-			if (err)
+			/* we log the port for which the attach failed and stop
+			 * attach for subsequent ports
+			 * Note that we keep running ports, for which the attach
+			 * were successful, unless add_adev fails too
+			 */
+			if (err) {
+				dev_err(dev, "Attach Failed for port %d\n", i);
 				break;
+			}
 		}
 	}
 
@@ -2897,6 +2953,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 {
 	struct gdma_context *gc = gd->gdma_context;
 	struct mana_context *ac = gd->driver_data;
+	struct mana_port_context *apc;
 	struct device *dev = gc->dev;
 	struct net_device *ndev;
 	int err;
@@ -2908,6 +2965,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 
 	for (i = 0; i < ac->num_ports; i++) {
 		ndev = ac->ports[i];
+		apc = netdev_priv(ndev);
 		if (!ndev) {
 			if (i == 0)
 				dev_err(dev, "No net device to remove\n");
@@ -2931,6 +2989,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 		}
 
 		unregister_netdevice(ndev);
+		mana_cleanup_indir_table(apc);
 
 		rtnl_unlock();
 
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index ab2413d71f6c..146d5db1792f 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -245,7 +245,9 @@ static u32 mana_get_rxfh_key_size(struct net_device *ndev)
 
 static u32 mana_rss_indir_size(struct net_device *ndev)
 {
-	return MANA_INDIRECT_TABLE_SIZE;
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	return apc->indir_table_sz;
 }
 
 static int mana_get_rxfh(struct net_device *ndev,
@@ -257,7 +259,7 @@ static int mana_get_rxfh(struct net_device *ndev,
 	rxfh->hfunc = ETH_RSS_HASH_TOP; /* Toeplitz */
 
 	if (rxfh->indir) {
-		for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++)
+		for (i = 0; i < apc->indir_table_sz; i++)
 			rxfh->indir[i] = apc->indir_table[i];
 	}
 
@@ -273,8 +275,8 @@ static int mana_set_rxfh(struct net_device *ndev,
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
 	bool update_hash = false, update_table = false;
-	u32 save_table[MANA_INDIRECT_TABLE_SIZE];
 	u8 save_key[MANA_HASH_KEY_SIZE];
+	u32 *save_table;
 	int i, err;
 
 	if (!apc->port_is_up)
@@ -284,13 +286,19 @@ static int mana_set_rxfh(struct net_device *ndev,
 	    rxfh->hfunc != ETH_RSS_HASH_TOP)
 		return -EOPNOTSUPP;
 
+	save_table = kcalloc(apc->indir_table_sz, sizeof(u32), GFP_KERNEL);
+	if (!save_table)
+		return -ENOMEM;
+
 	if (rxfh->indir) {
-		for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++)
-			if (rxfh->indir[i] >= apc->num_queues)
-				return -EINVAL;
+		for (i = 0; i < apc->indir_table_sz; i++)
+			if (rxfh->indir[i] >= apc->num_queues) {
+				err = -EINVAL;
+				goto cleanup;
+			}
 
 		update_table = true;
-		for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++) {
+		for (i = 0; i < apc->indir_table_sz; i++) {
 			save_table[i] = apc->indir_table[i];
 			apc->indir_table[i] = rxfh->indir[i];
 		}
@@ -306,7 +314,7 @@ static int mana_set_rxfh(struct net_device *ndev,
 
 	if (err) { /* recover to original values */
 		if (update_table) {
-			for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++)
+			for (i = 0; i < apc->indir_table_sz; i++)
 				apc->indir_table[i] = save_table[i];
 		}
 
@@ -316,6 +324,9 @@ static int mana_set_rxfh(struct net_device *ndev,
 		mana_config_rss(apc, TRI_STATE_TRUE, update_hash, update_table);
 	}
 
+cleanup:
+	kfree(save_table);
+
 	return err;
 }
 
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 27684135bb4d..c547756c4284 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -543,11 +543,13 @@ enum {
  */
 #define GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX BIT(2)
 #define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG BIT(3)
+#define GDMA_DRV_CAP_FLAG_1_VARIABLE_INDIRECTION_TABLE_SUPPORT BIT(5)
 
 #define GDMA_DRV_CAP_FLAGS1 \
 	(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
 	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
-	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG)
+	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG | \
+	 GDMA_DRV_CAP_FLAG_1_VARIABLE_INDIRECTION_TABLE_SUPPORT)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 561f6719fb4e..59823901b74f 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -30,8 +30,8 @@ enum TRI_STATE {
 };
 
 /* Number of entries for hardware indirection table must be in power of 2 */
-#define MANA_INDIRECT_TABLE_SIZE 64
-#define MANA_INDIRECT_TABLE_MASK (MANA_INDIRECT_TABLE_SIZE - 1)
+#define MANA_INDIRECT_TABLE_MAX_SIZE 512
+#define MANA_INDIRECT_TABLE_DEF_SIZE 64
 
 /* The Toeplitz hash key's length in bytes: should be multiple of 8 */
 #define MANA_HASH_KEY_SIZE 40
@@ -410,10 +410,11 @@ struct mana_port_context {
 	struct mana_tx_qp *tx_qp;
 
 	/* Indirection Table for RX & TX. The values are queue indexes */
-	u32 indir_table[MANA_INDIRECT_TABLE_SIZE];
+	u32 *indir_table;
+	u32 indir_table_sz;
 
 	/* Indirection table containing RxObject Handles */
-	mana_handle_t rxobj_table[MANA_INDIRECT_TABLE_SIZE];
+	mana_handle_t *rxobj_table;
 
 	/*  Hash key used by the NIC */
 	u8 hashkey[MANA_HASH_KEY_SIZE];
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH v1 3/3] mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline() pages when offlining
From: David Hildenbrand @ 2024-06-10  8:56 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Andrew Morton, Mike Rapoport, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <ZmaBGSqchtEWnqM1@localhost.localdomain>

On 10.06.24 06:29, Oscar Salvador wrote:
> On Fri, Jun 07, 2024 at 11:09:38AM +0200, David Hildenbrand wrote:
>> We currently have a hack for virtio-mem in place to handle memory
>> offlining with PageOffline pages for which we already adjusted the
>> managed page count.
>>
>> Let's enlighten memory offlining code so we can get rid of that hack,
>> and document the situation.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Acked-by: Oscar Salvador <osalvador@suse.de>
> 

Thanks for the review!

-- 
Cheers,

David / dhildenb


^ permalink raw reply

* Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()
From: David Hildenbrand @ 2024-06-10  8:56 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Andrew Morton, Mike Rapoport, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <ZmZ_3Xc7fdrL1R15@localhost.localdomain>

On 10.06.24 06:23, Oscar Salvador wrote:
> On Fri, Jun 07, 2024 at 11:09:37AM +0200, David Hildenbrand wrote:
>> We currently initialize the memmap such that PG_reserved is set and the
>> refcount of the page is 1. In virtio-mem code, we have to manually clear
>> that PG_reserved flag to make memory offlining with partially hotplugged
>> memory blocks possible: has_unmovable_pages() would otherwise bail out on
>> such pages.
>>
>> We want to avoid PG_reserved where possible and move to typed pages
>> instead. Further, we want to further enlighten memory offlining code about
>> PG_offline: offline pages in an online memory section. One example is
>> handling managed page count adjustments in a cleaner way during memory
>> offlining.
>>
>> So let's initialize the pages with PG_offline instead of PG_reserved.
>> generic_online_page()->__free_pages_core() will now clear that flag before
>> handing that memory to the buddy.
>>
>> Note that the page refcount is still 1 and would forbid offlining of such
>> memory except when special care is take during GOING_OFFLINE as
>> currently only implemented by virtio-mem.
>>
>> With this change, we can now get non-PageReserved() pages in the XEN
>> balloon list. From what I can tell, that can already happen via
>> decrease_reservation(), so that should be fine.
>>
>> HV-balloon should not really observe a change: partial online memory
>> blocks still cannot get surprise-offlined, because the refcount of these
>> PageOffline() pages is 1.
>>
>> Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
>> hotplugged pages are now PageOffline() instead of PageReserved() before
>> they are handed over to the buddy.
>>
>> We'll leave the ZONE_DEVICE case alone for now.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 27e3be75edcf7..0254059efcbe1 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -734,7 +734,7 @@ static inline void section_taint_zone_device(unsigned long pfn)
>>   /*
>>    * Associate the pfn range with the given zone, initializing the memmaps
>>    * and resizing the pgdat/zone data to span the added pages. After this
>> - * call, all affected pages are PG_reserved.
>> + * call, all affected pages are PageOffline().
>>    *
>>    * All aligned pageblocks are initialized to the specified migratetype
>>    * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
>> @@ -1100,8 +1100,12 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
>>   
>>   	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
>>   
>> -	for (i = 0; i < nr_pages; i++)
>> -		SetPageVmemmapSelfHosted(pfn_to_page(pfn + i));
>> +	for (i = 0; i < nr_pages; i++) {
>> +		struct page *page = pfn_to_page(pfn + i);
>> +
>> +		__ClearPageOffline(page);
>> +		SetPageVmemmapSelfHosted(page);
> 
> So, refresh my memory here please.
> AFAIR, those VmemmapSelfHosted pages were marked Reserved before, but now,
> memmap_init_range() will not mark them reserved anymore.

Correct.

> I do not think that is ok? I am worried about walkers getting this wrong.
> 
> We usually skip PageReserved pages in walkers because are pages we cannot deal
> with for those purposes, but with this change, we will leak
> PageVmemmapSelfHosted, and I am not sure whether are ready for that.

There are fortunately not that many left.

I'd even say marking them (vmemmap) reserved is more wrong than right: 
note that ordinary vmemmap pages after memory hotplug are not reserved! 
Only bootmem should be reserved.

Let's take at the relevant core-mm ones (arch stuff is mostly just for 
MMIO remapping)

fs/proc/task_mmu.c:     if (PageReserved(page))
fs/proc/task_mmu.c:     if (PageReserved(page))

-> If we find vmemmap pages mapped into user space we already messed up
    seriously

kernel/power/snapshot.c:        if (PageReserved(page) ||
kernel/power/snapshot.c:        if (PageReserved(page)

-> There should be no change (saveable_page() would still allow saving
    them, highmem does not apply)

mm/hugetlb_vmemmap.c:           if (!PageReserved(head))
mm/hugetlb_vmemmap.c:   if (PageReserved(page))

-> Wants to identify bootmem, but we exclude these
    PageVmemmapSelfHosted() on the splitting part already properly


mm/page_alloc.c:                VM_WARN_ON_ONCE(PageReserved(p));
mm/page_alloc.c:                if (PageReserved(page))

-> pfn_range_valid_contig() would scan them, just like for ordinary
    vmemmap pages during hotplug. We'll simply fail isolating/migrating
    them similarly (like any unmovable allocations) later

mm/page_ext.c:          BUG_ON(PageReserved(page));

-> free_page_ext handling, does not apply

mm/page_isolation.c:            if (PageReserved(page))

-> has_unmovable_pages() should still detect them as unmovable (e.g.,
    neither movable nor LRU).

mm/page_owner.c:                        if (PageReserved(page))
mm/page_owner.c:                        if (PageReserved(page))

-> Simply page_ext_get() will return NULL instead and we'll similarly
    skip them

mm/sparse.c:            if (!PageReserved(virt_to_page(ms->usage))) {

-> Detecting boot memory for ms->usage allocation, does not apply to
    vmemmap.

virt/kvm/kvm_main.c:    if (!PageReserved(page))
virt/kvm/kvm_main.c:    return !PageReserved(page);

-> For MMIO remapping purposes, does not apply to vmemmap


> Moreover, boot memmap pages are marked as PageReserved, which would be
> now inconsistent with those added during hotplug operations.

Just like vmemmap pages allocated dynamically during memory hotplug. 
Now, really only bootmem-ones are PageReserved.

> All in all, I feel uneasy about this change.

I really don't want to mark these pages here PageReserved for the sake 
of it.

Any PageReserved user that I am missing, or why we should handle these 
vmemmap pages differently than the ones allocated during ordinary memory 
hotplug?

In the future, we might want to consider using a dedicated page type for 
them, so we can stop using a bit that doesn't allow to reliably identify 
them. (we should mark all vmemmap with that type then)

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply

* Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()
From: David Hildenbrand @ 2024-06-10  8:38 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Andrew Morton, Mike Rapoport, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <ZmZ7GgwJw4ucPJaM@localhost.localdomain>

On 10.06.24 06:03, Oscar Salvador wrote:
> On Fri, Jun 07, 2024 at 11:09:36AM +0200, David Hildenbrand wrote:
>> In preparation for further changes, let's teach __free_pages_core()
>> about the differences of memory hotplug handling.
>>
>> Move the memory hotplug specific handling from generic_online_page() to
>> __free_pages_core(), use adjust_managed_page_count() on the memory
>> hotplug path, and spell out why memory freed via memblock
>> cannot currently use adjust_managed_page_count().
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> All looks good but I am puzzled with something.
> 
>> +	} else {
>> +		/* memblock adjusts totalram_pages() ahead of time. */
>> +		atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
>> +	}
> 
> You say that memblock adjusts totalram_pages ahead of time, and I guess
> you mean in memblock_free_all()

And memblock_free_late(), which uses atomic_long_inc().

> 
>   pages = free_low_memory_core_early()
>   totalram_pages_add(pages);
> 
> but that is not ahead, it looks like it is upading __after__ sending
> them to buddy?

Right (it's suboptimal, but not really problematic so far. Hopefully Wei 
can clean it up and move it in here as well)

For the time being

"/* memblock adjusts totalram_pages() manually. */"

?

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply

* Re: [PATCH v1 3/3] mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline() pages when offlining
From: Oscar Salvador @ 2024-06-10  4:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Andrew Morton, Mike Rapoport, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <20240607090939.89524-4-david@redhat.com>

On Fri, Jun 07, 2024 at 11:09:38AM +0200, David Hildenbrand wrote:
> We currently have a hack for virtio-mem in place to handle memory
> offlining with PageOffline pages for which we already adjusted the
> managed page count.
> 
> Let's enlighten memory offlining code so we can get rid of that hack,
> and document the situation.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply

* Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()
From: Oscar Salvador @ 2024-06-10  4:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Andrew Morton, Mike Rapoport, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <20240607090939.89524-3-david@redhat.com>

On Fri, Jun 07, 2024 at 11:09:37AM +0200, David Hildenbrand wrote:
> We currently initialize the memmap such that PG_reserved is set and the
> refcount of the page is 1. In virtio-mem code, we have to manually clear
> that PG_reserved flag to make memory offlining with partially hotplugged
> memory blocks possible: has_unmovable_pages() would otherwise bail out on
> such pages.
> 
> We want to avoid PG_reserved where possible and move to typed pages
> instead. Further, we want to further enlighten memory offlining code about
> PG_offline: offline pages in an online memory section. One example is
> handling managed page count adjustments in a cleaner way during memory
> offlining.
> 
> So let's initialize the pages with PG_offline instead of PG_reserved.
> generic_online_page()->__free_pages_core() will now clear that flag before
> handing that memory to the buddy.
> 
> Note that the page refcount is still 1 and would forbid offlining of such
> memory except when special care is take during GOING_OFFLINE as
> currently only implemented by virtio-mem.
> 
> With this change, we can now get non-PageReserved() pages in the XEN
> balloon list. From what I can tell, that can already happen via
> decrease_reservation(), so that should be fine.
> 
> HV-balloon should not really observe a change: partial online memory
> blocks still cannot get surprise-offlined, because the refcount of these
> PageOffline() pages is 1.
> 
> Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
> hotplugged pages are now PageOffline() instead of PageReserved() before
> they are handed over to the buddy.
> 
> We'll leave the ZONE_DEVICE case alone for now.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 27e3be75edcf7..0254059efcbe1 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -734,7 +734,7 @@ static inline void section_taint_zone_device(unsigned long pfn)
>  /*
>   * Associate the pfn range with the given zone, initializing the memmaps
>   * and resizing the pgdat/zone data to span the added pages. After this
> - * call, all affected pages are PG_reserved.
> + * call, all affected pages are PageOffline().
>   *
>   * All aligned pageblocks are initialized to the specified migratetype
>   * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
> @@ -1100,8 +1100,12 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
>  
>  	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
>  
> -	for (i = 0; i < nr_pages; i++)
> -		SetPageVmemmapSelfHosted(pfn_to_page(pfn + i));
> +	for (i = 0; i < nr_pages; i++) {
> +		struct page *page = pfn_to_page(pfn + i);
> +
> +		__ClearPageOffline(page);
> +		SetPageVmemmapSelfHosted(page);

So, refresh my memory here please.
AFAIR, those VmemmapSelfHosted pages were marked Reserved before, but now,
memmap_init_range() will not mark them reserved anymore.
I do not think that is ok? I am worried about walkers getting this wrong.

We usually skip PageReserved pages in walkers because are pages we cannot deal
with for those purposes, but with this change, we will leak
PageVmemmapSelfHosted, and I am not sure whether are ready for that.

Moreover, boot memmap pages are marked as PageReserved, which would be
now inconsistent with those added during hotplug operations.

All in all, I feel uneasy about this change.

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply

* Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()
From: Oscar Salvador @ 2024-06-10  4:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Andrew Morton, Mike Rapoport, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <20240607090939.89524-2-david@redhat.com>

On Fri, Jun 07, 2024 at 11:09:36AM +0200, David Hildenbrand wrote:
> In preparation for further changes, let's teach __free_pages_core()
> about the differences of memory hotplug handling.
> 
> Move the memory hotplug specific handling from generic_online_page() to
> __free_pages_core(), use adjust_managed_page_count() on the memory
> hotplug path, and spell out why memory freed via memblock
> cannot currently use adjust_managed_page_count().
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

All looks good but I am puzzled with something.

> +	} else {
> +		/* memblock adjusts totalram_pages() ahead of time. */
> +		atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
> +	}

You say that memblock adjusts totalram_pages ahead of time, and I guess
you mean in memblock_free_all()

 pages = free_low_memory_core_early()
 totalram_pages_add(pages);

but that is not ahead, it looks like it is upading __after__ sending
them to buddy?


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply

* [PATCH 18/18] KVM: x86: hyper-v: Handle VSM hcalls in user-space
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Let user-space handle all hypercalls that fall under the AccessVsm
partition privilege flag. That is:
 - HvCallModifyVtlProtectionMask
 - HvCallEnablePartitionVtl
 - HvCallEnableVpVtl
 - HvCallVtlCall
 - HvCallVtlReturn

All these are VTL aware and as such need to be handled in user-space.
Additionally, select KVM_GENERIC_MEMORY_ATTRIBUTES when
CONFIG_KVM_HYPERV is enabled, as it's necessary in order to implement
VTL memory protections.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 Documentation/virt/kvm/api.rst    | 23 +++++++++++++++++++++++
 arch/x86/kvm/Kconfig              |  1 +
 arch/x86/kvm/hyperv.c             | 29 +++++++++++++++++++++++++----
 include/asm-generic/hyperv-tlfs.h |  6 +++++-
 4 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 6d3bc5092ea63..77af2ccf49a30 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8969,3 +8969,26 @@ HvCallGetVpIndexFromApicId. Currently, it is only used in conjunction with
 HV_ACCESS_VSM, and immediately exits to userspace with KVM_EXIT_HYPERV_HCALL as
 the reason. Userspace is expected to complete the hypercall before resuming
 execution.
+
+10.4 HV_ACCESS_VSM
+------------------
+
+:Location: CPUID.40000003H:EBX[bit 16]
+
+This CPUID indicates that KVM supports HvCallModifyVtlProtectionMask,
+HvCallEnablePartitionVtl, HvCallEnableVpVtl, HvCallVtlCall, and
+HvCallVtlReturn.  Additionally, as a prerequirsite to being able to implement
+Hyper-V VSM, it also identifies the availability of HvTranslateVirtualAddress,
+as well as the VTL-aware aspects of HvCallSendSyntheticClusterIpi and
+HvCallSendSyntheticClusterIpiEx.
+
+All these hypercalls immediately exit with KVM_EXIT_HYPERV_HCALL as the reason.
+Userspace is expected to complete the hypercall before resuming execution.
+Note that both IPI hypercalls will only exit to userspace if the request is
+VTL-aware, which will only happen if HV_ACCESS_VSM is exposed to the guest.
+
+Access restriction memory attributes (4.141) are available to simplify
+HvCallModifyVtlProtectionMask's implementation.
+
+Ultimately this CPUID also indicates that KVM_MP_STATE_HV_INACTIVE_VTL is
+available.
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index fec95a7702703..8d851fe3b8c25 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -157,6 +157,7 @@ config KVM_SMM
 config KVM_HYPERV
 	bool "Support for Microsoft Hyper-V emulation"
 	depends on KVM
+	select KVM_GENERIC_MEMORY_ATTRIBUTES
 	default y
 	help
 	  Provides KVM support for emulating Microsoft Hyper-V.  This allows KVM
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index dd64f41dc835d..1158c59a92790 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2388,7 +2388,12 @@ static void kvm_hv_hypercall_set_result(struct kvm_vcpu *vcpu, u64 result)
 	}
 }
 
-static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
+static inline bool kvm_hv_is_vtl_call_return(u16 code)
+{
+	return code == HVCALL_VTL_CALL || code == HVCALL_VTL_RETURN;
+}
+
+static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u16 code, u64 result)
 {
 	u32 tlb_lock_count = 0;
 	int ret;
@@ -2400,9 +2405,12 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
 		result = HV_STATUS_INVALID_HYPERCALL_INPUT;
 
 	trace_kvm_hv_hypercall_done(result);
-	kvm_hv_hypercall_set_result(vcpu, result);
 	++vcpu->stat.hypercalls;
 
+	/* VTL call and return don't set a hcall result */
+	if (!kvm_hv_is_vtl_call_return(code))
+		kvm_hv_hypercall_set_result(vcpu, result);
+
 	ret = kvm_skip_emulated_instruction(vcpu);
 
 	if (tlb_lock_count)
@@ -2459,7 +2467,7 @@ static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
 		kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm);
 	}
 
-	return kvm_hv_hypercall_complete(vcpu, result);
+	return kvm_hv_hypercall_complete(vcpu, code, result);
 }
 
 static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
@@ -2513,6 +2521,7 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc)
 	case HVCALL_SEND_IPI_EX:
 	case HVCALL_GET_VP_REGISTERS:
 	case HVCALL_SET_VP_REGISTERS:
+	case HVCALL_MODIFY_VTL_PROTECTION_MASK:
 	case HVCALL_TRANSLATE_VIRTUAL_ADDRESS:
 		return true;
 	}
@@ -2552,6 +2561,12 @@ static bool hv_check_hypercall_access(struct kvm_vcpu_hv *hv_vcpu, u16 code)
 		 */
 		return !kvm_hv_is_syndbg_enabled(hv_vcpu->vcpu) ||
 			hv_vcpu->cpuid_cache.features_ebx & HV_DEBUGGING;
+	case HVCALL_MODIFY_VTL_PROTECTION_MASK:
+	case HVCALL_ENABLE_PARTITION_VTL:
+	case HVCALL_ENABLE_VP_VTL:
+	case HVCALL_VTL_CALL:
+	case HVCALL_VTL_RETURN:
+		return hv_vcpu->cpuid_cache.features_ebx & HV_ACCESS_VSM;
 	case HVCALL_GET_VP_REGISTERS:
 	case HVCALL_SET_VP_REGISTERS:
 		return hv_vcpu->cpuid_cache.features_ebx &
@@ -2744,6 +2759,11 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 			break;
 		}
 		goto hypercall_userspace_exit;
+	case HVCALL_MODIFY_VTL_PROTECTION_MASK:
+	case HVCALL_ENABLE_PARTITION_VTL:
+	case HVCALL_ENABLE_VP_VTL:
+	case HVCALL_VTL_CALL:
+	case HVCALL_VTL_RETURN:
 	case HVCALL_GET_VP_REGISTERS:
 	case HVCALL_SET_VP_REGISTERS:
 	case HVCALL_TRANSLATE_VIRTUAL_ADDRESS:
@@ -2765,7 +2785,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 	}
 
 hypercall_complete:
-	return kvm_hv_hypercall_complete(vcpu, ret);
+	return kvm_hv_hypercall_complete(vcpu, hc.code, ret);
 
 hypercall_userspace_exit:
 	vcpu->run->exit_reason = KVM_EXIT_HYPERV;
@@ -2921,6 +2941,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
 			ent->ebx |= HV_POST_MESSAGES;
 			ent->ebx |= HV_SIGNAL_EVENTS;
 			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
+			ent->ebx |= HV_ACCESS_VSM;
 			ent->ebx |= HV_ACCESS_VP_REGISTERS;
 			ent->ebx |= HV_START_VIRTUAL_PROCESSOR;
 
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index e24b88ec4ec00..6b12e5818292c 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -149,9 +149,13 @@ union hv_reference_tsc_msr {
 /* Declare the various hypercall operations. */
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE	0x0002
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST	0x0003
-#define HVCALL_ENABLE_VP_VTL			0x000f
 #define HVCALL_NOTIFY_LONG_SPIN_WAIT		0x0008
 #define HVCALL_SEND_IPI				0x000b
+#define HVCALL_MODIFY_VTL_PROTECTION_MASK	0x000c
+#define HVCALL_ENABLE_PARTITION_VTL		0x000d
+#define HVCALL_ENABLE_VP_VTL			0x000f
+#define HVCALL_VTL_CALL				0x0011
+#define HVCALL_VTL_RETURN			0x0012
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX	0x0013
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX	0x0014
 #define HVCALL_SEND_IPI_EX			0x0015
-- 
2.40.1


^ permalink raw reply related

* [PATCH 17/18] KVM: Introduce traces to track memory attributes modification.
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Introduce traces that track memory attributes modification.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 include/trace/events/kvm.h | 20 ++++++++++++++++++++
 virt/kvm/kvm_main.c        |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 74e40d5d4af42..aa6caeb16f12a 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -489,6 +489,26 @@ TRACE_EVENT(kvm_test_age_hva,
 	TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
 );
 
+TRACE_EVENT(kvm_vm_set_mem_attributes,
+	TP_PROTO(u64 start, u64 cnt, u64 attributes),
+	TP_ARGS(start, cnt, attributes),
+
+	TP_STRUCT__entry(
+		__field(	u64,	start		)
+		__field(	u64,	cnt		)
+		__field(	u64,	attributes	)
+	),
+
+	TP_fast_assign(
+		__entry->start		= start;
+		__entry->cnt		= cnt;
+		__entry->attributes	= attributes;
+	),
+
+	TP_printk("gfn 0x%llx, cnt 0x%llx, attributes 0x%llx",
+		  __entry->start, __entry->cnt, __entry->attributes)
+);
+
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bd27fc01e9715..1c493ece3deb1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2556,6 +2556,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 
 	kvm_handle_gfn_range(kvm, &post_set_range);
 
+	trace_kvm_vm_set_mem_attributes(start, end - start, attributes);
+
 out_unlock:
 	mutex_unlock(&kvm->slots_lock);
 
-- 
2.40.1


^ permalink raw reply related

* [PATCH 16/18] KVM: x86: Take mem attributes into account when faulting memory
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Take into account access restrictions memory attributes when faulting
guest memory. Prohibited memory accesses will cause an user-space fault
exit.

Additionally, bypass a warning in the !tdp case. Access restrictions in
guest page tables might not necessarily match the host pte's when memory
attributes are in use.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c         | 64 ++++++++++++++++++++++++++++------
 arch/x86/kvm/mmu/mmutrace.h    | 29 +++++++++++++++
 arch/x86/kvm/mmu/paging_tmpl.h |  2 +-
 include/linux/kvm_host.h       |  4 +++
 4 files changed, 87 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 91edd873dcdbc..dfe50c9c31f7b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -754,7 +754,8 @@ static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
 	return sp->role.access;
 }
 
-static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
+static void kvm_mmu_page_set_translation(struct kvm *kvm,
+					 struct kvm_mmu_page *sp, int index,
 					 gfn_t gfn, unsigned int access)
 {
 	if (sp_has_gptes(sp)) {
@@ -762,10 +763,17 @@ static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
 		return;
 	}
 
-	WARN_ONCE(access != kvm_mmu_page_get_access(sp, index),
-	          "access mismatch under %s page %llx (expected %u, got %u)\n",
-	          sp->role.passthrough ? "passthrough" : "direct",
-	          sp->gfn, kvm_mmu_page_get_access(sp, index), access);
+	/*
+	 * Userspace might have introduced memory attributes for this gfn,
+	 * breaking the assumption that the spte's access restrictions match
+	 * the guest's. Userspace is also responsible from taking care of
+	 * faults caused by these 'artificial' access restrictions.
+	 */
+	WARN_ONCE(access != kvm_mmu_page_get_access(sp, index) &&
+		  !kvm_get_memory_attributes(kvm, gfn),
+		  "access mismatch under %s page %llx (expected %u, got %u)\n",
+		  sp->role.passthrough ? "passthrough" : "direct", sp->gfn,
+		  kvm_mmu_page_get_access(sp, index), access);
 
 	WARN_ONCE(gfn != kvm_mmu_page_get_gfn(sp, index),
 	          "gfn mismatch under %s page %llx (expected %llx, got %llx)\n",
@@ -773,12 +781,12 @@ static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index,
 	          sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn);
 }
 
-static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index,
-				    unsigned int access)
+static void kvm_mmu_page_set_access(struct kvm *kvm, struct kvm_mmu_page *sp,
+				    int index, unsigned int access)
 {
 	gfn_t gfn = kvm_mmu_page_get_gfn(sp, index);
 
-	kvm_mmu_page_set_translation(sp, index, gfn, access);
+	kvm_mmu_page_set_translation(kvm, sp, index, gfn, access);
 }
 
 /*
@@ -1607,7 +1615,7 @@ static void __rmap_add(struct kvm *kvm,
 	int rmap_count;
 
 	sp = sptep_to_sp(spte);
-	kvm_mmu_page_set_translation(sp, spte_index(spte), gfn, access);
+	kvm_mmu_page_set_translation(kvm, sp, spte_index(spte), gfn, access);
 	kvm_update_page_stats(kvm, sp->role.level, 1);
 
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
@@ -2928,7 +2936,8 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 		rmap_add(vcpu, slot, sptep, gfn, pte_access);
 	} else {
 		/* Already rmapped but the pte_access bits may have changed. */
-		kvm_mmu_page_set_access(sp, spte_index(sptep), pte_access);
+		kvm_mmu_page_set_access(vcpu->kvm, sp, spte_index(sptep),
+					pte_access);
 	}
 
 	return ret;
@@ -4320,6 +4329,38 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
 	return RET_PF_CONTINUE;
 }
 
+static int kvm_mem_attributes_faultin_access_prots(struct kvm_vcpu *vcpu,
+						   struct kvm_page_fault *fault)
+{
+	bool may_read, may_write, may_exec;
+	unsigned long attrs;
+
+	attrs = kvm_get_memory_attributes(vcpu->kvm, fault->gfn);
+	if (!attrs)
+		return RET_PF_CONTINUE;
+
+	if (!kvm_mem_attributes_valid(vcpu->kvm, attrs)) {
+		kvm_err("Invalid mem attributes 0x%lx found for address 0x%016llx\n",
+			attrs, fault->addr);
+		return -EFAULT;
+	}
+
+	trace_kvm_mem_attributes_faultin_access_prots(vcpu, fault, attrs);
+
+	may_read = kvm_mem_attributes_may_read(attrs);
+	may_write = kvm_mem_attributes_may_write(attrs);
+	may_exec = kvm_mem_attributes_may_exec(attrs);
+
+	if ((fault->user && !may_read) || (fault->write && !may_write) ||
+	    (fault->exec && !may_exec))
+		return -EFAULT;
+
+	fault->map_writable = may_write;
+	fault->map_executable = may_exec;
+
+	return RET_PF_CONTINUE;
+}
+
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	bool async;
@@ -4375,7 +4416,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	 * Now that we have a snapshot of mmu_invalidate_seq we can check for a
 	 * private vs. shared mismatch.
 	 */
-	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn) ||
+	    kvm_mem_attributes_faultin_access_prots(vcpu, fault)) {
 		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
 		return -EFAULT;
 	}
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index 195d98bc8de85..ddbdd7396e9fa 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -440,6 +440,35 @@ TRACE_EVENT(
 		  __entry->gfn, __entry->spte, __entry->level, __entry->errno)
 );
 
+TRACE_EVENT(kvm_mem_attributes_faultin_access_prots,
+	TP_PROTO(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+		 u64 mem_attrs),
+	TP_ARGS(vcpu, fault, mem_attrs),
+
+	TP_STRUCT__entry(
+		__field(unsigned int, vcpu_id)
+		__field(unsigned long, guest_rip)
+		__field(u64, fault_address)
+		__field(bool, write)
+		__field(bool, exec)
+		__field(u64, mem_attrs)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu->vcpu_id;
+		__entry->guest_rip = kvm_rip_read(vcpu);
+		__entry->fault_address = fault->addr;
+		__entry->write = fault->write;
+		__entry->exec = fault->exec;
+		__entry->mem_attrs = mem_attrs;
+	),
+
+	TP_printk("vcpu %d rip 0x%lx gfn 0x%016llx access %s mem_attrs 0x%llx",
+		  __entry->vcpu_id, __entry->guest_rip, __entry->fault_address,
+		  __entry->exec ? "X" : (__entry->write ? "W" : "R"),
+		  __entry->mem_attrs)
+);
+
 #endif /* _TRACE_KVMMMU_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index d3dbcf382ed2d..166f5f0e885e0 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -954,7 +954,7 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
 		return 0;
 
 	/* Update the shadowed access bits in case they changed. */
-	kvm_mmu_page_set_access(sp, i, pte_access);
+	kvm_mmu_page_set_access(vcpu->kvm, sp, i, pte_access);
 
 	sptep = &sp->spt[i];
 	spte = *sptep;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 85378345e8e77..9c26161d13dea 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2463,6 +2463,10 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return false;
 }
+static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
+{
+	return 0;
+}
 #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 
 #ifdef CONFIG_KVM_PRIVATE_MEM
-- 
2.40.1


^ permalink raw reply related

* [PATCH 15/18] KVM: Introduce RWX memory attributes
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Declare memory attributes to map memory regions as non-readable,
non-writable, and/or non-executable.

The attributes are negated for the following reasons:
 - Setting a 0 memory attribute (attr->attributes == 0) shouldn't
   introduce any access restrictions. For example, when moving from
   private to shared mappings in context of confidential computing.
 - In practice, with negated attributes, a non-private RWX memory
   attribute is analogous to a delete operation. It's a nice outcome, as
   it forces remapping the region with huge-pages, doing the right thing
   for use-cases that have short-lived access restricted regions like
   Hyper-V's VSM.
 - A non-negated version of the flags has no way of expressing
   non-access mapping (NR/NW/NX) without having to introduce an extra
   flag (since 0 isn't available).

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 Documentation/virt/kvm/api.rst | 14 +++++++++++---
 include/linux/kvm_host.h       | 22 +++++++++++++++++++++-
 include/uapi/linux/kvm.h       |  3 +++
 virt/kvm/kvm_main.c            | 32 +++++++++++++++++++++++++++++---
 4 files changed, 64 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 18ddea9c4c58a..6d3bc5092ea63 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6313,15 +6313,23 @@ of guest physical memory.
 	__u64 flags;
   };
 
+  #define KVM_MEMORY_ATTRIBUTE_NR                (1ULL << 0)
+  #define KVM_MEMORY_ATTRIBUTE_NW                (1ULL << 1)
+  #define KVM_MEMORY_ATTRIBUTE_NX                (1ULL << 2)
   #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
 The address and size must be page aligned.  The supported attributes can be
 retrieved via ioctl(KVM_CHECK_EXTENSION) on KVM_CAP_MEMORY_ATTRIBUTES.  If
 executed on a VM, KVM_CAP_MEMORY_ATTRIBUTES precisely returns the attributes
 supported by that VM.  If executed at system scope, KVM_CAP_MEMORY_ATTRIBUTES
-returns all attributes supported by KVM.  The only attribute defined at this
-time is KVM_MEMORY_ATTRIBUTE_PRIVATE, which marks the associated gfn as being
-guest private memory.
+returns all attributes supported by KVM.  The attribute defined at this
+time are:
+
+ - KVM_MEMORY_ATTRIBUTE_NR/NW/NX - Respectively marks the memory region as
+   non-read, non-write and/or non-exec.  Note that write-only, exec-only and
+   write-exec mappings are not supported.
+ - KVM_MEMORY_ATTRIBUTE_PRIVATE - Which marks the associated gfn as being guest
+   private memory.
 
 Note, there is no "get" API.  Userspace is responsible for explicitly tracking
 the state of a gfn/page as needed.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9250bf1c4db15..85378345e8e77 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2411,6 +2411,21 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
 
+static inline bool kvm_mem_attributes_may_read(u64 attrs)
+{
+	return !(attrs & KVM_MEMORY_ATTRIBUTE_NR);
+}
+
+static inline bool kvm_mem_attributes_may_write(u64 attrs)
+{
+	return !(attrs & KVM_MEMORY_ATTRIBUTE_NW);
+}
+
+static inline bool kvm_mem_attributes_may_exec(u64 attrs)
+{
+	return !(attrs & KVM_MEMORY_ATTRIBUTE_NX);
+}
+
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
@@ -2423,7 +2438,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range);
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range);
-
+bool kvm_mem_attributes_valid(struct kvm *kvm, unsigned long attrs);
 static inline bool kvm_memory_attributes_in_use(struct kvm *kvm)
 {
 	return !xa_empty(&kvm->mem_attr_array);
@@ -2435,6 +2450,11 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 	       kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
 }
 #else
+static inline bool kvm_mem_attributes_valid(struct kvm *kvm,
+					    unsigned long attrs)
+{
+	return false;
+}
 static inline bool kvm_memory_attributes_in_use(struct kvm *kvm)
 {
 	return false;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 516d39910f9ab..26d4477dae8c6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1550,6 +1550,9 @@ struct kvm_memory_attributes {
 	__u64 flags;
 };
 
+#define KVM_MEMORY_ATTRIBUTE_NR		       (1ULL << 0)
+#define KVM_MEMORY_ATTRIBUTE_NW		       (1ULL << 1)
+#define KVM_MEMORY_ATTRIBUTE_NX		       (1ULL << 2)
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 63c4b6739edee..bd27fc01e9715 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2430,10 +2430,14 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
+	u64 supported_attrs = KVM_MEMORY_ATTRIBUTE_NR |
+			      KVM_MEMORY_ATTRIBUTE_NW |
+			      KVM_MEMORY_ATTRIBUTE_NX;
+
 	if (!kvm || kvm_arch_has_private_mem(kvm))
-		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
+		supported_attrs |= KVM_MEMORY_ATTRIBUTE_PRIVATE;
 
-	return 0;
+	return supported_attrs;
 }
 
 static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
@@ -2557,6 +2561,28 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 
 	return r;
 }
+
+bool kvm_mem_attributes_valid(struct kvm *kvm, unsigned long attrs)
+{
+	bool may_read = kvm_mem_attributes_may_read(attrs);
+	bool may_write = kvm_mem_attributes_may_write(attrs);
+	bool may_exec = kvm_mem_attributes_may_exec(attrs);
+	bool priv = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+	if (attrs & ~kvm_supported_mem_attributes(kvm))
+		return false;
+
+	/* Private memory and access permissions are incompatible */
+	if (priv && (!may_read || !may_write || !may_exec))
+		return false;
+
+	/* Write and exec mappings require read access */
+	if ((may_write || may_exec) && !may_read)
+		return false;
+
+	return true;
+}
+
 static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 					   struct kvm_memory_attributes *attrs)
 {
@@ -2565,7 +2591,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 	/* flags is currently not used. */
 	if (attrs->flags)
 		return -EINVAL;
-	if (attrs->attributes & ~kvm_supported_mem_attributes(kvm))
+	if (!kvm_mem_attributes_valid(kvm, attrs->attributes))
 		return -EINVAL;
 	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
 		return -EINVAL;
-- 
2.40.1


^ permalink raw reply related

* [PATCH 14/18] KVM: x86/mmu: Init memslot if memory attributes available
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Systems that lack private memory support are about to start using memory
attributes. So query if the memory attributes xarray is empty in order
to decide whether it's necessary to init the hugepage information when
installing a new memslot.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c   | 2 +-
 include/linux/kvm_host.h | 9 +++++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d56c04fbdc66b..91edd873dcdbc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7487,7 +7487,7 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
 {
 	int level;
 
-	if (!kvm_arch_has_private_mem(kvm))
+	if (!kvm_memory_attributes_in_use(kvm))
 		return;
 
 	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4fa16c4772269..9250bf1c4db15 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2424,12 +2424,21 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range);
 
+static inline bool kvm_memory_attributes_in_use(struct kvm *kvm)
+{
+	return !xa_empty(&kvm->mem_attr_array);
+}
+
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
 	       kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
 }
 #else
+static inline bool kvm_memory_attributes_in_use(struct kvm *kvm)
+{
+	return false;
+}
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return false;
-- 
2.40.1


^ permalink raw reply related

* [PATCH 13/18] KVM: x86/mmu: Avoid warning when installing non-private memory attributes
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

In preparation to introducing RWX memory attributes, make sure
user-space is attempting to install a memory attribute with
KVM_MEMORY_ATTRIBUTE_PRIVATE before throwing a warning on systems with
no private memory support.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c | 8 ++++++--
 virt/kvm/kvm_main.c    | 1 +
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b0c210b96419f..d56c04fbdc66b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7359,6 +7359,9 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range)
 {
+	unsigned long attrs = range->arg.attributes;
+	bool priv_attr = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
 	/*
 	 * Zap SPTEs even if the slot can't be mapped PRIVATE.  KVM x86 only
 	 * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM
@@ -7370,7 +7373,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	 * Zapping SPTEs in this case ensures KVM will reassess whether or not
 	 * a hugepage can be used for affected ranges.
 	 */
-	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
+	if (WARN_ON_ONCE(priv_attr && !kvm_arch_has_private_mem(kvm)))
 		return false;
 
 	return kvm_unmap_gfn_range(kvm, range);
@@ -7415,6 +7418,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range)
 {
 	unsigned long attrs = range->arg.attributes;
+	bool priv_attr = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
 	struct kvm_memory_slot *slot = range->slot;
 	int level;
 
@@ -7427,7 +7431,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 	 * a range that has PRIVATE GFNs, and conversely converting a range to
 	 * SHARED may now allow hugepages.
 	 */
-	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
+	if (WARN_ON_ONCE(priv_attr && !kvm_arch_has_private_mem(kvm)))
 		return false;
 
 	/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 14841acb8b959..63c4b6739edee 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2506,6 +2506,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	struct kvm_mmu_notifier_range pre_set_range = {
 		.start = start,
 		.end = end,
+		.arg.attributes = attributes,
 		.handler = kvm_pre_set_memory_attributes,
 		.on_lock = kvm_mmu_invalidate_begin,
 		.flush_on_ret = true,
-- 
2.40.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox