Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats
From: Michael Kelley @ 2024-06-06 14:34 UTC (permalink / raw)
  To: Thomas Gleixner, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, lpieralisi@kernel.org, kw@linux.com,
	robh@kernel.org, bhelgaas@google.com,
	James.Bottomley@HansenPartnership.com, martin.petersen@oracle.com,
	arnd@arndb.de, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: maz@kernel.org, den@valinux.co.jp, jgowans@amazon.com,
	dawei.li@shingroup.cn
In-Reply-To: <87le3i2z5g.ffs@tglx>

From: Thomas Gleixner <tglx@linutronix.de> Sent: Thursday, June 6, 2024 2:34 AM
> 
> On Thu, Jun 06 2024 at 03:14, Michael Kelley wrote:
> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 7:20 AM
> >>
> >> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
> >> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM
> >> >
> >> > In /proc/interrupts, the double-counting isn't a problem, and is
> >> > potentially helpful as you say. But /proc/stat, for example, shows a total
> >> > interrupt count, which will be roughly double what it was before. That
> >> > /proc/stat value then shows up in user space in vmstat, for example.
> >> > That's what I was concerned about, though it's not a huge problem in
> >> > the grand scheme of things.
> >>
> >> That's trivial to solve. We can mark interrupts to be excluded from
> >> /proc/stat accounting.
> >>
> >
> > OK.  On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter
> > out the HYP interrupts. But what do you envision on arm64, where
> > there is no arch_irq_stat_cpu()?  On arm64, the top-level interrupt is a
> > normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field
> > with no breakout by IRQ. Identifying the right IRQ and subtracting it
> > out later looks a lot uglier than the conditional stats accounting.
> 
> Sure. There are two ways to solve that:
> 
> 1) Introduce a IRQ_NO_PER_CPU_STATS flag, mark the interrupt
>    accordingly and make the stats increment conditional on it.
>    The downside is that the conditional affects every interrupt.
> 
> 2) Do something like this:
> 
> static inline
> void __handle_percpu_irq(struct irq_desc *desc, irqreturn_t (*handle)(struct irq_desc
> *))
> {
> 	struct irq_chip *chip = irq_desc_get_chip(desc);
> 
> 	if (chip->irq_ack)
> 		chip->irq_ack(&desc->irq_data);
> 
> 	handle(desc);
> 
> 	if (chip->irq_eoi)
> 		chip->irq_eoi(&desc->irq_data);
> }
> 
> void handle_percpu_irq(struct irq_desc *desc)
> {
> 	/*
> 	 * PER CPU interrupts are not serialized. Do not touch
> 	 * desc->tot_count.
> 	 */
> 	__kstat_incr_irqs_this_cpu(desc);
> 	__handle_percpu_irq(desc, handle_irq_event_percpu);
> }
> 
> void handle_percpu_irq_nostat(struct irq_desc *desc)
> {
> 	__this_cpu_inc(desc->kstat_irqs->cnt);
> 	__handle_percpu_irq(desc, __handle_irq_event_percpu);
> }
> 
> So that keeps the interrupt accounted for in /proc/interrupts. If you
> don't want that remove the __this_cpu_inc() and mark the interrupt with
> irq_set_status_flags(irq, IRQ_HIDDEN). That will exclude it from
> /proc/interrupts too.
> 

Yes, this works for not double-counting in the first place. Account for the
control message interrupts in their own Linux IRQ. Then for the top-level
interrupt, instead of adding a new handler with conditional accounting,
add a new per-CPU handler that does no accounting. I had not noticed
the IRQ_HIDDEN flag, and that solves my concern about having an
entry in /proc/interrupts that always shows zero interrupts.  And with
no double-counting, the interrupt counts in /proc/stat won't be bloated.

On x86, I'll have to separately make the "HYP" line in /proc/interrupts
go away, but that's easy.

Thanks,

Michael

^ permalink raw reply

* Re: [PATCH net-next v3] net: mana: Allow variable size indirection table
From: Simon Horman @ 2024-06-06 16:33 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: linux-hardening, netdev, linux-hyperv, linux-kernel, linux-rdma,
	Colin Ian King, Ahmed Zaki, Pavan Chebbi, Souradeep Chakrabarti,
	Konstantin Taranov, Kees Cook, Paolo Abeni, Jakub Kicinski,
	Eric Dumazet, David S. Miller, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Leon Romanovsky, Jason Gunthorpe, Long Li,
	Shradha Gupta
In-Reply-To: <20240605083906.GA15889@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Wed, Jun 05, 2024 at 01:39:06AM -0700, Shradha Gupta wrote:
> On Tue, Jun 04, 2024 at 10:33:49AM +0100, Simon Horman wrote:
> > On Fri, May 31, 2024 at 08:37:41AM -0700, Shradha Gupta wrote:
> > > Allow variable size indirection table allocation in MANA instead
> > > of using a constant value MANA_INDIRECT_TABLE_SIZE.
> > > The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
> > > indirection table is allocated dynamically.
> > > 
> > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > Reviewed-by: Dexuan Cui <decui@microsoft.com>
> > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > 
> > ...
> > 
> > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > 
> > ...
> > 
> > > @@ -2344,11 +2352,33 @@ static int mana_create_vport(struct mana_port_context *apc,
> > >  	return mana_create_txq(apc, net);
> > >  }
> > >  
> > > +static int mana_rss_table_alloc(struct mana_port_context *apc)
> > > +{
> > > +	if (!apc->indir_table_sz) {
> > > +		netdev_err(apc->ndev,
> > > +			   "Indirection table size not set for vPort %d\n",
> > > +			   apc->port_idx);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	apc->indir_table = kcalloc(apc->indir_table_sz, sizeof(u32), GFP_KERNEL);
> > > +	if (!apc->indir_table)
> > > +		return -ENOMEM;
> > > +
> > > +	apc->rxobj_table = kcalloc(apc->indir_table_sz, sizeof(mana_handle_t), GFP_KERNEL);
> > > +	if (!apc->rxobj_table) {
> > > +		kfree(apc->indir_table);
> > 
> > Hi, Shradha
> > 
> > Perhaps I am on the wrong track here, but I have some concerns
> > about clean-up paths.
> > 
> > Firstly.  I think that apc->indir_table should be to NULL here for
> > consistency with other clean-up paths. Or alternatively, fields of apc
> > should not set to NULL elsewhere after being freed.
> 
> Hi Simon,
> 
> Thanks for the comments. This makes sense, I am planning of consistently
> removing the NULLify from other places too as per Leon's comments.

Great!

> > In looking into this I noticed that mana_probe() does not call
> > mana_remove() or return an error in the cases where mana_probe_port()
> > or mana_attach() fail unless add_adev also fails. If so, is that
> > intentional?
> 
> Right, so most calls like mana_probe_port(), mana_attach() cleanup after
> themselves in the code if there is any error. So, not having to call
> mana_remove() in these cases in mana_probe() is intentional. But I do
> agree that an error is returned in mana_probe() only if add_adev also
> fails. I'll fix that too in the next version

I'm not entirely sure, but perhaps that is a candidate for a separate patch.

> > 
> > In any case, I would suggest as a follow-up, arranging things so that
> > when an error occurs in a function, anything that was allocated is
> > unwound before returning an error.
> > 
> > I think this would make allocation/deallocation easier to reason with.
> > And I suspect it would avoid both the need for fields of structures to
> > be zeroed after being freed, and the need to call mana_remove() from
> > mana_probe().
> 
> Agreed
> > 
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  static void mana_rss_table_init(struct mana_port_context *apc)
> > >  {
> > >  	int i;
> > >  
> > > -	for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++)
> > > +	for (i = 0; i < apc->indir_table_sz; i++)
> > >  		apc->indir_table[i] =
> > >  			ethtool_rxfh_indir_default(i, apc->num_queues);
> > >  }
> > 
> > ...
> > 
> > > @@ -2739,11 +2772,17 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
> > >  	err = register_netdev(ndev);
> > >  	if (err) {
> > >  		netdev_err(ndev, "Unable to register netdev.\n");
> > > -		goto reset_apc;
> > > +		goto free_indir;
> > >  	}
> > >  
> > >  	return 0;
> > >  
> > > +free_indir:
> > > +	apc->indir_table_sz = 0;
> > > +	kfree(apc->indir_table);
> > > +	apc->indir_table = NULL;
> > > +	kfree(apc->rxobj_table);
> > > +	apc->rxobj_table = NULL;
> > >  reset_apc:
> > >  	kfree(apc->rxqs);
> > >  	apc->rxqs = NULL;
> > 
> > nit: Not strictly related to this patch, but the reset_apc code should
> >      probably be a call to mana_cleanup_port_context() as it is the dual of
> >      mana_init_port_context() which is called earlier in mana_probe_port()
> 
> Sure, let me do that too.

FWIIW, I think it would be appropriate to put that change in a separate patch.

> > 
> > ...
> > 
> > > @@ -2931,6 +2972,11 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> > >  		}
> > >  
> > >  		unregister_netdevice(ndev);
> > > +		apc->indir_table_sz = 0;
> > > +		kfree(apc->indir_table);
> > > +		apc->indir_table = NULL;
> > > +		kfree(apc->rxobj_table);
> > > +		apc->rxobj_table = NULL;
> > 
> > The code to free and zero indir_table_sz and indir_table appears twice
> > in this patch. Perhaps a helper to do this, which would be the dual
> > of mana_rss_table_alloc is in order.
> Makes sense, will change this too.

Thanks.

^ permalink raw reply

* [PATCH v1 0/3] mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE
From: David Hildenbrand @ 2024-06-07  9:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-hyperv, virtualization, xen-devel, kasan-dev,
	David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov

This can be a considered a long-overdue follow-up to some parts of [1].
The patches are based on [2], but they are not strictly required -- just
makes it clearer why we can use adjust_managed_page_count() for memory
hotplug without going into details about highmem.

We stop initializing pages with PageReserved() in memory hotplug code --
except when dealing with ZONE_DEVICE for now. Instead, we use
PageOffline(): all pages are initialized to PageOffline() when onlining a
memory section, and only the ones actually getting exposed to the
system/page allocator will get PageOffline cleared.

This way, we enlighten memory hotplug more about PageOffline() pages and
can cleanup some hacks we have in virtio-mem code.

What about ZONE_DEVICE? PageOffline() is wrong, but we might just stop
using PageReserved() for them later by simply checking for
is_zone_device_page() at suitable places. That will be a separate patch
set / proposal.

This primarily affects virtio-mem, HV-balloon and XEN balloon. I only
briefly tested with virtio-mem, which benefits most from these cleanups.

[1] https://lore.kernel.org/all/20191024120938.11237-1-david@redhat.com/
[2] https://lkml.kernel.org/r/20240607083711.62833-1-david@redhat.com

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>

David Hildenbrand (3):
  mm: pass meminit_context to __free_pages_core()
  mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with
    PageOffline() instead of PageReserved()
  mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline()
    pages when offlining

 drivers/hv/hv_balloon.c        |  5 ++--
 drivers/virtio/virtio_mem.c    | 29 +++++++++---------
 drivers/xen/balloon.c          |  9 ++++--
 include/linux/memory_hotplug.h |  4 +--
 include/linux/page-flags.h     | 20 +++++++------
 mm/internal.h                  |  3 +-
 mm/kmsan/init.c                |  2 +-
 mm/memory_hotplug.c            | 31 +++++++++----------
 mm/mm_init.c                   | 14 ++++++---
 mm/page_alloc.c                | 55 +++++++++++++++++++++++++++-------
 10 files changed, 108 insertions(+), 64 deletions(-)

base-commit: 19b8422c5bd56fb5e7085995801c6543a98bda1f
prerequisite-patch-id: ca280eafd2732d7912e0c5249dc0df9ecbef19ca
prerequisite-patch-id: 8f43ebc81fdf7b9b665b57614e9e569535094758
-- 
2.45.1

^ permalink raw reply

* [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()
From: David Hildenbrand @ 2024-06-07  9:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-hyperv, virtualization, xen-devel, kasan-dev,
	David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov
In-Reply-To: <20240607090939.89524-1-david@redhat.com>

In preparation for further changes, let's teach __free_pages_core()
about the differences of memory hotplug handling.

Move the memory hotplug specific handling from generic_online_page() to
__free_pages_core(), use adjust_managed_page_count() on the memory
hotplug path, and spell out why memory freed via memblock
cannot currently use adjust_managed_page_count().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h       |  3 ++-
 mm/kmsan/init.c     |  2 +-
 mm/memory_hotplug.c |  9 +--------
 mm/mm_init.c        |  4 ++--
 mm/page_alloc.c     | 17 +++++++++++++++--
 5 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 12e95fdf61e90..3fdee779205ab 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -604,7 +604,8 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
 				    int mt);
 extern void memblock_free_pages(struct page *page, unsigned long pfn,
 					unsigned int order);
-extern void __free_pages_core(struct page *page, unsigned int order);
+extern void __free_pages_core(struct page *page, unsigned int order,
+		enum meminit_context);
 
 /*
  * This will have no effect, other than possibly generating a warning, if the
diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
index 3ac3b8921d36f..ca79636f858e5 100644
--- a/mm/kmsan/init.c
+++ b/mm/kmsan/init.c
@@ -172,7 +172,7 @@ static void do_collection(void)
 		shadow = smallstack_pop(&collect);
 		origin = smallstack_pop(&collect);
 		kmsan_setup_meta(page, shadow, origin, collect.order);
-		__free_pages_core(page, collect.order);
+		__free_pages_core(page, collect.order, MEMINIT_EARLY);
 	}
 }
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 171ad975c7cfd..27e3be75edcf7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -630,14 +630,7 @@ EXPORT_SYMBOL_GPL(restore_online_page_callback);
 
 void generic_online_page(struct page *page, unsigned int order)
 {
-	/*
-	 * Freeing the page with debug_pagealloc enabled will try to unmap it,
-	 * so we should map it first. This is better than introducing a special
-	 * case in page freeing fast path.
-	 */
-	debug_pagealloc_map_pages(page, 1 << order);
-	__free_pages_core(page, order);
-	totalram_pages_add(1UL << order);
+	__free_pages_core(page, order, MEMINIT_HOTPLUG);
 }
 EXPORT_SYMBOL_GPL(generic_online_page);
 
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 019193b0d8703..feb5b6e8c8875 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1938,7 +1938,7 @@ static void __init deferred_free_range(unsigned long pfn,
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if (pageblock_aligned(pfn))
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-		__free_pages_core(page, 0);
+		__free_pages_core(page, 0, MEMINIT_EARLY);
 	}
 }
 
@@ -2513,7 +2513,7 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
 		}
 	}
 
-	__free_pages_core(page, order);
+	__free_pages_core(page, order, MEMINIT_EARLY);
 }
 
 DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2224965ada468..e0c8a8354be36 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1214,7 +1214,8 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	__count_vm_events(PGFREE, 1 << order);
 }
 
-void __free_pages_core(struct page *page, unsigned int order)
+void __free_pages_core(struct page *page, unsigned int order,
+		enum meminit_context context)
 {
 	unsigned int nr_pages = 1 << order;
 	struct page *p = page;
@@ -1234,7 +1235,19 @@ void __free_pages_core(struct page *page, unsigned int order)
 	__ClearPageReserved(p);
 	set_page_count(p, 0);
 
-	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
+	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG) &&
+	    unlikely(context == MEMINIT_HOTPLUG)) {
+		/*
+		 * Freeing the page with debug_pagealloc enabled will try to
+		 * unmap it; some archs don't like double-unmappings, so
+		 * map it first.
+		 */
+		debug_pagealloc_map_pages(page, nr_pages);
+		adjust_managed_page_count(page, nr_pages);
+	} else {
+		/* memblock adjusts totalram_pages() ahead of time. */
+		atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
+	}
 
 	if (page_contains_unaccepted(page, order)) {
 		if (order == MAX_PAGE_ORDER && __free_unaccepted(page))
-- 
2.45.1


^ permalink raw reply related

* [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()
From: David Hildenbrand @ 2024-06-07  9:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-hyperv, virtualization, xen-devel, kasan-dev,
	David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov
In-Reply-To: <20240607090939.89524-1-david@redhat.com>

We currently initialize the memmap such that PG_reserved is set and the
refcount of the page is 1. In virtio-mem code, we have to manually clear
that PG_reserved flag to make memory offlining with partially hotplugged
memory blocks possible: has_unmovable_pages() would otherwise bail out on
such pages.

We want to avoid PG_reserved where possible and move to typed pages
instead. Further, we want to further enlighten memory offlining code about
PG_offline: offline pages in an online memory section. One example is
handling managed page count adjustments in a cleaner way during memory
offlining.

So let's initialize the pages with PG_offline instead of PG_reserved.
generic_online_page()->__free_pages_core() will now clear that flag before
handing that memory to the buddy.

Note that the page refcount is still 1 and would forbid offlining of such
memory except when special care is take during GOING_OFFLINE as
currently only implemented by virtio-mem.

With this change, we can now get non-PageReserved() pages in the XEN
balloon list. From what I can tell, that can already happen via
decrease_reservation(), so that should be fine.

HV-balloon should not really observe a change: partial online memory
blocks still cannot get surprise-offlined, because the refcount of these
PageOffline() pages is 1.

Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
hotplugged pages are now PageOffline() instead of PageReserved() before
they are handed over to the buddy.

We'll leave the ZONE_DEVICE case alone for now.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/hv/hv_balloon.c     |  5 ++---
 drivers/virtio/virtio_mem.c | 18 ++++++++++++------
 drivers/xen/balloon.c       |  9 +++++++--
 include/linux/page-flags.h  | 12 +++++-------
 mm/memory_hotplug.c         | 16 ++++++++++------
 mm/mm_init.c                | 10 ++++++++--
 mm/page_alloc.c             | 32 +++++++++++++++++++++++---------
 7 files changed, 67 insertions(+), 35 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index e000fa3b9f978..c1be38edd8361 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -693,9 +693,8 @@ static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 		if (!PageOffline(pg))
 			__SetPageOffline(pg);
 		return;
-	}
-	if (PageOffline(pg))
-		__ClearPageOffline(pg);
+	} else if (!PageOffline(pg))
+		return;
 
 	/* This frame is currently backed; online the page. */
 	generic_online_page(pg, 0);
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index a3857bacc8446..b90df29621c81 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1146,12 +1146,16 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
 	for (; nr_pages--; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__SetPageOffline(page);
-		if (!onlined) {
+		if (!onlined)
+			/*
+			 * Pages that have not been onlined yet were initialized
+			 * to PageOffline(). Remember that we have to route them
+			 * through generic_online_page().
+			 */
 			SetPageDirty(page);
-			/* FIXME: remove after cleanups */
-			ClearPageReserved(page);
-		}
+		else
+			__SetPageOffline(page);
+		VM_WARN_ON_ONCE(!PageOffline(page));
 	}
 	page_offline_end();
 }
@@ -1166,9 +1170,11 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
 	for (; nr_pages--; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__ClearPageOffline(page);
 		if (!onlined)
+			/* generic_online_page() will clear PageOffline(). */
 			ClearPageDirty(page);
+		else
+			__ClearPageOffline(page);
 	}
 }
 
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index aaf2514fcfa46..528395133b4f8 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -146,7 +146,8 @@ static DECLARE_WAIT_QUEUE_HEAD(balloon_wq);
 /* balloon_append: add the given page to the balloon. */
 static void balloon_append(struct page *page)
 {
-	__SetPageOffline(page);
+	if (!PageOffline(page))
+		__SetPageOffline(page);
 
 	/* Lowmem is re-populated first, so highmem pages go at list tail. */
 	if (PageHighMem(page)) {
@@ -412,7 +413,11 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
 
 		xenmem_reservation_va_mapping_update(1, &page, &frame_list[i]);
 
-		/* Relinquish the page back to the allocator. */
+		/*
+		 * Relinquish the page back to the allocator. Note that
+		 * some pages, including ones added via xen_online_page(), might
+		 * not be marked reserved; free_reserved_page() will handle that.
+		 */
 		free_reserved_page(page);
 	}
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f04fea86324d9..e0362ce7fc109 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -30,16 +30,11 @@
  * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
  *   to read/write these pages might end badly. Don't touch!
  * - The zero page(s)
- * - Pages not added to the page allocator when onlining a section because
- *   they were excluded via the online_page_callback() or because they are
- *   PG_hwpoison.
  * - Pages allocated in the context of kexec/kdump (loaded kernel image,
  *   control pages, vmcoreinfo)
  * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
  *   not marked PG_reserved (as they might be in use by somebody else who does
  *   not respect the caching strategy).
- * - Pages part of an offline section (struct pages of offline sections should
- *   not be trusted as they will be initialized when first onlined).
  * - MCA pages on ia64
  * - Pages holding CPU notes for POWER Firmware Assisted Dump
  * - Device memory (e.g. PMEM, DAX, HMM)
@@ -1021,6 +1016,10 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
  * The content of these pages is effectively stale. Such pages should not
  * be touched (read/write/dump/save) except by their owner.
  *
+ * When a memory block gets onlined, all pages are initialized with a
+ * refcount of 1 and PageOffline(). generic_online_page() will
+ * take care of clearing PageOffline().
+ *
  * If a driver wants to allow to offline unmovable PageOffline() pages without
  * putting them back to the buddy, it can do so via the memory notifier by
  * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
@@ -1028,8 +1027,7 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
  * pages (now with a reference count of zero) are treated like free pages,
  * allowing the containing memory block to get offlined. A driver that
  * relies on this feature is aware that re-onlining the memory block will
- * require to re-set the pages PageOffline() and not giving them to the
- * buddy via online_page_callback_t.
+ * require not giving them to the buddy via generic_online_page().
  *
  * There are drivers that mark a page PageOffline() and expect there won't be
  * any further access to page content. PFN walkers that read content of random
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 27e3be75edcf7..0254059efcbe1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -734,7 +734,7 @@ static inline void section_taint_zone_device(unsigned long pfn)
 /*
  * Associate the pfn range with the given zone, initializing the memmaps
  * and resizing the pgdat/zone data to span the added pages. After this
- * call, all affected pages are PG_reserved.
+ * call, all affected pages are PageOffline().
  *
  * All aligned pageblocks are initialized to the specified migratetype
  * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
@@ -1100,8 +1100,12 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
 
 	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
 
-	for (i = 0; i < nr_pages; i++)
-		SetPageVmemmapSelfHosted(pfn_to_page(pfn + i));
+	for (i = 0; i < nr_pages; i++) {
+		struct page *page = pfn_to_page(pfn + i);
+
+		__ClearPageOffline(page);
+		SetPageVmemmapSelfHosted(page);
+	}
 
 	/*
 	 * It might be that the vmemmap_pages fully span sections. If that is
@@ -1959,9 +1963,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	 * Don't allow to offline memory blocks that contain holes.
 	 * Consequently, memory blocks with holes can never get onlined
 	 * via the hotplug path - online_pages() - as hotplugged memory has
-	 * no holes. This way, we e.g., don't have to worry about marking
-	 * memory holes PG_reserved, don't need pfn_valid() checks, and can
-	 * avoid using walk_system_ram_range() later.
+	 * no holes. This way, we don't have to worry about memory holes,
+	 * don't need pfn_valid() checks, and can avoid using
+	 * walk_system_ram_range() later.
 	 */
 	walk_system_ram_range(start_pfn, nr_pages, &system_ram_pages,
 			      count_system_ram_pages_cb);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index feb5b6e8c8875..c066c1c474837 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -892,8 +892,14 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 
 		page = pfn_to_page(pfn);
 		__init_single_page(page, pfn, zone, nid);
-		if (context == MEMINIT_HOTPLUG)
-			__SetPageReserved(page);
+		if (context == MEMINIT_HOTPLUG) {
+#ifdef CONFIG_ZONE_DEVICE
+			if (zone == ZONE_DEVICE)
+				__SetPageReserved(page);
+			else
+#endif
+				__SetPageOffline(page);
+		}
 
 		/*
 		 * Usually, we want to mark the pageblock MIGRATE_MOVABLE,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0c8a8354be36..039bc52cc9091 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1225,18 +1225,23 @@ void __free_pages_core(struct page *page, unsigned int order,
 	 * When initializing the memmap, __init_single_page() sets the refcount
 	 * of all pages to 1 ("allocated"/"not free"). We have to set the
 	 * refcount of all involved pages to 0.
+	 *
+	 * Note that hotplugged memory pages are initialized to PageOffline().
+	 * Pages freed from memblock might be marked as reserved.
 	 */
-	prefetchw(p);
-	for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
-		prefetchw(p + 1);
-		__ClearPageReserved(p);
-		set_page_count(p, 0);
-	}
-	__ClearPageReserved(p);
-	set_page_count(p, 0);
-
 	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG) &&
 	    unlikely(context == MEMINIT_HOTPLUG)) {
+		prefetchw(p);
+		for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
+			prefetchw(p + 1);
+			VM_WARN_ON_ONCE(PageReserved(p));
+			__ClearPageOffline(p);
+			set_page_count(p, 0);
+		}
+		VM_WARN_ON_ONCE(PageReserved(p));
+		__ClearPageOffline(p);
+		set_page_count(p, 0);
+
 		/*
 		 * Freeing the page with debug_pagealloc enabled will try to
 		 * unmap it; some archs don't like double-unmappings, so
@@ -1245,6 +1250,15 @@ void __free_pages_core(struct page *page, unsigned int order,
 		debug_pagealloc_map_pages(page, nr_pages);
 		adjust_managed_page_count(page, nr_pages);
 	} else {
+		prefetchw(p);
+		for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
+			prefetchw(p + 1);
+			__ClearPageReserved(p);
+			set_page_count(p, 0);
+		}
+		__ClearPageReserved(p);
+		set_page_count(p, 0);
+
 		/* memblock adjusts totalram_pages() ahead of time. */
 		atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
 	}
-- 
2.45.1


^ permalink raw reply related

* [PATCH v1 3/3] mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline() pages when offlining
From: David Hildenbrand @ 2024-06-07  9:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-hyperv, virtualization, xen-devel, kasan-dev,
	David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Alexander Potapenko, Marco Elver, Dmitry Vyukov
In-Reply-To: <20240607090939.89524-1-david@redhat.com>

We currently have a hack for virtio-mem in place to handle memory
offlining with PageOffline pages for which we already adjusted the
managed page count.

Let's enlighten memory offlining code so we can get rid of that hack,
and document the situation.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/virtio/virtio_mem.c    | 11 ++---------
 include/linux/memory_hotplug.h |  4 ++--
 include/linux/page-flags.h     |  8 ++++++--
 mm/memory_hotplug.c            |  6 +++---
 mm/page_alloc.c                | 12 ++++++++++--
 5 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index b90df29621c81..b0b8714415783 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1269,12 +1269,6 @@ static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
 	struct page *page;
 	unsigned long i;
 
-	/*
-	 * Drop our reference to the pages so the memory can get offlined
-	 * and add the unplugged pages to the managed page counters (so
-	 * offlining code can correctly subtract them again).
-	 */
-	adjust_managed_page_count(pfn_to_page(pfn), nr_pages);
 	/* Drop our reference to the pages so the memory can get offlined. */
 	for (i = 0; i < nr_pages; i++) {
 		page = pfn_to_page(pfn + i);
@@ -1293,10 +1287,9 @@ static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
 	unsigned long i;
 
 	/*
-	 * Get the reference we dropped when going offline and subtract the
-	 * unplugged pages from the managed page counters.
+	 * Get the reference again that we dropped via page_ref_dec_and_test()
+	 * when going offline.
 	 */
-	adjust_managed_page_count(pfn_to_page(pfn), -nr_pages);
 	for (i = 0; i < nr_pages; i++)
 		page_ref_inc(pfn_to_page(pfn + i));
 }
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7a9ff464608d7..ebe876930e782 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -175,8 +175,8 @@ extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
 extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages);
 extern int online_pages(unsigned long pfn, unsigned long nr_pages,
 			struct zone *zone, struct memory_group *group);
-extern void __offline_isolated_pages(unsigned long start_pfn,
-				     unsigned long end_pfn);
+extern unsigned long __offline_isolated_pages(unsigned long start_pfn,
+		unsigned long end_pfn);
 
 typedef void (*online_page_callback_t)(struct page *page, unsigned int order);
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e0362ce7fc109..0876aca0833e7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1024,11 +1024,15 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
  * putting them back to the buddy, it can do so via the memory notifier by
  * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
  * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
- * pages (now with a reference count of zero) are treated like free pages,
- * allowing the containing memory block to get offlined. A driver that
+ * pages (now with a reference count of zero) are treated like free (unmanaged)
+ * pages, allowing the containing memory block to get offlined. A driver that
  * relies on this feature is aware that re-onlining the memory block will
  * require not giving them to the buddy via generic_online_page().
  *
+ * Memory offlining code will not adjust the managed page count for any
+ * PageOffline() pages, treating them like they were never exposed to the
+ * buddy using generic_online_page().
+ *
  * There are drivers that mark a page PageOffline() and expect there won't be
  * any further access to page content. PFN walkers that read content of random
  * pages should check PageOffline() and synchronize with such drivers using
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0254059efcbe1..965707a02556f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1941,7 +1941,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 			struct zone *zone, struct memory_group *group)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn, system_ram_pages = 0;
+	unsigned long pfn, managed_pages, system_ram_pages = 0;
 	const int node = zone_to_nid(zone);
 	unsigned long flags;
 	struct memory_notify arg;
@@ -2062,7 +2062,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	} while (ret);
 
 	/* Mark all sections offline and remove free pages from the buddy. */
-	__offline_isolated_pages(start_pfn, end_pfn);
+	managed_pages = __offline_isolated_pages(start_pfn, end_pfn);
 	pr_debug("Offlined Pages %ld\n", nr_pages);
 
 	/*
@@ -2078,7 +2078,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	zone_pcp_enable(zone);
 
 	/* removal success */
-	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+	adjust_managed_page_count(pfn_to_page(start_pfn), -managed_pages);
 	adjust_present_page_count(pfn_to_page(start_pfn), group, -nr_pages);
 
 	/* reinitialise watermarks and update pcp limits */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 039bc52cc9091..809bc4a816e85 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6745,14 +6745,19 @@ void zone_pcp_reset(struct zone *zone)
 /*
  * All pages in the range must be in a single zone, must not contain holes,
  * must span full sections, and must be isolated before calling this function.
+ *
+ * Returns the number of managed (non-PageOffline()) pages in the range: the
+ * number of pages for which memory offlining code must adjust managed page
+ * counters using adjust_managed_page_count().
  */
-void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+unsigned long __offline_isolated_pages(unsigned long start_pfn,
+		unsigned long end_pfn)
 {
+	unsigned long already_offline = 0, flags;
 	unsigned long pfn = start_pfn;
 	struct page *page;
 	struct zone *zone;
 	unsigned int order;
-	unsigned long flags;
 
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
@@ -6774,6 +6779,7 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		if (PageOffline(page)) {
 			BUG_ON(page_count(page));
 			BUG_ON(PageBuddy(page));
+			already_offline++;
 			pfn++;
 			continue;
 		}
@@ -6786,6 +6792,8 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		pfn += (1 << order);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return end_pfn - start_pfn - already_offline;
 }
 #endif
 
-- 
2.45.1


^ permalink raw reply related

* Re: [PATCHv11 18/19] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
From: Kirill A. Shutemov @ 2024-06-07 15:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, Rafael J. Wysocki,
	Peter Zijlstra, Adrian Hunter, Kuppuswamy Sathyanarayanan,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	Kalra, Ashish, Sean Christopherson, Huang, Kai, Ard Biesheuvel,
	Baoquan He, H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang,
	kexec, linux-hyperv, linux-acpi, linux-coco, linux-kernel,
	Tao Liu
In-Reply-To: <20240603083930.GNZl2BQk2lQ8WtcE4o@fat_crate.local>

On Mon, Jun 03, 2024 at 10:39:30AM +0200, Borislav Petkov wrote:
> > +/*
> > + * Make sure asm_acpi_mp_play_dead() is present in the identity mapping at
> > + * the same place as in the kernel page tables. asm_acpi_mp_play_dead() switches
> > + * to the identity mapping and the function has be present at the same spot in
> > + * the virtual address space before and after switching page tables.
> > + */
> > +static int __init init_transition_pgtable(pgd_t *pgd)
> 
> This looks like a generic helper which should be in set_memory.c. And
> looking at that file, there's populate_pgd() which does pretty much the
> same thing, if I squint real hard.
> 
> Let's tone down the duplication.

Okay, there is a function called kernel_map_pages_in_pgd() in set_memory.c
that does what we need here.

I tried to use it, but encountered a few issues:

- The code in set_memory.c allocates memory using the buddy allocator,
  which is not yet ready. We can work around this limitation by delaying
  the initialization of offlining until later, using a separate
  early_initcall();

- I noticed a complaint that the allocation is being done from an atomic
  context: a spinlock called cpa_lock is taken when populate_pgd()
  allocates memory.

  I am not sure why this was not noticed before. kernel_map_pages_in_pgd()
  has only been used in EFI mapping initialization so far, so maybe it is
  somehow special, I don't know.

  I was able to address this issue by switching cpa_lock to a mutex.
  However, this solution will only work if the callers for set_memory
  interfaces are not called from an atomic context. I need to verify if
  this is the case.

- The function __flush_tlb_all() in kernel_(un)map_pages_in_pgd() must be
  called with preemption disabled. Once again, I am unsure why this has
  not caused issues in the EFI case.

- I discovered a bug in kernel_ident_mapping_free() when it is used on a
  machine with 5-level paging. I will submit a proper patch to fix this
  issue.

The fixup is below.

Any comments?

diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 6cfe762be28b..fbbfe78f7f27 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -59,82 +59,55 @@ static void acpi_mp_cpu_die(unsigned int cpu)
 		pr_err("Failed to hand over CPU %d to BIOS\n", cpu);
 }
 
+static void acpi_mp_disable_offlining(struct acpi_madt_multiproc_wakeup *mp_wake)
+{
+	cpu_hotplug_disable_offlining();
+
+	/*
+	 * ACPI MADT doesn't allow to offline a CPU after it was onlined. This
+	 * limits kexec: the second kernel won't be able to use more than one CPU.
+	 *
+	 * To prevent a kexec kernel from onlining secondary CPUs invalidate the
+	 * mailbox address in the ACPI MADT wakeup structure which prevents a
+	 * kexec kernel to use it.
+	 *
+	 * This is safe as the booting kernel has the mailbox address cached
+	 * already and acpi_wakeup_cpu() uses the cached value to bring up the
+	 * secondary CPUs.
+	 *
+	 * Note: This is a Linux specific convention and not covered by the
+	 *       ACPI specification.
+	 */
+	mp_wake->mailbox_address = 0;
+}
+
 /* The argument is required to match type of x86_mapping_info::alloc_pgt_page */
 static void __init *alloc_pgt_page(void *dummy)
 {
-	return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+	return (void *)get_zeroed_page(GFP_KERNEL);
 }
 
 static void __init free_pgt_page(void *pgt, void *dummy)
 {
-	return memblock_free(pgt, PAGE_SIZE);
+	return free_page((unsigned long)pgt);
 }
 
-/*
- * Make sure asm_acpi_mp_play_dead() is present in the identity mapping at
- * the same place as in the kernel page tables. asm_acpi_mp_play_dead() switches
- * to the identity mapping and the function has be present at the same spot in
- * the virtual address space before and after switching page tables.
- */
-static int __init init_transition_pgtable(pgd_t *pgd)
-{
-	pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
-	unsigned long vaddr, paddr;
-	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-
-	vaddr = (unsigned long)asm_acpi_mp_play_dead;
-	pgd += pgd_index(vaddr);
-	if (!pgd_present(*pgd)) {
-		p4d = (p4d_t *)alloc_pgt_page(NULL);
-		if (!p4d)
-			return -ENOMEM;
-		set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
-	}
-	p4d = p4d_offset(pgd, vaddr);
-	if (!p4d_present(*p4d)) {
-		pud = (pud_t *)alloc_pgt_page(NULL);
-		if (!pud)
-			return -ENOMEM;
-		set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
-	}
-	pud = pud_offset(p4d, vaddr);
-	if (!pud_present(*pud)) {
-		pmd = (pmd_t *)alloc_pgt_page(NULL);
-		if (!pmd)
-			return -ENOMEM;
-		set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
-	}
-	pmd = pmd_offset(pud, vaddr);
-	if (!pmd_present(*pmd)) {
-		pte = (pte_t *)alloc_pgt_page(NULL);
-		if (!pte)
-			return -ENOMEM;
-		set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
-	}
-	pte = pte_offset_kernel(pmd, vaddr);
-
-	paddr = __pa(vaddr);
-	set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
-
-	return 0;
-}
-
-static int __init acpi_mp_setup_reset(u64 reset_vector)
+static int __init acpi_mp_setup_reset(union acpi_subtable_headers *header,
+			      const unsigned long end)
 {
+	struct acpi_madt_multiproc_wakeup *mp_wake;
 	struct x86_mapping_info info = {
 		.alloc_pgt_page = alloc_pgt_page,
 		.free_pgt_page	= free_pgt_page,
 		.page_flag      = __PAGE_KERNEL_LARGE_EXEC,
-		.kernpg_flag    = _KERNPG_TABLE_NOENC,
+		.kernpg_flag    = _KERNPG_TABLE,
 	};
+	unsigned long vaddr, pfn;
 	pgd_t *pgd;
 
 	pgd = alloc_pgt_page(NULL);
 	if (!pgd)
-		return -ENOMEM;
+		goto err;
 
 	for (int i = 0; i < nr_pfn_mapped; i++) {
 		unsigned long mstart, mend;
@@ -143,30 +116,45 @@ static int __init acpi_mp_setup_reset(u64 reset_vector)
 		mend   = pfn_mapped[i].end << PAGE_SHIFT;
 		if (kernel_ident_mapping_init(&info, pgd, mstart, mend)) {
 			kernel_ident_mapping_free(&info, pgd);
-			return -ENOMEM;
+			goto err;
 		}
 	}
 
 	if (kernel_ident_mapping_init(&info, pgd,
-				      PAGE_ALIGN_DOWN(reset_vector),
-				      PAGE_ALIGN(reset_vector + 1))) {
+				      PAGE_ALIGN_DOWN(acpi_mp_reset_vector_paddr),
+				      PAGE_ALIGN(acpi_mp_reset_vector_paddr + 1))) {
 		kernel_ident_mapping_free(&info, pgd);
-		return -ENOMEM;
+		goto err;
 	}
 
-	if (init_transition_pgtable(pgd)) {
+	/*
+	 * Make sure asm_acpi_mp_play_dead() is present in the identity mapping
+	 * at the same place as in the kernel page tables.
+	 *
+	 * asm_acpi_mp_play_dead() switches to the identity mapping and the
+	 * function has be present at the same spot in the virtual address space
+	 * before and after switching page tables.
+	 */
+	vaddr = (unsigned long)asm_acpi_mp_play_dead;
+	pfn = __pa(vaddr) >> PAGE_SHIFT;
+	if (kernel_map_pages_in_pgd(pgd, pfn, vaddr, 1, _KERNPG_TABLE)) {
 		kernel_ident_mapping_free(&info, pgd);
-		return -ENOMEM;
+		goto err;
 	}
 
 	smp_ops.play_dead = acpi_mp_play_dead;
 	smp_ops.stop_this_cpu = acpi_mp_stop_this_cpu;
 	smp_ops.cpu_die = acpi_mp_cpu_die;
 
-	acpi_mp_reset_vector_paddr = reset_vector;
 	acpi_mp_pgd = __pa(pgd);
 
 	return 0;
+err:
+	pr_warn("Failed to setup MADT reset vector\n");
+	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
+	acpi_mp_disable_offlining(mp_wake);
+	return -ENOMEM;
+
 }
 
 static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
@@ -226,28 +214,6 @@ static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
 	return 0;
 }
 
-static void acpi_mp_disable_offlining(struct acpi_madt_multiproc_wakeup *mp_wake)
-{
-	cpu_hotplug_disable_offlining();
-
-	/*
-	 * ACPI MADT doesn't allow to offline a CPU after it was onlined. This
-	 * limits kexec: the second kernel won't be able to use more than one CPU.
-	 *
-	 * To prevent a kexec kernel from onlining secondary CPUs invalidate the
-	 * mailbox address in the ACPI MADT wakeup structure which prevents a
-	 * kexec kernel to use it.
-	 *
-	 * This is safe as the booting kernel has the mailbox address cached
-	 * already and acpi_wakeup_cpu() uses the cached value to bring up the
-	 * secondary CPUs.
-	 *
-	 * Note: This is a Linux specific convention and not covered by the
-	 *       ACPI specification.
-	 */
-	mp_wake->mailbox_address = 0;
-}
-
 int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 			      const unsigned long end)
 {
@@ -274,10 +240,7 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 
 	if (mp_wake->version >= ACPI_MADT_MP_WAKEUP_VERSION_V1 &&
 	    mp_wake->header.length >= ACPI_MADT_MP_WAKEUP_SIZE_V1) {
-		if (acpi_mp_setup_reset(mp_wake->reset_vector)) {
-			pr_warn("Failed to setup MADT reset vector\n");
-			acpi_mp_disable_offlining(mp_wake);
-		}
+		acpi_mp_reset_vector_paddr = mp_wake->reset_vector;
 	} else {
 		/*
 		 * CPU offlining requires version 1 of the ACPI MADT wakeup
@@ -290,3 +253,13 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
 
 	return 0;
 }
+
+static int __init acpi_mp_offline_init(void)
+{
+	if (!acpi_mp_reset_vector_paddr)
+		return 0;
+
+	return acpi_table_parse_madt(ACPI_MADT_TYPE_MULTIPROC_WAKEUP,
+				     acpi_mp_setup_reset, 1);
+}
+early_initcall(acpi_mp_offline_init);
diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index 3996af7b4abf..c45127265f2f 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -60,7 +60,7 @@ static void free_p4d(struct x86_mapping_info *info, pgd_t *pgd)
 	}
 
 	if (pgtable_l5_enabled())
-		info->free_pgt_page(pgd, info->context);
+		info->free_pgt_page(p4d, info->context);
 }
 
 void kernel_ident_mapping_free(struct x86_mapping_info *info, pgd_t *pgd)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 443a97e515c0..72715674f492 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -69,7 +69,7 @@ static const int cpa_warn_level = CPA_PROTECT;
  * entries change the page attribute in parallel to some other cpu
  * splitting a large page entry along with changing the attribute.
  */
-static DEFINE_SPINLOCK(cpa_lock);
+static DEFINE_MUTEX(cpa_lock);
 
 #define CPA_FLUSHTLB 1
 #define CPA_ARRAY 2
@@ -1186,10 +1186,10 @@ static int split_large_page(struct cpa_data *cpa, pte_t *kpte,
 	struct page *base;
 
 	if (!debug_pagealloc_enabled())
-		spin_unlock(&cpa_lock);
+		mutex_unlock(&cpa_lock);
 	base = alloc_pages(GFP_KERNEL, 0);
 	if (!debug_pagealloc_enabled())
-		spin_lock(&cpa_lock);
+		mutex_lock(&cpa_lock);
 	if (!base)
 		return -ENOMEM;
 
@@ -1804,10 +1804,10 @@ static int __change_page_attr_set_clr(struct cpa_data *cpa, int primary)
 			cpa->numpages = 1;
 
 		if (!debug_pagealloc_enabled())
-			spin_lock(&cpa_lock);
+			mutex_lock(&cpa_lock);
 		ret = __change_page_attr(cpa, primary);
 		if (!debug_pagealloc_enabled())
-			spin_unlock(&cpa_lock);
+			mutex_unlock(&cpa_lock);
 		if (ret)
 			goto out;
 
@@ -2516,7 +2516,9 @@ int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
 	cpa.mask_set = __pgprot(_PAGE_PRESENT | page_flags);
 
 	retval = __change_page_attr_set_clr(&cpa, 1);
+	preempt_disable();
 	__flush_tlb_all();
+	preempt_enable();
 
 out:
 	return retval;
@@ -2551,7 +2553,9 @@ int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address,
 	WARN_ONCE(num_online_cpus() > 1, "Don't call after initializing SMP");
 
 	retval = __change_page_attr_set_clr(&cpa, 1);
+	preempt_disable();
 	__flush_tlb_all();
+	preempt_enable();
 
 	return retval;
 }
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related

* [PATCH] Input: serio - use sizeof(*pointer) instead of sizeof(type)
From: Erick Archer @ 2024-06-07 17:04 UTC (permalink / raw)
  To: Dmitry Torokhov, Russell King, James E.J. Bottomley, Helge Deller,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Chen-Yu Tsai, Jernej Skrabec, Samuel Holland,
	Stephen Chandler Paul, Michal Simek, Uwe Kleine-König,
	Russell King (Oracle), Suzuki K Poulose, Krzysztof Kozlowski,
	Rob Herring, Ruan Jinjie, Ricardo B. Marliere, Greg Kroah-Hartman,
	Jiri Slaby (SUSE), Mark Brown, Yang Li, Kees Cook,
	Gustavo A. R. Silva, Justin Stitt
  Cc: Erick Archer, linux-input, linux-kernel, linux-parisc,
	linux-hyperv, linux-arm-kernel, linux-sunxi, linux-hardening

It is preferred to use sizeof(*pointer) instead of sizeof(type)
due to the type of the variable can change and one needs not
change the former (unlike the latter). This patch has no effect
on runtime behavior.

Signed-off-by: Erick Archer <erick.archer@outlook.com>
---
 drivers/input/serio/altera_ps2.c      | 2 +-
 drivers/input/serio/ambakmi.c         | 4 ++--
 drivers/input/serio/apbps2.c          | 2 +-
 drivers/input/serio/arc_ps2.c         | 2 +-
 drivers/input/serio/ct82c710.c        | 2 +-
 drivers/input/serio/gscps2.c          | 4 ++--
 drivers/input/serio/hyperv-keyboard.c | 4 ++--
 drivers/input/serio/i8042.c           | 4 ++--
 drivers/input/serio/maceps2.c         | 2 +-
 drivers/input/serio/olpc_apsp.c       | 4 ++--
 drivers/input/serio/parkbd.c          | 2 +-
 drivers/input/serio/pcips2.c          | 4 ++--
 drivers/input/serio/ps2-gpio.c        | 4 ++--
 drivers/input/serio/ps2mult.c         | 2 +-
 drivers/input/serio/q40kbd.c          | 4 ++--
 drivers/input/serio/rpckbd.c          | 2 +-
 drivers/input/serio/sa1111ps2.c       | 4 ++--
 drivers/input/serio/serio.c           | 2 +-
 drivers/input/serio/serio_raw.c       | 4 ++--
 drivers/input/serio/serport.c         | 4 ++--
 drivers/input/serio/sun4i-ps2.c       | 4 ++--
 drivers/input/serio/userio.c          | 4 ++--
 drivers/input/serio/xilinx_ps2.c      | 4 ++--
 23 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/drivers/input/serio/altera_ps2.c b/drivers/input/serio/altera_ps2.c
index c5b634940cfc..611eb9fe2d04 100644
--- a/drivers/input/serio/altera_ps2.c
+++ b/drivers/input/serio/altera_ps2.c
@@ -100,7 +100,7 @@ static int altera_ps2_probe(struct platform_device *pdev)
 		return error;
 	}
 
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (!serio)
 		return -ENOMEM;
 
diff --git a/drivers/input/serio/ambakmi.c b/drivers/input/serio/ambakmi.c
index 496bb7a312d2..de4b3915c37d 100644
--- a/drivers/input/serio/ambakmi.c
+++ b/drivers/input/serio/ambakmi.c
@@ -114,8 +114,8 @@ static int amba_kmi_probe(struct amba_device *dev,
 	if (ret)
 		return ret;
 
-	kmi = kzalloc(sizeof(struct amba_kmi_port), GFP_KERNEL);
-	io = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	kmi = kzalloc(sizeof(*kmi), GFP_KERNEL);
+	io = kzalloc(sizeof(*io), GFP_KERNEL);
 	if (!kmi || !io) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/drivers/input/serio/apbps2.c b/drivers/input/serio/apbps2.c
index dbbb10251520..4015e75fcb90 100644
--- a/drivers/input/serio/apbps2.c
+++ b/drivers/input/serio/apbps2.c
@@ -165,7 +165,7 @@ static int apbps2_of_probe(struct platform_device *ofdev)
 	/* Set reload register to core freq in kHz/10 */
 	iowrite32be(freq_hz / 10000, &priv->regs->reload);
 
-	priv->io = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	priv->io = kzalloc(sizeof(*priv->io), GFP_KERNEL);
 	if (!priv->io)
 		return -ENOMEM;
 
diff --git a/drivers/input/serio/arc_ps2.c b/drivers/input/serio/arc_ps2.c
index 9d8726830140..a9180a005872 100644
--- a/drivers/input/serio/arc_ps2.c
+++ b/drivers/input/serio/arc_ps2.c
@@ -155,7 +155,7 @@ static int arc_ps2_create_port(struct platform_device *pdev,
 	struct arc_ps2_port *port = &arc_ps2->port[index];
 	struct serio *io;
 
-	io = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	io = kzalloc(sizeof(*io), GFP_KERNEL);
 	if (!io)
 		return -ENOMEM;
 
diff --git a/drivers/input/serio/ct82c710.c b/drivers/input/serio/ct82c710.c
index d5c9bb3d0103..6834440b37f6 100644
--- a/drivers/input/serio/ct82c710.c
+++ b/drivers/input/serio/ct82c710.c
@@ -158,7 +158,7 @@ static int __init ct82c710_detect(void)
 
 static int ct82c710_probe(struct platform_device *dev)
 {
-	ct82c710_port = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	ct82c710_port = kzalloc(sizeof(*ct82c710_port), GFP_KERNEL);
 	if (!ct82c710_port)
 		return -ENOMEM;
 
diff --git a/drivers/input/serio/gscps2.c b/drivers/input/serio/gscps2.c
index 633c7de49d67..d94c01eb3fc9 100644
--- a/drivers/input/serio/gscps2.c
+++ b/drivers/input/serio/gscps2.c
@@ -338,8 +338,8 @@ static int __init gscps2_probe(struct parisc_device *dev)
 	if (dev->id.sversion == 0x96)
 		hpa += GSC_DINO_OFFSET;
 
-	ps2port = kzalloc(sizeof(struct gscps2port), GFP_KERNEL);
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	ps2port = kzalloc(sizeof(*ps2port), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (!ps2port || !serio) {
 		ret = -ENOMEM;
 		goto fail_nomem;
diff --git a/drivers/input/serio/hyperv-keyboard.c b/drivers/input/serio/hyperv-keyboard.c
index 31def6ce5157..31d9dacd2fd1 100644
--- a/drivers/input/serio/hyperv-keyboard.c
+++ b/drivers/input/serio/hyperv-keyboard.c
@@ -318,8 +318,8 @@ static int hv_kbd_probe(struct hv_device *hv_dev,
 	struct serio *hv_serio;
 	int error;
 
-	kbd_dev = kzalloc(sizeof(struct hv_kbd_dev), GFP_KERNEL);
-	hv_serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	kbd_dev = kzalloc(sizeof(*kbd_dev), GFP_KERNEL);
+	hv_serio = kzalloc(sizeof(*hv_serio), GFP_KERNEL);
 	if (!kbd_dev || !hv_serio) {
 		error = -ENOMEM;
 		goto err_free_mem;
diff --git a/drivers/input/serio/i8042.c b/drivers/input/serio/i8042.c
index 9fbb8d31575a..e0fb1db653b7 100644
--- a/drivers/input/serio/i8042.c
+++ b/drivers/input/serio/i8042.c
@@ -1329,7 +1329,7 @@ static int i8042_create_kbd_port(void)
 	struct serio *serio;
 	struct i8042_port *port = &i8042_ports[I8042_KBD_PORT_NO];
 
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (!serio)
 		return -ENOMEM;
 
@@ -1359,7 +1359,7 @@ static int i8042_create_aux_port(int idx)
 	int port_no = idx < 0 ? I8042_AUX_PORT_NO : I8042_MUX_PORT_NO + idx;
 	struct i8042_port *port = &i8042_ports[port_no];
 
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (!serio)
 		return -ENOMEM;
 
diff --git a/drivers/input/serio/maceps2.c b/drivers/input/serio/maceps2.c
index 5ccfb82759b3..42ac1eb94866 100644
--- a/drivers/input/serio/maceps2.c
+++ b/drivers/input/serio/maceps2.c
@@ -117,7 +117,7 @@ static struct serio *maceps2_allocate_port(int idx)
 {
 	struct serio *serio;
 
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (serio) {
 		serio->id.type		= SERIO_8042;
 		serio->write		= maceps2_write;
diff --git a/drivers/input/serio/olpc_apsp.c b/drivers/input/serio/olpc_apsp.c
index 240a714f7081..0ad95e880cc2 100644
--- a/drivers/input/serio/olpc_apsp.c
+++ b/drivers/input/serio/olpc_apsp.c
@@ -188,7 +188,7 @@ static int olpc_apsp_probe(struct platform_device *pdev)
 		return priv->irq;
 
 	/* KEYBOARD */
-	kb_serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	kb_serio = kzalloc(sizeof(*kb_serio), GFP_KERNEL);
 	if (!kb_serio)
 		return -ENOMEM;
 	kb_serio->id.type	= SERIO_8042_XL;
@@ -203,7 +203,7 @@ static int olpc_apsp_probe(struct platform_device *pdev)
 	serio_register_port(kb_serio);
 
 	/* TOUCHPAD */
-	pad_serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	pad_serio = kzalloc(sizeof(*pad_serio), GFP_KERNEL);
 	if (!pad_serio) {
 		error = -ENOMEM;
 		goto err_pad;
diff --git a/drivers/input/serio/parkbd.c b/drivers/input/serio/parkbd.c
index 0d54895428f5..328932297aad 100644
--- a/drivers/input/serio/parkbd.c
+++ b/drivers/input/serio/parkbd.c
@@ -165,7 +165,7 @@ static struct serio *parkbd_allocate_serio(void)
 {
 	struct serio *serio;
 
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (serio) {
 		serio->id.type = parkbd_mode;
 		serio->write = parkbd_write;
diff --git a/drivers/input/serio/pcips2.c b/drivers/input/serio/pcips2.c
index 05878750f2c2..6b9abb2e18c9 100644
--- a/drivers/input/serio/pcips2.c
+++ b/drivers/input/serio/pcips2.c
@@ -137,8 +137,8 @@ static int pcips2_probe(struct pci_dev *dev, const struct pci_device_id *id)
 	if (ret)
 		goto disable;
 
-	ps2if = kzalloc(sizeof(struct pcips2_data), GFP_KERNEL);
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	ps2if = kzalloc(sizeof(*ps2if), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (!ps2if || !serio) {
 		ret = -ENOMEM;
 		goto release;
diff --git a/drivers/input/serio/ps2-gpio.c b/drivers/input/serio/ps2-gpio.c
index c3ff60859a03..0c8b390b8b4f 100644
--- a/drivers/input/serio/ps2-gpio.c
+++ b/drivers/input/serio/ps2-gpio.c
@@ -404,8 +404,8 @@ static int ps2_gpio_probe(struct platform_device *pdev)
 	struct device *dev = &pdev->dev;
 	int error;
 
-	drvdata = devm_kzalloc(dev, sizeof(struct ps2_gpio_data), GFP_KERNEL);
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	drvdata = devm_kzalloc(dev, sizeof(*drvdata), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (!drvdata || !serio) {
 		error = -ENOMEM;
 		goto err_free_serio;
diff --git a/drivers/input/serio/ps2mult.c b/drivers/input/serio/ps2mult.c
index 902e81826fbf..937ecdea491d 100644
--- a/drivers/input/serio/ps2mult.c
+++ b/drivers/input/serio/ps2mult.c
@@ -127,7 +127,7 @@ static int ps2mult_create_port(struct ps2mult *psm, int i)
 	struct serio *mx_serio = psm->mx_serio;
 	struct serio *serio;
 
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (!serio)
 		return -ENOMEM;
 
diff --git a/drivers/input/serio/q40kbd.c b/drivers/input/serio/q40kbd.c
index 3f81f8749cd5..cd4d5be946a3 100644
--- a/drivers/input/serio/q40kbd.c
+++ b/drivers/input/serio/q40kbd.c
@@ -108,8 +108,8 @@ static int q40kbd_probe(struct platform_device *pdev)
 	struct serio *port;
 	int error;
 
-	q40kbd = kzalloc(sizeof(struct q40kbd), GFP_KERNEL);
-	port = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	q40kbd = kzalloc(sizeof(*q40kbd), GFP_KERNEL);
+	port = kzalloc(sizeof(*port), GFP_KERNEL);
 	if (!q40kbd || !port) {
 		error = -ENOMEM;
 		goto err_free_mem;
diff --git a/drivers/input/serio/rpckbd.c b/drivers/input/serio/rpckbd.c
index 9bbfefd092c0..e236bb7e1014 100644
--- a/drivers/input/serio/rpckbd.c
+++ b/drivers/input/serio/rpckbd.c
@@ -108,7 +108,7 @@ static int rpckbd_probe(struct platform_device *dev)
 	if (tx_irq < 0)
 		return tx_irq;
 
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	rpckbd = kzalloc(sizeof(*rpckbd), GFP_KERNEL);
 	if (!serio || !rpckbd) {
 		kfree(rpckbd);
diff --git a/drivers/input/serio/sa1111ps2.c b/drivers/input/serio/sa1111ps2.c
index 2724c3aa512c..1311caf7dba4 100644
--- a/drivers/input/serio/sa1111ps2.c
+++ b/drivers/input/serio/sa1111ps2.c
@@ -256,8 +256,8 @@ static int ps2_probe(struct sa1111_dev *dev)
 	struct serio *serio;
 	int ret;
 
-	ps2if = kzalloc(sizeof(struct ps2if), GFP_KERNEL);
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	ps2if = kzalloc(sizeof(*ps2if), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (!ps2if || !serio) {
 		ret = -ENOMEM;
 		goto free;
diff --git a/drivers/input/serio/serio.c b/drivers/input/serio/serio.c
index a8838b522627..04967494eeb6 100644
--- a/drivers/input/serio/serio.c
+++ b/drivers/input/serio/serio.c
@@ -258,7 +258,7 @@ static int serio_queue_event(void *object, struct module *owner,
 		}
 	}
 
-	event = kmalloc(sizeof(struct serio_event), GFP_ATOMIC);
+	event = kmalloc(sizeof(*event), GFP_ATOMIC);
 	if (!event) {
 		pr_err("Not enough memory to queue event %d\n", event_type);
 		retval = -ENOMEM;
diff --git a/drivers/input/serio/serio_raw.c b/drivers/input/serio/serio_raw.c
index 1e4770094415..0186d1b38f49 100644
--- a/drivers/input/serio/serio_raw.c
+++ b/drivers/input/serio/serio_raw.c
@@ -92,7 +92,7 @@ static int serio_raw_open(struct inode *inode, struct file *file)
 		goto out;
 	}
 
-	client = kzalloc(sizeof(struct serio_raw_client), GFP_KERNEL);
+	client = kzalloc(sizeof(*client), GFP_KERNEL);
 	if (!client) {
 		retval = -ENOMEM;
 		goto out;
@@ -293,7 +293,7 @@ static int serio_raw_connect(struct serio *serio, struct serio_driver *drv)
 	struct serio_raw *serio_raw;
 	int err;
 
-	serio_raw = kzalloc(sizeof(struct serio_raw), GFP_KERNEL);
+	serio_raw = kzalloc(sizeof(*serio_raw), GFP_KERNEL);
 	if (!serio_raw) {
 		dev_dbg(&serio->dev, "can't allocate memory for a device\n");
 		return -ENOMEM;
diff --git a/drivers/input/serio/serport.c b/drivers/input/serio/serport.c
index 1db3f30011c4..5a2b5404ffc2 100644
--- a/drivers/input/serio/serport.c
+++ b/drivers/input/serio/serport.c
@@ -82,7 +82,7 @@ static int serport_ldisc_open(struct tty_struct *tty)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	serport = kzalloc(sizeof(struct serport), GFP_KERNEL);
+	serport = kzalloc(sizeof(*serport), GFP_KERNEL);
 	if (!serport)
 		return -ENOMEM;
 
@@ -167,7 +167,7 @@ static ssize_t serport_ldisc_read(struct tty_struct * tty, struct file * file,
 	if (test_and_set_bit(SERPORT_BUSY, &serport->flags))
 		return -EBUSY;
 
-	serport->serio = serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	serport->serio = serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (!serio)
 		return -ENOMEM;
 
diff --git a/drivers/input/serio/sun4i-ps2.c b/drivers/input/serio/sun4i-ps2.c
index aec66d9f5176..95cd8aaee65d 100644
--- a/drivers/input/serio/sun4i-ps2.c
+++ b/drivers/input/serio/sun4i-ps2.c
@@ -213,8 +213,8 @@ static int sun4i_ps2_probe(struct platform_device *pdev)
 	struct device *dev = &pdev->dev;
 	int error;
 
-	drvdata = kzalloc(sizeof(struct sun4i_ps2data), GFP_KERNEL);
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	drvdata = kzalloc(sizeof(*drvdata), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (!drvdata || !serio) {
 		error = -ENOMEM;
 		goto err_free_mem;
diff --git a/drivers/input/serio/userio.c b/drivers/input/serio/userio.c
index 9ab5c45c3a9f..a88e2eee55c3 100644
--- a/drivers/input/serio/userio.c
+++ b/drivers/input/serio/userio.c
@@ -77,7 +77,7 @@ static int userio_char_open(struct inode *inode, struct file *file)
 {
 	struct userio_device *userio;
 
-	userio = kzalloc(sizeof(struct userio_device), GFP_KERNEL);
+	userio = kzalloc(sizeof(*userio), GFP_KERNEL);
 	if (!userio)
 		return -ENOMEM;
 
@@ -85,7 +85,7 @@ static int userio_char_open(struct inode *inode, struct file *file)
 	spin_lock_init(&userio->buf_lock);
 	init_waitqueue_head(&userio->waitq);
 
-	userio->serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	userio->serio = kzalloc(sizeof(*userio->serio), GFP_KERNEL);
 	if (!userio->serio) {
 		kfree(userio);
 		return -ENOMEM;
diff --git a/drivers/input/serio/xilinx_ps2.c b/drivers/input/serio/xilinx_ps2.c
index bb758346a33d..1543267d02ac 100644
--- a/drivers/input/serio/xilinx_ps2.c
+++ b/drivers/input/serio/xilinx_ps2.c
@@ -252,8 +252,8 @@ static int xps2_of_probe(struct platform_device *ofdev)
 		return -ENODEV;
 	}
 
-	drvdata = kzalloc(sizeof(struct xps2data), GFP_KERNEL);
-	serio = kzalloc(sizeof(struct serio), GFP_KERNEL);
+	drvdata = kzalloc(sizeof(*drvdata), GFP_KERNEL);
+	serio = kzalloc(sizeof(*serio), GFP_KERNEL);
 	if (!drvdata || !serio) {
 		error = -ENOMEM;
 		goto failed1;
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()
From: David Hildenbrand @ 2024-06-07 18:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-hyperv, virtualization, xen-devel, kasan-dev,
	Andrew Morton, Mike Rapoport, Oscar Salvador, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <20240607090939.89524-2-david@redhat.com>

On 07.06.24 11:09, David Hildenbrand wrote:
> In preparation for further changes, let's teach __free_pages_core()
> about the differences of memory hotplug handling.
> 
> Move the memory hotplug specific handling from generic_online_page() to
> __free_pages_core(), use adjust_managed_page_count() on the memory
> hotplug path, and spell out why memory freed via memblock
> cannot currently use adjust_managed_page_count().
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/internal.h       |  3 ++-
>   mm/kmsan/init.c     |  2 +-
>   mm/memory_hotplug.c |  9 +--------
>   mm/mm_init.c        |  4 ++--
>   mm/page_alloc.c     | 17 +++++++++++++++--
>   5 files changed, 21 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 12e95fdf61e90..3fdee779205ab 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -604,7 +604,8 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
>   				    int mt);
>   extern void memblock_free_pages(struct page *page, unsigned long pfn,
>   					unsigned int order);
> -extern void __free_pages_core(struct page *page, unsigned int order);
> +extern void __free_pages_core(struct page *page, unsigned int order,
> +		enum meminit_context);
>   
>   /*
>    * This will have no effect, other than possibly generating a warning, if the
> diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
> index 3ac3b8921d36f..ca79636f858e5 100644
> --- a/mm/kmsan/init.c
> +++ b/mm/kmsan/init.c
> @@ -172,7 +172,7 @@ static void do_collection(void)
>   		shadow = smallstack_pop(&collect);
>   		origin = smallstack_pop(&collect);
>   		kmsan_setup_meta(page, shadow, origin, collect.order);
> -		__free_pages_core(page, collect.order);
> +		__free_pages_core(page, collect.order, MEMINIT_EARLY);
>   	}
>   }
>   
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 171ad975c7cfd..27e3be75edcf7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -630,14 +630,7 @@ EXPORT_SYMBOL_GPL(restore_online_page_callback);
>   
>   void generic_online_page(struct page *page, unsigned int order)
>   {
> -	/*
> -	 * Freeing the page with debug_pagealloc enabled will try to unmap it,
> -	 * so we should map it first. This is better than introducing a special
> -	 * case in page freeing fast path.
> -	 */
> -	debug_pagealloc_map_pages(page, 1 << order);
> -	__free_pages_core(page, order);
> -	totalram_pages_add(1UL << order);
> +	__free_pages_core(page, order, MEMINIT_HOTPLUG);
>   }
>   EXPORT_SYMBOL_GPL(generic_online_page);
>   
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 019193b0d8703..feb5b6e8c8875 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1938,7 +1938,7 @@ static void __init deferred_free_range(unsigned long pfn,
>   	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>   		if (pageblock_aligned(pfn))
>   			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> -		__free_pages_core(page, 0);
> +		__free_pages_core(page, 0, MEMINIT_EARLY);
>   	}
>   }

The build bot just reminded me that I missed another case in this function:
(CONFIG_DEFERRED_STRUCT_PAGE_INIT)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index feb5b6e8c8875..5a0752261a795 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1928,7 +1928,7 @@ static void __init deferred_free_range(unsigned long pfn,
         if (nr_pages == MAX_ORDER_NR_PAGES && IS_MAX_ORDER_ALIGNED(pfn)) {
                 for (i = 0; i < nr_pages; i += pageblock_nr_pages)
                         set_pageblock_migratetype(page + i, MIGRATE_MOVABLE);
-               __free_pages_core(page, MAX_PAGE_ORDER);
+               __free_pages_core(page, MAX_PAGE_ORDER, MEMINIT_EARLY);
                 return;
         }
  

-- 
Cheers,

David / dhildenb


^ permalink raw reply related

* Re: [PATCH] Input: serio - use sizeof(*pointer) instead of sizeof(type)
From: Dmitry Torokhov @ 2024-06-07 19:17 UTC (permalink / raw)
  To: Erick Archer
  Cc: Russell King, James E.J. Bottomley, Helge Deller,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Chen-Yu Tsai, Jernej Skrabec, Samuel Holland,
	Stephen Chandler Paul, Michal Simek, Uwe Kleine-König,
	Russell King (Oracle), Suzuki K Poulose, Krzysztof Kozlowski,
	Rob Herring, Ruan Jinjie, Ricardo B. Marliere, Greg Kroah-Hartman,
	Jiri Slaby (SUSE), Mark Brown, Yang Li, Kees Cook,
	Gustavo A. R. Silva, Justin Stitt, linux-input, linux-kernel,
	linux-parisc, linux-hyperv, linux-arm-kernel, linux-sunxi,
	linux-hardening
In-Reply-To: <AS8PR02MB7237D3D898CCC9C50C18DE078BFB2@AS8PR02MB7237.eurprd02.prod.outlook.com>

On Fri, Jun 07, 2024 at 07:04:23PM +0200, Erick Archer wrote:
> It is preferred to use sizeof(*pointer) instead of sizeof(type)
> due to the type of the variable can change and one needs not
> change the former (unlike the latter). This patch has no effect
> on runtime behavior.
> 
> Signed-off-by: Erick Archer <erick.archer@outlook.com>

Applied, thank you.

-- 
Dmitry

^ permalink raw reply

* Re: [PATCH v2 6/6] drivers/pci/hyperv/arm64: vPCI MSI IRQ domain from DT
From: Bjorn Helgaas @ 2024-06-07 19:55 UTC (permalink / raw)
  To: Roman Kisel
  Cc: Saurabh Singh Sengar, arnd, bhelgaas, bp, catalin.marinas,
	dave.hansen, decui, haiyangz, hpa, kw, kys, lenb, lpieralisi,
	mingo, mhklinux, rafael, robh, tglx, wei.liu, will, linux-acpi,
	linux-arch, linux-arm-kernel, linux-hyperv, linux-kernel,
	linux-pci, x86, ssengar, sunilmut, vdso
In-Reply-To: <20240515181238.GA2129352@bhelgaas>

On Wed, May 15, 2024 at 01:12:38PM -0500, Bjorn Helgaas wrote:
> On Wed, May 15, 2024 at 09:34:09AM -0700, Roman Kisel wrote:
> > 
> > 
> > On 5/15/2024 2:48 AM, Saurabh Singh Sengar wrote:
> > > On Tue, May 14, 2024 at 03:43:53PM -0700, Roman Kisel wrote:
> > > > The hyperv-pci driver uses ACPI for MSI IRQ domain configuration
> > > > on arm64 thereby it won't be able to do that in the VTL mode where
> > > > only DeviceTree can be used.
> > > > 
> > > > Update the hyperv-pci driver to discover interrupt configuration
> > > > via DeviceTree.
> > > 
> > > Subject prefix should be "PCI: hv:"

I forgot to also suggest that the subject line begin with a verb,
e.g., "Get vPCI MSI IRQ domain from DT" or similar, again so it reads
consistently with previous commits.

Oh, I see patch 5/6, "Get the irq number from DeviceTree" is also very
similar.  It would be nice if they matched, e.g., both used "IRQ" and
"DT".

Bjorn

^ permalink raw reply

* [PATCH 00/18] Introducing Core Building Blocks for Hyper-V VSM Emulation
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy

This series introduces core KVM functionality necessary to emulate Hyper-V's
Virtual Secure Mode in a Virtual Machine Monitor (VMM).

Hyper-V's Virtual Secure Mode (VSM) is a virtualization security feature that
leverages the hypervisor to create secure execution environments within a
guest. VSM is documented as part of Microsoft's Hypervisor Top Level Functional
Specification [1]. Security features that build upon VSM, like Windows
Credential Guard, are enabled by default on Windows 11 and are becoming a
prerequisite in some industries.

VSM introduces the concept of Virtual Trust Levels (VTLs). These are
independent execution contexts, each with its own CPU architectural state,
local APIC state, and a different view of memory. They are hierarchical, with
more privileged VTLs having priority over the execution of lower VTLs and
control over lower VTLs' state. Windows leverages these low-level
paravirtualized primitives, as well as the hypervisor's higher trust base, to
prevent guest data exfiltration even when the operating system itself has been
compromised.

As discussed at LPC2023 and in our previous RFC [2], we decided to model each
VTL as a distinct KVM VM. With this approach, and the RWX memory attributes
introduced in this series, we have been able to implement VTL memory
protections in a non-intrusive way, using generic KVM APIs. Additionally, each
CPU's VTL is modeled as a distinct KVM vCPU, owned by the KVM VM tracking that
VTL's state. VTL awareness is fully removed from KVM, and the responsibility
for VTL-aware hypercalls, VTL scheduling, and state transfer is delegated to
userspace.

Series overview:
- 1-8: Introduce a number of Hyper-V hyper-calls, all of which are VTL-aware and
       expected to be handled in userspace. Additionally an new VTL-specifc MP
       state is introduced.
- 9-10: Pass the instruction length as part of the userspace fault exit data
        in order to simplify VSM's secure intercept generation.
- 11-17: Introduce RWX memory attributes as well as extend userspace faults.
- 18: Introduces the main VSM CPUID bit which gates all VTL configuration and
      runtime hypercalls.

The series is accompanied by two repositories:
 - A PoC QEMU implementation of VSM [3]: This PoC VSM implementation is capable
   of booting Windows Server 2016 and 2019 with Credential Guard (CG) enabled
   on VMs of any size or vCPUs number. It's generally stable, but still sees
   its share of crashes. The PoC itself implements VSM interfaces to
   accommodate CG's needs, and it's by no means comprehensive. All in all,
   don't expect anything usable in production.

 - VSM kvm-unit-tests [4]: They cover all VSM hypercalls, as well as KVM APIs
   introduced by this series. But unfortunately depends on the QEMU
   implementation.

We mostly tested on an Intel machine, both with and without TDP. Basic tests
were also run on AMD (build and kvm-unit-tests). Please note that v2 will
include KVM self-tests to close the testing gap, and allow merging this while
we work on the userspace bits.

The series is based on 'kvm/master', that is, commit db574f2f96d0, and also
available in github [5].

This series also serves as a call-out to anyone interested in collaborating. We
have a proven design, a working PoC, and hopefully a path forward to merge
these KVM APIs. There is plenty to do in both QEMU and KVM still, I'll post a
list of ideas in the future. Feel free to get in touch!

Thanks,
Nicolas

[1] https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf
[2] https://lore.kernel.org/lkml/20231108111806.92604-1-nsaenz@amazon.com/
[3] https://github.com/vianpl/qemu/tree/vsm-v1
[4] https://github.com/vianpl/kvm-unit-tests/tree/vsm-v1
[4] https://github.com/vianpl/linux/tree/vsm-v1

---

Anish Moorthy (1):
  KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
    userspace

Nicolas Saenz Julienne (17):
  KVM: x86: hyper-v: Introduce XMM output support
  KVM: x86: hyper-v: Introduce helpers to check if VSM is exposed to
    guest
  hyperv-tlfs: Update struct hv_send_ipi{_ex}'s declarations
  KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs
  KVM: x86: hyper-v: Introduce MP_STATE_HV_INACTIVE_VTL
  KVM: x86: hyper-v: Exit on Get/SetVpRegisters hcall
  KVM: x86: hyper-v: Exit on TranslateVirtualAddress hcall
  KVM: x86: hyper-v: Exit on StartVirtualProcessor and
    GetVpIndexFromApicId hcalls
  KVM: x86: Keep track of instruction length during faults
  KVM: x86: Pass the instruction length on memory fault user-space exits
  KVM: x86/mmu: Introduce infrastructure to handle non-executable
    mappings
  KVM: x86/mmu: Avoid warning when installing non-private memory
    attributes
  KVM: x86/mmu: Init memslot if memory attributes available
  KVM: Introduce RWX memory attributes
  KVM: x86: Take mem attributes into account when faulting memory
  KVM: Introduce traces to track memory attributes modification.
  KVM: x86: hyper-v: Handle VSM hcalls in user-space

 Documentation/virt/kvm/api.rst     | 107 +++++++++++++++++++++++-
 arch/x86/hyperv/hv_apic.c          |   3 +-
 arch/x86/include/asm/hyperv-tlfs.h |   2 +-
 arch/x86/kvm/Kconfig               |   1 +
 arch/x86/kvm/hyperv.c              | 127 +++++++++++++++++++++++++++--
 arch/x86/kvm/hyperv.h              |  18 ++++
 arch/x86/kvm/mmu/mmu.c             |  91 +++++++++++++++++----
 arch/x86/kvm/mmu/mmu_internal.h    |   9 +-
 arch/x86/kvm/mmu/mmutrace.h        |  29 +++++++
 arch/x86/kvm/mmu/paging_tmpl.h     |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c         |   8 +-
 arch/x86/kvm/svm/svm.c             |   7 +-
 arch/x86/kvm/vmx/vmx.c             |  23 +++++-
 arch/x86/kvm/x86.c                 |  17 +++-
 include/asm-generic/hyperv-tlfs.h  |  16 +++-
 include/linux/kvm_host.h           |  45 +++++++++-
 include/trace/events/kvm.h         |  20 +++++
 include/uapi/linux/kvm.h           |  15 ++++
 virt/kvm/kvm_main.c                |  35 +++++++-
 19 files changed, 527 insertions(+), 48 deletions(-)

-- 
2.40.1

^ permalink raw reply

* [PATCH 01/18] KVM: x86: hyper-v: Introduce XMM output support
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Prepare infrastructure to be able to return data through the XMM
registers when Hyper-V hypercalls are issues in fast mode. The XMM
registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and
restored on successful hypercall completion.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>

---

There was some discussion in the RFC about whether growing 'struct
kvm_hyperv_exit' is ABI breakage. IMO it isn't:
- There is padding in 'struct kvm_run' that ensures that a bigger
  'struct kvm_hyperv_exit' doesn't alter the offsets within that struct.
- Adding a new field at the bottom of the 'hcall' field within the
  'struct kvm_hyperv_exit' should be fine as well, as it doesn't alter
  the offsets within that struct either.
- Ultimately, previous updates to 'struct kvm_hyperv_exit's hint that
  its size isn't part of the uABI. It already grew when syndbg was
  introduced.

 Documentation/virt/kvm/api.rst     | 19 ++++++++++
 arch/x86/include/asm/hyperv-tlfs.h |  2 +-
 arch/x86/kvm/hyperv.c              | 56 +++++++++++++++++++++++++++++-
 include/uapi/linux/kvm.h           |  6 ++++
 4 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index a71d91978d9ef..17893b330b76f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8893,3 +8893,22 @@ Ordering of KVM_GET_*/KVM_SET_* ioctls
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 TBD
+
+10. Hyper-V CPUIDs
+==================
+
+This section only applies to x86.
+
+New Hyper-V feature support is no longer being tracked through KVM
+capabilities.  Userspace can check if a particular version of KVM supports a
+feature using KMV_GET_SUPPORTED_HV_CPUID.  This section documents how Hyper-V
+CPUIDs map to KVM functionality.
+
+10.1 HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE
+------------------------------------------
+
+:Location: CPUID.40000003H:EDX[bit 15]
+
+This CPUID indicates that KVM supports retuning data to the guest in response
+to a hypercall using the XMM registers. It also extends ``struct
+kvm_hyperv_exit`` to allow passing the XMM data from userspace.
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 3787d26810c1c..6a18c9f77d5fe 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -49,7 +49,7 @@
 /* Support for physical CPU dynamic partitioning events is available*/
 #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE	BIT(3)
 /*
- * Support for passing hypercall input parameter block via XMM
+ * Support for passing hypercall input and output parameter block via XMM
  * registers is available
  */
 #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE		BIT(4)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 8a47f8541eab7..42f44546fe79c 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1865,6 +1865,7 @@ struct kvm_hv_hcall {
 	u16 rep_idx;
 	bool fast;
 	bool rep;
+	bool xmm_dirty;
 	sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS];
 
 	/*
@@ -2396,9 +2397,49 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
 	return ret;
 }
 
+static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
+{
+	int reg;
+
+	kvm_fpu_get();
+	for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) {
+		const sse128_t data = sse128(xmm[reg].low, xmm[reg].high);
+		_kvm_write_sse_reg(reg, &data);
+	}
+	kvm_fpu_put();
+}
+
+static bool kvm_hv_is_xmm_output_hcall(u16 code)
+{
+	return false;
+}
+
+static bool kvm_hv_xmm_output_allowed(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);
+
+	return !hv_vcpu->enforce_cpuid ||
+	       hv_vcpu->cpuid_cache.features_edx &
+		       HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
+}
+
 static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
 {
-	return kvm_hv_hypercall_complete(vcpu, vcpu->run->hyperv.u.hcall.result);
+	bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT);
+	u16 code = vcpu->run->hyperv.u.hcall.input & 0xffff;
+	u64 result = vcpu->run->hyperv.u.hcall.result;
+
+	if (hv_result_success(result) && fast &&
+	    kvm_hv_is_xmm_output_hcall(code)) {
+		if (unlikely(!kvm_hv_xmm_output_allowed(vcpu))) {
+			kvm_queue_exception(vcpu, UD_VECTOR);
+			return 1;
+		}
+
+		kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm);
+	}
+
+	return kvm_hv_hypercall_complete(vcpu, result);
 }
 
 static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
@@ -2553,6 +2594,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 	hc.rep_cnt = (hc.param >> HV_HYPERCALL_REP_COMP_OFFSET) & 0xfff;
 	hc.rep_idx = (hc.param >> HV_HYPERCALL_REP_START_OFFSET) & 0xfff;
 	hc.rep = !!(hc.rep_cnt || hc.rep_idx);
+	hc.xmm_dirty = false;
 
 	trace_kvm_hv_hypercall(hc.code, hc.fast, hc.var_cnt, hc.rep_cnt,
 			       hc.rep_idx, hc.ingpa, hc.outgpa);
@@ -2673,6 +2715,15 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 		break;
 	}
 
+	if (hv_result_success(ret) && hc.xmm_dirty) {
+		if (unlikely(!kvm_hv_xmm_output_allowed(vcpu))) {
+			kvm_queue_exception(vcpu, UD_VECTOR);
+			return 1;
+		}
+
+		kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg *)hc.xmm);
+	}
+
 hypercall_complete:
 	return kvm_hv_hypercall_complete(vcpu, ret);
 
@@ -2682,6 +2733,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 	vcpu->run->hyperv.u.hcall.input = hc.param;
 	vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa;
 	vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa;
+	if (hc.fast)
+		memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm));
 	vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace;
 	return 0;
 }
@@ -2830,6 +2883,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
 			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
 
 			ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
+			ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
 			ent->edx |= HV_FEATURE_FREQUENCY_MSRS_AVAILABLE;
 			ent->edx |= HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d03842abae578..fbdee8d754595 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -90,6 +90,11 @@ struct kvm_pit_config {
 
 #define KVM_PIT_SPEAKER_DUMMY     1
 
+struct kvm_hyperv_xmm_reg {
+	__u64 low;
+	__u64 high;
+};
+
 struct kvm_hyperv_exit {
 #define KVM_EXIT_HYPERV_SYNIC          1
 #define KVM_EXIT_HYPERV_HCALL          2
@@ -108,6 +113,7 @@ struct kvm_hyperv_exit {
 			__u64 input;
 			__u64 result;
 			__u64 params[2];
+			struct kvm_hyperv_xmm_reg xmm[6];
 		} hcall;
 		struct {
 			__u32 msr;
-- 
2.40.1


^ permalink raw reply related

* [PATCH 02/18] KVM: x86: hyper-v: Introduce helpers to check if VSM is exposed to guest
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Introduce a helper function to check if the guest exposes the VSM CPUID
bit.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.h             | 10 ++++++++++
 include/asm-generic/hyperv-tlfs.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index 923e64903da9a..d007d2203e0e4 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -265,6 +265,12 @@ static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu,
 }
 
 int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);
+static inline bool kvm_hv_cpuid_vsm_enabled(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);
+
+	return hv_vcpu && (hv_vcpu->cpuid_cache.features_ebx & HV_ACCESS_VSM);
+}
 #else /* CONFIG_KVM_HYPERV */
 static inline void kvm_hv_setup_tsc_page(struct kvm *kvm,
 					 struct pvclock_vcpu_time_info *hv_clock) {}
@@ -322,6 +328,10 @@ static inline u32 kvm_hv_get_vpindex(struct kvm_vcpu *vcpu)
 	return vcpu->vcpu_idx;
 }
 static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu, bool tdp_enabled) {}
+static inline bool kvm_hv_cpuid_vsm_enabled(struct kvm_vcpu *vcpu)
+{
+	return false;
+}
 #endif /* CONFIG_KVM_HYPERV */
 
 #endif /* __ARCH_X86_KVM_HYPERV_H__ */
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 814207e7c37fc..ffac04bbd0c19 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -89,6 +89,7 @@
 #define HV_ACCESS_STATS				BIT(8)
 #define HV_DEBUGGING				BIT(11)
 #define HV_CPU_MANAGEMENT			BIT(12)
+#define HV_ACCESS_VSM				BIT(16)
 #define HV_ENABLE_EXTENDED_HYPERCALLS		BIT(20)
 #define HV_ISOLATION				BIT(22)
 
-- 
2.40.1


^ permalink raw reply related

* [PATCH 03/18] hyperv-tlfs: Update struct hv_send_ipi{_ex}'s declarations
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Both 'struct hv_send_ipi' and 'struct hv_send_ipi_ex' have an 'union
hv_input_vtl' parameter which has been ignored until now. Expose it, as
KVM will soon provide a way of dealing with VTL-aware IPIs. While doing
Also fixup __send_ipi_mask_ex().

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/hyperv/hv_apic.c         | 3 +--
 include/asm-generic/hyperv-tlfs.h | 6 ++++--
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 0569f579338b5..97907371d51ef 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -121,9 +121,8 @@ static bool __send_ipi_mask_ex(const struct cpumask *mask, int vector,
 	if (unlikely(!ipi_arg))
 		goto ipi_mask_ex_done;
 
+	memset(ipi_arg, 0, sizeof(*ipi_arg));
 	ipi_arg->vector = vector;
-	ipi_arg->reserved = 0;
-	ipi_arg->vp_set.valid_bank_mask = 0;
 
 	/*
 	 * Use HV_GENERIC_SET_ALL and avoid converting cpumask to VP_SET
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index ffac04bbd0c19..28cde641b5474 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -425,14 +425,16 @@ struct hv_vpset {
 /* HvCallSendSyntheticClusterIpi hypercall */
 struct hv_send_ipi {
 	u32 vector;
-	u32 reserved;
+	union hv_input_vtl in_vtl;
+	u8 reserved[3];
 	u64 cpu_mask;
 } __packed;
 
 /* HvCallSendSyntheticClusterIpiEx hypercall */
 struct hv_send_ipi_ex {
 	u32 vector;
-	u32 reserved;
+	union hv_input_vtl in_vtl;
+	u8 reserved[3];
 	struct hv_vpset vp_set;
 } __packed;
 
-- 
2.40.1


^ permalink raw reply related

* [PATCH 04/18] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

HvCallSendSyntheticClusterIpi and HvCallSendSyntheticClusterIpiEx allow
sending VTL-aware IPIs. Honour the hcall by exiting to user-space upon
receiving a request with a valid VTL target. This behaviour is only
available if the VSM CPUID flag is available and exposed to the guest.
It doesn't introduce a behaviour change otherwise.

User-space is accountable for the correct processing of the PV-IPI
before resuming execution.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 42f44546fe79c..d00baf3ffb165 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2217,16 +2217,20 @@ static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
 
 static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
 {
+	bool vsm_enabled = kvm_hv_cpuid_vsm_enabled(vcpu);
 	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);
 	u64 *sparse_banks = hv_vcpu->sparse_banks;
 	struct kvm *kvm = vcpu->kvm;
 	struct hv_send_ipi_ex send_ipi_ex;
 	struct hv_send_ipi send_ipi;
+	union hv_input_vtl *in_vtl;
 	u64 valid_bank_mask;
+	int rsvd_shift;
 	u32 vector;
 	bool all_cpus;
 
 	if (hc->code == HVCALL_SEND_IPI) {
+		in_vtl = &send_ipi.in_vtl;
 		if (!hc->fast) {
 			if (unlikely(kvm_read_guest(kvm, hc->ingpa, &send_ipi,
 						    sizeof(send_ipi))))
@@ -2235,16 +2239,22 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
 			vector = send_ipi.vector;
 		} else {
 			/* 'reserved' part of hv_send_ipi should be 0 */
-			if (unlikely(hc->ingpa >> 32 != 0))
+			rsvd_shift = vsm_enabled ? 40 : 32;
+			if (unlikely(hc->ingpa >> rsvd_shift != 0))
 				return HV_STATUS_INVALID_HYPERCALL_INPUT;
+			in_vtl->as_uint8 = (u8)(hc->ingpa >> 32);
 			sparse_banks[0] = hc->outgpa;
 			vector = (u32)hc->ingpa;
 		}
 		all_cpus = false;
 		valid_bank_mask = BIT_ULL(0);
 
+		if (in_vtl->use_target_vtl)
+			return -ENODEV;
+
 		trace_kvm_hv_send_ipi(vector, sparse_banks[0]);
 	} else {
+		in_vtl = &send_ipi_ex.in_vtl;
 		if (!hc->fast) {
 			if (unlikely(kvm_read_guest(kvm, hc->ingpa, &send_ipi_ex,
 						    sizeof(send_ipi_ex))))
@@ -2253,8 +2263,12 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
 			send_ipi_ex.vector = (u32)hc->ingpa;
 			send_ipi_ex.vp_set.format = hc->outgpa;
 			send_ipi_ex.vp_set.valid_bank_mask = sse128_lo(hc->xmm[0]);
+			in_vtl->as_uint8 = (u8)(hc->ingpa >> 32);
 		}
 
+		if (vsm_enabled && in_vtl->use_target_vtl)
+			return -ENODEV;
+
 		trace_kvm_hv_send_ipi_ex(send_ipi_ex.vector,
 					 send_ipi_ex.vp_set.format,
 					 send_ipi_ex.vp_set.valid_bank_mask);
@@ -2682,6 +2696,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 			break;
 		}
 		ret = kvm_hv_send_ipi(vcpu, &hc);
+		/* VTL-enabled ipi, let user-space handle it */
+		if (ret == -ENODEV)
+			goto hypercall_userspace_exit;
 		break;
 	case HVCALL_POST_DEBUG_DATA:
 	case HVCALL_RETRIEVE_DEBUG_DATA:
-- 
2.40.1


^ permalink raw reply related

* [PATCH 05/18] KVM: x86: hyper-v: Introduce MP_STATE_HV_INACTIVE_VTL
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Model inactive VTL vCPUs' behaviour with a new MP state.

Inactive VTLs are in an artificial halt state. They enter into this
state in response to invoking HvCallVtlCall, HvCallVtlReturn.
User-space, which is VTL aware, can processes the hypercall, and set the
vCPU in MP_STATE_HV_INACTIVE_VTL. When a vCPU is run in this state it'll
block until a wakeup event is received. The rules of what constitutes an
event are analogous to halt's except that VTL's ignore RFLAGS.IF.

When a wakeup event is registered, KVM will exit to user-space with a
KVM_SYSTEM_EVENT exit, and KVM_SYSTEM_EVENT_WAKEUP event type.
User-space is responsible of deciding whether the event has precedence
over the active VTL and will switch the vCPU to KVM_MP_STATE_RUNNABLE
before resuming execution on it.

Running a KVM_MP_STATE_HV_INACTIVE_VTL vCPU with pending events will
return immediately to user-space.

Note that by re-using the readily available halt infrastructure in
KVM_RUN, MP_STATE_HV_INACTIVE_VTL correctly handles (or disables)
virtualisation features like the VMX preemption timer or APICv before
blocking.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>

---

I do recall Sean mentioning using MP states for this might have
unexpected side-effects. But it was in the context of introducing a
broader `HALTED_USERSPACE` style state. I believe that by narrowing down
the MP state's semantics to the specifics of inactive VTLs --
alternatively, we could change RFLAGS.IF in user-space before updating
the mp state -- we cement this as a VSM-only API as well as limit the
ambiguity on the guest/vCPU's state upon entering into this execution
mode.

 Documentation/virt/kvm/api.rst | 19 +++++++++++++++++++
 arch/x86/kvm/hyperv.h          |  8 ++++++++
 arch/x86/kvm/svm/svm.c         |  7 ++++++-
 arch/x86/kvm/vmx/vmx.c         |  7 ++++++-
 arch/x86/kvm/x86.c             | 16 +++++++++++++++-
 include/uapi/linux/kvm.h       |  1 +
 6 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 17893b330b76f..e664c54a13b04 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1517,6 +1517,8 @@ Possible values are:
                                  [s390]
    KVM_MP_STATE_SUSPENDED        the vcpu is in a suspend state and is waiting
                                  for a wakeup event [arm64]
+   KVM_MP_STATE_HV_INACTIVE_VTL  the vcpu is an inactive VTL and is waiting for
+                                 a wakeup event [x86]
    ==========================    ===============================================
 
 On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an
@@ -1559,6 +1561,23 @@ KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not.
 On LoongArch, only the KVM_MP_STATE_RUNNABLE state is used to reflect
 whether the vcpu is runnable.
 
+For x86:
+^^^^^^^^
+
+KVM_MP_STATE_HV_INACTIVE_VTL is only available to a VM if Hyper-V's
+HV_ACCESS_VSM CPUID is exposed to the guest.  This processor state models the
+behavior of an inactive VTL and should only be used for this purpose. A
+userspace process should only switch a vCPU into this MP state in response to a
+HvCallVtlCall, HvCallVtlReturn.
+
+If a vCPU is in KVM_MP_STATE_HV_INACTIVE_VTL, KVM will emulate the
+architectural execution of a HLT instruction with the caveat that RFLAGS.IF is
+ignored when deciding whether to wake up (TLFS 12.12.2.1).  If a wakeup is
+recognized, KVM will exit to userspace with a KVM_SYSTEM_EVENT exit, where the
+event type is KVM_SYSTEM_EVENT_WAKEUP. Userspace has the responsibility to
+switch the vCPU back into KVM_MP_STATE_RUNNABLE state. Calling KVM_RUN on a
+KVM_MP_STATE_HV_INACTIVE_VTL vCPU with pending events will exit immediately.
+
 4.39 KVM_SET_MP_STATE
 ---------------------
 
diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index d007d2203e0e4..d42fe3f85b002 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -271,6 +271,10 @@ static inline bool kvm_hv_cpuid_vsm_enabled(struct kvm_vcpu *vcpu)
 
 	return hv_vcpu && (hv_vcpu->cpuid_cache.features_ebx & HV_ACCESS_VSM);
 }
+static inline bool kvm_hv_vcpu_is_idle_vtl(struct kvm_vcpu *vcpu)
+{
+	return vcpu->arch.mp_state == KVM_MP_STATE_HV_INACTIVE_VTL;
+}
 #else /* CONFIG_KVM_HYPERV */
 static inline void kvm_hv_setup_tsc_page(struct kvm *kvm,
 					 struct pvclock_vcpu_time_info *hv_clock) {}
@@ -332,6 +336,10 @@ static inline bool kvm_hv_cpuid_vsm_enabled(struct kvm_vcpu *vcpu)
 {
 	return false;
 }
+static inline bool kvm_hv_vcpu_is_idle_vtl(struct kvm_vcpu *vcpu)
+{
+	return false;
+}
 #endif /* CONFIG_KVM_HYPERV */
 
 #endif /* __ARCH_X86_KVM_HYPERV_H__ */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 296c524988f95..9671191fef4ea 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -49,6 +49,7 @@
 #include "svm.h"
 #include "svm_ops.h"
 
+#include "hyperv.h"
 #include "kvm_onhyperv.h"
 #include "svm_onhyperv.h"
 
@@ -3797,6 +3798,10 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu)
 	if (!gif_set(svm))
 		return true;
 
+	/*
+	 * The Hyper-V TLFS states that RFLAGS.IF is ignored when deciding
+	 * whether to block interrupts targeted at inactive VTLs.
+	 */
 	if (is_guest_mode(vcpu)) {
 		/* As long as interrupts are being delivered...  */
 		if ((svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK)
@@ -3808,7 +3813,7 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu)
 		if (nested_exit_on_intr(svm))
 			return false;
 	} else {
-		if (!svm_get_if_flag(vcpu))
+		if (!svm_get_if_flag(vcpu) && !kvm_hv_vcpu_is_idle_vtl(vcpu))
 			return true;
 	}
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b3c83c06f8265..ac0682fece604 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5057,7 +5057,12 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
 	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
 		return false;
 
-	return !(vmx_get_rflags(vcpu) & X86_EFLAGS_IF) ||
+	/*
+	 * The Hyper-V TLFS states that RFLAGS.IF is ignored when deciding
+	 * whether to block interrupts targeted at inactive VTLs.
+	 */
+	return (!(vmx_get_rflags(vcpu) & X86_EFLAGS_IF) &&
+		!kvm_hv_vcpu_is_idle_vtl(vcpu)) ||
 	       (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
 		(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8c9e4281d978d..a6e2312ccb68f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -134,6 +134,7 @@ static int kvm_vcpu_do_singlestep(struct kvm_vcpu *vcpu);
 
 static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
 static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
+static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu);
 
 static DEFINE_MUTEX(vendor_module_lock);
 struct kvm_x86_ops kvm_x86_ops __read_mostly;
@@ -11176,7 +11177,8 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
 			kvm_lapic_switch_to_sw_timer(vcpu);
 
 		kvm_vcpu_srcu_read_unlock(vcpu);
-		if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
+		if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED ||
+		    kvm_hv_vcpu_is_idle_vtl(vcpu))
 			kvm_vcpu_halt(vcpu);
 		else
 			kvm_vcpu_block(vcpu);
@@ -11218,6 +11220,7 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
 		vcpu->arch.apf.halted = false;
 		break;
 	case KVM_MP_STATE_INIT_RECEIVED:
+	case KVM_MP_STATE_HV_INACTIVE_VTL:
 		break;
 	default:
 		WARN_ON_ONCE(1);
@@ -11264,6 +11267,13 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 		if (kvm_cpu_has_pending_timer(vcpu))
 			kvm_inject_pending_timer_irqs(vcpu);
 
+		if (kvm_hv_vcpu_is_idle_vtl(vcpu) && kvm_vcpu_has_events(vcpu)) {
+			r = 0;
+			vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
+			vcpu->run->system_event.type = KVM_SYSTEM_EVENT_WAKEUP;
+			break;
+		}
+
 		if (dm_request_for_irq_injection(vcpu) &&
 			kvm_vcpu_ready_for_interrupt_injection(vcpu)) {
 			r = 0;
@@ -11703,6 +11713,10 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 			goto out;
 		break;
 
+	case KVM_MP_STATE_HV_INACTIVE_VTL:
+		if (is_guest_mode(vcpu) || !kvm_hv_cpuid_vsm_enabled(vcpu))
+			goto out;
+		break;
 	case KVM_MP_STATE_RUNNABLE:
 		break;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index fbdee8d754595..f4864e6907e0b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -564,6 +564,7 @@ struct kvm_vapic_addr {
 #define KVM_MP_STATE_LOAD              8
 #define KVM_MP_STATE_AP_RESET_HOLD     9
 #define KVM_MP_STATE_SUSPENDED         10
+#define KVM_MP_STATE_HV_INACTIVE_VTL   11
 
 struct kvm_mp_state {
 	__u32 mp_state;
-- 
2.40.1


^ permalink raw reply related

* [PATCH 06/18] KVM: x86: hyper-v: Exit on Get/SetVpRegisters hcall
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Let user-space handle HvGetVpRegisters and HvSetVpRegisters as they are
VTL aware hypercalls used solely in the context of VSM. Additionally,
expose the cpuid bit.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 Documentation/virt/kvm/api.rst    | 10 ++++++++++
 arch/x86/kvm/hyperv.c             | 15 +++++++++++++++
 include/asm-generic/hyperv-tlfs.h |  1 +
 3 files changed, 26 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e664c54a13b04..05b01b00a395c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8931,3 +8931,13 @@ CPUIDs map to KVM functionality.
 This CPUID indicates that KVM supports retuning data to the guest in response
 to a hypercall using the XMM registers. It also extends ``struct
 kvm_hyperv_exit`` to allow passing the XMM data from userspace.
+
+10.2 HV_ACCESS_VP_REGISTERS
+---------------------------
+
+:Location: CPUID.40000003H:EBX[bit 17]
+
+This CPUID indicates that KVM supports HvGetVpRegisters and HvSetVpRegisters.
+Currently, it is only used in conjunction with HV_ACCESS_VSM, and immediately
+exits to userspace with KVM_EXIT_HYPERV_HCALL as the reason. Userspace is
+expected to complete the hypercall before resuming execution.
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index d00baf3ffb165..d0edc2bec5a4f 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2425,6 +2425,11 @@ static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
 
 static bool kvm_hv_is_xmm_output_hcall(u16 code)
 {
+	switch (code) {
+	case HVCALL_GET_VP_REGISTERS:
+		return true;
+	}
+
 	return false;
 }
 
@@ -2505,6 +2510,8 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc)
 	case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX:
 	case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX:
 	case HVCALL_SEND_IPI_EX:
+	case HVCALL_GET_VP_REGISTERS:
+	case HVCALL_SET_VP_REGISTERS:
 		return true;
 	}
 
@@ -2543,6 +2550,10 @@ static bool hv_check_hypercall_access(struct kvm_vcpu_hv *hv_vcpu, u16 code)
 		 */
 		return !kvm_hv_is_syndbg_enabled(hv_vcpu->vcpu) ||
 			hv_vcpu->cpuid_cache.features_ebx & HV_DEBUGGING;
+	case HVCALL_GET_VP_REGISTERS:
+	case HVCALL_SET_VP_REGISTERS:
+		return hv_vcpu->cpuid_cache.features_ebx &
+			HV_ACCESS_VP_REGISTERS;
 	case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX:
 	case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX:
 		if (!(hv_vcpu->cpuid_cache.enlightenments_eax &
@@ -2727,6 +2738,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 			break;
 		}
 		goto hypercall_userspace_exit;
+	case HVCALL_GET_VP_REGISTERS:
+	case HVCALL_SET_VP_REGISTERS:
+		goto hypercall_userspace_exit;
 	default:
 		ret = HV_STATUS_INVALID_HYPERCALL_CODE;
 		break;
@@ -2898,6 +2912,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
 			ent->ebx |= HV_POST_MESSAGES;
 			ent->ebx |= HV_SIGNAL_EVENTS;
 			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
+			ent->ebx |= HV_ACCESS_VP_REGISTERS;
 
 			ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
 			ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 28cde641b5474..9e909f0834598 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -90,6 +90,7 @@
 #define HV_DEBUGGING				BIT(11)
 #define HV_CPU_MANAGEMENT			BIT(12)
 #define HV_ACCESS_VSM				BIT(16)
+#define HV_ACCESS_VP_REGISTERS			BIT(17)
 #define HV_ENABLE_EXTENDED_HYPERCALLS		BIT(20)
 #define HV_ISOLATION				BIT(22)
 
-- 
2.40.1


^ permalink raw reply related

* [PATCH 07/18] KVM: x86: hyper-v: Exit on TranslateVirtualAddress hcall
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Handle HvTranslateVirtualAddress in user-space. The hypercall is
VTL-aware and only used in the context of VSM. Additionally, the TLFS
doesn't introduce an ad-hoc CPUID bit for it, so the hypercall
availability is tracked as part of the HV_ACCESS_VSM CPUID. This will be
documented with the main VSM commit.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c             | 3 +++
 include/asm-generic/hyperv-tlfs.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index d0edc2bec5a4f..cbe2aca52514b 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2427,6 +2427,7 @@ static bool kvm_hv_is_xmm_output_hcall(u16 code)
 {
 	switch (code) {
 	case HVCALL_GET_VP_REGISTERS:
+	case HVCALL_TRANSLATE_VIRTUAL_ADDRESS:
 		return true;
 	}
 
@@ -2512,6 +2513,7 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc)
 	case HVCALL_SEND_IPI_EX:
 	case HVCALL_GET_VP_REGISTERS:
 	case HVCALL_SET_VP_REGISTERS:
+	case HVCALL_TRANSLATE_VIRTUAL_ADDRESS:
 		return true;
 	}
 
@@ -2740,6 +2742,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 		goto hypercall_userspace_exit;
 	case HVCALL_GET_VP_REGISTERS:
 	case HVCALL_SET_VP_REGISTERS:
+	case HVCALL_TRANSLATE_VIRTUAL_ADDRESS:
 		goto hypercall_userspace_exit;
 	default:
 		ret = HV_STATUS_INVALID_HYPERCALL_CODE;
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 9e909f0834598..57c791c555861 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -159,6 +159,7 @@ union hv_reference_tsc_msr {
 #define HVCALL_CREATE_VP			0x004e
 #define HVCALL_GET_VP_REGISTERS			0x0050
 #define HVCALL_SET_VP_REGISTERS			0x0051
+#define HVCALL_TRANSLATE_VIRTUAL_ADDRESS	0x0052
 #define HVCALL_POST_MESSAGE			0x005c
 #define HVCALL_SIGNAL_EVENT			0x005d
 #define HVCALL_POST_DEBUG_DATA			0x0069
-- 
2.40.1


^ permalink raw reply related

* [PATCH 08/18] KVM: x86: hyper-v: Exit on StartVirtualProcessor and GetVpIndexFromApicId hcalls
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Both HvCallStartVirtualProcessor and GetVpIndexFromApicId are used as
part of the Hyper-V VSM CPU bootstrap process, and requires VTL
awareness, as such handle the hypercall in user-space. Also, expose the
ad-hoc CPUID bit.

Note that these hypercalls aren't necessary on Hyper-V guests that don't
enable VSM.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 Documentation/virt/kvm/api.rst    | 11 +++++++++++
 arch/x86/kvm/hyperv.c             |  7 +++++++
 include/asm-generic/hyperv-tlfs.h |  1 +
 3 files changed, 19 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 05b01b00a395c..161a772c23c6a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8941,3 +8941,14 @@ This CPUID indicates that KVM supports HvGetVpRegisters and HvSetVpRegisters.
 Currently, it is only used in conjunction with HV_ACCESS_VSM, and immediately
 exits to userspace with KVM_EXIT_HYPERV_HCALL as the reason. Userspace is
 expected to complete the hypercall before resuming execution.
+
+10.3 HV_START_VIRTUAL_PROCESSOR
+-------------------------------
+
+:Location: CPUID.40000003H:EBX[bit 21]
+
+This CPUID indicates that KVM supports HvCallStartVirtualProcessor and
+HvCallGetVpIndexFromApicId. Currently, it is only used in conjunction with
+HV_ACCESS_VSM, and immediately exits to userspace with KVM_EXIT_HYPERV_HCALL as
+the reason. Userspace is expected to complete the hypercall before resuming
+execution.
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index cbe2aca52514b..dd64f41dc835d 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2556,6 +2556,10 @@ static bool hv_check_hypercall_access(struct kvm_vcpu_hv *hv_vcpu, u16 code)
 	case HVCALL_SET_VP_REGISTERS:
 		return hv_vcpu->cpuid_cache.features_ebx &
 			HV_ACCESS_VP_REGISTERS;
+	case HVCALL_START_VP:
+	case HVCALL_GET_VP_ID_FROM_APIC_ID:
+		return hv_vcpu->cpuid_cache.features_ebx &
+			HV_START_VIRTUAL_PROCESSOR;
 	case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX:
 	case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX:
 		if (!(hv_vcpu->cpuid_cache.enlightenments_eax &
@@ -2743,6 +2747,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 	case HVCALL_GET_VP_REGISTERS:
 	case HVCALL_SET_VP_REGISTERS:
 	case HVCALL_TRANSLATE_VIRTUAL_ADDRESS:
+	case HVCALL_START_VP:
+	case HVCALL_GET_VP_ID_FROM_APIC_ID:
 		goto hypercall_userspace_exit;
 	default:
 		ret = HV_STATUS_INVALID_HYPERCALL_CODE;
@@ -2916,6 +2922,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
 			ent->ebx |= HV_SIGNAL_EVENTS;
 			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
 			ent->ebx |= HV_ACCESS_VP_REGISTERS;
+			ent->ebx |= HV_START_VIRTUAL_PROCESSOR;
 
 			ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
 			ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 57c791c555861..e24b88ec4ec00 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -92,6 +92,7 @@
 #define HV_ACCESS_VSM				BIT(16)
 #define HV_ACCESS_VP_REGISTERS			BIT(17)
 #define HV_ENABLE_EXTENDED_HYPERCALLS		BIT(20)
+#define HV_START_VIRTUAL_PROCESSOR		BIT(21)
 #define HV_ISOLATION				BIT(22)
 
 /*
-- 
2.40.1


^ permalink raw reply related

* [PATCH 09/18] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

From: Anish Moorthy <amoorthy@google.com>

kvm_prepare_memory_fault_exit() already takes parameters describing the
RWX-ness of the relevant access but doesn't actually do anything with
them. Define and use the flags necessary to pass this information on to
userspace.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 Documentation/virt/kvm/api.rst | 5 +++++
 include/linux/kvm_host.h       | 9 ++++++++-
 include/uapi/linux/kvm.h       | 3 +++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 161a772c23c6a..761b99987cf1a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7014,6 +7014,9 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
+  #define KVM_MEMORY_EXIT_FLAG_READ     (1ULL << 0)
+  #define KVM_MEMORY_EXIT_FLAG_WRITE    (1ULL << 1)
+  #define KVM_MEMORY_EXIT_FLAG_EXEC     (1ULL << 2)
   #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
 			__u64 flags;
 			__u64 gpa;
@@ -7025,6 +7028,8 @@ could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
 guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
 describes properties of the faulting access that are likely pertinent:
 
+ - KVM_MEMORY_EXIT_FLAG_READ/WRITE/EXEC - When set, indicates that the memory
+   fault occurred on a read/write/exec access respectively.
  - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
    on a private memory access.  When clear, indicates the fault occurred on a
    shared access.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 692c01e41a18e..59f687985ba24 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2397,8 +2397,15 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 	vcpu->run->memory_fault.gpa = gpa;
 	vcpu->run->memory_fault.size = size;
 
-	/* RWX flags are not (yet) defined or communicated to userspace. */
 	vcpu->run->memory_fault.flags = 0;
+
+	if (is_write)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_WRITE;
+	else if (is_exec)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_EXEC;
+	else
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_READ;
+
 	if (is_private)
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f4864e6907e0b..d6d8b17bfa9a7 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -434,6 +434,9 @@ struct kvm_run {
 		} notify;
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
+#define KVM_MEMORY_EXIT_FLAG_READ       (1ULL << 0)
+#define KVM_MEMORY_EXIT_FLAG_WRITE      (1ULL << 1)
+#define KVM_MEMORY_EXIT_FLAG_EXEC       (1ULL << 2)
 #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
 			__u64 flags;
 			__u64 gpa;
-- 
2.40.1


^ permalink raw reply related

* [PATCH 10/18] KVM: x86: Keep track of instruction length during faults
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

Both VMX and SVM provide the length of the instruction
being run at the time of the page fault. Save it within 'struct
kvm_page_fault', as it'll become useful in the future.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c          | 11 ++++++++---
 arch/x86/kvm/mmu/mmu_internal.h |  5 ++++-
 arch/x86/kvm/vmx/vmx.c          | 16 ++++++++++++++--
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8d74bdef68c1d..39b113afefdfc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4271,7 +4271,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	      work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
 		return;
 
-	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code, true, NULL);
+	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code,
+			      true, NULL, 0);
 }
 
 static inline u8 kvm_max_level_for_order(int order)
@@ -5887,7 +5888,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 
 	if (r == RET_PF_INVALID) {
 		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
-					  &emulation_type);
+					  &emulation_type, insn_len);
 		if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
 			return -EIO;
 	}
@@ -5924,8 +5925,12 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 	if (!mmio_info_in_cache(vcpu, cr2_or_gpa, direct) && !is_guest_mode(vcpu))
 		emulation_type |= EMULTYPE_ALLOW_RETRY_PF;
 emulate:
+	/*
+	 * x86_emulate_instruction() expects insn to contain data if
+	 * insn_len > 0.
+	 */
 	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
-				       insn_len);
+				       insn ? insn_len : 0);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_page_fault);
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ce2fcd19ba6be..a0cde1a0e39b0 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -192,6 +192,7 @@ struct kvm_page_fault {
 	const gpa_t addr;
 	const u64 error_code;
 	const bool prefetch;
+	const u8 insn_len;
 
 	/* Derived from error_code.  */
 	const bool exec;
@@ -288,11 +289,13 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 }
 
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-					u64 err, bool prefetch, int *emulation_type)
+					u64 err, bool prefetch,
+					int *emulation_type, u8 insn_len)
 {
 	struct kvm_page_fault fault = {
 		.addr = cr2_or_gpa,
 		.error_code = err,
+		.insn_len = insn_len,
 		.exec = err & PFERR_FETCH_MASK,
 		.write = err & PFERR_WRITE_MASK,
 		.present = err & PFERR_PRESENT_MASK,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ac0682fece604..9ba38e0b0c7a8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5807,11 +5807,13 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
 		return kvm_emulate_instruction(vcpu, 0);
 
-	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL,
+				  vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
 {
+	u8 insn_len = 0;
 	gpa_t gpa;
 
 	if (vmx_check_emulate_instruction(vcpu, EMULTYPE_PF, NULL, 0))
@@ -5828,7 +5830,17 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
 		return kvm_skip_emulated_instruction(vcpu);
 	}
 
-	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
+	/*
+	 * Using VMCS.VM_EXIT_INSTRUCTION_LEN on EPT misconfig depends on
+	 * undefined behavior: Intel's SDM doesn't mandate the VMCS field be
+	 * set when EPT misconfig occurs.  In practice, real hardware updates
+	 * VM_EXIT_INSTRUCTION_LEN on EPT misconfig, but other hypervisors
+	 * (namely Hyper-V) don't set it due to it being undefined behavior.
+	 */
+	if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
+		insn_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+
+	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, insn_len);
 }
 
 static int handle_nmi_window(struct kvm_vcpu *vcpu)
-- 
2.40.1


^ permalink raw reply related

* [PATCH 11/18] KVM: x86: Pass the instruction length on memory fault user-space exits
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

In order to simplify Hyper-V VSM secure memory intercept generation in
user-space (it avoids the need of implementing an x86 instruction
decoder and the actual decoding). Pass the instruction length being run
at the time of the guest exit as part of the memory fault exit
information.

The presence of this additional information is indicated by a new
capability, KVM_CAP_FAULT_EXIT_INSN_LEN.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 Documentation/virt/kvm/api.rst  | 6 +++++-
 arch/x86/kvm/mmu/mmu_internal.h | 2 +-
 arch/x86/kvm/x86.c              | 1 +
 include/linux/kvm_host.h        | 3 ++-
 include/uapi/linux/kvm.h        | 2 ++
 5 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 761b99987cf1a..18ddea9c4c58a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7021,11 +7021,15 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
+                        __u8 insn_len;
 		} memory_fault;
 
 KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
 could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
-guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
+guest physical address range [gpa, gpa + size) of the fault.  The
+'insn_len' field describes the size (in bytes) of the instruction
+that caused the fault. It is only available if the underlying HW exposes that
+information on guest exit, otherwise it's set to 0.  The 'flags' field
 describes properties of the faulting access that are likely pertinent:
 
  - KVM_MEMORY_EXIT_FLAG_READ/WRITE/EXEC - When set, indicates that the memory
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index a0cde1a0e39b0..4f5c4c8af9941 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -285,7 +285,7 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 {
 	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
 				      PAGE_SIZE, fault->write, fault->exec,
-				      fault->is_private);
+				      fault->is_private, fault->insn_len);
 }
 
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a6e2312ccb68f..d2b8b74cb48bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4704,6 +4704,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_IRQFD_RESAMPLE:
 	case KVM_CAP_MEMORY_FAULT_INFO:
+	case KVM_CAP_FAULT_EXIT_INSN_LEN:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 59f687985ba24..4fa16c4772269 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2391,11 +2391,12 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 						 gpa_t gpa, gpa_t size,
 						 bool is_write, bool is_exec,
-						 bool is_private)
+						 bool is_private, u8 insn_len)
 {
 	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
 	vcpu->run->memory_fault.gpa = gpa;
 	vcpu->run->memory_fault.size = size;
+	vcpu->run->memory_fault.insn_len = insn_len;
 
 	vcpu->run->memory_fault.flags = 0;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d6d8b17bfa9a7..516d39910f9ab 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -441,6 +441,7 @@ struct kvm_run {
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
+			__u8 insn_len;
 		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
@@ -927,6 +928,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_MEMORY_ATTRIBUTES 233
 #define KVM_CAP_GUEST_MEMFD 234
 #define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_FAULT_EXIT_INSN_LEN 236
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
-- 
2.40.1


^ permalink raw reply related

* [PATCH 12/18] KVM: x86/mmu: Introduce infrastructure to handle non-executable mappings
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

The upcoming access restriction KVM memory attributes open the door to
installing non-executable mappings. Introduce a new attribute in struct
kvm_page_fault, map_executable, to control whether the gfn range should
be mapped as executable and make sure it's taken into account when
generating new sptes.

No functional change intended.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c          | 6 +++++-
 arch/x86/kvm/mmu/mmu_internal.h | 2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 8 ++++++--
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 39b113afefdfc..b0c210b96419f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3197,6 +3197,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_shadow_walk_iterator it;
+	unsigned int access = ACC_ALL;
 	struct kvm_mmu_page *sp;
 	int ret;
 	gfn_t base_gfn = fault->gfn;
@@ -3229,7 +3230,10 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	if (WARN_ON_ONCE(it.level != fault->goal_level))
 		return -EFAULT;
 
-	ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL,
+	if (!fault->map_executable)
+		access &= ~ACC_EXEC_MASK;
+
+	ret = mmu_set_spte(vcpu, fault->slot, it.sptep, access,
 			   base_gfn, fault->pfn, fault);
 	if (ret == RET_PF_SPURIOUS)
 		return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 4f5c4c8af9941..af0c3a154ed89 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -241,6 +241,7 @@ struct kvm_page_fault {
 	kvm_pfn_t pfn;
 	hva_t hva;
 	bool map_writable;
+	bool map_executable;
 
 	/*
 	 * Indicates the guest is trying to write a gfn that contains one or
@@ -313,6 +314,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 		.pfn = KVM_PFN_ERR_FAULT,
 		.hva = KVM_HVA_ERR_BAD,
+		.map_executable = true,
 	};
 	int r;
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 36539c1b36cd6..344781981999a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1018,6 +1018,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 					  struct tdp_iter *iter)
 {
 	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
+	unsigned int access = ACC_ALL;
 	u64 new_spte;
 	int ret = RET_PF_FIXED;
 	bool wrprot = false;
@@ -1025,10 +1026,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	if (WARN_ON_ONCE(sp->role.level != fault->goal_level))
 		return RET_PF_RETRY;
 
+	if (!fault->map_executable)
+		access &= ~ACC_EXEC_MASK;
+
 	if (unlikely(!fault->slot))
-		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
+		new_spte = make_mmio_spte(vcpu, iter->gfn, access);
 	else
-		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
+		wrprot = make_spte(vcpu, sp, fault->slot, access, iter->gfn,
 					 fault->pfn, iter->old_spte, fault->prefetch, true,
 					 fault->map_writable, &new_spte);
 
-- 
2.40.1


^ permalink raw reply related

* [PATCH 13/18] KVM: x86/mmu: Avoid warning when installing non-private memory attributes
From: Nicolas Saenz Julienne @ 2024-06-09 15:49 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: pbonzini, seanjc, vkuznets, linux-doc, linux-hyperv, linux-arch,
	linux-trace-kernel, graf, dwmw2, paul, nsaenz, mlevitsk, jgowans,
	corbet, decui, tglx, mingo, bp, dave.hansen, x86, amoorthy
In-Reply-To: <20240609154945.55332-1-nsaenz@amazon.com>

In preparation to introducing RWX memory attributes, make sure
user-space is attempting to install a memory attribute with
KVM_MEMORY_ATTRIBUTE_PRIVATE before throwing a warning on systems with
no private memory support.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c | 8 ++++++--
 virt/kvm/kvm_main.c    | 1 +
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b0c210b96419f..d56c04fbdc66b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7359,6 +7359,9 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range)
 {
+	unsigned long attrs = range->arg.attributes;
+	bool priv_attr = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
 	/*
 	 * Zap SPTEs even if the slot can't be mapped PRIVATE.  KVM x86 only
 	 * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM
@@ -7370,7 +7373,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	 * Zapping SPTEs in this case ensures KVM will reassess whether or not
 	 * a hugepage can be used for affected ranges.
 	 */
-	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
+	if (WARN_ON_ONCE(priv_attr && !kvm_arch_has_private_mem(kvm)))
 		return false;
 
 	return kvm_unmap_gfn_range(kvm, range);
@@ -7415,6 +7418,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range)
 {
 	unsigned long attrs = range->arg.attributes;
+	bool priv_attr = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
 	struct kvm_memory_slot *slot = range->slot;
 	int level;
 
@@ -7427,7 +7431,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 	 * a range that has PRIVATE GFNs, and conversely converting a range to
 	 * SHARED may now allow hugepages.
 	 */
-	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
+	if (WARN_ON_ONCE(priv_attr && !kvm_arch_has_private_mem(kvm)))
 		return false;
 
 	/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 14841acb8b959..63c4b6739edee 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2506,6 +2506,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	struct kvm_mmu_notifier_range pre_set_range = {
 		.start = start,
 		.end = end,
+		.arg.attributes = attributes,
 		.handler = kvm_pre_set_memory_attributes,
 		.on_lock = kvm_mmu_invalidate_begin,
 		.flush_on_ret = true,
-- 
2.40.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox