Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH v1 4/4] page_reporting: change PAGE_REPORTING_DEFAULT_ORDER to -1
From: Yuvraj Sakshith @ 2026-03-02  8:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Michael Kelley
  Cc: Michael Kelley, akpm@linux-foundation.org, mst@redhat.com,
	kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com, jasowang@redhat.com,
	xuanzhuo@linux.alibaba.com, eperezma@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org,
	ziy@nvidia.com, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux.dev, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <b1390b24-eaef-40e0-a16b-77c27decb77e@kernel.org>

On Mon, Mar 02, 2026 at 08:42:57AM +0100, David Hildenbrand (Arm) wrote:
> On 3/2/26 06:25, Michael Kelley wrote:
> > From: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com> Sent: Sunday, March 1, 2026 7:33 PM
> >>
> >> On Fri, Feb 27, 2026 at 09:50:15PM +0100, David Hildenbrand (Arm) wrote:
> >>>
> >>> No need for the ().
> >>>
> >>> Wondering whether we now also want to do in this patch:
> >>>
> >>>
> >>> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> >>> index f0042d5743af..d432aadf9d07 100644
> >>> --- a/mm/page_reporting.c
> >>> +++ b/mm/page_reporting.c
> >>> @@ -11,8 +11,7 @@
> >>>  #include "page_reporting.h"
> >>>  #include "internal.h"
> >>>
> >>> -/* Initialize to an unsupported value */
> >>> -unsigned int page_reporting_order = -1;
> >>> +unsigned int page_reporting_order = PAGE_REPORTING_DEFAULT_ORDER;
> >>>
> >>>  static int page_order_update_notify(const char *val, const struct
> >>> kernel_param *kp)
> >>>  {
> >>> @@ -369,7 +368,7 @@ int page_reporting_register(struct
> >>> page_reporting_dev_info *prdev)
> >>>          * pageblock_order.
> >>>          */
> >>>
> >>> -       if (page_reporting_order == -1) {
> >>> +       if (page_reporting_order == PAGE_REPORTING_DEFAULT_ORDER) {
> >>>
> >>>
> >>
> >> Sure. Now that I think of it, don’t you think the first nested if() will
> >> always be false? and can be compressed down to just one if()?
> > 
> > I don't think what you propose is correct. The purpose of testing
> > page_reporting_order for -1 is to see if a page reporting order has
> > been specified on the kernel boot line. If it has been specified, then
> > the page reporting order specified in the call to page_reporting_register()
> > [either a specific value or the default] is ignored and the kernel boot
> > line value prevails. But if page_reporting_order is -1 here, then
> > no kernel boot line value was specified, and the value passed to
> > page_reporting_register() should prevail.
> > 
> > With this in mind, substituting PAGE_REPORTING_DEFAULT_ORDER
> > for the -1 in the test doesn’t exactly make sense to me. The -1 in the
> > test doesn't have quite the same meaning as the -1 for
> > PAGE_REPORTING_DEFAULT_ORDER. You could even use -2 for
> > the initial value of page_reporting_order, and here in the test, in
> > order to make that distinction obvious. Or use a separate symbolic
> > name like PAGE_REPORTING_ORDER_NOT_SET.
> 
Option 1:

if (page_reporting_order == PAGE_REPORTING_DEFAULT_ORDER) {
        if (page_reporting_order != PAGE_REPORTING_DEFAULT_ORDER
                && prdev->order <= MAX_PAGE_ORDER) {
                page_reporting_order = prdev->order;
        } else {
                page_reporting_order = pageblock_order;
        }
}

Option 2:

if (page_reporting_order == PAGE_REPORTING_ORDER_NOT_SET) {
        if (page_reporting_order != PAGE_REPORTING_DEFAULT_ORDER
                && prdev->order <= MAX_PAGE_ORDER) {
                page_reporting_order = prdev->order;
        } else {
                page_reporting_order = pageblock_order;
        }
}


> I don't really see a difference between "PAGE_REPORTING_DEFAULT_ORDER"
> and "PAGE_REPORTING_ORDER_NOT_SET" that would warrant a split and adding
> confusion for the page-reporting drivers.
> 
> In both cases, we want "no special requirement, just use the default".
> Maybe we can use a better name to express that.

Agreed.

If we were to read this code without context, wouldn't it be confusing as to
why PAGE_REPORTING_DEFAULT_ORDER is being checked in the first place?

Option 1 checks if page_reporting_order is equal to PAGE_REPORTING_DEFAULT_ORDER
and then immediately checks if its not equal to it. Which is a bit confusing..

And moreover, page_reporting_order can be set by two people. The commandline and
the driver itself. So PAGE_REPORTING_ORDER_NOT_SET can indicate if its set by cmdline
and PAGE_REPORTING_DEFAULT_ORDER can be used by drivers exclusively to "tell" page-reporting
to select the default value for us.

I think what Michael is pointing out is the prevalence of cmdline option over the driver's
request. 

This is not obvious to the reader if we choose to only have one flag IMO :)

Thanks,
Yuvraj
 
> -- 
> Cheers,
> 
> David












^ permalink raw reply

* Re: [PATCH v1 4/4] page_reporting: change PAGE_REPORTING_DEFAULT_ORDER to -1
From: David Hildenbrand (Arm) @ 2026-03-02  7:42 UTC (permalink / raw)
  To: Michael Kelley, Yuvraj Sakshith
  Cc: akpm@linux-foundation.org, mst@redhat.com, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	longli@microsoft.com, jasowang@redhat.com,
	xuanzhuo@linux.alibaba.com, eperezma@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org,
	ziy@nvidia.com, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux.dev, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157D37B3D254251F0135B04D47EA@SN6PR02MB4157.namprd02.prod.outlook.com>

On 3/2/26 06:25, Michael Kelley wrote:
> From: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com> Sent: Sunday, March 1, 2026 7:33 PM
>>
>> On Fri, Feb 27, 2026 at 09:50:15PM +0100, David Hildenbrand (Arm) wrote:
>>>
>>> No need for the ().
>>>
>>> Wondering whether we now also want to do in this patch:
>>>
>>>
>>> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
>>> index f0042d5743af..d432aadf9d07 100644
>>> --- a/mm/page_reporting.c
>>> +++ b/mm/page_reporting.c
>>> @@ -11,8 +11,7 @@
>>>  #include "page_reporting.h"
>>>  #include "internal.h"
>>>
>>> -/* Initialize to an unsupported value */
>>> -unsigned int page_reporting_order = -1;
>>> +unsigned int page_reporting_order = PAGE_REPORTING_DEFAULT_ORDER;
>>>
>>>  static int page_order_update_notify(const char *val, const struct
>>> kernel_param *kp)
>>>  {
>>> @@ -369,7 +368,7 @@ int page_reporting_register(struct
>>> page_reporting_dev_info *prdev)
>>>          * pageblock_order.
>>>          */
>>>
>>> -       if (page_reporting_order == -1) {
>>> +       if (page_reporting_order == PAGE_REPORTING_DEFAULT_ORDER) {
>>>
>>>
>>
>> Sure. Now that I think of it, don’t you think the first nested if() will
>> always be false? and can be compressed down to just one if()?
> 
> I don't think what you propose is correct. The purpose of testing
> page_reporting_order for -1 is to see if a page reporting order has
> been specified on the kernel boot line. If it has been specified, then
> the page reporting order specified in the call to page_reporting_register()
> [either a specific value or the default] is ignored and the kernel boot
> line value prevails. But if page_reporting_order is -1 here, then
> no kernel boot line value was specified, and the value passed to
> page_reporting_register() should prevail.
> 
> With this in mind, substituting PAGE_REPORTING_DEFAULT_ORDER
> for the -1 in the test doesn’t exactly make sense to me. The -1 in the
> test doesn't have quite the same meaning as the -1 for
> PAGE_REPORTING_DEFAULT_ORDER. You could even use -2 for
> the initial value of page_reporting_order, and here in the test, in
> order to make that distinction obvious. Or use a separate symbolic
> name like PAGE_REPORTING_ORDER_NOT_SET.

I don't really see a difference between "PAGE_REPORTING_DEFAULT_ORDER"
and "PAGE_REPORTING_ORDER_NOT_SET" that would warrant a split and adding
confusion for the page-reporting drivers.

In both cases, we want "no special requirement, just use the default".
Maybe we can use a better name to express that.

-- 
Cheers,

David

^ permalink raw reply

* RE: [PATCH v1 4/4] page_reporting: change PAGE_REPORTING_DEFAULT_ORDER to -1
From: Michael Kelley @ 2026-03-02  5:25 UTC (permalink / raw)
  To: Yuvraj Sakshith, David Hildenbrand (Arm)
  Cc: akpm@linux-foundation.org, mst@redhat.com, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	longli@microsoft.com, jasowang@redhat.com,
	xuanzhuo@linux.alibaba.com, eperezma@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org,
	ziy@nvidia.com, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux.dev, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <aaUE7M9QkfnYh12e@hu-ysakshit-lv.qualcomm.com>

From: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com> Sent: Sunday, March 1, 2026 7:33 PM
> 
> On Fri, Feb 27, 2026 at 09:50:15PM +0100, David Hildenbrand (Arm) wrote:
> > On 2/27/26 15:06, Yuvraj Sakshith wrote:
> > > PAGE_REPORTING_DEFAULT_ORDER is now set to zero. This means,
> > > pages of order zero cannot be reported to a client/driver -- as zero
> > > is used to signal a fallback to MAX_PAGE_ORDER.
> > >
> > > Change PAGE_REPORTING_DEFAULT_ORDER to (-1),
> > > so that zero can be used as a valid order with which pages can
> > > be reported.
> > >
> > > Signed-off-by: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com>
> > > ---
> > >  include/linux/page_reporting.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
> > > index a7e3e30f2..3eb3e26d8 100644
> > > --- a/include/linux/page_reporting.h
> > > +++ b/include/linux/page_reporting.h
> > > @@ -7,7 +7,7 @@
> > >
> > >  /* This value should always be a power of 2, see page_reporting_cycle() */
> > >  #define PAGE_REPORTING_CAPACITY		32
> > > -#define PAGE_REPORTING_DEFAULT_ORDER	0
> > > +#define PAGE_REPORTING_DEFAULT_ORDER	(-1)
> >
> > No need for the ().
> >
> > Wondering whether we now also want to do in this patch:
> >
> >
> > diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> > index f0042d5743af..d432aadf9d07 100644
> > --- a/mm/page_reporting.c
> > +++ b/mm/page_reporting.c
> > @@ -11,8 +11,7 @@
> >  #include "page_reporting.h"
> >  #include "internal.h"
> >
> > -/* Initialize to an unsupported value */
> > -unsigned int page_reporting_order = -1;
> > +unsigned int page_reporting_order = PAGE_REPORTING_DEFAULT_ORDER;
> >
> >  static int page_order_update_notify(const char *val, const struct
> > kernel_param *kp)
> >  {
> > @@ -369,7 +368,7 @@ int page_reporting_register(struct
> > page_reporting_dev_info *prdev)
> >          * pageblock_order.
> >          */
> >
> > -       if (page_reporting_order == -1) {
> > +       if (page_reporting_order == PAGE_REPORTING_DEFAULT_ORDER) {
> >
> >
> 
> Sure. Now that I think of it, don’t you think the first nested if() will
> always be false? and can be compressed down to just one if()?

I don't think what you propose is correct. The purpose of testing
page_reporting_order for -1 is to see if a page reporting order has
been specified on the kernel boot line. If it has been specified, then
the page reporting order specified in the call to page_reporting_register()
[either a specific value or the default] is ignored and the kernel boot
line value prevails. But if page_reporting_order is -1 here, then
no kernel boot line value was specified, and the value passed to
page_reporting_register() should prevail.

With this in mind, substituting PAGE_REPORTING_DEFAULT_ORDER
for the -1 in the test doesn’t exactly make sense to me. The -1 in the
test doesn't have quite the same meaning as the -1 for
PAGE_REPORTING_DEFAULT_ORDER. You could even use -2 for
the initial value of page_reporting_order, and here in the test, in
order to make that distinction obvious. Or use a separate symbolic
name like PAGE_REPORTING_ORDER_NOT_SET.

Michael Kelley

> 
> -       if (page_reporting_order == PAGE_REPORTING_DEFAULT_ORDER) {
> -               if (prdev->order != PAGE_REPORTING_DEFAULT_ORDER &&
> -                       prdev->order <= MAX_PAGE_ORDER)
> -                       page_reporting_order = prdev->order;
> -               else
> -                       page_reporting_order = pageblock_order;
> -       }
> +       page_reporting_order = pageblock_order;
> +
> +       if (prdev->order != PAGE_REPORTING_DEFAULT_ORDER &&
> +               prdev->order <= MAX_PAGE_ORDER)
> +               page_reporting_order = prdev->order;
> 
> Thanks,
> Yuvraj
> 
> >
> > (and wondering whether we should have called it
> > PAGE_REPORTING_USE_DEFAULT_ORDER to make it clearer that it is not an
> > actual order. Leaving that up to you :) )
> >
> > --
> > Cheers,
> >
> > David


^ permalink raw reply

* Re: [PATCH v1 4/4] page_reporting: change PAGE_REPORTING_DEFAULT_ORDER to -1
From: Yuvraj Sakshith @ 2026-03-02  3:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: akpm, mst, kys, haiyangz, wei.liu, decui, longli, jasowang,
	xuanzhuo, eperezma, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, jackmanb, hannes, ziy, linux-hyperv,
	virtualization, linux-mm, linux-kernel
In-Reply-To: <c618e7a4-42c1-4438-9bc2-9c41450a81a2@kernel.org>

On Fri, Feb 27, 2026 at 09:50:15PM +0100, David Hildenbrand (Arm) wrote:
> On 2/27/26 15:06, Yuvraj Sakshith wrote:
> > PAGE_REPORTING_DEFAULT_ORDER is now set to zero. This means,
> > pages of order zero cannot be reported to a client/driver -- as zero
> > is used to signal a fallback to MAX_PAGE_ORDER.
> > 
> > Change PAGE_REPORTING_DEFAULT_ORDER to (-1),
> > so that zero can be used as a valid order with which pages can
> > be reported.
> > 
> > Signed-off-by: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com>
> > ---
> >  include/linux/page_reporting.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
> > index a7e3e30f2..3eb3e26d8 100644
> > --- a/include/linux/page_reporting.h
> > +++ b/include/linux/page_reporting.h
> > @@ -7,7 +7,7 @@
> >  
> >  /* This value should always be a power of 2, see page_reporting_cycle() */
> >  #define PAGE_REPORTING_CAPACITY		32
> > -#define PAGE_REPORTING_DEFAULT_ORDER	0
> > +#define PAGE_REPORTING_DEFAULT_ORDER	(-1)
> 
> No need for the ().
> 
> Wondering whether we now also want to do in this patch:
> 
> 
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index f0042d5743af..d432aadf9d07 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -11,8 +11,7 @@
>  #include "page_reporting.h"
>  #include "internal.h"
> 
> -/* Initialize to an unsupported value */
> -unsigned int page_reporting_order = -1;
> +unsigned int page_reporting_order = PAGE_REPORTING_DEFAULT_ORDER;
> 
>  static int page_order_update_notify(const char *val, const struct
> kernel_param *kp)
>  {
> @@ -369,7 +368,7 @@ int page_reporting_register(struct
> page_reporting_dev_info *prdev)
>          * pageblock_order.
>          */
> 
> -       if (page_reporting_order == -1) {
> +       if (page_reporting_order == PAGE_REPORTING_DEFAULT_ORDER) {
> 
> 

Sure. Now that I think of it, don’t you think the first nested if() will
always be false? and can be compressed down to just one if()?

-       if (page_reporting_order == PAGE_REPORTING_DEFAULT_ORDER) {
-               if (prdev->order != PAGE_REPORTING_DEFAULT_ORDER &&
-                       prdev->order <= MAX_PAGE_ORDER)
-                       page_reporting_order = prdev->order;
-               else
-                       page_reporting_order = pageblock_order;
-       }
+       page_reporting_order = pageblock_order;
+
+       if (prdev->order != PAGE_REPORTING_DEFAULT_ORDER &&
+               prdev->order <= MAX_PAGE_ORDER)
+               page_reporting_order = prdev->order;

Thanks,
Yuvraj

> 
> (and wondering whether we should have called it
> PAGE_REPORTING_USE_DEFAULT_ORDER to make it clearer that it is not an
> actual order. Leaving that up to you :) )
> 
> -- 
> Cheers,
> 
> David

^ permalink raw reply

* [PATCH v2] mshv: Introduce tracing support
From: Stanislav Kinsburskii @ 2026-03-01 17:39 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

Introduces various trace events and use them in the corresponding places
in the driver.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/Makefile            |    1 
 drivers/hv/mshv_eventfd.c      |   14 +
 drivers/hv/mshv_irq.c          |    4 
 drivers/hv/mshv_root.h         |    1 
 drivers/hv/mshv_root_hv_call.c |   22 +-
 drivers/hv/mshv_root_main.c    |   78 +++++-
 drivers/hv/mshv_trace.c        |    9 +
 drivers/hv/mshv_trace.h        |  515 ++++++++++++++++++++++++++++++++++++++++
 8 files changed, 629 insertions(+), 15 deletions(-)
 create mode 100644 drivers/hv/mshv_trace.c
 create mode 100644 drivers/hv/mshv_trace.h

diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index 2593711c3628..888a748cc7cb 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -16,6 +16,7 @@ hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
 mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
 	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
 mshv_root-$(CONFIG_DEBUG_FS) += mshv_debugfs.o
+mshv_root-$(CONFIG_TRACEPOINTS) += mshv_trace.o
 mshv_vtl-y := mshv_vtl_main.o
 
 # Code that must be built-in
diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 492c6258045c..d2efe248ca9b 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -733,6 +733,14 @@ static int mshv_assign_ioeventfd(struct mshv_partition *pt,
 	ret = mshv_register_doorbell(pt->pt_id, ioeventfd_mmio_write,
 				     (void *)pt, p->iovntfd_addr,
 				     p->iovntfd_datamatch, doorbell_flags);
+
+	trace_mshv_assign_ioeventfd(pt->pt_id, p->iovntfd_addr,
+				    p->iovntfd_length,
+				    p->iovntfd_datamatch,
+				    p->iovntfd_wildcard,
+				    p->iovntfd_eventfd,
+				    ret);
+
 	if (ret < 0)
 		goto unlock_fail;
 
@@ -780,6 +788,12 @@ static int mshv_deassign_ioeventfd(struct mshv_partition *pt,
 		    p->iovntfd_datamatch != args->datamatch)
 			continue;
 
+		trace_mshv_deassign_ioeventfd(pt->pt_id, p->iovntfd_addr,
+					      p->iovntfd_length,
+					      p->iovntfd_datamatch,
+					      p->iovntfd_wildcard,
+					      p->iovntfd_eventfd);
+
 		hlist_del_rcu(&p->iovntfd_hnode);
 		synchronize_rcu();
 		ioeventfd_release(p, pt->pt_id);
diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
index 798e7e1ab06e..aba7d3c431b8 100644
--- a/drivers/hv/mshv_irq.c
+++ b/drivers/hv/mshv_irq.c
@@ -71,6 +71,10 @@ int mshv_update_routing_table(struct mshv_partition *partition,
 	mutex_unlock(&partition->pt_irq_lock);
 
 	synchronize_srcu_expedited(&partition->pt_irq_srcu);
+
+	trace_mshv_update_routing_table(partition->pt_id,
+					old, new, numents);
+
 	new = old;
 
 out:
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 04c2a1910a8a..947dfb76bb19 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -17,6 +17,7 @@
 #include <linux/build_bug.h>
 #include <linux/mmu_notifier.h>
 #include <uapi/linux/mshv.h>
+#include "mshv_trace.h"
 
 /*
  * Hypervisor must be between these version numbers (inclusive)
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 317191462b63..bdcb8de7fb47 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -44,8 +44,7 @@ int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
 	struct hv_output_withdraw_memory *output_page;
 	struct page *page;
 	u16 completed;
-	unsigned long remaining = count;
-	u64 status;
+	u64 status, withdrawn = 0;
 	int i;
 	unsigned long flags;
 
@@ -54,7 +53,7 @@ int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
 		return -ENOMEM;
 	output_page = page_address(page);
 
-	while (remaining) {
+	while (withdrawn < count) {
 		local_irq_save(flags);
 
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -62,7 +61,7 @@ int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
 		memset(input_page, 0, sizeof(*input_page));
 		input_page->partition_id = partition_id;
 		status = hv_do_rep_hypercall(HVCALL_WITHDRAW_MEMORY,
-					     min(remaining, HV_WITHDRAW_BATCH_SIZE),
+					     min(count - withdrawn, HV_WITHDRAW_BATCH_SIZE),
 					     0, input_page, output_page);
 
 		local_irq_restore(flags);
@@ -78,10 +77,12 @@ int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
 			break;
 		}
 
-		remaining -= completed;
+		withdrawn += completed;
 	}
 	free_page((unsigned long)output_page);
 
+	trace_mshv_hvcall_withdraw_memory(partition_id, withdrawn, status);
+
 	return hv_result_to_errno(status);
 }
 
@@ -125,6 +126,8 @@ int hv_call_create_partition(u64 flags,
 		ret = hv_deposit_memory(hv_current_partition_id, status);
 	} while (!ret);
 
+	trace_mshv_hvcall_create_partition(flags, ret ? ret : *partition_id);
+
 	return ret;
 }
 
@@ -152,6 +155,8 @@ int hv_call_initialize_partition(u64 partition_id)
 		ret = hv_deposit_memory(partition_id, status);
 	} while (!ret);
 
+	trace_mshv_hvcall_initialize_partition(partition_id, status);
+
 	return ret;
 }
 
@@ -164,6 +169,8 @@ int hv_call_finalize_partition(u64 partition_id)
 	status = hv_do_fast_hypercall8(HVCALL_FINALIZE_PARTITION,
 				       *(u64 *)&input);
 
+	trace_mshv_hvcall_finalize_partition(partition_id, status);
+
 	return hv_result_to_errno(status);
 }
 
@@ -175,6 +182,8 @@ int hv_call_delete_partition(u64 partition_id)
 	input.partition_id = partition_id;
 	status = hv_do_fast_hypercall8(HVCALL_DELETE_PARTITION, *(u64 *)&input);
 
+	trace_mshv_hvcall_delete_partition(partition_id, status);
+
 	return hv_result_to_errno(status);
 }
 
@@ -571,6 +580,9 @@ static int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
 		ret = hv_deposit_memory(partition_id, status);
 	} while (!ret);
 
+	trace_mshv_hvcall_map_vp_state_page(partition_id, vp_index,
+					    type, status);
+
 	return ret;
 }
 
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index e6509c980763..d753f41d3b57 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -430,6 +430,17 @@ mshv_vp_dispatch(struct mshv_vp *vp, u32 flags,
 	status = hv_do_hypercall(HVCALL_DISPATCH_VP, input, output);
 	vp->run.flags.root_sched_dispatched = 0;
 
+	trace_mshv_hvcall_dispatch_vp(vp->vp_partition->pt_id,
+				      vp->vp_index, flags,
+				      output->dispatch_state,
+				      output->dispatch_event,
+#if defined(CONFIG_X86_64)
+				      vp->vp_register_page->interrupt_vectors.as_uint64,
+#else
+				      0,
+#endif
+				      status);
+
 	*res = *output;
 	preempt_enable();
 
@@ -452,6 +463,9 @@ mshv_vp_clear_explicit_suspend(struct mshv_vp *vp)
 	ret = mshv_set_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
 				    1, &explicit_suspend);
 
+	trace_mshv_vp_clear_explicit_suspend(vp->vp_partition->pt_id,
+					     vp->vp_index, ret);
+
 	if (ret)
 		vp_err(vp, "Failed to unsuspend\n");
 
@@ -494,6 +508,12 @@ mshv_vp_wait_for_hv_kick(struct mshv_vp *vp)
 	if (ret)
 		return -EINTR;
 
+	trace_mshv_vp_wait_for_hv_kick(vp->vp_partition->pt_id,
+				       vp->vp_index,
+				       vp->run.kicked_by_hv,
+				       mshv_vp_dispatch_thread_blocked(vp),
+				       mshv_vp_interrupt_pending(vp));
+
 	vp->run.flags.root_sched_blocked = 0;
 	vp->run.kicked_by_hv = 0;
 
@@ -522,6 +542,12 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
 
 		if (__xfer_to_guest_mode_work_pending()) {
 			ret = xfer_to_guest_mode_handle_work();
+
+			trace_mshv_xfer_to_guest_mode_work(vp->vp_partition->pt_id,
+							   vp->vp_index,
+							   read_thread_flags(),
+							   ret);
+
 			if (ret)
 				break;
 		}
@@ -673,6 +699,8 @@ static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
 {
 	long rc;
 
+	trace_mshv_run_vp_entry(vp->vp_partition->pt_id, vp->vp_index);
+
 	do {
 		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
 			rc = mshv_run_vp_with_root_scheduler(vp);
@@ -680,6 +708,10 @@ static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
 			rc = mshv_run_vp_with_hyp_scheduler(vp);
 	} while (rc == 0 && mshv_vp_handle_intercept(vp));
 
+	trace_mshv_run_vp_exit(vp->vp_partition->pt_id, vp->vp_index,
+			       vp->vp_intercept_msg_page->header.message_type,
+			       rc);
+
 	if (rc)
 		return rc;
 
@@ -941,6 +973,8 @@ mshv_vp_release(struct inode *inode, struct file *filp)
 {
 	struct mshv_vp *vp = filp->private_data;
 
+	trace_mshv_vp_release(vp->vp_partition->pt_id, vp->vp_index);
+
 	/* Rest of VP cleanup happens in destroy_partition() */
 	mshv_partition_put(vp->vp_partition);
 	return 0;
@@ -1113,7 +1147,7 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	partition->pt_vp_count++;
 	partition->pt_vp_array[args.vp_index] = vp;
 
-	return ret;
+	goto out;
 
 remove_debugfs_vp:
 	mshv_debugfs_vp_remove(vp);
@@ -1139,6 +1173,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 			       intercept_msg_page, input_vtl_zero);
 destroy_vp:
 	hv_call_delete_vp(partition->pt_id, args.vp_index);
+out:
+	trace_mshv_create_vp(partition->pt_id, args.vp_index, ret);
 	return ret;
 }
 
@@ -1338,6 +1374,10 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		break;
 	}
 
+	trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
+				   region->start_gfn, region->nr_pages,
+				   region->hv_map_flags, ret);
+
 	if (ret)
 		goto errout;
 
@@ -1633,6 +1673,9 @@ disable_vp_dispatch(struct mshv_vp *vp)
 	if (ret)
 		vp_err(vp, "failed to suspend\n");
 
+	trace_mshv_disable_vp_dispatch(vp->vp_partition->pt_id,
+				       vp->vp_index, ret);
+
 	return ret;
 }
 
@@ -1681,6 +1724,8 @@ drain_vp_signals(struct mshv_vp *vp)
 		vp->run.kicked_by_hv = 0;
 		vp_signal_count = atomic64_read(&vp->run.vp_signaled_count);
 	}
+
+	trace_mshv_drain_vp_signals(vp->vp_partition->pt_id, vp->vp_index);
 }
 
 static void drain_all_vps(const struct mshv_partition *partition)
@@ -1734,6 +1779,8 @@ static void destroy_partition(struct mshv_partition *partition)
 		return;
 	}
 
+	trace_mshv_destroy_partition(partition->pt_id);
+
 	if (partition->pt_initialized) {
 		/*
 		 * We only need to drain signals for root scheduler. This should be
@@ -1840,6 +1887,8 @@ mshv_partition_release(struct inode *inode, struct file *filp)
 {
 	struct mshv_partition *partition = filp->private_data;
 
+	trace_mshv_partition_release(partition->pt_id);
+
 	mshv_eventfd_release(partition);
 
 	cleanup_srcu_struct(&partition->pt_irq_srcu);
@@ -1969,6 +2018,7 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
 	struct hv_partition_creation_properties creation_properties;
 	union hv_partition_isolation_properties isolation_properties;
 	struct mshv_partition *partition;
+	u64 pt_id = -1;
 	long ret;
 
 	ret = mshv_ioctl_process_pt_flags(user_arg, &creation_flags,
@@ -2008,22 +2058,29 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
 	ret = hv_call_create_partition(creation_flags,
 				       creation_properties,
 				       isolation_properties,
-				       &partition->pt_id);
+				       &pt_id);
 	if (ret)
 		goto cleanup_irq_srcu;
 
+	partition->pt_id = pt_id;
+
 	ret = add_partition(partition);
 	if (ret)
 		goto delete_partition;
 
 	ret = mshv_init_async_handler(partition);
-	if (!ret) {
-		ret = FD_ADD(O_CLOEXEC, anon_inode_getfile("mshv_partition",
-							   &mshv_partition_fops,
-							   partition, O_RDWR));
-		if (ret >= 0)
-			return ret;
-	}
+	if (ret)
+		goto remove_partition;
+
+	ret = FD_ADD(O_CLOEXEC, anon_inode_getfile("mshv_partition",
+						   &mshv_partition_fops,
+						   partition, O_RDWR));
+	if (ret < 0)
+		goto remove_partition;
+
+	goto out;
+
+remove_partition:
 	remove_partition(partition);
 delete_partition:
 	hv_call_delete_partition(partition->pt_id);
@@ -2031,7 +2088,8 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
 	cleanup_srcu_struct(&partition->pt_irq_srcu);
 free_partition:
 	kfree(partition);
-
+out:
+	trace_mshv_create_partition(pt_id, ret);
 	return ret;
 }
 
diff --git a/drivers/hv/mshv_trace.c b/drivers/hv/mshv_trace.c
new file mode 100644
index 000000000000..0936b2f95edd
--- /dev/null
+++ b/drivers/hv/mshv_trace.c
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026, Microsoft Corporation.
+ *
+ * Tracepoint definitions for mshv driver.
+ */
+
+#define CREATE_TRACE_POINTS
+#include "mshv_trace.h"
diff --git a/drivers/hv/mshv_trace.h b/drivers/hv/mshv_trace.h
new file mode 100644
index 000000000000..ba3b3f575983
--- /dev/null
+++ b/drivers/hv/mshv_trace.h
@@ -0,0 +1,515 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2026, Microsoft Corporation.
+ *
+ * Tracepoint declarations for mshv driver.
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mshv
+
+#if !defined(__MSHV_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _MSHV_TRACE_H_
+
+#include <linux/tracepoint.h>
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH ../../drivers/hv
+
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE mshv_trace
+
+TRACE_EVENT(mshv_create_partition,
+	    TP_PROTO(u64 partition_id, int vm_fd),
+	    TP_ARGS(partition_id, vm_fd),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(int, vm_fd)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vm_fd = vm_fd;
+	    ),
+	    TP_printk("partition_id=%llu vm_fd=%d",
+		    __entry->partition_id,
+		    __entry->vm_fd
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_create_partition,
+	    TP_PROTO(u64 flags, s64 partition_id),
+	    TP_ARGS(flags, partition_id),
+	    TP_STRUCT__entry(
+		    __field(u64, flags)
+		    __field(s64, partition_id)
+	    ),
+	    TP_fast_assign(
+		    __entry->flags = flags;
+		    __entry->partition_id = partition_id;
+	    ),
+	    TP_printk("flags=%#llx partition_id=%lld",
+		    __entry->flags,
+		    __entry->partition_id
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_initialize_partition,
+	    TP_PROTO(u64 partition_id, u64 status),
+	    TP_ARGS(partition_id, status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu status=%#llx",
+		    __entry->partition_id,
+		    __entry->status
+	    )
+);
+
+TRACE_EVENT(mshv_partition_release,
+	    TP_PROTO(u64 partition_id),
+	    TP_ARGS(partition_id),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+	    ),
+	    TP_printk("partition_id=%llu",
+		    __entry->partition_id
+	    )
+);
+
+TRACE_EVENT(mshv_destroy_partition,
+	    TP_PROTO(u64 partition_id),
+	    TP_ARGS(partition_id),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+	    ),
+	    TP_printk("partition_id=%llu",
+		    __entry->partition_id
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_finalize_partition,
+	    TP_PROTO(u64 partition_id, u64 status),
+	    TP_ARGS(partition_id, status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu status=%#llx ",
+		    __entry->partition_id,
+		    __entry->status
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_withdraw_memory,
+	    TP_PROTO(u64 partition_id, u64 withdrawn, u64 status),
+	    TP_ARGS(partition_id, withdrawn, status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, withdrawn)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->withdrawn = withdrawn;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu withdrawn=%llu status=%#llx",
+		    __entry->partition_id,
+		    __entry->withdrawn,
+		    __entry->status
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_delete_partition,
+	    TP_PROTO(u64 partition_id, u64 status),
+	    TP_ARGS(partition_id, status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu status=%#llx",
+		    __entry->partition_id,
+		    __entry->status
+	    )
+);
+
+TRACE_EVENT(mshv_create_vp,
+	    TP_PROTO(u64 partition_id, u32 vp_index, long vp_fd),
+	    TP_ARGS(partition_id, vp_index, vp_fd),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(long, vp_fd)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->vp_fd = vp_fd;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u vp_fd=%ld",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->vp_fd
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_map_vp_state_page,
+	    TP_PROTO(u64 partition_id, u32 vp_index, u32 page_type, u64 status),
+	    TP_ARGS(partition_id, vp_index, page_type, status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(u32, page_type)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->page_type = page_type;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u page_type=%u status=%#llx",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->page_type,
+		    __entry->status
+	    )
+);
+
+TRACE_EVENT(mshv_drain_vp_signals,
+	    TP_PROTO(u64 partition_id, u32 vp_index),
+	    TP_ARGS(partition_id, vp_index),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u",
+		    __entry->partition_id,
+		    __entry->vp_index
+	    )
+);
+
+TRACE_EVENT(mshv_disable_vp_dispatch,
+	    TP_PROTO(u64 partition_id, u32 vp_index, int ret),
+	    TP_ARGS(partition_id, vp_index, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(int, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u ret=%d",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->ret
+	    )
+);
+
+TRACE_EVENT(mshv_vp_release,
+	    TP_PROTO(u64 partition_id, u32 vp_index),
+	    TP_ARGS(partition_id, vp_index),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u",
+		    __entry->partition_id,
+		    __entry->vp_index
+	    )
+);
+
+TRACE_EVENT(mshv_run_vp_entry,
+	    TP_PROTO(u64 partition_id, u32 vp_index),
+	    TP_ARGS(partition_id, vp_index),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u",
+		    __entry->partition_id,
+		    __entry->vp_index
+	    )
+);
+
+TRACE_EVENT(mshv_run_vp_exit,
+	    TP_PROTO(u64 partition_id, u32 vp_index, u64 hv_message_type, long ret),
+	    TP_ARGS(partition_id, vp_index, hv_message_type, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(u64, hv_message_type)
+		    __field(long, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->hv_message_type = hv_message_type;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u hv_message_type=%#llx ret=%ld",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->hv_message_type,
+		    __entry->ret
+	    )
+);
+
+TRACE_EVENT(mshv_vp_clear_explicit_suspend,
+	    TP_PROTO(u64 partition_id, u32 vp_index, int ret),
+	    TP_ARGS(partition_id, vp_index, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(int, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u ret=%d",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->ret
+	    )
+);
+
+TRACE_EVENT(mshv_xfer_to_guest_mode_work,
+	    TP_PROTO(u64 partition_id, u32 vp_index, unsigned long thread_info_flag, long ret),
+	    TP_ARGS(partition_id, vp_index, thread_info_flag, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(unsigned long, thread_info_flag)
+		    __field(long, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->thread_info_flag = thread_info_flag;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u thread_info_flag=%#lx ret=%ld",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->thread_info_flag,
+		    __entry->ret
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_dispatch_vp,
+	    TP_PROTO(u64 partition_id, u32 vp_index, u32 flags,
+		     u32 dispatch_state, u32 dispatch_event, u64 irq_vectors, u64 status),
+	    TP_ARGS(partition_id, vp_index, flags, dispatch_state, dispatch_event, irq_vectors,
+		    status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(u32, flags)
+		    __field(u32, dispatch_state)
+		    __field(u32, dispatch_event)
+		    __field(u64, irq_vectors)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->flags = flags;
+		    __entry->dispatch_state = dispatch_state;
+		    __entry->dispatch_event = dispatch_event;
+		    __entry->irq_vectors = irq_vectors;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u flags=%#x dispatch_state=%#x dispatch_event=%#x irq_vectors=%#016llx status=%#llx",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->flags,
+		    __entry->dispatch_state,
+		    __entry->dispatch_event,
+		    __entry->irq_vectors,
+		    __entry->status
+	     )
+);
+
+TRACE_EVENT(mshv_update_routing_table,
+	    TP_PROTO(u64 partition_id, void *old, void *new, u32 numents),
+	    TP_ARGS(partition_id, old, new, numents),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(struct mshv_girq_routing_table *, old)
+		    __field(struct mshv_girq_routing_table *, new)
+		    __field(u32, numents)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->old = old;
+		    __entry->new = new;
+		    __entry->numents = numents;
+	    ),
+	    TP_printk("partition_id=%llu old=%p new=%p numents=%u",
+		    __entry->partition_id,
+		    __entry->old,
+		    __entry->new,
+		    __entry->numents
+	    )
+);
+
+TRACE_EVENT(mshv_map_user_memory,
+	    TP_PROTO(u64 partition_id, u64 start_uaddr, u64 start_gfn, u64 nr_pages, u32 map_flags,
+		     long ret),
+	    TP_ARGS(partition_id, start_uaddr, start_gfn, nr_pages, map_flags, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, start_uaddr)
+		    __field(u64, start_gfn)
+		    __field(u64, nr_pages)
+		    __field(u32, map_flags)
+		    __field(long, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->start_uaddr = start_uaddr;
+		    __entry->start_gfn = start_gfn;
+		    __entry->nr_pages = nr_pages;
+		    __entry->map_flags = map_flags;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu start_uaddr=%#llx start_gfn=%#llx nr_pages=%llu map_flags=%#x ret=%ld",
+		    __entry->partition_id,
+		    __entry->start_uaddr,
+		    __entry->start_gfn,
+		    __entry->nr_pages,
+		    __entry->map_flags,
+		    __entry->ret
+	     )
+);
+
+TRACE_EVENT(mshv_assign_ioeventfd,
+	    TP_PROTO(u64 partition_id, u64 addr, u64 length, u64 datamatch, bool wildcard,
+		     void *eventfd, int ret),
+	    TP_ARGS(partition_id, addr, length, datamatch, wildcard, eventfd, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, addr)
+		    __field(u64, length)
+		    __field(u64, datamatch)
+		    __field(bool, wildcard)
+		    __field(struct eventfd_ctx *, eventfd)
+		    __field(int, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->addr = addr;
+		    __entry->length = length;
+		    __entry->datamatch = datamatch;
+		    __entry->wildcard = wildcard;
+		    __entry->eventfd = eventfd;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu addr=%#016llx length=%#llx datamatch=%#llx wildcard=%d eventfd=%p ret=%d",
+		    __entry->partition_id,
+		    __entry->addr,
+		    __entry->length,
+		    __entry->datamatch,
+		    __entry->wildcard,
+		    __entry->eventfd,
+		    __entry->ret
+	     )
+);
+
+TRACE_EVENT(mshv_deassign_ioeventfd,
+	    TP_PROTO(u64 partition_id, u64 addr, u64 length, u64 datamatch, bool wildcard,
+		     void *eventfd),
+	    TP_ARGS(partition_id, addr, length, datamatch, wildcard, eventfd),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, addr)
+		    __field(u64, length)
+		    __field(u64, datamatch)
+		    __field(bool, wildcard)
+		    __field(struct eventfd_ctx *, eventfd)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->addr = addr;
+		    __entry->length = length;
+		    __entry->datamatch = datamatch;
+		    __entry->wildcard = wildcard;
+		    __entry->eventfd = eventfd;
+	    ),
+	    TP_printk("partition_id=%llu addr=%#016llx length=%#llx datamatch=%#llx wildcard=%d eventfd=%p",
+		    __entry->partition_id,
+		    __entry->addr,
+		    __entry->length,
+		    __entry->datamatch,
+		    __entry->wildcard,
+		    __entry->eventfd
+	     )
+);
+
+TRACE_EVENT(mshv_vp_wait_for_hv_kick,
+	    TP_PROTO(u64 partition_id, u32 vp_index, bool kicked_by_hv, bool blocked,
+		     bool irq_pending),
+	    TP_ARGS(partition_id, vp_index, kicked_by_hv, blocked, irq_pending),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(bool, kicked_by_hv)
+		    __field(bool, blocked)
+		    __field(bool, irq_pending)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->kicked_by_hv = kicked_by_hv;
+		    __entry->blocked = blocked;
+		    __entry->irq_pending = irq_pending;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u kicked_by_hv=%d blocked=%d irq_pending=%d",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->kicked_by_hv,
+		    __entry->blocked,
+		    __entry->irq_pending
+	    )
+);
+
+#endif /* _MSHV_TRACE_H_ */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>



^ permalink raw reply related

* Re: [PATCH net-next v4] net: mana: Add MAC address to vPort logs and clarify error messages
From: Simon Horman @ 2026-03-01 16:41 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, dipayanroy, shirazsaleem, ssengar,
	shradhagupta, gargaditya, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260225192252.943534-1-ernis@linux.microsoft.com>

On Wed, Feb 25, 2026 at 11:22:41AM -0800, Erni Sri Satya Vennela wrote:
> Add MAC address to vPort configuration success message and update error
> message to be more specific about HWC message errors in
> mana_send_request.
> 
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>

...

> diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c

...

> @@ -893,8 +895,8 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
>  	if (!wait_for_completion_timeout(&ctx->comp_event,
>  					 (msecs_to_jiffies(hwc->hwc_timeout)))) {
>  		if (hwc->hwc_timeout != 0)
> -			dev_err(hwc->dev, "HWC: Request timed out: %u ms\n",
> -				hwc->hwc_timeout);
> +			dev_err(hwc->dev, "%s:%d: Command 0x%x timed out: %u ms\n",
> +				__func__, __LINE__, command, hwc->hwc_timeout);

I have reservations about the usefulness of including __func__ and __LINE__
in debug messages. In a nutshell, it requires the logs to be correlated
(exactly?) with the source used to build the driver. And at that point
I think other mechanism - e.g. dynamic trace points - are going to be
useful if the debug message (without function and line information)
is insufficient to pinpoint the problem.

This is a general statement, rather than something specifically
about this code. But nonetheless I'd advise against adding this
information here.

>  
>  		/* Reduce further waiting if HWC no response */
>  		if (hwc->hwc_timeout > 1)
> @@ -914,9 +916,9 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
>  			err = -EOPNOTSUPP;
>  			goto out;
>  		}
> -		if (req_msg->req.msg_type != MANA_QUERY_PHY_STAT)
> -			dev_err(hwc->dev, "HWC: Failed hw_channel req: 0x%x\n",
> -				ctx->status_code);
> +		if (command != MANA_QUERY_PHY_STAT)
> +			dev_err(hwc->dev, "%s:%d: Command 0x%x failed with status: 0x%x\n",
> +				__func__, __LINE__, command, ctx->status_code);

>  		err = -EPROTO;
>  		goto out;
>  	}

...

^ permalink raw reply

* Re: [PATCH] scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT
From: Martin K. Petersen @ 2026-02-28 23:12 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Martin K. Petersen, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, James E.J. Bottomley, linux-hyperv,
	linux-scsi, Linux Kernel Mailing List, Florian Bezdeka, RT,
	Mitchell Levy
In-Reply-To: <898e9467-0c05-46b4-a3ed-518797b829c5@siemens.com>


Jan,

>> Applied to 7.0/scsi-fixes, thanks!
>> 
>> [1/1] scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT
>>       https://git.kernel.org/mkp/scsi/c/57297736c082
>> 
>
> Should it be here then already?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/log/?h=7.0/scsi-fixes

Because it was part of a merge of the outstanding commits from
7.0/scsi-queue, this patch doesn't appear at the top of the history for
7.0/scsi-fixes.

It is there, though:

$ git ol v7.0-rc1..7.0/scsi-fixes | grep storvsc
57297736c082 ("scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT")

-- 
Martin K. Petersen

^ permalink raw reply

* Re: [PATCH net-next v4] net: mana: Add MAC address to vPort logs and clarify error messages
From: Jakub Kicinski @ 2026-02-28 17:55 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, dipayanroy, shirazsaleem, ssengar, shradhagupta,
	gargaditya, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260225192252.943534-1-ernis@linux.microsoft.com>

On Wed, 25 Feb 2026 11:22:41 -0800 Erni Sri Satya Vennela wrote:
> -			dev_err(hwc->dev, "HWC: Request timed out: %u ms\n",
> -				hwc->hwc_timeout);
> +			dev_err(hwc->dev, "%s:%d: Command 0x%x timed out: %u ms\n",
> +				__func__, __LINE__, command, hwc->hwc_timeout);

Please don't include __LINE__, they are meaningless given the amount of
backporting that usually happens in the kernel. The string should be
unique enough to identify the error, which I think yours is given the
__func__ + text you have.
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH v2] x86/hyperv: Use __naked attribute to fix stackless C function
From: Andrew Cooper @ 2026-02-28 17:38 UTC (permalink / raw)
  To: Uros Bizjak, Ard Biesheuvel
  Cc: Andrew Cooper, linux-kernel, x86, Mukesh Rathor, Wei Liu,
	linux-hyperv
In-Reply-To: <CAFULd4YTkLLdvjTtGXtHgsZCiEMXAYXcjSwciwdsE-RGvnVrdg@mail.gmail.com>

On 28/02/2026 5:34 pm, Uros Bizjak wrote:
> On Sat, Feb 28, 2026 at 5:41 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>> So I don't see any reason for volatile on hv_msr(). However, I do see a potential issue with the compiler assuming that code after the final asm() block is reachable, and calling unreachable() is not permitted [by Clang] due to the __naked attribute.
> As far as the compiler is concerned, there is "nothing" after the last
> asm() block. It can be marked with noreturn attribute, but I don't
> know how it interacts with the naked attribute.

You can't use __builtin_unreachable() in a naked function.  I was going
to suggest it against v1, but alas.

~Andrew

^ permalink raw reply

* Re: [PATCH v2] x86/hyperv: Use __naked attribute to fix stackless C function
From: Uros Bizjak @ 2026-02-28 17:34 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, x86, Mukesh Rathor, Wei Liu, Andrew Cooper,
	linux-hyperv
In-Reply-To: <7c7cd72e-fd46-4f77-8bf7-8d538fec0a52@app.fastmail.com>

On Sat, Feb 28, 2026 at 5:41 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
>
>
> On Sat, 28 Feb 2026, at 11:17, Uros Bizjak wrote:
> > On Fri, Feb 27, 2026 at 11:40 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >>
> >> hv_crash_c_entry() is a C function that is entered without a stack,
> >> and this is only allowed for functions that have the __naked attribute,
> >> which informs the compiler that it must not emit the usual prologue and
> >> epilogue or emit any other kind of instrumentation that relies on a
> >> stack frame.
> >>
> >> So split up the function, and set the __naked attribute on the initial
> >> part that sets up the stack, GDT, IDT and other pieces that are needed
> >> for ordinary C execution. Given that function calls are not permitted
> >> either, use the existing long return coded in an asm() block to call the
> >> second part of the function, which is an ordinary function that is
> >> permitted to call other functions as usual.
> >>
> >> Cc: Mukesh Rathor <mrathor@linux.microsoft.com>
> >> Cc: Wei Liu <wei.liu@kernel.org>
> >> Cc: Uros Bizjak <ubizjak@gmail.com>
> >> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> >> Cc: linux-hyperv@vger.kernel.org
> >> Fixes: 94212d34618c ("x86/hyperv: Implement hypervisor RAM collection into vmcore")
> >> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> >> ---
> >> v2: apply some asm tweaks suggested by Uros and Andrew
> >>
> >>  arch/x86/hyperv/hv_crash.c | 79 ++++++++++----------
> >>  1 file changed, 41 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
> >> index 92da1b4f2e73..1c0965eb346e 100644
> >> --- a/arch/x86/hyperv/hv_crash.c
> >> +++ b/arch/x86/hyperv/hv_crash.c
> >> @@ -107,14 +107,12 @@ static void __noreturn hv_panic_timeout_reboot(void)
> >>                 cpu_relax();
> >>  }
> >>
> >> -/* This cannot be inlined as it needs stack */
> >> -static noinline __noclone void hv_crash_restore_tss(void)
> >> +static void hv_crash_restore_tss(void)
> >>  {
> >>         load_TR_desc();
> >>  }
> >>
> >> -/* This cannot be inlined as it needs stack */
> >> -static noinline void hv_crash_clear_kernpt(void)
> >> +static void hv_crash_clear_kernpt(void)
> >>  {
> >>         pgd_t *pgd;
> >>         p4d_t *p4d;
> >> @@ -125,6 +123,25 @@ static noinline void hv_crash_clear_kernpt(void)
> >>         native_p4d_clear(p4d);
> >>  }
> >>
> >> +
> >> +static void __noreturn hv_crash_handle(void)
> >> +{
> >> +       hv_crash_restore_tss();
> >> +       hv_crash_clear_kernpt();
> >> +
> >> +       /* we are now fully in devirtualized normal kernel mode */
> >> +       __crash_kexec(NULL);
> >> +
> >> +       hv_panic_timeout_reboot();
> >> +}
> >> +
> >> +/*
> >> + * __naked functions do not permit function calls, not even to __always_inline
> >> + * functions that only contain asm() blocks themselves. So use a macro instead.
> >> + */
> >> +#define hv_wrmsr(msr, val) \
> >> +       asm("wrmsr" :: "c"(msr), "a"((u32)val), "d"((u32)(val >> 32)) : "memory")
> >
> > This one should be defined as "asm volatile", otherwise the compiler
> > will remove it (it has no outputs used later in the code!).
>
> An asm() block with only input operands does not need to be marked as volatile to prevent it from being optimized away.

Oh, you are right in this detail. It won't be removed, but insn
reordering is still allowed (the ones without memory access when
"memory" clobber is also used).

> > Also, it
> > should be defined as "asm volatile" when it is important that the insn
> > stays where it is, relative to other "asm volatile"s. Otherwise, the
> > compiler is free to schedule other insns, including other "asm
> > volatile"s around . Since this macro is also used to update
> > MSR_GS_BASE (so it affects memory in case of %gs prefixed access),
> > "memory" clobber should remain).
> >
>
> All the other asm() blocks except the last one read from memory, and hv_msr() has a memory clobber. So I don't think there is any legal transformation that the compiler might apply except perhaps re-ordering it with the final asm() block doing the long return.

The last asm() block also reads from the memory, so it is OK here, too.
>
> So I don't see any reason for volatile on hv_msr(). However, I do see a potential issue with the compiler assuming that code after the final asm() block is reachable, and calling unreachable() is not permitted [by Clang] due to the __naked attribute.

As far as the compiler is concerned, there is "nothing" after the last
asm() block. It can be marked with noreturn attribute, but I don't
know how it interacts with the naked attribute.
>
> Would it be better to add a memory clobber to that one as well?

But the last asm() block also reads from memory, this prevents
scheduling of asm() with memory clobber around it.

Uros.

^ permalink raw reply

* Re: [PATCH v2] x86/hyperv: Use __naked attribute to fix stackless C function
From: Uros Bizjak @ 2026-02-28 17:15 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, x86, Mukesh Rathor, Wei Liu, Andrew Cooper,
	linux-hyperv
In-Reply-To: <a80879c4-10d4-41a7-8043-290b92a0d9fc@app.fastmail.com>

On Sat, Feb 28, 2026 at 5:37 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
>
>
> On Sat, 28 Feb 2026, at 10:38, Uros Bizjak wrote:
> > On Fri, Feb 27, 2026 at 11:40 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> ...
> >> -       asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
> >> -       asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
> >> +       asm volatile("movw %0, %%ss" : : "m"(hv_crash_ctxt.ss));
> >> +       asm volatile("movq %0, %%rsp" : : "m"(hv_crash_ctxt.rsp));
> >
> > Maybe this part should be written together as:
> >
> >       asm volatile("movw %0, %%ss\n\t"
> >                     "movq %1, %%rsp"
> >                     :: "m"(hv_crash_ctxt.ss), "m"(hv_crash_ctxt,rsp));
> >
> > This way, the stack register update is guaranteed to execute in the
> > stack segment shadow. Otherwise, the compiler is free to insert some
> > unrelated instruction in between. It probably won't happen in practice
> > in this case, but the compiler can be quite creative with moving asm
> > arguments around.
> >
>
> It also doesn't matter: setting the SS segment is not needed when running in 64-bit mode, so whether or not the RSP update occurs immediately after is irrelevant.

x86-64 still implements the stack segment interrupt shadow for MOV SS
and POP SS, even though segmentation is mostly disabled in long mode.

Uros.

^ permalink raw reply

* Re: [PATCH v2] x86/hyperv: Use __naked attribute to fix stackless C function
From: Ard Biesheuvel @ 2026-02-28 16:40 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: linux-kernel, x86, Mukesh Rathor, Wei Liu, Andrew Cooper,
	linux-hyperv
In-Reply-To: <CAFULd4YM=D9+akehA5h_sC-97otYyv1Nxr2neE8bD1AoW-8ocQ@mail.gmail.com>



On Sat, 28 Feb 2026, at 11:17, Uros Bizjak wrote:
> On Fri, Feb 27, 2026 at 11:40 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>>
>> hv_crash_c_entry() is a C function that is entered without a stack,
>> and this is only allowed for functions that have the __naked attribute,
>> which informs the compiler that it must not emit the usual prologue and
>> epilogue or emit any other kind of instrumentation that relies on a
>> stack frame.
>>
>> So split up the function, and set the __naked attribute on the initial
>> part that sets up the stack, GDT, IDT and other pieces that are needed
>> for ordinary C execution. Given that function calls are not permitted
>> either, use the existing long return coded in an asm() block to call the
>> second part of the function, which is an ordinary function that is
>> permitted to call other functions as usual.
>>
>> Cc: Mukesh Rathor <mrathor@linux.microsoft.com>
>> Cc: Wei Liu <wei.liu@kernel.org>
>> Cc: Uros Bizjak <ubizjak@gmail.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: linux-hyperv@vger.kernel.org
>> Fixes: 94212d34618c ("x86/hyperv: Implement hypervisor RAM collection into vmcore")
>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>> ---
>> v2: apply some asm tweaks suggested by Uros and Andrew
>>
>>  arch/x86/hyperv/hv_crash.c | 79 ++++++++++----------
>>  1 file changed, 41 insertions(+), 38 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
>> index 92da1b4f2e73..1c0965eb346e 100644
>> --- a/arch/x86/hyperv/hv_crash.c
>> +++ b/arch/x86/hyperv/hv_crash.c
>> @@ -107,14 +107,12 @@ static void __noreturn hv_panic_timeout_reboot(void)
>>                 cpu_relax();
>>  }
>>
>> -/* This cannot be inlined as it needs stack */
>> -static noinline __noclone void hv_crash_restore_tss(void)
>> +static void hv_crash_restore_tss(void)
>>  {
>>         load_TR_desc();
>>  }
>>
>> -/* This cannot be inlined as it needs stack */
>> -static noinline void hv_crash_clear_kernpt(void)
>> +static void hv_crash_clear_kernpt(void)
>>  {
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>> @@ -125,6 +123,25 @@ static noinline void hv_crash_clear_kernpt(void)
>>         native_p4d_clear(p4d);
>>  }
>>
>> +
>> +static void __noreturn hv_crash_handle(void)
>> +{
>> +       hv_crash_restore_tss();
>> +       hv_crash_clear_kernpt();
>> +
>> +       /* we are now fully in devirtualized normal kernel mode */
>> +       __crash_kexec(NULL);
>> +
>> +       hv_panic_timeout_reboot();
>> +}
>> +
>> +/*
>> + * __naked functions do not permit function calls, not even to __always_inline
>> + * functions that only contain asm() blocks themselves. So use a macro instead.
>> + */
>> +#define hv_wrmsr(msr, val) \
>> +       asm("wrmsr" :: "c"(msr), "a"((u32)val), "d"((u32)(val >> 32)) : "memory")
>
> This one should be defined as "asm volatile", otherwise the compiler
> will remove it (it has no outputs used later in the code!).

An asm() block with only input operands does not need to be marked as volatile to prevent it from being optimized away.

> Also, it
> should be defined as "asm volatile" when it is important that the insn
> stays where it is, relative to other "asm volatile"s. Otherwise, the
> compiler is free to schedule other insns, including other "asm
> volatile"s around . Since this macro is also used to update
> MSR_GS_BASE (so it affects memory in case of %gs prefixed access),
> "memory" clobber should remain).
>

All the other asm() blocks except the last one read from memory, and hv_msr() has a memory clobber. So I don't think there is any legal transformation that the compiler might apply except perhaps re-ordering it with the final asm() block doing the long return.

So I don't see any reason for volatile on hv_msr(). However, I do see a potential issue with the compiler assuming that code after the final asm() block is reachable, and calling unreachable() is not permitted [by Clang] due to the __naked attribute.

Would it be better to add a memory clobber to that one as well?


^ permalink raw reply

* Re: [PATCH v2] x86/hyperv: Use __naked attribute to fix stackless C function
From: Ard Biesheuvel @ 2026-02-28 16:37 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: linux-kernel, x86, Mukesh Rathor, Wei Liu, Andrew Cooper,
	linux-hyperv
In-Reply-To: <CAFULd4ZYiSWciqo94yaLvB43z_+jqgXa2gy8DOEQQp1W8yFF0w@mail.gmail.com>



On Sat, 28 Feb 2026, at 10:38, Uros Bizjak wrote:
> On Fri, Feb 27, 2026 at 11:40 PM Ard Biesheuvel <ardb@kernel.org> wrote:
...
>> -       asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
>> -       asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
>> +       asm volatile("movw %0, %%ss" : : "m"(hv_crash_ctxt.ss));
>> +       asm volatile("movq %0, %%rsp" : : "m"(hv_crash_ctxt.rsp));
>
> Maybe this part should be written together as:
>
>       asm volatile("movw %0, %%ss\n\t"
>                     "movq %1, %%rsp"
>                     :: "m"(hv_crash_ctxt.ss), "m"(hv_crash_ctxt,rsp));
>
> This way, the stack register update is guaranteed to execute in the
> stack segment shadow. Otherwise, the compiler is free to insert some
> unrelated instruction in between. It probably won't happen in practice
> in this case, but the compiler can be quite creative with moving asm
> arguments around.
>

It also doesn't matter: setting the SS segment is not needed when running in 64-bit mode, so whether or not the RSP update occurs immediately after is irrelevant.


^ permalink raw reply

* Re: [PATCH v2] x86/hyperv: Use __naked attribute to fix stackless C function
From: Uros Bizjak @ 2026-02-28 10:17 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, x86, Mukesh Rathor, Wei Liu, Andrew Cooper,
	linux-hyperv
In-Reply-To: <20260227224030.299993-2-ardb@kernel.org>

On Fri, Feb 27, 2026 at 11:40 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> hv_crash_c_entry() is a C function that is entered without a stack,
> and this is only allowed for functions that have the __naked attribute,
> which informs the compiler that it must not emit the usual prologue and
> epilogue or emit any other kind of instrumentation that relies on a
> stack frame.
>
> So split up the function, and set the __naked attribute on the initial
> part that sets up the stack, GDT, IDT and other pieces that are needed
> for ordinary C execution. Given that function calls are not permitted
> either, use the existing long return coded in an asm() block to call the
> second part of the function, which is an ordinary function that is
> permitted to call other functions as usual.
>
> Cc: Mukesh Rathor <mrathor@linux.microsoft.com>
> Cc: Wei Liu <wei.liu@kernel.org>
> Cc: Uros Bizjak <ubizjak@gmail.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: linux-hyperv@vger.kernel.org
> Fixes: 94212d34618c ("x86/hyperv: Implement hypervisor RAM collection into vmcore")
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
> v2: apply some asm tweaks suggested by Uros and Andrew
>
>  arch/x86/hyperv/hv_crash.c | 79 ++++++++++----------
>  1 file changed, 41 insertions(+), 38 deletions(-)
>
> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
> index 92da1b4f2e73..1c0965eb346e 100644
> --- a/arch/x86/hyperv/hv_crash.c
> +++ b/arch/x86/hyperv/hv_crash.c
> @@ -107,14 +107,12 @@ static void __noreturn hv_panic_timeout_reboot(void)
>                 cpu_relax();
>  }
>
> -/* This cannot be inlined as it needs stack */
> -static noinline __noclone void hv_crash_restore_tss(void)
> +static void hv_crash_restore_tss(void)
>  {
>         load_TR_desc();
>  }
>
> -/* This cannot be inlined as it needs stack */
> -static noinline void hv_crash_clear_kernpt(void)
> +static void hv_crash_clear_kernpt(void)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
> @@ -125,6 +123,25 @@ static noinline void hv_crash_clear_kernpt(void)
>         native_p4d_clear(p4d);
>  }
>
> +
> +static void __noreturn hv_crash_handle(void)
> +{
> +       hv_crash_restore_tss();
> +       hv_crash_clear_kernpt();
> +
> +       /* we are now fully in devirtualized normal kernel mode */
> +       __crash_kexec(NULL);
> +
> +       hv_panic_timeout_reboot();
> +}
> +
> +/*
> + * __naked functions do not permit function calls, not even to __always_inline
> + * functions that only contain asm() blocks themselves. So use a macro instead.
> + */
> +#define hv_wrmsr(msr, val) \
> +       asm("wrmsr" :: "c"(msr), "a"((u32)val), "d"((u32)(val >> 32)) : "memory")

This one should be defined as "asm volatile", otherwise the compiler
will remove it (it has no outputs used later in the code!). Also, it
should be defined as "asm volatile" when it is important that the insn
stays where it is, relative to other "asm volatile"s. Otherwise, the
compiler is free to schedule other insns, including other "asm
volatile"s around . Since this macro is also used to update
MSR_GS_BASE (so it affects memory in case of %gs prefixed access),
"memory" clobber should remain).

Uros.

^ permalink raw reply

* Re: [PATCH v2] x86/hyperv: Use __naked attribute to fix stackless C function
From: Uros Bizjak @ 2026-02-28  9:38 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, x86, Mukesh Rathor, Wei Liu, Andrew Cooper,
	linux-hyperv
In-Reply-To: <20260227224030.299993-2-ardb@kernel.org>

On Fri, Feb 27, 2026 at 11:40 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> hv_crash_c_entry() is a C function that is entered without a stack,
> and this is only allowed for functions that have the __naked attribute,
> which informs the compiler that it must not emit the usual prologue and
> epilogue or emit any other kind of instrumentation that relies on a
> stack frame.
>
> So split up the function, and set the __naked attribute on the initial
> part that sets up the stack, GDT, IDT and other pieces that are needed
> for ordinary C execution. Given that function calls are not permitted
> either, use the existing long return coded in an asm() block to call the
> second part of the function, which is an ordinary function that is
> permitted to call other functions as usual.
>
> Cc: Mukesh Rathor <mrathor@linux.microsoft.com>
> Cc: Wei Liu <wei.liu@kernel.org>
> Cc: Uros Bizjak <ubizjak@gmail.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: linux-hyperv@vger.kernel.org
> Fixes: 94212d34618c ("x86/hyperv: Implement hypervisor RAM collection into vmcore")
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
> v2: apply some asm tweaks suggested by Uros and Andrew
>
>  arch/x86/hyperv/hv_crash.c | 79 ++++++++++----------
>  1 file changed, 41 insertions(+), 38 deletions(-)
>
> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
> index 92da1b4f2e73..1c0965eb346e 100644
> --- a/arch/x86/hyperv/hv_crash.c
> +++ b/arch/x86/hyperv/hv_crash.c
> @@ -107,14 +107,12 @@ static void __noreturn hv_panic_timeout_reboot(void)
>                 cpu_relax();
>  }
>
> -/* This cannot be inlined as it needs stack */
> -static noinline __noclone void hv_crash_restore_tss(void)
> +static void hv_crash_restore_tss(void)
>  {
>         load_TR_desc();
>  }
>
> -/* This cannot be inlined as it needs stack */
> -static noinline void hv_crash_clear_kernpt(void)
> +static void hv_crash_clear_kernpt(void)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
> @@ -125,6 +123,25 @@ static noinline void hv_crash_clear_kernpt(void)
>         native_p4d_clear(p4d);
>  }
>
> +
> +static void __noreturn hv_crash_handle(void)
> +{
> +       hv_crash_restore_tss();
> +       hv_crash_clear_kernpt();
> +
> +       /* we are now fully in devirtualized normal kernel mode */
> +       __crash_kexec(NULL);
> +
> +       hv_panic_timeout_reboot();
> +}
> +
> +/*
> + * __naked functions do not permit function calls, not even to __always_inline
> + * functions that only contain asm() blocks themselves. So use a macro instead.
> + */
> +#define hv_wrmsr(msr, val) \
> +       asm("wrmsr" :: "c"(msr), "a"((u32)val), "d"((u32)(val >> 32)) : "memory")
> +
>  /*
>   * This is the C entry point from the asm glue code after the disable hypercall.
>   * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel
> @@ -133,49 +150,35 @@ static noinline void hv_crash_clear_kernpt(void)
>   * available. We restore kernel GDT, and rest of the context, and continue
>   * to kexec.
>   */
> -static asmlinkage void __noreturn hv_crash_c_entry(void)
> +static void __naked hv_crash_c_entry(void)
>  {
> -       struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
> -
>         /* first thing, restore kernel gdt */
> -       native_load_gdt(&ctxt->gdtr);
> +       asm volatile("lgdt %0" : : "m" (hv_crash_ctxt.gdtr));
>
> -       asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
> -       asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
> +       asm volatile("movw %0, %%ss" : : "m"(hv_crash_ctxt.ss));
> +       asm volatile("movq %0, %%rsp" : : "m"(hv_crash_ctxt.rsp));

Maybe this part should be written together as:

      asm volatile("movw %0, %%ss\n\t"
                    "movq %1, %%rsp"
                    :: "m"(hv_crash_ctxt.ss), "m"(hv_crash_ctxt,rsp));

This way, the stack register update is guaranteed to execute in the
stack segment shadow. Otherwise, the compiler is free to insert some
unrelated instruction in between. It probably won't happen in practice
in this case, but the compiler can be quite creative with moving asm
arguments around.

Uros.

^ permalink raw reply

* Re: [PATCH net-next v4] net: mana: Add MAC address to vPort logs and clarify error messages
From: Erni Sri Satya Vennela @ 2026-02-28  7:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, dipayanroy, shirazsaleem, ssengar, shradhagupta,
	gargaditya, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260227165226.07efbefd@kernel.org>

On Fri, Feb 27, 2026 at 04:52:26PM -0800, Jakub Kicinski wrote:
> On Fri, 27 Feb 2026 11:06:31 -0800 Erni Sri Satya Vennela wrote:
> > On Wed, Feb 25, 2026 at 11:22:41AM -0800, Erni Sri Satya Vennela wrote:
> > > Add MAC address to vPort configuration success message and update error
> > > message to be more specific about HWC message errors in
> > > mana_send_request.
> > > 
> > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>  
> > 
> > Gentle ping — I sent this patch on 25/02/2026 and would appreciate any
> > feedback when you have time.  
> > Happy to rebase or add more details if needed, thanks for your review.
> 
> What are you trying to achieve with this ping? Just look at patchwork,
> there are 61 patches ahead of you in the queue.
> 
> These are Microsoft review contribution scores:
>   Author score negative (-42)
>   Company score negative (-1118)
> so you expecting that someone in the community will jump onto reviewing
> your patches is... odd. How about you review something?
> 
> Read the process documentation, and please have some basic
> understanding of what is consider good manners when communicating
> upstream.

I'm sorry for causing the trouble, and I appreciate you pointing this out.
I’ll be more patient with the review process and wait my turn in the
queue.

- Vennela

^ permalink raw reply

* Re: [PATCH v2] x86/hyperv: Use __naked attribute to fix stackless C function
From: Uros Bizjak @ 2026-02-28  6:46 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, x86, Mukesh Rathor, Wei Liu, Andrew Cooper,
	linux-hyperv
In-Reply-To: <20260227224030.299993-2-ardb@kernel.org>

On Fri, Feb 27, 2026 at 11:40 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> hv_crash_c_entry() is a C function that is entered without a stack,
> and this is only allowed for functions that have the __naked attribute,
> which informs the compiler that it must not emit the usual prologue and
> epilogue or emit any other kind of instrumentation that relies on a
> stack frame.
>
> So split up the function, and set the __naked attribute on the initial
> part that sets up the stack, GDT, IDT and other pieces that are needed
> for ordinary C execution. Given that function calls are not permitted
> either, use the existing long return coded in an asm() block to call the
> second part of the function, which is an ordinary function that is
> permitted to call other functions as usual.
>
> Cc: Mukesh Rathor <mrathor@linux.microsoft.com>
> Cc: Wei Liu <wei.liu@kernel.org>
> Cc: Uros Bizjak <ubizjak@gmail.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: linux-hyperv@vger.kernel.org
> Fixes: 94212d34618c ("x86/hyperv: Implement hypervisor RAM collection into vmcore")
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
> v2: apply some asm tweaks suggested by Uros and Andrew

Acked by: Uros Bizjak <ubizjak@gmail.com>

FYI: GCC by design inserts ud2 at the end of x86 naked functions. This
is intended to help debugging in case someone forgets ret/jmp, so
program execution does not wander into whatever follows the function.
IIRC, when ud2 follows ret, code analyzers may report "unreachable
code" warnings. I don't know if this is still the case, but
nevertheless this should be considered an important "safety net"
feature of the compiler.

Uros.

^ permalink raw reply

* Re: [PATCH net v2] net: mana: Ring doorbell at 4 CQ wraparounds
From: patchwork-bot+netdevbpf @ 2026-02-28  3:40 UTC (permalink / raw)
  To: Long Li
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, shradhagupta, ernis, linux-hyperv, netdev,
	linux-kernel, stable
In-Reply-To: <20260226192833.1050807-1-longli@microsoft.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 26 Feb 2026 11:28:33 -0800 you wrote:
> MANA hardware requires at least one doorbell ring every 8 wraparounds
> of the CQ. The driver rings the doorbell as a form of flow control to
> inform hardware that CQEs have been consumed.
> 
> The NAPI poll functions mana_poll_tx_cq() and mana_poll_rx_cq() can
> poll up to CQE_POLLING_BUFFER (512) completions per call. If the CQ
> has fewer than 512 entries, a single poll call can process more than
> 4 wraparounds without ringing the doorbell. The doorbell threshold
> check also uses ">" instead of ">=", delaying the ring by one extra
> CQE beyond 4 wraparounds. Combined, these issues can cause the driver
> to exceed the 8-wraparound hardware limit, leading to missed
> completions and stalled queues.
> 
> [...]

Here is the summary with links:
  - [net,v2] net: mana: Ring doorbell at 4 CQ wraparounds
    https://git.kernel.org/netdev/net/c/dabffd08545f

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v2] x86/hyperv: Use __naked attribute to fix stackless C function
From: Mukesh R @ 2026-02-28  3:36 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-kernel
  Cc: x86, Wei Liu, Uros Bizjak, Andrew Cooper, linux-hyperv
In-Reply-To: <3cd719bb-334a-d05a-d44a-f68982a76a9d@linux.microsoft.com>

On 2/27/26 15:03, Mukesh R wrote:
> On 2/27/26 14:40, Ard Biesheuvel wrote:
>> hv_crash_c_entry() is a C function that is entered without a stack,
>> and this is only allowed for functions that have the __naked attribute,
>> which informs the compiler that it must not emit the usual prologue and
>> epilogue or emit any other kind of instrumentation that relies on a
>> stack frame.
>>
>> So split up the function, and set the __naked attribute on the initial
>> part that sets up the stack, GDT, IDT and other pieces that are needed
>> for ordinary C execution. Given that function calls are not permitted
>> either, use the existing long return coded in an asm() block to call the
>> second part of the function, which is an ordinary function that is
>> permitted to call other functions as usual.
> 
> Thank you for the patch. I'll start a build on the side and test it
> out and let you know.

Well, never that simple. I am able to generate cores, both before and
after the patch, but the crash command hangs on the vmcore (with correct
vmlinux). But, since the kexec happened, I think it is fair to say that
the patch works. With that:

Reviewed-by: Mukesh R <mrathor@linux.microsoft.com>

However, I did notice a pre-exising cut-n-paste oopsie:

   asm volatile("movq %0, %%cr2" : : "r"(hv_crash_ctxt.cr4)); <== cr2, not cr4


So, if you happen to do another churn, feel free to fix it. Otherwise,
no worries, I'll submit another patch.

Thanks for all your help.
-Mukesh


> Thanks,
> -Mukesh
> 
> 
> 
>> Cc: Mukesh Rathor <mrathor@linux.microsoft.com>
>> Cc: Wei Liu <wei.liu@kernel.org>
>> Cc: Uros Bizjak <ubizjak@gmail.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: linux-hyperv@vger.kernel.org
>> Fixes: 94212d34618c ("x86/hyperv: Implement hypervisor RAM collection into vmcore")
>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>> ---
>> v2: apply some asm tweaks suggested by Uros and Andrew
>>
>>   arch/x86/hyperv/hv_crash.c | 79 ++++++++++----------
>>   1 file changed, 41 insertions(+), 38 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
>> index 92da1b4f2e73..1c0965eb346e 100644
>> --- a/arch/x86/hyperv/hv_crash.c
>> +++ b/arch/x86/hyperv/hv_crash.c
>> @@ -107,14 +107,12 @@ static void __noreturn hv_panic_timeout_reboot(void)
>>           cpu_relax();
>>   }
>> -/* This cannot be inlined as it needs stack */
>> -static noinline __noclone void hv_crash_restore_tss(void)
>> +static void hv_crash_restore_tss(void)
>>   {
>>       load_TR_desc();
>>   }
>> -/* This cannot be inlined as it needs stack */
>> -static noinline void hv_crash_clear_kernpt(void)
>> +static void hv_crash_clear_kernpt(void)
>>   {
>>       pgd_t *pgd;
>>       p4d_t *p4d;
>> @@ -125,6 +123,25 @@ static noinline void hv_crash_clear_kernpt(void)
>>       native_p4d_clear(p4d);
>>   }
>> +
>> +static void __noreturn hv_crash_handle(void)
>> +{
>> +    hv_crash_restore_tss();
>> +    hv_crash_clear_kernpt();
>> +
>> +    /* we are now fully in devirtualized normal kernel mode */
>> +    __crash_kexec(NULL);
>> +
>> +    hv_panic_timeout_reboot();
>> +}
>> +
>> +/*
>> + * __naked functions do not permit function calls, not even to __always_inline
>> + * functions that only contain asm() blocks themselves. So use a macro instead.
>> + */
>> +#define hv_wrmsr(msr, val) \
>> +    asm("wrmsr" :: "c"(msr), "a"((u32)val), "d"((u32)(val >> 32)) : "memory")
>> +
>>   /*
>>    * This is the C entry point from the asm glue code after the disable hypercall.
>>    * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel
>> @@ -133,49 +150,35 @@ static noinline void hv_crash_clear_kernpt(void)
>>    * available. We restore kernel GDT, and rest of the context, and continue
>>    * to kexec.
>>    */
>> -static asmlinkage void __noreturn hv_crash_c_entry(void)
>> +static void __naked hv_crash_c_entry(void)
>>   {
>> -    struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
>> -
>>       /* first thing, restore kernel gdt */
>> -    native_load_gdt(&ctxt->gdtr);
>> +    asm volatile("lgdt %0" : : "m" (hv_crash_ctxt.gdtr));
>> -    asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
>> -    asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
>> +    asm volatile("movw %0, %%ss" : : "m"(hv_crash_ctxt.ss));
>> +    asm volatile("movq %0, %%rsp" : : "m"(hv_crash_ctxt.rsp));
>> -    asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds));
>> -    asm volatile("movw %%ax, %%es" : : "a"(ctxt->es));
>> -    asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs));
>> -    asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs));
>> +    asm volatile("movw %0, %%ds" : : "m"(hv_crash_ctxt.ds));
>> +    asm volatile("movw %0, %%es" : : "m"(hv_crash_ctxt.es));
>> +    asm volatile("movw %0, %%fs" : : "m"(hv_crash_ctxt.fs));
>> +    asm volatile("movw %0, %%gs" : : "m"(hv_crash_ctxt.gs));
>> -    native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat);
>> -    asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0));
>> +    hv_wrmsr(MSR_IA32_CR_PAT, hv_crash_ctxt.pat);
>> +    asm volatile("movq %0, %%cr0" : : "r"(hv_crash_ctxt.cr0));
>> -    asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8));
>> -    asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4));
>> -    asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4));
>> +    asm volatile("movq %0, %%cr8" : : "r"(hv_crash_ctxt.cr8));
>> +    asm volatile("movq %0, %%cr4" : : "r"(hv_crash_ctxt.cr4));
>> +    asm volatile("movq %0, %%cr2" : : "r"(hv_crash_ctxt.cr4));
>> -    native_load_idt(&ctxt->idtr);
>> -    native_wrmsrq(MSR_GS_BASE, ctxt->gsbase);
>> -    native_wrmsrq(MSR_EFER, ctxt->efer);
>> +    asm volatile("lidt %0" : : "m" (hv_crash_ctxt.idtr));
>> +    hv_wrmsr(MSR_GS_BASE, hv_crash_ctxt.gsbase);
>> +    hv_wrmsr(MSR_EFER, hv_crash_ctxt.efer);
>>       /* restore the original kernel CS now via far return */
>> -    asm volatile("movzwq %0, %%rax\n\t"
>> -             "pushq %%rax\n\t"
>> -             "pushq $1f\n\t"
>> -             "lretq\n\t"
>> -             "1:nop\n\t" : : "m"(ctxt->cs) : "rax");
>> -
>> -    /* We are in asmlinkage without stack frame, hence make C function
>> -     * calls which will buy stack frames.
>> -     */
>> -    hv_crash_restore_tss();
>> -    hv_crash_clear_kernpt();
>> -
>> -    /* we are now fully in devirtualized normal kernel mode */
>> -    __crash_kexec(NULL);
>> -
>> -    hv_panic_timeout_reboot();
>> +    asm volatile("pushq %q0\n\t"
>> +             "pushq %q1\n\t"
>> +             "lretq"
>> +             :: "r"(hv_crash_ctxt.cs), "r"(hv_crash_handle));
>>   }
>>   /* Tell gcc we are using lretq long jump in the above function intentionally */
>>   STACK_FRAME_NON_STANDARD(hv_crash_c_entry);
> 


^ permalink raw reply

* [PATCH net-next 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs
From: Long Li @ 2026-02-28  2:11 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260228021144.85054-1-longli@microsoft.com>

Use the GIC functions to allocate interrupt contexts for RDMA EQs. These
interrupt contexts may be shared with Ethernet EQs when MSI-X vectors
are limited.

The driver now supports allocating dedicated MSI-X for each EQ. Indicate
this capability through driver capability bits.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/infiniband/hw/mana/main.c | 33 ++++++++++++++++++++++++++-----
 include/net/mana/gdma.h           |  5 ++++-
 2 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index cfa954460585..029609fb91c5 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -787,6 +787,7 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
 {
 	struct gdma_context *gc = mdev_to_gc(mdev);
 	struct gdma_queue_spec spec = {};
+	struct gdma_irq_context *gic;
 	int err, i;
 
 	spec.type = GDMA_EQ;
@@ -797,9 +798,15 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
 	spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
 	spec.eq.msix_index = 0;
 
+	gic = mana_gd_get_gic(gc, false, &spec.eq.msix_index);
+	if (!gic)
+		return -ENOMEM;
+
 	err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->fatal_err_eq);
-	if (err)
+	if (err) {
+		mana_gd_put_gic(gc, false, 0);
 		return err;
+	}
 
 	mdev->eqs = kcalloc(mdev->ib_dev.num_comp_vectors, sizeof(struct gdma_queue *),
 			    GFP_KERNEL);
@@ -810,31 +817,47 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
 	spec.eq.callback = NULL;
 	for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
 		spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+
+		gic = mana_gd_get_gic(gc, false, &spec.eq.msix_index);
+		if (!gic) {
+			err = -ENOMEM;
+			goto destroy_eqs;
+		}
+
 		err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->eqs[i]);
-		if (err)
+		if (err) {
+			mana_gd_put_gic(gc, false, spec.eq.msix_index);
 			goto destroy_eqs;
+		}
 	}
 
 	return 0;
 
 destroy_eqs:
-	while (i-- > 0)
+	while (i-- > 0) {
 		mana_gd_destroy_queue(gc, mdev->eqs[i]);
+		mana_gd_put_gic(gc, false, (i + 1) % gc->num_msix_usable);
+	}
 	kfree(mdev->eqs);
 destroy_fatal_eq:
 	mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+	mana_gd_put_gic(gc, false, 0);
 	return err;
 }
 
 void mana_ib_destroy_eqs(struct mana_ib_dev *mdev)
 {
 	struct gdma_context *gc = mdev_to_gc(mdev);
-	int i;
+	int i, msi;
 
 	mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+	mana_gd_put_gic(gc, false, 0);
 
-	for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++)
+	for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
 		mana_gd_destroy_queue(gc, mdev->eqs[i]);
+		msi = (i + 1) % gc->num_msix_usable;
+		mana_gd_put_gic(gc, false, msi);
+	}
 
 	kfree(mdev->eqs);
 }
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 4eb94d1df439..f0d5c873f856 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -610,6 +610,8 @@ enum {
 
 /* Driver supports dynamic MSI-X vector allocation */
 #define GDMA_DRV_CAP_FLAG_1_DYNAMIC_IRQ_ALLOC_SUPPORT BIT(13)
+/* Driver supports separate EQ/MSIs for each vPort */
+#define GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT BIT(19)
 
 /* Driver can self reset on EQE notification */
 #define GDMA_DRV_CAP_FLAG_1_SELF_RESET_ON_EQE BIT(14)
@@ -644,7 +646,8 @@ enum {
 	 GDMA_DRV_CAP_FLAG_1_PERIODIC_STATS_QUERY | \
 	 GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
 	 GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
-	 GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY)
+	 GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
+	 GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort
From: Long Li @ 2026-02-28  2:11 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260228021144.85054-1-longli@microsoft.com>

Use GIC functions to create a dedicated interrupt context or acquire a
shared interrupt context for each EQ when setting up a vPort.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/gdma_main.c |  2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c   | 17 ++++++++++++++++-
 include/net/mana/gdma.h                         |  1 +
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index f3dbc4881be4..61dc06dc8602 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -808,7 +808,6 @@ static void mana_gd_deregister_irq(struct gdma_queue *queue)
 	}
 	spin_unlock_irqrestore(&gic->lock, flags);
 
-	queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
 	synchronize_rcu();
 }
 
@@ -923,6 +922,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
 out:
 	dev_err(dev, "Failed to create EQ: %d\n", err);
 	mana_gd_destroy_eq(gc, false, queue);
+	queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
 	return err;
 }
 
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1e65670feb17..c0bd520dd54d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1600,6 +1600,7 @@ void mana_destroy_eq(struct mana_port_context *apc)
 	struct gdma_context *gc = ac->gdma_dev->gdma_context;
 	struct gdma_queue *eq;
 	int i;
+	unsigned int msi;
 
 	if (!apc->eqs)
 		return;
@@ -1612,7 +1613,9 @@ void mana_destroy_eq(struct mana_port_context *apc)
 		if (!eq)
 			continue;
 
+		msi = eq->eq.msix_index;
 		mana_gd_destroy_queue(gc, eq);
+		mana_gd_put_gic(gc, !gc->msi_sharing, msi);
 	}
 
 	kfree(apc->eqs);
@@ -1629,6 +1632,7 @@ static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
 	eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
 	debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
 	debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
+	debugfs_create_u32("irq", 0400, eq.mana_eq_debugfs, &eq.eq->eq.irq);
 	debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
 }
 
@@ -1639,6 +1643,7 @@ int mana_create_eq(struct mana_port_context *apc)
 	struct gdma_queue_spec spec = {};
 	int err;
 	int i;
+	struct gdma_irq_context *gic;
 
 	WARN_ON(apc->eqs);
 	apc->eqs = kcalloc(apc->num_queues, sizeof(struct mana_eq),
@@ -1656,12 +1661,22 @@ int mana_create_eq(struct mana_port_context *apc)
 	apc->mana_eqs_debugfs = debugfs_create_dir("EQs", apc->mana_port_debugfs);
 
 	for (i = 0; i < apc->num_queues; i++) {
-		spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+		if (gc->msi_sharing)
+			spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+
+		gic = mana_gd_get_gic(gc, !gc->msi_sharing, &spec.eq.msix_index);
+		if (!gic) {
+			err = -ENOMEM;
+			goto out;
+		}
+
 		err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
 		if (err) {
 			dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
+			mana_gd_put_gic(gc, !gc->msi_sharing, spec.eq.msix_index);
 			goto out;
 		}
+		apc->eqs[i].eq->eq.irq = gic->irq;
 		mana_create_eq_debugfs(apc, i);
 	}
 
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index be6bdd169b3d..4eb94d1df439 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -336,6 +336,7 @@ struct gdma_queue {
 			void *context;
 
 			unsigned int msix_index;
+			unsigned int irq;
 
 			u32 log2_throttle_limit;
 		} eq;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next 4/6] net: mana: Use GIC functions to allocate global EQs
From: Long Li @ 2026-02-28  2:11 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260228021144.85054-1-longli@microsoft.com>

Replace the GDMA global interrupt setup code with the new GIC allocation
and release functions for managing interrupt contexts.

Signed-off-by: Long Li <longli@microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 83 +++----------------
 1 file changed, 10 insertions(+), 73 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index e9b839259c01..f3dbc4881be4 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1830,30 +1830,13 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	 * further used in irq_setup()
 	 */
 	for (i = 1; i <= nvec; i++) {
-		gic = kzalloc(sizeof(*gic), GFP_KERNEL);
+		gic = mana_gd_get_gic(gc, false, &i);
 		if (!gic) {
 			err = -ENOMEM;
 			goto free_irq;
 		}
-		gic->handler = mana_gd_process_eq_events;
-		INIT_LIST_HEAD(&gic->eq_list);
-		spin_lock_init(&gic->lock);
-
-		snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
-			 i - 1, pci_name(pdev));
-
-		/* one pci vector is already allocated for HWC */
-		irqs[i - 1] = pci_irq_vector(pdev, i);
-		if (irqs[i - 1] < 0) {
-			err = irqs[i - 1];
-			goto free_current_gic;
-		}
-
-		err = request_irq(irqs[i - 1], mana_gd_intr, 0, gic->name, gic);
-		if (err)
-			goto free_current_gic;
 
-		xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+		irqs[i - 1] = gic->irq;
 	}
 
 	/*
@@ -1875,19 +1858,11 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	kfree(irqs);
 	return 0;
 
-free_current_gic:
-	kfree(gic);
 free_irq:
 	for (i -= 1; i > 0; i--) {
 		irq = pci_irq_vector(pdev, i);
-		gic = xa_load(&gc->irq_contexts, i);
-		if (WARN_ON(!gic))
-			continue;
-
 		irq_update_affinity_hint(irq, NULL);
-		free_irq(irq, gic);
-		xa_erase(&gc->irq_contexts, i);
-		kfree(gic);
+		mana_gd_put_gic(gc, false, i);
 	}
 	kfree(irqs);
 	return err;
@@ -1908,34 +1883,13 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
 	start_irqs = irqs;
 
 	for (i = 0; i < nvec; i++) {
-		gic = kzalloc(sizeof(*gic), GFP_KERNEL);
+		gic = mana_gd_get_gic(gc, false, &i);
 		if (!gic) {
 			err = -ENOMEM;
 			goto free_irq;
 		}
 
-		gic->handler = mana_gd_process_eq_events;
-		INIT_LIST_HEAD(&gic->eq_list);
-		spin_lock_init(&gic->lock);
-
-		if (!i)
-			snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
-				 pci_name(pdev));
-		else
-			snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
-				 i - 1, pci_name(pdev));
-
-		irqs[i] = pci_irq_vector(pdev, i);
-		if (irqs[i] < 0) {
-			err = irqs[i];
-			goto free_current_gic;
-		}
-
-		err = request_irq(irqs[i], mana_gd_intr, 0, gic->name, gic);
-		if (err)
-			goto free_current_gic;
-
-		xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+		irqs[i] = gic->irq;
 	}
 
 	/* If number of IRQ is one extra than number of online CPUs,
@@ -1964,19 +1918,11 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
 	kfree(start_irqs);
 	return 0;
 
-free_current_gic:
-	kfree(gic);
 free_irq:
 	for (i -= 1; i >= 0; i--) {
 		irq = pci_irq_vector(pdev, i);
-		gic = xa_load(&gc->irq_contexts, i);
-		if (WARN_ON(!gic))
-			continue;
-
 		irq_update_affinity_hint(irq, NULL);
-		free_irq(irq, gic);
-		xa_erase(&gc->irq_contexts, i);
-		kfree(gic);
+		mana_gd_put_gic(gc, false, i);
 	}
 
 	kfree(start_irqs);
@@ -2051,26 +1997,17 @@ static int mana_gd_setup_remaining_irqs(struct pci_dev *pdev)
 static void mana_gd_remove_irqs(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
-	struct gdma_irq_context *gic;
 	int irq, i;
 
 	if (gc->max_num_msix < 1)
 		return;
 
-	for (i = 0; i < gc->max_num_msix; i++) {
-		irq = pci_irq_vector(pdev, i);
-		if (irq < 0)
-			continue;
-
-		gic = xa_load(&gc->irq_contexts, i);
-		if (WARN_ON(!gic))
-			continue;
-
+	for (i = 0; i < (gc->msi_sharing ? gc->max_num_msix : 1); i++) {
 		/* Need to clear the hint before free_irq */
+		irq = pci_irq_vector(pdev, i);
 		irq_update_affinity_hint(irq, NULL);
-		free_irq(irq, gic);
-		xa_erase(&gc->irq_contexts, i);
-		kfree(gic);
+
+		mana_gd_put_gic(gc, false, i);
 	}
 
 	pci_free_irq_vectors(pdev);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next 3/6] net: mana: Introduce GIC context with refcounting for interrupt management
From: Long Li @ 2026-02-28  2:11 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260228021144.85054-1-longli@microsoft.com>

To allow Ethernet EQs to use dedicated or shared MSI-X vectors and RDMA
EQs to share the same MSI-X, introduce a GIC (GDMA IRQ Context) with
reference counting. This allows the driver to create an interrupt context
on an assigned or unassigned MSI-X vector and share it across multiple
EQ consumers.

Signed-off-by: Long Li <longli@microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 158 ++++++++++++++++++
 include/net/mana/gdma.h                       |  10 ++
 2 files changed, 168 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 62e3a2eb68e0..e9b839259c01 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1558,6 +1558,163 @@ static irqreturn_t mana_gd_intr(int irq, void *arg)
 	return IRQ_HANDLED;
 }
 
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi)
+{
+	struct pci_dev *dev = to_pci_dev(gc->dev);
+	struct msi_map irq_map;
+	struct gdma_irq_context *gic;
+	int irq;
+
+	mutex_lock(&gc->gic_mutex);
+
+	gic = xa_load(&gc->irq_contexts, msi);
+	if (WARN_ON(!gic)) {
+		mutex_unlock(&gc->gic_mutex);
+		return;
+	}
+
+	if (use_msi_bitmap)
+		gic->bitmap_refs--;
+
+	if (use_msi_bitmap && gic->bitmap_refs == 0)
+		clear_bit(msi, gc->msi_bitmap);
+
+	if (!refcount_dec_and_test(&gic->refcount))
+		goto out;
+
+	irq = pci_irq_vector(dev, msi);
+
+	irq_update_affinity_hint(irq, NULL);
+	free_irq(irq, gic);
+
+	if (pci_msix_can_alloc_dyn(dev)) {
+		irq_map.virq = irq;
+		irq_map.index = msi;
+		pci_msix_free_irq(dev, irq_map);
+	}
+
+	xa_erase(&gc->irq_contexts, msi);
+	kfree(gic);
+
+out:
+	mutex_unlock(&gc->gic_mutex);
+}
+EXPORT_SYMBOL_NS(mana_gd_put_gic, "NET_MANA");
+
+/*
+ * Get a GIC (GDMA IRQ Context) on a MSI vector
+ * a MSI can be shared between different EQs, this function supports setting
+ * up separate MSIs using a bitmap, or directly using the MSI index
+ *
+ * @use_msi_bitmap:
+ * True if MSI is assigned by this function on available slots from bitmap.
+ * False if MSI is passed from *msi_requested
+ */
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc,
+					 bool use_msi_bitmap,
+					 int *msi_requested)
+{
+	struct gdma_irq_context *gic;
+	struct pci_dev *dev = to_pci_dev(gc->dev);
+	struct msi_map irq_map = { };
+	int irq;
+	int msi;
+	int err;
+
+	mutex_lock(&gc->gic_mutex);
+
+	if (use_msi_bitmap) {
+		msi = find_first_zero_bit(gc->msi_bitmap, gc->num_msix_usable);
+		if (msi >= gc->num_msix_usable) {
+			dev_err(gc->dev, "No free MSI vectors available\n");
+			gic = NULL;
+			goto out;
+		}
+		*msi_requested = msi;
+	} else {
+		msi = *msi_requested;
+	}
+
+	gic = xa_load(&gc->irq_contexts, msi);
+	if (gic) {
+		refcount_inc(&gic->refcount);
+		if (use_msi_bitmap) {
+			gic->bitmap_refs++;
+			set_bit(msi, gc->msi_bitmap);
+		}
+		goto out;
+	}
+
+	irq = pci_irq_vector(dev, msi);
+	if (irq == -EINVAL) {
+		irq_map = pci_msix_alloc_irq_at(dev, msi, NULL);
+		if (!irq_map.virq) {
+			err = irq_map.index;
+			dev_err(gc->dev,
+				"Failed to alloc irq_map msi %d err %d\n",
+				msi, err);
+			gic = NULL;
+			goto out;
+		}
+		irq = irq_map.virq;
+		msi = irq_map.index;
+	}
+
+	gic = kzalloc(sizeof(*gic), GFP_KERNEL);
+	if (!gic) {
+		if (irq_map.virq)
+			pci_msix_free_irq(dev, irq_map);
+		goto out;
+	}
+
+	gic->handler = mana_gd_process_eq_events;
+	gic->msi = msi;
+	gic->irq = irq;
+	INIT_LIST_HEAD(&gic->eq_list);
+	spin_lock_init(&gic->lock);
+
+	if (!gic->msi)
+		snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
+			 pci_name(dev));
+	else
+		snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_msi%d@pci:%s",
+			 gic->msi, pci_name(dev));
+
+	err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
+	if (err) {
+		dev_err(gc->dev, "Failed to request irq %d %s\n",
+			irq, gic->name);
+		kfree(gic);
+		gic = NULL;
+		if (irq_map.virq)
+			pci_msix_free_irq(dev, irq_map);
+		goto out;
+	}
+
+	refcount_set(&gic->refcount, 1);
+	gic->bitmap_refs = use_msi_bitmap ? 1 : 0;
+
+	err = xa_err(xa_store(&gc->irq_contexts, msi, gic, GFP_KERNEL));
+	if (err) {
+		dev_err(gc->dev, "Failed to store irq context for msi %d: %d\n",
+			msi, err);
+		free_irq(irq, gic);
+		kfree(gic);
+		gic = NULL;
+		if (irq_map.virq)
+			pci_msix_free_irq(dev, irq_map);
+		goto out;
+	}
+
+	if (use_msi_bitmap)
+		set_bit(msi, gc->msi_bitmap);
+
+out:
+	mutex_unlock(&gc->gic_mutex);
+	return gic;
+}
+EXPORT_SYMBOL_NS(mana_gd_get_gic, "NET_MANA");
+
 int mana_gd_alloc_res_map(u32 res_avail, struct gdma_resource *r)
 {
 	r->map = bitmap_zalloc(res_avail, GFP_KERNEL);
@@ -2040,6 +2197,7 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto release_region;
 
 	mutex_init(&gc->eq_test_event_mutex);
+	mutex_init(&gc->gic_mutex);
 	pci_set_drvdata(pdev, gc);
 	gc->bar0_pa = pci_resource_start(pdev, 0);
 
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 477b751f124e..be6bdd169b3d 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -382,6 +382,10 @@ struct gdma_irq_context {
 	spinlock_t lock;
 	struct list_head eq_list;
 	char name[MANA_IRQ_NAME_SZ];
+	unsigned int msi;
+	unsigned int irq;
+	refcount_t refcount;
+	unsigned int bitmap_refs;
 };
 
 enum gdma_context_flags {
@@ -441,6 +445,9 @@ struct gdma_context {
 
 	unsigned long		flags;
 
+	/* Protect access to GIC context */
+	struct mutex		gic_mutex;
+
 	/* Indicate if this device is sharing MSI for EQs on MANA */
 	bool msi_sharing;
 
@@ -1007,6 +1014,9 @@ int mana_gd_resume(struct pci_dev *pdev);
 
 bool mana_need_log(struct gdma_context *gc, int err);
 
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc, bool use_msi_bitmap,
+					 int *msi_requested);
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi);
 int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
 			     u32 proto_minor_ver, u32 proto_micro_ver,
 			     u16 *max_num_vports, u8 *bm_hostmode);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
From: Long Li @ 2026-02-28  2:11 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260228021144.85054-1-longli@microsoft.com>

When querying the device, adjust the max number of queues to allow
dedicated MSI-X vectors for each vPort. The number of queues per vPort
is clamped to no less than 16. MSI-X sharing among vPorts is disabled
by default and is only enabled when there are not enough MSI-X vectors
for dedicated allocation.

Rename mana_query_device_cfg() to mana_gd_query_device_cfg() as it is
used at GDMA device probe time for querying device capabilities.

Signed-off-by: Long Li <longli@microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 66 ++++++++++++++++---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 36 +++++-----
 include/net/mana/gdma.h                       | 13 +++-
 3 files changed, 91 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 0055c231acf6..62e3a2eb68e0 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -107,6 +107,9 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_query_max_resources_resp resp = {};
 	struct gdma_general_req req = {};
+	unsigned int max_num_queues;
+	u8 bm_hostmode;
+	u16 num_ports;
 	int err;
 
 	mana_gd_init_req_hdr(&req.hdr, GDMA_QUERY_MAX_RESOURCES,
@@ -152,6 +155,40 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	if (gc->max_num_queues > gc->num_msix_usable - 1)
 		gc->max_num_queues = gc->num_msix_usable - 1;
 
+	err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
+				       MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
+	if (err)
+		return err;
+
+	if (!num_ports)
+		return -EINVAL;
+
+	/*
+	 * Adjust gc->max_num_queues returned from the SOC to allow dedicated MSIx
+	 * for each vPort. Reduce max_num_queues to no less than 16 if necessary
+	 */
+	max_num_queues = (gc->num_msix_usable - 1) / num_ports;
+	max_num_queues = roundup_pow_of_two(max(max_num_queues, 1U));
+	if (max_num_queues < 16)
+		max_num_queues = 16;
+
+	/*
+	 * Use dedicated MSIx for EQs whenever possible, use MSIx sharing for
+	 * Ethernet EQs when (max_num_queues * num_ports > num_msix_usable - 1)
+	 */
+	max_num_queues = min(gc->max_num_queues, max_num_queues);
+	if (max_num_queues * num_ports > gc->num_msix_usable - 1)
+		gc->msi_sharing = true;
+
+	/* If MSI is shared, use max allowed value */
+	if (gc->msi_sharing)
+		gc->max_num_queues_vport = min(gc->num_msix_usable - 1, gc->max_num_queues);
+	else
+		gc->max_num_queues_vport = max_num_queues;
+
+	dev_info(gc->dev, "MSI sharing mode %d max queues %d\n",
+		 gc->msi_sharing, gc->max_num_queues);
+
 	return 0;
 }
 
@@ -1802,6 +1839,7 @@ static int mana_gd_setup_hwc_irqs(struct pci_dev *pdev)
 		/* Need 1 interrupt for HWC */
 		max_irqs = min(num_online_cpus(), MANA_MAX_NUM_QUEUES) + 1;
 		min_irqs = 2;
+		gc->msi_sharing = true;
 	}
 
 	nvec = pci_alloc_irq_vectors(pdev, min_irqs, max_irqs, PCI_IRQ_MSIX);
@@ -1880,6 +1918,8 @@ static void mana_gd_remove_irqs(struct pci_dev *pdev)
 
 	pci_free_irq_vectors(pdev);
 
+	bitmap_free(gc->msi_bitmap);
+	gc->msi_bitmap = NULL;
 	gc->max_num_msix = 0;
 	gc->num_msix_usable = 0;
 }
@@ -1911,20 +1951,30 @@ static int mana_gd_setup(struct pci_dev *pdev)
 	if (err)
 		goto destroy_hwc;
 
-	err = mana_gd_query_max_resources(pdev);
+	err = mana_gd_detect_devices(pdev);
 	if (err)
 		goto destroy_hwc;
 
-	err = mana_gd_setup_remaining_irqs(pdev);
-	if (err) {
-		dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
-		goto destroy_hwc;
-	}
-
-	err = mana_gd_detect_devices(pdev);
+	err = mana_gd_query_max_resources(pdev);
 	if (err)
 		goto destroy_hwc;
 
+	if (!gc->msi_sharing) {
+		gc->msi_bitmap = bitmap_zalloc(gc->num_msix_usable, GFP_KERNEL);
+		if (!gc->msi_bitmap) {
+			err = -ENOMEM;
+			goto destroy_hwc;
+		}
+		/* Set bit for HWC */
+		set_bit(0, gc->msi_bitmap);
+	} else {
+		err = mana_gd_setup_remaining_irqs(pdev);
+		if (err) {
+			dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
+			goto destroy_hwc;
+		}
+	}
+
 	dev_dbg(&pdev->dev, "mana gdma setup successful\n");
 	return 0;
 
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 566e45a66adf..1e65670feb17 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1002,10 +1002,9 @@ static int mana_init_port_context(struct mana_port_context *apc)
 	return !apc->rxqs ? -ENOMEM : 0;
 }
 
-static int mana_send_request(struct mana_context *ac, void *in_buf,
-			     u32 in_len, void *out_buf, u32 out_len)
+static int gdma_mana_send_request(struct gdma_context *gc, void *in_buf,
+				  u32 in_len, void *out_buf, u32 out_len)
 {
-	struct gdma_context *gc = ac->gdma_dev->gdma_context;
 	struct gdma_resp_hdr *resp = out_buf;
 	struct gdma_req_hdr *req = in_buf;
 	struct device *dev = gc->dev;
@@ -1039,6 +1038,14 @@ static int mana_send_request(struct mana_context *ac, void *in_buf,
 	return 0;
 }
 
+static int mana_send_request(struct mana_context *ac, void *in_buf,
+			     u32 in_len, void *out_buf, u32 out_len)
+{
+	struct gdma_context *gc = ac->gdma_dev->gdma_context;
+
+	return gdma_mana_send_request(gc, in_buf, in_len, out_buf, out_len);
+}
+
 static int mana_verify_resp_hdr(const struct gdma_resp_hdr *resp_hdr,
 				const enum mana_command_code expected_code,
 				const u32 min_size)
@@ -1172,11 +1179,10 @@ static void mana_pf_deregister_filter(struct mana_port_context *apc)
 			   err, resp.hdr.status);
 }
 
-static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
-				 u32 proto_minor_ver, u32 proto_micro_ver,
-				 u16 *max_num_vports, u8 *bm_hostmode)
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+			     u32 proto_minor_ver, u32 proto_micro_ver,
+			     u16 *max_num_vports, u8 *bm_hostmode)
 {
-	struct gdma_context *gc = ac->gdma_dev->gdma_context;
 	struct mana_query_device_cfg_resp resp = {};
 	struct mana_query_device_cfg_req req = {};
 	struct device *dev = gc->dev;
@@ -1191,7 +1197,7 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
 	req.proto_minor_ver = proto_minor_ver;
 	req.proto_micro_ver = proto_micro_ver;
 
-	err = mana_send_request(ac, &req, sizeof(req), &resp, sizeof(resp));
+	err = gdma_mana_send_request(gc, &req, sizeof(req), &resp, sizeof(resp));
 	if (err) {
 		dev_err(dev, "Failed to query config: %d", err);
 		return err;
@@ -1219,8 +1225,6 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
 	else
 		*bm_hostmode = 0;
 
-	debugfs_create_u16("adapter-MTU", 0400, gc->mana_pci_debugfs, &gc->adapter_mtu);
-
 	return 0;
 }
 
@@ -3334,7 +3338,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	int err;
 
 	ndev = alloc_etherdev_mq(sizeof(struct mana_port_context),
-				 gc->max_num_queues);
+				 gc->max_num_queues_vport);
 	if (!ndev)
 		return -ENOMEM;
 
@@ -3343,8 +3347,8 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	apc = netdev_priv(ndev);
 	apc->ac = ac;
 	apc->ndev = ndev;
-	apc->max_queues = gc->max_num_queues;
-	apc->num_queues = gc->max_num_queues;
+	apc->max_queues = gc->max_num_queues_vport;
+	apc->num_queues = gc->max_num_queues_vport;
 	apc->tx_queue_size = DEF_TX_BUFFERS_PER_QUEUE;
 	apc->rx_queue_size = DEF_RX_BUFFERS_PER_QUEUE;
 	apc->port_handle = INVALID_MANA_HANDLE;
@@ -3596,13 +3600,15 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 		gd->driver_data = ac;
 	}
 
-	err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
-				    MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
+	err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
+				       MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
 	if (err)
 		goto out;
 
 	ac->bm_hostmode = bm_hostmode;
 
+	debugfs_create_u16("adapter-MTU", 0400, gc->mana_pci_debugfs, &gc->adapter_mtu);
+
 	if (!resuming) {
 		ac->num_ports = num_ports;
 
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 766f4fb25e26..477b751f124e 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -392,8 +392,10 @@ struct gdma_context {
 	struct device		*dev;
 	struct dentry		*mana_pci_debugfs;
 
-	/* Per-vPort max number of queues */
+	/* Hardware max number of queues */
 	unsigned int		max_num_queues;
+	/* Per-vPort max number of queues */
+	unsigned int		max_num_queues_vport;
 	unsigned int		max_num_msix;
 	unsigned int		num_msix_usable;
 	struct xarray		irq_contexts;
@@ -438,6 +440,12 @@ struct gdma_context {
 	struct workqueue_struct *service_wq;
 
 	unsigned long		flags;
+
+	/* Indicate if this device is sharing MSI for EQs on MANA */
+	bool msi_sharing;
+
+	/* Bitmap tracks where MSI is allocated when it is not shared for EQs */
+	unsigned long *msi_bitmap;
 };
 
 static inline bool mana_gd_is_mana(struct gdma_dev *gd)
@@ -999,4 +1007,7 @@ int mana_gd_resume(struct pci_dev *pdev);
 
 bool mana_need_log(struct gdma_context *gc, int err);
 
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+			     u32 proto_minor_ver, u32 proto_micro_ver,
+			     u16 *max_num_vports, u8 *bm_hostmode);
 #endif /* _GDMA_H */
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next 1/6] net: mana: Create separate EQs for each vPort
From: Long Li @ 2026-02-28  2:11 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
	Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
	Dexuan Cui
  Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260228021144.85054-1-longli@microsoft.com>

To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ
sharing among the vPorts and create dedicated EQs for each vPort.

Move the EQ definition from struct mana_context to struct mana_port_context
and update related support functions. Export mana_create_eq() and
mana_destroy_eq() for use by the MANA RDMA driver.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/infiniband/hw/mana/main.c             |  14 ++-
 drivers/infiniband/hw/mana/qp.c               |   4 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 111 ++++++++++--------
 include/net/mana/mana.h                       |   7 +-
 4 files changed, 83 insertions(+), 53 deletions(-)

diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index fac159f7128d..cfa954460585 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -20,8 +20,10 @@ void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
 	pd->vport_use_count--;
 	WARN_ON(pd->vport_use_count < 0);
 
-	if (!pd->vport_use_count)
+	if (!pd->vport_use_count) {
+		mana_destroy_eq(mpc);
 		mana_uncfg_vport(mpc);
+	}
 
 	mutex_unlock(&pd->vport_mutex);
 }
@@ -55,15 +57,21 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
 		return err;
 	}
 
-	mutex_unlock(&pd->vport_mutex);
 
 	pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
 	pd->tx_vp_offset = mpc->tx_vp_offset;
+	err = mana_create_eq(mpc);
+	if (err) {
+		mana_uncfg_vport(mpc);
+		pd->vport_use_count--;
+	}
+
+	mutex_unlock(&pd->vport_mutex);
 
 	ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
 		  mpc->port_handle, pd->pdn, doorbell_id);
 
-	return 0;
+	return err;
 }
 
 int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 48c1f4977f21..d71c301b29c2 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -189,7 +189,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		cq_spec.gdma_region = cq->queue.gdma_region;
 		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
 		cq_spec.modr_ctx_id = 0;
-		eq = &mpc->ac->eqs[cq->comp_vector];
+		eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
 		cq_spec.attached_eq = eq->eq->id;
 
 		ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
@@ -341,7 +341,7 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 	cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
 	cq_spec.modr_ctx_id = 0;
 	eq_vec = send_cq->comp_vector;
-	eq = &mpc->ac->eqs[eq_vec];
+	eq = &mpc->eqs[eq_vec % mpc->num_queues];
 	cq_spec.attached_eq = eq->eq->id;
 
 	err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 9b5a72ada5c4..566e45a66adf 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1590,79 +1590,83 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
 }
 EXPORT_SYMBOL_NS(mana_destroy_wq_obj, "NET_MANA");
 
-static void mana_destroy_eq(struct mana_context *ac)
+void mana_destroy_eq(struct mana_port_context *apc)
 {
+	struct mana_context *ac = apc->ac;
 	struct gdma_context *gc = ac->gdma_dev->gdma_context;
 	struct gdma_queue *eq;
 	int i;
 
-	if (!ac->eqs)
+	if (!apc->eqs)
 		return;
 
-	debugfs_remove_recursive(ac->mana_eqs_debugfs);
-	ac->mana_eqs_debugfs = NULL;
+	debugfs_remove_recursive(apc->mana_eqs_debugfs);
+	apc->mana_eqs_debugfs = NULL;
 
-	for (i = 0; i < gc->max_num_queues; i++) {
-		eq = ac->eqs[i].eq;
+	for (i = 0; i < apc->num_queues; i++) {
+		eq = apc->eqs[i].eq;
 		if (!eq)
 			continue;
 
 		mana_gd_destroy_queue(gc, eq);
 	}
 
-	kfree(ac->eqs);
-	ac->eqs = NULL;
+	kfree(apc->eqs);
+	apc->eqs = NULL;
 }
+EXPORT_SYMBOL_NS(mana_destroy_eq, "NET_MANA");
 
-static void mana_create_eq_debugfs(struct mana_context *ac, int i)
+static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
 {
-	struct mana_eq eq = ac->eqs[i];
+	struct mana_eq eq = apc->eqs[i];
 	char eqnum[32];
 
 	sprintf(eqnum, "eq%d", i);
-	eq.mana_eq_debugfs = debugfs_create_dir(eqnum, ac->mana_eqs_debugfs);
+	eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
 	debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
 	debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
 	debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
 }
 
-static int mana_create_eq(struct mana_context *ac)
+int mana_create_eq(struct mana_port_context *apc)
 {
-	struct gdma_dev *gd = ac->gdma_dev;
+	struct gdma_dev *gd = apc->ac->gdma_dev;
 	struct gdma_context *gc = gd->gdma_context;
 	struct gdma_queue_spec spec = {};
 	int err;
 	int i;
 
-	ac->eqs = kcalloc(gc->max_num_queues, sizeof(struct mana_eq),
-			  GFP_KERNEL);
-	if (!ac->eqs)
+	WARN_ON(apc->eqs);
+	apc->eqs = kcalloc(apc->num_queues, sizeof(struct mana_eq),
+			   GFP_KERNEL);
+	if (!apc->eqs)
 		return -ENOMEM;
 
 	spec.type = GDMA_EQ;
 	spec.monitor_avl_buf = false;
 	spec.queue_size = EQ_SIZE;
 	spec.eq.callback = NULL;
-	spec.eq.context = ac->eqs;
+	spec.eq.context = apc->eqs;
 	spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
 
-	ac->mana_eqs_debugfs = debugfs_create_dir("EQs", gc->mana_pci_debugfs);
+	apc->mana_eqs_debugfs = debugfs_create_dir("EQs", apc->mana_port_debugfs);
 
-	for (i = 0; i < gc->max_num_queues; i++) {
+	for (i = 0; i < apc->num_queues; i++) {
 		spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
-		err = mana_gd_create_mana_eq(gd, &spec, &ac->eqs[i].eq);
+		err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
 		if (err) {
 			dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
 			goto out;
 		}
-		mana_create_eq_debugfs(ac, i);
+		mana_create_eq_debugfs(apc, i);
 	}
 
 	return 0;
 out:
-	mana_destroy_eq(ac);
+	mana_destroy_eq(apc);
 	return err;
 }
+EXPORT_SYMBOL_NS(mana_create_eq, "NET_MANA");
 
 static int mana_fence_rq(struct mana_port_context *apc, struct mana_rxq *rxq)
 {
@@ -2381,7 +2385,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 		spec.monitor_avl_buf = false;
 		spec.queue_size = cq_size;
 		spec.cq.callback = mana_schedule_napi;
-		spec.cq.parent_eq = ac->eqs[i].eq;
+		spec.cq.parent_eq = apc->eqs[i].eq;
 		spec.cq.context = cq;
 		err = mana_gd_create_mana_wq_cq(gd, &spec, &cq->gdma_cq);
 		if (err)
@@ -2775,13 +2779,12 @@ static void mana_create_rxq_debugfs(struct mana_port_context *apc, int idx)
 static int mana_add_rx_queues(struct mana_port_context *apc,
 			      struct net_device *ndev)
 {
-	struct mana_context *ac = apc->ac;
 	struct mana_rxq *rxq;
 	int err = 0;
 	int i;
 
 	for (i = 0; i < apc->num_queues; i++) {
-		rxq = mana_create_rxq(apc, i, &ac->eqs[i], ndev);
+		rxq = mana_create_rxq(apc, i, &apc->eqs[i], ndev);
 		if (!rxq) {
 			err = -ENOMEM;
 			netdev_err(ndev, "Failed to create rxq %d : %d\n", i, err);
@@ -2800,9 +2803,8 @@ static int mana_add_rx_queues(struct mana_port_context *apc,
 	return err;
 }
 
-static void mana_destroy_vport(struct mana_port_context *apc)
+static void mana_destroy_rxqs(struct mana_port_context *apc)
 {
-	struct gdma_dev *gd = apc->ac->gdma_dev;
 	struct mana_rxq *rxq;
 	u32 rxq_idx;
 
@@ -2814,8 +2816,12 @@ static void mana_destroy_vport(struct mana_port_context *apc)
 		mana_destroy_rxq(apc, rxq, true);
 		apc->rxqs[rxq_idx] = NULL;
 	}
+}
+
+static void mana_destroy_vport(struct mana_port_context *apc)
+{
+	struct gdma_dev *gd = apc->ac->gdma_dev;
 
-	mana_destroy_txq(apc);
 	mana_uncfg_vport(apc);
 
 	if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode)
@@ -2836,11 +2842,7 @@ static int mana_create_vport(struct mana_port_context *apc,
 			return err;
 	}
 
-	err = mana_cfg_vport(apc, gd->pdid, gd->doorbell);
-	if (err)
-		return err;
-
-	return mana_create_txq(apc, net);
+	return mana_cfg_vport(apc, gd->pdid, gd->doorbell);
 }
 
 static int mana_rss_table_alloc(struct mana_port_context *apc)
@@ -3117,21 +3119,36 @@ int mana_alloc_queues(struct net_device *ndev)
 
 	err = mana_create_vport(apc, ndev);
 	if (err) {
-		netdev_err(ndev, "Failed to create vPort %u : %d\n", apc->port_idx, err);
+		netdev_err(ndev, "Failed to create vPort %u : %d\n",
+			   apc->port_idx, err);
 		return err;
 	}
 
+	err = mana_create_eq(apc);
+	if (err) {
+		netdev_err(ndev, "Failed to create EQ on vPort %u: %d\n",
+			   apc->port_idx, err);
+		goto destroy_vport;
+	}
+
+	err = mana_create_txq(apc, ndev);
+	if (err) {
+		netdev_err(ndev, "Failed to create TXQ on vPort %u: %d\n",
+			   apc->port_idx, err);
+		goto destroy_eq;
+	}
+
 	err = netif_set_real_num_tx_queues(ndev, apc->num_queues);
 	if (err) {
 		netdev_err(ndev,
 			   "netif_set_real_num_tx_queues () failed for ndev with num_queues %u : %d\n",
 			   apc->num_queues, err);
-		goto destroy_vport;
+		goto destroy_txq;
 	}
 
 	err = mana_add_rx_queues(apc, ndev);
 	if (err)
-		goto destroy_vport;
+		goto destroy_rxq;
 
 	apc->rss_state = apc->num_queues > 1 ? TRI_STATE_TRUE : TRI_STATE_FALSE;
 
@@ -3140,7 +3157,7 @@ int mana_alloc_queues(struct net_device *ndev)
 		netdev_err(ndev,
 			   "netif_set_real_num_rx_queues () failed for ndev with num_queues %u : %d\n",
 			   apc->num_queues, err);
-		goto destroy_vport;
+		goto destroy_rxq;
 	}
 
 	mana_rss_table_init(apc);
@@ -3148,19 +3165,25 @@ int mana_alloc_queues(struct net_device *ndev)
 	err = mana_config_rss(apc, TRI_STATE_TRUE, true, true);
 	if (err) {
 		netdev_err(ndev, "Failed to configure RSS table: %d\n", err);
-		goto destroy_vport;
+		goto destroy_rxq;
 	}
 
 	if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode) {
 		err = mana_pf_register_filter(apc);
 		if (err)
-			goto destroy_vport;
+			goto destroy_rxq;
 	}
 
 	mana_chn_setxdp(apc, mana_xdp_get(apc));
 
 	return 0;
 
+destroy_rxq:
+	mana_destroy_rxqs(apc);
+destroy_txq:
+	mana_destroy_txq(apc);
+destroy_eq:
+	mana_destroy_eq(apc);
 destroy_vport:
 	mana_destroy_vport(apc);
 	return err;
@@ -3263,6 +3286,9 @@ static int mana_dealloc_queues(struct net_device *ndev)
 		netdev_err(ndev, "Failed to disable vPort: %d\n", err);
 
 	/* Even in err case, still need to cleanup the vPort */
+	mana_destroy_rxqs(apc);
+	mana_destroy_txq(apc);
+	mana_destroy_eq(apc);
 	mana_destroy_vport(apc);
 
 	return 0;
@@ -3570,12 +3596,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 		gd->driver_data = ac;
 	}
 
-	err = mana_create_eq(ac);
-	if (err) {
-		dev_err(dev, "Failed to create EQs: %d\n", err);
-		goto out;
-	}
-
 	err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
 				    MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
 	if (err)
@@ -3714,7 +3734,6 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 		free_netdev(ndev);
 	}
 
-	mana_destroy_eq(ac);
 out:
 	if (ac->per_port_queue_reset_wq) {
 		destroy_workqueue(ac->per_port_queue_reset_wq);
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a078af283bdd..787e637059df 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -478,8 +478,6 @@ struct mana_context {
 	u8 bm_hostmode;
 
 	struct mana_ethtool_hc_stats hc_stats;
-	struct mana_eq *eqs;
-	struct dentry *mana_eqs_debugfs;
 	struct workqueue_struct *per_port_queue_reset_wq;
 	/* Workqueue for querying hardware stats */
 	struct delayed_work gf_stats_work;
@@ -499,6 +497,9 @@ struct mana_port_context {
 
 	u8 mac_addr[ETH_ALEN];
 
+	struct mana_eq *eqs;
+	struct dentry *mana_eqs_debugfs;
+
 	enum TRI_STATE rss_state;
 
 	mana_handle_t default_rxobj;
@@ -1023,6 +1024,8 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
 int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
 		   u32 doorbell_pg_id);
 void mana_uncfg_vport(struct mana_port_context *apc);
+int mana_create_eq(struct mana_port_context *apc);
+void mana_destroy_eq(struct mana_port_context *apc);
 
 struct net_device *mana_get_primary_netdev(struct mana_context *ac,
 					   u32 port_index,
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox