Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH net] hv_netvsc: Fix a warning of suspicious RCU usage
From: Dexuan Cui @ 2019-08-06  5:17 UTC (permalink / raw)
  To: netdev@vger.kernel.org, David S. Miller, Haiyang Zhang,
	Stephen Hemminger
  Cc: sashal@kernel.org, KY Srinivasan, Michael Kelley,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com


This fixes a warning of "suspicious rcu_dereference_check() usage"
when nload runs.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 drivers/net/hyperv/netvsc_drv.c | 44 +++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index f9209594624b5..25502d335b94f 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1236,25 +1236,10 @@ static void netvsc_get_pcpu_stats(struct net_device *net,
 	}
 }
 
-static void netvsc_get_stats64(struct net_device *net,
-			       struct rtnl_link_stats64 *t)
+static void netvsc_get_per_chan_stats(struct netvsc_device *nvdev,
+				      struct rtnl_link_stats64 *t)
 {
-	struct net_device_context *ndev_ctx = netdev_priv(net);
-	struct netvsc_device *nvdev = rcu_dereference_rtnl(ndev_ctx->nvdev);
-	struct netvsc_vf_pcpu_stats vf_tot;
-	int i;
-
-	if (!nvdev)
-		return;
-
-	netdev_stats_to_stats64(t, &net->stats);
-
-	netvsc_get_vf_stats(net, &vf_tot);
-	t->rx_packets += vf_tot.rx_packets;
-	t->tx_packets += vf_tot.tx_packets;
-	t->rx_bytes   += vf_tot.rx_bytes;
-	t->tx_bytes   += vf_tot.tx_bytes;
-	t->tx_dropped += vf_tot.tx_dropped;
+	u32 i;
 
 	for (i = 0; i < nvdev->num_chn; i++) {
 		const struct netvsc_channel *nvchan = &nvdev->chan_table[i];
@@ -1286,6 +1271,29 @@ static void netvsc_get_stats64(struct net_device *net,
 	}
 }
 
+static void netvsc_get_stats64(struct net_device *net,
+			       struct rtnl_link_stats64 *t)
+{
+	struct net_device_context *ndev_ctx = netdev_priv(net);
+	struct netvsc_device *nvdev;
+	struct netvsc_vf_pcpu_stats vf_tot;
+
+	netdev_stats_to_stats64(t, &net->stats);
+
+	netvsc_get_vf_stats(net, &vf_tot);
+	t->rx_packets += vf_tot.rx_packets;
+	t->tx_packets += vf_tot.tx_packets;
+	t->rx_bytes   += vf_tot.rx_bytes;
+	t->tx_bytes   += vf_tot.tx_bytes;
+	t->tx_dropped += vf_tot.tx_dropped;
+
+	rcu_read_lock();
+	nvdev = rcu_dereference(ndev_ctx->nvdev);
+	if (nvdev)
+		netvsc_get_per_chan_stats(nvdev, t);
+	rcu_read_unlock();
+}
+
 static int netvsc_set_mac_addr(struct net_device *ndev, void *p)
 {
 	struct net_device_context *ndc = netdev_priv(ndev);

^ permalink raw reply related

* RE: hv_netvsc: WARNING: suspicious RCU usage?
From: Dexuan Cui @ 2019-08-06  5:25 UTC (permalink / raw)
  To: netdev@vger.kernel.org, Yidong Ren, Haiyang Zhang,
	Stephen Hemminger, David S. Miller
  Cc: linux-hyperv@vger.kernel.org
In-Reply-To: <PU1P153MB0169B7073A4865D50AED9EE5BFDA0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

> From: linux-hyperv-owner@vger.kernel.org
> <linux-hyperv-owner@vger.kernel.org> On Behalf Of Dexuan Cui
> Sent: Monday, August 5, 2019 4:56 PM
> The warning is caused by the rcu_dereference_rtnl() :
> 
> 1239 static void netvsc_get_stats64(struct net_device *net,
> 1240                                struct rtnl_link_stats64 *t)
> 1241 {
> 1242         struct net_device_context *ndev_ctx = netdev_priv(net);
> 1243         struct netvsc_device *nvdev =
> rcu_dereference_rtnl(ndev_ctx->nvdev);
> 
> I think here netvsc_get_stats64() neither holds rcu_read_lock() nor RTNL
> 
> IMO it should call rcu_read_lock()/unlock(), or get RTNL to fix the warning?

I just posted a patch on behalf of Stephen:
[PATCH net] hv_netvsc: Fix a warning of suspicious RCU usage
 
Thanks,
-- Dexuan


^ permalink raw reply

* Re: [PATCH] PCI: hv: Detect and fix Hyper-V PCI domain number collision
From: Bjorn Helgaas @ 2019-08-06 18:55 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: sashal@kernel.org, lorenzo.pieralisi@arm.com,
	linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, olaf@aepfle.de, vkuznets,
	linux-kernel@vger.kernel.org
In-Reply-To: <1564771954-9181-1-git-send-email-haiyangz@microsoft.com>

On Fri, Aug 02, 2019 at 06:52:56PM +0000, Haiyang Zhang wrote:
> Due to Azure host agent settings, the device instance ID's bytes 8 and 9
> are no longer unique. This causes some of the PCI devices not showing up
> in VMs with multiple passthrough devices, such as GPUs. So, as recommended
> by Azure host team, we now use the bytes 4 and 5 which usually provide
> unique numbers.

What does "Azure host agent settings" mean?  Would it be useful to say
something more specific, so users could ready this and say "oh, I'm
using the Azure host agent settings mentioned here, so I need this
patch"?  Is this related to a specific Azure host agent commit or
release?

"This causes some of the PCI devices ..." is not a sentence.  I think
I understand what you're saying -- "This sometimes causes device
passthrough to VMs to fail." Is there something about GPUs that makes
them more susceptible to this problem?

I think there are really two changes in this patch:

  1) Start with a domain number from bytes 4-5 instead of bytes 8-9.

  2) If the domain number is not unique, allocate another one using
  the bitmap.

It sounds like part 2) by itself would be enough to solve the problem,
and including part 1) just reduces the likelihood of having to
allocate another domain number.

> In the rare cases of collision, we will detect and find another number
> that is not in use.
> Thanks to Michael Kelley <mikelley@microsoft.com> for proposing this idea.

This looks like two paragraphs and should have a blank line between
them.

> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 91 +++++++++++++++++++++++++++++++------
>  1 file changed, 78 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 82acd61..6b9cc6e60a 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -37,6 +37,8 @@
>   * the PCI back-end driver in Hyper-V.
>   */
>  
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
>  #include <linux/kernel.h>
>  #include <linux/module.h>
>  #include <linux/pci.h>
> @@ -2507,6 +2509,47 @@ static void put_hvpcibus(struct hv_pcibus_device *hbus)
>  		complete(&hbus->remove_event);
>  }
>  
> +#define HVPCI_DOM_MAP_SIZE (64 * 1024)
> +static DECLARE_BITMAP(hvpci_dom_map, HVPCI_DOM_MAP_SIZE);
> +
> +/* PCI domain number 0 is used by emulated devices on Gen1 VMs, so define 0
> + * as invalid for passthrough PCI devices of this driver.
> + */

Please use the usual multi-line comment style:

  /*
   * PCI domain number ...
   */

> +#define HVPCI_DOM_INVALID 0
> +
> +/**
> + * hv_get_dom_num() - Get a valid PCI domain number
> + * Check if the PCI domain number is in use, and return another number if
> + * it is in use.
> + *
> + * @dom: Requested domain number
> + *
> + * return: domain number on success, HVPCI_DOM_INVALID on failure
> + */
> +static u16 hv_get_dom_num(u16 dom)
> +{
> +	unsigned int i;
> +
> +	if (test_and_set_bit(dom, hvpci_dom_map) == 0)
> +		return dom;
> +
> +	for_each_clear_bit(i, hvpci_dom_map, HVPCI_DOM_MAP_SIZE) {
> +		if (test_and_set_bit(i, hvpci_dom_map) == 0)
> +			return i;
> +	}
> +
> +	return HVPCI_DOM_INVALID;
> +}
> +
> +/**
> + * hv_put_dom_num() - Mark the PCI domain number as free
> + * @dom: Domain number to be freed
> + */
> +static void hv_put_dom_num(u16 dom)
> +{
> +	clear_bit(dom, hvpci_dom_map);
> +}
> +
>  /**
>   * hv_pci_probe() - New VMBus channel probe, for a root PCI bus
>   * @hdev:	VMBus's tracking struct for this root PCI bus
> @@ -2518,6 +2561,7 @@ static int hv_pci_probe(struct hv_device *hdev,
>  			const struct hv_vmbus_device_id *dev_id)
>  {
>  	struct hv_pcibus_device *hbus;
> +	u16 dom_req, dom;
>  	int ret;
>  
>  	/*
> @@ -2532,19 +2576,32 @@ static int hv_pci_probe(struct hv_device *hdev,
>  	hbus->state = hv_pcibus_init;
>  
>  	/*
> -	 * The PCI bus "domain" is what is called "segment" in ACPI and
> -	 * other specs.  Pull it from the instance ID, to get something
> -	 * unique.  Bytes 8 and 9 are what is used in Windows guests, so
> -	 * do the same thing for consistency.  Note that, since this code
> -	 * only runs in a Hyper-V VM, Hyper-V can (and does) guarantee
> -	 * that (1) the only domain in use for something that looks like
> -	 * a physical PCI bus (which is actually emulated by the
> -	 * hypervisor) is domain 0 and (2) there will be no overlap
> -	 * between domains derived from these instance IDs in the same
> -	 * VM.
> +	 * The PCI bus "domain" is what is called "segment" in ACPI and other
> +	 * specs. Pull it from the instance ID, to get something usually
> +	 * unique. In rare cases of collision, we will find out another number
> +	 * not in use.
> +	 * Note that, since this code only runs in a Hyper-V VM, Hyper-V
> +	 * together with this guest driver can guarantee that (1) The only
> +	 * domain used by Gen1 VMs for something that looks like a physical
> +	 * PCI bus (which is actually emulated by the hypervisor) is domain 0.
> +	 * (2) There will be no overlap between domains (after fixing possible
> +	 * collisions) in the same VM.

Please use blank lines between paragraphs.

>  	 */
> -	hbus->sysdata.domain = hdev->dev_instance.b[9] |
> -			       hdev->dev_instance.b[8] << 8;
> +	dom_req = hdev->dev_instance.b[5] << 8 | hdev->dev_instance.b[4];
> +	dom = hv_get_dom_num(dom_req);
> +
> +	if (dom == HVPCI_DOM_INVALID) {
> +		pr_err("Unable to use dom# 0x%hx or other numbers",
> +		       dom_req);
> +		ret = -EINVAL;
> +		goto free_bus;
> +	}
> +
> +	if (dom != dom_req)
> +		pr_info("PCI dom# 0x%hx has collision, using 0x%hx",
> +			dom_req, dom);

Can these use "dev_err/info(&hdev->device, ...)" like the other
message in this function?  It's always nicer to have a specific device
reference when one is available.  Probably don't need the new pr_fmt()
definition if you do this.

> +
> +	hbus->sysdata.domain = dom;
>  
>  	hbus->hdev = hdev;
>  	refcount_set(&hbus->remove_lock, 1);
> @@ -2559,7 +2616,7 @@ static int hv_pci_probe(struct hv_device *hdev,
>  					   hbus->sysdata.domain);
>  	if (!hbus->wq) {
>  		ret = -ENOMEM;
> -		goto free_bus;
> +		goto free_dom;
>  	}
>  
>  	ret = vmbus_open(hdev->channel, pci_ring_size, pci_ring_size, NULL, 0,
> @@ -2636,6 +2693,8 @@ static int hv_pci_probe(struct hv_device *hdev,
>  	vmbus_close(hdev->channel);
>  destroy_wq:
>  	destroy_workqueue(hbus->wq);
> +free_dom:
> +	hv_put_dom_num(hbus->sysdata.domain);
>  free_bus:
>  	free_page((unsigned long)hbus);
>  	return ret;
> @@ -2717,6 +2776,9 @@ static int hv_pci_remove(struct hv_device *hdev)
>  	put_hvpcibus(hbus);
>  	wait_for_completion(&hbus->remove_event);
>  	destroy_workqueue(hbus->wq);
> +
> +	hv_put_dom_num(hbus->sysdata.domain);
> +
>  	free_page((unsigned long)hbus);
>  	return 0;
>  }
> @@ -2744,6 +2806,9 @@ static void __exit exit_hv_pci_drv(void)
>  
>  static int __init init_hv_pci_drv(void)
>  {
> +	/* Set the invalid domain number's bit, so it will not be used */
> +	set_bit(HVPCI_DOM_INVALID, hvpci_dom_map);
> +
>  	return vmbus_driver_register(&hv_pci_drv);
>  }
>  
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH] PCI: hv: Detect and fix Hyper-V PCI domain number collision
From: Bjorn Helgaas @ 2019-08-06 18:55 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Haiyang Zhang, lorenzo.pieralisi@arm.com,
	linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, olaf@aepfle.de, vkuznets,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190805183657.GD17747@sasha-vm>

On Mon, Aug 05, 2019 at 02:36:57PM -0400, Sasha Levin wrote:
> On Fri, Aug 02, 2019 at 06:52:56PM +0000, Haiyang Zhang wrote:
> > Due to Azure host agent settings, the device instance ID's bytes 8 and 9
> > are no longer unique. This causes some of the PCI devices not showing up
> > in VMs with multiple passthrough devices, such as GPUs. So, as recommended
> > by Azure host team, we now use the bytes 4 and 5 which usually provide
> > unique numbers.
> > 
> > In the rare cases of collision, we will detect and find another number
> > that is not in use.
> > Thanks to Michael Kelley <mikelley@microsoft.com> for proposing this idea.

> > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> 
> Acked-by: Sasha Levin <sashal@kernel.org>
> 
> Bjorn, will you take it through the PCI tree or do you want me to take
> it through hyper-v?

Lorenzo usually applies patches to drivers/pci/controller/*, so that
would be my preference.

Bjorn

^ permalink raw reply

* Re: [PATCH net] hv_netvsc: Fix a warning of suspicious RCU usage
From: Jakub Kicinski @ 2019-08-06 19:12 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: netdev@vger.kernel.org, David S. Miller, Haiyang Zhang,
	Stephen Hemminger, sashal@kernel.org, KY Srinivasan,
	Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com
In-Reply-To: <PU1P153MB0169AECABF6094A3E7BEE381BFD50@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

On Tue, 6 Aug 2019 05:17:44 +0000, Dexuan Cui wrote:
> This fixes a warning of "suspicious rcu_dereference_check() usage"
> when nload runs.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>

Minor change in behaviour would perhaps be worth acknowledging in the
commit message (since you check ndev for NULL later now), and a Fixes
tag would be good.

But the looks pretty straightforward and correct!

^ permalink raw reply

* RE: [PATCH] PCI: hv: Detect and fix Hyper-V PCI domain number collision
From: Haiyang Zhang @ 2019-08-06 19:45 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: sashal@kernel.org, lorenzo.pieralisi@arm.com,
	linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, olaf@aepfle.de, vkuznets,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190806185515.GR151852@google.com>



> -----Original Message-----
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Tuesday, August 6, 2019 2:55 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: sashal@kernel.org; lorenzo.pieralisi@arm.com; linux-
> hyperv@vger.kernel.org; linux-pci@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>;
> olaf@aepfle.de; vkuznets <vkuznets@redhat.com>; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH] PCI: hv: Detect and fix Hyper-V PCI domain number
> collision
> 
> On Fri, Aug 02, 2019 at 06:52:56PM +0000, Haiyang Zhang wrote:
> > Due to Azure host agent settings, the device instance ID's bytes 8 and
> > 9 are no longer unique. This causes some of the PCI devices not
> > showing up in VMs with multiple passthrough devices, such as GPUs. So,
> > as recommended by Azure host team, we now use the bytes 4 and 5 which
> > usually provide unique numbers.
> 
> What does "Azure host agent settings" mean?  Would it be useful to say
> something more specific, so users could ready this and say "oh, I'm using the
> Azure host agent settings mentioned here, so I need this patch"?  Is this
> related to a specific Azure host agent commit or release?
> 
> "This causes some of the PCI devices ..." is not a sentence.  I think I
> understand what you're saying -- "This sometimes causes device passthrough
> to VMs to fail." Is there something about GPUs that makes them more
> susceptible to this problem?
> 
> I think there are really two changes in this patch:
> 
>   1) Start with a domain number from bytes 4-5 instead of bytes 8-9.
> 
>   2) If the domain number is not unique, allocate another one using
>   the bitmap.
> 
> It sounds like part 2) by itself would be enough to solve the problem, and
> including part 1) just reduces the likelihood of having to allocate another
> domain number.
> 
> > In the rare cases of collision, we will detect and find another number
> > that is not in use.
> > Thanks to Michael Kelley <mikelley@microsoft.com> for proposing this
> idea.
> 
> This looks like two paragraphs and should have a blank line between them.
> 
> > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> >  drivers/pci/controller/pci-hyperv.c | 91
> > +++++++++++++++++++++++++++++++------
> >  1 file changed, 78 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> > b/drivers/pci/controller/pci-hyperv.c
> > index 82acd61..6b9cc6e60a 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -37,6 +37,8 @@
> >   * the PCI back-end driver in Hyper-V.
> >   */
> >
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> >  #include <linux/kernel.h>
> >  #include <linux/module.h>
> >  #include <linux/pci.h>
> > @@ -2507,6 +2509,47 @@ static void put_hvpcibus(struct
> hv_pcibus_device *hbus)
> >  		complete(&hbus->remove_event);
> >  }
> >
> > +#define HVPCI_DOM_MAP_SIZE (64 * 1024) static
> > +DECLARE_BITMAP(hvpci_dom_map, HVPCI_DOM_MAP_SIZE);
> > +
> > +/* PCI domain number 0 is used by emulated devices on Gen1 VMs, so
> > +define 0
> > + * as invalid for passthrough PCI devices of this driver.
> > + */
> 
> Please use the usual multi-line comment style:
> 
>   /*
>    * PCI domain number ...
>    */
> 
> > +#define HVPCI_DOM_INVALID 0
> > +
> > +/**
> > + * hv_get_dom_num() - Get a valid PCI domain number
> > + * Check if the PCI domain number is in use, and return another
> > +number if
> > + * it is in use.
> > + *
> > + * @dom: Requested domain number
> > + *
> > + * return: domain number on success, HVPCI_DOM_INVALID on failure  */
> > +static u16 hv_get_dom_num(u16 dom) {
> > +	unsigned int i;
> > +
> > +	if (test_and_set_bit(dom, hvpci_dom_map) == 0)
> > +		return dom;
> > +
> > +	for_each_clear_bit(i, hvpci_dom_map, HVPCI_DOM_MAP_SIZE) {
> > +		if (test_and_set_bit(i, hvpci_dom_map) == 0)
> > +			return i;
> > +	}
> > +
> > +	return HVPCI_DOM_INVALID;
> > +}
> > +
> > +/**
> > + * hv_put_dom_num() - Mark the PCI domain number as free
> > + * @dom: Domain number to be freed
> > + */
> > +static void hv_put_dom_num(u16 dom)
> > +{
> > +	clear_bit(dom, hvpci_dom_map);
> > +}
> > +
> >  /**
> >   * hv_pci_probe() - New VMBus channel probe, for a root PCI bus
> >   * @hdev:	VMBus's tracking struct for this root PCI bus
> > @@ -2518,6 +2561,7 @@ static int hv_pci_probe(struct hv_device *hdev,
> >  			const struct hv_vmbus_device_id *dev_id)  {
> >  	struct hv_pcibus_device *hbus;
> > +	u16 dom_req, dom;
> >  	int ret;
> >
> >  	/*
> > @@ -2532,19 +2576,32 @@ static int hv_pci_probe(struct hv_device
> *hdev,
> >  	hbus->state = hv_pcibus_init;
> >
> >  	/*
> > -	 * The PCI bus "domain" is what is called "segment" in ACPI and
> > -	 * other specs.  Pull it from the instance ID, to get something
> > -	 * unique.  Bytes 8 and 9 are what is used in Windows guests, so
> > -	 * do the same thing for consistency.  Note that, since this code
> > -	 * only runs in a Hyper-V VM, Hyper-V can (and does) guarantee
> > -	 * that (1) the only domain in use for something that looks like
> > -	 * a physical PCI bus (which is actually emulated by the
> > -	 * hypervisor) is domain 0 and (2) there will be no overlap
> > -	 * between domains derived from these instance IDs in the same
> > -	 * VM.
> > +	 * The PCI bus "domain" is what is called "segment" in ACPI and other
> > +	 * specs. Pull it from the instance ID, to get something usually
> > +	 * unique. In rare cases of collision, we will find out another number
> > +	 * not in use.
> > +	 * Note that, since this code only runs in a Hyper-V VM, Hyper-V
> > +	 * together with this guest driver can guarantee that (1) The only
> > +	 * domain used by Gen1 VMs for something that looks like a physical
> > +	 * PCI bus (which is actually emulated by the hypervisor) is domain 0.
> > +	 * (2) There will be no overlap between domains (after fixing possible
> > +	 * collisions) in the same VM.
> 
> Please use blank lines between paragraphs.
> 
> >  	 */
> > -	hbus->sysdata.domain = hdev->dev_instance.b[9] |
> > -			       hdev->dev_instance.b[8] << 8;
> > +	dom_req = hdev->dev_instance.b[5] << 8 | hdev->dev_instance.b[4];
> > +	dom = hv_get_dom_num(dom_req);
> > +
> > +	if (dom == HVPCI_DOM_INVALID) {
> > +		pr_err("Unable to use dom# 0x%hx or other numbers",
> > +		       dom_req);
> > +		ret = -EINVAL;
> > +		goto free_bus;
> > +	}
> > +
> > +	if (dom != dom_req)
> > +		pr_info("PCI dom# 0x%hx has collision, using 0x%hx",
> > +			dom_req, dom);
> 
> Can these use "dev_err/info(&hdev->device, ...)" like the other message in
> this function?  It's always nicer to have a specific device reference when one
> is available.  Probably don't need the new pr_fmt() definition if you do this.
> 
> > +
> > +	hbus->sysdata.domain = dom;
> >
> >  	hbus->hdev = hdev;
> >  	refcount_set(&hbus->remove_lock, 1); @@ -2559,7 +2616,7 @@
> static
> > int hv_pci_probe(struct hv_device *hdev,
> >  					   hbus->sysdata.domain);
> >  	if (!hbus->wq) {
> >  		ret = -ENOMEM;
> > -		goto free_bus;
> > +		goto free_dom;
> >  	}
> >
> >  	ret = vmbus_open(hdev->channel, pci_ring_size, pci_ring_size, NULL,
> > 0, @@ -2636,6 +2693,8 @@ static int hv_pci_probe(struct hv_device
> *hdev,
> >  	vmbus_close(hdev->channel);
> >  destroy_wq:
> >  	destroy_workqueue(hbus->wq);
> > +free_dom:
> > +	hv_put_dom_num(hbus->sysdata.domain);
> >  free_bus:
> >  	free_page((unsigned long)hbus);
> >  	return ret;
> > @@ -2717,6 +2776,9 @@ static int hv_pci_remove(struct hv_device *hdev)
> >  	put_hvpcibus(hbus);
> >  	wait_for_completion(&hbus->remove_event);
> >  	destroy_workqueue(hbus->wq);
> > +
> > +	hv_put_dom_num(hbus->sysdata.domain);
> > +
> >  	free_page((unsigned long)hbus);
> >  	return 0;
> >  }
> > @@ -2744,6 +2806,9 @@ static void __exit exit_hv_pci_drv(void)
> >
> >  static int __init init_hv_pci_drv(void)  {
> > +	/* Set the invalid domain number's bit, so it will not be used */
> > +	set_bit(HVPCI_DOM_INVALID, hvpci_dom_map);
> > +
> >  	return vmbus_driver_register(&hv_pci_drv);
> >  }
> >
> > --
> > 1.8.3.1

Thanks for the comments. I will make the recommended changes.

- Haiyang

^ permalink raw reply

* Re: [PATCH v2] PCI: hv: Fix panic by calling hv_pci_remove_slots() earlier
From: Bjorn Helgaas @ 2019-08-06 20:16 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: lorenzo.pieralisi@arm.com, linux-pci@vger.kernel.org,
	Michael Kelley, Stephen Hemminger, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Sasha Levin,
	Haiyang Zhang, KY Srinivasan, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com,
	jackm@mellanox.com
In-Reply-To: <PU1P153MB01693F32F6BB02F9655CC84EBFD90@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

Thanks for updating this.  But you didn't update the subject line,
which is really still a little too low-level.  Maybe Lorenzo will fix
this.  Something like this, maybe?

  PCI: hv: Avoid use of hv_pci_dev->pci_slot after freeing it

On Fri, Aug 02, 2019 at 10:50:20PM +0000, Dexuan Cui wrote:
> 
> The slot must be removed before the pci_dev is removed, otherwise a panic
> can happen due to use-after-free.
> 
> Fixes: 15becc2b56c6 ("PCI: hv: Add hv_pci_remove_slots() when we unload the driver")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: stable@vger.kernel.org
> ---
> 
> Changes in v2:
>   Improved the changelog accordign to the discussion with Bjorn Helgaas:
> 	  https://lkml.org/lkml/2019/8/1/1173
> 	  https://lkml.org/lkml/2019/8/2/1559
> 
>  drivers/pci/controller/pci-hyperv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 6b9cc6e60a..68c611d 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -2757,8 +2757,8 @@ static int hv_pci_remove(struct hv_device *hdev)
>  		/* Remove the bus from PCI's point of view. */
>  		pci_lock_rescan_remove();
>  		pci_stop_root_bus(hbus->pci_bus);
> -		pci_remove_root_bus(hbus->pci_bus);
>  		hv_pci_remove_slots(hbus);
> +		pci_remove_root_bus(hbus->pci_bus);
>  		pci_unlock_rescan_remove();
>  		hbus->state = hv_pcibus_removed;
>  	}
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* RE: [PATCH v2] PCI: hv: Fix panic by calling hv_pci_remove_slots() earlier
From: Dexuan Cui @ 2019-08-06 20:41 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: lorenzo.pieralisi@arm.com, linux-pci@vger.kernel.org,
	Michael Kelley, Stephen Hemminger, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Sasha Levin,
	Haiyang Zhang, KY Srinivasan, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com,
	jackm@mellanox.com
In-Reply-To: <20190806201611.GT151852@google.com>

> From: linux-hyperv-owner@vger.kernel.org
> <linux-hyperv-owner@vger.kernel.org> On Behalf Of Bjorn Helgaas
> Sent: Tuesday, August 6, 2019 1:16 PM
> To: Dexuan Cui <decui@microsoft.com>
> 
> Thanks for updating this.  But you didn't update the subject line,
> which is really still a little too low-level.  Maybe Lorenzo will fix
> this.  Something like this, maybe?
> 
>   PCI: hv: Avoid use of hv_pci_dev->pci_slot after freeing it

This is better. Thanks!

I hope Lorenzo can help to fix this so I could avoid a v3. :-)

Thanks,
-- Dexuan

^ permalink raw reply

* [PATCH v2] PCI: hv: Detect and fix Hyper-V PCI domain number collision
From: Haiyang Zhang @ 2019-08-06 23:52 UTC (permalink / raw)
  To: sashal@kernel.org, bhelgaas@google.com, lorenzo.pieralisi@arm.com,
	linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger, olaf@aepfle.de,
	vkuznets, linux-kernel@vger.kernel.org

Currently in Azure cloud, for passthrough devices including GPU, the
host sets the device instance ID's bytes 8 - 15 to a value derived from
the host HWID, which is the same on all devices in a VM. So, the device
instance ID's bytes 8 and 9 provided by the host are no longer unique.

This can cause device passthrough to VMs to fail because the bytes 8 and
9 is used as PCI domain number. So, as recommended by Azure host team,
we now use the bytes 4 and 5 which usually contain unique numbers as PCI
domain. The chance of collision is greatly reduced. In the rare cases of
collision, we will detect and find another number that is not in use.

Thanks to Michael Kelley <mikelley@microsoft.com> for proposing this idea.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Acked-by: Sasha Levin <sashal@kernel.org>
---
 drivers/pci/controller/pci-hyperv.c | 92 +++++++++++++++++++++++++++++++------
 1 file changed, 79 insertions(+), 13 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 40b6254..4f3d97e 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -2510,6 +2510,48 @@ static void put_hvpcibus(struct hv_pcibus_device *hbus)
 		complete(&hbus->remove_event);
 }
 
+#define HVPCI_DOM_MAP_SIZE (64 * 1024)
+static DECLARE_BITMAP(hvpci_dom_map, HVPCI_DOM_MAP_SIZE);
+
+/*
+ * PCI domain number 0 is used by emulated devices on Gen1 VMs, so define 0
+ * as invalid for passthrough PCI devices of this driver.
+ */
+#define HVPCI_DOM_INVALID 0
+
+/**
+ * hv_get_dom_num() - Get a valid PCI domain number
+ * Check if the PCI domain number is in use, and return another number if
+ * it is in use.
+ *
+ * @dom: Requested domain number
+ *
+ * return: domain number on success, HVPCI_DOM_INVALID on failure
+ */
+static u16 hv_get_dom_num(u16 dom)
+{
+	unsigned int i;
+
+	if (test_and_set_bit(dom, hvpci_dom_map) == 0)
+		return dom;
+
+	for_each_clear_bit(i, hvpci_dom_map, HVPCI_DOM_MAP_SIZE) {
+		if (test_and_set_bit(i, hvpci_dom_map) == 0)
+			return i;
+	}
+
+	return HVPCI_DOM_INVALID;
+}
+
+/**
+ * hv_put_dom_num() - Mark the PCI domain number as free
+ * @dom: Domain number to be freed
+ */
+static void hv_put_dom_num(u16 dom)
+{
+	clear_bit(dom, hvpci_dom_map);
+}
+
 /**
  * hv_pci_probe() - New VMBus channel probe, for a root PCI bus
  * @hdev:	VMBus's tracking struct for this root PCI bus
@@ -2521,6 +2563,7 @@ static int hv_pci_probe(struct hv_device *hdev,
 			const struct hv_vmbus_device_id *dev_id)
 {
 	struct hv_pcibus_device *hbus;
+	u16 dom_req, dom;
 	int ret;
 
 	/*
@@ -2535,19 +2578,34 @@ static int hv_pci_probe(struct hv_device *hdev,
 	hbus->state = hv_pcibus_init;
 
 	/*
-	 * The PCI bus "domain" is what is called "segment" in ACPI and
-	 * other specs.  Pull it from the instance ID, to get something
-	 * unique.  Bytes 8 and 9 are what is used in Windows guests, so
-	 * do the same thing for consistency.  Note that, since this code
-	 * only runs in a Hyper-V VM, Hyper-V can (and does) guarantee
-	 * that (1) the only domain in use for something that looks like
-	 * a physical PCI bus (which is actually emulated by the
-	 * hypervisor) is domain 0 and (2) there will be no overlap
-	 * between domains derived from these instance IDs in the same
-	 * VM.
+	 * The PCI bus "domain" is what is called "segment" in ACPI and other
+	 * specs. Pull it from the instance ID, to get something usually
+	 * unique. In rare cases of collision, we will find out another number
+	 * not in use.
+	 *
+	 * Note that, since this code only runs in a Hyper-V VM, Hyper-V
+	 * together with this guest driver can guarantee that (1) The only
+	 * domain used by Gen1 VMs for something that looks like a physical
+	 * PCI bus (which is actually emulated by the hypervisor) is domain 0.
+	 * (2) There will be no overlap between domains (after fixing possible
+	 * collisions) in the same VM.
 	 */
-	hbus->sysdata.domain = hdev->dev_instance.b[9] |
-			       hdev->dev_instance.b[8] << 8;
+	dom_req = hdev->dev_instance.b[5] << 8 | hdev->dev_instance.b[4];
+	dom = hv_get_dom_num(dom_req);
+
+	if (dom == HVPCI_DOM_INVALID) {
+		dev_err(&hdev->device,
+			"Unable to use dom# 0x%hx or other numbers", dom_req);
+		ret = -EINVAL;
+		goto free_bus;
+	}
+
+	if (dom != dom_req)
+		dev_info(&hdev->device,
+			 "PCI dom# 0x%hx has collision, using 0x%hx",
+			 dom_req, dom);
+
+	hbus->sysdata.domain = dom;
 
 	hbus->hdev = hdev;
 	refcount_set(&hbus->remove_lock, 1);
@@ -2562,7 +2620,7 @@ static int hv_pci_probe(struct hv_device *hdev,
 					   hbus->sysdata.domain);
 	if (!hbus->wq) {
 		ret = -ENOMEM;
-		goto free_bus;
+		goto free_dom;
 	}
 
 	ret = vmbus_open(hdev->channel, pci_ring_size, pci_ring_size, NULL, 0,
@@ -2639,6 +2697,8 @@ static int hv_pci_probe(struct hv_device *hdev,
 	vmbus_close(hdev->channel);
 destroy_wq:
 	destroy_workqueue(hbus->wq);
+free_dom:
+	hv_put_dom_num(hbus->sysdata.domain);
 free_bus:
 	free_page((unsigned long)hbus);
 	return ret;
@@ -2720,6 +2780,9 @@ static int hv_pci_remove(struct hv_device *hdev)
 	put_hvpcibus(hbus);
 	wait_for_completion(&hbus->remove_event);
 	destroy_workqueue(hbus->wq);
+
+	hv_put_dom_num(hbus->sysdata.domain);
+
 	free_page((unsigned long)hbus);
 	return 0;
 }
@@ -2747,6 +2810,9 @@ static void __exit exit_hv_pci_drv(void)
 
 static int __init init_hv_pci_drv(void)
 {
+	/* Set the invalid domain number's bit, so it will not be used */
+	set_bit(HVPCI_DOM_INVALID, hvpci_dom_map);
+
 	return vmbus_driver_register(&hv_pci_drv);
 }
 
-- 
1.8.3.1


^ permalink raw reply related

* RE: [PATCH net] hv_netvsc: Fix a warning of suspicious RCU usage
From: Dexuan Cui @ 2019-08-07  6:56 UTC (permalink / raw)
  To: Jakub Kicinski, Stephen Hemminger
  Cc: netdev@vger.kernel.org, David S. Miller, Haiyang Zhang,
	sashal@kernel.org, KY Srinivasan, Michael Kelley,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com
In-Reply-To: <20190806121242.141c2324@cakuba.netronome.com>

> From: Jakub Kicinski <jakub.kicinski@netronome.com>
> Sent: Tuesday, August 6, 2019 12:13 PM
> To: Dexuan Cui <decui@microsoft.com>
> 
> On Tue, 6 Aug 2019 05:17:44 +0000, Dexuan Cui wrote:
> > This fixes a warning of "suspicious rcu_dereference_check() usage"
> > when nload runs.
> >
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> 
> Minor change in behaviour would perhaps be worth acknowledging in the
> commit message (since you check ndev for NULL later now), and a Fixes
> tag would be good.
> 
> But the looks pretty straightforward and correct!

Hi,
Yeah, it looks the minor behavior change doesn't matter, because IMO the 
'nvdev' can only be NULL when the NIC is being removed, or the MTU is
being changed, etc.

The Fixes tag is:
Fixes: 776e726bfb34 ("netvsc: fix RCU warning in get_stats")

If I should send a v2, please let me know.

Thanks,
-- Dexuan


^ permalink raw reply

* RE: [PATCH v2 4/7] Drivers: hv: vmbus: Suspend/resume the synic for hibernation
From: Michael Kelley @ 2019-08-07 15:21 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1564595464-56520-5-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Wednesday, July 31, 2019 10:52 AM
> 
> This is needed when we resume the old kernel from the "current" kernel.
> 
> Note: when hv_synic_suspend() and hv_synic_resume() run, all the
> non-boot CPUs have been offlined, and interrupts are disabled on CPU0.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/vmbus_drv.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
> 

Reviewed-by: Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply

* RE: [PATCH v2 5/7] Drivers: hv: vmbus: Ignore the offers when resuming from hibernation
From: Michael Kelley @ 2019-08-07 15:22 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1564595464-56520-6-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com>  Sent: Wednesday, July 31, 2019 10:52 AM
> 
> When the VM resumes, the host re-sends the offers. We should not add the
> offers to the global vmbus_connection.chn_list again.
> 
> Added some debug code, in case the host screws up the exact info related to
> the offers.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/channel_mgmt.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 

Reviewed-by: Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply

* RE: [PATCH v2 6/7] Drivers: hv: vmbus: Suspend/resume the vmbus itself for hibernation
From: Michael Kelley @ 2019-08-07 15:22 UTC (permalink / raw)
  To: Dexuan Cui, linux-hyperv@vger.kernel.org,
	gregkh@linuxfoundation.org, Stephen Hemminger, Sasha Levin,
	sashal@kernel.org, Haiyang Zhang, KY Srinivasan,
	tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1564595464-56520-7-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Wednesday, July 31, 2019 10:52 AM
> 
> Before Linux enters hibernation, it sends the CHANNELMSG_UNLOAD message to
> the host so all the offers are gone. After hibernation, Linux needs to
> re-negotiate with the host using the same vmbus protocol version (which
> was in use before hibernation), and ask the host to re-offer the vmbus
> devices.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/connection.c   |  3 +--
>  drivers/hv/hyperv_vmbus.h |  2 ++
>  drivers/hv/vmbus_drv.c    | 50
> +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 53 insertions(+), 2 deletions(-)
> 

Reviewed-by: Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply

* RE: [PATCH v2 6/7] Drivers: hv: vmbus: Suspend/resume the vmbus itself for hibernation
From: Dexuan Cui @ 2019-08-07 23:09 UTC (permalink / raw)
  To: linux-hyperv@vger.kernel.org, gregkh@linuxfoundation.org,
	Stephen Hemminger, Sasha Levin, sashal@kernel.org, Haiyang Zhang,
	KY Srinivasan, Michael Kelley, tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1564595464-56520-7-git-send-email-decui@microsoft.com>

> From: Dexuan Cui <decui@microsoft.com>
> Sent: Wednesday, July 31, 2019 10:52 AM
> To: linux-hyperv@vger.kernel.org; gregkh@linuxfoundation.org; Stephen
> @@ -2050,6 +2095,10 @@ static int vmbus_acpi_add(struct acpi_device
> *device)
>  };
>  MODULE_DEVICE_TABLE(acpi, vmbus_acpi_device_ids);
> 
> +static const struct dev_pm_ops vmbus_bus_pm = {
> +	SET_SYSTEM_SLEEP_PM_OPS(vmbus_bus_suspend, vmbus_bus_resume)
> +};

It turns out this " SET_SYSTEM_SLEEP_PM_OPS" should be changed to
"SET_NOIRQ_SYSTEM_SLEEP_PM_OPS" to make NIC SR-IOV work with
hibernation.

This is because the "pci_dev_pm_ops" uses the "noirq" callbacks. 
In the resume path, the "noirq" restore callback runs before the non-noirq
callbacks, meaning the vmbus_bus_resume() and the pci-hyperv's .resume()
must also run via the "noirq" callbacks.

Similarly, I also need to make the same change in this patch:
[PATCH v2 7/7] Drivers: hv: vmbus: Implement suspend/resume for VSC drivers for hibernation

BTW, I'd like to change the print_hex_dump_debug() to print_hex_dump() in this patch:
[PATCH v2 5/7] Drivers: hv: vmbus: Ignore the offers when resuming from hibernation

print_hex_dump_debug() outputs nothing unless we enable the dyndbg. 
This is not good as we do want see the error messages here, if there is an error.

I'll do more tests and post a v3 of the patchset.

Thanks,
-- Dexuan

^ permalink raw reply

* [PATCH] PCI: PM: Also move to D0 before calling pci_legacy_resume_early()
From: Dexuan Cui @ 2019-08-08 18:46 UTC (permalink / raw)
  To: lorenzo.pieralisi@arm.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org
  Cc: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Sasha Levin,
	Haiyang Zhang, KY Srinivasan, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com,
	Stephen Hemminger, jackm@mellanox.com, Dexuan Cui


In pci_legacy_suspend_late(), the device state is moved to PCI_UNKNOWN.
In pci_pm_thaw_noirq(), the state is supposed to be moved back to PCI_D0,
but the current code misses the pci_legacy_resume_early() path, so the
state remains in PCI_UNKNOWN in that path, and during hiberantion this
causes an error for the Mellanox VF driver, which fails to enable
MSI-X: pci_msi_supported() is false due to dev->current_state != PCI_D0:

mlx4_core a6d1:00:02.0: Detected virtual function - running in slave mode
mlx4_core a6d1:00:02.0: Sending reset
mlx4_core a6d1:00:02.0: Sending vhcr0
mlx4_core a6d1:00:02.0: HCA minimum page size:512
mlx4_core a6d1:00:02.0: Timestamping is not supported in slave mode
mlx4_core a6d1:00:02.0: INTx is not supported in multi-function mode, aborting
PM: dpm_run_callback(): pci_pm_thaw+0x0/0xd7 returns -95
PM: Device a6d1:00:02.0 failed to thaw: error -95

Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 drivers/pci/pci-driver.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 36dbe960306b..27dfc68db9e7 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1074,15 +1074,16 @@ static int pci_pm_thaw_noirq(struct device *dev)
 			return error;
 	}
 
-	if (pci_has_legacy_pm_support(pci_dev))
-		return pci_legacy_resume_early(dev);
-
 	/*
 	 * pci_restore_state() requires the device to be in D0 (because of MSI
 	 * restoration among other things), so force it into D0 in case the
 	 * driver's "freeze" callbacks put it into a low-power state directly.
 	 */
 	pci_set_power_state(pci_dev, PCI_D0);
+
+	if (pci_has_legacy_pm_support(pci_dev))
+		return pci_legacy_resume_early(dev);
+
 	pci_restore_state(pci_dev);
 
 	if (drv && drv->pm && drv->pm->thaw_noirq)
-- 
2.19.1


^ permalink raw reply related

* Re: [PATCH] PCI: PM: Also move to D0 before calling pci_legacy_resume_early()
From: Bjorn Helgaas @ 2019-08-08 19:19 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: lorenzo.pieralisi@arm.com, linux-pci@vger.kernel.org,
	Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Sasha Levin,
	Haiyang Zhang, KY Srinivasan, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com,
	Stephen Hemminger, jackm@mellanox.com
In-Reply-To: <PU1P153MB01695867B01987A8C239A8CCBFD70@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

On Thu, Aug 08, 2019 at 06:46:51PM +0000, Dexuan Cui wrote:
> 
> In pci_legacy_suspend_late(), the device state is moved to PCI_UNKNOWN.
> In pci_pm_thaw_noirq(), the state is supposed to be moved back to PCI_D0,
> but the current code misses the pci_legacy_resume_early() path, so the
> state remains in PCI_UNKNOWN in that path, and during hiberantion this
> causes an error for the Mellanox VF driver, which fails to enable
> MSI-X: pci_msi_supported() is false due to dev->current_state != PCI_D0:

s/hiberantion/hibernation/

Actually, it sounds more like "during *resume*, this causes an error",
so maybe you want s/hiberantion/resume/ instead?

> mlx4_core a6d1:00:02.0: Detected virtual function - running in slave mode
> mlx4_core a6d1:00:02.0: Sending reset
> mlx4_core a6d1:00:02.0: Sending vhcr0
> mlx4_core a6d1:00:02.0: HCA minimum page size:512
> mlx4_core a6d1:00:02.0: Timestamping is not supported in slave mode
> mlx4_core a6d1:00:02.0: INTx is not supported in multi-function mode, aborting
> PM: dpm_run_callback(): pci_pm_thaw+0x0/0xd7 returns -95
> PM: Device a6d1:00:02.0 failed to thaw: error -95
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/pci/pci-driver.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 36dbe960306b..27dfc68db9e7 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1074,15 +1074,16 @@ static int pci_pm_thaw_noirq(struct device *dev)
>  			return error;
>  	}
>  
> -	if (pci_has_legacy_pm_support(pci_dev))
> -		return pci_legacy_resume_early(dev);
> -
>  	/*
>  	 * pci_restore_state() requires the device to be in D0 (because of MSI
>  	 * restoration among other things), so force it into D0 in case the
>  	 * driver's "freeze" callbacks put it into a low-power state directly.
>  	 */
>  	pci_set_power_state(pci_dev, PCI_D0);
> +
> +	if (pci_has_legacy_pm_support(pci_dev))
> +		return pci_legacy_resume_early(dev);
> +
>  	pci_restore_state(pci_dev);
>  
>  	if (drv && drv->pm && drv->pm->thaw_noirq)
> -- 
> 2.19.1
> 

^ permalink raw reply

* RE: [PATCH] PCI: PM: Also move to D0 before calling pci_legacy_resume_early()
From: Dexuan Cui @ 2019-08-08 20:02 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: lorenzo.pieralisi@arm.com, linux-pci@vger.kernel.org,
	Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org, Sasha Levin,
	Haiyang Zhang, KY Srinivasan, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com,
	Stephen Hemminger, jackm@mellanox.com
In-Reply-To: <20190808191913.GI151852@google.com>

> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Thursday, August 8, 2019 12:19 PM
> To: Dexuan Cui <decui@microsoft.com>
> 
> On Thu, Aug 08, 2019 at 06:46:51PM +0000, Dexuan Cui wrote:
> >
> > In pci_legacy_suspend_late(), the device state is moved to PCI_UNKNOWN.
> > In pci_pm_thaw_noirq(), the state is supposed to be moved back to PCI_D0,
> > but the current code misses the pci_legacy_resume_early() path, so the
> > state remains in PCI_UNKNOWN in that path, and during hiberantion this
> > causes an error for the Mellanox VF driver, which fails to enable
> > MSI-X: pci_msi_supported() is false due to dev->current_state != PCI_D0:
> 
> s/hiberantion/hibernation/

Thanks for spoting this typo! :-)
 
> Actually, it sounds more like "during *resume*, this causes an error",
> so maybe you want s/hiberantion/resume/ instead?

Yes, it's during "resume", and to be more accurate, it happens during the
"resume" of the "thaw" phase: when we run "echo disk > /sys/power/state",
first the kernel "freezes" all the devices and create a hibernation image, then
the kernel "thaws" the devices including the disk/NIC, and writes the memory
to the disk and powers down. This patch fixes the error message for the
Mellanox VF driver in this phase. 

When the system starts again, a fresh kernel starts to run, and when the kernel
detects that a hibernation image was saved, the kernel "quiesces" the devices,
and "restores" the devices from the saved image. Here device_resume_noirq()
-> ... -> pci_pm_restore_noirq() -> pci_pm_default_resume_early() ->
pci_power_up() moves the device states back to PCI_D0. This path is not broken
and doesn't need my patch. 

Thanks,
-- Dexuan

^ permalink raw reply

* Re: [PATCH net] hv_netvsc: Fix a warning of suspicious RCU usage
From: David Miller @ 2019-08-09  1:13 UTC (permalink / raw)
  To: decui
  Cc: netdev, haiyangz, sthemmin, sashal, kys, mikelley, linux-hyperv,
	linux-kernel, olaf, apw, jasowang, vkuznets, marcelo.cerri
In-Reply-To: <PU1P153MB0169AECABF6094A3E7BEE381BFD50@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

From: Dexuan Cui <decui@microsoft.com>
Date: Tue, 6 Aug 2019 05:17:44 +0000

> 
> This fixes a warning of "suspicious rcu_dereference_check() usage"
> when nload runs.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>

Please resend with appropriate fixes tag.

^ permalink raw reply

* RE: [PATCH net] hv_netvsc: Fix a warning of suspicious RCU usage
From: Dexuan Cui @ 2019-08-09  1:16 UTC (permalink / raw)
  To: David Miller
  Cc: netdev@vger.kernel.org, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, KY Srinivasan, Michael Kelley,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com
In-Reply-To: <20190808.181350.1331633709956960086.davem@davemloft.net>

> From: David Miller <davem@davemloft.net>
> Sent: Thursday, August 8, 2019 6:14 PM
> To: Dexuan Cui <decui@microsoft.com>
> Cc: netdev@vger.kernel.org; Haiyang Zhang <haiyangz@microsoft.com>;
> Stephen Hemminger <sthemmin@microsoft.com>; sashal@kernel.org; KY
> Srinivasan <kys@microsoft.com>; Michael Kelley <mikelley@microsoft.com>;
> linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; olaf@aepfle.de;
> apw@canonical.com; jasowang@redhat.com; vkuznets
> <vkuznets@redhat.com>; marcelo.cerri@canonical.com
> Subject: Re: [PATCH net] hv_netvsc: Fix a warning of suspicious RCU usage
> 
> From: Dexuan Cui <decui@microsoft.com>
> Date: Tue, 6 Aug 2019 05:17:44 +0000
> 
> >
> > This fixes a warning of "suspicious rcu_dereference_check() usage"
> > when nload runs.
> >
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> 
> Please resend with appropriate fixes tag.

Will do shortly.

Thanks,
-- Dexuan

^ permalink raw reply

* [PATCH net v2] hv_netvsc: Fix a warning of suspicious RCU usage
From: Dexuan Cui @ 2019-08-09  1:58 UTC (permalink / raw)
  To: netdev@vger.kernel.org, David S. Miller, Haiyang Zhang,
	Stephen Hemminger, Jakub Kicinski
  Cc: sashal@kernel.org, KY Srinivasan, Michael Kelley,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com


This fixes a warning of "suspicious rcu_dereference_check() usage"
when nload runs.

Fixes: 776e726bfb34 ("netvsc: fix RCU warning in get_stats")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

Changes in v2:
    Made the minimal required change.
	Added a Fixes tag.
	Removed Stephen H.'s Signed-off-by since this is somewhat different from the 
		v1 from him; if there is any bug in v2, it's all my fault. :-)

 drivers/net/hyperv/netvsc_drv.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index f9209594624b..b6357a75712c 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1240,12 +1240,15 @@ static void netvsc_get_stats64(struct net_device *net,
 			       struct rtnl_link_stats64 *t)
 {
 	struct net_device_context *ndev_ctx = netdev_priv(net);
-	struct netvsc_device *nvdev = rcu_dereference_rtnl(ndev_ctx->nvdev);
+	struct netvsc_device *nvdev;
 	struct netvsc_vf_pcpu_stats vf_tot;
 	int i;
 
+	rcu_read_lock();
+
+	nvdev = rcu_dereference(ndev_ctx->nvdev);
 	if (!nvdev)
-		return;
+		goto out;
 
 	netdev_stats_to_stats64(t, &net->stats);
 
@@ -1284,6 +1287,8 @@ static void netvsc_get_stats64(struct net_device *net,
 		t->rx_packets	+= packets;
 		t->multicast	+= multicast;
 	}
+out:
+	rcu_read_unlock();
 }
 
 static int netvsc_set_mac_addr(struct net_device *ndev, void *p)
-- 
2.19.1


^ permalink raw reply related

* [PATCH 0/3] KVM/Hyper-V: Add Hyper-V direct tlb flush support
From: lantianyu1986 @ 2019-08-09  9:49 UTC (permalink / raw)
  To: pbonzini, rkrcmar, corbet, kys, haiyangz, sthemmin, sashal, tglx,
	mingo, bp, hpa, x86, michael.h.kelley
  Cc: Tianyu Lan, kvm, linux-doc, linux-hyperv, linux-kernel, vkuznets

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

This patchset is to add Hyper-V direct tlb support in KVM. Hyper-V
in L0 can delegate L1 hypervisor to handle tlb flush request from
L2 guest when direct tlb flush is enabled in L1.

Patch 2 introduces new cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH to enable
feature from user space. User space should enable this feature only
when Hyper-V hypervisor capability is exposed to guest and KVM profile
is hided. There is a parameter conflict between KVM and Hyper-V hypercall.
We hope L2 guest doesn't use KVM hypercall when the feature is
enabled. Detail please see comment of new API "KVM_CAP_HYPERV_DIRECT_TLBFLUSH"

Tianyu Lan (2):
  x86/Hyper-V: Fix definition of struct hv_vp_assist_page
  KVM/Hyper-V: Add new KVM cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH

Vitaly Kuznetsov (1):
  KVM/Hyper-V/VMX: Add direct tlb flush support

 Documentation/virtual/kvm/api.txt  | 10 ++++++++++
 arch/x86/include/asm/hyperv-tlfs.h | 24 +++++++++++++++++++-----
 arch/x86/include/asm/kvm_host.h    |  2 ++
 arch/x86/kvm/vmx/evmcs.h           |  2 ++
 arch/x86/kvm/vmx/vmx.c             | 38 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c                 |  8 ++++++++
 include/linux/kvm_host.h           |  1 +
 include/uapi/linux/kvm.h           |  1 +
 8 files changed, 81 insertions(+), 5 deletions(-)

-- 
2.14.2

^ permalink raw reply

* [PATCH 1/3] x86/Hyper-V: Fix definition of struct hv_vp_assist_page
From: lantianyu1986 @ 2019-08-09  9:49 UTC (permalink / raw)
  To: pbonzini, rkrcmar, corbet, kys, haiyangz, sthemmin, sashal, tglx,
	mingo, bp, hpa, x86, michael.h.kelley
  Cc: Tianyu Lan, kvm, linux-doc, linux-kernel, linux-hyperv, vkuznets
In-Reply-To: <20190809094939.76093-1-Tianyu.Lan@microsoft.com>

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

The struct hv_vp_assist_page was defined incorrectly.
The "vtl_control" should be u64[3], "nested_enlightenments_control"
should be a u64 and there is 7 reserved bytes following "enlighten_vmentry".
This patch is to fix it.

Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
 arch/x86/include/asm/hyperv-tlfs.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index af78cd72b8f3..a79703c56ebe 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -517,11 +517,11 @@ struct hv_timer_message_payload {
 /* Define virtual processor assist page structure. */
 struct hv_vp_assist_page {
 	__u32 apic_assist;
-	__u32 reserved;
-	__u64 vtl_control[2];
+	__u32 reserved1;
+	__u64 vtl_control[3];
 	__u64 nested_enlightenments_control[2];
-	__u32 enlighten_vmentry;
-	__u32 padding;
+	__u8 enlighten_vmentry;
+	__u8 reserved2[7];
 	__u64 current_nested_vmcs;
 } __packed;
 
-- 
2.14.2


^ permalink raw reply related

* [PATCH 2/3] KVM/Hyper-V: Add new KVM cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH
From: lantianyu1986 @ 2019-08-09  9:49 UTC (permalink / raw)
  To: pbonzini, rkrcmar, corbet, kys, haiyangz, sthemmin, sashal, tglx,
	mingo, bp, hpa, x86, michael.h.kelley
  Cc: Tianyu Lan, kvm, linux-doc, linux-kernel, linux-hyperv, vkuznets
In-Reply-To: <20190809094939.76093-1-Tianyu.Lan@microsoft.com>

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

This patch adds new KVM cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH and let
user space to enable direct tlb flush function when only Hyper-V
hypervsior capability is exposed to VM. This patch also adds
enable_direct_tlbflush callback in the struct kvm_x86_ops and
platforms may use it to implement direct tlb flush support.

Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
 Documentation/virtual/kvm/api.txt | 10 ++++++++++
 arch/x86/include/asm/kvm_host.h   |  2 ++
 arch/x86/kvm/x86.c                |  8 ++++++++
 include/uapi/linux/kvm.h          |  1 +
 4 files changed, 21 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 2cd6250b2896..45308ed6dd75 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -5289,3 +5289,13 @@ Architectures: x86
 This capability indicates that KVM supports paravirtualized Hyper-V IPI send
 hypercalls:
 HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
+8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
+
+Architecture: x86
+
+This capability indicates that KVM supports Hyper-V direct tlb flush function.
+User space should enable this feature only when Hyper-V hypervisor capability
+is exposed to guest and KVM profile is hided. Both Hyper-V and KVM hypercalls
+use RAX and RCX registers to pass parameters. If KVM hypercall is exposed
+to L2 guest with direct tlbflush enabled, Hyper-V may mistake KVM hypercall
+for Hyper-V tlb flush Hypercall due to paremeter register overlap.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0cc5b611a113..667d154e89d4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1205,6 +1205,8 @@ struct kvm_x86_ops {
 	uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
 
 	bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu);
+
+	int (*enable_direct_tlbflush)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9d7b9e6a0939..a9d8ee7f7bf0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3183,6 +3183,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = kvm_x86_ops->get_nested_state ?
 			kvm_x86_ops->get_nested_state(NULL, NULL, 0) : 0;
 		break;
+	case KVM_CAP_HYPERV_DIRECT_TLBFLUSH:
+		r = kvm_x86_ops->enable_direct_tlbflush ? 1 : 0;
+		break;
 	default:
 		break;
 	}
@@ -3953,6 +3956,11 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 				r = -EFAULT;
 		}
 		return r;
+	case KVM_CAP_HYPERV_DIRECT_TLBFLUSH:
+		if (!kvm_x86_ops->enable_direct_tlbflush)
+			return -ENOTTY;
+
+		return kvm_x86_ops->enable_direct_tlbflush(vcpu);
 
 	default:
 		return -EINVAL;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a7c19540ce21..cb959bc925b1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -996,6 +996,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_PTRAUTH_ADDRESS 171
 #define KVM_CAP_ARM_PTRAUTH_GENERIC 172
 #define KVM_CAP_PMU_EVENT_FILTER 173
+#define KVM_CAP_HYPERV_DIRECT_TLBFLUSH 174
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.14.2


^ permalink raw reply related

* [PATCH 3/3] KVM/Hyper-V/VMX: Add direct tlb flush support
From: lantianyu1986 @ 2019-08-09  9:49 UTC (permalink / raw)
  To: kys, haiyangz, sthemmin, sashal, tglx, mingo, bp, hpa, x86,
	pbonzini, rkrcmar, michael.h.kelley
  Cc: Vitaly Kuznetsov, linux-hyperv, linux-kernel, kvm, Tianyu Lan
In-Reply-To: <20190809094939.76093-1-Tianyu.Lan@microsoft.com>

From: Vitaly Kuznetsov <vkuznets@redhat.com>

This patch is to enable Hyper-V direct tlb flush function
for VMX.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
 arch/x86/include/asm/hyperv-tlfs.h | 16 +++++++++++++++-
 arch/x86/kvm/vmx/evmcs.h           |  2 ++
 arch/x86/kvm/vmx/vmx.c             | 38 ++++++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h           |  1 +
 4 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index a79703c56ebe..d53d6e4a6210 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -171,6 +171,7 @@
 #define HV_X64_ENLIGHTENED_VMCS_RECOMMENDED		BIT(14)
 
 /* Nested features. These are HYPERV_CPUID_NESTED_FEATURES.EAX bits. */
+#define HV_X64_NESTED_DIRECT_FLUSH			BIT(17)
 #define HV_X64_NESTED_GUEST_MAPPING_FLUSH		BIT(18)
 #define HV_X64_NESTED_MSR_BITMAP			BIT(19)
 
@@ -514,12 +515,22 @@ struct hv_timer_message_payload {
 	__u64 delivery_time;	/* When the message was delivered */
 } __packed;
 
+struct hv_nested_enlightenments_control {
+	struct {
+		__u32 directhypercall:1;
+		__u32 reserved:31;
+	} features;
+	struct {
+		__u32 reserved;
+	} hypercallControls;
+} __packed;
+
 /* Define virtual processor assist page structure. */
 struct hv_vp_assist_page {
 	__u32 apic_assist;
 	__u32 reserved1;
 	__u64 vtl_control[3];
-	__u64 nested_enlightenments_control[2];
+	struct hv_nested_enlightenments_control nested_control;
 	__u8 enlighten_vmentry;
 	__u8 reserved2[7];
 	__u64 current_nested_vmcs;
@@ -872,4 +883,7 @@ struct hv_tlb_flush_ex {
 	u64 gva_list[];
 } __packed;
 
+struct hv_partition_assist_pg {
+	u32 tlb_lock_count;
+};
 #endif
diff --git a/arch/x86/kvm/vmx/evmcs.h b/arch/x86/kvm/vmx/evmcs.h
index 39a24eec8884..07ebf6882a45 100644
--- a/arch/x86/kvm/vmx/evmcs.h
+++ b/arch/x86/kvm/vmx/evmcs.h
@@ -178,6 +178,8 @@ static inline void evmcs_load(u64 phys_addr)
 	struct hv_vp_assist_page *vp_ap =
 		hv_get_vp_assist_page(smp_processor_id());
 
+	if (current_evmcs->hv_enlightenments_control.nested_flush_hypercall)
+		vp_ap->nested_control.features.directhypercall = 1;
 	vp_ap->current_nested_vmcs = phys_addr;
 	vp_ap->enlighten_vmentry = 1;
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 84f8d49a2fd2..a49be029864e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -486,6 +486,34 @@ static int hv_remote_flush_tlb(struct kvm *kvm)
 	return hv_remote_flush_tlb_with_range(kvm, NULL);
 }
 
+static int hv_enable_direct_tlbflush(struct kvm_vcpu *vcpu)
+{
+	struct hv_enlightened_vmcs *evmcs;
+
+	/*
+	 * Synthetic VM-Exit is not enabled in current code and so All
+	 * evmcs in singe VM shares same assist page.
+	 */
+	if (!vcpu->kvm->hv_pa_pg) {
+		vcpu->kvm->hv_pa_pg = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		if (!vcpu->kvm->hv_pa_pg)
+			return -ENOMEM;
+		pr_debug("KVM: Hyper-V: allocated PA_PG for %llx\n",
+		       (u64)&vcpu->kvm);
+	}
+
+	evmcs = (struct hv_enlightened_vmcs *)to_vmx(vcpu)->loaded_vmcs->vmcs;
+
+	evmcs->partition_assist_page =
+		__pa(vcpu->kvm->hv_pa_pg);
+	evmcs->hv_vm_id = (u64)vcpu->kvm;
+	evmcs->hv_enlightenments_control.nested_flush_hypercall = 1;
+
+	pr_debug("KVM: Hyper-V: enabled DIRECT flush for %llx\n",
+		 (u64)vcpu->kvm);
+	return 0;
+}
+
 #endif /* IS_ENABLED(CONFIG_HYPERV) */
 
 /*
@@ -6516,6 +6544,9 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 		current_evmcs->hv_clean_fields |=
 			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
 
+	if (static_branch_unlikely(&enable_evmcs))
+		current_evmcs->hv_vp_id = vcpu->arch.hyperv.vp_index;
+
 	/* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
 	if (vmx->host_debugctlmsr)
 		update_debugctlmsr(vmx->host_debugctlmsr);
@@ -6583,6 +6614,7 @@ static struct kvm *vmx_vm_alloc(void)
 
 static void vmx_vm_free(struct kvm *kvm)
 {
+	kfree(kvm->hv_pa_pg);
 	vfree(to_kvm_vmx(kvm));
 }
 
@@ -7815,6 +7847,7 @@ static void vmx_exit(void)
 			if (!vp_ap)
 				continue;
 
+			vp_ap->nested_control.features.directhypercall = 0;
 			vp_ap->current_nested_vmcs = 0;
 			vp_ap->enlighten_vmentry = 0;
 		}
@@ -7854,6 +7887,11 @@ static int __init vmx_init(void)
 			pr_info("KVM: vmx: using Hyper-V Enlightened VMCS\n");
 			static_branch_enable(&enable_evmcs);
 		}
+
+		if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
+			vmx_x86_ops.enable_direct_tlbflush
+				= hv_enable_direct_tlbflush;
+
 	} else {
 		enlightened_vmcs = false;
 	}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c5da875f19e3..479ad76661e6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -500,6 +500,7 @@ struct kvm {
 	struct srcu_struct srcu;
 	struct srcu_struct irq_srcu;
 	pid_t userspace_pid;
+	struct hv_partition_assist_pg *hv_pa_pg;
 };
 
 #define kvm_err(fmt, ...) \
-- 
2.14.2


^ permalink raw reply related

* Re: [PATCH 1/3] x86/Hyper-V: Fix definition of struct hv_vp_assist_page
From: Vitaly Kuznetsov @ 2019-08-09 10:25 UTC (permalink / raw)
  To: lantianyu1986
  Cc: Tianyu Lan, kvm, linux-doc, linux-kernel, linux-hyperv, pbonzini,
	rkrcmar, corbet, kys, haiyangz, sthemmin, sashal, tglx, mingo, bp,
	hpa, x86, michael.h.kelley
In-Reply-To: <20190809094939.76093-2-Tianyu.Lan@microsoft.com>

lantianyu1986@gmail.com writes:

> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
>
> The struct hv_vp_assist_page was defined incorrectly.
> The "vtl_control" should be u64[3], "nested_enlightenments_control"
> should be a u64 and there is 7 reserved bytes following "enlighten_vmentry".
> This patch is to fix it.
>
> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index af78cd72b8f3..a79703c56ebe 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -517,11 +517,11 @@ struct hv_timer_message_payload {
>  /* Define virtual processor assist page structure. */
>  struct hv_vp_assist_page {
>  	__u32 apic_assist;
> -	__u32 reserved;
> -	__u64 vtl_control[2];
> +	__u32 reserved1;
> +	__u64 vtl_control[3];
>  	__u64 nested_enlightenments_control[2];

In PATCH3 you define 'struct hv_nested_enlightenments_control' and it is
64bit long, not 128. We should change it here too as ...

> -	__u32 enlighten_vmentry;

enlighten_vmentry filed will get a very different offset breaking
Enlightened VMCS.

> -	__u32 padding;
> +	__u8 enlighten_vmentry;
> +	__u8 reserved2[7];
>  	__u64 current_nested_vmcs;
>  } __packed;

-- 
Vitaly

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox