Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v2] PCI: hv: fix pci-hyperv build when SYSFS not enabled
From: Randy Dunlap @ 2019-07-07 17:46 UTC (permalink / raw)
  To: Haiyang Zhang, LKML, linux-pci, Stephen Hemminger
  Cc: Matthew Wilcox, Jake Oshins, KY Srinivasan, Sasha Levin,
	Bjorn Helgaas, linux-hyperv@vger.kernel.org, Dexuan Cui,
	Yuehaibing, Stephen Hemminger
In-Reply-To: <DM6PR21MB13373F2B76558930CC368E17CAFB0@DM6PR21MB1337.namprd21.prod.outlook.com>

On 7/3/19 11:06 AM, Haiyang Zhang wrote:
> 
> 
>> -----Original Message-----
>> From: Randy Dunlap <rdunlap@infradead.org>
>> Sent: Wednesday, July 3, 2019 12:59 PM
>> To: LKML <linux-kernel@vger.kernel.org>; linux-pci <linux-
>> pci@vger.kernel.org>
>> Cc: Matthew Wilcox <willy@infradead.org>; Jake Oshins
>> <jakeo@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Haiyang
>> Zhang <haiyangz@microsoft.com>; Stephen Hemminger
>> <sthemmin@microsoft.com>; Sasha Levin <sashal@kernel.org>; Bjorn
>> Helgaas <bhelgaas@google.com>; linux-hyperv@vger.kernel.org; Dexuan
>> Cui <decui@microsoft.com>; Yuehaibing <yuehaibing@huawei.com>
>> Subject: [PATCH v2] PCI: hv: fix pci-hyperv build when SYSFS not enabled
>>
>> From: Randy Dunlap <rdunlap@infradead.org>
>>
>> Fix build of drivers/pci/controller/pci-hyperv.o when
>> CONFIG_SYSFS is not set/enabled by adding stubs for
>> pci_create_slot() and pci_destroy_slot().
>>
>> Fixes these build errors:
>>
>> ERROR: "pci_destroy_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
>> ERROR: "pci_create_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
>>
>> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot
>> information")
>>
>> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Jake Oshins <jakeo@microsoft.com>
>> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
>> Cc: Haiyang Zhang <haiyangz@microsoft.com>
>> Cc: Stephen Hemminger <sthemmin@microsoft.com>
>> Cc: Sasha Levin <sashal@kernel.org>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> Cc: linux-hyperv@vger.kernel.org
>> Cc: Dexuan Cui <decui@microsoft.com>
>> Cc: Yuehaibing <yuehaibing@huawei.com>
>> ---
>> v2:
>> - provide non-CONFIG_SYSFS stubs for pci_create_slot() and
>>   pci_destroy_slot() [suggested by Matthew Wilcox <willy@infradead.org>]
>> - use the correct Fixes: tag [Dexuan Cui <decui@microsoft.com>]
>>
>>  include/linux/pci.h |   12 ++++++++++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> --- lnx-52-rc7.orig/include/linux/pci.h
>> +++ lnx-52-rc7/include/linux/pci.h
>> @@ -25,6 +25,7 @@
>>  #include <linux/ioport.h>
>>  #include <linux/list.h>
>>  #include <linux/compiler.h>
>> +#include <linux/err.h>
>>  #include <linux/errno.h>
>>  #include <linux/kobject.h>
>>  #include <linux/atomic.h>
>> @@ -947,14 +948,21 @@ int pci_scan_root_bus_bridge(struct pci_
>>  struct pci_bus *pci_add_new_bus(struct pci_bus *parent, struct pci_dev
>> *dev,
>>  				int busnr);
>>  void pcie_update_link_speed(struct pci_bus *bus, u16 link_status);
>> +#ifdef CONFIG_SYSFS
>> +void pci_dev_assign_slot(struct pci_dev *dev);
>>  struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
>>  				 const char *name,
>>  				 struct hotplug_slot *hotplug);
>>  void pci_destroy_slot(struct pci_slot *slot);
>> -#ifdef CONFIG_SYSFS
>> -void pci_dev_assign_slot(struct pci_dev *dev);
>>  #else
>>  static inline void pci_dev_assign_slot(struct pci_dev *dev) { }
>> +static inline struct pci_slot *pci_create_slot(struct pci_bus *parent,
>> +					       int slot_nr,
>> +					       const char *name,
>> +					       struct hotplug_slot *hotplug) {
>> +	return ERR_PTR(-EINVAL);
>> +}
>> +static inline void pci_destroy_slot(struct pci_slot *slot) { }
>>  #endif
>>  int pci_scan_slot(struct pci_bus *bus, int devfn);
>>  struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn);
>>
> 
> The serial number in slot info is used to match VF NIC with Synthetic NIC.
> Without selecting SYSFS, the SRIOV feature will fail on VM on Hyper-V and
> Azure. The first version of this patch should be used.
> 
> @Stephen Hemminger how do you think?
> 
> Thanks,
> - Haiyang


Hi Stephen,

Please comment on this patch or v1.
v1:  https://lore.kernel.org/lkml/69c25bc3-da00-2758-92ee-13c82b51fc45@infradead.org/


thanks.
-- 
~Randy

^ permalink raw reply

* Re: [PATCH] ACPI: PM: Fix "multiple definition of acpi_sleep_state_supported" for ARM64
From: Rafael J. Wysocki @ 2019-07-06  7:53 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Pavel Machek, Michael Kelley, linux-acpi@vger.kernel.org,
	rjw@rjwysocki.net, lenb@kernel.org, robert.moore@intel.com,
	erik.schmauss@intel.com, Russell King, Russ Dill,
	Sebastian Capella, Lorenzo Pieralisi,
	Russell King - ARM Linux admin, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	Haiyang Zhang, Sasha Levin, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com
In-Reply-To: <PU1P153MB0169731042EFE4D6B08F04A5BFF50@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

On Fri, Jul 5, 2019 at 10:18 PM Dexuan Cui <decui@microsoft.com> wrote:
>
>
> If CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT is not set, the dummy version of
> the function should be static.
>
> Fixes: 1e2c3f0f1e93 ("ACPI: PM: Make acpi_sleep_state_supported() non-static")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Reported-by: kbuild test robot <lkp@intel.com>
> ---
>
> Sorry for not doing it right in the previous patch!
>
> The patch fixes the build errors on ARM64:
>
>    drivers/net/ethernet/qualcomm/emac/emac-phy.o: In function `acpi_sleep_state_supported':
> >> emac-phy.c:(.text+0x1d8): multiple definition of `acpi_sleep_state_supported'
>    drivers/net/ethernet/qualcomm/emac/emac.o:emac.c:(.text+0xbf8): first defined here
>    drivers/net/ethernet/qualcomm/emac/emac-sgmii.o: In function `acpi_sleep_state_supported':
>    emac-sgmii.c:(.text+0x548): multiple definition of `acpi_sleep_state_supported'
>    drivers/net/ethernet/qualcomm/emac/emac.o:emac.c:(.text+0xbf8): first defined here
>
>
>  include/acpi/acpi_bus.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
> index 4ce59bdc852e..8ffc4acf2b56 100644
> --- a/include/acpi/acpi_bus.h
> +++ b/include/acpi/acpi_bus.h
> @@ -657,7 +657,7 @@ static inline int acpi_pm_set_bridge_wakeup(struct device *dev, bool enable)
>  #ifdef CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT
>  bool acpi_sleep_state_supported(u8 sleep_state);
>  #else
> -bool acpi_sleep_state_supported(u8 sleep_state) { return false; }
> +static bool acpi_sleep_state_supported(u8 sleep_state) { return false; }

This should be static inline even.

I've reapplied the original patch with this change folded in.

Thanks!

^ permalink raw reply

* [PATCH] ACPI: PM: Fix "multiple definition of acpi_sleep_state_supported" for ARM64
From: Dexuan Cui @ 2019-07-05 20:18 UTC (permalink / raw)
  To: Pavel Machek, Michael Kelley, linux-acpi@vger.kernel.org,
	rjw@rjwysocki.net, lenb@kernel.org, robert.moore@intel.com,
	erik.schmauss@intel.com, Russell King, Russ Dill,
	Sebastian Capella, Lorenzo Pieralisi,
	Russell King - ARM Linux admin
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, Haiyang Zhang, Sasha Levin,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com


If CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT is not set, the dummy version of
the function should be static.

Fixes: 1e2c3f0f1e93 ("ACPI: PM: Make acpi_sleep_state_supported() non-static")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reported-by: kbuild test robot <lkp@intel.com>
---

Sorry for not doing it right in the previous patch!

The patch fixes the build errors on ARM64:

   drivers/net/ethernet/qualcomm/emac/emac-phy.o: In function `acpi_sleep_state_supported':
>> emac-phy.c:(.text+0x1d8): multiple definition of `acpi_sleep_state_supported'
   drivers/net/ethernet/qualcomm/emac/emac.o:emac.c:(.text+0xbf8): first defined here
   drivers/net/ethernet/qualcomm/emac/emac-sgmii.o: In function `acpi_sleep_state_supported':
   emac-sgmii.c:(.text+0x548): multiple definition of `acpi_sleep_state_supported'
   drivers/net/ethernet/qualcomm/emac/emac.o:emac.c:(.text+0xbf8): first defined here


 include/acpi/acpi_bus.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
index 4ce59bdc852e..8ffc4acf2b56 100644
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -657,7 +657,7 @@ static inline int acpi_pm_set_bridge_wakeup(struct device *dev, bool enable)
 #ifdef CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT
 bool acpi_sleep_state_supported(u8 sleep_state);
 #else
-bool acpi_sleep_state_supported(u8 sleep_state) { return false; }
+static bool acpi_sleep_state_supported(u8 sleep_state) { return false; }
 #endif
 
 #ifdef CONFIG_ACPI_SLEEP
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH v2] PCI: hv: Fix a use-after-free bug in hv_eject_device_work()
From: Lorenzo Pieralisi @ 2019-07-05 13:41 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: linux-pci@vger.kernel.org, bhelgaas@google.com, Haiyang Zhang,
	KY Srinivasan, Stephen Hemminger, Sasha Levin,
	linux-hyperv@vger.kernel.org, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com,
	Michael Kelley, Lili Deng (Wicresoft North America Ltd),
	linux-kernel@vger.kernel.org,
	driverdev-devel@linuxdriverproject.org
In-Reply-To: <PU1P153MB0169D420EAB61757DF4B337FBFE70@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

On Fri, Jun 21, 2019 at 11:45:23PM +0000, Dexuan Cui wrote:
> 
> The commit 05f151a73ec2 itself is correct, but it exposes this
> use-after-free bug, which is caught by some memory debug options.
> 
> Add a Fixes tag to indicate the dependency.
> 
> Fixes: 05f151a73ec2 ("PCI: hv: Fix a memory leak in hv_eject_device_work()")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: stable@vger.kernel.org
> ---
> 
> In v2:
> Replaced "hpdev->hbus" with "hbus", since we have the new "hbus" variable. [Michael Kelley]
> 
>  drivers/pci/controller/pci-hyperv.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)

Applied to pci/hv for v5.3, thanks.

Lorenzo

> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 808a182830e5..5dadc964ad3b 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1880,6 +1880,7 @@ static void hv_pci_devices_present(struct hv_pcibus_device *hbus,
>  static void hv_eject_device_work(struct work_struct *work)
>  {
>  	struct pci_eject_response *ejct_pkt;
> +	struct hv_pcibus_device *hbus;
>  	struct hv_pci_dev *hpdev;
>  	struct pci_dev *pdev;
>  	unsigned long flags;
> @@ -1890,6 +1891,7 @@ static void hv_eject_device_work(struct work_struct *work)
>  	} ctxt;
>  
>  	hpdev = container_of(work, struct hv_pci_dev, wrk);
> +	hbus = hpdev->hbus;
>  
>  	WARN_ON(hpdev->state != hv_pcichild_ejecting);
>  
> @@ -1900,8 +1902,7 @@ static void hv_eject_device_work(struct work_struct *work)
>  	 * because hbus->pci_bus may not exist yet.
>  	 */
>  	wslot = wslot_to_devfn(hpdev->desc.win_slot.slot);
> -	pdev = pci_get_domain_bus_and_slot(hpdev->hbus->sysdata.domain, 0,
> -					   wslot);
> +	pdev = pci_get_domain_bus_and_slot(hbus->sysdata.domain, 0, wslot);
>  	if (pdev) {
>  		pci_lock_rescan_remove();
>  		pci_stop_and_remove_bus_device(pdev);
> @@ -1909,9 +1910,9 @@ static void hv_eject_device_work(struct work_struct *work)
>  		pci_unlock_rescan_remove();
>  	}
>  
> -	spin_lock_irqsave(&hpdev->hbus->device_list_lock, flags);
> +	spin_lock_irqsave(&hbus->device_list_lock, flags);
>  	list_del(&hpdev->list_entry);
> -	spin_unlock_irqrestore(&hpdev->hbus->device_list_lock, flags);
> +	spin_unlock_irqrestore(&hbus->device_list_lock, flags);
>  
>  	if (hpdev->pci_slot)
>  		pci_destroy_slot(hpdev->pci_slot);
> @@ -1920,7 +1921,7 @@ static void hv_eject_device_work(struct work_struct *work)
>  	ejct_pkt = (struct pci_eject_response *)&ctxt.pkt.message;
>  	ejct_pkt->message_type.type = PCI_EJECTION_COMPLETE;
>  	ejct_pkt->wslot.slot = hpdev->desc.win_slot.slot;
> -	vmbus_sendpacket(hpdev->hbus->hdev->channel, ejct_pkt,
> +	vmbus_sendpacket(hbus->hdev->channel, ejct_pkt,
>  			 sizeof(*ejct_pkt), (unsigned long)&ctxt.pkt,
>  			 VM_PKT_DATA_INBAND, 0);
>  
> @@ -1929,7 +1930,9 @@ static void hv_eject_device_work(struct work_struct *work)
>  	/* For the two refs got in new_pcichild_device() */
>  	put_pcichild(hpdev);
>  	put_pcichild(hpdev);
> -	put_hvpcibus(hpdev->hbus);
> +	/* hpdev has been freed. Do not use it any more. */
> +
> +	put_hvpcibus(hbus);
>  }
>  
>  /**
> -- 
> 2.17.1
> 

^ permalink raw reply

* Re: [PATCH] ACPI: PM: Make acpi_sleep_state_supported() non-static
From: Rafael J. Wysocki @ 2019-07-05  9:40 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Pavel Machek, Michael Kelley, linux-acpi@vger.kernel.org,
	lenb@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com,
	Russell King, Russ Dill, Sebastian Capella, Lorenzo Pieralisi,
	Russell King - ARM Linux admin, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	Haiyang Zhang, Sasha Levin, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com
In-Reply-To: <PU1P153MB0169A260911AACDA861F0029BFFA0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

On Thursday, July 4, 2019 4:43:32 AM CEST Dexuan Cui wrote:
> 
> With some upcoming patches to save/restore the Hyper-V drivers related
> states, a Linux VM running on Hyper-V will be able to hibernate. When
> a Linux VM hibernates, unluckily we must disable the memory hot-add/remove
> and balloon up/down capabilities in the hv_balloon driver
> (drivers/hv/hv_balloon.c), because these can not really work according to
> the design of the related back-end driver on the host.
> 
> By default, Hyper-V does not enable the virtual ACPI S4 state for a VM;
> on recent Hyper-V hosts, the administrator is able to enable the virtual
> ACPI S4 state for a VM, so we hope to use the presence of the virtual ACPI
> S4 state as a hint for hv_balloon to disable the aforementioned
> capabilities. In this way, hibernation will work more reliably, from the
> user's perspective.
> 
> By marking acpi_sleep_state_supported() non-static, we'll be able to
> implement a hv_is_hibernation_supported() API in the always-built-in
> module arch/x86/hyperv/hv_init.c, and the API will be called by hv_balloon.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
> 
> Previously I posted a version that tries to export the function:
> https://lkml.org/lkml/2019/6/14/1077, which may be an overkill.
> 
> So I proposed a second patch (which covers this patch and shows how this
> patch will be used): https://lkml.org/lkml/2019/6/19/861
> 
> I explained the situation in detail here: https://lkml.org/lkml/2019/6/21/63
> (a correction: old Hyper-V hosts can support guest hibernation, but some
> important functionalities in the host's management tool stack are missing).
> 
> There is no further reply in that discussion, so I'm sending this patch to
> draw people's attention again. :-)
> 
>  drivers/acpi/sleep.c    | 2 +-
>  include/acpi/acpi_bus.h | 6 ++++++
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
> index 8ff08e531443..d1ff303a857a 100644
> --- a/drivers/acpi/sleep.c
> +++ b/drivers/acpi/sleep.c
> @@ -77,7 +77,7 @@ static int acpi_sleep_prepare(u32 acpi_state)
>  	return 0;
>  }
>  
> -static bool acpi_sleep_state_supported(u8 sleep_state)
> +bool acpi_sleep_state_supported(u8 sleep_state)
>  {
>  	acpi_status status;
>  	u8 type_a, type_b;
> diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
> index 31b6c87d6240..3e6563e1a2c0 100644
> --- a/include/acpi/acpi_bus.h
> +++ b/include/acpi/acpi_bus.h
> @@ -651,6 +651,12 @@ static inline int acpi_pm_set_bridge_wakeup(struct device *dev, bool enable)
>  }
>  #endif
>  
> +#ifdef CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT
> +bool acpi_sleep_state_supported(u8 sleep_state);
> +#else
> +bool acpi_sleep_state_supported(u8 sleep_state) { return false; }
> +#endif
> +
>  #ifdef CONFIG_ACPI_SLEEP
>  u32 acpi_target_system_state(void);
>  #else
> 

Applied, thanks!




^ permalink raw reply

* [PATCH] ACPI: PM: Make acpi_sleep_state_supported() non-static
From: Dexuan Cui @ 2019-07-04  2:43 UTC (permalink / raw)
  To: Pavel Machek, Michael Kelley, linux-acpi@vger.kernel.org,
	rjw@rjwysocki.net, lenb@kernel.org, robert.moore@intel.com,
	erik.schmauss@intel.com, Russell King, Russ Dill,
	Sebastian Capella, Lorenzo Pieralisi,
	Russell King - ARM Linux admin
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	KY Srinivasan, Stephen Hemminger, Haiyang Zhang, Sasha Levin,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com

With some upcoming patches to save/restore the Hyper-V drivers related
states, a Linux VM running on Hyper-V will be able to hibernate. When
a Linux VM hibernates, unluckily we must disable the memory hot-add/remove
and balloon up/down capabilities in the hv_balloon driver
(drivers/hv/hv_balloon.c), because these can not really work according to
the design of the related back-end driver on the host.

By default, Hyper-V does not enable the virtual ACPI S4 state for a VM;
on recent Hyper-V hosts, the administrator is able to enable the virtual
ACPI S4 state for a VM, so we hope to use the presence of the virtual ACPI
S4 state as a hint for hv_balloon to disable the aforementioned
capabilities. In this way, hibernation will work more reliably, from the
user's perspective.

By marking acpi_sleep_state_supported() non-static, we'll be able to
implement a hv_is_hibernation_supported() API in the always-built-in
module arch/x86/hyperv/hv_init.c, and the API will be called by hv_balloon.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

Previously I posted a version that tries to export the function:
https://lkml.org/lkml/2019/6/14/1077, which may be an overkill.

So I proposed a second patch (which covers this patch and shows how this
patch will be used): https://lkml.org/lkml/2019/6/19/861

I explained the situation in detail here: https://lkml.org/lkml/2019/6/21/63
(a correction: old Hyper-V hosts can support guest hibernation, but some
important functionalities in the host's management tool stack are missing).

There is no further reply in that discussion, so I'm sending this patch to
draw people's attention again. :-)

 drivers/acpi/sleep.c    | 2 +-
 include/acpi/acpi_bus.h | 6 ++++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 8ff08e531443..d1ff303a857a 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -77,7 +77,7 @@ static int acpi_sleep_prepare(u32 acpi_state)
 	return 0;
 }

-static bool acpi_sleep_state_supported(u8 sleep_state)
+bool acpi_sleep_state_supported(u8 sleep_state)
 {
 	acpi_status status;
 	u8 type_a, type_b;
diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
index 31b6c87d6240..3e6563e1a2c0 100644
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -651,6 +651,12 @@ static inline int acpi_pm_set_bridge_wakeup(struct device *dev, bool enable)
 }
 #endif

+#ifdef CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT
+bool acpi_sleep_state_supported(u8 sleep_state);
+#else
+bool acpi_sleep_state_supported(u8 sleep_state) { return false; }
+#endif
+
 #ifdef CONFIG_ACPI_SLEEP
 u32 acpi_target_system_state(void);
 #else
-- 
2.19.1

^ permalink raw reply related

* [PATCH v3] locking/spinlocks, paravirt, hyperv: Correct the hv_nopvspin case
From: Zhenzhong Duan @ 2019-07-03  2:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Zhenzhong Duan, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Juergen Gross, Boris Ostrovsky,
	Peter Zijlstra, Waiman Long, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, linux-hyperv

With the boot parameter "hv_nopvspin" specified a Hyperv guest should
not make use of paravirt spinlocks, but behave as if running on bare
metal. This is not true, however, as the qspinlock code will fall back
to a test-and-set scheme when it is detecting a hypervisor.

In order to avoid this disable the virt_spin_lock_key.

Same change for XEN is already in Commit e6fd28eb3522
("locking/spinlocks, paravirt, xen: Correct the xen_nopvspin case")

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linux-hyperv@vger.kernel.org
---
v3: remove unlikely() as suggested by Sasha

 arch/x86/hyperv/hv_spinlock.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/hyperv/hv_spinlock.c b/arch/x86/hyperv/hv_spinlock.c
index 07f21a0..210495b 100644
--- a/arch/x86/hyperv/hv_spinlock.c
+++ b/arch/x86/hyperv/hv_spinlock.c
@@ -64,6 +64,9 @@ __visible bool hv_vcpu_is_preempted(int vcpu)
 
 void __init hv_init_spinlocks(void)
 {
+	if (!hv_pvspin)
+		static_branch_disable(&virt_spin_lock_key);
+
 	if (!hv_pvspin || !apic ||
 	    !(ms_hyperv.hints & HV_X64_CLUSTER_IPI_RECOMMENDED) ||
 	    !(ms_hyperv.features & HV_X64_MSR_GUEST_IDLE_AVAILABLE)) {
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining
From: Dexuan Cui @ 2019-07-04  1:45 UTC (permalink / raw)
  To: Haiyang Zhang, KY Srinivasan, Stephen Hemminger, Sasha Levin,
	linux-hyperv@vger.kernel.org, Michael Kelley, Long Li, vkuznets,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Borislav Petkov,
	x86@kernel.org
  Cc: linux-kernel@vger.kernel.org, marcelo.cerri@canonical.com,
	driverdev-devel@linuxdriverproject.org, olaf@aepfle.de,
	apw@canonical.com, jasowang@redhat.com

When a CPU is being offlined, the CPU usually still receives a few
interrupts (e.g. reschedule IPIs), after hv_cpu_die() disables the
HV_X64_MSR_VP_ASSIST_PAGE, so hv_apic_eoi_write() may not write the EOI
MSR, if the apic_assist field's bit0 happens to be 1; as a result, Hyper-V
may not be able to deliver all the interrupts to the CPU, and the CPU may
not be stopped, and the kernel will hang soon.

The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
5.2.1 "GPA Overlay Pages"), so with this fix we're sure the apic_assist
field is still zero, after the VP ASSIST PAGE is disabled.

Fixes: ba696429d290 ("x86/hyper-v: Implement EOI assist")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 arch/x86/hyperv/hv_init.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 0e033ef11a9f..db51a301f759 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -60,8 +60,14 @@ static int hv_cpu_init(unsigned int cpu)
 	if (!hv_vp_assist_page)
 		return 0;
 
+	/*
+	 * The ZERO flag is necessary, because in the case of CPU offlining
+	 * the page can still be used by hv_apic_eoi_write() for a while,
+	 * after the VP ASSIST PAGE is disabled in hv_cpu_die().
+	 */
 	if (!*hvp)
-		*hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL);
+		*hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO,
+				 PAGE_KERNEL);
 
 	if (*hvp) {
 		u64 val;
-- 
2.19.1


^ permalink raw reply related

* Re: [Xen-devel] [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Nadav Amit @ 2019-07-03 18:09 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Sasha Levin, linux-hyperv@vger.kernel.org,
	the arch/x86 maintainers, Stephen Hemminger, kvm list,
	Peter Zijlstra, Haiyang Zhang, Dave Hansen,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Paolo Bonzini, xen-devel,
	Thomas Gleixner, K. Y. Srinivasan, Boris Ostrovsky
In-Reply-To: <6038042c-917f-d361-5d79-f0205152fe00@citrix.com>

> On Jul 3, 2019, at 10:43 AM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> 
> On 03/07/2019 18:02, Nadav Amit wrote:
>>> On Jul 3, 2019, at 7:04 AM, Juergen Gross <jgross@suse.com> wrote:
>>> 
>>> On 03.07.19 01:51, Nadav Amit wrote:
>>>> To improve TLB shootdown performance, flush the remote and local TLBs
>>>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>>>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>>>> and hyper-v are only compile-tested).
>>>> While the updated smp infrastructure is capable of running a function on
>>>> a single local core, it is not optimized for this case. The multiple
>>>> function calls and the indirect branch introduce some overhead, and
>>>> might make local TLB flushes slower than they were before the recent
>>>> changes.
>>>> Before calling the SMP infrastructure, check if only a local TLB flush
>>>> is needed to restore the lost performance in this common case. This
>>>> requires to check mm_cpumask() one more time, but unless this mask is
>>>> updated very frequently, this should impact performance negatively.
>>>> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
>>>> Cc: Haiyang Zhang <haiyangz@microsoft.com>
>>>> Cc: Stephen Hemminger <sthemmin@microsoft.com>
>>>> Cc: Sasha Levin <sashal@kernel.org>
>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>> Cc: Borislav Petkov <bp@alien8.de>
>>>> Cc: x86@kernel.org
>>>> Cc: Juergen Gross <jgross@suse.com>
>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>>> Cc: Andy Lutomirski <luto@kernel.org>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>>> Cc: linux-hyperv@vger.kernel.org
>>>> Cc: linux-kernel@vger.kernel.org
>>>> Cc: virtualization@lists.linux-foundation.org
>>>> Cc: kvm@vger.kernel.org
>>>> Cc: xen-devel@lists.xenproject.org
>>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>>> ---
>>>> arch/x86/hyperv/mmu.c                 | 13 +++---
>>>> arch/x86/include/asm/paravirt.h       |  6 +--
>>>> arch/x86/include/asm/paravirt_types.h |  4 +-
>>>> arch/x86/include/asm/tlbflush.h       |  9 ++--
>>>> arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>>> arch/x86/kernel/kvm.c                 | 11 +++--
>>>> arch/x86/kernel/paravirt.c            |  2 +-
>>>> arch/x86/mm/tlb.c                     | 65 ++++++++++++++++++++-------
>>>> arch/x86/xen/mmu_pv.c                 | 20 ++++++---
>>>> include/trace/events/xen.h            |  2 +-
>>>> 10 files changed, 91 insertions(+), 43 deletions(-)
>>> ...
>>> 
>>>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>>>> index beb44e22afdf..19e481e6e904 100644
>>>> --- a/arch/x86/xen/mmu_pv.c
>>>> +++ b/arch/x86/xen/mmu_pv.c
>>>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>>>> 	preempt_enable();
>>>> }
>>>> -static void xen_flush_tlb_others(const struct cpumask *cpus,
>>>> -				 const struct flush_tlb_info *info)
>>>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>>>> +				const struct flush_tlb_info *info)
>>>> {
>>>> 	struct {
>>>> 		struct mmuext_op op;
>>>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>>> 	const size_t mc_entry_size = sizeof(args->op) +
>>>> 		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>>> -	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>>>> +	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>>>   	if (cpumask_empty(cpus))
>>>> 		return;		/* nothing to do */
>>>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>>> 	args = mcs.args;
>>>> 	args->op.arg2.vcpumask = to_cpumask(args->mask);
>>>> -	/* Remove us, and any offline CPUS. */
>>>> +	/* Flush locally if needed and remove us */
>>>> +	if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>>>> +		local_irq_disable();
>>>> +		flush_tlb_func_local(info);
>>> I think this isn't the correct function for PV guests.
>>> 
>>> In fact it should be much easier: just don't clear the own cpu from the
>>> mask, that's all what's needed. The hypervisor is just fine having the
>>> current cpu in the mask and it will do the right thing.
>> Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
>> the same, unfortunately, since it would induce VM-exit on TLB flushes.
> 
> Why do you believe the vmexit matters?  You're talking one anyway for
> the IPI.
> 
> Intel only have virtualised self-IPI, and while AMD do have working
> non-self IPIs, you still take a vmexit anyway if any destination vcpu
> isn't currently running in non-root mode (IIRC).
> 
> At that point, you might as well have the hypervisor do all the hard
> work via a multi-cpu shootdown/flush hypercall, rather than trying to
> arrange it locally.

I forgot that xen_flush_tlb_multi() should actually only be called when
there are some remote CPUs (as I optimized the case in which there is only a
single local CPU that needs to be flushed), so you are right.


^ permalink raw reply

* RE: [PATCH v2] PCI: hv: fix pci-hyperv build when SYSFS not enabled
From: Haiyang Zhang @ 2019-07-03 18:06 UTC (permalink / raw)
  To: Randy Dunlap, LKML, linux-pci, Stephen Hemminger
  Cc: Matthew Wilcox, Jake Oshins, KY Srinivasan, Sasha Levin,
	Bjorn Helgaas, linux-hyperv@vger.kernel.org, Dexuan Cui,
	Yuehaibing
In-Reply-To: <535f212f-e111-399d-4ad0-82d2ae505e48@infradead.org>



> -----Original Message-----
> From: Randy Dunlap <rdunlap@infradead.org>
> Sent: Wednesday, July 3, 2019 12:59 PM
> To: LKML <linux-kernel@vger.kernel.org>; linux-pci <linux-
> pci@vger.kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>; Jake Oshins
> <jakeo@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Haiyang
> Zhang <haiyangz@microsoft.com>; Stephen Hemminger
> <sthemmin@microsoft.com>; Sasha Levin <sashal@kernel.org>; Bjorn
> Helgaas <bhelgaas@google.com>; linux-hyperv@vger.kernel.org; Dexuan
> Cui <decui@microsoft.com>; Yuehaibing <yuehaibing@huawei.com>
> Subject: [PATCH v2] PCI: hv: fix pci-hyperv build when SYSFS not enabled
> 
> From: Randy Dunlap <rdunlap@infradead.org>
> 
> Fix build of drivers/pci/controller/pci-hyperv.o when
> CONFIG_SYSFS is not set/enabled by adding stubs for
> pci_create_slot() and pci_destroy_slot().
> 
> Fixes these build errors:
> 
> ERROR: "pci_destroy_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
> ERROR: "pci_create_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
> 
> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot
> information")
> 
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jake Oshins <jakeo@microsoft.com>
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Sasha Levin <sashal@kernel.org>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> Cc: linux-hyperv@vger.kernel.org
> Cc: Dexuan Cui <decui@microsoft.com>
> Cc: Yuehaibing <yuehaibing@huawei.com>
> ---
> v2:
> - provide non-CONFIG_SYSFS stubs for pci_create_slot() and
>   pci_destroy_slot() [suggested by Matthew Wilcox <willy@infradead.org>]
> - use the correct Fixes: tag [Dexuan Cui <decui@microsoft.com>]
> 
>  include/linux/pci.h |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> --- lnx-52-rc7.orig/include/linux/pci.h
> +++ lnx-52-rc7/include/linux/pci.h
> @@ -25,6 +25,7 @@
>  #include <linux/ioport.h>
>  #include <linux/list.h>
>  #include <linux/compiler.h>
> +#include <linux/err.h>
>  #include <linux/errno.h>
>  #include <linux/kobject.h>
>  #include <linux/atomic.h>
> @@ -947,14 +948,21 @@ int pci_scan_root_bus_bridge(struct pci_
>  struct pci_bus *pci_add_new_bus(struct pci_bus *parent, struct pci_dev
> *dev,
>  				int busnr);
>  void pcie_update_link_speed(struct pci_bus *bus, u16 link_status);
> +#ifdef CONFIG_SYSFS
> +void pci_dev_assign_slot(struct pci_dev *dev);
>  struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
>  				 const char *name,
>  				 struct hotplug_slot *hotplug);
>  void pci_destroy_slot(struct pci_slot *slot);
> -#ifdef CONFIG_SYSFS
> -void pci_dev_assign_slot(struct pci_dev *dev);
>  #else
>  static inline void pci_dev_assign_slot(struct pci_dev *dev) { }
> +static inline struct pci_slot *pci_create_slot(struct pci_bus *parent,
> +					       int slot_nr,
> +					       const char *name,
> +					       struct hotplug_slot *hotplug) {
> +	return ERR_PTR(-EINVAL);
> +}
> +static inline void pci_destroy_slot(struct pci_slot *slot) { }
>  #endif
>  int pci_scan_slot(struct pci_bus *bus, int devfn);
>  struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn);
> 

The serial number in slot info is used to match VF NIC with Synthetic NIC.
Without selecting SYSFS, the SRIOV feature will fail on VM on Hyper-V and
Azure. The first version of this patch should be used.

@Stephen Hemminger how do you think?

Thanks,
- Haiyang

^ permalink raw reply

* Re: [Xen-devel] [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Andrew Cooper @ 2019-07-03 17:43 UTC (permalink / raw)
  To: Nadav Amit, Juergen Gross
  Cc: Sasha Levin, linux-hyperv@vger.kernel.org,
	the arch/x86 maintainers, Stephen Hemminger, kvm list,
	Peter Zijlstra, Haiyang Zhang, Dave Hansen,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Paolo Bonzini, xen-devel,
	Thomas Gleixner, K. Y. Srinivasan, Boris Ostrovsky
In-Reply-To: <A4BC0EDE-71F0-455D-964A-7250D005FB56@vmware.com>

On 03/07/2019 18:02, Nadav Amit wrote:
>> On Jul 3, 2019, at 7:04 AM, Juergen Gross <jgross@suse.com> wrote:
>>
>> On 03.07.19 01:51, Nadav Amit wrote:
>>> To improve TLB shootdown performance, flush the remote and local TLBs
>>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>>> and hyper-v are only compile-tested).
>>> While the updated smp infrastructure is capable of running a function on
>>> a single local core, it is not optimized for this case. The multiple
>>> function calls and the indirect branch introduce some overhead, and
>>> might make local TLB flushes slower than they were before the recent
>>> changes.
>>> Before calling the SMP infrastructure, check if only a local TLB flush
>>> is needed to restore the lost performance in this common case. This
>>> requires to check mm_cpumask() one more time, but unless this mask is
>>> updated very frequently, this should impact performance negatively.
>>> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
>>> Cc: Haiyang Zhang <haiyangz@microsoft.com>
>>> Cc: Stephen Hemminger <sthemmin@microsoft.com>
>>> Cc: Sasha Levin <sashal@kernel.org>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Borislav Petkov <bp@alien8.de>
>>> Cc: x86@kernel.org
>>> Cc: Juergen Gross <jgross@suse.com>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>> Cc: Andy Lutomirski <luto@kernel.org>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>> Cc: linux-hyperv@vger.kernel.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Cc: virtualization@lists.linux-foundation.org
>>> Cc: kvm@vger.kernel.org
>>> Cc: xen-devel@lists.xenproject.org
>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>> ---
>>>  arch/x86/hyperv/mmu.c                 | 13 +++---
>>>  arch/x86/include/asm/paravirt.h       |  6 +--
>>>  arch/x86/include/asm/paravirt_types.h |  4 +-
>>>  arch/x86/include/asm/tlbflush.h       |  9 ++--
>>>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>>  arch/x86/kernel/kvm.c                 | 11 +++--
>>>  arch/x86/kernel/paravirt.c            |  2 +-
>>>  arch/x86/mm/tlb.c                     | 65 ++++++++++++++++++++-------
>>>  arch/x86/xen/mmu_pv.c                 | 20 ++++++---
>>>  include/trace/events/xen.h            |  2 +-
>>>  10 files changed, 91 insertions(+), 43 deletions(-)
>> ...
>>
>>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>>> index beb44e22afdf..19e481e6e904 100644
>>> --- a/arch/x86/xen/mmu_pv.c
>>> +++ b/arch/x86/xen/mmu_pv.c
>>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>>>  	preempt_enable();
>>>  }
>>>  -static void xen_flush_tlb_others(const struct cpumask *cpus,
>>> -				 const struct flush_tlb_info *info)
>>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>>> +				const struct flush_tlb_info *info)
>>>  {
>>>  	struct {
>>>  		struct mmuext_op op;
>>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>>  	const size_t mc_entry_size = sizeof(args->op) +
>>>  		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>>  -	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>>> +	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>>    	if (cpumask_empty(cpus))
>>>  		return;		/* nothing to do */
>>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>>  	args = mcs.args;
>>>  	args->op.arg2.vcpumask = to_cpumask(args->mask);
>>>  -	/* Remove us, and any offline CPUS. */
>>> +	/* Flush locally if needed and remove us */
>>> +	if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>>> +		local_irq_disable();
>>> +		flush_tlb_func_local(info);
>> I think this isn't the correct function for PV guests.
>>
>> In fact it should be much easier: just don't clear the own cpu from the
>> mask, that's all what's needed. The hypervisor is just fine having the
>> current cpu in the mask and it will do the right thing.
> Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
> the same, unfortunately, since it would induce VM-exit on TLB flushes.

Why do you believe the vmexit matters?  You're talking one anyway for
the IPI.

Intel only have virtualised self-IPI, and while AMD do have working
non-self IPIs, you still take a vmexit anyway if any destination vcpu
isn't currently running in non-root mode (IIRC).

At that point, you might as well have the hypervisor do all the hard
work via a multi-cpu shootdown/flush hypercall, rather than trying to
arrange it locally.

~Andrew

^ permalink raw reply

* Re: [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Nadav Amit @ 2019-07-03 17:02 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andy Lutomirski, Dave Hansen, Borislav Petkov, Peter Zijlstra,
	Sasha Levin, the arch/x86 maintainers, Thomas Gleixner,
	virtualization@lists.linux-foundation.org, xen-devel,
	Haiyang Zhang, K. Y. Srinivasan, Stephen Hemminger,
	Boris Ostrovsky, Ingo Molnar, Paolo Bonzini, kvm list,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <d89e2b57-8682-153e-33d8-98084e9983d6@suse.com>

> On Jul 3, 2019, at 7:04 AM, Juergen Gross <jgross@suse.com> wrote:
> 
> On 03.07.19 01:51, Nadav Amit wrote:
>> To improve TLB shootdown performance, flush the remote and local TLBs
>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>> and hyper-v are only compile-tested).
>> While the updated smp infrastructure is capable of running a function on
>> a single local core, it is not optimized for this case. The multiple
>> function calls and the indirect branch introduce some overhead, and
>> might make local TLB flushes slower than they were before the recent
>> changes.
>> Before calling the SMP infrastructure, check if only a local TLB flush
>> is needed to restore the lost performance in this common case. This
>> requires to check mm_cpumask() one more time, but unless this mask is
>> updated very frequently, this should impact performance negatively.
>> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
>> Cc: Haiyang Zhang <haiyangz@microsoft.com>
>> Cc: Stephen Hemminger <sthemmin@microsoft.com>
>> Cc: Sasha Levin <sashal@kernel.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: x86@kernel.org
>> Cc: Juergen Gross <jgross@suse.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>> Cc: linux-hyperv@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: kvm@vger.kernel.org
>> Cc: xen-devel@lists.xenproject.org
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>>  arch/x86/hyperv/mmu.c                 | 13 +++---
>>  arch/x86/include/asm/paravirt.h       |  6 +--
>>  arch/x86/include/asm/paravirt_types.h |  4 +-
>>  arch/x86/include/asm/tlbflush.h       |  9 ++--
>>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>  arch/x86/kernel/kvm.c                 | 11 +++--
>>  arch/x86/kernel/paravirt.c            |  2 +-
>>  arch/x86/mm/tlb.c                     | 65 ++++++++++++++++++++-------
>>  arch/x86/xen/mmu_pv.c                 | 20 ++++++---
>>  include/trace/events/xen.h            |  2 +-
>>  10 files changed, 91 insertions(+), 43 deletions(-)
> 
> ...
> 
>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>> index beb44e22afdf..19e481e6e904 100644
>> --- a/arch/x86/xen/mmu_pv.c
>> +++ b/arch/x86/xen/mmu_pv.c
>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>>  	preempt_enable();
>>  }
>>  -static void xen_flush_tlb_others(const struct cpumask *cpus,
>> -				 const struct flush_tlb_info *info)
>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>> +				const struct flush_tlb_info *info)
>>  {
>>  	struct {
>>  		struct mmuext_op op;
>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>  	const size_t mc_entry_size = sizeof(args->op) +
>>  		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>  -	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>> +	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>    	if (cpumask_empty(cpus))
>>  		return;		/* nothing to do */
>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>  	args = mcs.args;
>>  	args->op.arg2.vcpumask = to_cpumask(args->mask);
>>  -	/* Remove us, and any offline CPUS. */
>> +	/* Flush locally if needed and remove us */
>> +	if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>> +		local_irq_disable();
>> +		flush_tlb_func_local(info);
> 
> I think this isn't the correct function for PV guests.
> 
> In fact it should be much easier: just don't clear the own cpu from the
> mask, that's all what's needed. The hypervisor is just fine having the
> current cpu in the mask and it will do the right thing.

Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
the same, unfortunately, since it would induce VM-exit on TLB flushes. But
if they do - I’ll be able not to expose flush_tlb_func_local().


^ permalink raw reply

* [PATCH v2] PCI: hv: fix pci-hyperv build when SYSFS not enabled
From: Randy Dunlap @ 2019-07-03 16:59 UTC (permalink / raw)
  To: LKML, linux-pci
  Cc: Matthew Wilcox, Jake Oshins, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Bjorn Helgaas,
	linux-hyperv@vger.kernel.org, Dexuan Cui, Yuehaibing

From: Randy Dunlap <rdunlap@infradead.org>

Fix build of drivers/pci/controller/pci-hyperv.o when
CONFIG_SYSFS is not set/enabled by adding stubs for
pci_create_slot() and pci_destroy_slot().

Fixes these build errors:

ERROR: "pci_destroy_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
ERROR: "pci_create_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!

Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jake Oshins <jakeo@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Yuehaibing <yuehaibing@huawei.com>
---
v2:
- provide non-CONFIG_SYSFS stubs for pci_create_slot() and
  pci_destroy_slot() [suggested by Matthew Wilcox <willy@infradead.org>]
- use the correct Fixes: tag [Dexuan Cui <decui@microsoft.com>]

 include/linux/pci.h |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

--- lnx-52-rc7.orig/include/linux/pci.h
+++ lnx-52-rc7/include/linux/pci.h
@@ -25,6 +25,7 @@
 #include <linux/ioport.h>
 #include <linux/list.h>
 #include <linux/compiler.h>
+#include <linux/err.h>
 #include <linux/errno.h>
 #include <linux/kobject.h>
 #include <linux/atomic.h>
@@ -947,14 +948,21 @@ int pci_scan_root_bus_bridge(struct pci_
 struct pci_bus *pci_add_new_bus(struct pci_bus *parent, struct pci_dev *dev,
 				int busnr);
 void pcie_update_link_speed(struct pci_bus *bus, u16 link_status);
+#ifdef CONFIG_SYSFS
+void pci_dev_assign_slot(struct pci_dev *dev);
 struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
 				 const char *name,
 				 struct hotplug_slot *hotplug);
 void pci_destroy_slot(struct pci_slot *slot);
-#ifdef CONFIG_SYSFS
-void pci_dev_assign_slot(struct pci_dev *dev);
 #else
 static inline void pci_dev_assign_slot(struct pci_dev *dev) { }
+static inline struct pci_slot *pci_create_slot(struct pci_bus *parent,
+					       int slot_nr,
+					       const char *name,
+					       struct hotplug_slot *hotplug) {
+	return ERR_PTR(-EINVAL);
+}
+static inline void pci_destroy_slot(struct pci_slot *slot) { }
 #endif
 int pci_scan_slot(struct pci_bus *bus, int devfn);
 struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn);



^ permalink raw reply

* Re: [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Juergen Gross @ 2019-07-03 14:04 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: Borislav Petkov, Peter Zijlstra, Sasha Levin, x86,
	Thomas Gleixner, virtualization, xen-devel, Haiyang Zhang,
	K. Y. Srinivasan, Stephen Hemminger, Boris Ostrovsky, Ingo Molnar,
	Paolo Bonzini, kvm, linux-hyperv, linux-kernel
In-Reply-To: <20190702235151.4377-5-namit@vmware.com>

On 03.07.19 01:51, Nadav Amit wrote:
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. Introduce
> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
> and hyper-v are only compile-tested).
> 
> While the updated smp infrastructure is capable of running a function on
> a single local core, it is not optimized for this case. The multiple
> function calls and the indirect branch introduce some overhead, and
> might make local TLB flushes slower than they were before the recent
> changes.
> 
> Before calling the SMP infrastructure, check if only a local TLB flush
> is needed to restore the lost performance in this common case. This
> requires to check mm_cpumask() one more time, but unless this mask is
> updated very frequently, this should impact performance negatively.
> 
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Sasha Levin <sashal@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: x86@kernel.org
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: linux-hyperv@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: kvm@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>   arch/x86/hyperv/mmu.c                 | 13 +++---
>   arch/x86/include/asm/paravirt.h       |  6 +--
>   arch/x86/include/asm/paravirt_types.h |  4 +-
>   arch/x86/include/asm/tlbflush.h       |  9 ++--
>   arch/x86/include/asm/trace/hyperv.h   |  2 +-
>   arch/x86/kernel/kvm.c                 | 11 +++--
>   arch/x86/kernel/paravirt.c            |  2 +-
>   arch/x86/mm/tlb.c                     | 65 ++++++++++++++++++++-------
>   arch/x86/xen/mmu_pv.c                 | 20 ++++++---
>   include/trace/events/xen.h            |  2 +-
>   10 files changed, 91 insertions(+), 43 deletions(-)

...

> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index beb44e22afdf..19e481e6e904 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>   	preempt_enable();
>   }
>   
> -static void xen_flush_tlb_others(const struct cpumask *cpus,
> -				 const struct flush_tlb_info *info)
> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
> +				const struct flush_tlb_info *info)
>   {
>   	struct {
>   		struct mmuext_op op;
> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>   	const size_t mc_entry_size = sizeof(args->op) +
>   		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>   
> -	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
> +	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>   
>   	if (cpumask_empty(cpus))
>   		return;		/* nothing to do */
> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>   	args = mcs.args;
>   	args->op.arg2.vcpumask = to_cpumask(args->mask);
>   
> -	/* Remove us, and any offline CPUS. */
> +	/* Flush locally if needed and remove us */
> +	if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
> +		local_irq_disable();
> +		flush_tlb_func_local(info);

I think this isn't the correct function for PV guests.

In fact it should be much easier: just don't clear the own cpu from the
mask, that's all what's needed. The hypervisor is just fine having the
current cpu in the mask and it will do the right thing.


Juergen

^ permalink raw reply

* [tip:timers/core] clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
From: tip-bot for Michael Kelley @ 2019-07-03  9:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: sunilmut, pcc, huw, vincenzo.frascino, salyzyn, apw, vkuznets,
	linux-kselftest, tglx, hpa, gregkh, pbonzini, olaf, marcelo.cerri,
	will.deacon, linux-kernel, rkrcmar, linux-mips, kvm, ralf,
	linux-arm-kernel, linux, kys, shuah, mark.rutland, paul.burton,
	marc.zyngier, mingo, bp, linux, sashal, sfr, mikelley,
	catalin.marinas, 0x7f454c46, daniel.lezcano, linux-hyperv,
	jasowang, linux-arch, arnd
In-Reply-To: <1561955054-1838-3-git-send-email-mikelley@microsoft.com>

Commit-ID:  dd2cb348613b44f9d948b068775e159aad298599
Gitweb:     https://git.kernel.org/tip/dd2cb348613b44f9d948b068775e159aad298599
Author:     Michael Kelley <mikelley@microsoft.com>
AuthorDate: Mon, 1 Jul 2019 04:26:06 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 3 Jul 2019 11:00:59 +0200

clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic

Continue consolidating Hyper-V clock and timer code into an ISA
independent Hyper-V clocksource driver.

Move the existing clocksource code under drivers/hv and arch/x86 to the new
clocksource driver while separating out the ISA dependencies. Update
Hyper-V initialization to call initialization and cleanup routines since
the Hyper-V synthetic clock is not independently enumerated in ACPI.

Update Hyper-V clocksource users in KVM and VDSO to get definitions from
the new include file.

No behavior is changed and no new functionality is added.

Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "bp@alien8.de" <bp@alien8.de>
Cc: "will.deacon@arm.com" <will.deacon@arm.com>
Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
Cc: "olaf@aepfle.de" <olaf@aepfle.de>
Cc: "apw@canonical.com" <apw@canonical.com>
Cc: "jasowang@redhat.com" <jasowang@redhat.com>
Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
Cc: KY Srinivasan <kys@microsoft.com>
Cc: "sashal@kernel.org" <sashal@kernel.org>
Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
Cc: "arnd@arndb.de" <arnd@arndb.de>
Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
Cc: "paul.burton@mips.com" <paul.burton@mips.com>
Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
Cc: "salyzyn@android.com" <salyzyn@android.com>
Cc: "pcc@google.com" <pcc@google.com>
Cc: "shuah@kernel.org" <shuah@kernel.org>
Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
Cc: "huw@codeweavers.com" <huw@codeweavers.com>
Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
Link: https://lkml.kernel.org/r/1561955054-1838-3-git-send-email-mikelley@microsoft.com

---
 arch/x86/entry/vdso/vma.c                |   2 +-
 arch/x86/hyperv/hv_init.c                |  91 +-------------------
 arch/x86/include/asm/mshyperv.h          |  81 +++---------------
 arch/x86/include/asm/vdso/gettimeofday.h |   2 +-
 arch/x86/kvm/x86.c                       |   1 +
 drivers/clocksource/hyperv_timer.c       | 139 +++++++++++++++++++++++++++++++
 drivers/hv/hv_util.c                     |   1 +
 include/clocksource/hyperv_timer.h       |  80 ++++++++++++++++++
 8 files changed, 237 insertions(+), 160 deletions(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 8db1f594e8b1..349a61d8bf34 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -22,7 +22,7 @@
 #include <asm/page.h>
 #include <asm/desc.h>
 #include <asm/cpufeature.h>
-#include <asm/mshyperv.h>
+#include <clocksource/hyperv_timer.h>
 
 #if defined(CONFIG_X86_64)
 unsigned int __read_mostly vdso64_enabled = 1;
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 1608050e9df9..0e033ef11a9f 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -17,64 +17,13 @@
 #include <linux/version.h>
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
-#include <linux/clockchips.h>
 #include <linux/hyperv.h>
 #include <linux/slab.h>
 #include <linux/cpuhotplug.h>
-
-#ifdef CONFIG_HYPERV_TSCPAGE
-
-static struct ms_hyperv_tsc_page *tsc_pg;
-
-struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
-{
-	return tsc_pg;
-}
-EXPORT_SYMBOL_GPL(hv_get_tsc_page);
-
-static u64 read_hv_clock_tsc(struct clocksource *arg)
-{
-	u64 current_tick = hv_read_tsc_page(tsc_pg);
-
-	if (current_tick == U64_MAX)
-		rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);
-
-	return current_tick;
-}
-
-static struct clocksource hyperv_cs_tsc = {
-		.name		= "hyperv_clocksource_tsc_page",
-		.rating		= 400,
-		.read		= read_hv_clock_tsc,
-		.mask		= CLOCKSOURCE_MASK(64),
-		.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
-};
-#endif
-
-static u64 read_hv_clock_msr(struct clocksource *arg)
-{
-	u64 current_tick;
-	/*
-	 * Read the partition counter to get the current tick count. This count
-	 * is set to 0 when the partition is created and is incremented in
-	 * 100 nanosecond units.
-	 */
-	rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);
-	return current_tick;
-}
-
-static struct clocksource hyperv_cs_msr = {
-	.name		= "hyperv_clocksource_msr",
-	.rating		= 400,
-	.read		= read_hv_clock_msr,
-	.mask		= CLOCKSOURCE_MASK(64),
-	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
-};
+#include <clocksource/hyperv_timer.h>
 
 void *hv_hypercall_pg;
 EXPORT_SYMBOL_GPL(hv_hypercall_pg);
-struct clocksource *hyperv_cs;
-EXPORT_SYMBOL_GPL(hyperv_cs);
 
 u32 *hv_vp_index;
 EXPORT_SYMBOL_GPL(hv_vp_index);
@@ -343,42 +292,8 @@ void __init hyperv_init(void)
 
 	x86_init.pci.arch_init = hv_pci_init;
 
-	/*
-	 * Register Hyper-V specific clocksource.
-	 */
-#ifdef CONFIG_HYPERV_TSCPAGE
-	if (ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE) {
-		union hv_x64_msr_hypercall_contents tsc_msr;
-
-		tsc_pg = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL);
-		if (!tsc_pg)
-			goto register_msr_cs;
-
-		hyperv_cs = &hyperv_cs_tsc;
-
-		rdmsrl(HV_X64_MSR_REFERENCE_TSC, tsc_msr.as_uint64);
-
-		tsc_msr.enable = 1;
-		tsc_msr.guest_physical_address = vmalloc_to_pfn(tsc_pg);
-
-		wrmsrl(HV_X64_MSR_REFERENCE_TSC, tsc_msr.as_uint64);
-
-		hyperv_cs_tsc.archdata.vclock_mode = VCLOCK_HVCLOCK;
-
-		clocksource_register_hz(&hyperv_cs_tsc, NSEC_PER_SEC/100);
-		return;
-	}
-register_msr_cs:
-#endif
-	/*
-	 * For 32 bit guests just use the MSR based mechanism for reading
-	 * the partition counter.
-	 */
-
-	hyperv_cs = &hyperv_cs_msr;
-	if (ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE)
-		clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
-
+	/* Register Hyper-V specific clocksource */
+	hv_init_clocksource();
 	return;
 
 remove_cpuhp_state:
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index cc60e617931c..f4fa8a9d5d0b 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -105,6 +105,17 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
 #define hv_get_crash_ctl(val) \
 	rdmsrl(HV_X64_MSR_CRASH_CTL, val)
 
+#define hv_get_time_ref_count(val) \
+	rdmsrl(HV_X64_MSR_TIME_REF_COUNT, val)
+
+#define hv_get_reference_tsc(val) \
+	rdmsrl(HV_X64_MSR_REFERENCE_TSC, val)
+#define hv_set_reference_tsc(val) \
+	wrmsrl(HV_X64_MSR_REFERENCE_TSC, val)
+#define hv_set_clocksource_vdso(val) \
+	((val).archdata.vclock_mode = VCLOCK_HVCLOCK)
+#define hv_get_raw_timer() rdtsc_ordered()
+
 void hyperv_callback_vector(void);
 void hyperv_reenlightenment_vector(void);
 #ifdef CONFIG_TRACING
@@ -133,7 +144,6 @@ static inline void hv_disable_stimer0_percpu_irq(int irq) {}
 
 
 #if IS_ENABLED(CONFIG_HYPERV)
-extern struct clocksource *hyperv_cs;
 extern void *hv_hypercall_pg;
 extern void  __percpu  **hyperv_pcpu_input_arg;
 
@@ -387,73 +397,4 @@ static inline int hyperv_flush_guest_mapping_range(u64 as,
 }
 #endif /* CONFIG_HYPERV */
 
-#ifdef CONFIG_HYPERV_TSCPAGE
-struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
-static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
-				       u64 *cur_tsc)
-{
-	u64 scale, offset;
-	u32 sequence;
-
-	/*
-	 * The protocol for reading Hyper-V TSC page is specified in Hypervisor
-	 * Top-Level Functional Specification ver. 3.0 and above. To get the
-	 * reference time we must do the following:
-	 * - READ ReferenceTscSequence
-	 *   A special '0' value indicates the time source is unreliable and we
-	 *   need to use something else. The currently published specification
-	 *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
-	 *   instead of '0' as the special value, see commit c35b82ef0294.
-	 * - ReferenceTime =
-	 *        ((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
-	 * - READ ReferenceTscSequence again. In case its value has changed
-	 *   since our first reading we need to discard ReferenceTime and repeat
-	 *   the whole sequence as the hypervisor was updating the page in
-	 *   between.
-	 */
-	do {
-		sequence = READ_ONCE(tsc_pg->tsc_sequence);
-		if (!sequence)
-			return U64_MAX;
-		/*
-		 * Make sure we read sequence before we read other values from
-		 * TSC page.
-		 */
-		smp_rmb();
-
-		scale = READ_ONCE(tsc_pg->tsc_scale);
-		offset = READ_ONCE(tsc_pg->tsc_offset);
-		*cur_tsc = rdtsc_ordered();
-
-		/*
-		 * Make sure we read sequence after we read all other values
-		 * from TSC page.
-		 */
-		smp_rmb();
-
-	} while (READ_ONCE(tsc_pg->tsc_sequence) != sequence);
-
-	return mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
-}
-
-static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
-{
-	u64 cur_tsc;
-
-	return hv_read_tsc_page_tsc(tsc_pg, &cur_tsc);
-}
-
-#else
-static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
-{
-	return NULL;
-}
-
-static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
-				       u64 *cur_tsc)
-{
-	BUG();
-	return U64_MAX;
-}
-#endif
 #endif
diff --git a/arch/x86/include/asm/vdso/gettimeofday.h b/arch/x86/include/asm/vdso/gettimeofday.h
index a14039a59abd..ae91429129a6 100644
--- a/arch/x86/include/asm/vdso/gettimeofday.h
+++ b/arch/x86/include/asm/vdso/gettimeofday.h
@@ -18,7 +18,7 @@
 #include <asm/unistd.h>
 #include <asm/msr.h>
 #include <asm/pvclock.h>
-#include <asm/mshyperv.h>
+#include <clocksource/hyperv_timer.h>
 
 #define __vdso_data (VVAR(_vdso_data))
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8ec676029365..5e1db26b5e15 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -67,6 +67,7 @@
 #include <asm/mshyperv.h>
 #include <asm/hypervisor.h>
 #include <asm/intel_pt.h>
+#include <clocksource/hyperv_timer.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index 68a28af31561..ba2c79e6a0ee 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -14,6 +14,8 @@
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/clockchips.h>
+#include <linux/clocksource.h>
+#include <linux/sched_clock.h>
 #include <linux/mm.h>
 #include <clocksource/hyperv_timer.h>
 #include <asm/hyperv-tlfs.h>
@@ -198,3 +200,140 @@ void hv_stimer_global_cleanup(void)
 	hv_stimer_free();
 }
 EXPORT_SYMBOL_GPL(hv_stimer_global_cleanup);
+
+/*
+ * Code and definitions for the Hyper-V clocksources.  Two
+ * clocksources are defined: one that reads the Hyper-V defined MSR, and
+ * the other that uses the TSC reference page feature as defined in the
+ * TLFS.  The MSR version is for compatibility with old versions of
+ * Hyper-V and 32-bit x86.  The TSC reference page version is preferred.
+ */
+
+struct clocksource *hyperv_cs;
+EXPORT_SYMBOL_GPL(hyperv_cs);
+
+#ifdef CONFIG_HYPERV_TSCPAGE
+
+static struct ms_hyperv_tsc_page *tsc_pg;
+
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
+{
+	return tsc_pg;
+}
+EXPORT_SYMBOL_GPL(hv_get_tsc_page);
+
+static u64 notrace read_hv_sched_clock_tsc(void)
+{
+	u64 current_tick = hv_read_tsc_page(tsc_pg);
+
+	if (current_tick == U64_MAX)
+		hv_get_time_ref_count(current_tick);
+
+	return current_tick;
+}
+
+static u64 read_hv_clock_tsc(struct clocksource *arg)
+{
+	return read_hv_sched_clock_tsc();
+}
+
+static struct clocksource hyperv_cs_tsc = {
+	.name	= "hyperv_clocksource_tsc_page",
+	.rating	= 400,
+	.read	= read_hv_clock_tsc,
+	.mask	= CLOCKSOURCE_MASK(64),
+	.flags	= CLOCK_SOURCE_IS_CONTINUOUS,
+};
+#endif
+
+static u64 notrace read_hv_sched_clock_msr(void)
+{
+	u64 current_tick;
+	/*
+	 * Read the partition counter to get the current tick count. This count
+	 * is set to 0 when the partition is created and is incremented in
+	 * 100 nanosecond units.
+	 */
+	hv_get_time_ref_count(current_tick);
+	return current_tick;
+}
+
+static u64 read_hv_clock_msr(struct clocksource *arg)
+{
+	return read_hv_sched_clock_msr();
+}
+
+static struct clocksource hyperv_cs_msr = {
+	.name	= "hyperv_clocksource_msr",
+	.rating	= 400,
+	.read	= read_hv_clock_msr,
+	.mask	= CLOCKSOURCE_MASK(64),
+	.flags	= CLOCK_SOURCE_IS_CONTINUOUS,
+};
+
+#ifdef CONFIG_HYPERV_TSCPAGE
+static bool __init hv_init_tsc_clocksource(void)
+{
+	u64		tsc_msr;
+	phys_addr_t	phys_addr;
+
+	if (!(ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE))
+		return false;
+
+	tsc_pg = vmalloc(PAGE_SIZE);
+	if (!tsc_pg)
+		return false;
+
+	hyperv_cs = &hyperv_cs_tsc;
+	phys_addr = page_to_phys(vmalloc_to_page(tsc_pg));
+
+	/*
+	 * The Hyper-V TLFS specifies to preserve the value of reserved
+	 * bits in registers. So read the existing value, preserve the
+	 * low order 12 bits, and add in the guest physical address
+	 * (which already has at least the low 12 bits set to zero since
+	 * it is page aligned). Also set the "enable" bit, which is bit 0.
+	 */
+	hv_get_reference_tsc(tsc_msr);
+	tsc_msr &= GENMASK_ULL(11, 0);
+	tsc_msr = tsc_msr | 0x1 | (u64)phys_addr;
+	hv_set_reference_tsc(tsc_msr);
+
+	hv_set_clocksource_vdso(hyperv_cs_tsc);
+	clocksource_register_hz(&hyperv_cs_tsc, NSEC_PER_SEC/100);
+
+	/* sched_clock_register is needed on ARM64 but is a no-op on x86 */
+	sched_clock_register(read_hv_sched_clock_tsc, 64, HV_CLOCK_HZ);
+	return true;
+}
+#else
+static bool __init hv_init_tsc_clocksource(void)
+{
+	return false;
+}
+#endif
+
+
+void __init hv_init_clocksource(void)
+{
+	/*
+	 * Try to set up the TSC page clocksource. If it succeeds, we're
+	 * done. Otherwise, set up the MSR clocksoruce.  At least one of
+	 * these will always be available except on very old versions of
+	 * Hyper-V on x86.  In that case we won't have a Hyper-V
+	 * clocksource, but Linux will still run with a clocksource based
+	 * on the emulated PIT or LAPIC timer.
+	 */
+	if (hv_init_tsc_clocksource())
+		return;
+
+	if (!(ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE))
+		return;
+
+	hyperv_cs = &hyperv_cs_msr;
+	clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
+
+	/* sched_clock_register is needed on ARM64 but is a no-op on x86 */
+	sched_clock_register(read_hv_sched_clock_msr, 64, HV_CLOCK_HZ);
+}
+EXPORT_SYMBOL_GPL(hv_init_clocksource);
diff --git a/drivers/hv/hv_util.c b/drivers/hv/hv_util.c
index 7d3d31f099ea..e32681ee7b9f 100644
--- a/drivers/hv/hv_util.c
+++ b/drivers/hv/hv_util.c
@@ -17,6 +17,7 @@
 #include <linux/hyperv.h>
 #include <linux/clockchips.h>
 #include <linux/ptp_clock_kernel.h>
+#include <clocksource/hyperv_timer.h>
 #include <asm/mshyperv.h>
 
 #include "hyperv_vmbus.h"
diff --git a/include/clocksource/hyperv_timer.h b/include/clocksource/hyperv_timer.h
index 0cd73f7bc992..a821deb8ecb2 100644
--- a/include/clocksource/hyperv_timer.h
+++ b/include/clocksource/hyperv_timer.h
@@ -13,6 +13,10 @@
 #ifndef __CLKSOURCE_HYPERV_TIMER_H
 #define __CLKSOURCE_HYPERV_TIMER_H
 
+#include <linux/clocksource.h>
+#include <linux/math64.h>
+#include <asm/mshyperv.h>
+
 #define HV_MAX_MAX_DELTA_TICKS 0xffffffff
 #define HV_MIN_DELTA_TICKS 1
 
@@ -24,4 +28,80 @@ extern void hv_stimer_cleanup(unsigned int cpu);
 extern void hv_stimer_global_cleanup(void);
 extern void hv_stimer0_isr(void);
 
+#if IS_ENABLED(CONFIG_HYPERV)
+extern struct clocksource *hyperv_cs;
+extern void hv_init_clocksource(void);
+#endif /* CONFIG_HYPERV */
+
+#ifdef CONFIG_HYPERV_TSCPAGE
+extern struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
+
+static inline notrace u64
+hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg, u64 *cur_tsc)
+{
+	u64 scale, offset;
+	u32 sequence;
+
+	/*
+	 * The protocol for reading Hyper-V TSC page is specified in Hypervisor
+	 * Top-Level Functional Specification ver. 3.0 and above. To get the
+	 * reference time we must do the following:
+	 * - READ ReferenceTscSequence
+	 *   A special '0' value indicates the time source is unreliable and we
+	 *   need to use something else. The currently published specification
+	 *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
+	 *   instead of '0' as the special value, see commit c35b82ef0294.
+	 * - ReferenceTime =
+	 *        ((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
+	 * - READ ReferenceTscSequence again. In case its value has changed
+	 *   since our first reading we need to discard ReferenceTime and repeat
+	 *   the whole sequence as the hypervisor was updating the page in
+	 *   between.
+	 */
+	do {
+		sequence = READ_ONCE(tsc_pg->tsc_sequence);
+		if (!sequence)
+			return U64_MAX;
+		/*
+		 * Make sure we read sequence before we read other values from
+		 * TSC page.
+		 */
+		smp_rmb();
+
+		scale = READ_ONCE(tsc_pg->tsc_scale);
+		offset = READ_ONCE(tsc_pg->tsc_offset);
+		*cur_tsc = hv_get_raw_timer();
+
+		/*
+		 * Make sure we read sequence after we read all other values
+		 * from TSC page.
+		 */
+		smp_rmb();
+
+	} while (READ_ONCE(tsc_pg->tsc_sequence) != sequence);
+
+	return mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
+}
+
+static inline notrace u64
+hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+{
+	u64 cur_tsc;
+
+	return hv_read_tsc_page_tsc(tsc_pg, &cur_tsc);
+}
+
+#else /* CONFIG_HYPERV_TSC_PAGE */
+static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
+{
+	return NULL;
+}
+
+static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+				       u64 *cur_tsc)
+{
+	return U64_MAX;
+}
+#endif /* CONFIG_HYPERV_TSCPAGE */
+
 #endif

^ permalink raw reply related

* [tip:timers/core] clocksource/drivers: Make Hyper-V clocksource ISA agnostic
From: tip-bot for Michael Kelley @ 2019-07-03  9:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: 0x7f454c46, linux-kernel, apw, daniel.lezcano, linux-arch, sashal,
	linux, bp, pbonzini, linux-mips, paul.burton, mikelley,
	marc.zyngier, sunilmut, jasowang, linux-kselftest, mark.rutland,
	huw, ralf, olaf, rkrcmar, linux-hyperv, arnd, kys, linux, mingo,
	sfr, will.deacon, gregkh, vincenzo.frascino, tglx, kvm, shuah,
	pcc, vkuznets, marcelo.cerri, hpa, catalin.marinas, salyzyn,
	linux-arm-kernel
In-Reply-To: <1561955054-1838-2-git-send-email-mikelley@microsoft.com>

Commit-ID:  fd1fea6834d0f9f93062ae6685862908a9baed39
Gitweb:     https://git.kernel.org/tip/fd1fea6834d0f9f93062ae6685862908a9baed39
Author:     Michael Kelley <mikelley@microsoft.com>
AuthorDate: Mon, 1 Jul 2019 04:25:56 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 3 Jul 2019 11:00:59 +0200

clocksource/drivers: Make Hyper-V clocksource ISA agnostic

Hyper-V clock/timer code and data structures are currently mixed
in with other code in the ISA independent drivers/hv directory as
well as the ISA dependent Hyper-V code under arch/x86.

Consolidate this code and data structures into a Hyper-V clocksource driver
to better follow the Linux model. In doing so, separate out the ISA
dependent portions so the new clocksource driver works for x86 and for the
in-process Hyper-V on ARM64 code.

To start, move the existing clockevents code to create the new clocksource
driver. Update the VMbus driver to call initialization and cleanup routines
since the Hyper-V synthetic timers are not independently enumerated in
ACPI.

No behavior is changed and no new functionality is added.

Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "bp@alien8.de" <bp@alien8.de>
Cc: "will.deacon@arm.com" <will.deacon@arm.com>
Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
Cc: "olaf@aepfle.de" <olaf@aepfle.de>
Cc: "apw@canonical.com" <apw@canonical.com>
Cc: "jasowang@redhat.com" <jasowang@redhat.com>
Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
Cc: KY Srinivasan <kys@microsoft.com>
Cc: "sashal@kernel.org" <sashal@kernel.org>
Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
Cc: "arnd@arndb.de" <arnd@arndb.de>
Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
Cc: "paul.burton@mips.com" <paul.burton@mips.com>
Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
Cc: "salyzyn@android.com" <salyzyn@android.com>
Cc: "pcc@google.com" <pcc@google.com>
Cc: "shuah@kernel.org" <shuah@kernel.org>
Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
Cc: "huw@codeweavers.com" <huw@codeweavers.com>
Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
Link: https://lkml.kernel.org/r/1561955054-1838-2-git-send-email-mikelley@microsoft.com

---
 MAINTAINERS                        |   2 +
 arch/x86/include/asm/hyperv-tlfs.h |   6 ++
 arch/x86/kernel/cpu/mshyperv.c     |   4 +-
 drivers/clocksource/Makefile       |   1 +
 drivers/clocksource/hyperv_timer.c | 200 +++++++++++++++++++++++++++++++++++++
 drivers/hv/Kconfig                 |   3 +
 drivers/hv/hv.c                    | 156 +----------------------------
 drivers/hv/hyperv_vmbus.h          |   3 -
 drivers/hv/vmbus_drv.c             |  42 ++++----
 include/clocksource/hyperv_timer.h |  27 +++++
 10 files changed, 268 insertions(+), 176 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index a2812972b7b0..bfde42a76a95 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7313,6 +7313,7 @@ F:	arch/x86/include/asm/trace/hyperv.h
 F:	arch/x86/include/asm/hyperv-tlfs.h
 F:	arch/x86/kernel/cpu/mshyperv.c
 F:	arch/x86/hyperv
+F:	drivers/clocksource/hyperv_timer.c
 F:	drivers/hid/hid-hyperv.c
 F:	drivers/hv/
 F:	drivers/input/serio/hyperv-keyboard.c
@@ -7323,6 +7324,7 @@ F:	drivers/uio/uio_hv_generic.c
 F:	drivers/video/fbdev/hyperv_fb.c
 F:	drivers/iommu/hyperv_iommu.c
 F:	net/vmw_vsock/hyperv_transport.c
+F:	include/clocksource/hyperv_timer.h
 F:	include/linux/hyperv.h
 F:	include/uapi/linux/hyperv.h
 F:	tools/hv/
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index cdf44aa9a501..af78cd72b8f3 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -401,6 +401,12 @@ enum HV_GENERIC_SET_FORMAT {
 #define HV_STATUS_INVALID_CONNECTION_ID		18
 #define HV_STATUS_INSUFFICIENT_BUFFERS		19
 
+/*
+ * The Hyper-V TimeRefCount register and the TSC
+ * page provide a guest VM clock with 100ns tick rate
+ */
+#define HV_CLOCK_HZ (NSEC_PER_SEC/100)
+
 typedef struct _HV_REFERENCE_TSC_PAGE {
 	__u32 tsc_sequence;
 	__u32 res1;
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 7df29f08871b..1e5f7a03ddf5 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -17,6 +17,7 @@
 #include <linux/irq.h>
 #include <linux/kexec.h>
 #include <linux/i8253.h>
+#include <linux/random.h>
 #include <asm/processor.h>
 #include <asm/hypervisor.h>
 #include <asm/hyperv-tlfs.h>
@@ -80,6 +81,7 @@ __visible void __irq_entry hv_stimer0_vector_handler(struct pt_regs *regs)
 	inc_irq_stat(hyperv_stimer0_count);
 	if (hv_stimer0_handler)
 		hv_stimer0_handler();
+	add_interrupt_randomness(HYPERV_STIMER0_VECTOR, 0);
 	ack_APIC_irq();
 
 	exiting_irq();
@@ -89,7 +91,7 @@ __visible void __irq_entry hv_stimer0_vector_handler(struct pt_regs *regs)
 int hv_setup_stimer0_irq(int *irq, int *vector, void (*handler)(void))
 {
 	*vector = HYPERV_STIMER0_VECTOR;
-	*irq = 0;   /* Unused on x86/x64 */
+	*irq = -1;   /* Unused on x86/x64 */
 	hv_stimer0_handler = handler;
 	return 0;
 }
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index 5582252efb31..2e7936e7833f 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -86,3 +86,4 @@ obj-$(CONFIG_ATCPIT100_TIMER)		+= timer-atcpit100.o
 obj-$(CONFIG_RISCV_TIMER)		+= timer-riscv.o
 obj-$(CONFIG_CSKY_MP_TIMER)		+= timer-mp-csky.o
 obj-$(CONFIG_GX6605S_TIMER)		+= timer-gx6605s.o
+obj-$(CONFIG_HYPERV_TIMER)		+= hyperv_timer.o
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
new file mode 100644
index 000000000000..68a28af31561
--- /dev/null
+++ b/drivers/clocksource/hyperv_timer.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Clocksource driver for the synthetic counter and timers
+ * provided by the Hyper-V hypervisor to guest VMs, as described
+ * in the Hyper-V Top Level Functional Spec (TLFS). This driver
+ * is instruction set architecture independent.
+ *
+ * Copyright (C) 2019, Microsoft, Inc.
+ *
+ * Author:  Michael Kelley <mikelley@microsoft.com>
+ */
+
+#include <linux/percpu.h>
+#include <linux/cpumask.h>
+#include <linux/clockchips.h>
+#include <linux/mm.h>
+#include <clocksource/hyperv_timer.h>
+#include <asm/hyperv-tlfs.h>
+#include <asm/mshyperv.h>
+
+static struct clock_event_device __percpu *hv_clock_event;
+
+/*
+ * If false, we're using the old mechanism for stimer0 interrupts
+ * where it sends a VMbus message when it expires. The old
+ * mechanism is used when running on older versions of Hyper-V
+ * that don't support Direct Mode. While Hyper-V provides
+ * four stimer's per CPU, Linux uses only stimer0.
+ */
+static bool direct_mode_enabled;
+
+static int stimer0_irq;
+static int stimer0_vector;
+static int stimer0_message_sint;
+
+/*
+ * ISR for when stimer0 is operating in Direct Mode.  Direct Mode
+ * does not use VMbus or any VMbus messages, so process here and not
+ * in the VMbus driver code.
+ */
+void hv_stimer0_isr(void)
+{
+	struct clock_event_device *ce;
+
+	ce = this_cpu_ptr(hv_clock_event);
+	ce->event_handler(ce);
+}
+EXPORT_SYMBOL_GPL(hv_stimer0_isr);
+
+static int hv_ce_set_next_event(unsigned long delta,
+				struct clock_event_device *evt)
+{
+	u64 current_tick;
+
+	current_tick = hyperv_cs->read(NULL);
+	current_tick += delta;
+	hv_init_timer(0, current_tick);
+	return 0;
+}
+
+static int hv_ce_shutdown(struct clock_event_device *evt)
+{
+	hv_init_timer(0, 0);
+	hv_init_timer_config(0, 0);
+	if (direct_mode_enabled)
+		hv_disable_stimer0_percpu_irq(stimer0_irq);
+
+	return 0;
+}
+
+static int hv_ce_set_oneshot(struct clock_event_device *evt)
+{
+	union hv_stimer_config timer_cfg;
+
+	timer_cfg.as_uint64 = 0;
+	timer_cfg.enable = 1;
+	timer_cfg.auto_enable = 1;
+	if (direct_mode_enabled) {
+		/*
+		 * When it expires, the timer will directly interrupt
+		 * on the specified hardware vector/IRQ.
+		 */
+		timer_cfg.direct_mode = 1;
+		timer_cfg.apic_vector = stimer0_vector;
+		hv_enable_stimer0_percpu_irq(stimer0_irq);
+	} else {
+		/*
+		 * When it expires, the timer will generate a VMbus message,
+		 * to be handled by the normal VMbus interrupt handler.
+		 */
+		timer_cfg.direct_mode = 0;
+		timer_cfg.sintx = stimer0_message_sint;
+	}
+	hv_init_timer_config(0, timer_cfg.as_uint64);
+	return 0;
+}
+
+/*
+ * hv_stimer_init - Per-cpu initialization of the clockevent
+ */
+void hv_stimer_init(unsigned int cpu)
+{
+	struct clock_event_device *ce;
+
+	/*
+	 * Synthetic timers are always available except on old versions of
+	 * Hyper-V on x86.  In that case, just return as Linux will use a
+	 * clocksource based on emulated PIT or LAPIC timer hardware.
+	 */
+	if (!(ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE))
+		return;
+
+	ce = per_cpu_ptr(hv_clock_event, cpu);
+	ce->name = "Hyper-V clockevent";
+	ce->features = CLOCK_EVT_FEAT_ONESHOT;
+	ce->cpumask = cpumask_of(cpu);
+	ce->rating = 1000;
+	ce->set_state_shutdown = hv_ce_shutdown;
+	ce->set_state_oneshot = hv_ce_set_oneshot;
+	ce->set_next_event = hv_ce_set_next_event;
+
+	clockevents_config_and_register(ce,
+					HV_CLOCK_HZ,
+					HV_MIN_DELTA_TICKS,
+					HV_MAX_MAX_DELTA_TICKS);
+}
+EXPORT_SYMBOL_GPL(hv_stimer_init);
+
+/*
+ * hv_stimer_cleanup - Per-cpu cleanup of the clockevent
+ */
+void hv_stimer_cleanup(unsigned int cpu)
+{
+	struct clock_event_device *ce;
+
+	/* Turn off clockevent device */
+	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
+		ce = per_cpu_ptr(hv_clock_event, cpu);
+		hv_ce_shutdown(ce);
+	}
+}
+EXPORT_SYMBOL_GPL(hv_stimer_cleanup);
+
+/* hv_stimer_alloc - Global initialization of the clockevent and stimer0 */
+int hv_stimer_alloc(int sint)
+{
+	int ret;
+
+	hv_clock_event = alloc_percpu(struct clock_event_device);
+	if (!hv_clock_event)
+		return -ENOMEM;
+
+	direct_mode_enabled = ms_hyperv.misc_features &
+			HV_STIMER_DIRECT_MODE_AVAILABLE;
+	if (direct_mode_enabled) {
+		ret = hv_setup_stimer0_irq(&stimer0_irq, &stimer0_vector,
+				hv_stimer0_isr);
+		if (ret) {
+			free_percpu(hv_clock_event);
+			hv_clock_event = NULL;
+			return ret;
+		}
+	}
+
+	stimer0_message_sint = sint;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(hv_stimer_alloc);
+
+/* hv_stimer_free - Free global resources allocated by hv_stimer_alloc() */
+void hv_stimer_free(void)
+{
+	if (direct_mode_enabled && (stimer0_irq != 0)) {
+		hv_remove_stimer0_irq(stimer0_irq);
+		stimer0_irq = 0;
+	}
+	free_percpu(hv_clock_event);
+	hv_clock_event = NULL;
+}
+EXPORT_SYMBOL_GPL(hv_stimer_free);
+
+/*
+ * Do a global cleanup of clockevents for the cases of kexec and
+ * vmbus exit
+ */
+void hv_stimer_global_cleanup(void)
+{
+	int	cpu;
+	struct clock_event_device *ce;
+
+	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
+		for_each_present_cpu(cpu) {
+			ce = per_cpu_ptr(hv_clock_event, cpu);
+			clockevents_unbind_device(ce, cpu);
+		}
+	}
+	hv_stimer_free();
+}
+EXPORT_SYMBOL_GPL(hv_stimer_global_cleanup);
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 1c1a2514d6f3..c423e57ae888 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -10,6 +10,9 @@ config HYPERV
 	  Select this option to run Linux as a Hyper-V client operating
 	  system.
 
+config HYPERV_TIMER
+	def_bool HYPERV
+
 config HYPERV_TSCPAGE
        def_bool HYPERV && X86_64
 
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index a1ea482183e8..6188fb7dda42 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -16,27 +16,13 @@
 #include <linux/version.h>
 #include <linux/random.h>
 #include <linux/clockchips.h>
+#include <clocksource/hyperv_timer.h>
 #include <asm/mshyperv.h>
 #include "hyperv_vmbus.h"
 
 /* The one and only */
 struct hv_context hv_context;
 
-/*
- * If false, we're using the old mechanism for stimer0 interrupts
- * where it sends a VMbus message when it expires. The old
- * mechanism is used when running on older versions of Hyper-V
- * that don't support Direct Mode. While Hyper-V provides
- * four stimer's per CPU, Linux uses only stimer0.
- */
-static bool direct_mode_enabled;
-static int stimer0_irq;
-static int stimer0_vector;
-
-#define HV_TIMER_FREQUENCY (10 * 1000 * 1000) /* 100ns period */
-#define HV_MAX_MAX_DELTA_TICKS 0xffffffff
-#define HV_MIN_DELTA_TICKS 1
-
 /*
  * hv_init - Main initialization routine.
  *
@@ -47,9 +33,6 @@ int hv_init(void)
 	hv_context.cpu_context = alloc_percpu(struct hv_per_cpu_context);
 	if (!hv_context.cpu_context)
 		return -ENOMEM;
-
-	direct_mode_enabled = ms_hyperv.misc_features &
-			HV_STIMER_DIRECT_MODE_AVAILABLE;
 	return 0;
 }
 
@@ -88,89 +71,6 @@ int hv_post_message(union hv_connection_id connection_id,
 	return status & 0xFFFF;
 }
 
-/*
- * ISR for when stimer0 is operating in Direct Mode.  Direct Mode
- * does not use VMbus or any VMbus messages, so process here and not
- * in the VMbus driver code.
- */
-
-static void hv_stimer0_isr(void)
-{
-	struct hv_per_cpu_context *hv_cpu;
-
-	hv_cpu = this_cpu_ptr(hv_context.cpu_context);
-	hv_cpu->clk_evt->event_handler(hv_cpu->clk_evt);
-	add_interrupt_randomness(stimer0_vector, 0);
-}
-
-static int hv_ce_set_next_event(unsigned long delta,
-				struct clock_event_device *evt)
-{
-	u64 current_tick;
-
-	WARN_ON(!clockevent_state_oneshot(evt));
-
-	current_tick = hyperv_cs->read(NULL);
-	current_tick += delta;
-	hv_init_timer(0, current_tick);
-	return 0;
-}
-
-static int hv_ce_shutdown(struct clock_event_device *evt)
-{
-	hv_init_timer(0, 0);
-	hv_init_timer_config(0, 0);
-	if (direct_mode_enabled)
-		hv_disable_stimer0_percpu_irq(stimer0_irq);
-
-	return 0;
-}
-
-static int hv_ce_set_oneshot(struct clock_event_device *evt)
-{
-	union hv_stimer_config timer_cfg;
-
-	timer_cfg.as_uint64 = 0;
-	timer_cfg.enable = 1;
-	timer_cfg.auto_enable = 1;
-	if (direct_mode_enabled) {
-		/*
-		 * When it expires, the timer will directly interrupt
-		 * on the specified hardware vector/IRQ.
-		 */
-		timer_cfg.direct_mode = 1;
-		timer_cfg.apic_vector = stimer0_vector;
-		hv_enable_stimer0_percpu_irq(stimer0_irq);
-	} else {
-		/*
-		 * When it expires, the timer will generate a VMbus message,
-		 * to be handled by the normal VMbus interrupt handler.
-		 */
-		timer_cfg.direct_mode = 0;
-		timer_cfg.sintx = VMBUS_MESSAGE_SINT;
-	}
-	hv_init_timer_config(0, timer_cfg.as_uint64);
-	return 0;
-}
-
-static void hv_init_clockevent_device(struct clock_event_device *dev, int cpu)
-{
-	dev->name = "Hyper-V clockevent";
-	dev->features = CLOCK_EVT_FEAT_ONESHOT;
-	dev->cpumask = cpumask_of(cpu);
-	dev->rating = 1000;
-	/*
-	 * Avoid settint dev->owner = THIS_MODULE deliberately as doing so will
-	 * result in clockevents_config_and_register() taking additional
-	 * references to the hv_vmbus module making it impossible to unload.
-	 */
-
-	dev->set_state_shutdown = hv_ce_shutdown;
-	dev->set_state_oneshot = hv_ce_set_oneshot;
-	dev->set_next_event = hv_ce_set_next_event;
-}
-
-
 int hv_synic_alloc(void)
 {
 	int cpu;
@@ -199,14 +99,6 @@ int hv_synic_alloc(void)
 		tasklet_init(&hv_cpu->msg_dpc,
 			     vmbus_on_msg_dpc, (unsigned long) hv_cpu);
 
-		hv_cpu->clk_evt = kzalloc(sizeof(struct clock_event_device),
-					  GFP_KERNEL);
-		if (hv_cpu->clk_evt == NULL) {
-			pr_err("Unable to allocate clock event device\n");
-			goto err;
-		}
-		hv_init_clockevent_device(hv_cpu->clk_evt, cpu);
-
 		hv_cpu->synic_message_page =
 			(void *)get_zeroed_page(GFP_ATOMIC);
 		if (hv_cpu->synic_message_page == NULL) {
@@ -229,11 +121,6 @@ int hv_synic_alloc(void)
 		INIT_LIST_HEAD(&hv_cpu->chan_list);
 	}
 
-	if (direct_mode_enabled &&
-	    hv_setup_stimer0_irq(&stimer0_irq, &stimer0_vector,
-				hv_stimer0_isr))
-		goto err;
-
 	return 0;
 err:
 	/*
@@ -252,7 +139,6 @@ void hv_synic_free(void)
 		struct hv_per_cpu_context *hv_cpu
 			= per_cpu_ptr(hv_context.cpu_context, cpu);
 
-		kfree(hv_cpu->clk_evt);
 		free_page((unsigned long)hv_cpu->synic_event_page);
 		free_page((unsigned long)hv_cpu->synic_message_page);
 		free_page((unsigned long)hv_cpu->post_msg_page);
@@ -311,36 +197,9 @@ int hv_synic_init(unsigned int cpu)
 
 	hv_set_synic_state(sctrl.as_uint64);
 
-	/*
-	 * Register the per-cpu clockevent source.
-	 */
-	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE)
-		clockevents_config_and_register(hv_cpu->clk_evt,
-						HV_TIMER_FREQUENCY,
-						HV_MIN_DELTA_TICKS,
-						HV_MAX_MAX_DELTA_TICKS);
-	return 0;
-}
-
-/*
- * hv_synic_clockevents_cleanup - Cleanup clockevent devices
- */
-void hv_synic_clockevents_cleanup(void)
-{
-	int cpu;
+	hv_stimer_init(cpu);
 
-	if (!(ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE))
-		return;
-
-	if (direct_mode_enabled)
-		hv_remove_stimer0_irq(stimer0_irq);
-
-	for_each_present_cpu(cpu) {
-		struct hv_per_cpu_context *hv_cpu
-			= per_cpu_ptr(hv_context.cpu_context, cpu);
-
-		clockevents_unbind_device(hv_cpu->clk_evt, cpu);
-	}
+	return 0;
 }
 
 /*
@@ -388,14 +247,7 @@ int hv_synic_cleanup(unsigned int cpu)
 	if (channel_found && vmbus_connection.conn_state == CONNECTED)
 		return -EBUSY;
 
-	/* Turn off clockevent device */
-	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
-		struct hv_per_cpu_context *hv_cpu
-			= this_cpu_ptr(hv_context.cpu_context);
-
-		clockevents_unbind_device(hv_cpu->clk_evt, cpu);
-		hv_ce_shutdown(hv_cpu->clk_evt);
-	}
+	hv_stimer_cleanup(cpu);
 
 	hv_get_synint_state(VMBUS_MESSAGE_SINT, shared_sint.as_uint64);
 
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index b8e1ff05f110..362e70e9d145 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -138,7 +138,6 @@ struct hv_per_cpu_context {
 	 * per-cpu list of the channels based on their CPU affinity.
 	 */
 	struct list_head chan_list;
-	struct clock_event_device *clk_evt;
 };
 
 struct hv_context {
@@ -176,8 +175,6 @@ extern int hv_synic_init(unsigned int cpu);
 
 extern int hv_synic_cleanup(unsigned int cpu);
 
-extern void hv_synic_clockevents_cleanup(void);
-
 /* Interface */
 
 void hv_ringbuffer_pre_init(struct vmbus_channel *channel);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 92b1874b3eb3..72d5a7cde7ea 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -30,6 +30,7 @@
 #include <linux/kdebug.h>
 #include <linux/efi.h>
 #include <linux/random.h>
+#include <clocksource/hyperv_timer.h>
 #include "hyperv_vmbus.h"
 
 struct vmbus_dynid {
@@ -955,17 +956,6 @@ static void vmbus_onmessage_work(struct work_struct *work)
 	kfree(ctx);
 }
 
-static void hv_process_timer_expiration(struct hv_message *msg,
-					struct hv_per_cpu_context *hv_cpu)
-{
-	struct clock_event_device *dev = hv_cpu->clk_evt;
-
-	if (dev->event_handler)
-		dev->event_handler(dev);
-
-	vmbus_signal_eom(msg, HVMSG_TIMER_EXPIRED);
-}
-
 void vmbus_on_msg_dpc(unsigned long data)
 {
 	struct hv_per_cpu_context *hv_cpu = (void *)data;
@@ -1159,9 +1149,10 @@ static void vmbus_isr(void)
 
 	/* Check if there are actual msgs to be processed */
 	if (msg->header.message_type != HVMSG_NONE) {
-		if (msg->header.message_type == HVMSG_TIMER_EXPIRED)
-			hv_process_timer_expiration(msg, hv_cpu);
-		else
+		if (msg->header.message_type == HVMSG_TIMER_EXPIRED) {
+			hv_stimer0_isr();
+			vmbus_signal_eom(msg, HVMSG_TIMER_EXPIRED);
+		} else
 			tasklet_schedule(&hv_cpu->msg_dpc);
 	}
 
@@ -1263,14 +1254,19 @@ static int vmbus_bus_init(void)
 	ret = hv_synic_alloc();
 	if (ret)
 		goto err_alloc;
+
+	ret = hv_stimer_alloc(VMBUS_MESSAGE_SINT);
+	if (ret < 0)
+		goto err_alloc;
+
 	/*
-	 * Initialize the per-cpu interrupt state and
-	 * connect to the host.
+	 * Initialize the per-cpu interrupt state and stimer state.
+	 * Then connect to the host.
 	 */
 	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "hyperv/vmbus:online",
 				hv_synic_init, hv_synic_cleanup);
 	if (ret < 0)
-		goto err_alloc;
+		goto err_cpuhp;
 	hyperv_cpuhp_online = ret;
 
 	ret = vmbus_connect();
@@ -1318,6 +1314,8 @@ static int vmbus_bus_init(void)
 
 err_connect:
 	cpuhp_remove_state(hyperv_cpuhp_online);
+err_cpuhp:
+	hv_stimer_free();
 err_alloc:
 	hv_synic_free();
 	hv_remove_vmbus_irq();
@@ -2064,7 +2062,7 @@ static struct acpi_driver vmbus_acpi_driver = {
 
 static void hv_kexec_handler(void)
 {
-	hv_synic_clockevents_cleanup();
+	hv_stimer_global_cleanup();
 	vmbus_initiate_unload(false);
 	vmbus_connection.conn_state = DISCONNECTED;
 	/* Make sure conn_state is set as hv_synic_cleanup checks for it */
@@ -2075,6 +2073,8 @@ static void hv_kexec_handler(void)
 
 static void hv_crash_handler(struct pt_regs *regs)
 {
+	int cpu;
+
 	vmbus_initiate_unload(true);
 	/*
 	 * In crash handler we can't schedule synic cleanup for all CPUs,
@@ -2082,7 +2082,9 @@ static void hv_crash_handler(struct pt_regs *regs)
 	 * for kdump.
 	 */
 	vmbus_connection.conn_state = DISCONNECTED;
-	hv_synic_cleanup(smp_processor_id());
+	cpu = smp_processor_id();
+	hv_stimer_cleanup(cpu);
+	hv_synic_cleanup(cpu);
 	hyperv_cleanup();
 };
 
@@ -2131,7 +2133,7 @@ static void __exit vmbus_exit(void)
 	hv_remove_kexec_handler();
 	hv_remove_crash_handler();
 	vmbus_connection.conn_state = DISCONNECTED;
-	hv_synic_clockevents_cleanup();
+	hv_stimer_global_cleanup();
 	vmbus_disconnect();
 	hv_remove_vmbus_irq();
 	for_each_online_cpu(cpu) {
diff --git a/include/clocksource/hyperv_timer.h b/include/clocksource/hyperv_timer.h
new file mode 100644
index 000000000000..0cd73f7bc992
--- /dev/null
+++ b/include/clocksource/hyperv_timer.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Definitions for the clocksource provided by the Hyper-V
+ * hypervisor to guest VMs, as described in the Hyper-V Top
+ * Level Functional Spec (TLFS).
+ *
+ * Copyright (C) 2019, Microsoft, Inc.
+ *
+ * Author:  Michael Kelley <mikelley@microsoft.com>
+ */
+
+#ifndef __CLKSOURCE_HYPERV_TIMER_H
+#define __CLKSOURCE_HYPERV_TIMER_H
+
+#define HV_MAX_MAX_DELTA_TICKS 0xffffffff
+#define HV_MIN_DELTA_TICKS 1
+
+/* Routines called by the VMbus driver */
+extern int hv_stimer_alloc(int sint);
+extern void hv_stimer_free(void);
+extern void hv_stimer_init(unsigned int cpu);
+extern void hv_stimer_cleanup(unsigned int cpu);
+extern void hv_stimer_global_cleanup(void);
+extern void hv_stimer0_isr(void);
+
+#endif

^ permalink raw reply related

* [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Nadav Amit @ 2019-07-02 23:51 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Borislav Petkov, Juergen Gross, Paolo Bonzini,
	Boris Ostrovsky, linux-hyperv, virtualization, kvm, xen-devel
In-Reply-To: <20190702235151.4377-1-namit@vmware.com>

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: Juergen Gross <jgross@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/hyperv/mmu.c                 | 13 +++---
 arch/x86/include/asm/paravirt.h       |  6 +--
 arch/x86/include/asm/paravirt_types.h |  4 +-
 arch/x86/include/asm/tlbflush.h       |  9 ++--
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c                 | 11 +++--
 arch/x86/kernel/paravirt.c            |  2 +-
 arch/x86/mm/tlb.c                     | 65 ++++++++++++++++++++-------
 arch/x86/xen/mmu_pv.c                 | 20 ++++++---
 include/trace/events/xen.h            |  2 +-
 10 files changed, 91 insertions(+), 43 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..1177f863e4cd 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
 	return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-				    const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+				   const struct flush_tlb_info *info)
 {
 	int cpu, vcpu, gva_n, max_gvas;
 	struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
 	u64 status = U64_MAX;
 	unsigned long flags;
 
-	trace_hyperv_mmu_flush_tlb_others(cpus, info);
+	trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
 	if (!hv_hypercall_pg)
 		goto do_native;
@@ -69,6 +69,9 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
 
 	local_irq_save(flags);
 
+	if (cpumask_test_cpu(smp_processor_id(), cpus))
+		flush_tlb_func_local(info);
+
 	flush_pcpu = (struct hv_tlb_flush **)
 		     this_cpu_ptr(hyperv_pcpu_input_arg);
 
@@ -156,7 +159,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
 	if (!(status & HV_HYPERCALL_RESULT_MASK))
 		return;
 do_native:
-	native_flush_tlb_others(cpus, info);
+	native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +234,6 @@ void hyperv_setup_mmu_ops(void)
 		return;
 
 	pr_info("Using hypercall for remote TLB flush\n");
-	pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
 	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..316959e89258 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
 	PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-				    const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+				   const struct flush_tlb_info *info)
 {
-	PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..54f4c718b5b0 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,8 +211,8 @@ struct pv_mmu_ops {
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_one_user)(unsigned long addr);
-	void (*flush_tlb_others)(const struct cpumask *cpus,
-				 const struct flush_tlb_info *info);
+	void (*flush_tlb_multi)(const struct cpumask *cpus,
+				const struct flush_tlb_info *info);
 
 	void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..36aa2a9b7597 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -517,7 +517,7 @@ static inline void __flush_tlb_one_kernel(unsigned long addr)
  *  - flush_tlb_page(vma, vmaddr) flushes one page
  *  - flush_tlb_range(vma, start, end) flushes a range of pages
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
- *  - flush_tlb_others(cpumask, info) flushes TLBs on other cpus
+ *  - flush_tlb_multi(cpumask, info) flushes TLBs on multiple cpus
  *
  * ..but the i386 has somewhat limited tlb flushing capabilities,
  * and page-granular flushes are available only on i486 and up.
@@ -563,13 +563,14 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void flush_tlb_func_local(const struct flush_tlb_info *info);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_multi(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
@@ -593,8 +594,8 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
-#define flush_tlb_others(mask, info)	\
-	native_flush_tlb_others(mask, info)
+#define flush_tlb_multi(mask, info)	\
+	native_flush_tlb_multi(mask, info)
 
 #define paravirt_tlb_remove_table(tlb, page) \
 	tlb_remove_page(tlb, (void *)(page))
diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h
index ace464f09681..85ca8560c7f9 100644
--- a/arch/x86/include/asm/trace/hyperv.h
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -8,7 +8,7 @@
 
 #if IS_ENABLED(CONFIG_HYPERV)
 
-TRACE_EVENT(hyperv_mmu_flush_tlb_others,
+TRACE_EVENT(hyperv_mmu_flush_tlb_multi,
 	    TP_PROTO(const struct cpumask *cpus,
 		     const struct flush_tlb_info *info),
 	    TP_ARGS(cpus, info),
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5169b8cc35bb..d00d551d4a2a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(void)
 
 static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
 
-static void kvm_flush_tlb_others(const struct cpumask *cpumask,
+static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
 			const struct flush_tlb_info *info)
 {
 	u8 state;
@@ -594,6 +594,11 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 	 * queue flush_on_enter for pre-empted vCPUs
 	 */
 	for_each_cpu(cpu, flushmask) {
+		/*
+		 * The local vCPU is never preempted, so we do not explicitly
+		 * skip check for local vCPU - it will never be cleared from
+		 * flushmask.
+		 */
 		src = &per_cpu(steal_time, cpu);
 		state = READ_ONCE(src->preempted);
 		if ((state & KVM_VCPU_PREEMPTED)) {
@@ -603,7 +608,7 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 		}
 	}
 
-	native_flush_tlb_others(flushmask, info);
+	native_flush_tlb_multi(flushmask, info);
 }
 
 static void __init kvm_guest_init(void)
@@ -628,7 +633,7 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
 	    !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
+		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
 		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 	}
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 98039d7fb998..7cdcffe2a028 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -363,7 +363,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_user	= native_flush_tlb,
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
-	.mmu.flush_tlb_others	= native_flush_tlb_others,
+	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
 	.mmu.tlb_remove_table	=
 			(void (*)(struct mmu_gather *, void *))tlb_remove_page,
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5c9b1607191d..074288a6916e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -551,7 +551,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
 		 *
-		 * This should be rare, with native_flush_tlb_others skipping
+		 * This should be rare, with native_flush_tlb_multi() skipping
 		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
@@ -635,7 +635,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
 }
 
-static void flush_tlb_func_local(void *info)
+static void __flush_tlb_func_local(void *info)
 {
 	const struct flush_tlb_info *f = info;
 	enum tlb_flush_reason reason;
@@ -645,6 +645,11 @@ static void flush_tlb_func_local(void *info)
 	flush_tlb_func_common(f, true, reason);
 }
 
+void flush_tlb_func_local(const struct flush_tlb_info *info)
+{
+	__flush_tlb_func_local((void *)info);
+}
+
 static void flush_tlb_func_remote(void *info)
 {
 	const struct flush_tlb_info *f = info;
@@ -665,9 +670,14 @@ static bool tlb_is_not_lazy(int cpu)
 
 static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
-			     const struct flush_tlb_info *info)
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+			    const struct flush_tlb_info *info)
 {
+	/*
+	 * Do accounting and tracing. Note that there are (and have always been)
+	 * cases in which a remote TLB flush will be traced, but eventually
+	 * would not happen.
+	 */
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -687,10 +697,12 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		 * means that the percpu tlb_gen variables won't be updated
 		 * and we'll do pointless flushes on future context switches.
 		 *
-		 * Rather than hooking native_flush_tlb_others() here, I think
+		 * Rather than hooking native_flush_tlb_multi() here, I think
 		 * that UV should be updated so that smp_call_function_many(),
 		 * etc, are optimal on UV.
 		 */
+		flush_tlb_func_local(info);
+
 		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
 			smp_call_function_many(cpumask, flush_tlb_func_remote,
@@ -709,8 +721,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	 * doing a speculative memory access.
 	 */
 	if (info->freed_tables) {
-		smp_call_function_many(cpumask, flush_tlb_func_remote,
-			       (void *)info, 1);
+		__smp_call_function_many(cpumask, flush_tlb_func_remote,
+					 __flush_tlb_func_local,
+					 (void *)info, 1);
 	} else {
 		/*
 		 * Although we could have used on_each_cpu_cond_mask(),
@@ -737,7 +750,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 			if (tlb_is_not_lazy(cpu))
 				__cpumask_set_cpu(cpu, cond_cpumask);
 		}
-		smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+		__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+					 __flush_tlb_func_local,
 					 (void *)info, 1);
 	}
 }
@@ -818,16 +832,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
 				  new_tlb_gen);
 
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+	/*
+	 * Assert that mm_cpumask() corresponds with the loaded mm. We got one
+	 * exception: for init_mm we do not need to flush anything, and the
+	 * cpumask does not correspond with loaded_mm.
+	 */
+	VM_WARN_ON_ONCE(cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm)) !=
+			(mm == this_cpu_read(cpu_tlbstate.loaded_mm)) &&
+			mm != &init_mm);
+
+	/*
+	 * flush_tlb_multi() is not optimized for the common case in which only
+	 * a local TLB flush is needed. Optimize this use-case by calling
+	 * flush_tlb_func_local() directly in this case.
+	 */
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+		flush_tlb_multi(mm_cpumask(mm), info);
+	} else {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
 		flush_tlb_func_local(info);
 		local_irq_enable();
 	}
 
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), info);
-
 	put_flush_tlb_info();
 	put_cpu();
 }
@@ -890,16 +917,20 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	int cpu = get_cpu();
 
-	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+	/*
+	 * flush_tlb_multi() is not optimized for the common case in which only
+	 * a local TLB flush is needed. Optimize this use-case by calling
+	 * flush_tlb_func_local() directly in this case.
+	 */
+	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+		flush_tlb_multi(&batch->cpumask, &full_flush_tlb_info);
+	} else {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func_local((void *)&full_flush_tlb_info);
+		flush_tlb_func_local(&full_flush_tlb_info);
 		local_irq_enable();
 	}
 
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
-		flush_tlb_others(&batch->cpumask, &full_flush_tlb_info);
-
 	cpumask_clear(&batch->cpumask);
 
 	put_cpu();
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index beb44e22afdf..19e481e6e904 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
 	preempt_enable();
 }
 
-static void xen_flush_tlb_others(const struct cpumask *cpus,
-				 const struct flush_tlb_info *info)
+static void xen_flush_tlb_multi(const struct cpumask *cpus,
+				const struct flush_tlb_info *info)
 {
 	struct {
 		struct mmuext_op op;
@@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	const size_t mc_entry_size = sizeof(args->op) +
 		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
 
-	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
+	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
 
 	if (cpumask_empty(cpus))
 		return;		/* nothing to do */
@@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	args = mcs.args;
 	args->op.arg2.vcpumask = to_cpumask(args->mask);
 
-	/* Remove us, and any offline CPUS. */
+	/* Flush locally if needed and remove us */
+	if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
+		local_irq_disable();
+		flush_tlb_func_local(info);
+		local_irq_enable();
+
+		cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask));
+	}
+
+	/* Remove offline CPUS */
 	cpumask_and(to_cpumask(args->mask), cpus, cpu_online_mask);
-	cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask));
 
 	args->op.cmd = MMUEXT_TLB_FLUSH_MULTI;
 	if (info->end != TLB_FLUSH_ALL &&
@@ -2406,7 +2414,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
 	.flush_tlb_user = xen_flush_tlb,
 	.flush_tlb_kernel = xen_flush_tlb,
 	.flush_tlb_one_user = xen_flush_tlb_one_user,
-	.flush_tlb_others = xen_flush_tlb_others,
+	.flush_tlb_multi = xen_flush_tlb_multi,
 	.tlb_remove_table = tlb_remove_table,
 
 	.pgd_alloc = xen_pgd_alloc,
diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index 9a0e8af21310..546022acf160 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -362,7 +362,7 @@ TRACE_EVENT(xen_mmu_flush_tlb_one_user,
 	    TP_printk("addr %lx", __entry->addr)
 	);
 
-TRACE_EVENT(xen_mmu_flush_tlb_others,
+TRACE_EVENT(xen_mmu_flush_tlb_multi,
 	    TP_PROTO(const struct cpumask *cpus, struct mm_struct *mm,
 		     unsigned long addr, unsigned long end),
 	    TP_ARGS(cpus, mm, addr, end),
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 0/9] x86: Concurrent TLB flushes
From: Nadav Amit @ 2019-07-02 23:51 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, Borislav Petkov, Boris Ostrovsky, Haiyang Zhang,
	Josh Poimboeuf, Juergen Gross, K. Y. Srinivasan, Paolo Bonzini,
	Rik van Riel, Sasha Levin, Stephen Hemminger, kvm, linux-hyperv,
	virtualization, xen-devel

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions:

 sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
  --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

  Th.	tip-jun28 avg (stdev)	+patch-set avg (stdev)	change
  ---	---------------------	----------------------	------
  1	1267765 (14146)		1299253 (5715)		+2.4%
  2	1734644 (11936)		1799225 (19577)		+3.7%
  4	2821268 (41184)		2919132 (40149)		+3.4%
  8	4171652 (31243)		4376925 (65416)		+4.9%
  16	5590729 (24160)		5829866 (8127)		+4.2%
  24	6250212 (24481)		6522303 (28044)		+4.3%
  32	3994314 (26606)		4077543 (10685)		+2.0%
  48	4345177 (28091)		4417821 (41337)		+1.6%

(Note that on configurations with up to 24 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

  Th.	tip-jun28 avg (stdev)	+patch-set avg (stdev)	change
  ---	---------------------	----------------------	------
  1	1598896 (5174)		1607903 (4091)		+0.5%
  2	2109472 (17827)		2224726 (4372)		+5.4%
  4	3448587 (11952)		3668551 (30219)		+6.3%
  8	5425778 (29641)		5606266 (33519)		+3.3%
  16	6931232 (34677)		7054052 (27873)		+1.7%
  24	7612473 (23482)		7783138 (13871)		+2.2%
  32	4296274 (18029)		4283279 (32323)		-0.3%
  48	4770029 (35541)		4764760 (13575)		-0.1%

Presumably, PTI requires two invalidations of each mapping, which allows
to get higher benefits from concurrency when PTI is on. At the same
time, when mitigations are on, other overheads reduce the potential
speedup.

I tried to reduce the size of the code of the main patch, which required
restructuring of the series.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c                 |  13 ++-
 arch/x86/include/asm/paravirt.h       |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h       |  48 +++++----
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c                 |  11 +-
 arch/x86/kernel/paravirt.c            |   2 +-
 arch/x86/mm/init.c                    |   2 +-
 arch/x86/mm/tlb.c                     | 147 ++++++++++++++++----------
 arch/x86/xen/mmu_pv.c                 |  20 ++--
 include/linux/cpumask.h               |   6 +-
 include/linux/smp.h                   |  27 +++--
 include/trace/events/xen.h            |   2 +-
 kernel/smp.c                          | 133 +++++++++++------------
 14 files changed, 245 insertions(+), 178 deletions(-)

-- 
2.17.1

^ permalink raw reply

* Re: [PATCH] PCI: hv: fix pci-hyperv build, depends on SYSFS
From: Randy Dunlap @ 2019-07-03  5:05 UTC (permalink / raw)
  To: Dexuan Cui, LKML, linux-hyperv@vger.kernel.org
  Cc: Matthew Wilcox, Jake Oshins, KY Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, linux-pci, Bjorn Helgaas
In-Reply-To: <PU1P153MB016931FDE7BF095FB85783EEBFFB0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>

On 7/2/19 9:33 PM, Dexuan Cui wrote:
>> From: linux-hyperv-owner@vger.kernel.org
>> <linux-hyperv-owner@vger.kernel.org> On Behalf Of Randy Dunlap
>> Sent: Tuesday, July 2, 2019 4:25 PM
>> ERROR: "pci_destroy_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
>> ERROR: "pci_create_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
>>
>> drivers/pci/slot.o is only built when SYSFS is enabled, so
>> pci-hyperv.o has an implicit dependency on SYSFS.
>> Make that explicit.
>>
>> Also, depending on X86 && X86_64 is not needed, so just change that
>> to depend on X86_64.
>>
>> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft
>> Hyper-V VMs")
> 
> I think the Fixes tag should be:
> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")
> 
> Thanks,
> -- Dexuan
> 

Thanks.  I did have a little trouble with that.

-- 
~Randy

^ permalink raw reply

* RE: [PATCH] PCI: hv: fix pci-hyperv build, depends on SYSFS
From: Dexuan Cui @ 2019-07-03  4:40 UTC (permalink / raw)
  To: Randy Dunlap, Matthew Wilcox, Yuehaibing
  Cc: LKML, linux-hyperv@vger.kernel.org, Jake Oshins, KY Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Sasha Levin, linux-pci,
	Bjorn Helgaas
In-Reply-To: <139b6a64-1980-412b-5870-88706084b288@infradead.org>

> From: linux-hyperv-owner@vger.kernel.org
> <linux-hyperv-owner@vger.kernel.org> On Behalf Of Randy Dunlap
> Sent: Tuesday, July 2, 2019 8:25 PM
> To: Matthew Wilcox <willy@infradead.org>
> Cc: LKML <linux-kernel@vger.kernel.org>; linux-hyperv@vger.kernel.org; Jake
> Oshins <jakeo@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Haiyang
> Zhang <haiyangz@microsoft.com>; Stephen Hemminger
> <sthemmin@microsoft.com>; Sasha Levin <sashal@kernel.org>; linux-pci
> <linux-pci@vger.kernel.org>; Bjorn Helgaas <bhelgaas@google.com>
> Subject: Re: [PATCH] PCI: hv: fix pci-hyperv build, depends on SYSFS
> 
> On 7/2/19 5:15 PM, Matthew Wilcox wrote:
> > On Tue, Jul 02, 2019 at 04:24:30PM -0700, Randy Dunlap wrote:
> >> From: Randy Dunlap <rdunlap@infradead.org>
> >>
> >> Fix build errors when building almost-allmodconfig but with SYSFS
> >> not set (not enabled).  Fixes these build errors:
> >>
> >> ERROR: "pci_destroy_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
> >> ERROR: "pci_create_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
> >>
> >> drivers/pci/slot.o is only built when SYSFS is enabled, so
> >> pci-hyperv.o has an implicit dependency on SYSFS.
> >> Make that explicit.
> >
> > I wonder if we shouldn't rather provide no-op versions of
> > pci_create|destroy_slot for when SYSFS is not set?
> >
> 
> Makes sense.  I'm test-building that now.
> 
> --
> ~Randy

+ Yuehaibing, who submitted a similar patch, which I guess is neglected
at the end of the discussion last month:

https://lkml.org/lkml/2019/5/31/559
https://lkml.org/lkml/2019/6/14/784
https://lkml.org/lkml/2019/6/15/24

Thanks,
-- Dexuan

^ permalink raw reply

* RE: [PATCH] PCI: hv: fix pci-hyperv build, depends on SYSFS
From: Dexuan Cui @ 2019-07-03  4:33 UTC (permalink / raw)
  To: Randy Dunlap, LKML, linux-hyperv@vger.kernel.org
  Cc: Matthew Wilcox, Jake Oshins, KY Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, linux-pci, Bjorn Helgaas
In-Reply-To: <69c25bc3-da00-2758-92ee-13c82b51fc45@infradead.org>

> From: linux-hyperv-owner@vger.kernel.org
> <linux-hyperv-owner@vger.kernel.org> On Behalf Of Randy Dunlap
> Sent: Tuesday, July 2, 2019 4:25 PM
> ERROR: "pci_destroy_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
> ERROR: "pci_create_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
> 
> drivers/pci/slot.o is only built when SYSFS is enabled, so
> pci-hyperv.o has an implicit dependency on SYSFS.
> Make that explicit.
> 
> Also, depending on X86 && X86_64 is not needed, so just change that
> to depend on X86_64.
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft
> Hyper-V VMs")

I think the Fixes tag should be:
Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")

Thanks,
-- Dexuan

^ permalink raw reply

* Re: [PATCH] PCI: hv: fix pci-hyperv build, depends on SYSFS
From: Randy Dunlap @ 2019-07-03  3:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: LKML, linux-hyperv, Jake Oshins, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, linux-pci, Bjorn Helgaas
In-Reply-To: <20190703001541.GG1729@bombadil.infradead.org>

On 7/2/19 5:15 PM, Matthew Wilcox wrote:
> On Tue, Jul 02, 2019 at 04:24:30PM -0700, Randy Dunlap wrote:
>> From: Randy Dunlap <rdunlap@infradead.org>
>>
>> Fix build errors when building almost-allmodconfig but with SYSFS
>> not set (not enabled).  Fixes these build errors:
>>
>> ERROR: "pci_destroy_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
>> ERROR: "pci_create_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
>>
>> drivers/pci/slot.o is only built when SYSFS is enabled, so
>> pci-hyperv.o has an implicit dependency on SYSFS.
>> Make that explicit.
> 
> I wonder if we shouldn't rather provide no-op versions of
> pci_create|destroy_slot for when SYSFS is not set?
> 

Makes sense.  I'm test-building that now.

-- 
~Randy

^ permalink raw reply

* [PATCH] PCI: hv: fix pci-hyperv build, depends on SYSFS
From: Randy Dunlap @ 2019-07-02 23:24 UTC (permalink / raw)
  To: LKML, linux-hyperv
  Cc: Matthew Wilcox, Jake Oshins, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, linux-pci, Bjorn Helgaas

From: Randy Dunlap <rdunlap@infradead.org>

Fix build errors when building almost-allmodconfig but with SYSFS
not set (not enabled).  Fixes these build errors:

ERROR: "pci_destroy_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
ERROR: "pci_create_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!

drivers/pci/slot.o is only built when SYSFS is enabled, so
pci-hyperv.o has an implicit dependency on SYSFS.
Make that explicit.

Also, depending on X86 && X86_64 is not needed, so just change that
to depend on X86_64.

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jake Oshins <jakeo@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: linux-hyperv@vger.kernel.org
---
 drivers/pci/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- lnx-52-rc7.orig/drivers/pci/Kconfig
+++ lnx-52-rc7/drivers/pci/Kconfig
@@ -181,7 +181,7 @@ config PCI_LABEL
 
 config PCI_HYPERV
         tristate "Hyper-V PCI Frontend"
-        depends on X86 && HYPERV && PCI_MSI && PCI_MSI_IRQ_DOMAIN && X86_64
+        depends on X86_64 && HYPERV && PCI_MSI && PCI_MSI_IRQ_DOMAIN && SYSFS
         help
           The PCI device frontend driver allows the kernel to import arbitrary
           PCI devices from a PCI backend to support PCI driver domains.



^ permalink raw reply

* Re: [PATCH] PCI: hv: fix pci-hyperv build, depends on SYSFS
From: Matthew Wilcox @ 2019-07-03  0:15 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: LKML, linux-hyperv, Jake Oshins, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, linux-pci, Bjorn Helgaas
In-Reply-To: <69c25bc3-da00-2758-92ee-13c82b51fc45@infradead.org>

On Tue, Jul 02, 2019 at 04:24:30PM -0700, Randy Dunlap wrote:
> From: Randy Dunlap <rdunlap@infradead.org>
> 
> Fix build errors when building almost-allmodconfig but with SYSFS
> not set (not enabled).  Fixes these build errors:
> 
> ERROR: "pci_destroy_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
> ERROR: "pci_create_slot" [drivers/pci/controller/pci-hyperv.ko] undefined!
> 
> drivers/pci/slot.o is only built when SYSFS is enabled, so
> pci-hyperv.o has an implicit dependency on SYSFS.
> Make that explicit.

I wonder if we shouldn't rather provide no-op versions of
pci_create|destroy_slot for when SYSFS is not set?

^ permalink raw reply

* [PATCH v5 0/2] clocksource/drivers: Create new Hyper-V clocksource driver
From: Michael Kelley @ 2019-07-01  4:25 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	x86@kernel.org
  Cc: Michael Kelley, will.deacon@arm.com, catalin.marinas@arm.com,
	mark.rutland@arm.com, linux-arm-kernel@lists.infradead.org,
	gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, olaf@aepfle.de, apw@canonical.com,
	vkuznets, jasowang@redhat.com, marcelo.cerri@canonical.com,
	Sunil Muthuswamy, KY Srinivasan, sashal@kernel.org,
	vincenzo.frascino@arm.com, linux-arch@vger.kernel.org,
	linux-mips@vger.kernel.org, linux-kselftest@vger.kernel.org,
	arnd@arndb.de, linux@armlinux.org.uk, ralf@linux-mips.org,
	paul.burton@mips.com, daniel.lezcano@linaro.org,
	salyzyn@android.com, pcc@google.com, shuah@kernel.org,
	0x7f454c46@gmail.com, linux@rasmusvillemoes.dk,
	huw@codeweavers.com, sfr@canb.auug.org.au, pbonzini@redhat.com,
	rkrcmar@redhat.com, kvm@vger.kernel.org

This patch series moves Hyper-V clock/timer code to a separate Hyper-V
clocksource driver. Previously, Hyper-V clock/timer code and data
structures were mixed in with other Hyper-V code in the ISA independent
drivers/hv code as well as in ISA dependent code. The new Hyper-V
clocksource driver is ISA agnostic, with a just few dependencies on
ISA specific functions. The patch series does not change any behavior
or functionality -- it only reorganizes the existing code and fixes up
the linkages. A few places outside of Hyper-V code are fixed up to use
the new #include file structure.

This restructuring is in response to Marc Zyngier's review comments
on supporting Hyper-V running on ARM64, and is a good idea in general.
It increases the amount of code shared between the x86 and ARM64
architectures, and reduces the size of the new code for supporting
Hyper-V on ARM64. A new version of the Hyper-V on ARM64 patches will
follow once this clocksource restructuring is accepted.

The code is diff'ed against the upstream tip tree:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/vdso

Changes in v5:
* Revised commit summaries [Thomas Gleixner]
* Removed call to clockevents_unbind_device() [Thomas Gleixner]
* Restructured hv_init_clocksource() [Thomas Gleixner]
* Various other small code cleanups [Thomas Gleixner]

Changes in v4:
* Revised commit messages
* Rebased to upstream tip tree

Changes in v3:
* Removed boolean argument to hv_init_clocksource(). Always call
sched_clock_register, which is needed on ARM64 but a no-op on x86.
* Removed separate cpuhp setup in hv_stimer_alloc() and instead
directly call hv_stimer_init() and hv_stimer_cleanup() from
corresponding VMbus functions.  This more closely matches original
code and avoids clocksource stop/restart problems on ARM64 when
VMbus code denies CPU offlining request.

Changes in v2:
* Revised commit short descriptions so the distinction between
the first and second patches is clearer [GregKH]
* Renamed new clocksource driver files and functions to use
existing "timer" and "stimer" names instead of introducing
"syntimer". [Vitaly Kuznetsov]
* Introduced CONFIG_HYPER_TIMER to fix build problem when
CONFIG_HYPERV=m [Vitaly Kuznetsov]
* Added "Suggested-by: Marc Zyngier"

Michael Kelley (2):
  clocksource/drivers: Make Hyper-V clocksource ISA agnostic
  clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic

 MAINTAINERS                              |   2 +
 arch/x86/entry/vdso/vma.c                |   2 +-
 arch/x86/hyperv/hv_init.c                |  91 +--------
 arch/x86/include/asm/hyperv-tlfs.h       |   6 +
 arch/x86/include/asm/mshyperv.h          |  81 +-------
 arch/x86/include/asm/vdso/gettimeofday.h |   2 +-
 arch/x86/kernel/cpu/mshyperv.c           |   4 +-
 arch/x86/kvm/x86.c                       |   1 +
 drivers/clocksource/Makefile             |   1 +
 drivers/clocksource/hyperv_timer.c       | 339 +++++++++++++++++++++++++++++++
 drivers/hv/Kconfig                       |   3 +
 drivers/hv/hv.c                          | 156 +-------------
 drivers/hv/hv_util.c                     |   1 +
 drivers/hv/hyperv_vmbus.h                |   3 -
 drivers/hv/vmbus_drv.c                   |  42 ++--
 include/clocksource/hyperv_timer.h       | 105 ++++++++++
 16 files changed, 503 insertions(+), 336 deletions(-)
 create mode 100644 drivers/clocksource/hyperv_timer.c
 create mode 100644 include/clocksource/hyperv_timer.h

-- 
1.8.3.1

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox