Linux-HyperV List
 help / color / mirror / Atom feed
* [PATCH 13/13] uas: set virt_boundary_mask in the scsi host
From: Christoph Hellwig @ 2019-06-05 19:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Ott, Sagi Grimberg, Max Gurtovoy, Bart Van Assche,
	Ulf Hansson, Alan Stern, Oliver Neukum, linux-block, linux-rdma,
	linux-mmc, linux-nvme, linux-scsi, megaraidlinux.pdl,
	MPT-FusionLinux.pdl, linux-hyperv, linux-usb, usb-storage,
	linux-kernel
In-Reply-To: <20190605190836.32354-1-hch@lst.de>

This ensures all proper DMA layer handling is taken care of by the
SCSI midlayer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/usb/storage/uas.c | 36 ++++++++++++++++--------------------
 1 file changed, 16 insertions(+), 20 deletions(-)

diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
index 047c5922618f..d20919e7bbf4 100644
--- a/drivers/usb/storage/uas.c
+++ b/drivers/usb/storage/uas.c
@@ -789,29 +789,9 @@ static int uas_slave_alloc(struct scsi_device *sdev)
 {
 	struct uas_dev_info *devinfo =
 		(struct uas_dev_info *)sdev->host->hostdata;
-	int maxp;
 
 	sdev->hostdata = devinfo;
 
-	/*
-	 * We have two requirements here. We must satisfy the requirements
-	 * of the physical HC and the demands of the protocol, as we
-	 * definitely want no additional memory allocation in this path
-	 * ruling out using bounce buffers.
-	 *
-	 * For a transmission on USB to continue we must never send
-	 * a package that is smaller than maxpacket. Hence the length of each
-         * scatterlist element except the last must be divisible by the
-         * Bulk maxpacket value.
-	 * If the HC does not ensure that through SG,
-	 * the upper layer must do that. We must assume nothing
-	 * about the capabilities off the HC, so we use the most
-	 * pessimistic requirement.
-	 */
-
-	maxp = usb_maxpacket(devinfo->udev, devinfo->data_in_pipe, 0);
-	blk_queue_virt_boundary(sdev->request_queue, maxp - 1);
-
 	/*
 	 * The protocol has no requirements on alignment in the strict sense.
 	 * Controllers may or may not have alignment restrictions.
@@ -1004,6 +984,22 @@ static int uas_probe(struct usb_interface *intf, const struct usb_device_id *id)
 	 */
 	shost->can_queue = devinfo->qdepth - 2;
 
+	/*
+	 * We have two requirements here. We must satisfy the requirements of
+	 * the physical HC and the demands of the protocol, as we definitely
+	 * want no additional memory allocation in this path ruling out using
+	 * bounce buffers.
+	 *
+	 * For a transmission on USB to continue we must never send a package
+	 * that is smaller than maxpacket.  Hence the length of each scatterlist
+	 * element except the last must be divisible by the Bulk maxpacket
+	 * value.  If the HC does not ensure that through SG, the upper layer
+	 * must do that.  We must assume nothing about the capabilities off the
+	 * HC, so we use the most pessimistic requirement.
+	 */
+	shost->virt_boundary_mask =
+		usb_maxpacket(udev, devinfo->data_in_pipe, 0) - 1;
+
 	usb_set_intfdata(intf, shost);
 	result = scsi_add_host(shost, &intf->dev);
 	if (result)
-- 
2.20.1


^ permalink raw reply related

* [PATCH 05/13] scsi: add a host / host template field for the virt boundary
From: Christoph Hellwig @ 2019-06-05 19:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Ott, Sagi Grimberg, Max Gurtovoy, Bart Van Assche,
	Ulf Hansson, Alan Stern, Oliver Neukum, linux-block, linux-rdma,
	linux-mmc, linux-nvme, linux-scsi, megaraidlinux.pdl,
	MPT-FusionLinux.pdl, linux-hyperv, linux-usb, usb-storage,
	linux-kernel
In-Reply-To: <20190605190836.32354-1-hch@lst.de>

This allows drivers setting it up easily instead of branching out to
block layer calls in slave_alloc, and ensures the upgraded
max_segment_size setting gets picked up by the DMA layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/hosts.c     | 3 +++
 drivers/scsi/scsi_lib.c  | 3 ++-
 include/scsi/scsi_host.h | 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index ff0d8c6a8d0c..55522b7162d3 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -462,6 +462,9 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 	else
 		shost->dma_boundary = 0xffffffff;
 
+	if (sht->virt_boundary_mask)
+		shost->virt_boundary_mask = sht->virt_boundary_mask;
+
 	device_initialize(&shost->shost_gendev);
 	dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
 	shost->shost_gendev.bus = &scsi_bus_type;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 65d0a10c76ad..d333bb6b1c59 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1775,7 +1775,8 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
 	dma_set_seg_boundary(dev, shost->dma_boundary);
 
 	blk_queue_max_segment_size(q, shost->max_segment_size);
-	dma_set_max_seg_size(dev, shost->max_segment_size);
+	blk_queue_virt_boundary(q, shost->virt_boundary_mask);
+	dma_set_max_seg_size(dev, queue_max_segment_size(q));
 
 	/*
 	 * Set a reasonable default alignment:  The larger of 32-byte (dword),
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index a5fcdad4a03e..cc139dbd71e5 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -369,6 +369,8 @@ struct scsi_host_template {
 	 */
 	unsigned long dma_boundary;
 
+	unsigned long virt_boundary_mask;
+
 	/*
 	 * This specifies "machine infinity" for host templates which don't
 	 * limit the transfer size.  Note this limit represents an absolute
@@ -587,6 +589,7 @@ struct Scsi_Host {
 	unsigned int max_sectors;
 	unsigned int max_segment_size;
 	unsigned long dma_boundary;
+	unsigned long virt_boundary_mask;
 	/*
 	 * In scsi-mq mode, the number of hardware queues supported by the LLD.
 	 *
-- 
2.20.1


^ permalink raw reply related

* [PATCH 11/13] mpt3sas: set virt_boundary_mask in the scsi host
From: Christoph Hellwig @ 2019-06-05 19:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Ott, Sagi Grimberg, Max Gurtovoy, Bart Van Assche,
	Ulf Hansson, Alan Stern, Oliver Neukum, linux-block, linux-rdma,
	linux-mmc, linux-nvme, linux-scsi, megaraidlinux.pdl,
	MPT-FusionLinux.pdl, linux-hyperv, linux-usb, usb-storage,
	linux-kernel
In-Reply-To: <20190605190836.32354-1-hch@lst.de>

This ensures all proper DMA layer handling is taken care of by the
SCSI midlayer.  Note that the effect is global, as the IOMMU merging
is based off a paramters in struct device.  We could still turn if off
if no PCIe devices are present, but I don't know how to find that out.

Also remove the bogus nomerges flag, merges do take the virt_boundary
into account.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 1ccfbc7eebe0..03a0df2a3379 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -2361,14 +2361,6 @@ scsih_slave_configure(struct scsi_device *sdev)
 		pcie_device_put(pcie_device);
 		spin_unlock_irqrestore(&ioc->pcie_device_lock, flags);
 		scsih_change_queue_depth(sdev, qdepth);
-		/* Enable QUEUE_FLAG_NOMERGES flag, so that IOs won't be
-		 ** merged and can eliminate holes created during merging
-		 ** operation.
-		 **/
-		blk_queue_flag_set(QUEUE_FLAG_NOMERGES,
-				sdev->request_queue);
-		blk_queue_virt_boundary(sdev->request_queue,
-				ioc->page_size - 1);
 		return 0;
 	}
 
@@ -10472,6 +10464,9 @@ _scsih_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	shost->transportt = mpt3sas_transport_template;
 	shost->unique_id = ioc->id;
 
+	/* XXX: only strictly needed if NVMe devices are attached */
+	shost->virt_boundary_mask = ioc->page_size - 1;
+
 	if (ioc->is_mcpu_endpoint) {
 		/* mCPU MPI support 64K max IO */
 		shost->max_sectors = 128;
-- 
2.20.1


^ permalink raw reply related

* [PATCH 06/13] ufshcd: set max_segment_size in the scsi host template
From: Christoph Hellwig @ 2019-06-05 19:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Ott, Sagi Grimberg, Max Gurtovoy, Bart Van Assche,
	Ulf Hansson, Alan Stern, Oliver Neukum, linux-block, linux-rdma,
	linux-mmc, linux-nvme, linux-scsi, megaraidlinux.pdl,
	MPT-FusionLinux.pdl, linux-hyperv, linux-usb, usb-storage,
	linux-kernel
In-Reply-To: <20190605190836.32354-1-hch@lst.de>

We need to also mirror the value to the device to ensure IOMMU merging
doesn't undo it, and the SCSI host level parameter will ensure that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/ufs/ufshcd.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 8c1c551f2b42..4e524ade489e 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -4586,8 +4586,6 @@ static int ufshcd_slave_configure(struct scsi_device *sdev)
 	struct request_queue *q = sdev->request_queue;
 
 	blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1);
-	blk_queue_max_segment_size(q, PRDT_DATA_BYTE_COUNT_MAX);
-
 	return 0;
 }
 
@@ -6990,6 +6988,7 @@ static struct scsi_host_template ufshcd_driver_template = {
 	.sg_tablesize		= SG_ALL,
 	.cmd_per_lun		= UFSHCD_CMD_PER_LUN,
 	.can_queue		= UFSHCD_CAN_QUEUE,
+	.max_segment_size	= PRDT_DATA_BYTE_COUNT_MAX,
 	.max_host_blocked	= 1,
 	.track_queue_depth	= 1,
 	.sdev_groups		= ufshcd_driver_groups,
-- 
2.20.1


^ permalink raw reply related

* [PATCH 02/13] rsxx: don't call dma_set_max_seg_size
From: Christoph Hellwig @ 2019-06-05 19:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Ott, Sagi Grimberg, Max Gurtovoy, Bart Van Assche,
	Ulf Hansson, Alan Stern, Oliver Neukum, linux-block, linux-rdma,
	linux-mmc, linux-nvme, linux-scsi, megaraidlinux.pdl,
	MPT-FusionLinux.pdl, linux-hyperv, linux-usb, usb-storage,
	linux-kernel
In-Reply-To: <20190605190836.32354-1-hch@lst.de>

This driver does never uses dma_map_sg, so the setting is rather
pointless.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/block/rsxx/core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rsxx/core.c b/drivers/block/rsxx/core.c
index de9b2d2f8654..76b73ddf8fd7 100644
--- a/drivers/block/rsxx/core.c
+++ b/drivers/block/rsxx/core.c
@@ -767,7 +767,6 @@ static int rsxx_pci_probe(struct pci_dev *dev,
 		goto failed_enable;
 
 	pci_set_master(dev);
-	dma_set_max_seg_size(&dev->dev, RSXX_HW_BLK_SIZE);
 
 	st = dma_set_mask(&dev->dev, DMA_BIT_MASK(64));
 	if (st) {
-- 
2.20.1


^ permalink raw reply related

* [PATCH 07/13] storvsc: set virt_boundary_mask in the scsi host template
From: Christoph Hellwig @ 2019-06-05 19:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Ott, Sagi Grimberg, Max Gurtovoy, Bart Van Assche,
	Ulf Hansson, Alan Stern, Oliver Neukum, linux-block, linux-rdma,
	linux-mmc, linux-nvme, linux-scsi, megaraidlinux.pdl,
	MPT-FusionLinux.pdl, linux-hyperv, linux-usb, usb-storage,
	linux-kernel
In-Reply-To: <20190605190836.32354-1-hch@lst.de>

This ensures all proper DMA layer handling is taken care of by the
SCSI midlayer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/storvsc_drv.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 8472de1007ff..e61051c026f6 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1434,9 +1434,6 @@ static int storvsc_device_configure(struct scsi_device *sdevice)
 {
 	blk_queue_rq_timeout(sdevice->request_queue, (storvsc_timeout * HZ));
 
-	/* Ensure there are no gaps in presented sgls */
-	blk_queue_virt_boundary(sdevice->request_queue, PAGE_SIZE - 1);
-
 	sdevice->no_write_same = 1;
 
 	/*
@@ -1709,6 +1706,8 @@ static struct scsi_host_template scsi_driver = {
 	.this_id =		-1,
 	/* Make sure we dont get a sg segment crosses a page boundary */
 	.dma_boundary =		PAGE_SIZE-1,
+	/* Ensure there are no gaps in presented sgls */
+	.virt_boundary_mask =	PAGE_SIZE-1,
 	.no_write_same =	1,
 	.track_queue_depth =	1,
 };
-- 
2.20.1


^ permalink raw reply related

* [PATCH 03/13] mtip32xx: also set max_segment_size in the device
From: Christoph Hellwig @ 2019-06-05 19:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Ott, Sagi Grimberg, Max Gurtovoy, Bart Van Assche,
	Ulf Hansson, Alan Stern, Oliver Neukum, linux-block, linux-rdma,
	linux-mmc, linux-nvme, linux-scsi, megaraidlinux.pdl,
	MPT-FusionLinux.pdl, linux-hyperv, linux-usb, usb-storage,
	linux-kernel
In-Reply-To: <20190605190836.32354-1-hch@lst.de>

If we only set the max_segment_size on the queue an IOMMU merge might
create bigger segments again, so limit the IOMMU merges as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/block/mtip32xx/mtip32xx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index bacfdac7161c..a14b09ab3a41 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3676,6 +3676,7 @@ static int mtip_block_initialize(struct driver_data *dd)
 	blk_queue_physical_block_size(dd->queue, 4096);
 	blk_queue_max_hw_sectors(dd->queue, 0xffff);
 	blk_queue_max_segment_size(dd->queue, 0x400000);
+	dma_set_max_seg_size(&dd->pdev->dev, 0x400000);
 	blk_queue_io_min(dd->queue, 4096);
 
 	/* Set the capacity of the device in 512 byte sectors. */
-- 
2.20.1


^ permalink raw reply related

* [PATCH 04/13] mmc: also set max_segment_size in the device
From: Christoph Hellwig @ 2019-06-05 19:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Ott, Sagi Grimberg, Max Gurtovoy, Bart Van Assche,
	Ulf Hansson, Alan Stern, Oliver Neukum, linux-block, linux-rdma,
	linux-mmc, linux-nvme, linux-scsi, megaraidlinux.pdl,
	MPT-FusionLinux.pdl, linux-hyperv, linux-usb, usb-storage,
	linux-kernel
In-Reply-To: <20190605190836.32354-1-hch@lst.de>

If we only set the max_segment_size on the queue an IOMMU merge might
create bigger segments again, so limit the IOMMU merges as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/mmc/core/queue.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index b5b9c6142f08..92900a095796 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -377,6 +377,8 @@ static void mmc_setup_queue(struct mmc_queue *mq, struct mmc_card *card)
 	blk_queue_max_segment_size(mq->queue,
 			round_down(host->max_seg_size, block_size));
 
+	dma_set_max_seg_size(mmc_dev(host), queue_max_segment_size(mq->queue));
+
 	INIT_WORK(&mq->recovery_work, mmc_mq_recovery_handler);
 	INIT_WORK(&mq->complete_work, mmc_blk_mq_complete_work);
 
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] revert async probing of VMBus scsi device
From: Christoph Hellwig @ 2019-06-05 19:07 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Christoph Hellwig, linux-scsi, linux-hyperv, Stephen Hemminger
In-Reply-To: <20190605120640.00358689@hermes.lan>

On Wed, Jun 05, 2019 at 12:06:40PM -0700, Stephen Hemminger wrote:
> > Which is true for every device, and why we use UUIDs or label for
> > mounts that are supposed to be stable.
> 
> Not everyone is smart enough to do that.

Sure.  But they should not get a way out for just one specific driver.

^ permalink raw reply

* Re: [PATCH] revert async probing of VMBus scsi device
From: Stephen Hemminger @ 2019-06-05 19:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-scsi, linux-hyperv, Stephen Hemminger
In-Reply-To: <20190605185637.GA31439@infradead.org>

On Wed, 5 Jun 2019 11:56:37 -0700
Christoph Hellwig <hch@infradead.org> wrote:

> On Wed, Jun 05, 2019 at 11:52:05AM -0700, Stephen Hemminger wrote:
> > Doing asynchronous probing can lead to reordered device names
> > which is leads to failed mounts.  
> 
> Which is true for every device, and why we use UUIDs or label for
> mounts that are supposed to be stable.

Not everyone is smart enough to do that.

^ permalink raw reply

* Re: [PATCH] revert async probing of VMBus scsi device
From: Christoph Hellwig @ 2019-06-05 18:56 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: linux-scsi, linux-hyperv, Stephen Hemminger
In-Reply-To: <20190605185205.12583-1-sthemmin@microsoft.com>

On Wed, Jun 05, 2019 at 11:52:05AM -0700, Stephen Hemminger wrote:
> Doing asynchronous probing can lead to reordered device names
> which is leads to failed mounts.

Which is true for every device, and why we use UUIDs or label for
mounts that are supposed to be stable.

^ permalink raw reply

* RE: [PATCH] revert async probing of VMBus scsi device
From: Haiyang Zhang @ 2019-06-05 18:54 UTC (permalink / raw)
  To: Stephen Hemminger, linux-scsi@vger.kernel.org
  Cc: linux-hyperv@vger.kernel.org, Stephen Hemminger
In-Reply-To: <20190605185205.12583-1-sthemmin@microsoft.com>



> -----Original Message-----
> From: linux-hyperv-owner@vger.kernel.org <linux-hyperv-
> owner@vger.kernel.org> On Behalf Of Stephen Hemminger
> Sent: Wednesday, June 5, 2019 2:52 PM
> To: linux-scsi@vger.kernel.org
> Cc: linux-hyperv@vger.kernel.org; Stephen Hemminger
> <sthemmin@microsoft.com>
> Subject: [PATCH] revert async probing of VMBus scsi device
> 
> Doing asynchronous probing can lead to reordered device names which is
> leads to failed mounts.
> 
> Fixes: af0a5646cb8d ("use the new async probing feature for the hyperv
> drivers")
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> ---

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>

^ permalink raw reply

* RE: [PATCH] revert async probing of VMBus network devices.
From: Haiyang Zhang @ 2019-06-05 18:54 UTC (permalink / raw)
  To: Stephen Hemminger, netdev@vger.kernel.org
  Cc: linux-hyperv@vger.kernel.org, Stephen Hemminger
In-Reply-To: <20190605185114.12456-1-sthemmin@microsoft.com>



> -----Original Message-----
> From: linux-hyperv-owner@vger.kernel.org <linux-hyperv-
> owner@vger.kernel.org> On Behalf Of Stephen Hemminger
> Sent: Wednesday, June 5, 2019 2:51 PM
> To: netdev@vger.kernel.org
> Cc: linux-hyperv@vger.kernel.org; Stephen Hemminger
> <sthemmin@microsoft.com>
> Subject: [PATCH] revert async probing of VMBus network devices.
> 
> Doing asynchronous probing can lead to reordered network device names.
> And because udev doesn't have any useful information to construct a
> persistent name, this causes VM's to sporadically boot with reordered device
> names and no connectivity.
> 
> This shows up on the Ubuntu image on larger VM's where 30% of the time
> eth0 and eth1 get swapped.
> 
> Note: udev MAC address policy is disabled on Azure images because the
> netvsc and PCI VF will have the same mac address.
> 
> Fixes: af0a5646cb8d ("use the new async probing feature for the hyperv
> drivers")
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> ---

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>

^ permalink raw reply

* [PATCH] revert async probing of VMBus scsi device
From: Stephen Hemminger @ 2019-06-05 18:52 UTC (permalink / raw)
  To: linux-scsi; +Cc: linux-hyperv, Stephen Hemminger

Doing asynchronous probing can lead to reordered device names
which is leads to failed mounts.

Fixes: af0a5646cb8d ("use the new async probing feature for the hyperv drivers")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/scsi/storvsc_drv.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 8472de1007ff..56dcaa43b652 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1942,9 +1942,6 @@ static struct hv_driver storvsc_drv = {
 	.id_table = id_table,
 	.probe = storvsc_probe,
 	.remove = storvsc_remove,
-	.driver = {
-		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
-	},
 };
 
 #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS)
-- 
2.20.1


^ permalink raw reply related

* [PATCH] revert async probing of VMBus network devices.
From: Stephen Hemminger @ 2019-06-05 18:51 UTC (permalink / raw)
  To: netdev; +Cc: linux-hyperv, Stephen Hemminger

Doing asynchronous probing can lead to reordered network device names.
And because udev doesn't have any useful information to construct a
persistent name, this causes VM's to sporadically boot with reordered
device names and no connectivity.

This shows up on the Ubuntu image on larger VM's where 30% of the
time eth0 and eth1 get swapped.

Note: udev MAC address policy is disabled on Azure images
because the netvsc and PCI VF will have the same mac address.

Fixes: af0a5646cb8d ("use the new async probing feature for the hyperv drivers")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/hyperv/netvsc_drv.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 06393b215102..1a2c32111106 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2411,9 +2411,6 @@ static struct  hv_driver netvsc_drv = {
 	.id_table = id_table,
 	.probe = netvsc_probe,
 	.remove = netvsc_remove,
-	.driver = {
-		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
-	},
 };
 
 /*
-- 
2.20.1


^ permalink raw reply related

* RE: [PATCH] PCI: hv: Fix build error without CONFIG_SYSFS
From: Michael Kelley @ 2019-06-01 22:59 UTC (permalink / raw)
  To: YueHaibing, bhelgaas@google.com, Stephen Hemminger,
	sashal@kernel.org, Dexuan Cui, linux-hyperv@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
In-Reply-To: <20190531150923.12376-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>  Sent: Friday, May 31, 2019 8:09 AM
> 
> while building without CONFIG_SYSFS, fails as below:
> 
> drivers/pci/controller/pci-hyperv.o: In function 'hv_pci_assign_slots':
> pci-hyperv.c:(.text+0x40a): undefined reference to 'pci_create_slot'
> drivers/pci/controller/pci-hyperv.o: In function 'pci_devices_present_work':
> pci-hyperv.c:(.text+0xc02): undefined reference to 'pci_destroy_slot'
> drivers/pci/controller/pci-hyperv.o: In function 'hv_pci_remove':
> pci-hyperv.c:(.text+0xe50): undefined reference to 'pci_destroy_slot'
> drivers/pci/controller/pci-hyperv.o: In function 'hv_eject_device_work':
> pci-hyperv.c:(.text+0x11f9): undefined reference to 'pci_destroy_slot'
> 
> Select SYSFS while PCI_HYPERV is set to fix this.
> 

I'm wondering if is the right way to fix the problem.  Conceptually
is it possible to setup & operate virtual PCI devices like 
pci-hyperv.c does, even if sysfs is not present?  Or is it right to
always required sysfs?

The function pci_dev_assign_slot() in slot.c has a null implementation
in include/linux/pci.h when CONFIG_SYSFS is not defined, which
seems to be trying to solve the same problem for that function.  And
if CONFIG_HOTPLUG_PCI is defined but CONFIG_SYSFS is not,
pci_hp_create_module_link() and pci_hp_remove_module_link()
look like they would have the same problem.  Maybe there should
be degenerate implementations of pci_create_slot() and
pci_destroy_slot() for cases when CONFIG_SYSFS is not defined?

But I'll admit I don't know the full story behind how PCI slots
are represented and used, so maybe I'm off base.  I just noticed
the inconsistency in how other functions in slot.c are handled.

Thoughts?

Michael

^ permalink raw reply

* Re: [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Nadav Amit @ 2019-05-31 19:44 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	Paolo Bonzini, Dave Hansen, Boris Ostrovsky,
	linux-hyperv@vger.kernel.org,
	virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
In-Reply-To: <a847ee9c-4faf-c8b4-43bb-cc30e0980796@suse.com>

> On May 31, 2019, at 4:48 AM, Juergen Gross <jgross@suse.com> wrote:
> 
> On 31/05/2019 08:36, Nadav Amit wrote:
>> 
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>> 	void (*flush_tlb_user)(void);
>> 	void (*flush_tlb_kernel)(void);
>> 	void (*flush_tlb_one_user)(unsigned long addr);
>> +	/*
>> +	 * flush_tlb_multi() is the preferred interface. When it is used,
>> +	 * flush_tlb_others() should return false.
> 
> Didn't you want to remove/change this comment?

Yes! Sorry for that. Fixed now.

^ permalink raw reply

* Re: [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Juergen Gross @ 2019-05-31 11:48 UTC (permalink / raw)
  To: Nadav Amit, Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Paolo Bonzini, Dave Hansen, Boris Ostrovsky,
	linux-hyperv, virtualization, kvm, xen-devel
In-Reply-To: <20190531063645.4697-5-namit@vmware.com>

On 31/05/2019 08:36, Nadav Amit wrote:
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. The current
> flush_tlb_others() interface is kept, since paravirtual interfaces need
> to be adapted first before it can be removed. This is left for future
> work. In such PV environments, TLB flushes are not performed, at this
> time, concurrently.
> 
> Add a static key to tell whether this new interface is supported.
> 
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Sasha Levin <sashal@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: x86@kernel.org
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: linux-hyperv@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: kvm@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/hyperv/mmu.c                 |  2 +
>  arch/x86/include/asm/paravirt.h       |  8 +++
>  arch/x86/include/asm/paravirt_types.h |  6 ++
>  arch/x86/include/asm/tlbflush.h       |  6 ++
>  arch/x86/kernel/kvm.c                 |  1 +
>  arch/x86/kernel/paravirt.c            |  3 +
>  arch/x86/mm/tlb.c                     | 80 +++++++++++++++++++++++----
>  arch/x86/xen/mmu_pv.c                 |  2 +
>  8 files changed, 96 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> index e65d7fe6489f..ca28b400c87c 100644
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
>  	pr_info("Using hypercall for remote TLB flush\n");
>  	pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
>  	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
> +
> +	static_key_disable(&flush_tlb_multi_enabled.key);
>  }
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index c25c38a05c1c..192be7254457 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -47,6 +47,8 @@ static inline void slow_down_io(void)
>  #endif
>  }
>  
> +DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
> +
>  static inline void __flush_tlb(void)
>  {
>  	PVOP_VCALL0(mmu.flush_tlb_user);
> @@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
>  	PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
>  }
>  
> +static inline void flush_tlb_multi(const struct cpumask *cpumask,
> +				   const struct flush_tlb_info *info)
> +{
> +	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
> +}
> +
>  static inline void flush_tlb_others(const struct cpumask *cpumask,
>  				    const struct flush_tlb_info *info)
>  {
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 946f8f1f1efc..3a156e63c57d 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>  	void (*flush_tlb_user)(void);
>  	void (*flush_tlb_kernel)(void);
>  	void (*flush_tlb_one_user)(unsigned long addr);
> +	/*
> +	 * flush_tlb_multi() is the preferred interface. When it is used,
> +	 * flush_tlb_others() should return false.

Didn't you want to remove/change this comment?


Juergen

^ permalink raw reply

* [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Juergen Gross, Paolo Bonzini,
	Dave Hansen, Boris Ostrovsky, linux-hyperv, virtualization, kvm,
	xen-devel
In-Reply-To: <20190531063645.4697-1-namit@vmware.com>

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. The current
flush_tlb_others() interface is kept, since paravirtual interfaces need
to be adapted first before it can be removed. This is left for future
work. In such PV environments, TLB flushes are not performed, at this
time, concurrently.

Add a static key to tell whether this new interface is supported.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: Juergen Gross <jgross@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/hyperv/mmu.c                 |  2 +
 arch/x86/include/asm/paravirt.h       |  8 +++
 arch/x86/include/asm/paravirt_types.h |  6 ++
 arch/x86/include/asm/tlbflush.h       |  6 ++
 arch/x86/kernel/kvm.c                 |  1 +
 arch/x86/kernel/paravirt.c            |  3 +
 arch/x86/mm/tlb.c                     | 80 +++++++++++++++++++++++----
 arch/x86/xen/mmu_pv.c                 |  2 +
 8 files changed, 96 insertions(+), 12 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..ca28b400c87c 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
 	pr_info("Using hypercall for remote TLB flush\n");
 	pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
 	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+
+	static_key_disable(&flush_tlb_multi_enabled.key);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..192be7254457 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -47,6 +47,8 @@ static inline void slow_down_io(void)
 #endif
 }
 
+DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static inline void __flush_tlb(void)
 {
 	PVOP_VCALL0(mmu.flush_tlb_user);
@@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
 	PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+				   const struct flush_tlb_info *info)
+{
+	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
+}
+
 static inline void flush_tlb_others(const struct cpumask *cpumask,
 				    const struct flush_tlb_info *info)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3a156e63c57d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,6 +211,12 @@ struct pv_mmu_ops {
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_one_user)(unsigned long addr);
+	/*
+	 * flush_tlb_multi() is the preferred interface. When it is used,
+	 * flush_tlb_others() should return false.
+	 */
+	void (*flush_tlb_multi)(const struct cpumask *cpus,
+				const struct flush_tlb_info *info);
 	void (*flush_tlb_others)(const struct cpumask *cpus,
 				 const struct flush_tlb_info *info);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..79272938cf79 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+			     const struct flush_tlb_info *info);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
@@ -593,6 +596,9 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
+#define flush_tlb_multi(mask, info)	\
+	native_flush_tlb_multi(mask, info)
+
 #define flush_tlb_others(mask, info)	\
 	native_flush_tlb_others(mask, info)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 3f0cc828cc36..c1c2b88ea3f1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -643,6 +643,7 @@ static void __init kvm_guest_init(void)
 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
 		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
 		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+		static_key_disable(&flush_tlb_multi_enabled.key);
 	}
 
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5492a669f658..1314f89304a8 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -171,6 +171,8 @@ unsigned paravirt_patch_insns(void *insn_buff, unsigned len,
 	return insn_len;
 }
 
+DEFINE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static void native_flush_tlb(void)
 {
 	__native_flush_tlb();
@@ -375,6 +377,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_user	= native_flush_tlb,
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
+	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
 	.mmu.flush_tlb_others	= native_flush_tlb_others,
 	.mmu.tlb_remove_table	=
 			(void (*)(struct mmu_gather *, void *))tlb_remove_page,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ac98ad76f695..73d0d51b0f61 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -550,7 +550,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
 		 *
-		 * This should be rare, with native_flush_tlb_others skipping
+		 * This should be rare, with native_flush_tlb_multi skipping
 		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
@@ -634,9 +634,12 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
 }
 
-static void flush_tlb_func_local(const void *info, enum tlb_flush_reason reason)
+static void flush_tlb_func_local(void *info)
 {
 	const struct flush_tlb_info *f = info;
+	enum tlb_flush_reason reason;
+
+	reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;
 
 	flush_tlb_func_common(f, true, reason);
 }
@@ -654,14 +657,30 @@ static void flush_tlb_func_remote(void *info)
 	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
-static bool tlb_is_not_lazy(int cpu, void *data)
+static inline bool tlb_is_not_lazy(int cpu)
 {
 	return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
-			     const struct flush_tlb_info *info)
+static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
+
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+			    const struct flush_tlb_info *info)
 {
+	/*
+	 * native_flush_tlb_multi() can handle a single CPU, but it is
+	 * suboptimal if the local TLB should be flushed, and therefore should
+	 * not be used in such case. Check that it is not used in such case,
+	 * and use this assumption for tracing and accounting of remote TLB
+	 * flushes.
+	 */
+	VM_WARN_ON(!cpumask_any_but(cpumask, smp_processor_id()));
+
+	/*
+	 * Do accounting and tracing. Note that there are (and have always been)
+	 * cases in which a remote TLB flush will be traced, but eventually
+	 * would not happen.
+	 */
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -681,10 +700,14 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		 * means that the percpu tlb_gen variables won't be updated
 		 * and we'll do pointless flushes on future context switches.
 		 *
-		 * Rather than hooking native_flush_tlb_others() here, I think
+		 * Rather than hooking native_flush_tlb_multi() here, I think
 		 * that UV should be updated so that smp_call_function_many(),
 		 * etc, are optimal on UV.
 		 */
+		local_irq_disable();
+		flush_tlb_func_local((__force void *)info);
+		local_irq_enable();
+
 		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
 			smp_call_function_many(cpumask, flush_tlb_func_remote,
@@ -703,11 +726,39 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	 * doing a speculative memory access.
 	 */
 	if (info->freed_tables)
-		smp_call_function_many(cpumask, flush_tlb_func_remote,
-			       (void *)info, 1);
-	else
-		on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
-				(void *)info, 1, GFP_ATOMIC, cpumask);
+		__smp_call_function_many(cpumask, flush_tlb_func_remote,
+					 flush_tlb_func_local, (void *)info, 1);
+	else {
+		/*
+		 * Although we could have used on_each_cpu_cond_mask(),
+		 * open-coding it has several performance advantages: (1) we can
+		 * use specialized functions for remote and local flushes; (2)
+		 * no need for indirect branch to test if TLB is lazy; (3) we
+		 * can use a designated cpumask for evaluating the condition
+		 * instead of allocating a new one.
+		 *
+		 * This works under the assumption that there are no nested TLB
+		 * flushes, an assumption that is already made in
+		 * flush_tlb_mm_range().
+		 */
+		struct cpumask *cond_cpumask = this_cpu_ptr(&flush_tlb_mask);
+		int cpu;
+
+		cpumask_clear(cond_cpumask);
+
+		for_each_cpu(cpu, cpumask) {
+			if (tlb_is_not_lazy(cpu))
+				__cpumask_set_cpu(cpu, cond_cpumask);
+		}
+		__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+					 flush_tlb_func_local, (void *)info, 1);
+	}
+}
+
+void native_flush_tlb_others(const struct cpumask *cpumask,
+			     const struct flush_tlb_info *info)
+{
+	native_flush_tlb_multi(cpumask, info);
 }
 
 /*
@@ -773,10 +824,15 @@ static void flush_tlb_on_cpus(const cpumask_t *cpumask,
 {
 	int this_cpu = smp_processor_id();
 
+	if (static_branch_likely(&flush_tlb_multi_enabled)) {
+		flush_tlb_multi(cpumask, info);
+		return;
+	}
+
 	if (cpumask_test_cpu(this_cpu, cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
+		flush_tlb_func_local((__force void *)info);
 		local_irq_enable();
 	}
 
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index beb44e22afdf..0cb277848cb4 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2474,6 +2474,8 @@ void __init xen_init_mmu_ops(void)
 
 	pv_ops.mmu = xen_mmu_ops;
 
+	static_key_disable(&flush_tlb_multi_enabled.key);
+
 	memset(dummy_mapping, 0xff, PAGE_SIZE);
 }
 
-- 
2.20.1


^ permalink raw reply related

* RE: [PATCH v3 2/2] Drivers: hv: Move Hyper-V clocksource code to new clocksource driver
From: Vitaly Kuznetsov @ 2019-05-30 13:51 UTC (permalink / raw)
  To: Michael Kelley
  Cc: catalin.marinas@arm.com, mark.rutland@arm.com,
	will.deacon@arm.com, marc.zyngier@arm.com,
	linux-arm-kernel@lists.infradead.org, gregkh@linuxfoundation.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com,
	marcelo.cerri@canonical.com, Sunil Muthuswamy, KY Srinivasan
In-Reply-To: <BYAPR21MB122115920E78B7897FDC7BE9D7180@BYAPR21MB1221.namprd21.prod.outlook.com>

Michael Kelley <mikelley@microsoft.com> writes:

> From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Thursday, May 30, 2019 2:48 AM
>> > +		/*
>> > +		 * sched_clock_register is needed on ARM64 but
>> > +		 * is a no-op on x86
>> > +		 */
>> > +		sched_clock_register(read_hv_sched_clock_msr,
>> > +						64, HV_CLOCK_HZ);
>> 
>> I'm not sure about ARM, but MSR-based clocksource would be a really bad
>> choice for sched clock on x86, this will slow things down
>> significantly. Luckily, as you're validly stating above,
>> sched_clock_register() is a no-op on x86 as we don't define
>> CONFIG_GENERIC_SCHED_CLOCK.
>> 
>> Can we actually *not* do sched_clock_register() in case
>> TSC page is unavailable (and revert to counting jiffies or whatever)?
>> 
>
> We can't skip the sched_clock_register() on ARM64 because it
> does define CONFIG_GENERIC_SCHED_CLOCK.  However, Hyper-V
> should always provide REFERENCE_TSC_AVAILALBE on ARM64,
> so we should never end up in the MSR-based code on ARM64.
> Arguably that means the call to sched_clock_register() could be
> removed since it's a no-op on x86.  But I'd like to keep it for symmetry
> and in case there's a testing/debugging situation on ARM64 where
> we want to clear REFERENCE_TSC_AVAILABLE and go down the
> MSR-based code path.

Ok, so it is just a fall-back and not going to be actively used. Thanks!

-- 
Vitaly

^ permalink raw reply

* RE: [PATCH v3 2/2] Drivers: hv: Move Hyper-V clocksource code to new clocksource driver
From: Michael Kelley @ 2019-05-30 12:39 UTC (permalink / raw)
  To: vkuznets
  Cc: catalin.marinas@arm.com, mark.rutland@arm.com,
	will.deacon@arm.com, marc.zyngier@arm.com,
	linux-arm-kernel@lists.infradead.org, gregkh@linuxfoundation.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com,
	marcelo.cerri@canonical.com, Sunil Muthuswamy, KY Srinivasan
In-Reply-To: <87r28gl1d1.fsf@vitty.brq.redhat.com>

From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Thursday, May 30, 2019 2:48 AM
> > +		/*
> > +		 * sched_clock_register is needed on ARM64 but
> > +		 * is a no-op on x86
> > +		 */
> > +		sched_clock_register(read_hv_sched_clock_msr,
> > +						64, HV_CLOCK_HZ);
> 
> I'm not sure about ARM, but MSR-based clocksource would be a really bad
> choice for sched clock on x86, this will slow things down
> significantly. Luckily, as you're validly stating above,
> sched_clock_register() is a no-op on x86 as we don't define
> CONFIG_GENERIC_SCHED_CLOCK.
> 
> Can we actually *not* do sched_clock_register() in case
> TSC page is unavailable (and revert to counting jiffies or whatever)?
> 

We can't skip the sched_clock_register() on ARM64 because it
does define CONFIG_GENERIC_SCHED_CLOCK.  However, Hyper-V
should always provide REFERENCE_TSC_AVAILALBE on ARM64,
so we should never end up in the MSR-based code on ARM64.
Arguably that means the call to sched_clock_register() could be
removed since it's a no-op on x86.  But I'd like to keep it for symmetry
and in case there's a testing/debugging situation on ARM64 where
we want to clear REFERENCE_TSC_AVAILABLE and go down the
MSR-based code path.

Michael

^ permalink raw reply

* Re: [PATCH v3 2/2] Drivers: hv: Move Hyper-V clocksource code to new clocksource driver
From: Vitaly Kuznetsov @ 2019-05-30  9:48 UTC (permalink / raw)
  To: Michael Kelley
  Cc: catalin.marinas@arm.com, mark.rutland@arm.com,
	will.deacon@arm.com, marc.zyngier@arm.com,
	linux-arm-kernel@lists.infradead.org, gregkh@linuxfoundation.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com,
	marcelo.cerri@canonical.com, Sunil Muthuswamy, KY Srinivasan
In-Reply-To: <1558969089-13204-3-git-send-email-mikelley@microsoft.com>

Michael Kelley <mikelley@microsoft.com> writes:

> Code for the Hyper-V specific clocksources is currently mixed
> in with other Hyper-V code. Move the code to the Hyper-V specific
> driver in the "clocksource" directory, while separating out
> ISA dependencies so that the clocksource driver remains ISA
> independent. Update the Hyper-V initialization code to call
> initialization and cleanup routines since the Hyper-V synthetic
> timers are not independently enumerated in ACPI. Update Hyper-V
> clocksource users KVM and VDSO to get definitions from a new
> include file.
>
> No behavior is changed and no new functionality is added.
>
> Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> ---
>  arch/x86/entry/vdso/vclock_gettime.c |   1 +
>  arch/x86/entry/vdso/vma.c            |   2 +-
>  arch/x86/hyperv/hv_init.c            |  91 +-----------------------
>  arch/x86/include/asm/mshyperv.h      |  81 +++-------------------
>  arch/x86/kvm/x86.c                   |   1 +
>  drivers/clocksource/hyperv_timer.c   | 130 +++++++++++++++++++++++++++++++++++
>  drivers/hv/hv_util.c                 |   1 +
>  include/clocksource/hyperv_timer.h   |  78 +++++++++++++++++++++
>  8 files changed, 226 insertions(+), 159 deletions(-)
>
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
> index 98c7d12..7983ca2 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -21,6 +21,7 @@
>  #include <linux/math64.h>
>  #include <linux/time.h>
>  #include <linux/kernel.h>
> +#include <clocksource/hyperv_timer.h>
>  
>  #define gtod (&VVAR(vsyscall_gtod_data))
>  
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index babc4e7..99b38e9 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -22,7 +22,7 @@
>  #include <asm/page.h>
>  #include <asm/desc.h>
>  #include <asm/cpufeature.h>
> -#include <asm/mshyperv.h>
> +#include <clocksource/hyperv_timer.h>
>  
>  #if defined(CONFIG_X86_64)
>  unsigned int __read_mostly vdso64_enabled = 1;
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index e4ba467..79f626a 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -27,64 +27,13 @@
>  #include <linux/version.h>
>  #include <linux/vmalloc.h>
>  #include <linux/mm.h>
> -#include <linux/clockchips.h>
>  #include <linux/hyperv.h>
>  #include <linux/slab.h>
>  #include <linux/cpuhotplug.h>
> -
> -#ifdef CONFIG_HYPERV_TSCPAGE
> -
> -static struct ms_hyperv_tsc_page *tsc_pg;
> -
> -struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
> -{
> -	return tsc_pg;
> -}
> -EXPORT_SYMBOL_GPL(hv_get_tsc_page);
> -
> -static u64 read_hv_clock_tsc(struct clocksource *arg)
> -{
> -	u64 current_tick = hv_read_tsc_page(tsc_pg);
> -
> -	if (current_tick == U64_MAX)
> -		rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);
> -
> -	return current_tick;
> -}
> -
> -static struct clocksource hyperv_cs_tsc = {
> -		.name		= "hyperv_clocksource_tsc_page",
> -		.rating		= 400,
> -		.read		= read_hv_clock_tsc,
> -		.mask		= CLOCKSOURCE_MASK(64),
> -		.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
> -};
> -#endif
> -
> -static u64 read_hv_clock_msr(struct clocksource *arg)
> -{
> -	u64 current_tick;
> -	/*
> -	 * Read the partition counter to get the current tick count. This count
> -	 * is set to 0 when the partition is created and is incremented in
> -	 * 100 nanosecond units.
> -	 */
> -	rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);
> -	return current_tick;
> -}
> -
> -static struct clocksource hyperv_cs_msr = {
> -	.name		= "hyperv_clocksource_msr",
> -	.rating		= 400,
> -	.read		= read_hv_clock_msr,
> -	.mask		= CLOCKSOURCE_MASK(64),
> -	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
> -};
> +#include <clocksource/hyperv_timer.h>
>  
>  void *hv_hypercall_pg;
>  EXPORT_SYMBOL_GPL(hv_hypercall_pg);
> -struct clocksource *hyperv_cs;
> -EXPORT_SYMBOL_GPL(hyperv_cs);
>  
>  u32 *hv_vp_index;
>  EXPORT_SYMBOL_GPL(hv_vp_index);
> @@ -353,42 +302,8 @@ void __init hyperv_init(void)
>  
>  	x86_init.pci.arch_init = hv_pci_init;
>  
> -	/*
> -	 * Register Hyper-V specific clocksource.
> -	 */
> -#ifdef CONFIG_HYPERV_TSCPAGE
> -	if (ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE) {
> -		union hv_x64_msr_hypercall_contents tsc_msr;
> -
> -		tsc_pg = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL);
> -		if (!tsc_pg)
> -			goto register_msr_cs;
> -
> -		hyperv_cs = &hyperv_cs_tsc;
> -
> -		rdmsrl(HV_X64_MSR_REFERENCE_TSC, tsc_msr.as_uint64);
> -
> -		tsc_msr.enable = 1;
> -		tsc_msr.guest_physical_address = vmalloc_to_pfn(tsc_pg);
> -
> -		wrmsrl(HV_X64_MSR_REFERENCE_TSC, tsc_msr.as_uint64);
> -
> -		hyperv_cs_tsc.archdata.vclock_mode = VCLOCK_HVCLOCK;
> -
> -		clocksource_register_hz(&hyperv_cs_tsc, NSEC_PER_SEC/100);
> -		return;
> -	}
> -register_msr_cs:
> -#endif
> -	/*
> -	 * For 32 bit guests just use the MSR based mechanism for reading
> -	 * the partition counter.
> -	 */
> -
> -	hyperv_cs = &hyperv_cs_msr;
> -	if (ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE)
> -		clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
> -
> +	/* Register Hyper-V specific clocksource */
> +	hv_init_clocksource();
>  	return;
>  
>  remove_cpuhp_state:
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index cc60e61..f4fa8a9 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -105,6 +105,17 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
>  #define hv_get_crash_ctl(val) \
>  	rdmsrl(HV_X64_MSR_CRASH_CTL, val)
>  
> +#define hv_get_time_ref_count(val) \
> +	rdmsrl(HV_X64_MSR_TIME_REF_COUNT, val)
> +
> +#define hv_get_reference_tsc(val) \
> +	rdmsrl(HV_X64_MSR_REFERENCE_TSC, val)
> +#define hv_set_reference_tsc(val) \
> +	wrmsrl(HV_X64_MSR_REFERENCE_TSC, val)
> +#define hv_set_clocksource_vdso(val) \
> +	((val).archdata.vclock_mode = VCLOCK_HVCLOCK)
> +#define hv_get_raw_timer() rdtsc_ordered()
> +
>  void hyperv_callback_vector(void);
>  void hyperv_reenlightenment_vector(void);
>  #ifdef CONFIG_TRACING
> @@ -133,7 +144,6 @@ static inline void hv_disable_stimer0_percpu_irq(int irq) {}
>  
>  
>  #if IS_ENABLED(CONFIG_HYPERV)
> -extern struct clocksource *hyperv_cs;
>  extern void *hv_hypercall_pg;
>  extern void  __percpu  **hyperv_pcpu_input_arg;
>  
> @@ -387,73 +397,4 @@ static inline int hyperv_flush_guest_mapping_range(u64 as,
>  }
>  #endif /* CONFIG_HYPERV */
>  
> -#ifdef CONFIG_HYPERV_TSCPAGE
> -struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
> -static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
> -				       u64 *cur_tsc)
> -{
> -	u64 scale, offset;
> -	u32 sequence;
> -
> -	/*
> -	 * The protocol for reading Hyper-V TSC page is specified in Hypervisor
> -	 * Top-Level Functional Specification ver. 3.0 and above. To get the
> -	 * reference time we must do the following:
> -	 * - READ ReferenceTscSequence
> -	 *   A special '0' value indicates the time source is unreliable and we
> -	 *   need to use something else. The currently published specification
> -	 *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
> -	 *   instead of '0' as the special value, see commit c35b82ef0294.
> -	 * - ReferenceTime =
> -	 *        ((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
> -	 * - READ ReferenceTscSequence again. In case its value has changed
> -	 *   since our first reading we need to discard ReferenceTime and repeat
> -	 *   the whole sequence as the hypervisor was updating the page in
> -	 *   between.
> -	 */
> -	do {
> -		sequence = READ_ONCE(tsc_pg->tsc_sequence);
> -		if (!sequence)
> -			return U64_MAX;
> -		/*
> -		 * Make sure we read sequence before we read other values from
> -		 * TSC page.
> -		 */
> -		smp_rmb();
> -
> -		scale = READ_ONCE(tsc_pg->tsc_scale);
> -		offset = READ_ONCE(tsc_pg->tsc_offset);
> -		*cur_tsc = rdtsc_ordered();
> -
> -		/*
> -		 * Make sure we read sequence after we read all other values
> -		 * from TSC page.
> -		 */
> -		smp_rmb();
> -
> -	} while (READ_ONCE(tsc_pg->tsc_sequence) != sequence);
> -
> -	return mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
> -}
> -
> -static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
> -{
> -	u64 cur_tsc;
> -
> -	return hv_read_tsc_page_tsc(tsc_pg, &cur_tsc);
> -}
> -
> -#else
> -static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
> -{
> -	return NULL;
> -}
> -
> -static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
> -				       u64 *cur_tsc)
> -{
> -	BUG();
> -	return U64_MAX;
> -}
> -#endif
>  #endif
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 536b78c..3fbaac0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -70,6 +70,7 @@
>  #include <asm/mshyperv.h>
>  #include <asm/hypervisor.h>
>  #include <asm/intel_pt.h>
> +#include <clocksource/hyperv_timer.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include "trace.h"
> diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
> index 30615f3..274614d 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -14,6 +14,8 @@
>  #include <linux/percpu.h>
>  #include <linux/cpumask.h>
>  #include <linux/clockchips.h>
> +#include <linux/clocksource.h>
> +#include <linux/sched_clock.h>
>  #include <linux/mm.h>
>  #include <clocksource/hyperv_timer.h>
>  #include <asm/hyperv-tlfs.h>
> @@ -189,3 +191,131 @@ void hv_stimer_global_cleanup(void)
>  	hv_stimer_free();
>  }
>  EXPORT_SYMBOL_GPL(hv_stimer_global_cleanup);
> +
> +/*
> + * Code and definitions for the Hyper-V clocksources.  Two
> + * clocksources are defined: one that reads the Hyper-V defined MSR, and
> + * the other that uses the TSC reference page feature as defined in the
> + * TLFS.  The MSR version is for compatibility with old versions of
> + * Hyper-V and 32-bit x86.  The TSC reference page version is preferred.
> + */
> +
> +struct clocksource *hyperv_cs;
> +EXPORT_SYMBOL_GPL(hyperv_cs);
> +
> +#ifdef CONFIG_HYPERV_TSCPAGE
> +
> +static struct ms_hyperv_tsc_page *tsc_pg;
> +
> +struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
> +{
> +	return tsc_pg;
> +}
> +EXPORT_SYMBOL_GPL(hv_get_tsc_page);
> +
> +static u64 read_hv_sched_clock_tsc(void)
> +{
> +	u64 current_tick = hv_read_tsc_page(tsc_pg);
> +
> +	if (current_tick == U64_MAX)
> +		hv_get_time_ref_count(current_tick);
> +
> +	return current_tick;
> +}
> +
> +static u64 read_hv_clock_tsc(struct clocksource *arg)
> +{
> +	return read_hv_sched_clock_tsc();
> +}
> +
> +static struct clocksource hyperv_cs_tsc = {
> +		.name		= "hyperv_clocksource_tsc_page",
> +		.rating		= 400,
> +		.read		= read_hv_clock_tsc,
> +		.mask		= CLOCKSOURCE_MASK(64),
> +		.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
> +};
> +#endif
> +
> +static u64 read_hv_sched_clock_msr(void)
> +{
> +	u64 current_tick;
> +	/*
> +	 * Read the partition counter to get the current tick count. This count
> +	 * is set to 0 when the partition is created and is incremented in
> +	 * 100 nanosecond units.
> +	 */
> +	hv_get_time_ref_count(current_tick);
> +	return current_tick;
> +}
> +
> +static u64 read_hv_clock_msr(struct clocksource *arg)
> +{
> +	return read_hv_sched_clock_msr();
> +}
> +
> +static struct clocksource hyperv_cs_msr = {
> +	.name		= "hyperv_clocksource_msr",
> +	.rating		= 400,
> +	.read		= read_hv_clock_msr,
> +	.mask		= CLOCKSOURCE_MASK(64),
> +	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
> +};
> +
> +void __init hv_init_clocksource(void)
> +{
> +#ifdef CONFIG_HYPERV_TSCPAGE
> +	if (ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE) {
> +
> +		u64		tsc_msr;
> +		phys_addr_t	phys_addr;
> +
> +		tsc_pg = vmalloc(PAGE_SIZE);
> +		if (!tsc_pg)
> +			goto register_msr_cs;
> +
> +		hyperv_cs = &hyperv_cs_tsc;
> +		phys_addr = page_to_phys(vmalloc_to_page(tsc_pg));
> +
> +		/* The Hyper-V TLFS specifies to preserve the value of
> +		 * reserved bits in registers.  So read the existing
> +		 * value, preserve the low order 12 bits, and add in
> +		 * the guest physical address (which already has at
> +		 * least the low 12 bits set to zero since it is page
> +		 * aligned). Also set the "enable" bit, which is bit 0.
> +		 */
> +		hv_get_reference_tsc(tsc_msr);
> +		tsc_msr &= GENMASK_ULL(11, 0);
> +		tsc_msr = tsc_msr | 0x1 | (u64)phys_addr;
> +		hv_set_reference_tsc(tsc_msr);
> +
> +		hv_set_clocksource_vdso(hyperv_cs_tsc);
> +		clocksource_register_hz(&hyperv_cs_tsc, NSEC_PER_SEC/100);
> +
> +		/*
> +		 * sched_clock_register is needed on ARM64 but
> +		 * is a no-op on x86
> +		 */
> +		sched_clock_register(read_hv_sched_clock_tsc,
> +						64, HV_CLOCK_HZ);
> +		return;
> +	}
> +register_msr_cs:
> +#endif
> +	/*
> +	 * For 32 bit guests just use the MSR based mechanism for reading
> +	 * the partition counter.
> +	 */
> +	hyperv_cs = &hyperv_cs_msr;
> +	if (ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE) {
> +		clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
> +
> +		/*
> +		 * sched_clock_register is needed on ARM64 but
> +		 * is a no-op on x86
> +		 */
> +		sched_clock_register(read_hv_sched_clock_msr,
> +						64, HV_CLOCK_HZ);

I'm not sure about ARM, but MSR-based clocksource would be a really bad
choice for sched clock on x86, this will slow things down
significantly. Luckily, as you're validly stating above,
sched_clock_register() is a no-op on x86 as we don't define
CONFIG_GENERIC_SCHED_CLOCK.

Can we actually *not* do sched_clock_register() in case
TSC page is unavailable (and revert to counting jiffies or whatever)?

> +	}
> +}
> +EXPORT_SYMBOL_GPL(hv_init_clocksource);
> diff --git a/drivers/hv/hv_util.c b/drivers/hv/hv_util.c
> index f10eeb1..9eff85e 100644
> --- a/drivers/hv/hv_util.c
> +++ b/drivers/hv/hv_util.c
> @@ -29,6 +29,7 @@
>  #include <linux/hyperv.h>
>  #include <linux/clockchips.h>
>  #include <linux/ptp_clock_kernel.h>
> +#include <clocksource/hyperv_timer.h>
>  #include <asm/mshyperv.h>
>  
>  #include "hyperv_vmbus.h"
> diff --git a/include/clocksource/hyperv_timer.h b/include/clocksource/hyperv_timer.h
> index ba5dc17..e17e8a2 100644
> --- a/include/clocksource/hyperv_timer.h
> +++ b/include/clocksource/hyperv_timer.h
> @@ -13,6 +13,10 @@
>  #ifndef __CLKSOURCE_HYPERV_TIMER_H
>  #define __CLKSOURCE_HYPERV_TIMER_H
>  
> +#include <linux/clocksource.h>
> +#include <linux/math64.h>
> +#include <asm/mshyperv.h>
> +
>  #define HV_MAX_MAX_DELTA_TICKS 0xffffffff
>  #define HV_MIN_DELTA_TICKS 1
>  
> @@ -24,4 +28,78 @@
>  extern void hv_stimer_global_cleanup(void);
>  extern void hv_stimer0_isr(void);
>  
> +#if IS_ENABLED(CONFIG_HYPERV)
> +extern struct clocksource *hyperv_cs;
> +extern void hv_init_clocksource(void);
> +#endif /* CONFIG_HYPERV */
> +
> +#ifdef CONFIG_HYPERV_TSCPAGE
> +extern struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
> +static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
> +				       u64 *cur_tsc)
> +{
> +	u64 scale, offset;
> +	u32 sequence;
> +
> +	/*
> +	 * The protocol for reading Hyper-V TSC page is specified in Hypervisor
> +	 * Top-Level Functional Specification ver. 3.0 and above. To get the
> +	 * reference time we must do the following:
> +	 * - READ ReferenceTscSequence
> +	 *   A special '0' value indicates the time source is unreliable and we
> +	 *   need to use something else. The currently published specification
> +	 *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
> +	 *   instead of '0' as the special value, see commit c35b82ef0294.
> +	 * - ReferenceTime =
> +	 *        ((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
> +	 * - READ ReferenceTscSequence again. In case its value has changed
> +	 *   since our first reading we need to discard ReferenceTime and repeat
> +	 *   the whole sequence as the hypervisor was updating the page in
> +	 *   between.
> +	 */
> +	do {
> +		sequence = READ_ONCE(tsc_pg->tsc_sequence);
> +		if (!sequence)
> +			return U64_MAX;
> +		/*
> +		 * Make sure we read sequence before we read other values from
> +		 * TSC page.
> +		 */
> +		smp_rmb();
> +
> +		scale = READ_ONCE(tsc_pg->tsc_scale);
> +		offset = READ_ONCE(tsc_pg->tsc_offset);
> +		*cur_tsc = hv_get_raw_timer();
> +
> +		/*
> +		 * Make sure we read sequence after we read all other values
> +		 * from TSC page.
> +		 */
> +		smp_rmb();
> +
> +	} while (READ_ONCE(tsc_pg->tsc_sequence) != sequence);
> +
> +	return mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
> +}
> +
> +static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
> +{
> +	u64 cur_tsc;
> +
> +	return hv_read_tsc_page_tsc(tsc_pg, &cur_tsc);
> +}
> +
> +#else /* CONFIG_HYPERV_TSC_PAGE */
> +static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
> +{
> +	return NULL;
> +}
> +
> +static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
> +				       u64 *cur_tsc)
> +{
> +	return U64_MAX;
> +}
> +#endif /* CONFIG_HYPERV_TSCPAGE */
> +
>  #endif

With the (theoretical) question above answered,

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>

-- 
Vitaly

^ permalink raw reply

* Re: [PATCH v3 1/2] Drivers: hv: Create Hyper-V clocksource driver from existing clockevents code
From: Vitaly Kuznetsov @ 2019-05-30  9:37 UTC (permalink / raw)
  To: Michael Kelley
  Cc: catalin.marinas@arm.com, mark.rutland@arm.com,
	will.deacon@arm.com, marc.zyngier@arm.com,
	linux-arm-kernel@lists.infradead.org, gregkh@linuxfoundation.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com,
	marcelo.cerri@canonical.com, Sunil Muthuswamy, KY Srinivasan
In-Reply-To: <1558969089-13204-2-git-send-email-mikelley@microsoft.com>

Michael Kelley <mikelley@microsoft.com> writes:

> Clockevents code for Hyper-V synthetic timers is currently mixed
> in with other Hyper-V code. Move the code to a Hyper-V specific
> driver in the "clocksource" directory. Update the VMbus driver
> to call initialization and cleanup routines since the Hyper-V
> synthetic timers are not independently enumerated in ACPI.
>
> No behavior is changed and no new functionality is added.
>
> Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> ---
>  MAINTAINERS                        |   2 +
>  arch/x86/include/asm/hyperv-tlfs.h |   6 ++
>  arch/x86/kernel/cpu/mshyperv.c     |   2 +
>  drivers/clocksource/Makefile       |   1 +
>  drivers/clocksource/hyperv_timer.c | 191 +++++++++++++++++++++++++++++++++++++
>  drivers/hv/Kconfig                 |   3 +
>  drivers/hv/hv.c                    | 156 +-----------------------------
>  drivers/hv/hyperv_vmbus.h          |   3 -
>  drivers/hv/vmbus_drv.c             |  42 ++++----
>  include/clocksource/hyperv_timer.h |  27 ++++++
>  10 files changed, 258 insertions(+), 175 deletions(-)
>  create mode 100644 drivers/clocksource/hyperv_timer.c
>  create mode 100644 include/clocksource/hyperv_timer.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 992f1dd..cf2a5b7 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7294,6 +7294,7 @@ F:	arch/x86/include/asm/trace/hyperv.h
>  F:	arch/x86/include/asm/hyperv-tlfs.h
>  F:	arch/x86/kernel/cpu/mshyperv.c
>  F:	arch/x86/hyperv
> +F:	drivers/clocksource/hyperv_timer.c
>  F:	drivers/hid/hid-hyperv.c
>  F:	drivers/hv/
>  F:	drivers/input/serio/hyperv-keyboard.c
> @@ -7304,6 +7305,7 @@ F:	drivers/uio/uio_hv_generic.c
>  F:	drivers/video/fbdev/hyperv_fb.c
>  F:	drivers/iommu/hyperv_iommu.c
>  F:	net/vmw_vsock/hyperv_transport.c
> +F:	include/clocksource/hyperv_timer.h
>  F:	include/linux/hyperv.h
>  F:	include/uapi/linux/hyperv.h
>  F:	tools/hv/
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index cdf44aa..af78cd7 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -401,6 +401,12 @@ enum HV_GENERIC_SET_FORMAT {
>  #define HV_STATUS_INVALID_CONNECTION_ID		18
>  #define HV_STATUS_INSUFFICIENT_BUFFERS		19
>  
> +/*
> + * The Hyper-V TimeRefCount register and the TSC
> + * page provide a guest VM clock with 100ns tick rate
> + */
> +#define HV_CLOCK_HZ (NSEC_PER_SEC/100)
> +
>  typedef struct _HV_REFERENCE_TSC_PAGE {
>  	__u32 tsc_sequence;
>  	__u32 res1;
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index faae611..b18ede4 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -21,6 +21,7 @@
>  #include <linux/irq.h>
>  #include <linux/kexec.h>
>  #include <linux/i8253.h>
> +#include <linux/random.h>
>  #include <asm/processor.h>
>  #include <asm/hypervisor.h>
>  #include <asm/hyperv-tlfs.h>
> @@ -84,6 +85,7 @@ __visible void __irq_entry hv_stimer0_vector_handler(struct pt_regs *regs)
>  	inc_irq_stat(hyperv_stimer0_count);
>  	if (hv_stimer0_handler)
>  		hv_stimer0_handler();
> +	add_interrupt_randomness(HYPERV_STIMER0_VECTOR, 0);
>  	ack_APIC_irq();
>  
>  	exiting_irq();
> diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
> index 236858f..2b65c5f 100644
> --- a/drivers/clocksource/Makefile
> +++ b/drivers/clocksource/Makefile
> @@ -84,3 +84,4 @@ obj-$(CONFIG_ATCPIT100_TIMER)		+= timer-atcpit100.o
>  obj-$(CONFIG_RISCV_TIMER)		+= timer-riscv.o
>  obj-$(CONFIG_CSKY_MP_TIMER)		+= timer-mp-csky.o
>  obj-$(CONFIG_GX6605S_TIMER)		+= timer-gx6605s.o
> +obj-$(CONFIG_HYPERV_TIMER)		+= hyperv_timer.o
> diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
> new file mode 100644
> index 0000000..30615f3
> --- /dev/null
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -0,0 +1,191 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Clocksource driver for the synthetic counter and timers
> + * provided by the Hyper-V hypervisor to guest VMs, as described
> + * in the Hyper-V Top Level Functional Spec (TLFS). This driver
> + * is instruction set architecture independent.
> + *
> + * Copyright (C) 2019, Microsoft, Inc.
> + *
> + * Author:  Michael Kelley <mikelley@microsoft.com>
> + */
> +
> +#include <linux/percpu.h>
> +#include <linux/cpumask.h>
> +#include <linux/clockchips.h>
> +#include <linux/mm.h>
> +#include <clocksource/hyperv_timer.h>
> +#include <asm/hyperv-tlfs.h>
> +#include <asm/mshyperv.h>
> +
> +static struct clock_event_device __percpu *hv_clock_event;
> +
> +/*
> + * If false, we're using the old mechanism for stimer0 interrupts
> + * where it sends a VMbus message when it expires. The old
> + * mechanism is used when running on older versions of Hyper-V
> + * that don't support Direct Mode. While Hyper-V provides
> + * four stimer's per CPU, Linux uses only stimer0.
> + */
> +static bool direct_mode_enabled;
> +
> +static int stimer0_irq;
> +static int stimer0_vector;
> +static int stimer0_message_sint;
> +
> +/*
> + * ISR for when stimer0 is operating in Direct Mode.  Direct Mode
> + * does not use VMbus or any VMbus messages, so process here and not
> + * in the VMbus driver code.
> + */
> +void hv_stimer0_isr(void)
> +{
> +	struct clock_event_device *ce;
> +
> +	ce = this_cpu_ptr(hv_clock_event);
> +	ce->event_handler(ce);
> +}
> +EXPORT_SYMBOL_GPL(hv_stimer0_isr);
> +
> +static int hv_ce_set_next_event(unsigned long delta,
> +				struct clock_event_device *evt)
> +{
> +	u64 current_tick;
> +
> +	WARN_ON(!clockevent_state_oneshot(evt));
> +
> +	current_tick = hyperv_cs->read(NULL);
> +	current_tick += delta;
> +	hv_init_timer(0, current_tick);
> +	return 0;
> +}
> +
> +static int hv_ce_shutdown(struct clock_event_device *evt)
> +{
> +	hv_init_timer(0, 0);
> +	hv_init_timer_config(0, 0);
> +	if (direct_mode_enabled)
> +		hv_disable_stimer0_percpu_irq(stimer0_irq);
> +
> +	return 0;
> +}
> +
> +static int hv_ce_set_oneshot(struct clock_event_device *evt)
> +{
> +	union hv_stimer_config timer_cfg;
> +
> +	timer_cfg.as_uint64 = 0;
> +	timer_cfg.enable = 1;
> +	timer_cfg.auto_enable = 1;
> +	if (direct_mode_enabled) {
> +		/*
> +		 * When it expires, the timer will directly interrupt
> +		 * on the specified hardware vector/IRQ.
> +		 */
> +		timer_cfg.direct_mode = 1;
> +		timer_cfg.apic_vector = stimer0_vector;
> +		hv_enable_stimer0_percpu_irq(stimer0_irq);
> +	} else {
> +		/*
> +		 * When it expires, the timer will generate a VMbus message,
> +		 * to be handled by the normal VMbus interrupt handler.
> +		 */
> +		timer_cfg.direct_mode = 0;
> +		timer_cfg.sintx = stimer0_message_sint;
> +	}
> +	hv_init_timer_config(0, timer_cfg.as_uint64);
> +	return 0;
> +}
> +
> +/*
> + * hv_stimer_init - Per-cpu initialization of the clockevent
> + */
> +int hv_stimer_init(unsigned int cpu)
> +{
> +	struct clock_event_device *ce;
> +
> +	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
> +		ce = per_cpu_ptr(hv_clock_event, cpu);
> +		ce->name = "Hyper-V clockevent";
> +		ce->features = CLOCK_EVT_FEAT_ONESHOT;
> +		ce->cpumask = cpumask_of(cpu);
> +		ce->rating = 1000;
> +		ce->set_state_shutdown = hv_ce_shutdown;
> +		ce->set_state_oneshot = hv_ce_set_oneshot;
> +		ce->set_next_event = hv_ce_set_next_event;
> +
> +		clockevents_config_and_register(ce,
> +						HV_CLOCK_HZ,
> +						HV_MIN_DELTA_TICKS,
> +						HV_MAX_MAX_DELTA_TICKS);
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(hv_stimer_init);
> +
> +/*
> + * hv_stimer_cleanup - Per-cpu cleanup of the clockevent
> + */
> +int hv_stimer_cleanup(unsigned int cpu)
> +{
> +	struct clock_event_device *ce;
> +
> +	/* Turn off clockevent device */
> +	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
> +		ce = per_cpu_ptr(hv_clock_event, cpu);
> +		clockevents_unbind_device(ce, cpu);
> +		hv_ce_shutdown(ce);
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(hv_stimer_cleanup);
> +
> +/* hv_stimer_alloc - Global initialization of the clockevent and stimer0 */
> +int hv_stimer_alloc(int sint)
> +{
> +	hv_clock_event = alloc_percpu(struct clock_event_device);
> +	if (!hv_clock_event)
> +		return -ENOMEM;
> +
> +	direct_mode_enabled = ms_hyperv.misc_features &
> +			HV_STIMER_DIRECT_MODE_AVAILABLE;
> +	if (direct_mode_enabled &&
> +	    hv_setup_stimer0_irq(&stimer0_irq, &stimer0_vector,
> +				hv_stimer0_isr)) {
> +		free_percpu(hv_clock_event);
> +		return -EINVAL;
> +	}
> +
> +	stimer0_message_sint = sint;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(hv_stimer_alloc);
> +
> +/* hv_stimer_free - Free global resources allocated by hv_stimer_alloc() */
> +void hv_stimer_free(void)
> +{
> +	if (direct_mode_enabled)
> +		hv_remove_stimer0_irq(stimer0_irq);
> +	free_percpu(hv_clock_event);
> +}
> +EXPORT_SYMBOL_GPL(hv_stimer_free);
> +
> +/*
> + * Do a global cleanup of clockevents for the cases of kexec and
> + * vmbus exit
> + */
> +void hv_stimer_global_cleanup(void)
> +{
> +	int	cpu;
> +	struct clock_event_device *ce;
> +
> +	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE)
> +		for_each_present_cpu(cpu) {
> +			ce = per_cpu_ptr(hv_clock_event, cpu);
> +			clockevents_unbind_device(ce, cpu);
> +		}
> +	hv_stimer_free();
> +}
> +EXPORT_SYMBOL_GPL(hv_stimer_global_cleanup);
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 1c1a251..c423e57 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -10,6 +10,9 @@ config HYPERV
>  	  Select this option to run Linux as a Hyper-V client operating
>  	  system.
>  
> +config HYPERV_TIMER
> +	def_bool HYPERV
> +
>  config HYPERV_TSCPAGE
>         def_bool HYPERV && X86_64
>  
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index 4565302..bd6452f 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -29,6 +29,7 @@
>  #include <linux/version.h>
>  #include <linux/random.h>
>  #include <linux/clockchips.h>

(very minor nit, just an idea for future cleanup):

I think we can throw away

 #include <linux/random.h>
 #include <linux/clockchips.h>

now.

> +#include <clocksource/hyperv_timer.h>
>  #include <asm/mshyperv.h>
>  #include "hyperv_vmbus.h"
>  
> @@ -36,21 +37,6 @@
>  struct hv_context hv_context;
>  
>  /*
> - * If false, we're using the old mechanism for stimer0 interrupts
> - * where it sends a VMbus message when it expires. The old
> - * mechanism is used when running on older versions of Hyper-V
> - * that don't support Direct Mode. While Hyper-V provides
> - * four stimer's per CPU, Linux uses only stimer0.
> - */
> -static bool direct_mode_enabled;
> -static int stimer0_irq;
> -static int stimer0_vector;
> -
> -#define HV_TIMER_FREQUENCY (10 * 1000 * 1000) /* 100ns period */
> -#define HV_MAX_MAX_DELTA_TICKS 0xffffffff
> -#define HV_MIN_DELTA_TICKS 1
> -
> -/*
>   * hv_init - Main initialization routine.
>   *
>   * This routine must be called before any other routines in here are called
> @@ -60,9 +46,6 @@ int hv_init(void)
>  	hv_context.cpu_context = alloc_percpu(struct hv_per_cpu_context);
>  	if (!hv_context.cpu_context)
>  		return -ENOMEM;
> -
> -	direct_mode_enabled = ms_hyperv.misc_features &
> -			HV_STIMER_DIRECT_MODE_AVAILABLE;
>  	return 0;
>  }
>  
> @@ -101,89 +84,6 @@ int hv_post_message(union hv_connection_id connection_id,
>  	return status & 0xFFFF;
>  }
>  
> -/*
> - * ISR for when stimer0 is operating in Direct Mode.  Direct Mode
> - * does not use VMbus or any VMbus messages, so process here and not
> - * in the VMbus driver code.
> - */
> -
> -static void hv_stimer0_isr(void)
> -{
> -	struct hv_per_cpu_context *hv_cpu;
> -
> -	hv_cpu = this_cpu_ptr(hv_context.cpu_context);
> -	hv_cpu->clk_evt->event_handler(hv_cpu->clk_evt);
> -	add_interrupt_randomness(stimer0_vector, 0);
> -}
> -
> -static int hv_ce_set_next_event(unsigned long delta,
> -				struct clock_event_device *evt)
> -{
> -	u64 current_tick;
> -
> -	WARN_ON(!clockevent_state_oneshot(evt));
> -
> -	current_tick = hyperv_cs->read(NULL);
> -	current_tick += delta;
> -	hv_init_timer(0, current_tick);
> -	return 0;
> -}
> -
> -static int hv_ce_shutdown(struct clock_event_device *evt)
> -{
> -	hv_init_timer(0, 0);
> -	hv_init_timer_config(0, 0);
> -	if (direct_mode_enabled)
> -		hv_disable_stimer0_percpu_irq(stimer0_irq);
> -
> -	return 0;
> -}
> -
> -static int hv_ce_set_oneshot(struct clock_event_device *evt)
> -{
> -	union hv_stimer_config timer_cfg;
> -
> -	timer_cfg.as_uint64 = 0;
> -	timer_cfg.enable = 1;
> -	timer_cfg.auto_enable = 1;
> -	if (direct_mode_enabled) {
> -		/*
> -		 * When it expires, the timer will directly interrupt
> -		 * on the specified hardware vector/IRQ.
> -		 */
> -		timer_cfg.direct_mode = 1;
> -		timer_cfg.apic_vector = stimer0_vector;
> -		hv_enable_stimer0_percpu_irq(stimer0_irq);
> -	} else {
> -		/*
> -		 * When it expires, the timer will generate a VMbus message,
> -		 * to be handled by the normal VMbus interrupt handler.
> -		 */
> -		timer_cfg.direct_mode = 0;
> -		timer_cfg.sintx = VMBUS_MESSAGE_SINT;
> -	}
> -	hv_init_timer_config(0, timer_cfg.as_uint64);
> -	return 0;
> -}
> -
> -static void hv_init_clockevent_device(struct clock_event_device *dev, int cpu)
> -{
> -	dev->name = "Hyper-V clockevent";
> -	dev->features = CLOCK_EVT_FEAT_ONESHOT;
> -	dev->cpumask = cpumask_of(cpu);
> -	dev->rating = 1000;
> -	/*
> -	 * Avoid settint dev->owner = THIS_MODULE deliberately as doing so will
> -	 * result in clockevents_config_and_register() taking additional
> -	 * references to the hv_vmbus module making it impossible to unload.
> -	 */
> -
> -	dev->set_state_shutdown = hv_ce_shutdown;
> -	dev->set_state_oneshot = hv_ce_set_oneshot;
> -	dev->set_next_event = hv_ce_set_next_event;
> -}
> -
> -
>  int hv_synic_alloc(void)
>  {
>  	int cpu;
> @@ -212,14 +112,6 @@ int hv_synic_alloc(void)
>  		tasklet_init(&hv_cpu->msg_dpc,
>  			     vmbus_on_msg_dpc, (unsigned long) hv_cpu);
>  
> -		hv_cpu->clk_evt = kzalloc(sizeof(struct clock_event_device),
> -					  GFP_KERNEL);
> -		if (hv_cpu->clk_evt == NULL) {
> -			pr_err("Unable to allocate clock event device\n");
> -			goto err;
> -		}
> -		hv_init_clockevent_device(hv_cpu->clk_evt, cpu);
> -
>  		hv_cpu->synic_message_page =
>  			(void *)get_zeroed_page(GFP_ATOMIC);
>  		if (hv_cpu->synic_message_page == NULL) {
> @@ -242,11 +134,6 @@ int hv_synic_alloc(void)
>  		INIT_LIST_HEAD(&hv_cpu->chan_list);
>  	}
>  
> -	if (direct_mode_enabled &&
> -	    hv_setup_stimer0_irq(&stimer0_irq, &stimer0_vector,
> -				hv_stimer0_isr))
> -		goto err;
> -
>  	return 0;
>  err:
>  	/*
> @@ -265,7 +152,6 @@ void hv_synic_free(void)
>  		struct hv_per_cpu_context *hv_cpu
>  			= per_cpu_ptr(hv_context.cpu_context, cpu);
>  
> -		kfree(hv_cpu->clk_evt);
>  		free_page((unsigned long)hv_cpu->synic_event_page);
>  		free_page((unsigned long)hv_cpu->synic_message_page);
>  		free_page((unsigned long)hv_cpu->post_msg_page);
> @@ -324,36 +210,9 @@ int hv_synic_init(unsigned int cpu)
>  
>  	hv_set_synic_state(sctrl.as_uint64);
>  
> -	/*
> -	 * Register the per-cpu clockevent source.
> -	 */
> -	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE)
> -		clockevents_config_and_register(hv_cpu->clk_evt,
> -						HV_TIMER_FREQUENCY,
> -						HV_MIN_DELTA_TICKS,
> -						HV_MAX_MAX_DELTA_TICKS);
> -	return 0;
> -}
> -
> -/*
> - * hv_synic_clockevents_cleanup - Cleanup clockevent devices
> - */
> -void hv_synic_clockevents_cleanup(void)
> -{
> -	int cpu;
> +	hv_stimer_init(cpu);
>  
> -	if (!(ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE))
> -		return;
> -
> -	if (direct_mode_enabled)
> -		hv_remove_stimer0_irq(stimer0_irq);
> -
> -	for_each_present_cpu(cpu) {
> -		struct hv_per_cpu_context *hv_cpu
> -			= per_cpu_ptr(hv_context.cpu_context, cpu);
> -
> -		clockevents_unbind_device(hv_cpu->clk_evt, cpu);
> -	}
> +	return 0;
>  }
>  
>  /*
> @@ -401,14 +260,7 @@ int hv_synic_cleanup(unsigned int cpu)
>  	if (channel_found && vmbus_connection.conn_state == CONNECTED)
>  		return -EBUSY;
>  
> -	/* Turn off clockevent device */
> -	if (ms_hyperv.features & HV_MSR_SYNTIMER_AVAILABLE) {
> -		struct hv_per_cpu_context *hv_cpu
> -			= this_cpu_ptr(hv_context.cpu_context);
> -
> -		clockevents_unbind_device(hv_cpu->clk_evt, cpu);
> -		hv_ce_shutdown(hv_cpu->clk_evt);
> -	}
> +	hv_stimer_cleanup(cpu);
>  
>  	hv_get_synint_state(VMBUS_MESSAGE_SINT, shared_sint.as_uint64);
>  
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index e5467b8..a8f8aee 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -151,7 +151,6 @@ struct hv_per_cpu_context {
>  	 * per-cpu list of the channels based on their CPU affinity.
>  	 */
>  	struct list_head chan_list;
> -	struct clock_event_device *clk_evt;
>  };
>  
>  struct hv_context {
> @@ -189,8 +188,6 @@ extern int hv_post_message(union hv_connection_id connection_id,
>  
>  extern int hv_synic_cleanup(unsigned int cpu);
>  
> -extern void hv_synic_clockevents_cleanup(void);
> -
>  /* Interface */
>  
>  void hv_ringbuffer_pre_init(struct vmbus_channel *channel);
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 1cb9408..89aca5d 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -43,6 +43,7 @@
>  #include <linux/kdebug.h>
>  #include <linux/efi.h>
>  #include <linux/random.h>
> +#include <clocksource/hyperv_timer.h>
>  #include "hyperv_vmbus.h"
>  
>  struct vmbus_dynid {
> @@ -968,17 +969,6 @@ static void vmbus_onmessage_work(struct work_struct *work)
>  	kfree(ctx);
>  }
>  
> -static void hv_process_timer_expiration(struct hv_message *msg,
> -					struct hv_per_cpu_context *hv_cpu)
> -{
> -	struct clock_event_device *dev = hv_cpu->clk_evt;
> -
> -	if (dev->event_handler)
> -		dev->event_handler(dev);
> -
> -	vmbus_signal_eom(msg, HVMSG_TIMER_EXPIRED);
> -}
> -
>  void vmbus_on_msg_dpc(unsigned long data)
>  {
>  	struct hv_per_cpu_context *hv_cpu = (void *)data;
> @@ -1172,9 +1162,10 @@ static void vmbus_isr(void)
>  
>  	/* Check if there are actual msgs to be processed */
>  	if (msg->header.message_type != HVMSG_NONE) {
> -		if (msg->header.message_type == HVMSG_TIMER_EXPIRED)
> -			hv_process_timer_expiration(msg, hv_cpu);
> -		else
> +		if (msg->header.message_type == HVMSG_TIMER_EXPIRED) {
> +			hv_stimer0_isr();
> +			vmbus_signal_eom(msg, HVMSG_TIMER_EXPIRED);
> +		} else
>  			tasklet_schedule(&hv_cpu->msg_dpc);
>  	}
>  
> @@ -1276,14 +1267,19 @@ static int vmbus_bus_init(void)
>  	ret = hv_synic_alloc();
>  	if (ret)
>  		goto err_alloc;
> +
> +	ret = hv_stimer_alloc(VMBUS_MESSAGE_SINT);
> +	if (ret < 0)
> +		goto err_alloc;
> +
>  	/*
> -	 * Initialize the per-cpu interrupt state and
> -	 * connect to the host.
> +	 * Initialize the per-cpu interrupt state and stimer state.
> +	 * Then connect to the host.
>  	 */
>  	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "hyperv/vmbus:online",
>  				hv_synic_init, hv_synic_cleanup);
>  	if (ret < 0)
> -		goto err_alloc;
> +		goto err_cpuhp;
>  	hyperv_cpuhp_online = ret;
>  
>  	ret = vmbus_connect();
> @@ -1331,6 +1327,8 @@ static int vmbus_bus_init(void)
>  
>  err_connect:
>  	cpuhp_remove_state(hyperv_cpuhp_online);
> +err_cpuhp:
> +	hv_stimer_free();
>  err_alloc:
>  	hv_synic_free();
>  	hv_remove_vmbus_irq();
> @@ -2077,7 +2075,7 @@ static int vmbus_acpi_add(struct acpi_device *device)
>  
>  static void hv_kexec_handler(void)
>  {
> -	hv_synic_clockevents_cleanup();
> +	hv_stimer_global_cleanup();
>  	vmbus_initiate_unload(false);
>  	vmbus_connection.conn_state = DISCONNECTED;
>  	/* Make sure conn_state is set as hv_synic_cleanup checks for it */
> @@ -2088,6 +2086,8 @@ static void hv_kexec_handler(void)
>  
>  static void hv_crash_handler(struct pt_regs *regs)
>  {
> +	int cpu;
> +
>  	vmbus_initiate_unload(true);
>  	/*
>  	 * In crash handler we can't schedule synic cleanup for all CPUs,
> @@ -2095,7 +2095,9 @@ static void hv_crash_handler(struct pt_regs *regs)
>  	 * for kdump.
>  	 */
>  	vmbus_connection.conn_state = DISCONNECTED;
> -	hv_synic_cleanup(smp_processor_id());
> +	cpu = smp_processor_id();
> +	hv_stimer_cleanup(cpu);
> +	hv_synic_cleanup(cpu);
>  	hyperv_cleanup();
>  };
>  
> @@ -2144,7 +2146,7 @@ static void __exit vmbus_exit(void)
>  	hv_remove_kexec_handler();
>  	hv_remove_crash_handler();
>  	vmbus_connection.conn_state = DISCONNECTED;
> -	hv_synic_clockevents_cleanup();
> +	hv_stimer_global_cleanup();
>  	vmbus_disconnect();
>  	hv_remove_vmbus_irq();
>  	for_each_online_cpu(cpu) {
> diff --git a/include/clocksource/hyperv_timer.h b/include/clocksource/hyperv_timer.h
> new file mode 100644
> index 0000000..ba5dc17
> --- /dev/null
> +++ b/include/clocksource/hyperv_timer.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Definitions for the clocksource provided by the Hyper-V
> + * hypervisor to guest VMs, as described in the Hyper-V Top
> + * Level Functional Spec (TLFS).
> + *
> + * Copyright (C) 2019, Microsoft, Inc.
> + *
> + * Author:  Michael Kelley <mikelley@microsoft.com>
> + */
> +
> +#ifndef __CLKSOURCE_HYPERV_TIMER_H
> +#define __CLKSOURCE_HYPERV_TIMER_H
> +
> +#define HV_MAX_MAX_DELTA_TICKS 0xffffffff
> +#define HV_MIN_DELTA_TICKS 1
> +
> +/* Routines called by the VMbus driver */
> +extern int hv_stimer_alloc(int sint);
> +extern void hv_stimer_free(void);
> +extern int hv_stimer_init(unsigned int cpu);
> +extern int hv_stimer_cleanup(unsigned int cpu);
> +extern void hv_stimer_global_cleanup(void);
> +extern void hv_stimer0_isr(void);
> +
> +#endif

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>

-- 
Vitaly

^ permalink raw reply

* [PATCH v2 1/1] Drivers: hv: vmbus: Break out ISA independent parts of mshyperv.h
From: Michael Kelley @ 2019-05-30  0:14 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, vkuznets, jasowang@redhat.com,
	marcelo.cerri@canonical.com, Sunil Muthuswamy, KY Srinivasan
  Cc: Michael Kelley

Break out parts of mshyperv.h that are ISA independent into a
separate file in include/asm-generic. This move facilitates
ARM64 code reusing these definitions and avoids code
duplication. No functionality or behavior is changed.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
Changes in v2:
* Removed unneeded #includes in asm-generic/mshyperv.h;
added two #includes that are needed. [Vitaly Kuznetsov]

---

 MAINTAINERS                     |   1 +
 arch/x86/include/asm/mshyperv.h | 147 +-------------------------------
 include/asm-generic/mshyperv.h  | 180 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 185 insertions(+), 143 deletions(-)
 create mode 100644 include/asm-generic/mshyperv.h

diff --git a/MAINTAINERS b/MAINTAINERS
index cf2a5b7..521192d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7308,6 +7308,7 @@ F:	net/vmw_vsock/hyperv_transport.c
 F:	include/clocksource/hyperv_timer.h
 F:	include/linux/hyperv.h
 F:	include/uapi/linux/hyperv.h
+F:	include/asm-generic/mshyperv.h
 F:	tools/hv/
 F:	Documentation/ABI/stable/sysfs-bus-vmbus
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index f4fa8a9..2a793bf 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -3,84 +3,15 @@
 #define _ASM_X86_MSHYPER_H
 
 #include <linux/types.h>
-#include <linux/atomic.h>
 #include <linux/nmi.h>
 #include <asm/io.h>
 #include <asm/hyperv-tlfs.h>
 #include <asm/nospec-branch.h>
 
-#define VP_INVAL	U32_MAX
-
-struct ms_hyperv_info {
-	u32 features;
-	u32 misc_features;
-	u32 hints;
-	u32 nested_features;
-	u32 max_vp_index;
-	u32 max_lp_index;
-};
-
-extern struct ms_hyperv_info ms_hyperv;
-
-
 typedef int (*hyperv_fill_flush_list_func)(
 		struct hv_guest_mapping_flush_list *flush,
 		void *data);
 
-/*
- * Generate the guest ID.
- */
-
-static inline  __u64 generate_guest_id(__u64 d_info1, __u64 kernel_version,
-				       __u64 d_info2)
-{
-	__u64 guest_id = 0;
-
-	guest_id = (((__u64)HV_LINUX_VENDOR_ID) << 48);
-	guest_id |= (d_info1 << 48);
-	guest_id |= (kernel_version << 16);
-	guest_id |= d_info2;
-
-	return guest_id;
-}
-
-
-/* Free the message slot and signal end-of-message if required */
-static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
-{
-	/*
-	 * On crash we're reading some other CPU's message page and we need
-	 * to be careful: this other CPU may already had cleared the header
-	 * and the host may already had delivered some other message there.
-	 * In case we blindly write msg->header.message_type we're going
-	 * to lose it. We can still lose a message of the same type but
-	 * we count on the fact that there can only be one
-	 * CHANNELMSG_UNLOAD_RESPONSE and we don't care about other messages
-	 * on crash.
-	 */
-	if (cmpxchg(&msg->header.message_type, old_msg_type,
-		    HVMSG_NONE) != old_msg_type)
-		return;
-
-	/*
-	 * Make sure the write to MessageType (ie set to
-	 * HVMSG_NONE) happens before we read the
-	 * MessagePending and EOMing. Otherwise, the EOMing
-	 * will not deliver any more messages since there is
-	 * no empty slot
-	 */
-	mb();
-
-	if (msg->header.message_flags.msg_pending) {
-		/*
-		 * This will cause message queue rescan to
-		 * possibly deliver another msg from the
-		 * hypervisor
-		 */
-		wrmsrl(HV_X64_MSR_EOM, 0);
-	}
-}
-
 #define hv_init_timer(timer, tick) \
 	wrmsrl(HV_X64_MSR_STIMER0_COUNT + (2*timer), tick)
 #define hv_init_timer_config(timer, val) \
@@ -97,6 +28,8 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
 
 #define hv_get_vp_index(index) rdmsrl(HV_X64_MSR_VP_INDEX, index)
 
+#define hv_signal_eom() wrmsrl(HV_X64_MSR_EOM, 0)
+
 #define hv_get_synint_state(int_num, val) \
 	rdmsrl(HV_X64_MSR_SINT0 + int_num, val)
 #define hv_set_synint_state(int_num, val) \
@@ -122,13 +55,6 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
 #define trace_hyperv_callback_vector hyperv_callback_vector
 #endif
 void hyperv_vector_handler(struct pt_regs *regs);
-void hv_setup_vmbus_irq(void (*handler)(void));
-void hv_remove_vmbus_irq(void);
-
-void hv_setup_kexec_handler(void (*handler)(void));
-void hv_remove_kexec_handler(void);
-void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
-void hv_remove_crash_handler(void);
 
 /*
  * Routines for stimer0 Direct Mode handling.
@@ -136,8 +62,6 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
  */
 void hv_stimer0_vector_handler(struct pt_regs *regs);
 void hv_stimer0_callback_vector(void);
-int hv_setup_stimer0_irq(int *irq, int *vector, void (*handler)(void));
-void hv_remove_stimer0_irq(int irq);
 
 static inline void hv_enable_stimer0_percpu_irq(int irq) {}
 static inline void hv_disable_stimer0_percpu_irq(int irq) {}
@@ -282,14 +206,6 @@ static inline u64 hv_do_rep_hypercall(u16 code, u16 rep_count, u16 varhead_size,
 	return status;
 }
 
-/*
- * Hypervisor's notion of virtual processor ID is different from
- * Linux' notion of CPU ID. This information can only be retrieved
- * in the context of the calling CPU. Setup a map for easy access
- * to this information.
- */
-extern u32 *hv_vp_index;
-extern u32 hv_max_vp_index;
 extern struct hv_vp_assist_page **hv_vp_assist_page;
 
 static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
@@ -300,63 +216,8 @@ static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
 	return hv_vp_assist_page[cpu];
 }
 
-/**
- * hv_cpu_number_to_vp_number() - Map CPU to VP.
- * @cpu_number: CPU number in Linux terms
- *
- * This function returns the mapping between the Linux processor
- * number and the hypervisor's virtual processor number, useful
- * in making hypercalls and such that talk about specific
- * processors.
- *
- * Return: Virtual processor number in Hyper-V terms
- */
-static inline int hv_cpu_number_to_vp_number(int cpu_number)
-{
-	return hv_vp_index[cpu_number];
-}
-
-static inline int cpumask_to_vpset(struct hv_vpset *vpset,
-				    const struct cpumask *cpus)
-{
-	int cpu, vcpu, vcpu_bank, vcpu_offset, nr_bank = 1;
-
-	/* valid_bank_mask can represent up to 64 banks */
-	if (hv_max_vp_index / 64 >= 64)
-		return 0;
-
-	/*
-	 * Clear all banks up to the maximum possible bank as hv_tlb_flush_ex
-	 * structs are not cleared between calls, we risk flushing unneeded
-	 * vCPUs otherwise.
-	 */
-	for (vcpu_bank = 0; vcpu_bank <= hv_max_vp_index / 64; vcpu_bank++)
-		vpset->bank_contents[vcpu_bank] = 0;
-
-	/*
-	 * Some banks may end up being empty but this is acceptable.
-	 */
-	for_each_cpu(cpu, cpus) {
-		vcpu = hv_cpu_number_to_vp_number(cpu);
-		if (vcpu == VP_INVAL)
-			return -1;
-		vcpu_bank = vcpu / 64;
-		vcpu_offset = vcpu % 64;
-		__set_bit(vcpu_offset, (unsigned long *)
-			  &vpset->bank_contents[vcpu_bank]);
-		if (vcpu_bank >= nr_bank)
-			nr_bank = vcpu_bank + 1;
-	}
-	vpset->valid_bank_mask = GENMASK_ULL(nr_bank - 1, 0);
-	return nr_bank;
-}
-
 void __init hyperv_init(void);
 void hyperv_setup_mmu_ops(void);
-void hyperv_report_panic(struct pt_regs *regs, long err);
-void hyperv_report_panic_msg(phys_addr_t pa, size_t size);
-bool hv_is_hyperv_initialized(void);
-void hyperv_cleanup(void);
 
 void hyperv_reenlightenment_intr(struct pt_regs *regs);
 void set_hv_tscchange_cb(void (*cb)(void));
@@ -379,8 +240,6 @@ static inline void hv_apic_init(void) {}
 
 #else /* CONFIG_HYPERV */
 static inline void hyperv_init(void) {}
-static inline bool hv_is_hyperv_initialized(void) { return false; }
-static inline void hyperv_cleanup(void) {}
 static inline void hyperv_setup_mmu_ops(void) {}
 static inline void set_hv_tscchange_cb(void (*cb)(void)) {}
 static inline void clear_hv_tscchange_cb(void) {}
@@ -397,4 +256,6 @@ static inline int hyperv_flush_guest_mapping_range(u64 as,
 }
 #endif /* CONFIG_HYPERV */
 
+#include <asm-generic/mshyperv.h>
+
 #endif
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
new file mode 100644
index 0000000..0becb7d
--- /dev/null
+++ b/include/asm-generic/mshyperv.h
@@ -0,0 +1,180 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Linux-specific definitions for managing interactions with Microsoft's
+ * Hyper-V hypervisor. The definitions in this file are architecture
+ * independent. See arch/<arch>/include/asm/mshyperv.h for definitions
+ * that are specific to architecture <arch>.
+ *
+ * Definitions that are specified in the Hyper-V Top Level Functional
+ * Spec (TLFS) should not go in this file, but should instead go in
+ * hyperv-tlfs.h.
+ *
+ * Copyright (C) 2019, Microsoft, Inc.
+ *
+ * Author : Michael Kelley <mikelley@microsoft.com>
+ */
+
+#ifndef _ASM_GENERIC_MSHYPERV_H
+#define _ASM_GENERIC_MSHYPERV_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <linux/bitops.h>
+#include <linux/cpumask.h>
+#include <asm/ptrace.h>
+#include <asm/hyperv-tlfs.h>
+
+struct ms_hyperv_info {
+	u32 features;
+	u32 misc_features;
+	u32 hints;
+	u32 nested_features;
+	u32 max_vp_index;
+	u32 max_lp_index;
+};
+extern struct ms_hyperv_info ms_hyperv;
+
+extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
+extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);
+
+
+/* Generate the guest OS identifier as described in the Hyper-V TLFS */
+static inline  __u64 generate_guest_id(__u64 d_info1, __u64 kernel_version,
+				       __u64 d_info2)
+{
+	__u64 guest_id = 0;
+
+	guest_id = (((__u64)HV_LINUX_VENDOR_ID) << 48);
+	guest_id |= (d_info1 << 48);
+	guest_id |= (kernel_version << 16);
+	guest_id |= d_info2;
+
+	return guest_id;
+}
+
+
+/* Free the message slot and signal end-of-message if required */
+static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
+{
+	/*
+	 * On crash we're reading some other CPU's message page and we need
+	 * to be careful: this other CPU may already had cleared the header
+	 * and the host may already had delivered some other message there.
+	 * In case we blindly write msg->header.message_type we're going
+	 * to lose it. We can still lose a message of the same type but
+	 * we count on the fact that there can only be one
+	 * CHANNELMSG_UNLOAD_RESPONSE and we don't care about other messages
+	 * on crash.
+	 */
+	if (cmpxchg(&msg->header.message_type, old_msg_type,
+		    HVMSG_NONE) != old_msg_type)
+		return;
+
+	/*
+	 * The cmxchg() above does an implicit memory barrier to
+	 * ensure the write to MessageType (ie set to
+	 * HVMSG_NONE) happens before we read the
+	 * MessagePending and EOMing. Otherwise, the EOMing
+	 * will not deliver any more messages since there is
+	 * no empty slot
+	 */
+	if (msg->header.message_flags.msg_pending) {
+		/*
+		 * This will cause message queue rescan to
+		 * possibly deliver another msg from the
+		 * hypervisor
+		 */
+		hv_signal_eom();
+	}
+}
+
+void hv_setup_vmbus_irq(void (*handler)(void));
+void hv_remove_vmbus_irq(void);
+void hv_enable_vmbus_irq(void);
+void hv_disable_vmbus_irq(void);
+
+void hv_setup_kexec_handler(void (*handler)(void));
+void hv_remove_kexec_handler(void);
+void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
+void hv_remove_crash_handler(void);
+
+#if IS_ENABLED(CONFIG_HYPERV)
+/*
+ * Hypervisor's notion of virtual processor ID is different from
+ * Linux' notion of CPU ID. This information can only be retrieved
+ * in the context of the calling CPU. Setup a map for easy access
+ * to this information.
+ */
+extern u32 *hv_vp_index;
+extern u32 hv_max_vp_index;
+
+/* Sentinel value for an uninitialized entry in hv_vp_index array */
+#define VP_INVAL	U32_MAX
+
+/**
+ * hv_cpu_number_to_vp_number() - Map CPU to VP.
+ * @cpu_number: CPU number in Linux terms
+ *
+ * This function returns the mapping between the Linux processor
+ * number and the hypervisor's virtual processor number, useful
+ * in making hypercalls and such that talk about specific
+ * processors.
+ *
+ * Return: Virtual processor number in Hyper-V terms
+ */
+static inline int hv_cpu_number_to_vp_number(int cpu_number)
+{
+	return hv_vp_index[cpu_number];
+}
+
+static inline int cpumask_to_vpset(struct hv_vpset *vpset,
+				    const struct cpumask *cpus)
+{
+	int cpu, vcpu, vcpu_bank, vcpu_offset, nr_bank = 1;
+
+	/* valid_bank_mask can represent up to 64 banks */
+	if (hv_max_vp_index / 64 >= 64)
+		return 0;
+
+	/*
+	 * Clear all banks up to the maximum possible bank as hv_tlb_flush_ex
+	 * structs are not cleared between calls, we risk flushing unneeded
+	 * vCPUs otherwise.
+	 */
+	for (vcpu_bank = 0; vcpu_bank <= hv_max_vp_index / 64; vcpu_bank++)
+		vpset->bank_contents[vcpu_bank] = 0;
+
+	/*
+	 * Some banks may end up being empty but this is acceptable.
+	 */
+	for_each_cpu(cpu, cpus) {
+		vcpu = hv_cpu_number_to_vp_number(cpu);
+		if (vcpu == VP_INVAL)
+			return -1;
+		vcpu_bank = vcpu / 64;
+		vcpu_offset = vcpu % 64;
+		__set_bit(vcpu_offset, (unsigned long *)
+			  &vpset->bank_contents[vcpu_bank]);
+		if (vcpu_bank >= nr_bank)
+			nr_bank = vcpu_bank + 1;
+	}
+	vpset->valid_bank_mask = GENMASK_ULL(nr_bank - 1, 0);
+	return nr_bank;
+}
+
+void hyperv_report_panic(struct pt_regs *regs, long err);
+void hyperv_report_panic_msg(phys_addr_t pa, size_t size);
+bool hv_is_hyperv_initialized(void);
+void hyperv_cleanup(void);
+#else /* CONFIG_HYPERV */
+static inline bool hv_is_hyperv_initialized(void) { return false; }
+static inline void hyperv_cleanup(void) {}
+#endif /* CONFIG_HYPERV */
+
+#if IS_ENABLED(CONFIG_HYPERV)
+extern int hv_setup_stimer0_irq(int *irq, int *vector, void (*handler)(void));
+extern void hv_remove_stimer0_irq(int irq);
+#endif
+
+#endif
-- 
1.8.3.1


^ permalink raw reply related

* RE: [PATCH 1/1] Drivers: hv: vmbus: Break out ISA independent parts of mshyperv.h
From: Michael Kelley @ 2019-05-30  0:05 UTC (permalink / raw)
  To: vkuznets
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com,
	marcelo.cerri@canonical.com, Sunil Muthuswamy, KY Srinivasan
In-Reply-To: <875zptmfp4.fsf@vitty.brq.redhat.com>

From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Wednesday, May 29, 2019 8:41 AM
> >  void __init hyperv_init(void);
> 
> I would actually expect to see hyperv_init() on all architectures so it
> can probably go to 'generic' too.
> 

Actually not.  The declaration is not needed on all architectures.   On ARM64,
hyperv_init() and ms_hyperv_init_platform() are folded into one routine, and
it's not externally referenced.  So the declaration isn't needed.  The hypervisor
initialization approach is fairly different on ARM64.

[....]
> 
> This seems a little bit too much, in particular, I think we don't need
> 
>  #include <linux/interrupt.h>
>  #include <linux/clocksource.h>
>  #include <linux/irq.h>
>  #include <linux/irqdesc.h>
> 
> we'll need 'struct pt_regs' definition but I'd suggest we use
> 
>  #include <linux/ptrace.h>
> 
> or even
> 
>  #include <asm/ptrace.h>
> 

Agreed.  I'll spin a v2 with this cleaned up.

Michael

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox