Linux virtualization list
 help / color / mirror / Atom feed
* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Anshuman Khandual @ 2018-04-05 14:39 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: robh, Michael S. Tsirkin, joe, linuxppc-dev, elfring, david
In-Reply-To: <20180405105631.9514-1-khandual@linux.vnet.ibm.com>

On 04/05/2018 04:26 PM, Anshuman Khandual wrote:
> There are certian platforms which would like to use SWIOTLB based DMA API
> for bouncing purpose without actually requiring an IOMMU back end. But the
> virtio core does not allow such mechanism. Right now DMA MAP API is only
> selected for devices which have an IOMMU and then the QEMU/host back end
> will process all incoming SG buffer addresses as IOVA instead of simple
> GPA which is the case for simple bounce buffers after being processed with
> SWIOTLB API. To enable this usage, it introduces an architecture specific
> function which will just make virtio core front end select DMA operations
> structure.
> 
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>

+ "Michael S. Tsirkin" <mst@redhat.com>

^ permalink raw reply

* Re: [RFC PATCH 0/2] use larger max_request_size for virtio_blk
From: Jens Axboe @ 2018-04-05 14:29 UTC (permalink / raw)
  To: cohuck, mst, jasowang, linux-block, virtualization
In-Reply-To: <cover.1522922101.git.zhangweiping@didichuxing.com>

On 4/5/18 4:09 AM, Weiping Zhang wrote:
> Hi,
> 
> For virtio block device, actually there is no a hard limit for max request
> size, and virtio_blk driver set -1 to blk_queue_max_hw_sectors(q, -1U);.
> But it doesn't work, because there is a default upper limitation
> BLK_DEF_MAX_SECTORS (1280 sectors). So this series want to add a new helper
> blk_queue_max_hw_sectors_no_limit to set a proper max reqeust size.
> 
> Weiping Zhang (2):
>   blk-setting: add new helper blk_queue_max_hw_sectors_no_limit
>   virtio_blk: add new module parameter to set max request size
> 
>  block/blk-settings.c       | 20 ++++++++++++++++++++
>  drivers/block/virtio_blk.c | 32 ++++++++++++++++++++++++++++++--
>  include/linux/blkdev.h     |  2 ++
>  3 files changed, 52 insertions(+), 2 deletions(-)

The driver should just use blk_queue_max_hw_sectors() to set the limit,
and then the soft limit can be modified by a udev rule. Technically the
driver doesn't own the software limit, it's imposed to ensure that we
don't introduce too much latency per request.

Your situation is no different from many other setups, where the
hw limit is much higher than the default 1280k.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-04-05 14:03 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, huangzhichao@huawei.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <286AC319A985734F985F78AFA26841F7394A7F3B@shsmsx102.ccr.corp.intel.com>

On Thu, Apr 05, 2018 at 02:05:03AM +0000, Wang, Wei W wrote:
> On Thursday, April 5, 2018 9:12 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 05, 2018 at 12:30:27AM +0000, Wang, Wei W wrote:
> > > On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> > > > On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > > > > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > > > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > I'm afraid the driver couldn't be aware if the added hints are stale
> > > or not,
> > 
> > 
> > No - I mean that driver has code that compares two values and stops
> > reporting. Can one of the values be stale?
> 
> The driver compares "vb->cmd_id_use != vb->cmd_id_received" to decide if it needs to stop reporting hints, and cmd_id_received is what the driver reads from host (host notifies the driver to read for the latest value). If host sends a new cmd id, it will notify the guest to read again. I'm not sure how that could be a stale cmd id (or maybe I misunderstood your point here?)
> 
> Best,
> Wei

The comparison is done in one thread, the update in another one.

-- 
MST

^ permalink raw reply

* Re: [RFC PATCH 0/2] use larger max_request_size for virtio_blk
From: Martin K. Petersen @ 2018-04-05 12:24 UTC (permalink / raw)
  To: Weiping Zhang; +Cc: axboe, mst, cohuck, virtualization, linux-block
In-Reply-To: <cover.1522922101.git.zhangweiping@didichuxing.com>


Weiping,

> For virtio block device, actually there is no a hard limit for max
> request size, and virtio_blk driver set -1 to
> blk_queue_max_hw_sectors(q, -1U);.  But it doesn't work, because there
> is a default upper limitation BLK_DEF_MAX_SECTORS (1280 sectors).

That's intentional (although it's an ongoing debate what the actual
value should be).

> So this series want to add a new helper
> blk_queue_max_hw_sectors_no_limit to set a proper max reqeust size.

BLK_DEF_MAX_SECTORS is a kernel default empirically chosen to strike a
decent balance between I/O latency and bandwidth. It sets an upper bound
for filesystem requests only. Regardless of the capabilities of the
block device driver and underlying hardware.

You can override the limit on a per-device basis via max_sectors_kb in
sysfs. People generally do it via a udev rule.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Anshuman Khandual @ 2018-04-05 11:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: robh, linux-kernel@vger.kernel.org, virtualization, Joe Perches,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Markus Elfring,
	David Gibson
In-Reply-To: <CAKTCnz=3b56bGKrqCcZOwBsu0yngS_Rw-9TieNLyHgix4eZrAQ@mail.gmail.com>

On 04/05/2018 04:44 PM, Balbir Singh wrote:
> On Thu, Apr 5, 2018 at 8:56 PM, Anshuman Khandual
> <khandual@linux.vnet.ibm.com> wrote:
>> There are certian platforms which would like to use SWIOTLB based DMA API
>> for bouncing purpose without actually requiring an IOMMU back end. But the
>> virtio core does not allow such mechanism. Right now DMA MAP API is only
>> selected for devices which have an IOMMU and then the QEMU/host back end
>> will process all incoming SG buffer addresses as IOVA instead of simple
>> GPA which is the case for simple bounce buffers after being processed with
>> SWIOTLB API. To enable this usage, it introduces an architecture specific
>> function which will just make virtio core front end select DMA operations
>> structure.
>>
>> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
>> ---
>> This RFC is just to get some feedback. Please ignore the function call
>> back into the architecture. It can be worked out properly later on. But
>> the question is can we have virtio devices in the guest which would like
>> to use SWIOTLB based (or any custom DMA API based) bounce buffering with
>> out actually being an IOMMU devices emulated by QEMU/host as been with
>> the current VIRTIO_F_IOMMU_PLATFORM virtio flag ?
>>
>>  arch/powerpc/platforms/pseries/iommu.c | 6 ++++++
>>  drivers/virtio/virtio_ring.c           | 4 ++++
>>  include/linux/virtio.h                 | 2 ++
>>  3 files changed, 12 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 06f02960b439..dd15fbddbe89 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -1396,3 +1396,9 @@ static int __init disable_multitce(char *str)
>>  __setup("multitce=", disable_multitce);
>>
>>  machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
>> +
>> +bool is_virtio_dma_platform(void)
>> +{
>> +       return true;
>> +}
>> +EXPORT_SYMBOL(is_virtio_dma_platform);
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index 71458f493cf8..9f205a79d378 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -144,6 +144,10 @@ struct vring_virtqueue {
>>
>>  static bool vring_use_dma_api(struct virtio_device *vdev)
>>  {
>> +       /* Use DMA API even for virtio devices without an IOMMU */
>> +       if (is_virtio_dma_platform())
>> +               return true;
>> +
>>         if (!virtio_has_iommu_quirk(vdev))
>>                 return true;
>>
>> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
>> index 988c7355bc22..d8bb83d753ea 100644
>> --- a/include/linux/virtio.h
>> +++ b/include/linux/virtio.h
>> @@ -200,6 +200,8 @@ static inline struct virtio_driver *drv_to_virtio(struct device_driver *drv)
>>  int register_virtio_driver(struct virtio_driver *drv);
>>  void unregister_virtio_driver(struct virtio_driver *drv);
>>
>> +extern bool is_virtio_dma_platform(void);
>> +
> 
> Where is the default implementation for non-pseries platforms? Will they compile
> after these changes?

No they wont. This is just a RFC asking for suggestion/feedback on a
particular direction, will clean up the code later on once we agree
on this.

^ permalink raw reply

* Re: [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Balbir Singh @ 2018-04-05 11:14 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: robh, Michael Ellerman, linux-kernel@vger.kernel.org,
	virtualization, Benjamin Herrenschmidt, Joe Perches,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Markus Elfring,
	David Gibson
In-Reply-To: <20180405105631.9514-1-khandual@linux.vnet.ibm.com>

On Thu, Apr 5, 2018 at 8:56 PM, Anshuman Khandual
<khandual@linux.vnet.ibm.com> wrote:
> There are certian platforms which would like to use SWIOTLB based DMA API
> for bouncing purpose without actually requiring an IOMMU back end. But the
> virtio core does not allow such mechanism. Right now DMA MAP API is only
> selected for devices which have an IOMMU and then the QEMU/host back end
> will process all incoming SG buffer addresses as IOVA instead of simple
> GPA which is the case for simple bounce buffers after being processed with
> SWIOTLB API. To enable this usage, it introduces an architecture specific
> function which will just make virtio core front end select DMA operations
> structure.
>
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> ---
> This RFC is just to get some feedback. Please ignore the function call
> back into the architecture. It can be worked out properly later on. But
> the question is can we have virtio devices in the guest which would like
> to use SWIOTLB based (or any custom DMA API based) bounce buffering with
> out actually being an IOMMU devices emulated by QEMU/host as been with
> the current VIRTIO_F_IOMMU_PLATFORM virtio flag ?
>
>  arch/powerpc/platforms/pseries/iommu.c | 6 ++++++
>  drivers/virtio/virtio_ring.c           | 4 ++++
>  include/linux/virtio.h                 | 2 ++
>  3 files changed, 12 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 06f02960b439..dd15fbddbe89 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -1396,3 +1396,9 @@ static int __init disable_multitce(char *str)
>  __setup("multitce=", disable_multitce);
>
>  machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> +
> +bool is_virtio_dma_platform(void)
> +{
> +       return true;
> +}
> +EXPORT_SYMBOL(is_virtio_dma_platform);
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 71458f493cf8..9f205a79d378 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -144,6 +144,10 @@ struct vring_virtqueue {
>
>  static bool vring_use_dma_api(struct virtio_device *vdev)
>  {
> +       /* Use DMA API even for virtio devices without an IOMMU */
> +       if (is_virtio_dma_platform())
> +               return true;
> +
>         if (!virtio_has_iommu_quirk(vdev))
>                 return true;
>
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 988c7355bc22..d8bb83d753ea 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -200,6 +200,8 @@ static inline struct virtio_driver *drv_to_virtio(struct device_driver *drv)
>  int register_virtio_driver(struct virtio_driver *drv);
>  void unregister_virtio_driver(struct virtio_driver *drv);
>
> +extern bool is_virtio_dma_platform(void);
> +

Where is the default implementation for non-pseries platforms? Will they compile
after these changes?

Balbir

^ permalink raw reply

* [RFC] virtio: Use DMA MAP API for devices without an IOMMU
From: Anshuman Khandual @ 2018-04-05 10:56 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: robh, benh, mpe, joe, linuxppc-dev, elfring, david

There are certian platforms which would like to use SWIOTLB based DMA API
for bouncing purpose without actually requiring an IOMMU back end. But the
virtio core does not allow such mechanism. Right now DMA MAP API is only
selected for devices which have an IOMMU and then the QEMU/host back end
will process all incoming SG buffer addresses as IOVA instead of simple
GPA which is the case for simple bounce buffers after being processed with
SWIOTLB API. To enable this usage, it introduces an architecture specific
function which will just make virtio core front end select DMA operations
structure.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
This RFC is just to get some feedback. Please ignore the function call
back into the architecture. It can be worked out properly later on. But
the question is can we have virtio devices in the guest which would like
to use SWIOTLB based (or any custom DMA API based) bounce buffering with
out actually being an IOMMU devices emulated by QEMU/host as been with
the current VIRTIO_F_IOMMU_PLATFORM virtio flag ?

 arch/powerpc/platforms/pseries/iommu.c | 6 ++++++
 drivers/virtio/virtio_ring.c           | 4 ++++
 include/linux/virtio.h                 | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 06f02960b439..dd15fbddbe89 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1396,3 +1396,9 @@ static int __init disable_multitce(char *str)
 __setup("multitce=", disable_multitce);
 
 machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
+
+bool is_virtio_dma_platform(void)
+{
+	return true;
+}
+EXPORT_SYMBOL(is_virtio_dma_platform);
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..9f205a79d378 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -144,6 +144,10 @@ struct vring_virtqueue {
 
 static bool vring_use_dma_api(struct virtio_device *vdev)
 {
+	/* Use DMA API even for virtio devices without an IOMMU */
+	if (is_virtio_dma_platform())
+		return true;
+
 	if (!virtio_has_iommu_quirk(vdev))
 		return true;
 
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 988c7355bc22..d8bb83d753ea 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -200,6 +200,8 @@ static inline struct virtio_driver *drv_to_virtio(struct device_driver *drv)
 int register_virtio_driver(struct virtio_driver *drv);
 void unregister_virtio_driver(struct virtio_driver *drv);
 
+extern bool is_virtio_dma_platform(void);
+
 /* module_virtio_driver() - Helper macro for drivers that don't do
  * anything special in module init/exit.  This eliminates a lot of
  * boilerplate.  Each module may only use this macro once, and
-- 
2.14.1

^ permalink raw reply related

* [RFC PATCH 2/2] virtio_blk: add new module parameter to set max request size
From: Weiping Zhang @ 2018-04-05 10:10 UTC (permalink / raw)
  To: axboe, cohuck, mst, jasowang; +Cc: linux-block, virtualization
In-Reply-To: <cover.1522922101.git.zhangweiping@didichuxing.com>

Actually there is no upper limitation, so add new module parameter
to provide a way to set a proper max request size for virtio block.
Using a larger request size can improve sequence performance in theory,
and reduce the interaction between guest and hypervisor.

Signed-off-by: Weiping Zhang <zhangweiping@didichuxing.com>
---
 drivers/block/virtio_blk.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4a07593c..5ac6d59 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -64,6 +64,34 @@ struct virtblk_req {
 	struct scatterlist sg[];
 };
 
+
+static int max_request_size_set(const char *val, const struct kernel_param *kp);
+
+static const struct kernel_param_ops max_request_size_ops = {
+	.set = max_request_size_set,
+	.get = param_get_uint,
+};
+
+static unsigned long max_request_size = 4096; /* in unit of KiB */
+module_param_cb(max_request_size, &max_request_size_ops, &max_request_size,
+		0444);
+MODULE_PARM_DESC(max_request_size, "set max request size, in unit of KiB");
+
+static int max_request_size_set(const char *val, const struct kernel_param *kp)
+{
+	int ret;
+	unsigned int size_kb, page_kb = 1 << (PAGE_SHIFT - 10);
+
+	ret = kstrtouint(val, 10, &size_kb);
+	if (ret != 0)
+		return -EINVAL;
+
+	if (size_kb < page_kb)
+		return -EINVAL;
+
+	return param_set_uint(val, kp);
+}
+
 static inline blk_status_t virtblk_result(struct virtblk_req *vbr)
 {
 	switch (vbr->status) {
@@ -730,8 +758,8 @@ static int virtblk_probe(struct virtio_device *vdev)
 	/* We can handle whatever the host told us to handle. */
 	blk_queue_max_segments(q, vblk->sg_elems-2);
 
-	/* No real sector limit. */
-	blk_queue_max_hw_sectors(q, -1U);
+	/* No real sector limit, use 512b (max_request_size << 10) >> 9 */
+	blk_queue_max_hw_sectors_no_limit(q, max_request_size << 1);
 
 	/* Host can optionally specify maximum segment size and number of
 	 * segments. */
-- 
2.9.4

^ permalink raw reply related

* [RFC PATCH 1/2] blk-setting: add new helper blk_queue_max_hw_sectors_no_limit
From: Weiping Zhang @ 2018-04-05 10:10 UTC (permalink / raw)
  To: axboe, cohuck, mst, jasowang; +Cc: linux-block, virtualization
In-Reply-To: <cover.1522922101.git.zhangweiping@didichuxing.com>

There is a default upper limitation BLK_DEF_MAX_SECTORS, but for
some virtual block device driver there is no such limitation. So
add a new help to set max request size.

Signed-off-by: Weiping Zhang <zhangweiping@didichuxing.com>
---
 block/blk-settings.c   | 20 ++++++++++++++++++++
 include/linux/blkdev.h |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 48ebe6b..685c30c 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -253,6 +253,26 @@ void blk_queue_max_hw_sectors(struct request_queue *q, unsigned int max_hw_secto
 }
 EXPORT_SYMBOL(blk_queue_max_hw_sectors);
 
+/* same as blk_queue_max_hw_sectors but without default upper limitation */
+void blk_queue_max_hw_sectors_no_limit(struct request_queue *q,
+		unsigned int max_hw_sectors)
+{
+	struct queue_limits *limits = &q->limits;
+	unsigned int max_sectors;
+
+	if ((max_hw_sectors << 9) < PAGE_SIZE) {
+		max_hw_sectors = 1 << (PAGE_SHIFT - 9);
+		printk(KERN_INFO "%s: set to minimum %d\n",
+		       __func__, max_hw_sectors);
+	}
+
+	limits->max_hw_sectors = max_hw_sectors;
+	max_sectors = min_not_zero(max_hw_sectors, limits->max_dev_sectors);
+	limits->max_sectors = max_sectors;
+	q->backing_dev_info->io_pages = max_sectors >> (PAGE_SHIFT - 9);
+}
+EXPORT_SYMBOL(blk_queue_max_hw_sectors_no_limit);
+
 /**
  * blk_queue_chunk_sectors - set size of the chunk for this queue
  * @q:  the request queue for the device
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed63f3b..2250709 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1243,6 +1243,8 @@ extern void blk_cleanup_queue(struct request_queue *);
 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
 extern void blk_queue_bounce_limit(struct request_queue *, u64);
 extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
+extern void blk_queue_max_hw_sectors_no_limit(struct request_queue *,
+						unsigned int);
 extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
 extern void blk_queue_max_segments(struct request_queue *, unsigned short);
 extern void blk_queue_max_discard_segments(struct request_queue *,
-- 
2.9.4

^ permalink raw reply related

* [RFC PATCH 0/2] use larger max_request_size for virtio_blk
From: Weiping Zhang @ 2018-04-05 10:09 UTC (permalink / raw)
  To: axboe, cohuck, mst, jasowang; +Cc: linux-block, virtualization

Hi,

For virtio block device, actually there is no a hard limit for max request
size, and virtio_blk driver set -1 to blk_queue_max_hw_sectors(q, -1U);.
But it doesn't work, because there is a default upper limitation
BLK_DEF_MAX_SECTORS (1280 sectors). So this series want to add a new helper
blk_queue_max_hw_sectors_no_limit to set a proper max reqeust size.

Weiping Zhang (2):
  blk-setting: add new helper blk_queue_max_hw_sectors_no_limit
  virtio_blk: add new module parameter to set max request size

 block/blk-settings.c       | 20 ++++++++++++++++++++
 drivers/block/virtio_blk.c | 32 ++++++++++++++++++++++++++++++--
 include/linux/blkdev.h     |  2 ++
 3 files changed, 52 insertions(+), 2 deletions(-)

-- 
2.9.4

^ permalink raw reply

* RE: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wang, Wei W @ 2018-04-05  2:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, huangzhichao@huawei.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20180405040900-mutt-send-email-mst@kernel.org>

On Thursday, April 5, 2018 9:12 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 05, 2018 at 12:30:27AM +0000, Wang, Wei W wrote:
> > On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > > > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > I'm afraid the driver couldn't be aware if the added hints are stale
> > or not,
> 
> 
> No - I mean that driver has code that compares two values and stops
> reporting. Can one of the values be stale?

The driver compares "vb->cmd_id_use != vb->cmd_id_received" to decide if it needs to stop reporting hints, and cmd_id_received is what the driver reads from host (host notifies the driver to read for the latest value). If host sends a new cmd id, it will notify the guest to read again. I'm not sure how that could be a stale cmd id (or maybe I misunderstood your point here?)

Best,
Wei

^ permalink raw reply

* [PULL] fwcfg, vhost: features and fixes
From: Michael S. Tsirkin @ 2018-04-05  1:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm, bhe, mst, netdev, somlo, linux-kernel, virtualization,
	sonnyrao, marcandre.lureau, dyoung

The following changes since commit 0c8efd610b58cb23cefdfa12015799079aef94ae:

  Linux 4.16-rc5 (2018-03-11 17:25:09 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to dc32bb678e103afbcfa4d814489af0566307f528:

  vhost: add vsock compat ioctl (2018-03-20 03:17:42 +0200)

----------------------------------------------------------------
fw_cfg, vhost: features fixes

This cleans up the qemu fw cfg device driver.
On top of this, vmcore is dumped there on crash to
help debugging witH kASLR enabled.
Also included are some fixes in vhost.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Marc-André Lureau (10):
      fw_cfg: fix sparse warnings in fw_cfg_sel_endianness()
      fw_cfg: fix sparse warnings with fw_cfg_file
      fw_cfg: fix sparse warning reading FW_CFG_ID
      fw_cfg: fix sparse warnings around FW_CFG_FILE_DIR read
      fw_cfg: remove inline from fw_cfg_read_blob()
      fw_cfg: handle fw_cfg_read_blob() error
      fw_cfg: add a public uapi header
      fw_cfg: add DMA register
      crash: export paddr_vmcoreinfo_note()
      fw_cfg: write vmcoreinfo details

Michael S. Tsirkin (1):
      ptr_ring: fix build

Sonny Rao (2):
      vhost: fix vhost ioctl signature to build with clang
      vhost: add vsock compat ioctl

 MAINTAINERS                      |   1 +
 drivers/firmware/qemu_fw_cfg.c   | 291 +++++++++++++++++++++++++++++++--------
 drivers/vhost/vhost.c            |   2 +-
 drivers/vhost/vhost.h            |   4 +-
 drivers/vhost/vsock.c            |  11 ++
 include/uapi/linux/qemu_fw_cfg.h |  97 +++++++++++++
 kernel/crash_core.c              |   1 +
 tools/virtio/ringtest/ptr_ring.c |   5 +
 8 files changed, 348 insertions(+), 64 deletions(-)
 create mode 100644 include/uapi/linux/qemu_fw_cfg.h
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-04-05  1:12 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, huangzhichao@huawei.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <286AC319A985734F985F78AFA26841F7394A6E96@shsmsx102.ccr.corp.intel.com>

On Thu, Apr 05, 2018 at 12:30:27AM +0000, Wang, Wei W wrote:
> On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> > On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > > > +static int add_one_sg(struct virtqueue *vq, unsigned long pfn,
> > > > > +uint32_t len) {
> > > > > +	struct scatterlist sg;
> > > > > +	unsigned int unused;
> > > > > +
> > > > > +	sg_init_table(&sg, 1);
> > > > > +	sg_set_page(&sg, pfn_to_page(pfn), len, 0);
> > > > > +
> > > > > +	/* Detach all the used buffers from the vq */
> > > > > +	while (virtqueue_get_buf(vq, &unused))
> > > > > +		;
> > > > > +
> > > > > +	/*
> > > > > +	 * Since this is an optimization feature, losing a couple of free
> > > > > +	 * pages to report isn't important. We simply return without adding
> > > > > +	 * the page hint if the vq is full.
> > > > why not stop scanning of following pages though?
> > >
> > > Because continuing to send hints is a way to deliver the maximum
> > > possible hints to host. For example, host may have a delay in taking
> > > hints at some point, and then it resumes to take hints soon. If the
> > > driver does not stop when the vq is full, it will be able to put more
> > > hints to the vq once the vq has available entries to add.
> > 
> > What this appears to be is just lack of coordination between host and guest.
> > 
> > But meanwhile you are spending cycles walking the list uselessly.
> > Instead of trying nilly-willy, the standard thing to do is to wait for host to
> > consume an entry and proceed.
> > 
> > Coding it up might be tricky, so it's probably acceptable as is for now, but
> > please replace the justification about with a TODO entry that we should
> > synchronize with the host.
> 
> Thanks. I plan to add
> 
> TODO: The current implementation could be further improved by stopping the reporting when the vq is full and continuing the reporting when host notifies that there are available entries for the driver to add.

... that entries have been used.

> 
> > 
> > 
> > >
> > > >
> > > > > +	 * We are adding one entry each time, which essentially results in no
> > > > > +	 * memory allocation, so the GFP_KERNEL flag below can be ignored.
> > > > > +	 * Host works by polling the free page vq for hints after sending the
> > > > > +	 * starting cmd id, so the driver doesn't need to kick after filling
> > > > > +	 * the vq.
> > > > > +	 * Lastly, there is always one entry reserved for the cmd id to use.
> > > > > +	 */
> > > > > +	if (vq->num_free > 1)
> > > > > +		return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static int virtio_balloon_send_free_pages(void *opaque, unsigned long
> > pfn,
> > > > > +					   unsigned long nr_pages)
> > > > > +{
> > > > > +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > > > > +	uint32_t len = nr_pages << PAGE_SHIFT;
> > > > > +
> > > > > +	/*
> > > > > +	 * If a stop id or a new cmd id was just received from host, stop
> > > > > +	 * the reporting, and return 1 to indicate an active stop.
> > > > > +	 */
> > > > > +	if (virtio32_to_cpu(vb->vdev, vb->cmd_id_use) != vb-
> > >cmd_id_received)
> > > > > +		return 1;
> > 
> > functions returning int should return 0 or -errno on failure, positive return
> > code should indicate progress.
> > 
> > If you want a boolean, use bool pls.
> 
> OK. I plan to change 1  to -EBUSY to indicate the case that host actively asks the driver to stop reporting (This makes the callback return value type consistent with walk_free_mem_block). 
> 

something like EINTR might be a better fit.

> 
> > 
> > 
> > > > > +
> > > > this access to cmd_id_use and cmd_id_received without locks bothers
> > > > me. Pls document why it's safe.
> > >
> > > OK. Probably we could add below to the above comments:
> > >
> > > cmd_id_use and cmd_id_received don't need to be accessed under locks
> > > because the reporting does not have to stop immediately before
> > > cmd_id_received is changed (i.e. when host requests to stop). That is,
> > > reporting more hints after host requests to stop isn't an issue for
> > > this optimization feature, because host will simply drop the stale
> > > hints next time when it needs a new reporting.
> > 
> > What about the other direction? Can this observe a stale value and exit
> > erroneously?
> 
> I'm afraid the driver couldn't be aware if the added hints are stale or not,


No - I mean that driver has code that compares two values
and stops reporting. Can one of the values be stale?

> because host and guest actions happen asynchronously. That is, host side iothread stops taking hints as soon as the migration thread asks to stop, it doesn't wait for any ACK from the driver to stop (as we discussed before, host couldn't always assume that the driver is in a responsive state).
> 
> Btw, we also don't need to worry about any memory left in the vq, since only addresses are added to the vq, there is no real memory allocations.
> 
> Best,
> Wei

When we support DMA API we will have to unmap things there.

^ permalink raw reply

* RE: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wang, Wei W @ 2018-04-05  0:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, huangzhichao@huawei.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20180404155907-mutt-send-email-mst@kernel.org>

On Wednesday, April 4, 2018 10:08 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> > On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > > +static int add_one_sg(struct virtqueue *vq, unsigned long pfn,
> > > > +uint32_t len) {
> > > > +	struct scatterlist sg;
> > > > +	unsigned int unused;
> > > > +
> > > > +	sg_init_table(&sg, 1);
> > > > +	sg_set_page(&sg, pfn_to_page(pfn), len, 0);
> > > > +
> > > > +	/* Detach all the used buffers from the vq */
> > > > +	while (virtqueue_get_buf(vq, &unused))
> > > > +		;
> > > > +
> > > > +	/*
> > > > +	 * Since this is an optimization feature, losing a couple of free
> > > > +	 * pages to report isn't important. We simply return without adding
> > > > +	 * the page hint if the vq is full.
> > > why not stop scanning of following pages though?
> >
> > Because continuing to send hints is a way to deliver the maximum
> > possible hints to host. For example, host may have a delay in taking
> > hints at some point, and then it resumes to take hints soon. If the
> > driver does not stop when the vq is full, it will be able to put more
> > hints to the vq once the vq has available entries to add.
> 
> What this appears to be is just lack of coordination between host and guest.
> 
> But meanwhile you are spending cycles walking the list uselessly.
> Instead of trying nilly-willy, the standard thing to do is to wait for host to
> consume an entry and proceed.
> 
> Coding it up might be tricky, so it's probably acceptable as is for now, but
> please replace the justification about with a TODO entry that we should
> synchronize with the host.

Thanks. I plan to add

TODO: The current implementation could be further improved by stopping the reporting when the vq is full and continuing the reporting when host notifies that there are available entries for the driver to add.


> 
> 
> >
> > >
> > > > +	 * We are adding one entry each time, which essentially results in no
> > > > +	 * memory allocation, so the GFP_KERNEL flag below can be ignored.
> > > > +	 * Host works by polling the free page vq for hints after sending the
> > > > +	 * starting cmd id, so the driver doesn't need to kick after filling
> > > > +	 * the vq.
> > > > +	 * Lastly, there is always one entry reserved for the cmd id to use.
> > > > +	 */
> > > > +	if (vq->num_free > 1)
> > > > +		return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int virtio_balloon_send_free_pages(void *opaque, unsigned long
> pfn,
> > > > +					   unsigned long nr_pages)
> > > > +{
> > > > +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > > > +	uint32_t len = nr_pages << PAGE_SHIFT;
> > > > +
> > > > +	/*
> > > > +	 * If a stop id or a new cmd id was just received from host, stop
> > > > +	 * the reporting, and return 1 to indicate an active stop.
> > > > +	 */
> > > > +	if (virtio32_to_cpu(vb->vdev, vb->cmd_id_use) != vb-
> >cmd_id_received)
> > > > +		return 1;
> 
> functions returning int should return 0 or -errno on failure, positive return
> code should indicate progress.
> 
> If you want a boolean, use bool pls.

OK. I plan to change 1  to -EBUSY to indicate the case that host actively asks the driver to stop reporting (This makes the callback return value type consistent with walk_free_mem_block). 



> 
> 
> > > > +
> > > this access to cmd_id_use and cmd_id_received without locks bothers
> > > me. Pls document why it's safe.
> >
> > OK. Probably we could add below to the above comments:
> >
> > cmd_id_use and cmd_id_received don't need to be accessed under locks
> > because the reporting does not have to stop immediately before
> > cmd_id_received is changed (i.e. when host requests to stop). That is,
> > reporting more hints after host requests to stop isn't an issue for
> > this optimization feature, because host will simply drop the stale
> > hints next time when it needs a new reporting.
> 
> What about the other direction? Can this observe a stale value and exit
> erroneously?

I'm afraid the driver couldn't be aware if the added hints are stale or not, because host and guest actions happen asynchronously. That is, host side iothread stops taking hints as soon as the migration thread asks to stop, it doesn't wait for any ACK from the driver to stop (as we discussed before, host couldn't always assume that the driver is in a responsive state).

Btw, we also don't need to worry about any memory left in the vq, since only addresses are added to the vq, there is no real memory allocations.

Best,
Wei

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Andrew Lunn @ 2018-04-04 20:08 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexander Duyck, Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, Netdev, Si-Wei Liu,
	David Miller
In-Reply-To: <b0f5e27b-0be1-311e-f3f3-f79af5cd4521@gmail.com>

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports.

Sounds a lot like DSA. Please ask the vendor to contribute the drivers
:-)

> The master netdev should not be mucked with by a user. It should be
> ignored by certain s/w with lldpd as just an *example*.

I have come across occasional problems with the master device in DSA.
But nothing too serious. Generally the switch will just toss frames it
gets which don't have the needed header, when they come direct from
the master device, rather than via the slave devices.

    Andrew

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Jiri Pirko @ 2018-04-04 18:20 UTC (permalink / raw)
  To: David Miller
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, sridhar.samudrala,
	virtualization, loseweigh, netdev, dsahern, si-wei.liu
In-Reply-To: <20180404.133749.1802514210170809419.davem@davemloft.net>

Wed, Apr 04, 2018 at 07:37:49PM CEST, davem@davemloft.net wrote:
>From: David Ahern <dsahern@gmail.com>
>Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
>I agree on this.
>
>What I really don't understand still is the use case... really.
>
>So there are control netdevs, what exactly is the problem with that?
>
>Are we not exporting enough information for applications to handle
>these devices sanely?  If so, then's let add that information.
>
>We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
>Another alternative is to add an interface flag like IFF_CONTROL or
>similar, and that probably is much nicer.
>
>Hiding the devices means that we acknowledge that applications are
>currently broken with control netdevs... and we want them to stay
>broken!
>
>That doesn't sound like a good plan to me.
>
>So let's fix handling of control netdevs instead of hiding them.

Exactly. Don't workaround userspace issues by kernel patches.

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-04 18:02 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexander Duyck, Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Netdev, Si-Wei Liu,
	David Miller
In-Reply-To: <54accf73-e6cc-e03f-6a1c-34e1bbd78047@gmail.com>

On Wed, Apr 4, 2018 at 10:21 AM, David Ahern <dsahern@gmail.com> wrote:
> On 4/4/18 1:36 AM, Siwei Liu wrote:
>> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
>>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>>
>>>>> There are other use cases that want to hide a device from userspace. I
>>>>
>>>> What usecases do you have in mind?
>>>
>>> As mentioned in a previous response some kernel drivers create control
>>> netdevs. Just as in this case users should not be mucking with it, and
>>> S/W like lldpd should ignore it.
>>>
>>>>
>>>>> would prefer a better solution than playing games with name prefixes and
>>>>> one that includes an API for users to list all devices -- even ones
>>>>> hidden by default.
>>>>
>>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>>> for userspace issues. Why can't the netdevice be visible always and
>>>> userspace would know what is it and what should it do with it?
>>>>
>>>> Once we start with hiding, there are other things related to that which
>>>> appear. Like who can see what, levels of visibility etc...
>>>>
>>>
>>> I would not advocate for any API that does not allow users to have full
>>> introspection. The intent is to hide the netdev by default but have an
>>> option to see it.
>>
>> I'm fine with having a link dump API to inspect the hidden netdev. As
>> said, the name for hidden netdevs should be in a separate device
>> namespace, and we did not even get closer to what it should look like
>> as I don't want to make it just an option for ip link. Perhaps a new
>> set of sub-commands of, say, 'ip device'.
>
> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.

If you're still thinking the visibility is part of link attribute
rather than a separate namespace, I'd say we are trying to solve
essentially different problems, and I really don't understand your
proposal in solving that problem to be honest.

So, let's step back on studying your case if that's the right thing
and let's talk about concrete examples.

-Siwei

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Stephen Hemminger @ 2018-04-04 17:44 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexander Duyck, Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, Netdev, Si-Wei Liu,
	David Miller
In-Reply-To: <b0f5e27b-0be1-311e-f3f3-f79af5cd4521@gmail.com>

On Wed, 4 Apr 2018 11:37:52 -0600
David Ahern <dsahern@gmail.com> wrote:

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports. The master netdev should not be mucked with
> by a user. It should be ignored by certain s/w with lldpd as just an
> *example*.

Sorry, the linux kernel maintainers have a clear well defined attitude
about out of tree kernel modules...

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: David Miller @ 2018-04-04 17:42 UTC (permalink / raw)
  To: dsahern
  Cc: alexander.h.duyck, jiri, mst, kubakici, sridhar.samudrala,
	virtualization, loseweigh, netdev, si-wei.liu
In-Reply-To: <b0f5e27b-0be1-311e-f3f3-f79af5cd4521@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Wed, 4 Apr 2018 11:37:52 -0600

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports. The master netdev should not be mucked with
> by a user. It should be ignored by certain s/w with lldpd as just an
> *example*.

Two approaches:

1) Add an IFF_CONTROL and make userspace understand this.  It is probably
   long overdue.

2) Design the driver properly.  Have a non-netdev master device like
   mlxsw does, and control it using devlink or similar.  This is exactly
   how this stuff was meant to be architected.

> From there I think you are confusing my intentions: I fundamentally do
> not believe the kernel should be hiding anything from an admin. Not
> showing data by default is completely different than not showing that
> data at all.

It is the same David.

It measn we have no intention of fixing applications to properly know
what to do with and how to handle these devices.

If you hide these objects, we are basically giving up on fixing the
tools and or the drivers themselves to be architected differently
(see #2 above).

That really isn't acceptable in my opinion.

> The intention of my patch with the IFF_HIDDEN attribute is:
> 1. it is a netdev attribute
> 
> 2. that attribute can be used by userpsace to indicate to the kernel I
> want all or I want the default
> 
> 3. that attribute can be controlled by an admin.
> 
> The patches go beyond my specific use case (preventing a user from
> modifying a netdev it should not be touching) but to defining the
> semantics of a generic capability which is what the kernel should have.

"Teach, do not hide!" -Yoda

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: David Miller @ 2018-04-04 17:37 UTC (permalink / raw)
  To: dsahern
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici,
	sridhar.samudrala, virtualization, loseweigh, netdev, si-wei.liu
In-Reply-To: <54accf73-e6cc-e03f-6a1c-34e1bbd78047@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Wed, 4 Apr 2018 11:21:54 -0600

> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.

I agree on this.

What I really don't understand still is the use case... really.

So there are control netdevs, what exactly is the problem with that?

Are we not exporting enough information for applications to handle
these devices sanely?  If so, then's let add that information.

We can set netdev->type to ETH_P_LINUXCONTROL or something like that.

Another alternative is to add an interface flag like IFF_CONTROL or
similar, and that probably is much nicer.

Hiding the devices means that we acknowledge that applications are
currently broken with control netdevs... and we want them to stay
broken!

That doesn't sound like a good plan to me.

So let's fix handling of control netdevs instead of hiding them.

Thanks.

^ permalink raw reply

* Re: [PATCH v30 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-04-04 14:07 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, mhocko, linux-mm, huangzhichao,
	pbonzini, akpm, virtualization
In-Reply-To: <5AC43377.2070607@intel.com>

On Wed, Apr 04, 2018 at 10:07:51AM +0800, Wei Wang wrote:
> On 04/04/2018 02:47 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 04, 2018 at 12:10:03AM +0800, Wei Wang wrote:
> > > +static int add_one_sg(struct virtqueue *vq, unsigned long pfn, uint32_t len)
> > > +{
> > > +	struct scatterlist sg;
> > > +	unsigned int unused;
> > > +
> > > +	sg_init_table(&sg, 1);
> > > +	sg_set_page(&sg, pfn_to_page(pfn), len, 0);
> > > +
> > > +	/* Detach all the used buffers from the vq */
> > > +	while (virtqueue_get_buf(vq, &unused))
> > > +		;
> > > +
> > > +	/*
> > > +	 * Since this is an optimization feature, losing a couple of free
> > > +	 * pages to report isn't important. We simply return without adding
> > > +	 * the page hint if the vq is full.
> > why not stop scanning of following pages though?
> 
> Because continuing to send hints is a way to deliver the maximum possible
> hints to host. For example, host may have a delay in taking hints at some
> point, and then it resumes to take hints soon. If the driver does not stop
> when the vq is full, it will be able to put more hints to the vq once the vq
> has available entries to add.

What this appears to be is just lack of coordination between
host and guest.

But meanwhile you are spending cycles walking the list uselessly.
Instead of trying nilly-willy, the standard thing to do
is to wait for host to consume an entry and proceed.

Coding it up might be tricky, so it's probably acceptable as is
for now, but please replace the justification about with
a TODO entry that we should synchronize with the host.


> 
> > 
> > > +	 * We are adding one entry each time, which essentially results in no
> > > +	 * memory allocation, so the GFP_KERNEL flag below can be ignored.
> > > +	 * Host works by polling the free page vq for hints after sending the
> > > +	 * starting cmd id, so the driver doesn't need to kick after filling
> > > +	 * the vq.
> > > +	 * Lastly, there is always one entry reserved for the cmd id to use.
> > > +	 */
> > > +	if (vq->num_free > 1)
> > > +		return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> > > +					   unsigned long nr_pages)
> > > +{
> > > +	struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > > +	uint32_t len = nr_pages << PAGE_SHIFT;
> > > +
> > > +	/*
> > > +	 * If a stop id or a new cmd id was just received from host, stop
> > > +	 * the reporting, and return 1 to indicate an active stop.
> > > +	 */
> > > +	if (virtio32_to_cpu(vb->vdev, vb->cmd_id_use) != vb->cmd_id_received)
> > > +		return 1;

functions returning int should return 0 or -errno on failure,
positive return code should indicate progress.

If you want a boolean, use bool pls.


> > > +
> > this access to cmd_id_use and cmd_id_received without locks
> > bothers me. Pls document why it's safe.
> 
> OK. Probably we could add below to the above comments:
> 
> cmd_id_use and cmd_id_received don't need to be accessed under locks because
> the reporting does not have to stop immediately before cmd_id_received is
> changed (i.e. when host requests to stop). That is, reporting more hints
> after host requests to stop isn't an issue for this optimization feature,
> because host will simply drop the stale hints next time when it needs a new
> reporting.

What about the other direction? Can this observe a stale value and
exit erroneously?

> 
> 
> 
> > 
> > > +	return add_one_sg(vb->free_page_vq, pfn, len);
> > > +}
> > > +
> > > +static int send_start_cmd_id(struct virtio_balloon *vb, uint32_t cmd_id)
> > > +{
> > > +	struct scatterlist sg;
> > > +	struct virtqueue *vq = vb->free_page_vq;
> > > +
> > > +	vb->cmd_id_use = cpu_to_virtio32(vb->vdev, cmd_id);
> > > +	sg_init_one(&sg, &vb->cmd_id_use, sizeof(vb->cmd_id_use));
> > > +	return virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > +}
> > > +
> > > +static int send_stop_cmd_id(struct virtio_balloon *vb)
> > > +{
> > > +	struct scatterlist sg;
> > > +	struct virtqueue *vq = vb->free_page_vq;
> > > +
> > > +	sg_init_one(&sg, &vb->stop_cmd_id, sizeof(vb->cmd_id_use));
> > why the inconsistency?
> 
> Thanks, will make it consistent.
> 
> Best,
> Wei

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-04  8:28 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	Si-Wei Liu, David Miller
In-Reply-To: <3bdfc39f-4935-2433-7982-9ce28c3aa166@gmail.com>

On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.

I'm still not sure I understand your case: why you want to hide the
control netdev, as I assume those devices could choose either to
silently ignore the request, or fail loudly against user operations?
Is it creating issues already, or what problem you want to solve if
not making the netdev invisible. Why couldn't lldpd check some
specific flag and ignore the control netdevice (can you please give an
example of a concrete driver for control netdevice *in tree*).

And I'm completely lost why you want an API to make a hidden netdev
visible again for these control devices.

Thanks,
-Siwei


>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.

^ permalink raw reply

* Re: [virtio-dev] Re: [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden
From: Siwei Liu @ 2018-04-04  8:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Netdev, Si-Wei Liu,
	David Miller
In-Reply-To: <20180403151915-mutt-send-email-mst@kernel.org>

On Tue, Apr 3, 2018 at 5:20 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Sun, Apr 01, 2018 at 05:13:10AM -0400, Si-Wei Liu wrote:
>> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>> index aa40664..0827b7e 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -80,6 +80,8 @@ struct virtio_net_config {
>>       __u16 max_virtqueue_pairs;
>>       /* Default maximum transmit unit advice */
>>       __u16 mtu;
>> +     /* Device at bus:slot.function backed up by virtio_net */
>> +     __u16 bsf2backup;
>>  } __attribute__((packed));
>
> I'm not sure this is a good interface.  This isn't unique even on some
> PCI systems, not to speak of non-PCI ones.

Are you suggesting adding PCI address domain besides to make it
universally unique? And what the non-PCI device you envisioned that
the main target, essetially live migration, can/should cover? Or is
there better option in your mind already?

Thanks,
-Siwei
>
>>  /*
>> --
>> 1.8.3.1
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply

* Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
From: Siwei Liu @ 2018-04-04  8:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Netdev, Si-Wei Liu,
	David Miller
In-Reply-To: <20180403152308-mutt-send-email-mst@kernel.org>

On Tue, Apr 3, 2018 at 5:25 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Sun, Apr 01, 2018 at 05:13:08AM -0400, Si-Wei Liu wrote:
>> @@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
>>      }
>>  }
>>
>> +int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
>> +                           uint16_t *devfn, Error **errp)
>> +{
>> +    uint16_t busnum = 0, slot = 0, func = 0;
>> +    const char *pc, *pd, *pe;
>> +    Error *local_err = NULL;
>> +    ObjectClass *class;
>> +    char value[1024];
>> +    BusState *bus;
>> +    uint64_t u64;
>> +
>> +    if (!(pc = strchr(id, ':'))) {
>> +        error_setg(errp, "Invalid id: backup=%s, "
>> +                   "correct format should be backup="
>> +                   "'<bus-id>:<slot>[.<function>]'", id);
>> +        return -1;
>> +    }
>> +    get_opt_name(value, sizeof(value), id, ':');
>> +    if (pc != id + 1) {
>> +        bus = qbus_find(value, errp);
>> +        if (!bus)
>> +            return -1;
>> +
>> +        class = object_get_class(OBJECT(bus));
>> +        if (class != object_class_by_name(TYPE_PCI_BUS) &&
>> +            class != object_class_by_name(TYPE_PCIE_BUS)) {
>> +            error_setg(errp, "%s is not a device on pci bus", id);
>> +            return -1;
>> +        }
>> +        busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
>> +    }
>
> pci_bus_num is almost always a bug if not done within
> a context of a PCI host, bridge, etc.
>
> In particular this will not DTRT if run before guest assigns bus
> numbers.
>
I was seeking means to reserve a specific pci bus slot from drivers,
and update the driver when guest assigns the bus number but it seems
there's no low-hanging fruits. Because of that reason the bus_num is
only obtained until it's really needed (during get_config) and I
assume at that point the pci bus assignment is already done. I know
the current one is not perfect, but we need that information (PCI
bus:slot.func number) to name the guest device correctly.

Regards,
-Siwei
>
>> +
>> +    if (!devfn)
>> +        goto out;
>> +
>> +    pd = strchr(pc, '.');
>> +    pe = get_opt_name(value, sizeof(value), pc + 1, '.');
>> +    if (pe != pc + 1) {
>> +        parse_option_number("slot", value, &u64, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return -1;
>> +        }
>> +        slot = (uint16_t)u64;
>> +    }
>> +    if (pd && *(pd + 1) != '\0') {
>> +        parse_option_number("function", pd, &u64, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return -1;
>> +        }
>> +        func = (uint16_t)u64;
>> +    }
>> +
>> +out:
>> +    if (busnr)
>> +        *busnr = busnum;
>> +    if (devfn)
>> +        *devfn = ((slot & 0x1F) << 3) | (func & 0x7);
>> +    return 0;
>> +}
>> +
>>  BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
>>  {
>>      DeviceState *dev;
>> --
>> 1.8.3.1
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-04  8:01 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexander Duyck, virtio-dev, Michael S. Tsirkin, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Netdev, David Ahern,
	Si-Wei Liu, David Miller
In-Reply-To: <20180404061945.GN3313@nanopsycho>

On Tue, Apr 3, 2018 at 11:19 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Apr 04, 2018 at 03:04:26AM CEST, dsahern@gmail.com wrote:
>>On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>
>>>> There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>>As mentioned in a previous response some kernel drivers create control
>>netdevs. Just as in this case users should not be mucking with it, and
>
> virtio_net. Any other drivers?

netvsc if factoring out virtio_bypass to a common driver.

>
>
>>S/W like lldpd should ignore it.
>
> It's just a matter of identification of the netdevs, so the user knows
> what to do.
>
>
>>
>>>
>>>> would prefer a better solution than playing games with name prefixes and
>>>> one that includes an API for users to list all devices -- even ones
>>>> hidden by default.
>>>
>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>> for userspace issues. Why can't the netdevice be visible always and
>>> userspace would know what is it and what should it do with it?
>>>
>>> Once we start with hiding, there are other things related to that which
>>> appear. Like who can see what, levels of visibility etc...
>>>
>>
>>I would not advocate for any API that does not allow users to have full
>>introspection. The intent is to hide the netdev by default but have an
>>option to see it.
>
> As an administrator, I want to see all by default. I think it is
> reasonable requirements. Again, this awfully smells like a workaround...

If the requirement is just for dumping the link info i.e. perform
read-only operation on the hidden netdev, it's completely fine.
However, I am not a big fan of creating a weird mechanism to allow
user deliberately manipulate the visibility (hide/unhide) of a netdev
in any case at any time. This is subject to becoming a slippery slope
to work around any software issue that should get fixed in the right
place.

Let's treat IFF_HIDDEN as a means to hide auto-managed netdevices. If
it's just the name is misleading, I can get it renamed to something
like IFF_AUTO_MANAGED which might reflect its nature more properly.

Thanks,
-Siwei

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox