Linux virtualization list
 help / color / mirror / Atom feed
* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-15 11:48 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Netdev, aaron.f.brown
In-Reply-To: <CADGSJ23WnTmVKHezm3t0V6M2sWeHaOUoTjdXkmrvbO0EF83hMg@mail.gmail.com>

On Thu, 14 Jun 2018 18:57:11 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> Thank you for sharing your thoughts, Cornelia. With questions below, I
> think you raised really good points, some of which I don't have answer
> yet and would also like to explore here.
> 
> First off, I don't want to push the discussion to the extreme at this
> point, or sell anything about having QEMU manage everything
> automatically. Don't get me wrong, it's not there yet. Let's don't
> assume we are tied to a specific or concerte solution. I think the key
> for our discussion might be to define or refine the boundary between
> VM and guest,  e.g. what each layer is expected to control and manage
> exactly.
> 
> In my view, there might be possibly 3 different options to represent
> the failover device conceipt to QEMU and libvirt (or any upper layer
> software):
> 
> a. Seperate device: in this model, virtio and passthough remains
> separate devices just as today. QEMU exposes the standby feature bit
> for virtio, and publish status/event around the negotiation process of
> this feature bit for libvirt to react upon. Since Libvirt has the
> pairing relationship itself, maybe through MAC address or something
> else, it can control the presence of primary by hot plugging or
> unplugging the passthrough device, although it has to work tightly
> with virtio's feature negotation process. Not just for migration but
> also various corner scenarios (driver/feature ok, device reset,
> reboot, legacy guest etc) along virtio's feature negotiation.

Yes, that one has obvious tie-ins to virtio's modus operandi.

> 
> b. Coupled device: in this model, virtio and passthough devices are
> weakly coupled using some group ID, i.e. QEMU match the passthough
> device for a standby virtio instance by comparing the group ID value
> present behind each device's bridge. Libvirt provides QEMU the group
> ID for both type of devices, and only deals with hot plug for
> migration, by checking some migration status exposed (e.g. the feature
> negotiation status on the virtio device) by QEMU. QEMU manages the
> visibility of the primary in guest along virtio's feature negotiation
> process.

I'm a bit confused here. What, exactly, ties the two devices together?
If libvirt already has the knowledge that it should manage the two as a
couple, why do we need the group id (or something else for other
architectures)? (Maybe I'm simply missing something because I'm not
that familiar with pci.)

> 
> c. Fully combined device: in this model, virtio and passthough devices
> are viewed as a single VM interface altogther. QEMU not just controls
> the visibility of the primary in guest, but can also manage the
> exposure of the passthrough for migratability. It can be like that
> libvirt supplies the group ID to QEMU. Or libvirt does not even have
> to provide group ID for grouping the two devices, if just one single
> combined device is exposed by QEMU. In either case, QEMU manages all
> aspect of such internal construct, including virtio feature
> negotiation, presence of the primary, and live migration.

Same question as above.

> 
> It looks like to me that, in your opinion, you seem to prefer go with
> (a). While I'm actually okay with either (b) or (c). Do I understand
> your point correctly?

I'm not yet preferring anything, as I'm still trying to understand how
this works :) I hope we can arrive at a model that covers the use case
and that is also flexible enough to be extended to other platforms.

> 
> The reason that I feel that (a) might not be ideal, just as Michael
> alluded to (quoting below), is that as management stack, it really
> doesn't need to care about the detailed process of feature negotiation
> (if we view the guest presence of the primary as part of feature
> negotiation at an extended level not just virtio). All it needs to be
> done is to hand in the required devices to QEMU and that's all. Why do
> we need to addd various hooks, events for whichever happens internally
> within the guest?
> 
> ''
> Primary device is added with a special "primary-failover" flag.
> A virtual machine is then initialized with just a standby virtio
> device. Primary is not yet added.
> 
> Later QEMU detects that guest driver device set DRIVER_OK.
> It then exposes the primary device to the guest, and triggers
> a device addition event (hot-plug event) for it.
> 
> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> to remove the primary driver.  In particular, if QEMU detects guest
> re-initialization (e.g. by detecting guest reset) it immediately removes
> the primary device.
> ''
> 
> and,
> 
> ''
> management just wants to give the primary to guest and later take it back,
> it really does not care about the details of the process,
> so I don't see what does pushing it up the stack buy you.
> 
> So I don't think it *needs* to be done in libvirt. It probably can if you
> add a bunch of hooks so it knows whenever vm reboots, driver binds and
> unbinds from device, and can check that backup flag was set.
> If you are pushing for a setup like that please get a buy-in
> from libvirt maintainers or better write a patch.
> ''

This actually seems to mean the opposite to me: We need to know what
the guest is doing and when, as it directly drives what we need to do
with the devices. If we switch to a visibility vs a hotplug model (see
the other mail), we might be able to handle that part within qemu.
However, I don't see how you get around needing libvirt to actually set
this up in the first place and to handle migration per se.

^ permalink raw reply

* Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-06-15 11:42 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, mhocko, linux-mm, pbonzini,
	akpm, virtualization, torvalds
In-Reply-To: <1529037793-35521-3-git-send-email-wei.w.wang@intel.com>

On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
> support of reporting hints of guest free pages to host via virtio-balloon.
> 
> Host requests the guest to report free page hints by sending a command
> to the guest via setting the VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT bit
> of the host_cmd config register.
> 
> As the first step here, virtio-balloon only reports free page hints from
> the max order (10) free page list to host. This has generated similar good
> results as reporting all free page hints during our tests.
> 
> TODO:
> - support reporting free page hints from smaller order free page lists
>   when there is a need/request from users.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  drivers/virtio/virtio_balloon.c     | 187 +++++++++++++++++++++++++++++-------
>  include/uapi/linux/virtio_balloon.h |  13 +++
>  2 files changed, 163 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 6b237e3..582a03b 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -43,6 +43,9 @@
>  #define OOM_VBALLOON_DEFAULT_PAGES 256
>  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
>  
> +/* The size of memory in bytes allocated for reporting free page hints */
> +#define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> +
>  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>  module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>  MODULE_PARM_DESC(oom_pages, "pages to free on OOM");

Doesn't this limit memory size of the guest we can report?
Apparently to several gigabytes ...
OTOH huge guests with lots of free memory is exactly
where we would gain the most ...

-- 
MST

^ permalink raw reply

* Re: [PATCH v33 0/4] Virtio-balloon: support free page reporting
From: Michael S. Tsirkin @ 2018-06-15 11:29 UTC (permalink / raw)
  To: Wei Wang
  Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
	liliang.opensource, linux-kernel, mhocko, linux-mm, pbonzini,
	akpm, virtualization, torvalds
In-Reply-To: <1529037793-35521-1-git-send-email-wei.w.wang@intel.com>

On Fri, Jun 15, 2018 at 12:43:09PM +0800, Wei Wang wrote:
>       - remove the cmd id related interface. Now host can just send a free
>         page hint command to the guest (via the host_cmd config register)
>         to start the reporting.

Here we go again. And what if reporting was already started previously?
I don't think it's a good idea to tweak the host/guest interface yet
again.

-- 
MST

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-15  9:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, qemu-devel, virtualization, Siwei Liu, Netdev,
	aaron.f.brown
In-Reply-To: <20180615052743-mutt-send-email-mst@kernel.org>

On Fri, 15 Jun 2018 05:34:24 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:

> > > > I am not all that familiar with how Qemu manages network devices. If we can
> > > > do all the
> > > > required management of the primary/standby devices within Qemu, that is
> > > > definitely a better
> > > > approach without upper layer involvement.    
> > > 
> > > Right. I would imagine in the extreme case the upper layer doesn't
> > > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > > The management tool can supply passthrough device and virtio with the
> > > same group UUID, QEMU auto-manages the presence of the primary, and
> > > hot plug the device as needed before or after the migration.  
> > 
> > I do not really see how you can manage that kind of stuff in QEMU only.  
> 
> So right now failover is limited to pci passthrough devices only.
> The idea is to realize the vfio device but not expose it
> to guest. Have a separate command to expose it to guest.
> Hotunplug would also hide it from guest but not unrealize it.

So, this would not be real hot(un)plug, but 'hide it from the guest',
right? The concept of "we have it realized in QEMU, but the guest can't
discover and use it" should be translatable to non-pci as well (at
least for ccw).

> 
> This will help ensure that e.g. on migration failure we can
> re-expose the device without risk of running out of resources.

Makes sense.

Should that 'hidden' state be visible/settable from outside as well
(e.g. via a property)? I guess yes, so that management software has a
chance to see whether a device is visible. Settable may be useful if we
find another use case for hiding realized devices.

^ permalink raw reply

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Christoph Hellwig @ 2018-06-15  9:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, cohuck, pawel.moll, Tom Lendacky, Michael S. Tsirkin,
	Ram Pai, linux-kernel, virtualization, Christoph Hellwig, joe,
	Rustad, Mark D, Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <10bbd7122aaa67f51de7a8328df8154212a13f23.camel@kernel.crashing.org>

On Wed, Jun 13, 2018 at 11:11:01PM +1000, Benjamin Herrenschmidt wrote:
> Actually ... the stuff in lib/dma-direct.c seems to be just it, no ?
> 
> There's no cache flushing and there's no architecture hooks that I can
> see other than the AMD security stuff which is probably fine.
> 
> Or am I missing something ?

You are missing the __phys_to_dma arch hook that allows architectures
to adjust the dma address.  Various systems have offsets, or even
multiple banks with different offsets there.  Most of them don't
use the dma-direct code yet (working on it), but there are a few
examples in the tree already.

^ permalink raw reply

* Re: [PATCH 0/3] Use sbitmap instead of percpu_ida
From: Christoph Hellwig @ 2018-06-15  7:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Juergen Gross, Jens Axboe, kvm, linux-scsi, netdev, linux-usb,
	linux-kernel, virtualization, target-devel, qla2xxx-upstream,
	linux1394-devel, Kent Overstreet
In-Reply-To: <20180612190545.10781-1-willy@infradead.org>

Btw, if you are on a spree to remove almost unused data structures
from target code, the lib/btree.c code is only used by the qla2xxx
target code, and doesn't really look like the best fit for it either.

^ permalink raw reply

* [PATCH v33 4/4] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
From: Wei Wang @ 2018-06-15  4:43 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal, torvalds
In-Reply-To: <1529037793-35521-1-git-send-email-wei.w.wang@intel.com>

The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value that is in use.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 drivers/virtio/virtio_balloon.c     | 10 ++++++++++
 include/uapi/linux/virtio_balloon.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 582a03b..c59bb380 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -634,6 +634,7 @@ static struct file_system_type balloon_fs = {
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
+	__u32 poison_val;
 	int err;
 
 	if (!vdev->config->get) {
@@ -671,6 +672,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
 			goto out_del_vqs;
 		}
 		INIT_WORK(&vb->report_free_page_work, report_free_page_func);
+		if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+			memset(&poison_val, PAGE_POISON, sizeof(poison_val));
+			virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+				      poison_val, &poison_val);
+		}
 		vb->hints = kmalloc(FREE_PAGE_HINT_MEM_SIZE, GFP_KERNEL);
 		if (!vb->hints) {
 			err = -ENOMEM;
@@ -796,6 +802,9 @@ static int virtballoon_restore(struct virtio_device *vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
+	if (!page_poisoning_enabled())
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+
 	__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
 	return 0;
 }
@@ -805,6 +814,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
+	VIRTIO_BALLOON_F_PAGE_POISON,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 99b8416..f3b6191 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -47,6 +48,8 @@ struct virtio_balloon_config {
 	__u32 actual;
 	/* Command sent from host */
 	__u32 host_cmd;
+	/* Stores PAGE_POISON if page poisoning is in use */
+	__u32 poison_val;
 };
 
 struct virtio_balloon_free_page_hints {
-- 
2.7.4

^ permalink raw reply related

* [PATCH v33 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules
From: Wei Wang @ 2018-06-15  4:43 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal, torvalds
In-Reply-To: <1529037793-35521-1-git-send-email-wei.w.wang@intel.com>

In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/page_poison.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_poison.c b/mm/page_poison.c
index aa2b3d3..830f604 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -17,6 +17,11 @@ static int __init early_page_poison_param(char *buf)
 }
 early_param("page_poison", early_page_poison_param);
 
+/**
+ * page_poisoning_enabled - check if page poisoning is enabled
+ *
+ * Return true if page poisoning is enabled, or false if not.
+ */
 bool page_poisoning_enabled(void)
 {
 	/*
@@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
 		(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
 		debug_pagealloc_enabled()));
 }
+EXPORT_SYMBOL_GPL(page_poisoning_enabled);
 
 static void poison_page(struct page *page)
 {
-- 
2.7.4

^ permalink raw reply related

* [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wei Wang @ 2018-06-15  4:43 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal, torvalds
In-Reply-To: <1529037793-35521-1-git-send-email-wei.w.wang@intel.com>

Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.

Host requests the guest to report free page hints by sending a command
to the guest via setting the VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT bit
of the host_cmd config register.

As the first step here, virtio-balloon only reports free page hints from
the max order (10) free page list to host. This has generated similar good
results as reporting all free page hints during our tests.

TODO:
- support reporting free page hints from smaller order free page lists
  when there is a need/request from users.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 drivers/virtio/virtio_balloon.c     | 187 +++++++++++++++++++++++++++++-------
 include/uapi/linux/virtio_balloon.h |  13 +++
 2 files changed, 163 insertions(+), 37 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6b237e3..582a03b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -43,6 +43,9 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+/* The size of memory in bytes allocated for reporting free page hints */
+#define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -51,9 +54,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+enum virtio_balloon_vq {
+	VIRTIO_BALLOON_VQ_INFLATE,
+	VIRTIO_BALLOON_VQ_DEFLATE,
+	VIRTIO_BALLOON_VQ_STATS,
+	VIRTIO_BALLOON_VQ_FREE_PAGE,
+	VIRTIO_BALLOON_VQ_MAX
+};
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+	/* Balloon's own wq for cpu-intensive work items */
+	struct workqueue_struct *balloon_wq;
+	/* The free page reporting work item submitted to the balloon wq */
+	struct work_struct report_free_page_work;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -63,6 +79,8 @@ struct virtio_balloon {
 	spinlock_t stop_update_lock;
 	bool stop_update;
 
+	struct virtio_balloon_free_page_hints *hints;
+
 	/* Waiting for host to ack the pages we released. */
 	wait_queue_head_t acked;
 
@@ -326,17 +344,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	virtqueue_kick(vq);
 }
 
-static void virtballoon_changed(struct virtio_device *vdev)
-{
-	struct virtio_balloon *vb = vdev->priv;
-	unsigned long flags;
-
-	spin_lock_irqsave(&vb->stop_update_lock, flags);
-	if (!vb->stop_update)
-		queue_work(system_freezable_wq, &vb->update_balloon_size_work);
-	spin_unlock_irqrestore(&vb->stop_update_lock, flags);
-}
-
 static inline s64 towards_target(struct virtio_balloon *vb)
 {
 	s64 target;
@@ -353,6 +360,32 @@ static inline s64 towards_target(struct virtio_balloon *vb)
 	return target - vb->num_pages;
 }
 
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = vdev->priv;
+	unsigned long flags;
+	uint32_t host_cmd;
+	s64 diff = towards_target(vb);
+
+	if (diff) {
+		spin_lock_irqsave(&vb->stop_update_lock, flags);
+		if (!vb->stop_update)
+			queue_work(system_freezable_wq,
+				   &vb->update_balloon_size_work);
+		spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+	}
+
+	virtio_cread(vdev, struct virtio_balloon_config, host_cmd, &host_cmd);
+	if ((host_cmd & VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT) &&
+	     virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+		spin_lock_irqsave(&vb->stop_update_lock, flags);
+		if (!vb->stop_update)
+			queue_work(vb->balloon_wq,
+				   &vb->report_free_page_work);
+		spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+	}
+}
+
 static void update_balloon_size(struct virtio_balloon *vb)
 {
 	u32 actual = vb->num_pages;
@@ -425,44 +458,98 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void free_page_vq_cb(struct virtqueue *vq)
+{
+	unsigned int unused;
+
+	while (virtqueue_get_buf(vq, &unused))
+		;
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
+	vq_callback_t *callbacks[VIRTIO_BALLOON_VQ_MAX];
+	const char *names[VIRTIO_BALLOON_VQ_MAX];
+	struct scatterlist sg;
+	int ret;
 
 	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
+	 * Inflateq and deflateq are used unconditionally. The names[]
+	 * will be NULL if the related feature is not enabled, which will
+	 * cause no allocation for the corresponding virtqueue in find_vqs.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
-	if (err)
-		return err;
+	callbacks[VIRTIO_BALLOON_VQ_INFLATE] = balloon_ack;
+	names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
+	callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
+	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
+	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
+	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 
-	vb->inflate_vq = vqs[0];
-	vb->deflate_vq = vqs[1];
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
-		struct scatterlist sg;
-		unsigned int num_stats;
-		vb->stats_vq = vqs[2];
+		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
+		callbacks[VIRTIO_BALLOON_VQ_STATS] = stats_request;
+	}
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+		names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
+		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = free_page_vq_cb;
+	}
+
+	ret = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
+					 vqs, callbacks, names, NULL, NULL);
+	if (ret)
+		return ret;
+
+	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
+	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		vb->stats_vq = vqs[VIRTIO_BALLOON_VQ_STATS];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
 		 */
-		num_stats = update_balloon_stats(vb);
-
-		sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
-		if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
-		    < 0)
-			BUG();
+		sg_init_one(&sg, vb->stats, sizeof(vb->stats));
+		ret = virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb,
+					   GFP_KERNEL);
+		if (ret) {
+			dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+				 __func__);
+			return ret;
+		}
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+		vb->free_page_vq = vqs[VIRTIO_BALLOON_VQ_FREE_PAGE];
+
 	return 0;
 }
 
+static void report_free_page_func(struct work_struct *work)
+{
+	struct virtio_balloon *vb;
+	struct virtqueue *vq;
+	struct virtio_balloon_free_page_hints *hints;
+	struct scatterlist sg;
+	uint32_t hdr_size, avail_entries, added_entries;
+
+	vb = container_of(work, struct virtio_balloon, report_free_page_work);
+	vq = vb->free_page_vq;
+	hints = vb->hints;
+	hdr_size = sizeof(hints->num_hints) + sizeof(hints->size);
+	avail_entries = (FREE_PAGE_HINT_MEM_SIZE - hdr_size) / sizeof(__le64);
+
+	added_entries = get_from_free_page_list(MAX_ORDER - 1, hints->buf,
+						avail_entries);
+	hints->num_hints = cpu_to_le32(added_entries);
+	hints->size = cpu_to_le32((1 << (MAX_ORDER - 1)) << PAGE_SHIFT);
+
+	sg_init_one(&sg, vb->hints, FREE_PAGE_HINT_MEM_SIZE);
+	virtqueue_add_outbuf(vq, &sg, 1, vb->hints, GFP_KERNEL);
+	virtqueue_kick(vb->free_page_vq);
+}
+
 #ifdef CONFIG_BALLOON_COMPACTION
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
@@ -576,18 +663,33 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vb;
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+		vb->balloon_wq = alloc_workqueue("balloon-wq",
+					WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+		if (!vb->balloon_wq) {
+			err = -ENOMEM;
+			goto out_del_vqs;
+		}
+		INIT_WORK(&vb->report_free_page_work, report_free_page_func);
+		vb->hints = kmalloc(FREE_PAGE_HINT_MEM_SIZE, GFP_KERNEL);
+		if (!vb->hints) {
+			err = -ENOMEM;
+			goto out_del_balloon_wq;
+		}
+	}
+
 	vb->nb.notifier_call = virtballoon_oom_notify;
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
 	if (err < 0)
-		goto out_del_vqs;
+		goto out_del_free_page_hint;
 
 #ifdef CONFIG_BALLOON_COMPACTION
 	balloon_mnt = kern_mount(&balloon_fs);
 	if (IS_ERR(balloon_mnt)) {
 		err = PTR_ERR(balloon_mnt);
 		unregister_oom_notifier(&vb->nb);
-		goto out_del_vqs;
+		goto out_del_free_page_hint;
 	}
 
 	vb->vb_dev_info.migratepage = virtballoon_migratepage;
@@ -597,7 +699,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		kern_unmount(balloon_mnt);
 		unregister_oom_notifier(&vb->nb);
 		vb->vb_dev_info.inode = NULL;
-		goto out_del_vqs;
+		goto out_del_free_page_hint;
 	}
 	vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
 #endif
@@ -607,7 +709,12 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (towards_target(vb))
 		virtballoon_changed(vdev);
 	return 0;
-
+out_del_free_page_hint:
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+		kfree(vb->hints);
+out_del_balloon_wq:
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+		destroy_workqueue(vb->balloon_wq);
 out_del_vqs:
 	vdev->config->del_vqs(vdev);
 out_free_vb:
@@ -641,6 +748,11 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+		cancel_work_sync(&vb->report_free_page_work);
+		destroy_workqueue(vb->balloon_wq);
+	}
+
 	remove_common(vb);
 #ifdef CONFIG_BALLOON_COMPACTION
 	if (vb->vb_dev_info.inode)
@@ -692,6 +804,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 13b8cb5..99b8416 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,15 +34,28 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
 
+#define VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT (1 << 0)
 struct virtio_balloon_config {
 	/* Number of pages host wants Guest to give up. */
 	__u32 num_pages;
 	/* Number of pages we've actually got in balloon. */
 	__u32 actual;
+	/* Command sent from host */
+	__u32 host_cmd;
+};
+
+struct virtio_balloon_free_page_hints {
+	/* Number of hints in the array below */
+	__le32 num_hints;
+	/* The size of each hint in bytes */
+	__le32 size;
+	/* Buffer for the hints */
+	__le64 buf[];
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
-- 
2.7.4

^ permalink raw reply related

* [PATCH v33 1/4] mm: add a function to get free page blocks
From: Wei Wang @ 2018-06-15  4:43 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal, torvalds
In-Reply-To: <1529037793-35521-1-git-send-email-wei.w.wang@intel.com>

This patch adds a function to get free pages blocks from a free page
list. The obtained free page blocks are hints about free pages, because
there is no guarantee that they are still on the free page list after
the function returns.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are hinted as free pages but are written after this function returns will
be captured by the hypervisor, and they will be added to the next round of
memory transfer.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c    | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e49388..c58b4e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2002,6 +2002,7 @@ extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07b3c23..7c816d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5043,6 +5043,58 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/**
+ * get_from_free_page_list - get free page blocks from a free page list
+ * @order: the order of the free page list to check
+ * @buf: the array to store the physical addresses of the free page blocks
+ * @size: the array size
+ *
+ * This function offers hints about free pages. There is no guarantee that
+ * the obtained free pages are still on the free page list after the function
+ * returns. pfn_to_page on the obtained free pages is strongly discouraged
+ * and if there is an absolute need for that, make sure to contact MM people
+ * to discuss potential problems.
+ *
+ * The addresses are currently stored to the array in little endian. This
+ * avoids the overhead of converting endianness by the caller who needs data
+ * in the little endian format. Big endian support can be added on demand in
+ * the future.
+ *
+ * Return the number of free page blocks obtained from the free page list.
+ * The maximum number of free page blocks that can be obtained is limited to
+ * the caller's array size.
+ */
+uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size)
+{
+	struct zone *zone;
+	enum migratetype mt;
+	struct page *page;
+	struct list_head *list;
+	unsigned long addr, flags;
+	uint32_t index = 0;
+
+	for_each_populated_zone(zone) {
+		spin_lock_irqsave(&zone->lock, flags);
+		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+			list = &zone->free_area[order].free_list[mt];
+			list_for_each_entry(page, list, lru) {
+				addr = page_to_pfn(page) << PAGE_SHIFT;
+				if (likely(index < size)) {
+					buf[index++] = cpu_to_le64(addr);
+				} else {
+					spin_unlock_irqrestore(&zone->lock,
+							       flags);
+					return index;
+				}
+			}
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+
+	return index;
+}
+EXPORT_SYMBOL_GPL(get_from_free_page_list);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related

* [PATCH v33 0/4] Virtio-balloon: support free page reporting
From: Wei Wang @ 2018-06-15  4:43 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
	mhocko, akpm
  Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
	nilal, torvalds

This patch series is separated from the previous "Virtio-balloon
Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,  
implemented by this series enables the virtio-balloon driver to report
hints of guest free pages to the host. It can be used to accelerate live
migration of VMs. Here is an introduction of this usage:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

* Tests
- Test Environment
    Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
    Guest: 8G RAM, 4 vCPU
    Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second

- Test Results
    - Idle Guest Live Migration Time (results are averaged over 10 runs):
        - Optimization v.s. Legacy = 278ms vs 1757ms --> ~84% reduction
    - Guest with Linux Compilation Workload (make bzImage -j4):
        - Live Migration Time (average)
          Optimization v.s. Legacy = 1408ms v.s. 2528ms --> ~44% reduction
        - Linux Compilation Time
          Optimization v.s. Legacy = 5min3s v.s. 5min12s
          --> no obvious difference

ChangeLog:
v32->v33:
    - mm/get_from_free_page_list: The new implementation to get free page
      hints based on the suggestions from Linus:
      https://lkml.org/lkml/2018/6/11/764
      This avoids the complex call chain, and looks more prudent.
    - virtio-balloon: 
      - use a fix-sized buffer to get free page hints;
      - remove the cmd id related interface. Now host can just send a free
        page hint command to the guest (via the host_cmd config register)
        to start the reporting. Currentlty the guest reports only the max
        order free page hints to host, which has generated similar good
        results as before. But the interface used by virtio-balloon to
        report can support reporting more orders in the future when there
        is a need.
v31->v32:
    - virtio-balloon:
        - rename cmd_id_use to cmd_id_active;
        - report_free_page_func: detach used buffers after host sends a vq
          interrupt, instead of busy waiting for used buffers.
v30->v31:
    - virtio-balloon:
        - virtio_balloon_send_free_pages: return -EINTR rather than 1 to
          indicate an active stop requested by host; and add more
          comments to explain about access to cmd_id_received without
          locks;
        -  add_one_sg: add TODO to comments about possible improvement.
v29->v30:
    - mm/walk_free_mem_block: add cond_sched() for each order
v28->v29:
    - mm/page_poison: only expose page_poison_enabled(), rather than more
      changes did in v28, as we are not 100% confident about that for now.
    - virtio-balloon: use a separate buffer for the stop cmd, instead of
      having the start and stop cmd use the same buffer. This avoids the
      corner case that the start cmd is overridden by the stop cmd when
      the host has a delay in reading the start cmd.
v27->v28:
    - mm/page_poison: Move PAGE_POISON to page_poison.c and add a function
      to expose page poison val to kernel modules.
v26->v27:
    - add a new patch to expose page_poisoning_enabled to kernel modules
    - virtio-balloon: set poison_val to 0xaaaaaaaa, instead of 0xaa
v25->v26: virtio-balloon changes only
    - remove kicking free page vq since the host now polls the vq after
      initiating the reporting
    - report_free_page_func: detach all the used buffers after sending
      the stop cmd id. This avoids leaving the detaching burden (i.e.
      overhead) to the next cmd id. Detaching here isn't considered
      overhead since the stop cmd id has been sent, and host has already
      moved formard.
v24->v25:
    - mm: change walk_free_mem_block to return 0 (instead of true) on
          completing the report, and return a non-zero value from the
          callabck, which stops the reporting.
    - virtio-balloon:
        - use enum instead of define for VIRTIO_BALLOON_VQ_INFLATE etc.
        - avoid __virtio_clear_bit when bailing out;
        - a new method to avoid reporting the some cmd id to host twice
        - destroy_workqueue can cancel free page work when the feature is
          negotiated;
        - fail probe when the free page vq size is less than 2.
v23->v24:
    - change feature name VIRTIO_BALLOON_F_FREE_PAGE_VQ to
      VIRTIO_BALLOON_F_FREE_PAGE_HINT
    - kick when vq->num_free < half full, instead of "= half full"
    - replace BUG_ON with bailing out
    - check vb->balloon_wq in probe(), if null, bail out
    - add a new feature bit for page poisoning
    - solve the corner case that one cmd id being sent to host twice
v22->v23:
    - change to kick the device when the vq is half-way full;
    - open-code batch_free_page_sg into add_one_sg;
    - change cmd_id from "uint32_t" to "__virtio32";
    - reserver one entry in the vq for the driver to send cmd_id, instead
      of busywaiting for an available entry;
    - add "stop_update" check before queue_work for prudence purpose for
      now, will have a separate patch to discuss this flag check later;
    - init_vqs: change to put some variables on stack to have simpler
      implementation;
    - add destroy_workqueue(vb->balloon_wq);
v21->v22:
    - add_one_sg: some code and comment re-arrangement
    - send_cmd_id: handle a cornercase

For previous ChangeLog, please reference
https://lwn.net/Articles/743660/

Wei Wang (4):
  mm: add a function to get free page blocks
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
  mm/page_poison: expose page_poisoning_enabled to kernel modules
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

 drivers/virtio/virtio_balloon.c     | 197 +++++++++++++++++++++++++++++-------
 include/linux/mm.h                  |   1 +
 include/uapi/linux/virtio_balloon.h |  16 +++
 mm/page_alloc.c                     |  52 ++++++++++
 mm/page_poison.c                    |   6 ++
 5 files changed, 235 insertions(+), 37 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Wei Wang @ 2018-06-15  3:53 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Linus Torvalds, Michael S. Tsirkin
  Cc: KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <0f18063c-c76b-4728-5145-810f069988ea@redhat.com>

On 06/14/2018 11:01 PM, Nitesh Narayan Lal wrote:
> Hi Wei,
>
>
> On 06/12/2018 07:05 AM, Wei Wang wrote:
>> On 06/12/2018 09:59 AM, Linus Torvalds wrote:
>>> On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin <mst@redhat.com>
>>> wrote:
>>>> Maybe it will help to have GFP_NONE which will make any allocation
>>>> fail if attempted. Linus, would this address your comment?
>>> It would definitely have helped me initially overlook that call chain.
>>>
>>> But then when I started looking at the whole dma_map_page() thing, it
>>> just raised my hackles again.
>>>
>>> I would seriously suggest having a much simpler version for the "no
>>> allocation, no dma mapping" case, so that it's *obvious* that that
>>> never happens.
>>>
>>> So instead of having virtio_balloon_send_free_pages() call a really
>>> generic complex chain of functions that in _some_ cases can do memory
>>> allocation, why isn't there a short-circuited "vitruque_add_datum()"
>>> that is guaranteed to never do anything like that?
>>>
>>> Honestly, I look at "add_one_sg()" and it really doesn't make me
>>> happy. It looks hacky as hell. If I read the code right, you're really
>>> trying to just queue up a simple tuple of <pfn,len>, except you encode
>>> it as a page pointer in order to play games with the SG logic, and
>>> then you hmap that to the ring, except in this case it's all a fake
>>> ring that just adds the cpu-physical address instead.
>>>
>>> And to figuer that out, it's like five layers of indirection through
>>> different helper functions that *can* do more generic things but in
>>> this case don't.
>>>
>>> And you do all of this from a core VM callback function with some
>>> _really_ core VM locks held.
>>>
>>> That makes no sense to me.
>>>
>>> How about this:
>>>
>>>    - get rid of all that code
>>>
>>>    - make the core VM callback save the "these are the free memory
>>> regions" in a fixed and limited array. One that DOES JUST THAT. No
>>> crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
>>> size, pre-allocated for that virtio instance.
>>>
>>>    - make it obvious that what you do in that sequence is ten
>>> instructions and no allocations ("Look ma, I wrote a value to an array
>>> and incremented the array idex, and I'M DONE")
>>>
>>>    - then in that workqueue entry that you start *anyway*, you empty the
>>> array and do all the crazy virtio stuff.
>>>
>>> In fact, while at it, just simplify the VM interface too. Instead of
>>> traversing a random number of buddy lists, just trraverse *one* - the
>>> top-level one. Are you seriously ever going to shrink or mark
>>> read-only anythin *but* something big enough to be in the maximum
>>> order?
>>>
>>> MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
>>> want the balloon code to work on smaller things, particularly since
>>> the whole interface is fundamentally racy and opportunistic to begin
>>> with?
>> OK, I will implement a new version based on the suggestions. Thanks.
> I have been working on a similar series [1] that is more generic, which
> solves the problem of giving unused memory back to the host and could be
> used to solve the migration problem as well. Can you take a look and see
> if you can use my series in some way?

Hi Nitesh,

I actually checked the last version, which dates back to last year. It 
seems the new version does not have fundamental differences.

Actually there are obvious differences between the two series. This 
series provides a simple lightweight method (will continue to post out a 
new version with simpler interfaces based on the above suggestions) to 
offer free pages hints, and the hints are quit helpful for usages like 
accelerating live migration and guest snapshot. If I read that 
correctly, that series seems to provide true (guaranteed) free pages 
with much more heavyweight logics, but true free pages are not necessary 
for the live migration optimization, which is the goal and motivation of 
this work. And from my point of view, that series seems more like an 
alternative function to ballooning, which takes out free pages (or say 
guest unused pages) via allocation.

I will join the discussion in that thread. Probably we would need to 
think about other new usages for that series.

Best,
Wei

^ permalink raw reply

* Re: [PATCH 0/3] Use sbitmap instead of percpu_ida
From: Matthew Wilcox @ 2018-06-15  2:37 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Juergen Gross, Jens Axboe, kvm, linux-scsi, netdev, linux-usb,
	linux-kernel, virtualization, target-devel, qla2xxx-upstream,
	linux1394-devel, Kent Overstreet
In-Reply-To: <yq136xols59.fsf@oracle.com>

On Thu, Jun 14, 2018 at 10:06:58PM -0400, Martin K. Petersen wrote:
> 
> Matthew,
> 
> > Removing the percpu_ida code nets over 400 lines of removal.  It's not
> > as spectacular as deleting an entire architecture, but it's still a
> > worthy reduction in lines of code.
> 
> Since most of the changes are in scsi or target, should I take this
> series through my tree?

I'd welcome that.  Nick seems to be inactive as target maintainer;
his tree on kernel.org hasn't seen any updates in five months.

Thanks!

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-15  2:34 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, qemu-devel, virtualization, Siwei Liu, Netdev,
	aaron.f.brown
In-Reply-To: <20180614120231.0a72bd5f.cohuck@redhat.com>

On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:
> So, do you know from the outset that there will be such a coupled
> device? I.e., is it a property of the VM definition?
> 
> Can there be a 'prepared' virtio-net device that presents the STANDBY
> feature even if there currently is no vfio-handled device available --
> but making it possible to simply hotplug that device later?
> Should it be possible to add a virtio/vfio pair later on?

Yes, that's the plan, more or less.

> > >>> I think Qemu should check if guest virtio-net supports this feature and
> > >>> provide a mechanism for
> > >>> an upper layer indicating if the STANDBY feature is successfully
> > >>> negotiated or not.
> > >>> The upper layer can then decide if it should hot plug a VF with the same
> > >>> MAC and manage the 2 links.
> > >>> If VF is successfully hot plugged, virtio-net link should be disabled.  

BTW I see no reason to do this last part. The role of the
standby device is to be always there.

> > >>
> > >> Did you even talk to upper layer management about it?
> > >> Just list the steps they need to do and you will see
> > >> that's a lot of machinery to manage by the upper layer.
> > >>
> > >> What do we gain in flexibility? As far as I can see the
> > >> only gain is some resources saved for legacy VMs.
> > >>
> > >> That's not a lot as tenant of the upper layer probably already has
> > >> at least a hunch that it's a new guest otherwise
> > >> why bother specifying the feature at all - you
> > >> save even more resources without it.
> > >>  
> > >
> > > I am not all that familiar with how Qemu manages network devices. If we can
> > > do all the
> > > required management of the primary/standby devices within Qemu, that is
> > > definitely a better
> > > approach without upper layer involvement.  
> > 
> > Right. I would imagine in the extreme case the upper layer doesn't
> > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > The management tool can supply passthrough device and virtio with the
> > same group UUID, QEMU auto-manages the presence of the primary, and
> > hot plug the device as needed before or after the migration.
> 
> I do not really see how you can manage that kind of stuff in QEMU only.

So right now failover is limited to pci passthrough devices only.
The idea is to realize the vfio device but not expose it
to guest. Have a separate command to expose it to guest.
Hotunplug would also hide it from guest but not unrealize it.

This will help ensure that e.g. on migration failure we can
re-expose the device without risk of running out of resources.
-- 
MST

^ permalink raw reply

* Re: [PATCH 0/3] Use sbitmap instead of percpu_ida
From: Martin K. Petersen @ 2018-06-15  2:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Juergen Gross, Jens Axboe, kvm, linux-scsi, netdev, linux-usb,
	linux-kernel, virtualization, target-devel, qla2xxx-upstream,
	linux1394-devel, Kent Overstreet
In-Reply-To: <20180612190545.10781-1-willy@infradead.org>


Matthew,

> Removing the percpu_ida code nets over 400 lines of removal.  It's not
> as spectacular as deleting an entire architecture, but it's still a
> worthy reduction in lines of code.

Since most of the changes are in scsi or target, should I take this
series through my tree?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-15  1:57 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Netdev, aaron.f.brown
In-Reply-To: <20180614120231.0a72bd5f.cohuck@redhat.com>

Thank you for sharing your thoughts, Cornelia. With questions below, I
think you raised really good points, some of which I don't have answer
yet and would also like to explore here.

First off, I don't want to push the discussion to the extreme at this
point, or sell anything about having QEMU manage everything
automatically. Don't get me wrong, it's not there yet. Let's don't
assume we are tied to a specific or concerte solution. I think the key
for our discussion might be to define or refine the boundary between
VM and guest,  e.g. what each layer is expected to control and manage
exactly.

In my view, there might be possibly 3 different options to represent
the failover device conceipt to QEMU and libvirt (or any upper layer
software):

a. Seperate device: in this model, virtio and passthough remains
separate devices just as today. QEMU exposes the standby feature bit
for virtio, and publish status/event around the negotiation process of
this feature bit for libvirt to react upon. Since Libvirt has the
pairing relationship itself, maybe through MAC address or something
else, it can control the presence of primary by hot plugging or
unplugging the passthrough device, although it has to work tightly
with virtio's feature negotation process. Not just for migration but
also various corner scenarios (driver/feature ok, device reset,
reboot, legacy guest etc) along virtio's feature negotiation.

b. Coupled device: in this model, virtio and passthough devices are
weakly coupled using some group ID, i.e. QEMU match the passthough
device for a standby virtio instance by comparing the group ID value
present behind each device's bridge. Libvirt provides QEMU the group
ID for both type of devices, and only deals with hot plug for
migration, by checking some migration status exposed (e.g. the feature
negotiation status on the virtio device) by QEMU. QEMU manages the
visibility of the primary in guest along virtio's feature negotiation
process.

c. Fully combined device: in this model, virtio and passthough devices
are viewed as a single VM interface altogther. QEMU not just controls
the visibility of the primary in guest, but can also manage the
exposure of the passthrough for migratability. It can be like that
libvirt supplies the group ID to QEMU. Or libvirt does not even have
to provide group ID for grouping the two devices, if just one single
combined device is exposed by QEMU. In either case, QEMU manages all
aspect of such internal construct, including virtio feature
negotiation, presence of the primary, and live migration.

It looks like to me that, in your opinion, you seem to prefer go with
(a). While I'm actually okay with either (b) or (c). Do I understand
your point correctly?

The reason that I feel that (a) might not be ideal, just as Michael
alluded to (quoting below), is that as management stack, it really
doesn't need to care about the detailed process of feature negotiation
(if we view the guest presence of the primary as part of feature
negotiation at an extended level not just virtio). All it needs to be
done is to hand in the required devices to QEMU and that's all. Why do
we need to addd various hooks, events for whichever happens internally
within the guest?

''
Primary device is added with a special "primary-failover" flag.
A virtual machine is then initialized with just a standby virtio
device. Primary is not yet added.

Later QEMU detects that guest driver device set DRIVER_OK.
It then exposes the primary device to the guest, and triggers
a device addition event (hot-plug event) for it.

If QEMU detects guest driver removal, it initiates a hot-unplug sequence
to remove the primary driver.  In particular, if QEMU detects guest
re-initialization (e.g. by detecting guest reset) it immediately removes
the primary device.
''

and,

''
management just wants to give the primary to guest and later take it back,
it really does not care about the details of the process,
so I don't see what does pushing it up the stack buy you.

So I don't think it *needs* to be done in libvirt. It probably can if you
add a bunch of hooks so it knows whenever vm reboots, driver binds and
unbinds from device, and can check that backup flag was set.
If you are pushing for a setup like that please get a buy-in
from libvirt maintainers or better write a patch.
''

Let me know if my clarifications sound clear to you now.


Thanks,
-Siwei


On Thu, Jun 14, 2018 at 3:02 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> I've been pointed to this discussion (which I had missed previously)
> and I'm getting a headache. Let me first summarize how I understand how
> this feature is supposed to work, then I'll respond to some individual
> points.
>
> The basic idea is to enable guests to migrate seamlessly, while still
> making it possible for them to use a passed-through device for more
> performance etc. The means to do so is to hook a virtio-net device
> together with a network device passed through via vfio. The
> vfio-handled device is there for performance, the virtio device for
> migratability. We have a new virtio feature bit for that which needs to
> be negotiated for that 'combined' device to be available. We have to
> consider two cases:
>
> - Older guests that do not support the new feature bit. We presume that
>   those guests will be confused if they get two network devices with
>   the same MAC, so the idea is to not show them the vfio-handled device
>   at all.
> - Guests that negotiate the feature bit. We only know positively that
>   they (a) know the feature bit and (b) are prepared to handle the
>   consequences of negotiating it after they set the FEATURES_OK bit.
>   This is therefore the earliest point in time that the vfio-handled
>   device should be visible or usable for the guest.
>
> On Wed, 13 Jun 2018 18:02:01 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
>> <sridhar.samudrala@intel.com> wrote:
>> > On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
>> >>
>> >> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>> >>>
>> >>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>> >>>>
>> >>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>> >>>>>
>> >>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>> >>>>>>
>> >>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>> >>>>>>>
>> >>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
>> >>>>>>> device to
>> >>>>>>> act as a standby for another device with the same MAC address.
>> >>>>>>>
>> >>>>>>> I tested this with a small change to the patch to mark the STANDBY
>> >>>>>>> feature 'true'
>> >>>>>>> by default as i am using libvirt to start the VMs.
>> >>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>> >>>>>>> via libvirt
>> >>>>>>> XML file?
>> >>>>>>>
>> >>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >>>>>>
>> >>>>>> So I do not think we can commit to this interface: we
>> >>>>>> really need to control visibility of the primary device.
>> >>>>>
>> >>>>> The problem is legacy guest won't use primary device at all if we do
>> >>>>> this.
>> >>>>
>> >>>> And that's by design - I think it's the only way to ensure the
>> >>>> legacy guest isn't confused.
>> >>>
>> >>> Yes. I think so. But i am not sure if Qemu is the right place to control
>> >>> the visibility
>> >>> of the primary device. The primary device may not be specified as an
>> >>> argument to Qemu. It
>> >>> may be plugged in later.
>> >>> The cloud service provider is providing a feature that enables low
>> >>> latency datapath and live
>> >>> migration capability.
>> >>> A tenant can use this feature only if he is running a VM that has
>> >>> virtio-net with failover support.
>
> So, do you know from the outset that there will be such a coupled
> device? I.e., is it a property of the VM definition?
>
> Can there be a 'prepared' virtio-net device that presents the STANDBY
> feature even if there currently is no vfio-handled device available --
> but making it possible to simply hotplug that device later?
>
> Should it be possible to add a virtio/vfio pair later on?
>
>> >>
>> >> Well live migration is there already. The new feature is low latency
>> >> data path.
>> >
>> >
>> > we get live migration with just virtio.  But I meant live migration with VF
>> > as
>> > primary device.
>> >
>> >>
>> >> And it's the guest that needs failover support not the VM.
>> >
>> >
>> > Isn't guest and VM synonymous?
>
> I think we need to be really careful to not mix up the two: The VM
> contains the definitions, but it is up to the guest how it uses them.
>
>> >
>> >
>> >>
>> >>
>> >>> I think Qemu should check if guest virtio-net supports this feature and
>> >>> provide a mechanism for
>> >>> an upper layer indicating if the STANDBY feature is successfully
>> >>> negotiated or not.
>> >>> The upper layer can then decide if it should hot plug a VF with the same
>> >>> MAC and manage the 2 links.
>> >>> If VF is successfully hot plugged, virtio-net link should be disabled.
>> >>
>> >> Did you even talk to upper layer management about it?
>> >> Just list the steps they need to do and you will see
>> >> that's a lot of machinery to manage by the upper layer.
>> >>
>> >> What do we gain in flexibility? As far as I can see the
>> >> only gain is some resources saved for legacy VMs.
>> >>
>> >> That's not a lot as tenant of the upper layer probably already has
>> >> at least a hunch that it's a new guest otherwise
>> >> why bother specifying the feature at all - you
>> >> save even more resources without it.
>> >>
>> >
>> > I am not all that familiar with how Qemu manages network devices. If we can
>> > do all the
>> > required management of the primary/standby devices within Qemu, that is
>> > definitely a better
>> > approach without upper layer involvement.
>>
>> Right. I would imagine in the extreme case the upper layer doesn't
>> have to be involved at all if QEMU manages all hot plug/unplug logic.
>> The management tool can supply passthrough device and virtio with the
>> same group UUID, QEMU auto-manages the presence of the primary, and
>> hot plug the device as needed before or after the migration.
>
> I do not really see how you can manage that kind of stuff in QEMU only.
> Have you talked to some libvirt folks? (And I'm not sure what you refer
> to with 'group UUID'?)
>
> Also, I think you need to make a distinction between hotplugging a
> device and making it visible to the guest. What does 'hotplugging' mean
> here? Adding it to the VM definition? Would it be enough to have the
> vfio-based device not operational until the virtio feature bit has been
> negotiated?
>
> What happens if the guest does not use the vfio-based device after it
> has been made available? Will you still disable the virtio-net link?
> (All that link handling definitely sounds like a task for libvirt or
> the like.)
>
> Regarding hot(un)plugging during migration, I think you also need to
> keep in mind that different architectures/busses have different
> semantics there. Something that works if there's an unplug handshake may
> not work on a platform with surprise removal.
>
> Have you considered guest agents? All of this is punching through
> several layers, and I'm not sure if that is a good idea.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v5 2/3] x86/asm: add _ASM_ARG* constants for argument registers to <asm/asm.h>
From: H. Peter Anvin @ 2018-06-15  0:17 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Kate Stewart, andrea.parri, linux-efi, brijesh.singh, J. Kiszka,
	Josh Poimboeuf, Will Deacon, jarkko.sakkinen, virtualization,
	Masahiro Yamada, Manoj Gupta, boris.ostrovsky, Thiebaud Weksteen,
	mawilcox, x86, akataria, Greg Hackmann, mingo, Alistair Strachan,
	David Rientjes, geert, thomas.lendacky, Arnd Bergmann,
	Linux Kbuild mailing list, Philippe Ombredanne, rostedt, acme
In-Reply-To: <CAKwvOdngs8k_T=nDKh8JYdAwwFoJRcCU0bZu0tJHQ+keZ5+RFQ@mail.gmail.com>

On 06/14/18 13:59, Nick Desaulniers wrote:
> On Thu, Jun 14, 2018 at 1:48 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> On 06/13/18 14:05, Nick Desaulniers wrote:
>>> From: "H. Peter Anvin" <hpa@linux.intel.com>
>>>
>>> i386 and x86-64 uses different registers for arguments; make them
>>> available so we don't have to #ifdef in the actual code.
>>>
>>> Native size and specified size (q, l, w, b) versions are provided.
>>>
>>> Suggested-by: Sedat Dilek <sedat.dilek@gmail.com>
>>> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
>>> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
>>
>> I still object to Suggested-by: here.  Sedat did a correction, which is
>> a Reviewed-by:, but unless I'm completely out to sea, there was no
>> suggestion on Sedat's part of this; and I had certainly not seen it when
>> I wrote the code.
> 
> I'm fine with changing it from a Suggested-by to a Reviewed-by.  Can
> you do that when you apply the set, or should I send a v6?
> 

I'm not handling patch mechanics for x86 at the moment.

	-hpa

^ permalink raw reply

* Re: [PATCH v5 2/3] x86/asm: add _ASM_ARG* constants for argument registers to <asm/asm.h>
From: H. Peter Anvin @ 2018-06-14 20:47 UTC (permalink / raw)
  To: Nick Desaulniers, akpm, mingo, tglx
  Cc: kstewart, andrea.parri, linux-efi, brijesh.singh, jan.kiszka,
	will.deacon, jarkko.sakkinen, linux-kernel, yamada.masahiro,
	manojgupta, akataria, tweek, mawilcox, x86, ghackmann, mka, geert,
	rientjes, aryabinin, thomas.lendacky, arnd, linux-kbuild,
	keescook, rostedt, acme, caoj.fnst, jpoimboe, sedat.dilek,
	boris.ostrovsky, virtualization, jgross, michal.lkml, tstellar,
	gregkh, ard.biesheuvel, astrachan, mjg59
In-Reply-To: <20180613210518.113983-3-ndesaulniers@google.com>

On 06/13/18 14:05, Nick Desaulniers wrote:
> From: "H. Peter Anvin" <hpa@linux.intel.com>
> 
> i386 and x86-64 uses different registers for arguments; make them
> available so we don't have to #ifdef in the actual code.
> 
> Native size and specified size (q, l, w, b) versions are provided.
> 
> Suggested-by: Sedat Dilek <sedat.dilek@gmail.com>
> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

I still object to Suggested-by: here.  Sedat did a correction, which is
a Reviewed-by:, but unless I'm completely out to sea, there was no
suggestion on Sedat's part of this; and I had certainly not seen it when
I wrote the code.

	-hpa

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-14 12:50 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, aaron.f.brown, Jiri Pirko,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, Netdev,
	virtualization
In-Reply-To: <CADGSJ213f8tJpNXuOhv8qRew-Y5VZAwA+srNMrLZYnKdVGLdAA@mail.gmail.com>

On Wed, Jun 13, 2018 at 06:02:01PM -0700, Siwei Liu wrote:
> >> And it's the guest that needs failover support not the VM.
> >
> >
> > Isn't guest and VM synonymous?

Guest is whatever software is running on top of the hypervisor.

The virtual machine is the interface between the two.

-- 
MST

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-14 10:02 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Netdev, aaron.f.brown
In-Reply-To: <CADGSJ213f8tJpNXuOhv8qRew-Y5VZAwA+srNMrLZYnKdVGLdAA@mail.gmail.com>

I've been pointed to this discussion (which I had missed previously)
and I'm getting a headache. Let me first summarize how I understand how
this feature is supposed to work, then I'll respond to some individual
points.

The basic idea is to enable guests to migrate seamlessly, while still
making it possible for them to use a passed-through device for more
performance etc. The means to do so is to hook a virtio-net device
together with a network device passed through via vfio. The
vfio-handled device is there for performance, the virtio device for
migratability. We have a new virtio feature bit for that which needs to
be negotiated for that 'combined' device to be available. We have to
consider two cases:

- Older guests that do not support the new feature bit. We presume that
  those guests will be confused if they get two network devices with
  the same MAC, so the idea is to not show them the vfio-handled device
  at all.
- Guests that negotiate the feature bit. We only know positively that
  they (a) know the feature bit and (b) are prepared to handle the
  consequences of negotiating it after they set the FEATURES_OK bit.
  This is therefore the earliest point in time that the vfio-handled
  device should be visible or usable for the guest.

On Wed, 13 Jun 2018 18:02:01 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
> > On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:  
> >>
> >> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:  
> >>>
> >>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:  
> >>>>
> >>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:  
> >>>>>
> >>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:  
> >>>>>>
> >>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:  
> >>>>>>>
> >>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
> >>>>>>> device to
> >>>>>>> act as a standby for another device with the same MAC address.
> >>>>>>>
> >>>>>>> I tested this with a small change to the patch to mark the STANDBY
> >>>>>>> feature 'true'
> >>>>>>> by default as i am using libvirt to start the VMs.
> >>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
> >>>>>>> via libvirt
> >>>>>>> XML file?
> >>>>>>>
> >>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>  
> >>>>>>
> >>>>>> So I do not think we can commit to this interface: we
> >>>>>> really need to control visibility of the primary device.  
> >>>>>
> >>>>> The problem is legacy guest won't use primary device at all if we do
> >>>>> this.  
> >>>>
> >>>> And that's by design - I think it's the only way to ensure the
> >>>> legacy guest isn't confused.  
> >>>
> >>> Yes. I think so. But i am not sure if Qemu is the right place to control
> >>> the visibility
> >>> of the primary device. The primary device may not be specified as an
> >>> argument to Qemu. It
> >>> may be plugged in later.
> >>> The cloud service provider is providing a feature that enables low
> >>> latency datapath and live
> >>> migration capability.
> >>> A tenant can use this feature only if he is running a VM that has
> >>> virtio-net with failover support.  

So, do you know from the outset that there will be such a coupled
device? I.e., is it a property of the VM definition?

Can there be a 'prepared' virtio-net device that presents the STANDBY
feature even if there currently is no vfio-handled device available --
but making it possible to simply hotplug that device later?

Should it be possible to add a virtio/vfio pair later on?

> >>
> >> Well live migration is there already. The new feature is low latency
> >> data path.  
> >
> >
> > we get live migration with just virtio.  But I meant live migration with VF
> > as
> > primary device.
> >  
> >>
> >> And it's the guest that needs failover support not the VM.  
> >
> >
> > Isn't guest and VM synonymous?

I think we need to be really careful to not mix up the two: The VM
contains the definitions, but it is up to the guest how it uses them.

> >
> >  
> >>
> >>  
> >>> I think Qemu should check if guest virtio-net supports this feature and
> >>> provide a mechanism for
> >>> an upper layer indicating if the STANDBY feature is successfully
> >>> negotiated or not.
> >>> The upper layer can then decide if it should hot plug a VF with the same
> >>> MAC and manage the 2 links.
> >>> If VF is successfully hot plugged, virtio-net link should be disabled.  
> >>
> >> Did you even talk to upper layer management about it?
> >> Just list the steps they need to do and you will see
> >> that's a lot of machinery to manage by the upper layer.
> >>
> >> What do we gain in flexibility? As far as I can see the
> >> only gain is some resources saved for legacy VMs.
> >>
> >> That's not a lot as tenant of the upper layer probably already has
> >> at least a hunch that it's a new guest otherwise
> >> why bother specifying the feature at all - you
> >> save even more resources without it.
> >>  
> >
> > I am not all that familiar with how Qemu manages network devices. If we can
> > do all the
> > required management of the primary/standby devices within Qemu, that is
> > definitely a better
> > approach without upper layer involvement.  
> 
> Right. I would imagine in the extreme case the upper layer doesn't
> have to be involved at all if QEMU manages all hot plug/unplug logic.
> The management tool can supply passthrough device and virtio with the
> same group UUID, QEMU auto-manages the presence of the primary, and
> hot plug the device as needed before or after the migration.

I do not really see how you can manage that kind of stuff in QEMU only.
Have you talked to some libvirt folks? (And I'm not sure what you refer
to with 'group UUID'?)

Also, I think you need to make a distinction between hotplugging a
device and making it visible to the guest. What does 'hotplugging' mean
here? Adding it to the VM definition? Would it be enough to have the
vfio-based device not operational until the virtio feature bit has been
negotiated?

What happens if the guest does not use the vfio-based device after it
has been made available? Will you still disable the virtio-net link?
(All that link handling definitely sounds like a task for libvirt or
the like.)

Regarding hot(un)plugging during migration, I think you also need to
keep in mind that different architectures/busses have different
semantics there. Something that works if there's an unplug handshake may
not work on a platform with surprise removal.

Have you considered guest agents? All of this is punching through
several layers, and I'm not sure if that is a good idea.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-14  1:02 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexander Duyck, virtio-dev, aaron.f.brown, Jiri Pirko,
	Michael S. Tsirkin, Jakub Kicinski, Netdev, qemu-devel,
	virtualization
In-Reply-To: <b9c88515-cf41-2a9a-078e-9c9e5adbbf14@intel.com>

On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
>>
>> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>>>>>
>>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>>>>>
>>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>>>>>
>>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
>>>>>>> device to
>>>>>>> act as a standby for another device with the same MAC address.
>>>>>>>
>>>>>>> I tested this with a small change to the patch to mark the STANDBY
>>>>>>> feature 'true'
>>>>>>> by default as i am using libvirt to start the VMs.
>>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>>>>>>> via libvirt
>>>>>>> XML file?
>>>>>>>
>>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>>>
>>>>>> So I do not think we can commit to this interface: we
>>>>>> really need to control visibility of the primary device.
>>>>>
>>>>> The problem is legacy guest won't use primary device at all if we do
>>>>> this.
>>>>
>>>> And that's by design - I think it's the only way to ensure the
>>>> legacy guest isn't confused.
>>>
>>> Yes. I think so. But i am not sure if Qemu is the right place to control
>>> the visibility
>>> of the primary device. The primary device may not be specified as an
>>> argument to Qemu. It
>>> may be plugged in later.
>>> The cloud service provider is providing a feature that enables low
>>> latency datapath and live
>>> migration capability.
>>> A tenant can use this feature only if he is running a VM that has
>>> virtio-net with failover support.
>>
>> Well live migration is there already. The new feature is low latency
>> data path.
>
>
> we get live migration with just virtio.  But I meant live migration with VF
> as
> primary device.
>
>>
>> And it's the guest that needs failover support not the VM.
>
>
> Isn't guest and VM synonymous?
>
>
>>
>>
>>> I think Qemu should check if guest virtio-net supports this feature and
>>> provide a mechanism for
>>> an upper layer indicating if the STANDBY feature is successfully
>>> negotiated or not.
>>> The upper layer can then decide if it should hot plug a VF with the same
>>> MAC and manage the 2 links.
>>> If VF is successfully hot plugged, virtio-net link should be disabled.
>>
>> Did you even talk to upper layer management about it?
>> Just list the steps they need to do and you will see
>> that's a lot of machinery to manage by the upper layer.
>>
>> What do we gain in flexibility? As far as I can see the
>> only gain is some resources saved for legacy VMs.
>>
>> That's not a lot as tenant of the upper layer probably already has
>> at least a hunch that it's a new guest otherwise
>> why bother specifying the feature at all - you
>> save even more resources without it.
>>
>
> I am not all that familiar with how Qemu manages network devices. If we can
> do all the
> required management of the primary/standby devices within Qemu, that is
> definitely a better
> approach without upper layer involvement.

Right. I would imagine in the extreme case the upper layer doesn't
have to be involved at all if QEMU manages all hot plug/unplug logic.
The management tool can supply passthrough device and virtio with the
same group UUID, QEMU auto-manages the presence of the primary, and
hot plug the device as needed before or after the migration.

-Siwei

>
>
>>
>>
>>>>> How about control the visibility of standby device?
>>>>>
>>>>> Thanks
>>>>
>>>> standy the always there to guarantee no downtime.
>>>>
>>>>>> However just for testing purposes, we could add a non-stable
>>>>>> interface "x-standby" with the understanding that as any
>>>>>> x- prefix it's unstable and will be changed down the road,
>>>>>> likely in the next release.
>>>>>>
>>>>>>
>>>>>>> ---
>>>>>>>     hw/net/virtio-net.c                         | 2 ++
>>>>>>>     include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>>>     2 files changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>>> index 90502fca7c..38b3140670 100644
>>>>>>> --- a/hw/net/virtio-net.c
>>>>>>> +++ b/hw/net/virtio-net.c
>>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>>>                          true),
>>>>>>>         DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed,
>>>>>>> SPEED_UNKNOWN),
>>>>>>>         DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features,
>>>>>>> VIRTIO_NET_F_STANDBY,
>>>>>>> +                      false),
>>>>>>>         DEFINE_PROP_END_OF_LIST(),
>>>>>>>     };
>>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h
>>>>>>> b/include/standard-headers/linux/virtio_net.h
>>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>>> @@ -57,6 +57,9 @@
>>>>>>>                                          * Steering */
>>>>>>>     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23       /* Set MAC address */
>>>>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for
>>>>>>> another device
>>>>>>> +                                         * with the same MAC.
>>>>>>> +                                         */
>>>>>>>     #define VIRTIO_NET_F_SPEED_DUPLEX 63        /* Device set
>>>>>>> linkspeed and duplex */
>>>>>>>     #ifndef VIRTIO_NET_NO_LEGACY
>>>>>>> --
>>>>>>> 2.14.3
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-14  0:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, qemu-devel, Samudrala, Sridhar,
	virtualization
In-Reply-To: <20180612143554-mutt-send-email-mst@kernel.org>

On Tue, Jun 12, 2018 at 4:47 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Jun 05, 2018 at 03:09:26PM -0700, Siwei Liu wrote:
>> The thing is cloud service provider might prefer sticking to the same
>> level of service agreement (SLA) of keeping VF over migration,
>
> That requirement is trivially satisfied by just a single VF :) If your
> SLA does not require live migration, you should do just that.

The requirement is enable live migration by default without getting
user too much attention. Migration should be attempted if guest is
live migratable. If not, present a single VF to legacy guest.

You seem to think it's not a valid use case? Or impossible to do so?
What if I have something in mind that can support my theory as well?

-Siwei

>
> --
> MST

^ permalink raw reply

* [PULL v2] vhost, virtio
From: Michael S. Tsirkin @ 2018-06-13 23:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: ohad, kevin, kvm, mst, netdev, linux-remoteproc, linux-kernel,
	stable, bjorn.andersson, syzbot+87cfa083e727a224754b,
	virtualization

Page hints are reworked - I dropped them for now.

The following changes since commit 29dcea88779c856c7dc92040a0c01233263101d4:

  Linux 4.17 (2018-06-03 14:15:21 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 2eb98105f8c7f4b867f7f714a998f5b8c1bb009b:

  virtio: update the comments for transport features (2018-06-12 04:59:29 +0300)

----------------------------------------------------------------
virtio, vhost: features, fixes

VF support for virtio.
DMA barriers for virtio strong barriers.
Bugfixes.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Michael S. Tsirkin (2):
      virtio_ring: switch to dma_XX barriers for rpmsg
      vhost: fix info leak due to uninitialized memory

Tiwei Bie (2):
      virtio_pci: support enabling VFs
      virtio: update the comments for transport features

 drivers/vhost/vhost.c              |  3 +++
 drivers/virtio/virtio_pci_common.c | 30 ++++++++++++++++++++++++++++++
 drivers/virtio/virtio_pci_modern.c | 14 ++++++++++++++
 include/linux/virtio_ring.h        |  4 ++--
 include/uapi/linux/virtio_config.h | 16 ++++++++++++----
 5 files changed, 61 insertions(+), 6 deletions(-)

^ permalink raw reply

* [RFC] virtio-iommu version 0.7
From: Jean-Philippe Brucker @ 2018-06-13 17:58 UTC (permalink / raw)
  To: virtio-dev, virtualization; +Cc: eric.auger, tnowicki

This is version 0.7 of the virtio-iommu specification. The diff from 0.6,
included below, is fairly small and consists of the following changes:

* Address comments from 0.6, rework bits of the implementation notes.

* Change resv_mem parameters to be consistent with the rest of the
  spec.

* Add the MMIO flag to MAP requests. At the moment it is used by
  mapped MSIs mostly for completeness, but will be important for IDENTITY
  resv_mem regions that next versions introduce. Please find more
  information about this on the v0.6 thread [1].

  For mapped MSIs, the MMIO flag allows host userspace to easily catch MSI
  maps, and route the rest to VFIO. Without it the host needs to check
  addresses against the MSI doorbell on every map request, before passing
  them down to VFIO. This version makes the flag mandatory in MMIO map
  requests. It could be useful on the unmap request as well but adding
  such flag seems unintuitive (and changes unmap semantics, no way to do
  unmap-all anymore). Adding both flags showed barely any performance
  improvement on my kvmtool prototype. Please let me know if you notice
  anything interesting on the QEMU and vhost prototypes, if you get around
  comparing this.

* A notable change to protection semantics: write-only may imply
  read-write (but read-only never implies read-write). We need this
  behavior because some architectures do not support write-only mappings
  and the corresponding host drivers don't reject write-only map requests.
  I chose to follow POSIX mmap() semantics for this ("if the application
  requests only PROT_WRITE, the implementation may also allow read
  access.")

Sources: git://linux-arm.org/virtio-iommu.git viommu/v0.7
Docs:    http://jpbrucker.net/virtio-iommu/spec/virtio-iommu.pdf
         http://jpbrucker.net/virtio-iommu/spec/virtio-iommu.html
Diff:    http://jpbrucker.net/virtio-iommu/spec/diffs/virtio-iommu-pdf-diff-v0.6-v0.7.pdf

I'll send the updated Linux driver, which you can find on my
virtio-iommu/devel branch, after the merge window.

Next RFCs should be more interesting, with support for page table sharing
and some optimizations. It is progressing nicely but there isn't any rush,
since we're currently discussing the host-side interface in VFIO, which
virtio-iommu will try to follow closely. Since I'm working on a few
related projects, expect a similar cadence (around four months) for next
versions.

[1] https://www.spinics.net/lists/linux-virtualization/msg32628.html


--- >8 ---
diff --git a/MSI.tex b/MSI.tex
index fb54af4..7758fd1 100644
--- a/MSI.tex
+++ b/MSI.tex
@@ -11,22 +11,18 @@ number and destination processing units. Additional devices between the
 endpoint and the IRQ chip may translate the doorbell address, the IRQ
 number and verify that the endpoint is allowed to send this interrupt.
 
-Different platforms implement IRQ remapping and routing in different ways.
-This section describes three ways of dealing with Message Signaled
-Interrupts in virtio-iommu devices and drivers.
-
-In simplest systems, the endpoint writes the plain interrupt number to the
+Different platforms implement IRQ remapping and routing in different ways. In
+simplest systems, the endpoint writes the plain interrupt number to the
 doorbell, and the IRQ chip signals the interruption to destination CPUs
-programmed by software. Section \ref{sec:viommu / MSI / Address bypass}
-describes how to implement a simple system with virtio-iommu. Section
-\ref{sec:viommu / MSI / Address translation} describes the added complexity
-(from the host point of view) of translating the IRQ chip doorbell.
+programmed by software. Sections \ref{sec:viommu / MSI / Address bypass} and
+\ref{sec:viommu / MSI / Address translation} describe two ways of implementing
+MSIs with virtio-iommu.
 
 More complex systems add a level of indirection in the MSI message. The address
 or data contains an index into a remapping table, that describes interrupt
 delivery in details and is programmed by software either into the IRQ chip or
-the IOMMU. Section \ref{sec:viommu / MSI / IRQ remapping} describes how to use
-the remapping feature of virtio-iommu.
+the IOMMU. This is shown in section \ref{sec:viommu / MSI / IRQ remapping} but
+isn't yet supported by virtio-iommu.
 
 \subsubsection{Address bypass}\label{sec:viommu / MSI / Address bypass}
 
@@ -66,8 +62,8 @@ struct __attribute__((packed)) {
 	},
 	.mem = {
 		.subtype	= VIRTIO_IOMMU_RESV_MEM_T_MSI,
-		.addr		= 0xfee00000,
-		.size		= 0x00100000,
+		.start		= 0xfee00000,
+		.end		= 0xfeefffff,
 	},
 };
 \end{lstlisting}
@@ -90,13 +86,20 @@ translation can only forbid an endpoint from sending interrupts. If it is
 allowed to send MSIs, the endpoint can easily spoof another endpoint by
 sending interrupts that were not assigned to it.
 
-From the virtio-iommu point of view, this is the simplest to implement, because
-there is no special address range. The whole address space is treated the same
-by the virtio-iommu device.
-
-However, this mode of operations may add significant complexity in the host
-implementation.
-
+From the virtio-iommu point of view, this is the simplest to implement,
+because there is no special address range and no need for a PROBE request.
+The whole address space is treated the same way by the virtio-iommu
+device.
+
+However, this mode of operations may add some complexity in the host
+implementation. To setup MSIs, the guest writes an IOVA into the MSI-X
+table of the PCI endpoint. One possible host implementation emulates IRQ
+chips and captures requests that map virtual addresses to doorbell
+registers. These requests have the VIRTIO_IOMMU_MAP_F_MMIO flag set,
+making them easy to differentiate from requests that target normal memory
+and are forwarded to the physical IOMMU driver. The host also traps
+accesses to the endpoint's MSI-X table, and creates IRQ routes by
+translating the written IOVA into the corresponding doorbell.
 
 \subsubsection{IRQ remapping}\label{sec:viommu / MSI / IRQ remapping}
 
diff --git a/assignment.tex b/assignment.tex
index 4e26a9c..e372713 100644
--- a/assignment.tex
+++ b/assignment.tex
@@ -35,8 +35,9 @@ struct __attribute__((packed)) {
 		.length		= sizeof(resv.mem),
 	},
 	.mem = {
-		.addr		= 0x08000000,
-		.size		= 0x00100000,
+		.subtype	= VIRTIO_IOMMU_RESV_MEM_T_RESERVED,
+		.start		= 0x08000000,
+		.end		= 0x080fffff,
 	},
 };
 \end{lstlisting}
diff --git a/device-operations.tex b/device-operations.tex
index 40c68cf..7af6fb0 100644
--- a/device-operations.tex
+++ b/device-operations.tex
@@ -294,15 +294,6 @@ If the VIRTIO_IOMMU_F_DOMAIN_BITS feature is offered, the driver SHOULD
 NOT send requests with \field{domain} greater than the size described by
 \field{domain_bits}.
 
-% We mandate truncation to allow a future extension X.Y that would store
-% information in addresses and domain IDs.
-%
-% If device is 0.2 and driver is X.Y, then device ignores ext. bits. But
-% if device is X.Y and device is 0.2, then driver *might* set ext. bits to
-% garbage. But this extension would be negotiated with a feature bit
-% anyway. If it's not, then device must assume that driver is 0.2 and must
-% keep truncating the fields.
-
 The driver SHOULD NOT use multiple descriptor chains for a single request.
 
 \devicenormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}
@@ -321,12 +312,14 @@ to zero.
 The device MUST ignore reserved fields of the head and the tail of a
 request.
 
-If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the device MUST
-truncate the range described by \field{virt_start} and \field{virt_end} in
-requests to fit in the range described by \field{input_range}.
+If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered and the range
+described by fields \field{virt_start} and \field{virt_end} doesn't fit in
+the range described by \field{input_range}, the device MAY set
+\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.
 
-If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered, the device MUST ignore bits
-above \field{domain_bits} in field \field{domain} of requests.
+If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered and bits above
+\field{domain_bits} are set in field \field{domain}, the device MAY set
+\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.
 
 \subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request}
 
@@ -415,7 +408,9 @@ struct virtio_iommu_req_detach {
 \end{lstlisting}
 
 Detach an endpoint from its domain. When this request completes, the
-endpoint cannot access any mapping from that domain anymore.
+endpoint cannot access any mapping from that domain anymore. If feature
+VIRTIO_IOMMU_F_BYPASS has been negotiated, then the endpoint accesses the
+guest-physical address space once this request completes.
 
 After all endpoints have been successfully detached from a domain, it
 ceases to exist and its ID can be reused by the driver for another domain.
@@ -457,6 +452,7 @@ struct virtio_iommu_req_map {
 #define VIRTIO_IOMMU_MAP_F_READ		(1 << 0)
 #define VIRTIO_IOMMU_MAP_F_WRITE	(1 << 1)
 #define VIRTIO_IOMMU_MAP_F_EXEC		(1 << 2)
+#define VIRTIO_IOMMU_MAP_F_MMIO		(1 << 3)
 \end{lstlisting}
 
 Map a range of virtually-contiguous addresses to a range of
@@ -478,15 +474,23 @@ guest-physical addresses for use by the host (for instance MSI doorbells).
 Guest physical boundaries are set by the host using a firmware mechanism
 outside the scope of this specification.
 
-\begin{note}
-On flags: it is unlikely that all possible combinations of flags will be
-supported by the physical IOMMU. For instance, $W \& !R$ or $X \& W$ might
-be invalid. We do not have a way to advertise supported and implicit (for
-instance $W \rightarrow R$) flags or combination thereof for the moment,
-you are free to send any suggestions for describing this. Please keep in
-mind that we might soon want to add more flags, such as privileged,
-device, transient, shared, etc. (whatever these would mean).
-\end{note}
+Availability and allowed combinations of \field{flags} depend of the
+underlying IOMMU architectures. VIRTIO_IOMMU_MAP_F_READ and
+VIRTIO_IOMMU_MAP_F_WRITE are usually implemented, although READ is
+sometimes implied. VIRTIO_IOMMU_MAP_F_EXEC might not be available. In
+addition combinations such as "WRITE and not READ" or "WRITE and EXEC"
+might not be supported.
+
+The VIRTIO_IOMMU_MAP_F_MMIO flag is a memory type rather than a protection
+lag. It may be used, for example, to map Message Signaled Interrupt
+doorbells when a VIRTIO_IOMMU_RESV_MEM_T_MSI region isn't available. To
+trigger interrupts the endpoint performs a direct memory write to another
+peripheral, the IRQ chip. Since it is a signal, the write must not be
+buffered, elided, or combined with other writes by the memory
+interconnect. The precise meaning of the MMIO flag depends on the
+underlying memory architecture (for example on Armv8-A it corresponds to
+the "Device-nGnRE" memory type). Unless needed by mapped MSIs, the device
+isn't required to support the MMIO flag.
 
 This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
 negotiated.
@@ -497,6 +501,11 @@ The driver SHOULD set undefined \field{flags} bits to zero.
 
 \field{virt_end} MUST be strictly greater than \field{virt_start}.
 
+The driver SHOULD set the VIRTIO_IOMMU_MAP_F_MMIO flag when the physical
+range corresponds to memory-mapped device registers. The physical range
+SHOULD have a single memory type: either normal memory or memory-mapped
+I/O.
+
 \devicenormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}
 
 If \field{virt_start}, \field{phys_start} or (\field{virt_end} + 1) is
@@ -510,6 +519,15 @@ here, because the driver might be attempting to map with special flags
 that the device doesn't recognize. Creating the mapping with incompatible
 flags may introduce a security hazard.}
 
+If a flag or combination of flag isn't supported, the device MAY set the
+request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
+
+The device MUST NOT allow writes to a range mapped without the
+VIRTIO_IOMMU_MAP_F_WRITE flag. However, if the underlying architecture
+does not support write-only mappings, the device MAY allow reads to a
+range mapped with VIRTIO_IOMMU_MAP_F_WRITE but not
+VIRTIO_IOMMU_MAP_F_READ.
+
 If \field{domain} does not exist, the device SHOULD set the request
 \field{status} to VIRTIO_IOMMU_S_NOENT.
 
@@ -730,21 +748,24 @@ allocated by the driver, or that are special.
 struct virtio_iommu_probe_resv_mem {
 	u8	subtype;
 	u8	reserved[3];
-	le64	addr;
-	le64	size;
+	le64	start;
+	le64	end;
 };
 \end{lstlisting}
 
-Fields \field{addr} and \field{size} describe the range of reserved
-virtual addresses. \field{subtype} may be one of:
+Fields \field{start} and \field{end} describe the range of reserved virtual
+addresses. \field{subtype} may be one of:
 
 \begin{description}
   \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
-    Accesses to virtual addresses in this region are not translated by the
-    device. They may either be aborted by the device (or the underlying
-    IOMMU), bypass it, or never even reach it. The guest should neither
-    use these virtual addresses in a MAP request nor instruct endpoints to
-    perform DMA on them.
+    Accesses to virtual addresses in this region have undefined behavior.
+    They may be aborted by the device, bypass it, or never even reach it.
+    The region may also be used for host mappings, for example Message
+    Signaled Interrupts (see \ref{sec:viommu / Hardware device
+    assignment}).
+
+    The guest should neither use these virtual addresses in a MAP request
+    nor instruct endpoints to perform DMA on them.
 
   \item[VIRTIO_IOMMU_RESV_MEM_T_MSI (1)]
     This region is a doorbell for Message Signaled Interrupts (MSIs). It

^ permalink raw reply related

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-06-13 14:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: robh, pawel.moll, Tom Lendacky, cohuck, Ram Pai, linux-kernel,
	virtualization, Christoph Hellwig, joe, Rustad, Mark D,
	Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <59e60715f27b10bc6816193eaf324824eff69c46.camel@kernel.crashing.org>

On Mon, Jun 11, 2018 at 01:34:50PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-06-11 at 06:28 +0300, Michael S. Tsirkin wrote:
> > 
> > > However if the administrator
> > > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> > > flag, virtio will not be able to pass control to the DMA ops associated
> > > with the virtio devices. Which means, we have no opportunity to share
> > > the I/O buffers with the hypervisor/qemu.
> > > 
> > > How do you suggest, we handle this case?
> > 
> > As step 1, ignore it as a user error.
> 
> Ugh ... not again. Ram, don't bring that subject back we ALREADY
> addressed it, and requiring the *user* to do special things is just
> utterly and completely wrong.
> 
> The *user* has no bloody idea what that stuff is, will never know to
> set whatver magic qemu flag etc... The user will just start a a VM
> normally and expect things to work. Requiring the *user* to know things
> like that iommu virtio flag is complete nonsense.

You should consider teaching QEMU that the platform has support for the
ultravisor thing so it can set flags for you.

That's true even if something else is done for virtio - if you don't
have the capability right now you will come to regret it, host userspace
is just fundamentally in charge of control-path enumeration in the KVM
stack, I understand you want to take some of it away for security but
trying to hide things from QEMU completely is a mistake IMHO.

Just my $.02.

> If by "user" you mean libvirt, then you are now requesting about 4 or 5
> different projects to be patched to add speical cases for something
> they know nothing about and is completely irrelevant, while it can be
> entirely addressed with a 1-liner in virtio kernel side to allow the
> arch to plumb alternate DMA ops.
> 
> So for some reason you seem to be dead set on a path that leads to
> mountain of user pain, changes to many different projects and overall
> havok while there is a much much simpler and elegant solution at hand
> which I described (again) in the response to Ram I sent about 5mn ago.

What you sent to Ram sounds OK to me - I only hope we do all mean
the same thing because the 3 paragraphs above are all
about the libvirt/QEMU split mess which linux or virtio
should not really care about.

And I'd like to stress that direct path isn't merely legacy.

It's a common sense approach that when you have a hypervisor doing
things like taking care of cache coherency, you do not want to do them
in the guest as well. And the same guest can use a pass-through device
where it does need to do all the coherency mess, so while it's quite
possible to build a platform-specific way to tell guests which devices
need which kind of treatment, people understandably chose to do it in a
portable way through the virtio device.

I understand you guys are working on a new project that is going to use
bounce buffers anyway so what do you care about optimizations but we
can't just slow everyone else down for your benefit.

> > Further you can for example add per-device quirks in virtio so it can be
> > switched to dma api. make extra decisions in platform code then.

Isn't above exactly what you sent to Ram?

-- 
MST

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox