* RE: [RFC qemu 2/4] virtio-balloon: Add a new feature to balloon device
From: Li, Liang Z @ 2016-03-04 2:29 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: ehabkost@redhat.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
linux-mm@kvack.org, amit.shah@redhat.com, pbonzini@redhat.com,
akpm@linux-foundation.org,
virtualization@lists.linux-foundation.org, dgilbert@redhat.com,
rth@twiddle.net
In-Reply-To: <20160303125651.GA21382@redhat.com>
> Subject: Re: [RFC qemu 2/4] virtio-balloon: Add a new feature to balloon
> device
>
> On Thu, Mar 03, 2016 at 06:44:26PM +0800, Liang Li wrote:
> > Extend the virtio balloon device to support a new feature, this new
> > feature can help to get guest's free pages information, which can be
> > used for live migration optimzation.
> >
> > Signed-off-by: Liang Li <liang.z.li@intel.com>
>
> I don't understand why we need a new interface.
> Balloon already sends free pages to host.
> Just teach host to skip these pages.
>
I just make use the current virtio-balloon implementation, it's more complicated to
invent a new virtio-io device...
Actually, there is no need to inflate the balloon before live migration, so the host has
no information about the guest's free pages, that's why I add a new one.
> Maybe instead of starting with code, you should send a high level description
> to the virtio tc for consideration?
>
> You can do it through the mailing list or using the web form:
> http://www.oasis-
> open.org/committees/comments/form.php?wg_abbrev=virtio
>
Thanks for your information and suggestion.
Liang
^ permalink raw reply
* RE: [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-04 1:52 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
linux-mm@kvack.org, amit.shah@redhat.com, pbonzini@redhat.com,
akpm@linux-foundation.org,
virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160303174615.GF2115@work-vm>
> Subject: Re: [RFC qemu 0/4] A PV solution for live migration optimization
>
> * Liang Li (liang.z.li@intel.com) wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
>
> Hi,
> An interesting solution; I know a few different people have been looking at
> how to speed up ballooned VM migration.
>
Ooh, different solutions for the same purpose, and both based on the balloon.
> I wonder if it would be possible to avoid the kernel changes by parsing
> /proc/self/pagemap - if that can be used to detect unmapped/zero mapped
> pages in the guest ram, would it achieve the same result?
>
Only detect the unmapped/zero mapped pages is not enough. Consider the
situation like case 2, it can't achieve the same result.
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
>
> For postcopy to be safe, you would still need to send a message to the
> destination telling it that there were zero pages, otherwise the destination
> can't tell if it's supposed to request the page from the source or treat the
> page as zero.
>
> Dave
I will consider this later, thanks, Dave.
Liang
>
> >
> > Performance data
> > ================
> >
> > Test environment:
> >
> > CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz Host RAM: 64GB
> > Host Linux Kernel: 4.2.0 Host OS: CentOS 7.1
> > Guest Linux Kernel: 4.5.rc6 Guest OS: CentOS 6.6
> > Network: X540-AT2 with 10 Gigabit connection Guest RAM: 8GB
> >
> > Case 1: Idle guest just boots:
> > ============================================
> > | original | pv
> > -------------------------------------------
> > total time(ms) | 1894 | 421
> > --------------------------------------------
> > transferred ram(KB) | 398017 | 353242
> > ============================================
> >
> >
> > Case 2: The guest has ever run some memory consuming workload, the
> > workload is terminated just before live migration.
> > ============================================
> > | original | pv
> > -------------------------------------------
> > total time(ms) | 7436 | 552
> > --------------------------------------------
> > transferred ram(KB) | 8146291 | 361375
> > ============================================
> >
^ permalink raw reply
* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-04 1:35 UTC (permalink / raw)
To: Roman Kagan
Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
linux-mm@kvack.org, amit.shah@redhat.com, pbonzini@redhat.com,
akpm@linux-foundation.org,
virtualization@lists.linux-foundation.org, dgilbert@redhat.com,
rth@twiddle.net
In-Reply-To: <20160303135833.GA9100@rkaganb.sw.ru>
> On Thu, Mar 03, 2016 at 06:44:24PM +0800, Liang Li wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> >
> > Performance data
> > ================
> >
> > Test environment:
> >
> > CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz Host RAM: 64GB
> > Host Linux Kernel: 4.2.0 Host OS: CentOS 7.1
> > Guest Linux Kernel: 4.5.rc6 Guest OS: CentOS 6.6
> > Network: X540-AT2 with 10 Gigabit connection Guest RAM: 8GB
> >
> > Case 1: Idle guest just boots:
> > ============================================
> > | original | pv
> > -------------------------------------------
> > total time(ms) | 1894 | 421
> > --------------------------------------------
> > transferred ram(KB) | 398017 | 353242
> > ============================================
> >
> >
> > Case 2: The guest has ever run some memory consuming workload, the
> > workload is terminated just before live migration.
> > ============================================
> > | original | pv
> > -------------------------------------------
> > total time(ms) | 7436 | 552
> > --------------------------------------------
> > transferred ram(KB) | 8146291 | 361375
> > ============================================
>
> Both cases look very artificial to me. Normally you migrate VMs which have
> started long ago and which can't have their services terminated before the
> migration, so I wouldn't expect any useful amount of free pages obtained
> this way.
>
Yes, it's somewhat artificial, just to emphasize the effect. And I think these two
cases are very easy to reproduce. Using the real workload and do the test
in production environment will be more convince.
We can predict that as long as the guest doesn't use out of its memory, this solution
may still take affect and shorten the total live migration time. (Off cause, we should
consider the time cost of the virtio communication.)
> OTOH I don't see why you can't just inflate the balloon before the migration,
> and really optimize the amount of transferred data this way?
> With the recently proposed VIRTIO_BALLOON_S_AVAIL you can have a fairly
> good estimate of the optimal balloon size, and with the recently merged
> balloon deflation on OOM it's a safe thing to do without exposing the guest
> workloads to OOM risks.
>
> Roman.
Thanks for your information. The size of the free page bitmap is not very large, for a
guest with 8GB RAM, only 256KB extra memory is required.
Comparing to this solution, inflate the balloon is more expensive. If the balloon size
is not so optimal and guest request more memory during live migration, the guest's
performance will be impacted.
Liang
^ permalink raw reply
* Re: [RFC qemu 0/4] A PV solution for live migration optimization
From: Dr. David Alan Gilbert @ 2016-03-03 17:46 UTC (permalink / raw)
To: Liang Li
Cc: ehabkost, kvm, mst, linux-kernel, qemu-devel, linux-mm, amit.shah,
pbonzini, akpm, virtualization, rth
In-Reply-To: <1457001868-15949-1-git-send-email-liang.z.li@intel.com>
* Liang Li (liang.z.li@intel.com) wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
>
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
>
> This patch set is the QEMU side implementation.
>
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
>
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.
Hi,
An interesting solution; I know a few different people have been looking
at how to speed up ballooned VM migration.
I wonder if it would be possible to avoid the kernel changes by
parsing /proc/self/pagemap - if that can be used to detect unmapped/zero
mapped pages in the guest ram, would it achieve the same result?
> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.
For postcopy to be safe, you would still need to send a message to the
destination telling it that there were zero pages, otherwise the destination
can't tell if it's supposed to request the page from the source or
treat the page as zero.
Dave
>
> Performance data
> ================
>
> Test environment:
>
> CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
> Host RAM: 64GB
> Host Linux Kernel: 4.2.0 Host OS: CentOS 7.1
> Guest Linux Kernel: 4.5.rc6 Guest OS: CentOS 6.6
> Network: X540-AT2 with 10 Gigabit connection
> Guest RAM: 8GB
>
> Case 1: Idle guest just boots:
> ============================================
> | original | pv
> -------------------------------------------
> total time(ms) | 1894 | 421
> --------------------------------------------
> transferred ram(KB) | 398017 | 353242
> ============================================
>
>
> Case 2: The guest has ever run some memory consuming workload, the
> workload is terminated just before live migration.
> ============================================
> | original | pv
> -------------------------------------------
> total time(ms) | 7436 | 552
> --------------------------------------------
> transferred ram(KB) | 8146291 | 361375
> ============================================
>
> Liang Li (4):
> pc: Add code to get the lowmem form PCMachineState
> virtio-balloon: Add a new feature to balloon device
> migration: not set migration bitmap in setup stage
> migration: filter out guest's free pages in ram bulk stage
>
> balloon.c | 30 ++++++++-
> hw/i386/pc.c | 5 ++
> hw/i386/pc_piix.c | 1 +
> hw/i386/pc_q35.c | 1 +
> hw/virtio/virtio-balloon.c | 81 ++++++++++++++++++++++++-
> include/hw/i386/pc.h | 3 +-
> include/hw/virtio/virtio-balloon.h | 17 +++++-
> include/standard-headers/linux/virtio_balloon.h | 1 +
> include/sysemu/balloon.h | 10 ++-
> migration/ram.c | 64 +++++++++++++++----
> 10 files changed, 195 insertions(+), 18 deletions(-)
>
> --
> 1.8.3.1
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply
* virtio-vsock live migration
From: Stefan Hajnoczi @ 2016-03-03 15:37 UTC (permalink / raw)
To: virtualization
Cc: virtio-dev, Michael S. Tsirkin, Claudio Imbrenda,
Christian Borntraeger, Matt Benjamin, Christoffer Dall
[-- Attachment #1.1: Type: text/plain, Size: 3812 bytes --]
Michael pointed out that the virtio-vsock draft specification does not
address live migration and in fact currently precludes migration.
Migration is fundamental so the device specification at least mustn't
preclude it. Having brainstormed migration with Matthew Benjamin and
Michael Tsirkin, I am now summarizing the approach that I want to
include in the next draft specification.
Feedback and comments welcome! In the meantime I will implement this in
code and update the draft specification.
1. Requirements
Virtio-vsock is a new AF_VSOCK transport. As such, it should provide at
least the same guarantees as the existing AF_VSOCK VMCI transport. This
is for consistency and to allow code reuse across any AF_VSOCK
transport.
Virtio-vsock aims to replace virtio-serial by providing the same
guest/host communication ability but with sockets API semantics that are
more popular and convenient for application developers. Therefore
virtio-vsock migration should provide at least the same level of
migration functionality as virtio-serial.
Ideally it should be possible to migrate applications using AF_VSOCK
together with the virtual machine so that guest<->host communication is
interrupted. Neither AF_VSOCK VMCI nor virtio-serial support this
today.
2. Basic disruptive migration flow
When the virtual machine migrates from the source host to the
destination host, the guest's CID may change. The CID namespace is
host-wide so other hosts may have CID collisions and allocate a new CID
for incoming migration VMs.
The device notifies the guest that the CID has changed. Guest sockets
are affected as follows:
* Established connections are reset (ECONNRESET) and the guest
application will have to reconnect.
* Listen sockets remain open. The only thing to note is that
connections from the host are now made to the new CID. This means
the local address of the listen socket is automatically updated to
the new CID.
* Sockets in other states are unchanged.
Applications must handle disruptive migration by reconnecting if
necessary after ECONNRESET.
3. Checkpoint/restore for seamless migration
Applications that wish to communicate across live migration can do so
but this requires extra application-specific checkpoint/restore code.
This is similar to the approach taken by the CRIU project where
getsockopt()/setsockopt() is used to migrate socket state. The
difference is that the application process is not automatically migrated
from the source host to the destination host. Therefore, the
application needs to migrate its own state somehow.
The flow is as follows:
The application on the source host must quiesce (stop sending/receiving)
and use getsockopt() to extract socket state information from the host
kernel.
A new instance of the application is started on the destination host and
given the state so it can restore the connection. The setsockopt()
syscall is used to restore socket state information.
The guest is given a list of <host_old_cid, host_new_cid, host_port,
guest_port> tuples for established connections that must not be reset
when the guest CID update notification is received. These connections
will carry on as if nothing changed.
Note that the connection's remote address is updated from host_old_cid
to host_new_cid. This allows remapping of CIDs (if necessary).
Typically this will be unused because the host always has well-known CID
2. In a guest<->guest scenario it may be used to remap CIDs.
For the time being I am focussing on the basic disruptive migration flow
only. Checkpoint/restore can be added with a feature bit in the future.
It is a lot more complex and I'm not sure whether there will be any
users yet.
Stefan
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Roman Kagan @ 2016-03-03 13:58 UTC (permalink / raw)
To: Liang Li
Cc: ehabkost, kvm, mst, linux-kernel, qemu-devel, linux-mm, amit.shah,
pbonzini, akpm, virtualization, dgilbert, rth
In-Reply-To: <1457001868-15949-1-git-send-email-liang.z.li@intel.com>
On Thu, Mar 03, 2016 at 06:44:24PM +0800, Liang Li wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
>
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
>
> This patch set is the QEMU side implementation.
>
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
>
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.
>
> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.
>
> Performance data
> ================
>
> Test environment:
>
> CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
> Host RAM: 64GB
> Host Linux Kernel: 4.2.0 Host OS: CentOS 7.1
> Guest Linux Kernel: 4.5.rc6 Guest OS: CentOS 6.6
> Network: X540-AT2 with 10 Gigabit connection
> Guest RAM: 8GB
>
> Case 1: Idle guest just boots:
> ============================================
> | original | pv
> -------------------------------------------
> total time(ms) | 1894 | 421
> --------------------------------------------
> transferred ram(KB) | 398017 | 353242
> ============================================
>
>
> Case 2: The guest has ever run some memory consuming workload, the
> workload is terminated just before live migration.
> ============================================
> | original | pv
> -------------------------------------------
> total time(ms) | 7436 | 552
> --------------------------------------------
> transferred ram(KB) | 8146291 | 361375
> ============================================
Both cases look very artificial to me. Normally you migrate VMs which
have started long ago and which can't have their services terminated
before the migration, so I wouldn't expect any useful amount of free
pages obtained this way.
OTOH I don't see why you can't just inflate the balloon before the
migration, and really optimize the amount of transferred data this way?
With the recently proposed VIRTIO_BALLOON_S_AVAIL you can have a fairly
good estimate of the optimal balloon size, and with the recently merged
balloon deflation on OOM it's a safe thing to do without exposing the
guest workloads to OOM risks.
Roman.
^ permalink raw reply
* [PULL for-4.5] virtio/vhost: minor fixes
From: Michael S. Tsirkin @ 2016-03-03 13:08 UTC (permalink / raw)
To: Linus Torvalds; +Cc: kvm, mst, netdev, lprosek, linux-kernel, virtualization
The following changes since commit fc77dbd34c5c99bce46d40a2491937c3bcbd10af:
Linux 4.5-rc6 (2016-02-28 08:41:20 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to e1f33be9186363da7955bcb5f0b03e6685544c50:
vhost: fix error path in vhost_init_used() (2016-03-02 17:01:49 +0200)
----------------------------------------------------------------
virtio/vhost: minor fixes
This fixes two minor bugs: error handling in vhost,
and capability processing in virtio.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
----------------------------------------------------------------
Greg Kurz (1):
vhost: fix error path in vhost_init_used()
Ladi Prosek (1):
virtio-pci: read the right virtio_pci_notify_cap field
drivers/vhost/vhost.c | 15 +++++++++++----
drivers/virtio/virtio_pci_modern.c | 2 +-
2 files changed, 12 insertions(+), 5 deletions(-)
^ permalink raw reply
* Re: [RFC qemu 2/4] virtio-balloon: Add a new feature to balloon device
From: Michael S. Tsirkin @ 2016-03-03 12:56 UTC (permalink / raw)
To: Liang Li
Cc: ehabkost, kvm, linux-kernel, qemu-devel, linux-mm, amit.shah,
pbonzini, akpm, virtualization, dgilbert, rth
In-Reply-To: <1457001868-15949-3-git-send-email-liang.z.li@intel.com>
On Thu, Mar 03, 2016 at 06:44:26PM +0800, Liang Li wrote:
> Extend the virtio balloon device to support a new feature, this
> new feature can help to get guest's free pages information, which
> can be used for live migration optimzation.
>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
I don't understand why we need a new interface.
Balloon already sends free pages to host.
Just teach host to skip these pages.
Maybe instead of starting with code, you
should send a high level description to the
virtio tc for consideration?
You can do it through the mailing list or
using the web form:
http://www.oasis-open.org/committees/comments/form.php?wg_abbrev=virtio
> ---
> balloon.c | 30 ++++++++-
> hw/virtio/virtio-balloon.c | 81 ++++++++++++++++++++++++-
> include/hw/virtio/virtio-balloon.h | 17 +++++-
> include/standard-headers/linux/virtio_balloon.h | 1 +
> include/sysemu/balloon.h | 10 ++-
> 5 files changed, 134 insertions(+), 5 deletions(-)
>
> diff --git a/balloon.c b/balloon.c
> index f2ef50c..a37717e 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -36,6 +36,7 @@
>
> static QEMUBalloonEvent *balloon_event_fn;
> static QEMUBalloonStatus *balloon_stat_fn;
> +static QEMUBalloonFreePages *balloon_free_pages_fn;
> static void *balloon_opaque;
> static bool balloon_inhibited;
>
> @@ -65,9 +66,12 @@ static bool have_balloon(Error **errp)
> }
>
> int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> - QEMUBalloonStatus *stat_func, void *opaque)
> + QEMUBalloonStatus *stat_func,
> + QEMUBalloonFreePages *free_pages_func,
> + void *opaque)
> {
> - if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
> + if (balloon_event_fn || balloon_stat_fn || balloon_free_pages_fn
> + || balloon_opaque) {
> /* We're already registered one balloon handler. How many can
> * a guest really have?
> */
> @@ -75,6 +79,7 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> }
> balloon_event_fn = event_func;
> balloon_stat_fn = stat_func;
> + balloon_free_pages_fn = free_pages_func;
> balloon_opaque = opaque;
> return 0;
> }
> @@ -86,6 +91,7 @@ void qemu_remove_balloon_handler(void *opaque)
> }
> balloon_event_fn = NULL;
> balloon_stat_fn = NULL;
> + balloon_free_pages_fn = NULL;
> balloon_opaque = NULL;
> }
>
> @@ -116,3 +122,23 @@ void qmp_balloon(int64_t target, Error **errp)
> trace_balloon_event(balloon_opaque, target);
> balloon_event_fn(balloon_opaque, target);
> }
> +
> +bool balloon_free_pages_support(void)
> +{
> + return balloon_free_pages_fn ? true : false;
> +}
> +
> +int balloon_get_free_pages(unsigned long *free_pages_bitmap,
> + unsigned long *free_pages_count)
> +{
> + if (!balloon_free_pages_fn) {
> + return -1;
> + }
> +
> + if (!free_pages_bitmap || !free_pages_count) {
> + return -1;
> + }
> +
> + return balloon_free_pages_fn(balloon_opaque,
> + free_pages_bitmap, free_pages_count);
> + }
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index e9c30e9..a5b9d08 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -76,6 +76,12 @@ static bool balloon_stats_supported(const VirtIOBalloon *s)
> return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_STATS_VQ);
> }
>
> +static bool balloon_free_pages_supported(const VirtIOBalloon *s)
> +{
> + VirtIODevice *vdev = VIRTIO_DEVICE(s);
> + return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_GET_FREE_PAGES);
> +}
> +
> static bool balloon_stats_enabled(const VirtIOBalloon *s)
> {
> return s->stats_poll_interval > 0;
> @@ -293,6 +299,37 @@ out:
> }
> }
>
> +static void virtio_balloon_get_free_pages(VirtIODevice *vdev, VirtQueue *vq)
> +{
> + VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
> + VirtQueueElement *elem;
> + size_t offset = 0;
> + uint64_t bitmap_bytes = 0, free_pages_count = 0;
> +
> + elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> + if (!elem) {
> + return;
> + }
> + s->free_pages_vq_elem = elem;
> +
> + if (!elem->out_num) {
> + return;
> + }
> +
> + iov_to_buf(elem->out_sg, elem->out_num, offset,
> + &free_pages_count, sizeof(uint64_t));
> +
> + offset += sizeof(uint64_t);
> + iov_to_buf(elem->out_sg, elem->out_num, offset,
> + &bitmap_bytes, sizeof(uint64_t));
> +
> + offset += sizeof(uint64_t);
> + iov_to_buf(elem->out_sg, elem->out_num, offset,
> + s->free_pages_bitmap, bitmap_bytes);
> + s->req_status = DONE;
> + s->free_pages_count = free_pages_count;
> +}
> +
> static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
> {
> VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
> @@ -362,6 +399,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
> VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
> f |= dev->host_features;
> virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
> + virtio_add_feature(&f, VIRTIO_BALLOON_F_GET_FREE_PAGES);
> return f;
> }
>
> @@ -372,6 +410,45 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
> VIRTIO_BALLOON_PFN_SHIFT);
> }
>
> +static int virtio_balloon_free_pages(void *opaque,
> + unsigned long *free_pages_bitmap,
> + unsigned long *free_pages_count)
> +{
> + VirtIOBalloon *s = opaque;
> + VirtIODevice *vdev = VIRTIO_DEVICE(s);
> + VirtQueueElement *elem = s->free_pages_vq_elem;
> + int len;
> +
> + if (!balloon_free_pages_supported(s)) {
> + return -1;
> + }
> +
> + if (s->req_status == NOT_STARTED) {
> + s->free_pages_bitmap = free_pages_bitmap;
> + s->req_status = STARTED;
> + s->mem_layout.low_mem = pc_get_lowmem(PC_MACHINE(current_machine));
> + if (!elem->in_num) {
> + elem = virtqueue_pop(s->fvq, sizeof(VirtQueueElement));
> + if (!elem) {
> + return 0;
> + }
> + s->free_pages_vq_elem = elem;
> + }
> + len = iov_from_buf(elem->in_sg, elem->in_num, 0, &s->mem_layout,
> + sizeof(s->mem_layout));
> + virtqueue_push(s->fvq, elem, len);
> + virtio_notify(vdev, s->fvq);
> + return 0;
> + } else if (s->req_status == STARTED) {
> + return 0;
> + } else if (s->req_status == DONE) {
> + *free_pages_count = s->free_pages_count;
> + s->req_status = NOT_STARTED;
> + }
> +
> + return 1;
> +}
> +
> static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
> {
> VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
> @@ -429,7 +506,8 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
> sizeof(struct virtio_balloon_config));
>
> ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> - virtio_balloon_stat, s);
> + virtio_balloon_stat,
> + virtio_balloon_free_pages, s);
>
> if (ret < 0) {
> error_setg(errp, "Only one balloon device is supported");
> @@ -440,6 +518,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
> s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
> s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
> s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
> + s->fvq = virtio_add_queue(vdev, 128, virtio_balloon_get_free_pages);
>
> reset_stats(s);
>
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index 35f62ac..fc173e4 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -23,6 +23,16 @@
> #define VIRTIO_BALLOON(obj) \
> OBJECT_CHECK(VirtIOBalloon, (obj), TYPE_VIRTIO_BALLOON)
>
> +typedef enum virtio_req_status {
> + NOT_STARTED,
> + STARTED,
> + DONE,
> +} VIRTIO_REQ_STATUS;
> +
> +typedef struct MemLayout {
> + uint64_t low_mem;
> +} MemLayout;
> +
> typedef struct virtio_balloon_stat VirtIOBalloonStat;
>
> typedef struct virtio_balloon_stat_modern {
> @@ -33,16 +43,21 @@ typedef struct virtio_balloon_stat_modern {
>
> typedef struct VirtIOBalloon {
> VirtIODevice parent_obj;
> - VirtQueue *ivq, *dvq, *svq;
> + VirtQueue *ivq, *dvq, *svq, *fvq;
> uint32_t num_pages;
> uint32_t actual;
> uint64_t stats[VIRTIO_BALLOON_S_NR];
> VirtQueueElement *stats_vq_elem;
> + VirtQueueElement *free_pages_vq_elem;
> size_t stats_vq_offset;
> QEMUTimer *stats_timer;
> int64_t stats_last_update;
> int64_t stats_poll_interval;
> uint32_t host_features;
> + uint64_t *free_pages_bitmap;
> + uint64_t free_pages_count;
> + MemLayout mem_layout;
> + VIRTIO_REQ_STATUS req_status;
> } VirtIOBalloon;
>
> #endif
> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
> index 2e2a6dc..95b7d0c 100644
> --- a/include/standard-headers/linux/virtio_balloon.h
> +++ b/include/standard-headers/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
> #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
> #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
> #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_GET_FREE_PAGES 3 /* Get the free pages bitmap */
>
> /* Size of a PFN in the balloon interface. */
> #define VIRTIO_BALLOON_PFN_SHIFT 12
> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
> index 3f976b4..205b272 100644
> --- a/include/sysemu/balloon.h
> +++ b/include/sysemu/balloon.h
> @@ -18,11 +18,19 @@
>
> typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
> typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
> +typedef int (QEMUBalloonFreePages)(void *opaque,
> + unsigned long *free_pages_bitmap,
> + unsigned long *free_pages_count);
>
> int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> - QEMUBalloonStatus *stat_func, void *opaque);
> + QEMUBalloonStatus *stat_func,
> + QEMUBalloonFreePages *free_pages_func,
> + void *opaque);
> void qemu_remove_balloon_handler(void *opaque);
> bool qemu_balloon_is_inhibited(void);
> void qemu_balloon_inhibit(bool state);
> +bool balloon_free_pages_support(void);
> +int balloon_get_free_pages(unsigned long *free_pages_bitmap,
> + unsigned long *free_pages_count);
>
> #endif
> --
> 1.8.3.1
^ permalink raw reply
* Re: [Qemu-devel] [RFC qemu 4/4] migration: filter out guest's free pages in ram bulk stage
From: Daniel P. Berrange @ 2016-03-03 12:45 UTC (permalink / raw)
To: Liang Li
Cc: ehabkost, kvm, mst, linux-kernel, qemu-devel, linux-mm, amit.shah,
pbonzini, akpm, virtualization, dgilbert, rth
In-Reply-To: <1457001868-15949-5-git-send-email-liang.z.li@intel.com>
On Thu, Mar 03, 2016 at 06:44:28PM +0800, Liang Li wrote:
> Get the free pages information through virtio and filter out the free
> pages in the ram bulk stage. This can significantly reduce the total
> live migration time as well as network traffic.
>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
> migration/ram.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 46 insertions(+), 6 deletions(-)
> @@ -1945,6 +1971,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> DIRTY_MEMORY_MIGRATION);
> }
> memory_global_dirty_log_start();
> +
> + if (balloon_free_pages_support() &&
> + balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> + &free_pages_count) == 0) {
> + qemu_mutex_unlock_iothread();
> + while (balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> + &free_pages_count) == 0) {
> + usleep(1000);
> + }
> + qemu_mutex_lock_iothread();
> +
> + filter_out_guest_free_pages(migration_bitmap_rcu->free_pages_bmap);
> + }
IIUC, this code is synchronous wrt to the guest OS balloon drive. ie it
is asking the geust for free pages and waiting for a response. If the
guest OS has crashed this is going to mean QEMU waits forever and thus
migration won't complete. Similarly you need to consider that the guest
OS may be malicious and simply never respond.
So if the migration code is going to use the guest balloon driver to get
info about free pages it has to be done in an asynchronous manner so that
migration can never be stalled by a slow/crashed/malicious guest driver.
Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
^ permalink raw reply
* Re: [RFC qemu 2/4] virtio-balloon: Add a new feature to balloon device
From: Cornelia Huck @ 2016-03-03 12:23 UTC (permalink / raw)
To: Liang Li
Cc: ehabkost, kvm, mst, linux-kernel, qemu-devel, linux-mm, amit.shah,
pbonzini, akpm, virtualization, dgilbert, rth
In-Reply-To: <1457001868-15949-3-git-send-email-liang.z.li@intel.com>
On Thu, 3 Mar 2016 18:44:26 +0800
Liang Li <liang.z.li@intel.com> wrote:
> Extend the virtio balloon device to support a new feature, this
> new feature can help to get guest's free pages information, which
> can be used for live migration optimzation.
Do you have a spec for this, e.g. as a patch to the virtio spec?
>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
> balloon.c | 30 ++++++++-
> hw/virtio/virtio-balloon.c | 81 ++++++++++++++++++++++++-
> include/hw/virtio/virtio-balloon.h | 17 +++++-
> include/standard-headers/linux/virtio_balloon.h | 1 +
> include/sysemu/balloon.h | 10 ++-
> 5 files changed, 134 insertions(+), 5 deletions(-)
> +static int virtio_balloon_free_pages(void *opaque,
> + unsigned long *free_pages_bitmap,
> + unsigned long *free_pages_count)
> +{
> + VirtIOBalloon *s = opaque;
> + VirtIODevice *vdev = VIRTIO_DEVICE(s);
> + VirtQueueElement *elem = s->free_pages_vq_elem;
> + int len;
> +
> + if (!balloon_free_pages_supported(s)) {
> + return -1;
> + }
> +
> + if (s->req_status == NOT_STARTED) {
> + s->free_pages_bitmap = free_pages_bitmap;
> + s->req_status = STARTED;
> + s->mem_layout.low_mem = pc_get_lowmem(PC_MACHINE(current_machine));
Please don't leak pc-specific information into generic code.
^ permalink raw reply
* Re: [RFC qemu 4/4] migration: filter out guest's free pages in ram bulk stage
From: Cornelia Huck @ 2016-03-03 12:16 UTC (permalink / raw)
To: Liang Li
Cc: ehabkost, kvm, mst, linux-kernel, qemu-devel, linux-mm, amit.shah,
pbonzini, akpm, virtualization, dgilbert, rth
In-Reply-To: <1457001868-15949-5-git-send-email-liang.z.li@intel.com>
On Thu, 3 Mar 2016 18:44:28 +0800
Liang Li <liang.z.li@intel.com> wrote:
> Get the free pages information through virtio and filter out the free
> pages in the ram bulk stage. This can significantly reduce the total
> live migration time as well as network traffic.
>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
> migration/ram.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 46 insertions(+), 6 deletions(-)
>
> @@ -1945,6 +1971,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> DIRTY_MEMORY_MIGRATION);
> }
> memory_global_dirty_log_start();
> +
> + if (balloon_free_pages_support() &&
> + balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> + &free_pages_count) == 0) {
> + qemu_mutex_unlock_iothread();
> + while (balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> + &free_pages_count) == 0) {
> + usleep(1000);
> + }
> + qemu_mutex_lock_iothread();
> +
> + filter_out_guest_free_pages(migration_bitmap_rcu->free_pages_bmap);
A general comment: Using the ballooner to get information about pages
that can be filtered out is too limited (there may be other ways to do
this; we might be able to use cmma on s390, for example), and I don't
like hardcoding to a specific method.
What about the reverse approach: Code may register a handler that
populates the free_pages_bitmap which is called during this stage?
<I like the idea of filtering in general, but I haven't looked at the
code yet>
> + }
> +
> migration_bitmap_sync();
> qemu_mutex_unlock_ramlist();
> qemu_mutex_unlock_iothread();
^ permalink raw reply
* [RFC kernel 2/2] virtio-balloon: extend balloon driver to support a new feature
From: Liang Li @ 2016-03-03 10:46 UTC (permalink / raw)
To: mst, linux-kernel
Cc: ehabkost, kvm, Liang Li, qemu-devel, virtualization, linux-mm,
amit.shah, pbonzini, akpm, dgilbert, rth
In-Reply-To: <1457002019-15998-1-git-send-email-liang.z.li@intel.com>
Extend the virio balloon to support the new feature
VIRTIO_BALLOON_F_GET_FREE_PAGES, so that we can use it to send the
free pages information from guest to QEMU, and then optimize the
live migration process.
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
drivers/virtio/virtio_balloon.c | 106 ++++++++++++++++++++++++++++++++++--
include/uapi/linux/virtio_balloon.h | 1 +
2 files changed, 102 insertions(+), 5 deletions(-)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0c3691f..7461d3e 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -45,9 +45,18 @@ static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
module_param(oom_pages, int, S_IRUSR | S_IWUSR);
MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
+extern void get_free_pages(unsigned long *free_page_bitmap,
+ unsigned long *free_pages_num,
+ unsigned long lowmem);
+extern unsigned long get_total_pages_count(unsigned long lowmem);
+
+struct mem_layout {
+ unsigned long low_mem;
+};
+
struct virtio_balloon {
struct virtio_device *vdev;
- struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+ struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_pages_vq;
/* Where the ballooning thread waits for config to change. */
wait_queue_head_t config_change;
@@ -75,6 +84,11 @@ struct virtio_balloon {
unsigned int num_pfns;
u32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
+ unsigned long *free_pages;
+ unsigned long free_pages_len;
+ unsigned long free_pages_num;
+ struct mem_layout mem_config;
+
/* Memory statistics */
int need_stats_update;
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
@@ -245,6 +259,34 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(i.totalram));
}
+static void update_free_pages_stats(struct virtio_balloon *vb)
+{
+ unsigned long total_page_count, bitmap_bytes;
+
+ total_page_count = get_total_pages_count(vb->mem_config.low_mem);
+ bitmap_bytes = ALIGN(total_page_count, BITS_PER_LONG) / 8;
+
+ if (!vb->free_pages)
+ vb->free_pages = kzalloc(bitmap_bytes, GFP_KERNEL);
+ else {
+ if (bitmap_bytes < vb->free_pages_len)
+ memset(vb->free_pages, 0, bitmap_bytes);
+ else {
+ kfree(vb->free_pages);
+ vb->free_pages = kzalloc(bitmap_bytes, GFP_KERNEL);
+ }
+ }
+ if (!vb->free_pages) {
+ vb->free_pages_len = 0;
+ vb->free_pages_num = 0;
+ return;
+ }
+
+ vb->free_pages_len = bitmap_bytes;
+ get_free_pages(vb->free_pages, &vb->free_pages_num,
+ vb->mem_config.low_mem);
+}
+
/*
* While most virtqueues communicate guest-initiated requests to the hypervisor,
* the stats queue operates in reverse. The driver initializes the virtqueue
@@ -278,6 +320,39 @@ static void stats_handle_request(struct virtio_balloon *vb)
virtqueue_kick(vq);
}
+static void free_pages_handle_rq(struct virtio_balloon *vb)
+{
+ struct virtqueue *vq;
+ struct scatterlist sg[3];
+ unsigned int len;
+ struct mem_layout *ptr_mem_layout;
+ struct scatterlist sg_in;
+
+ vq = vb->free_pages_vq;
+ ptr_mem_layout = virtqueue_get_buf(vq, &len);
+
+ if (!ptr_mem_layout)
+ return;
+ update_free_pages_stats(vb);
+ sg_init_table(sg, 3);
+ sg_set_buf(&sg[0], &(vb->free_pages_num), sizeof(vb->free_pages_num));
+ sg_set_buf(&sg[1], &(vb->free_pages_len), sizeof(vb->free_pages_len));
+ sg_set_buf(&sg[2], vb->free_pages, vb->free_pages_len);
+
+ sg_init_one(&sg_in, &vb->mem_config, sizeof(vb->mem_config));
+
+ virtqueue_add_outbuf(vq, &sg[0], 3, vb, GFP_KERNEL);
+ virtqueue_add_inbuf(vq, &sg_in, 1, &vb->mem_config, GFP_KERNEL);
+ virtqueue_kick(vq);
+}
+
+static void free_pages_rq(struct virtqueue *vq)
+{
+ struct virtio_balloon *vb = vq->vdev->priv;
+
+ free_pages_handle_rq(vb);
+}
+
static void virtballoon_changed(struct virtio_device *vdev)
{
struct virtio_balloon *vb = vdev->priv;
@@ -386,16 +461,22 @@ static int balloon(void *_vballoon)
static int init_vqs(struct virtio_balloon *vb)
{
- struct virtqueue *vqs[3];
- vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
- static const char * const names[] = { "inflate", "deflate", "stats" };
+ struct virtqueue *vqs[4];
+ vq_callback_t *callbacks[] = { balloon_ack, balloon_ack,
+ stats_request, free_pages_rq };
+ const char *names[] = { "inflate", "deflate", "stats", "free_pages" };
int err, nvqs;
/*
* We expect two virtqueues: inflate and deflate, and
* optionally stat.
*/
- nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_GET_FREE_PAGES))
+ nvqs = 4;
+ else
+ nvqs = virtio_has_feature(vb->vdev,
+ VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
+
err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
if (err)
return err;
@@ -416,6 +497,16 @@ static int init_vqs(struct virtio_balloon *vb)
BUG();
virtqueue_kick(vb->stats_vq);
}
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_GET_FREE_PAGES)) {
+ struct scatterlist sg_in;
+
+ vb->free_pages_vq = vqs[3];
+ sg_init_one(&sg_in, &vb->mem_config, sizeof(vb->mem_config));
+ if (virtqueue_add_inbuf(vb->free_pages_vq, &sg_in, 1,
+ &vb->mem_config, GFP_KERNEL) < 0)
+ BUG();
+ virtqueue_kick(vb->free_pages_vq);
+ }
return 0;
}
@@ -505,6 +596,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
init_waitqueue_head(&vb->acked);
vb->vdev = vdev;
vb->need_stats_update = 0;
+ vb->free_pages_num = 0;
+ vb->free_pages_len = 0;
+ vb->free_pages = NULL;
balloon_devinfo_init(&vb->vb_dev_info);
#ifdef CONFIG_BALLOON_COMPACTION
@@ -561,6 +655,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
unregister_oom_notifier(&vb->nb);
kthread_stop(vb->thread);
remove_common(vb);
+ kfree(vb->free_pages);
kfree(vb);
}
@@ -599,6 +694,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_MUST_TELL_HOST,
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+ VIRTIO_BALLOON_F_GET_FREE_PAGES,
};
static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index d7f1cbc..54aaf20 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_GET_FREE_PAGES 3 /* Get free pages bitmap */
/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12
--
1.8.3.1
^ permalink raw reply related
* [RFC kernel 1/2] mm: Add the functions used to get free pages information
From: Liang Li @ 2016-03-03 10:46 UTC (permalink / raw)
To: mst, linux-kernel
Cc: ehabkost, kvm, Liang Li, qemu-devel, virtualization, linux-mm,
amit.shah, pbonzini, akpm, dgilbert, rth
In-Reply-To: <1457002019-15998-1-git-send-email-liang.z.li@intel.com>
get_total_pages_count() tries to get the page count of the system
RAM.
get_free_pages() is intend to construct a free pages bitmap by
traversing the free_list.
The free pages information will be sent to QEMU through virtio
and used for live migration optimization.
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
mm/page_alloc.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 838ca8bb..81922e6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3860,6 +3860,63 @@ void show_free_areas(unsigned int filter)
show_swap_cache_info();
}
+#define PFN_4G (0x100000000 >> PAGE_SHIFT)
+
+unsigned long get_total_pages_count(unsigned long low_mem)
+{
+ if (max_pfn >= PFN_4G) {
+ unsigned long pfn_gap = PFN_4G - (low_mem >> PAGE_SHIFT);
+
+ return max_pfn - pfn_gap;
+ } else
+ return max_pfn;
+}
+EXPORT_SYMBOL(get_total_pages_count);
+
+static void mark_free_pages_bitmap(struct zone *zone,
+ unsigned long *free_page_bitmap, unsigned long pfn_gap)
+{
+ unsigned long pfn, flags, i;
+ unsigned int order, t;
+ struct list_head *curr;
+
+ if (zone_is_empty(zone))
+ return;
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ for_each_migratetype_order(order, t) {
+ list_for_each(curr, &zone->free_area[order].free_list[t]) {
+
+ pfn = page_to_pfn(list_entry(curr, struct page, lru));
+ for (i = 0; i < (1UL << order); i++) {
+ if ((pfn + i) >= PFN_4G)
+ set_bit_le(pfn + i - pfn_gap,
+ free_page_bitmap);
+ else
+ set_bit_le(pfn + i, free_page_bitmap);
+ }
+ }
+ }
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+void get_free_pages(unsigned long *free_page_bitmap,
+ unsigned long *free_pages_count,
+ unsigned long low_mem)
+{
+ struct zone *zone;
+ unsigned long pfn_gap;
+
+ pfn_gap = PFN_4G - (low_mem >> PAGE_SHIFT);
+ for_each_populated_zone(zone)
+ mark_free_pages_bitmap(zone, free_page_bitmap, pfn_gap);
+
+ *free_pages_count = global_page_state(NR_FREE_PAGES);
+}
+EXPORT_SYMBOL(get_free_pages);
+
static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
{
zoneref->zone = zone;
--
1.8.3.1
^ permalink raw reply related
* [RFC kernel 0/2]A PV solution for KVM live migration optimization
From: Liang Li @ 2016-03-03 10:46 UTC (permalink / raw)
To: mst, linux-kernel
Cc: ehabkost, kvm, Liang Li, qemu-devel, virtualization, linux-mm,
amit.shah, pbonzini, akpm, dgilbert, rth
The current QEMU live migration implementation mark the all the
guest's RAM pages as dirtied in the ram bulk stage, all these pages
will be processed and that takes quit a lot of CPU cycles.
From guest's point of view, it doesn't care about the content in free
pages. We can make use of this fact and skip processing the free
pages in the ram bulk stage, it can save a lot CPU cycles and reduce
the network traffic significantly while speed up the live migration
process obviously.
This patch set is the kernel side implementation.
It get the free pages information by traversing
zone->free_area[order].free_list, and construct a free pages bitmap.
The virtio-balloon driver is extended so as to send the free pages
bitmap to QEMU for live migration optimization.
Performance data
================
Test environment:
CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
Host RAM: 64GB
Host Linux Kernel: 4.2.0 Host OS: CentOS 7.1
Guest Linux Kernel: 4.5.rc6 Guest OS: CentOS 6.6
Network: X540-AT2 with 10 Gigabit connection
Guest RAM: 8GB
Case 1: Idle guest just boots:
============================================
| original | pv
-------------------------------------------
total time(ms) | 1894 | 421
--------------------------------------------
transferred ram(KB) | 398017 | 353242
============================================
Case 2: The guest has ever run some memory consuming workload, the
workload is terminated just before live migration.
============================================
| original | pv
-------------------------------------------
total time(ms) | 7436 | 552
--------------------------------------------
transferred ram(KB) | 8146291 | 361375
============================================
Liang Li (2):
mm: Add the functions used to get free pages information
virtio-balloon: extend balloon driver to support a new feature
drivers/virtio/virtio_balloon.c | 108 ++++++++++++++++++++++++++++++++++--
include/uapi/linux/virtio_balloon.h | 1 +
mm/page_alloc.c | 58 +++++++++++++++++++
3 files changed, 162 insertions(+), 5 deletions(-)
--
1.8.3.1
^ permalink raw reply
* [RFC qemu 4/4] migration: filter out guest's free pages in ram bulk stage
From: Liang Li @ 2016-03-03 10:44 UTC (permalink / raw)
To: quintela, amit.shah, qemu-devel, linux-kernel
Cc: ehabkost, kvm, mst, Liang Li, dgilbert, virtualization, linux-mm,
pbonzini, akpm, rth
In-Reply-To: <1457001868-15949-1-git-send-email-liang.z.li@intel.com>
Get the free pages information through virtio and filter out the free
pages in the ram bulk stage. This can significantly reduce the total
live migration time as well as network traffic.
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
migration/ram.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 46 insertions(+), 6 deletions(-)
diff --git a/migration/ram.c b/migration/ram.c
index ee2547d..819553b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -40,6 +40,7 @@
#include "trace.h"
#include "exec/ram_addr.h"
#include "qemu/rcu_queue.h"
+#include "sysemu/balloon.h"
#ifdef DEBUG_MIGRATION_RAM
#define DPRINTF(fmt, ...) \
@@ -241,6 +242,7 @@ static struct BitmapRcu {
struct rcu_head rcu;
/* Main migration bitmap */
unsigned long *bmap;
+ unsigned long *free_pages_bmap;
/* bitmap of pages that haven't been sent even once
* only maintained and used in postcopy at the moment
* where it's used to send the dirtymap at the start
@@ -561,12 +563,7 @@ ram_addr_t migration_bitmap_find_dirty(RAMBlock *rb,
unsigned long next;
bitmap = atomic_rcu_read(&migration_bitmap_rcu)->bmap;
- if (ram_bulk_stage && nr > base) {
- next = nr + 1;
- } else {
- next = find_next_bit(bitmap, size, nr);
- }
-
+ next = find_next_bit(bitmap, size, nr);
*ram_addr_abs = next << TARGET_PAGE_BITS;
return (next - base) << TARGET_PAGE_BITS;
}
@@ -1415,6 +1412,9 @@ void free_xbzrle_decoded_buf(void)
static void migration_bitmap_free(struct BitmapRcu *bmap)
{
g_free(bmap->bmap);
+ if (balloon_free_pages_support()) {
+ g_free(bmap->free_pages_bmap);
+ }
g_free(bmap->unsentmap);
g_free(bmap);
}
@@ -1873,6 +1873,28 @@ err:
return ret;
}
+static void filter_out_guest_free_pages(unsigned long *free_pages_bmap)
+{
+ RAMBlock *block;
+ DirtyMemoryBlocks *blocks;
+ unsigned long end, page;
+
+ blocks = atomic_rcu_read(&ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]);
+ block = QLIST_FIRST_RCU(&ram_list.blocks);
+ end = TARGET_PAGE_ALIGN(block->offset +
+ block->used_length) >> TARGET_PAGE_BITS;
+ page = block->offset >> TARGET_PAGE_BITS;
+
+ while (page < end) {
+ unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
+ unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
+ unsigned long num = MIN(end - page, DIRTY_MEMORY_BLOCK_SIZE - offset);
+ unsigned long *p = free_pages_bmap + BIT_WORD(page);
+
+ slow_bitmap_complement(blocks->blocks[idx], p, num);
+ page += num;
+ }
+}
/* Each of ram_save_setup, ram_save_iterate and ram_save_complete has
* long-running RCU critical section. When rcu-reclaims in the code
@@ -1884,6 +1906,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
{
RAMBlock *block;
int64_t ram_bitmap_pages; /* Size of bitmap in pages, including gaps */
+ uint64_t free_pages_count = 0;
dirty_rate_high_cnt = 0;
bitmap_sync_count = 0;
@@ -1931,6 +1954,9 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
ram_bitmap_pages = last_ram_offset() >> TARGET_PAGE_BITS;
migration_bitmap_rcu = g_new0(struct BitmapRcu, 1);
migration_bitmap_rcu->bmap = bitmap_new(ram_bitmap_pages);
+ if (balloon_free_pages_support()) {
+ migration_bitmap_rcu->free_pages_bmap = bitmap_new(ram_bitmap_pages);
+ }
if (migrate_postcopy_ram()) {
migration_bitmap_rcu->unsentmap = bitmap_new(ram_bitmap_pages);
@@ -1945,6 +1971,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
DIRTY_MEMORY_MIGRATION);
}
memory_global_dirty_log_start();
+
+ if (balloon_free_pages_support() &&
+ balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
+ &free_pages_count) == 0) {
+ qemu_mutex_unlock_iothread();
+ while (balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
+ &free_pages_count) == 0) {
+ usleep(1000);
+ }
+ qemu_mutex_lock_iothread();
+
+ filter_out_guest_free_pages(migration_bitmap_rcu->free_pages_bmap);
+ }
+
migration_bitmap_sync();
qemu_mutex_unlock_ramlist();
qemu_mutex_unlock_iothread();
--
1.8.3.1
^ permalink raw reply related
* [RFC qemu 3/4] migration: not set migration bitmap in setup stage
From: Liang Li @ 2016-03-03 10:44 UTC (permalink / raw)
To: quintela, amit.shah, qemu-devel, linux-kernel
Cc: ehabkost, kvm, mst, Liang Li, dgilbert, virtualization, linux-mm,
pbonzini, akpm, rth
In-Reply-To: <1457001868-15949-1-git-send-email-liang.z.li@intel.com>
Set ram_list.dirty_memory instead of migration bitmap, the migration
bitmap will be update when doing migration_bitmap_sync().
Set migration_dirty_pages to 0 and it will be updated by
migration_dirty_pages() too.
The following patch is based on this change.
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
migration/ram.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/migration/ram.c b/migration/ram.c
index 704f6a9..ee2547d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1931,19 +1931,19 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
ram_bitmap_pages = last_ram_offset() >> TARGET_PAGE_BITS;
migration_bitmap_rcu = g_new0(struct BitmapRcu, 1);
migration_bitmap_rcu->bmap = bitmap_new(ram_bitmap_pages);
- bitmap_set(migration_bitmap_rcu->bmap, 0, ram_bitmap_pages);
if (migrate_postcopy_ram()) {
migration_bitmap_rcu->unsentmap = bitmap_new(ram_bitmap_pages);
bitmap_set(migration_bitmap_rcu->unsentmap, 0, ram_bitmap_pages);
}
- /*
- * Count the total number of pages used by ram blocks not including any
- * gaps due to alignment or unplugs.
- */
- migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
+ migration_dirty_pages = 0;
+ QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+ cpu_physical_memory_set_dirty_range(block->offset,
+ block->used_length,
+ DIRTY_MEMORY_MIGRATION);
+ }
memory_global_dirty_log_start();
migration_bitmap_sync();
qemu_mutex_unlock_ramlist();
--
1.8.3.1
^ permalink raw reply related
* [RFC qemu 2/4] virtio-balloon: Add a new feature to balloon device
From: Liang Li @ 2016-03-03 10:44 UTC (permalink / raw)
To: quintela, amit.shah, qemu-devel, linux-kernel
Cc: ehabkost, kvm, mst, Liang Li, dgilbert, virtualization, linux-mm,
pbonzini, akpm, rth
In-Reply-To: <1457001868-15949-1-git-send-email-liang.z.li@intel.com>
Extend the virtio balloon device to support a new feature, this
new feature can help to get guest's free pages information, which
can be used for live migration optimzation.
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
balloon.c | 30 ++++++++-
hw/virtio/virtio-balloon.c | 81 ++++++++++++++++++++++++-
include/hw/virtio/virtio-balloon.h | 17 +++++-
include/standard-headers/linux/virtio_balloon.h | 1 +
include/sysemu/balloon.h | 10 ++-
5 files changed, 134 insertions(+), 5 deletions(-)
diff --git a/balloon.c b/balloon.c
index f2ef50c..a37717e 100644
--- a/balloon.c
+++ b/balloon.c
@@ -36,6 +36,7 @@
static QEMUBalloonEvent *balloon_event_fn;
static QEMUBalloonStatus *balloon_stat_fn;
+static QEMUBalloonFreePages *balloon_free_pages_fn;
static void *balloon_opaque;
static bool balloon_inhibited;
@@ -65,9 +66,12 @@ static bool have_balloon(Error **errp)
}
int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
- QEMUBalloonStatus *stat_func, void *opaque)
+ QEMUBalloonStatus *stat_func,
+ QEMUBalloonFreePages *free_pages_func,
+ void *opaque)
{
- if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
+ if (balloon_event_fn || balloon_stat_fn || balloon_free_pages_fn
+ || balloon_opaque) {
/* We're already registered one balloon handler. How many can
* a guest really have?
*/
@@ -75,6 +79,7 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
}
balloon_event_fn = event_func;
balloon_stat_fn = stat_func;
+ balloon_free_pages_fn = free_pages_func;
balloon_opaque = opaque;
return 0;
}
@@ -86,6 +91,7 @@ void qemu_remove_balloon_handler(void *opaque)
}
balloon_event_fn = NULL;
balloon_stat_fn = NULL;
+ balloon_free_pages_fn = NULL;
balloon_opaque = NULL;
}
@@ -116,3 +122,23 @@ void qmp_balloon(int64_t target, Error **errp)
trace_balloon_event(balloon_opaque, target);
balloon_event_fn(balloon_opaque, target);
}
+
+bool balloon_free_pages_support(void)
+{
+ return balloon_free_pages_fn ? true : false;
+}
+
+int balloon_get_free_pages(unsigned long *free_pages_bitmap,
+ unsigned long *free_pages_count)
+{
+ if (!balloon_free_pages_fn) {
+ return -1;
+ }
+
+ if (!free_pages_bitmap || !free_pages_count) {
+ return -1;
+ }
+
+ return balloon_free_pages_fn(balloon_opaque,
+ free_pages_bitmap, free_pages_count);
+ }
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index e9c30e9..a5b9d08 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -76,6 +76,12 @@ static bool balloon_stats_supported(const VirtIOBalloon *s)
return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_STATS_VQ);
}
+static bool balloon_free_pages_supported(const VirtIOBalloon *s)
+{
+ VirtIODevice *vdev = VIRTIO_DEVICE(s);
+ return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_GET_FREE_PAGES);
+}
+
static bool balloon_stats_enabled(const VirtIOBalloon *s)
{
return s->stats_poll_interval > 0;
@@ -293,6 +299,37 @@ out:
}
}
+static void virtio_balloon_get_free_pages(VirtIODevice *vdev, VirtQueue *vq)
+{
+ VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
+ VirtQueueElement *elem;
+ size_t offset = 0;
+ uint64_t bitmap_bytes = 0, free_pages_count = 0;
+
+ elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+ if (!elem) {
+ return;
+ }
+ s->free_pages_vq_elem = elem;
+
+ if (!elem->out_num) {
+ return;
+ }
+
+ iov_to_buf(elem->out_sg, elem->out_num, offset,
+ &free_pages_count, sizeof(uint64_t));
+
+ offset += sizeof(uint64_t);
+ iov_to_buf(elem->out_sg, elem->out_num, offset,
+ &bitmap_bytes, sizeof(uint64_t));
+
+ offset += sizeof(uint64_t);
+ iov_to_buf(elem->out_sg, elem->out_num, offset,
+ s->free_pages_bitmap, bitmap_bytes);
+ s->req_status = DONE;
+ s->free_pages_count = free_pages_count;
+}
+
static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
{
VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
@@ -362,6 +399,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
f |= dev->host_features;
virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
+ virtio_add_feature(&f, VIRTIO_BALLOON_F_GET_FREE_PAGES);
return f;
}
@@ -372,6 +410,45 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
VIRTIO_BALLOON_PFN_SHIFT);
}
+static int virtio_balloon_free_pages(void *opaque,
+ unsigned long *free_pages_bitmap,
+ unsigned long *free_pages_count)
+{
+ VirtIOBalloon *s = opaque;
+ VirtIODevice *vdev = VIRTIO_DEVICE(s);
+ VirtQueueElement *elem = s->free_pages_vq_elem;
+ int len;
+
+ if (!balloon_free_pages_supported(s)) {
+ return -1;
+ }
+
+ if (s->req_status == NOT_STARTED) {
+ s->free_pages_bitmap = free_pages_bitmap;
+ s->req_status = STARTED;
+ s->mem_layout.low_mem = pc_get_lowmem(PC_MACHINE(current_machine));
+ if (!elem->in_num) {
+ elem = virtqueue_pop(s->fvq, sizeof(VirtQueueElement));
+ if (!elem) {
+ return 0;
+ }
+ s->free_pages_vq_elem = elem;
+ }
+ len = iov_from_buf(elem->in_sg, elem->in_num, 0, &s->mem_layout,
+ sizeof(s->mem_layout));
+ virtqueue_push(s->fvq, elem, len);
+ virtio_notify(vdev, s->fvq);
+ return 0;
+ } else if (s->req_status == STARTED) {
+ return 0;
+ } else if (s->req_status == DONE) {
+ *free_pages_count = s->free_pages_count;
+ s->req_status = NOT_STARTED;
+ }
+
+ return 1;
+}
+
static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
{
VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
@@ -429,7 +506,8 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
sizeof(struct virtio_balloon_config));
ret = qemu_add_balloon_handler(virtio_balloon_to_target,
- virtio_balloon_stat, s);
+ virtio_balloon_stat,
+ virtio_balloon_free_pages, s);
if (ret < 0) {
error_setg(errp, "Only one balloon device is supported");
@@ -440,6 +518,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
+ s->fvq = virtio_add_queue(vdev, 128, virtio_balloon_get_free_pages);
reset_stats(s);
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 35f62ac..fc173e4 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -23,6 +23,16 @@
#define VIRTIO_BALLOON(obj) \
OBJECT_CHECK(VirtIOBalloon, (obj), TYPE_VIRTIO_BALLOON)
+typedef enum virtio_req_status {
+ NOT_STARTED,
+ STARTED,
+ DONE,
+} VIRTIO_REQ_STATUS;
+
+typedef struct MemLayout {
+ uint64_t low_mem;
+} MemLayout;
+
typedef struct virtio_balloon_stat VirtIOBalloonStat;
typedef struct virtio_balloon_stat_modern {
@@ -33,16 +43,21 @@ typedef struct virtio_balloon_stat_modern {
typedef struct VirtIOBalloon {
VirtIODevice parent_obj;
- VirtQueue *ivq, *dvq, *svq;
+ VirtQueue *ivq, *dvq, *svq, *fvq;
uint32_t num_pages;
uint32_t actual;
uint64_t stats[VIRTIO_BALLOON_S_NR];
VirtQueueElement *stats_vq_elem;
+ VirtQueueElement *free_pages_vq_elem;
size_t stats_vq_offset;
QEMUTimer *stats_timer;
int64_t stats_last_update;
int64_t stats_poll_interval;
uint32_t host_features;
+ uint64_t *free_pages_bitmap;
+ uint64_t free_pages_count;
+ MemLayout mem_layout;
+ VIRTIO_REQ_STATUS req_status;
} VirtIOBalloon;
#endif
diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
index 2e2a6dc..95b7d0c 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_GET_FREE_PAGES 3 /* Get the free pages bitmap */
/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12
diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
index 3f976b4..205b272 100644
--- a/include/sysemu/balloon.h
+++ b/include/sysemu/balloon.h
@@ -18,11 +18,19 @@
typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
+typedef int (QEMUBalloonFreePages)(void *opaque,
+ unsigned long *free_pages_bitmap,
+ unsigned long *free_pages_count);
int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
- QEMUBalloonStatus *stat_func, void *opaque);
+ QEMUBalloonStatus *stat_func,
+ QEMUBalloonFreePages *free_pages_func,
+ void *opaque);
void qemu_remove_balloon_handler(void *opaque);
bool qemu_balloon_is_inhibited(void);
void qemu_balloon_inhibit(bool state);
+bool balloon_free_pages_support(void);
+int balloon_get_free_pages(unsigned long *free_pages_bitmap,
+ unsigned long *free_pages_count);
#endif
--
1.8.3.1
^ permalink raw reply related
* [RFC qemu 1/4] pc: Add code to get the lowmem form PCMachineState
From: Liang Li @ 2016-03-03 10:44 UTC (permalink / raw)
To: quintela, amit.shah, qemu-devel, linux-kernel
Cc: ehabkost, kvm, mst, Liang Li, dgilbert, virtualization, linux-mm,
pbonzini, akpm, rth
In-Reply-To: <1457001868-15949-1-git-send-email-liang.z.li@intel.com>
The lowmem will be used by the following patch to get
a correct free pages bitmap.
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
hw/i386/pc.c | 5 +++++
hw/i386/pc_piix.c | 1 +
hw/i386/pc_q35.c | 1 +
include/hw/i386/pc.h | 3 ++-
4 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 0aeefd2..f794a84 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1115,6 +1115,11 @@ void pc_hot_add_cpu(const int64_t id, Error **errp)
object_unref(OBJECT(cpu));
}
+ram_addr_t pc_get_lowmem(PCMachineState *pcms)
+{
+ return pcms->lowmem;
+}
+
void pc_cpus_init(PCMachineState *pcms)
{
int i;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 6f8c2cd..268a08c 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -113,6 +113,7 @@ static void pc_init1(MachineState *machine,
}
}
+ pcms->lowmem = lowmem;
if (machine->ram_size >= lowmem) {
pcms->above_4g_mem_size = machine->ram_size - lowmem;
pcms->below_4g_mem_size = lowmem;
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 46522c9..8d9bd39 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -101,6 +101,7 @@ static void pc_q35_init(MachineState *machine)
}
}
+ pcms->lowmem = lowmem;
if (machine->ram_size >= lowmem) {
pcms->above_4g_mem_size = machine->ram_size - lowmem;
pcms->below_4g_mem_size = lowmem;
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 8b3546e..3694c91 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -60,7 +60,7 @@ struct PCMachineState {
bool nvdimm;
/* RAM information (sizes, addresses, configuration): */
- ram_addr_t below_4g_mem_size, above_4g_mem_size;
+ ram_addr_t below_4g_mem_size, above_4g_mem_size, lowmem;
/* CPU and apic information: */
bool apic_xrupt_override;
@@ -229,6 +229,7 @@ void pc_hot_add_cpu(const int64_t id, Error **errp);
void pc_acpi_init(const char *default_dsdt);
void pc_guest_info_init(PCMachineState *pcms);
+ram_addr_t pc_get_lowmem(PCMachineState *pcms);
#define PCI_HOST_PROP_PCI_HOLE_START "pci-hole-start"
#define PCI_HOST_PROP_PCI_HOLE_END "pci-hole-end"
--
1.8.3.1
^ permalink raw reply related
* [RFC qemu 0/4] A PV solution for live migration optimization
From: Liang Li @ 2016-03-03 10:44 UTC (permalink / raw)
To: quintela, amit.shah, qemu-devel, linux-kernel
Cc: ehabkost, kvm, mst, Liang Li, dgilbert, virtualization, linux-mm,
pbonzini, akpm, rth
The current QEMU live migration implementation mark the all the
guest's RAM pages as dirtied in the ram bulk stage, all these pages
will be processed and that takes quit a lot of CPU cycles.
From guest's point of view, it doesn't care about the content in free
pages. We can make use of this fact and skip processing the free
pages in the ram bulk stage, it can save a lot CPU cycles and reduce
the network traffic significantly while speed up the live migration
process obviously.
This patch set is the QEMU side implementation.
The virtio-balloon is extended so that QEMU can get the free pages
information from the guest through virtio.
After getting the free pages information (a bitmap), QEMU can use it
to filter out the guest's free pages in the ram bulk stage. This make
the live migration process much more efficient.
This RFC version doesn't take the post-copy and RDMA into
consideration, maybe both of them can benefit from this PV solution
by with some extra modifications.
Performance data
================
Test environment:
CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
Host RAM: 64GB
Host Linux Kernel: 4.2.0 Host OS: CentOS 7.1
Guest Linux Kernel: 4.5.rc6 Guest OS: CentOS 6.6
Network: X540-AT2 with 10 Gigabit connection
Guest RAM: 8GB
Case 1: Idle guest just boots:
============================================
| original | pv
-------------------------------------------
total time(ms) | 1894 | 421
--------------------------------------------
transferred ram(KB) | 398017 | 353242
============================================
Case 2: The guest has ever run some memory consuming workload, the
workload is terminated just before live migration.
============================================
| original | pv
-------------------------------------------
total time(ms) | 7436 | 552
--------------------------------------------
transferred ram(KB) | 8146291 | 361375
============================================
Liang Li (4):
pc: Add code to get the lowmem form PCMachineState
virtio-balloon: Add a new feature to balloon device
migration: not set migration bitmap in setup stage
migration: filter out guest's free pages in ram bulk stage
balloon.c | 30 ++++++++-
hw/i386/pc.c | 5 ++
hw/i386/pc_piix.c | 1 +
hw/i386/pc_q35.c | 1 +
hw/virtio/virtio-balloon.c | 81 ++++++++++++++++++++++++-
include/hw/i386/pc.h | 3 +-
include/hw/virtio/virtio-balloon.h | 17 +++++-
include/standard-headers/linux/virtio_balloon.h | 1 +
include/sysemu/balloon.h | 10 ++-
migration/ram.c | 64 +++++++++++++++----
10 files changed, 195 insertions(+), 18 deletions(-)
--
1.8.3.1
^ permalink raw reply
* Re: [PATCH 0/2] virtio/s390 patches
From: Michael S. Tsirkin @ 2016-03-01 13:01 UTC (permalink / raw)
To: Cornelia Huck; +Cc: borntraeger, virtualization, kvm, linux-s390
In-Reply-To: <1456836293-49239-1-git-send-email-cornelia.huck@de.ibm.com>
On Tue, Mar 01, 2016 at 01:44:51PM +0100, Cornelia Huck wrote:
> Hi Michael,
>
> here are two virtio/s390 patches (one cleanup, one bugfix), prepared
> against your vhost branch of mst/vhost.git.
>
> Please apply.
Will do, thanks!
> Cornelia Huck (1):
> virtio/s390: size of SET_IND payload
>
> Geliang Tang (1):
> virtio/s390: use dev_to_virtio
>
> drivers/s390/virtio/virtio_ccw.c | 15 +++++++++------
> 1 file changed, 9 insertions(+), 6 deletions(-)
>
> --
> 2.3.9
^ permalink raw reply
* [PATCH 2/2] virtio/s390: size of SET_IND payload
From: Cornelia Huck @ 2016-03-01 12:44 UTC (permalink / raw)
To: mst; +Cc: borntraeger, virtualization, kvm, linux-s390
In-Reply-To: <1456836293-49239-1-git-send-email-cornelia.huck@de.ibm.com>
SET_IND takes as a payload the _address_ of the indicators, meaning
that we have one of the rare cases where kmalloc(sizeof(&...)) is
actually correct. Let's clarify that with a comment.
The count for the ccw, however, was only correct because the
indicators are 64 bit. Let's use the correct value.
Reported-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
---
drivers/s390/virtio/virtio_ccw.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index 46b110a1..8688ad4 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -342,13 +342,14 @@ static void virtio_ccw_drop_indicator(struct virtio_ccw_device *vcdev,
ccw->count = sizeof(*thinint_area);
ccw->cda = (__u32)(unsigned long) thinint_area;
} else {
+ /* payload is the address of the indicators */
indicatorp = kmalloc(sizeof(&vcdev->indicators),
GFP_DMA | GFP_KERNEL);
if (!indicatorp)
return;
*indicatorp = 0;
ccw->cmd_code = CCW_CMD_SET_IND;
- ccw->count = sizeof(vcdev->indicators);
+ ccw->count = sizeof(&vcdev->indicators);
ccw->cda = (__u32)(unsigned long) indicatorp;
}
/* Deregister indicators from host. */
@@ -656,7 +657,10 @@ static int virtio_ccw_find_vqs(struct virtio_device *vdev, unsigned nvqs,
}
}
ret = -ENOMEM;
- /* We need a data area under 2G to communicate. */
+ /*
+ * We need a data area under 2G to communicate. Our payload is
+ * the address of the indicators.
+ */
indicatorp = kmalloc(sizeof(&vcdev->indicators), GFP_DMA | GFP_KERNEL);
if (!indicatorp)
goto out;
@@ -672,7 +676,7 @@ static int virtio_ccw_find_vqs(struct virtio_device *vdev, unsigned nvqs,
vcdev->indicators = 0;
ccw->cmd_code = CCW_CMD_SET_IND;
ccw->flags = 0;
- ccw->count = sizeof(vcdev->indicators);
+ ccw->count = sizeof(&vcdev->indicators);
ccw->cda = (__u32)(unsigned long) indicatorp;
ret = ccw_io_helper(vcdev, ccw, VIRTIO_CCW_DOING_SET_IND);
if (ret)
@@ -683,7 +687,7 @@ static int virtio_ccw_find_vqs(struct virtio_device *vdev, unsigned nvqs,
vcdev->indicators2 = 0;
ccw->cmd_code = CCW_CMD_SET_CONF_IND;
ccw->flags = 0;
- ccw->count = sizeof(vcdev->indicators2);
+ ccw->count = sizeof(&vcdev->indicators2);
ccw->cda = (__u32)(unsigned long) indicatorp;
ret = ccw_io_helper(vcdev, ccw, VIRTIO_CCW_DOING_SET_CONF_IND);
if (ret)
--
2.3.9
^ permalink raw reply related
* [PATCH 1/2] virtio/s390: use dev_to_virtio
From: Cornelia Huck @ 2016-03-01 12:44 UTC (permalink / raw)
To: mst; +Cc: linux-s390, kvm, Geliang Tang, virtualization, borntraeger
In-Reply-To: <1456836293-49239-1-git-send-email-cornelia.huck@de.ibm.com>
From: Geliang Tang <geliangtang@163.com>
Use dev_to_virtio() instead of open-coding it.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Message-Id: <912bf59bd3a48f2d4d4994681e898dc084fe29d3.1451484163.git.geliangtang@163.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
---
drivers/s390/virtio/virtio_ccw.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index bf2d130..46b110a1 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -945,8 +945,7 @@ static struct virtio_config_ops virtio_ccw_config_ops = {
static void virtio_ccw_release_dev(struct device *_d)
{
- struct virtio_device *dev = container_of(_d, struct virtio_device,
- dev);
+ struct virtio_device *dev = dev_to_virtio(_d);
struct virtio_ccw_device *vcdev = to_vc_device(dev);
kfree(vcdev->status);
--
2.3.9
^ permalink raw reply related
* [PATCH 0/2] virtio/s390 patches
From: Cornelia Huck @ 2016-03-01 12:44 UTC (permalink / raw)
To: mst; +Cc: borntraeger, virtualization, kvm, linux-s390
Hi Michael,
here are two virtio/s390 patches (one cleanup, one bugfix), prepared
against your vhost branch of mst/vhost.git.
Please apply.
Cornelia Huck (1):
virtio/s390: size of SET_IND payload
Geliang Tang (1):
virtio/s390: use dev_to_virtio
drivers/s390/virtio/virtio_ccw.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
--
2.3.9
^ permalink raw reply
* Re: [PATCH V3 3/3] vhost_net: basic polling support
From: Michael S. Tsirkin @ 2016-02-29 9:03 UTC (permalink / raw)
To: Jason Wang
Cc: yang.zhang.wz, RAPOPORT, kvm, netdev, linux-kernel,
virtualization
In-Reply-To: <56D3D404.6080600@redhat.com>
On Mon, Feb 29, 2016 at 01:15:48PM +0800, Jason Wang wrote:
>
>
> On 02/28/2016 10:09 PM, Michael S. Tsirkin wrote:
> > On Fri, Feb 26, 2016 at 04:42:44PM +0800, Jason Wang wrote:
> >> > This patch tries to poll for new added tx buffer or socket receive
> >> > queue for a while at the end of tx/rx processing. The maximum time
> >> > spent on polling were specified through a new kind of vring ioctl.
> >> >
> >> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > Looks good overall, but I still see one problem.
> >
> >> > ---
> >> > drivers/vhost/net.c | 79 +++++++++++++++++++++++++++++++++++++++++++---
> >> > drivers/vhost/vhost.c | 14 ++++++++
> >> > drivers/vhost/vhost.h | 1 +
> >> > include/uapi/linux/vhost.h | 6 ++++
> >> > 4 files changed, 95 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >> > index 9eda69e..c91af93 100644
> >> > --- a/drivers/vhost/net.c
> >> > +++ b/drivers/vhost/net.c
> >> > @@ -287,6 +287,44 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
> >> > rcu_read_unlock_bh();
> >> > }
> >> >
> >> > +static inline unsigned long busy_clock(void)
> >> > +{
> >> > + return local_clock() >> 10;
> >> > +}
> >> > +
> >> > +static bool vhost_can_busy_poll(struct vhost_dev *dev,
> >> > + unsigned long endtime)
> >> > +{
> >> > + return likely(!need_resched()) &&
> >> > + likely(!time_after(busy_clock(), endtime)) &&
> >> > + likely(!signal_pending(current)) &&
> >> > + !vhost_has_work(dev) &&
> >> > + single_task_running();
> > So I find it quite unfortunate that this still uses single_task_running.
> > This means that for example a SCHED_IDLE task will prevent polling from
> > becoming active, and that seems like a bug, or at least
> > an undocumented feature :).
>
> Yes, it may need more thoughts.
>
> >
> > Unfortunately this logic affects the behaviour as observed
> > by userspace, so we can't merge it like this and tune
> > afterwards, since otherwise mangement tools will start
> > depending on this logic.
> >
> >
>
> How about remove single_task_running() first here and optimize on top?
> We probably need something like this to handle overcommitment.
Sounds good.
--
MST
^ permalink raw reply
* Re: [PATCH V3 3/3] vhost_net: basic polling support
From: Jason Wang @ 2016-02-29 5:17 UTC (permalink / raw)
To: Christian Borntraeger, kvm, mst, virtualization, netdev,
linux-kernel
Cc: yang.zhang.wz, RAPOPORT
In-Reply-To: <56D36D25.6070903@de.ibm.com>
On 02/29/2016 05:56 AM, Christian Borntraeger wrote:
> On 02/26/2016 09:42 AM, Jason Wang wrote:
>> > This patch tries to poll for new added tx buffer or socket receive
>> > queue for a while at the end of tx/rx processing. The maximum time
>> > spent on polling were specified through a new kind of vring ioctl.
>> >
>> > Signed-off-by: Jason Wang <jasowang@redhat.com>
>> > ---
>> > drivers/vhost/net.c | 79 +++++++++++++++++++++++++++++++++++++++++++---
>> > drivers/vhost/vhost.c | 14 ++++++++
>> > drivers/vhost/vhost.h | 1 +
>> > include/uapi/linux/vhost.h | 6 ++++
>> > 4 files changed, 95 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> > index 9eda69e..c91af93 100644
>> > --- a/drivers/vhost/net.c
>> > +++ b/drivers/vhost/net.c
>> > @@ -287,6 +287,44 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
>> > rcu_read_unlock_bh();
>> > }
>> >
>> > +static inline unsigned long busy_clock(void)
>> > +{
>> > + return local_clock() >> 10;
>> > +}
>> > +
>> > +static bool vhost_can_busy_poll(struct vhost_dev *dev,
>> > + unsigned long endtime)
>> > +{
>> > + return likely(!need_resched()) &&
>> > + likely(!time_after(busy_clock(), endtime)) &&
>> > + likely(!signal_pending(current)) &&
>> > + !vhost_has_work(dev) &&
>> > + single_task_running();
>> > +}
>> > +
>> > +static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
>> > + struct vhost_virtqueue *vq,
>> > + struct iovec iov[], unsigned int iov_size,
>> > + unsigned int *out_num, unsigned int *in_num)
>> > +{
>> > + unsigned long uninitialized_var(endtime);
>> > + int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
>> > + out_num, in_num, NULL, NULL);
>> > +
>> > + if (r == vq->num && vq->busyloop_timeout) {
>> > + preempt_disable();
>> > + endtime = busy_clock() + vq->busyloop_timeout;
>> > + while (vhost_can_busy_poll(vq->dev, endtime) &&
>> > + vhost_vq_avail_empty(vq->dev, vq))
>> > + cpu_relax();
> Can you use cpu_relax_lowlatency (which should be the same as cpu_relax for almost
> everybody but s390? cpu_relax (without low latency might give up the time slice
> when running under another hypervisor (like LPAR on s390), which might not be what
> we want here.
Ok, will do this in next version.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox