Linux virtualization list
 help / color / mirror / Atom feed
* RE: [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-11  2:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
	simhan@hpe.com, linux-kernel@vger.kernel.org,
	qemu-devel@nongnu.org, jitendra.kolhe@hpe.com, linux-mm@kvack.org,
	mohan_parthasarathy@hpe.com, Amit Shah, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160310111844.GB2276@work-vm>

> 
> Hi,
>   I'm just catching back up on this thread; so without reference to any
> particular previous mail in the thread.
> 
>   1) How many of the free pages do we tell the host about?
>      Your main change is telling the host about all the
>      free pages.

Yes, all the guest's free pages.

>      If we tell the host about all the free pages, then we might
>      end up needing to allocate more pages and update the host
>      with pages we now want to use; that would have to wait for the
>      host to acknowledge that use of these pages, since if we don't
>      wait for it then it might have skipped migrating a page we
>      just started using (I don't understand how your series solves that).
>      So the guest probably needs to keep some free pages - how many?

Actually, there is no need to care about whether the free pages will be used by the host.
We only care about some of the free pages we get reused by the guest, right?

The dirty page logging can be used to solve this, starting the dirty page logging before getting
the free pages informant from guest. Even some of the free pages are modified by the guest
during the process of getting the free pages information, these modified pages will be traced
by the dirty page logging mechanism. So in the following migration_bitmap_sync() function.
The pages in the free pages bitmap, but latter was modified, will be reset to dirty. We won't
omit any dirtied pages.

So, guest doesn't need to keep any free pages.

>   2) Clearing out caches
>      Does it make sense to clean caches?  They're apparently useful data
>      so if we clean them it's likely to slow the guest down; I guess
>      they're also likely to be fairly static data - so at least fairly
>      easy to migrate.
>      The answer here partially depends on what you want from your migration;
>      if you're after the fastest possible migration time it might make
>      sense to clean the caches and avoid migrating them; but that might
>      be at the cost of more disruption to the guest - there's a trade off
>      somewhere and it's not clear to me how you set that depending on your
>      guest/network/reqirements.
> 

Yes, clean the caches is an option.  Let the users decide using it or not.

>   3) Why is ballooning slow?
>      You've got a figure of 5s to balloon on an 8GB VM - but an
>      8GB VM isn't huge; so I worry about how long it would take
>      on a big VM.   We need to understand why it's slow
>        * is it due to the guest shuffling pages around?
>        * is it due to the virtio-balloon protocol sending one page
>          at a time?
>          + Do balloon pages normally clump in physical memory
>             - i.e. would a 'large balloon' message help
>             - or do we need a bitmap because it tends not to clump?
> 

I didn't do a comprehensive test. But I found most of the time spending
on allocating the pages and sending the PFNs to guest, I don't know that's
the most time consuming operation, allocating the pages or sending the PFNs.

>        * is it due to the madvise on the host?
>          If we were using the normal balloon messages, then we
>          could, during migration, just route those to the migration
>          code rather than bothering with the madvise.
>          If they're clumping together we could just turn that into
>          one big madvise; if they're not then would we benefit from
>          a call that lets us madvise lots of areas?
> 

My test showed madvise() is not the main reason for the long time, only taken
10% of the total  inflating balloon operation time.
Big madvise can more or less improve the performance.

>   4) Speeding up the migration of those free pages
>     You're using the bitmap to avoid migrating those free pages; HPe's
>     patchset is reconstructing a bitmap from the balloon data;  OK, so
>     this all makes sense to avoid migrating them - I'd also been thinking
>     of using pagemap to spot zero pages that would help find other zero'd
>     pages, but perhaps ballooned is enough?
> 
Could you describe your ideal with more details?

>   5) Second-migrate
>     Given a VM where you've done all those tricks on, what happens when
>     you migrate it a second time?   I guess you're aiming for the guest
>     to update it's bitmap;  HPe's solution is to migrate it's balloon
>     bitmap along with the migration data.

Nothing is special in the second migration, QEMU will request the guest for free pages
Information, and the guest will traverse it's current free page list to construct a
new free page bitmap and send it to QEMU. Just like in the first migration.

Liang
> 
> Dave
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply

* Re: virtio-vsock live migration
From: Michael S. Tsirkin @ 2016-03-10 23:56 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Claudio Imbrenda, Christian Borntraeger,
	Matt Benjamin, virtualization, Christoffer Dall
In-Reply-To: <20160303153737.GA19780@stefanha-x1.localdomain>

On Thu, Mar 03, 2016 at 03:37:37PM +0000, Stefan Hajnoczi wrote:
> Michael pointed out that the virtio-vsock draft specification does not
> address live migration and in fact currently precludes migration.
> 
> Migration is fundamental so the device specification at least mustn't
> preclude it.  Having brainstormed migration with Matthew Benjamin and
> Michael Tsirkin, I am now summarizing the approach that I want to
> include in the next draft specification.
> 
> Feedback and comments welcome!  In the meantime I will implement this in
> code and update the draft specification.
> 
> 1. Requirements
> 
> Virtio-vsock is a new AF_VSOCK transport.  As such, it should provide at
> least the same guarantees as the existing AF_VSOCK VMCI transport.  This
> is for consistency and to allow code reuse across any AF_VSOCK
> transport.
> 
> Virtio-vsock aims to replace virtio-serial by providing the same
> guest/host communication ability but with sockets API semantics that are
> more popular and convenient for application developers.  Therefore
> virtio-vsock migration should provide at least the same level of
> migration functionality as virtio-serial.
> 
> Ideally it should be possible to migrate applications using AF_VSOCK
> together with the virtual machine so that guest<->host communication is
> interrupted.  Neither AF_VSOCK VMCI nor virtio-serial support this
> today.

I'm not sure why do you say this about virtio serial.
It appears that if host pre-connected to destination
qemu before migration, backend reconnects transparently
on destination.


> 2. Basic disruptive migration flow
> 
> When the virtual machine migrates from the source host to the
> destination host, the guest's CID may change.  The CID namespace is
> host-wide


BTW, I think CIDs would have to become per network namespace.

> so other hosts may have CID collisions and allocate a new CID
> for incoming migration VMs.

I guess all this is so that guest can retrieve its CID and
send it to host using some side-channel?


> The device notifies the guest that the CID has changed.  Guest sockets
> are affected as follows:
> 
>  * Established connections are reset (ECONNRESET) and the guest
>    application will have to reconnect.
> 
>  * Listen sockets remain open.  The only thing to note is that
>    connections from the host are now made to the new CID.  This means
>    the local address of the listen socket is automatically updated to
>    the new CID.
> 
>  * Sockets in other states are unchanged.
> 
> Applications must handle disruptive migration by reconnecting if
> necessary after ECONNRESET.
> 
> 3. Checkpoint/restore for seamless migration
> 
> Applications that wish to communicate across live migration can do so
> but this requires extra application-specific checkpoint/restore code.
> 
> This is similar to the approach taken by the CRIU project where
> getsockopt()/setsockopt() is used to migrate socket state.  The
> difference is that the application process is not automatically migrated
> from the source host to the destination host.  Therefore, the
> application needs to migrate its own state somehow.
> 
> The flow is as follows:
> 
> The application on the source host must quiesce (stop sending/receiving)
> and use getsockopt() to extract socket state information from the host
> kernel.
> 
> A new instance of the application is started on the destination host and
> given the state so it can restore the connection.  The setsockopt()
> syscall is used to restore socket state information.
> 
> The guest is given a list of <host_old_cid, host_new_cid, host_port,
> guest_port> tuples for established connections that must not be reset
> when the guest CID update notification is received.  These connections
> will carry on as if nothing changed.
> 
> Note that the connection's remote address is updated from host_old_cid
> to host_new_cid.  This allows remapping of CIDs (if necessary).
> Typically this will be unused because the host always has well-known CID
> 2.  In a guest<->guest scenario it may be used to remap CIDs.
> 
> 
> For the time being I am focussing on the basic disruptive migration flow
> only.  Checkpoint/restore can be added with a feature bit in the future.
> It is a lot more complex and I'm not sure whether there will be any
> users yet.
> 
> Stefan

This makes some things harder. For example, imagine a guest
reboot mixed with migration. We don't know why did the connection
die, so we'll retry connections until - when?

Could you please describe some user of vsock and show how
it recovers from destructive migration?

-- 
MST

^ permalink raw reply

* Re: [RFC -next 2/2] virtio_net: Read and use the advised MTU
From: Sergei Shtylyov @ 2016-03-10 18:56 UTC (permalink / raw)
  To: Aaron Conole, netdev, Michael S. Tsirkin, virtualization,
	linux-kernel
In-Reply-To: <1457620092-24170-3-git-send-email-aconole@redhat.com>

Hello.

On 03/10/2016 05:28 PM, Aaron Conole wrote:

> This patch checks the feature bit for the VIRTIO_NET_F_MTU feature. If it
> exists, read the advised MTU and use it.
>
> No proper error handling is provided for the case where a user changes the
> negotiated MTU. A future commit will add proper error handling. Instead, a
> warning is emitted if the guest changes the device MTU after previously being
> given advice.
>
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> ---
>   drivers/net/virtio_net.c | 15 ++++++++++++++-
>   1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 767ab11..7175563 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
[...]
> @@ -1390,8 +1391,12 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
>
>   static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
>   {
> +	struct virtnet_info *vi = netdev_priv(dev);
>   	if (new_mtu < MIN_MTU || new_mtu > MAX_MTU)
>   		return -EINVAL;
> +	if (vi->negotiated_mtu == true) {
> +		pr_warn("changing mtu from negotiated mtu.");
> +	}

    {} not needed, see Documentation/CodingStyle.

[...]

MBR, Sergei

^ permalink raw reply

* KVM Forum 2016: Call For Participation
From: Paolo Bonzini @ 2016-03-10 18:09 UTC (permalink / raw)
  To: qemu-devel, KVM list, libvir-list, edk2-devel@ml01.01.org,
	seabios@seabios.org, Linux Virtualization

=================================================================
KVM Forum 2016: Call For Participation
August 24-26, 2016 - Westin Harbor Castle - Toronto, Canada

(All submissions must be received before midnight May 1, 2016)
=================================================================

KVM Forum is an annual event that presents a rare opportunity
for developers and users to meet, discuss the state of Linux
virtualization technology, and plan for the challenges ahead. 
We invite you to lead part of the discussion by submitting a speaking
proposal for KVM Forum 2016.

At this highly technical conference, developers driving innovation
in the KVM virtualization stack (Linux, KVM, QEMU, libvirt) can
meet users who depend on KVM as part of their offerings, or to
power their data centers and clouds.

KVM Forum will include sessions on the state of the KVM
virtualization stack, planning for the future, and many
opportunities for attendees to collaborate. As we celebrate ten years
of KVM development in the Linux kernel, KVM continues to be a
critical part of the FOSS cloud infrastructure.

This year, KVM Forum is joining LinuxCon and ContainerCon in Toronto, 
Canada. Selected talks from KVM Forum will be presented on Wednesday
August 24 to the full audience of LinuxCon and ContainerCon. Also,
attendees of KVM Forum will have access to all of the LinuxCon and
ContainerCon talks on Wednesday.

http://events.linuxfoundation.org/cfp

Suggested topics:

KVM and Linux
* Scaling and optimizations
* Nested virtualization
* Linux kernel performance improvements
* Resource management (CPU, I/O, memory)
* Hardening and security
* VFIO: SR-IOV, GPU, platform device assignment
* Architecture ports

QEMU
* Management interfaces: QOM and QMP
* New devices, new boards, new architectures
* Scaling and optimizations
* Desktop virtualization and SPICE
* Virtual GPU
* virtio and vhost, including non-Linux or non-virtualized uses
* Hardening and security
* New storage features
* Live migration and fault tolerance
* High availability and continuous backup
* Real-time guest support
* Emulation and TCG
* Firmware: ACPI, UEFI, coreboot, u-Boot, etc.
* Testing

Management and infrastructure
* Managing KVM: Libvirt, OpenStack, oVirt, etc.
* Storage: glusterfs, Ceph, etc.
* Software defined networking: Open vSwitch, OpenDaylight, etc.
* Network Function Virtualization
* Security
* Provisioning
* Performance tuning


===============
SUBMITTING YOUR PROPOSAL
===============
Abstracts due: May 1, 2016

Please submit a short abstract (~150 words) describing your presentation
proposal. Slots vary in length up to 45 minutes. Also include the proposal
type -- one of:
- technical talk
- end-user talk

Submit your proposal here:
http://events.linuxfoundation.org/cfp
Please only use the categories "presentation" and "panel discussion"

You will receive a notification whether or not your presentation proposal
was accepted by May 27, 2016.

Speakers will receive a complimentary pass for the event. In the instance
that your submission has multiple presenters, only the primary speaker for a
proposal will receive a complementary event pass. For panel discussions, all
panelists will receive a complimentary event pass.

TECHNICAL TALKS

A good technical talk should not just report on what has happened over
the last year; it should present a concrete problem and how it impacts
the user and/or developer community. Whenever applicable, focus on
work that needs to be done, difficulties that haven't yet been solved,
and on decisions that other developers should be aware of. Summarizing
recent developments is okay but it should not be more than a small
portion of the overall talk.

END-USER TALKS

One of the big challenges as developers is to know what, where and how
people actually use our software. We will reserve a few slots for end
users talking about their deployment challenges and achievements.

If you are using KVM in production you are encouraged submit a speaking
proposal. Simply mark it as an end-user talk. As an end user, this is a
unique opportunity to get your input to developers.

HANDS-ON / BOF SESSIONS

We will reserve some time for people to get together and discuss
strategic decisions as well as other topics that are best solved within
smaller groups.

These sessions will be announced during the event. If you are interested
in organizing such a session, please add it to the list at

  http://www.linux-kvm.org/page/KVM_Forum_2016_BOF

Let people you think might be interested know about it, and encourage
them to add their names to the wiki page as well. Please try to
add your ideas to the list before KVM Forum starts.


PANEL DISCUSSIONS

If you are proposing a panel discussion, please make sure that you list
all of your potential panelists in your abstract. We will request full
biographies if a panel is accepted.


===============
HOTEL / TRAVEL
===============

This year's event will take place at the Westin Harbour Castle Toronto.
For information on discounted room rates for conference attendees
and on other hotels close to the conference, please visit
http://events.linuxfoundation.org/events/kvm-forum/attend/hotel-travel.

As of March 15, 2016, non-US citizens need either a visa or an Electronic
Travel Authorization (eTA) in order to enter Canada. Detailed information
on the travel documentation required for your country of origin can
be found at http://www.cic.gc.ca/english/visit/visas.asp and
http://events.linuxfoundation.org/events/kvm-forum/attend/hotel-travel.

** We urge you to start this process as quickly as possible to ensure
** receipt of appropriate travel documentation in time for your conference
** travel to Canada.  For processing times for visa applications, please visit
** http://www.cic.gc.ca/english/information/times/.

===============
IMPORTANT DATES
===============
Notification: May 27, 2015
Schedule announced: June 3, 2015
Event dates: August 24-26, 2016

Thank you for your interest in KVM. We're looking forward to your
submissions and seeing you at the KVM Forum 2016 in August!

-your KVM Forum 2016 Program Committee

Please contact us with any questions or comments at
kvm-forum-2016-pc@redhat.com

^ permalink raw reply

* Re: [RFC -next 2/2] virtio_net: Read and use the advised MTU
From: Paolo Abeni @ 2016-03-10 14:57 UTC (permalink / raw)
  To: Aaron Conole; +Cc: netdev, virtualization, linux-kernel, Michael S. Tsirkin
In-Reply-To: <1457620092-24170-3-git-send-email-aconole@redhat.com>

On Thu, 2016-03-10 at 09:28 -0500, Aaron Conole wrote:
> This patch checks the feature bit for the VIRTIO_NET_F_MTU feature. If it
> exists, read the advised MTU and use it.
> 
> No proper error handling is provided for the case where a user changes the
> negotiated MTU. A future commit will add proper error handling. Instead, a
> warning is emitted if the guest changes the device MTU after previously being
> given advice.
> 
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> ---
>  drivers/net/virtio_net.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 767ab11..7175563 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -146,6 +146,7 @@ struct virtnet_info {
>  	virtio_net_ctrl_ack ctrl_status;
>  	u8 ctrl_promisc;
>  	u8 ctrl_allmulti;
> +	bool negotiated_mtu;
>  };
>  
>  struct padded_vnet_hdr {
> @@ -1390,8 +1391,12 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
>  
>  static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
>  {
> +	struct virtnet_info *vi = netdev_priv(dev);
>  	if (new_mtu < MIN_MTU || new_mtu > MAX_MTU)
>  		return -EINVAL;
> +	if (vi->negotiated_mtu == true) {

why don't:

if ((vi->negotiated_mtu == true) && (dev->mtu != new_mtu))

?

> +		pr_warn("changing mtu from negotiated mtu.");
> +	}
>  	dev->mtu = new_mtu;
>  	return 0;
>  }
> @@ -1836,6 +1841,13 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>  		vi->has_cvq = true;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_MTU)) {
> +		vi->negotiated_mtu = true;
> +		dev->mtu = virtio_cread16(vdev,
> +					  offsetof(struct virtio_net_config,
> +						   mtu));
> +	}
> +
>  	if (vi->any_header_sg)
>  		dev->needed_headroom = vi->hdr_len;
>  
> @@ -2017,8 +2029,9 @@ static unsigned int features[] = {
>  	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
>  	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
>  	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ,
> -	VIRTIO_NET_F_CTRL_MAC_ADDR,
> +	VIRTIO_NET_F_CTRL_MAC_ADDR, 

Here a trailing white space slipped-in.

Otherwise LGTM.

Paolo

>  	VIRTIO_F_ANY_LAYOUT,
> +	VIRTIO_NET_F_MTU,
>  };
>  
>  static struct virtio_driver virtio_net_driver = {

^ permalink raw reply

* [RFC -next 2/2] virtio_net: Read and use the advised MTU
From: Aaron Conole @ 2016-03-10 14:28 UTC (permalink / raw)
  To: netdev, Michael S. Tsirkin, virtualization, linux-kernel
In-Reply-To: <1457620092-24170-1-git-send-email-aconole@redhat.com>

This patch checks the feature bit for the VIRTIO_NET_F_MTU feature. If it
exists, read the advised MTU and use it.

No proper error handling is provided for the case where a user changes the
negotiated MTU. A future commit will add proper error handling. Instead, a
warning is emitted if the guest changes the device MTU after previously being
given advice.

Signed-off-by: Aaron Conole <aconole@redhat.com>
---
 drivers/net/virtio_net.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 767ab11..7175563 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -146,6 +146,7 @@ struct virtnet_info {
 	virtio_net_ctrl_ack ctrl_status;
 	u8 ctrl_promisc;
 	u8 ctrl_allmulti;
+	bool negotiated_mtu;
 };
 
 struct padded_vnet_hdr {
@@ -1390,8 +1391,12 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
 
 static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
 {
+	struct virtnet_info *vi = netdev_priv(dev);
 	if (new_mtu < MIN_MTU || new_mtu > MAX_MTU)
 		return -EINVAL;
+	if (vi->negotiated_mtu == true) {
+		pr_warn("changing mtu from negotiated mtu.");
+	}
 	dev->mtu = new_mtu;
 	return 0;
 }
@@ -1836,6 +1841,13 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
 		vi->has_cvq = true;
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_MTU)) {
+		vi->negotiated_mtu = true;
+		dev->mtu = virtio_cread16(vdev,
+					  offsetof(struct virtio_net_config,
+						   mtu));
+	}
+
 	if (vi->any_header_sg)
 		dev->needed_headroom = vi->hdr_len;
 
@@ -2017,8 +2029,9 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ,
-	VIRTIO_NET_F_CTRL_MAC_ADDR,
+	VIRTIO_NET_F_CTRL_MAC_ADDR, 
 	VIRTIO_F_ANY_LAYOUT,
+	VIRTIO_NET_F_MTU,
 };
 
 static struct virtio_driver virtio_net_driver = {
-- 
2.5.0

^ permalink raw reply related

* [RFC -next 1/2] virtio: Start the advised MTU feature support
From: Aaron Conole @ 2016-03-10 14:28 UTC (permalink / raw)
  To: netdev, Michael S. Tsirkin, virtualization, linux-kernel
In-Reply-To: <1457620092-24170-1-git-send-email-aconole@redhat.com>

This commit adds the feature bit and associated mtu device entry for the
virtio network device. Future commits will make use of these bits to support
negotiated MTU.

Signed-off-by: Aaron Conole <aconole@redhat.com>
---
 include/uapi/linux/virtio_net.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index ec32293..41a6a01 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -55,6 +55,7 @@
 #define VIRTIO_NET_F_MQ	22	/* Device supports Receive Flow
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
+#define VIRTIO_NET_F_MTU 25	/* Device supports Default MTU Negotiation */
 
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO	6	/* Host handles pkts w/ any GSO type */
@@ -73,6 +74,8 @@ struct virtio_net_config {
 	 * Legal values are between 1 and 0x8000
 	 */
 	__u16 max_virtqueue_pairs;
+	/* Default maximum transmit unit advice */
+	__u16 mtu;
 } __attribute__((packed));
 
 /*
-- 
2.5.0

^ permalink raw reply related

* [RFC -next 0/2] virtio-net: Advised MTU feature
From: Aaron Conole @ 2016-03-10 14:28 UTC (permalink / raw)
  To: netdev, Michael S. Tsirkin, virtualization, linux-kernel

The following series adds the ability for a hypervisor to set an MTU on the
guest during feature negotiation phase. This is useful for VM orchestration
when, for instance, tunneling is involved and the MTU of the various systems
should be homogenous.

The first patch adds the feature bit as described in the proposed VFIO spec
addition found at
https://lists.oasis-open.org/archives/virtio-dev/201603/msg00001.html

The second patch adds a user of the bit, and a warning when the guest changes
the MTU from the hypervisor advised MTU. Future patches may add more thorough
error handling.

Aaron Conole (2):
  virtio: Start the advised MTU feature support
  virtio_net: Read and use the advised MTU

 drivers/net/virtio_net.c        | 15 ++++++++++++++-
 include/uapi/linux/virtio_net.h |  3 +++
 2 files changed, 17 insertions(+), 1 deletion(-)

-- 
2.5.0

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Michael S. Tsirkin @ 2016-03-10 12:29 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: riel@redhat.com, ehabkost@redhat.com, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	Dr. David Alan Gilbert, linux-mm@kvack.org, Roman Kagan,
	amit.shah@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E0414A41D@shsmsx102.ccr.corp.intel.com>

On Thu, Mar 10, 2016 at 01:41:16AM +0000, Li, Liang Z wrote:
> > > > > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > > > > The problem is the poor performance, this PV solution
> > > > >
> > > > > Balloon is always PV. And do not call patches solutions please.
> > > > >
> > > > > > is aimed to make it more
> > > > > > efficient and reduce the performance impact on guest.
> > > > >
> > > > > We need to get a bit beyond this.  You are making multiple
> > > > > changes, it seems to make sense to split it all up, and analyse
> > > > > each change separately.
> > > >
> > > > Couldn't agree more.
> > > >
> > > > There are three stages in this optimization:
> > > >
> > > > 1) choosing which pages to skip
> > > >
> > > > 2) communicating them from guest to host
> > > >
> > > > 3) skip transferring uninteresting pages to the remote side on
> > > > migration
> > > >
> > > > For (3) there seems to be a low-hanging fruit to amend
> > > > migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This
> > > > would work for guest RAM that hasn't been touched yet or which has
> > > > been ballooned out.
> > > >
> > > > For (1) I've been trying to make a point that skipping clean pages
> > > > is much more likely to result in noticable benefit than free pages only.
> > > >
> > >
> > > I am considering to drop the pagecache before getting the free pages.
> > >
> > > > As for (2), we do seem to have a problem with the existing balloon:
> > > > according to your measurements it's very slow; besides, I guess it
> > > > plays badly
> > >
> > > I didn't say communicating is slow. Even this is very slow, my
> > > solution use bitmap instead of PFNs, there is fewer data traffic, so it's
> > faster than the existing balloon which use PFNs.
> > 
> > By how much?
> > 
> 
> Haven't measured yet. 
> To identify a page, 1 bit is needed if using bitmap, 4 Bytes(32bit) is needed if using PFN, 
> 
> For a guest with 8GB RAM,  the corresponding free page bitmap size is 256KB.
> And the corresponding total PFNs size is 8192KB. Assuming the inflating size
> is 7GB, the total PFNs size is 7168KB.

Yes but this is not how balloon works, instead, it will reuse a single
4K page multiple times. We can also trade off more memory for speed
if we want to, it's completely up to guest.

> 
> Maybe this is not the point.
> 
> Liang



> > > > with transparent huge pages (as both the guest and the host work
> > > > with one 4k page at a time).  This is a problem for other use cases
> > > > of balloon (e.g. as a facility for resource management); tackling
> > > > that appears a more natural application for optimization efforts.
> > > >
> > > > Thanks,
> > > > Roman.

^ permalink raw reply

* Re: [RFC qemu 0/4] A PV solution for live migration optimization
From: Dr. David Alan Gilbert @ 2016-03-10 11:18 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com, simhan,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	jitendra.kolhe, linux-mm@kvack.org, mohan_parthasarathy,
	Amit Shah, pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E0414A860@shsmsx102.ccr.corp.intel.com>

Hi,
  I'm just catching back up on this thread; so without reference to any
particular previous mail in the thread.

  1) How many of the free pages do we tell the host about?
     Your main change is telling the host about all the
     free pages.
     If we tell the host about all the free pages, then we might
     end up needing to allocate more pages and update the host
     with pages we now want to use; that would have to wait for the
     host to acknowledge that use of these pages, since if we don't
     wait for it then it might have skipped migrating a page we
     just started using (I don't understand how your series solves that).
     So the guest probably needs to keep some free pages - how many?

  2) Clearing out caches
     Does it make sense to clean caches?  They're apparently useful data
     so if we clean them it's likely to slow the guest down; I guess
     they're also likely to be fairly static data - so at least fairly
     easy to migrate.
     The answer here partially depends on what you want from your migration;
     if you're after the fastest possible migration time it might make
     sense to clean the caches and avoid migrating them; but that might
     be at the cost of more disruption to the guest - there's a trade off
     somewhere and it's not clear to me how you set that depending on your
     guest/network/reqirements.

  3) Why is ballooning slow?
     You've got a figure of 5s to balloon on an 8GB VM - but an 
     8GB VM isn't huge; so I worry about how long it would take
     on a big VM.   We need to understand why it's slow 
       * is it due to the guest shuffling pages around? 
       * is it due to the virtio-balloon protocol sending one page
         at a time?
         + Do balloon pages normally clump in physical memory
            - i.e. would a 'large balloon' message help
            - or do we need a bitmap because it tends not to clump?

       * is it due to the madvise on the host?
         If we were using the normal balloon messages, then we
         could, during migration, just route those to the migration
         code rather than bothering with the madvise.
         If they're clumping together we could just turn that into
         one big madvise; if they're not then would we benefit from
         a call that lets us madvise lots of areas?

  4) Speeding up the migration of those free pages
    You're using the bitmap to avoid migrating those free pages; HPe's
    patchset is reconstructing a bitmap from the balloon data;  OK, so
    this all makes sense to avoid migrating them - I'd also been thinking
    of using pagemap to spot zero pages that would help find other zero'd
    pages, but perhaps ballooned is enough?

  5) Second-migrate
    Given a VM where you've done all those tricks on, what happens when
    you migrate it a second time?   I guess you're aiming for the guest
    to update it's bitmap;  HPe's solution is to migrate it's balloon
    bitmap along with the migration data.
     
Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Roman Kagan @ 2016-03-10 10:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: riel, ehabkost@redhat.com, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, Li, Liang Z, linux-kernel@vger.kernel.org,
	Dr. David Alan Gilbert, linux-mm@kvack.org, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160309193137-mutt-send-email-mst@redhat.com>

On Wed, Mar 09, 2016 at 07:39:18PM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 09, 2016 at 08:04:39PM +0300, Roman Kagan wrote:
> > On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > > For (1) I've been trying to make a point that skipping clean pages is
> > > > much more likely to result in noticable benefit than free pages only.
> > > 
> > > I guess when you say clean you mean zero?
> > 
> > No I meant clean, i.e. those that could be evicted from RAM without
> > causing I/O.
> 
> They must be migrated unless guest actually evicts them.

If the balloon is inflated the guest will.

> It's not at all clear to me that it's always preferable
> to drop all clean pages from pagecache. It is clearly is
> going to slow the guest down significantly.

That's a matter for optimization.  The current value for
/proc/meminfo:MemAvailable (which is being proposed as a member of
balloon stats, too) is a conservative estimate which will probably cover
a good deal of cases.

> > I must be missing something obvious, but how is that different from
> > inflating and then immediately deflating the balloon?
> 
> It's exactly the same except
> - we do not initiate this from host - it's guest doing
>   things for its own reasons
> - a bit less guest/host interaction this way

I don't quite understand why you need to deflate the balloon until the
VM is on the destination host.  deflate_on_oom will do it if the guest
is really tight on memory; otherwise there appears to be no reason for
it.  But then inflation followed immediately by deflation doubles the
guest/host interactions rather than reduces them, no?

> > it's just the granularity that makes things slow and
> > stands in the way.
> 
> So we could request a specific page size/alignment from guest.
> Send guest request to give us memory in aligned units of 2Mbytes,
> and then host can treat each of these as a single huge page.

I'd guess just coalescing contiguous pages would already speed things
up.  I'll try to find some time to experiment with it.

Roman.

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Roman Kagan @ 2016-03-10  9:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	Li, Liang Z, Dr. David Alan Gilbert, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Michael S. Tsirkin, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <1457552332.17933.24.camel@redhat.com>

On Wed, Mar 09, 2016 at 02:38:52PM -0500, Rik van Riel wrote:
> On Wed, 2016-03-09 at 20:04 +0300, Roman Kagan wrote:
> > On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > > For (1) I've been trying to make a point that skipping clean
> > > > pages is
> > > > much more likely to result in noticable benefit than free pages
> > > > only.
> > > 
> > > I guess when you say clean you mean zero?
> > 
> > No I meant clean, i.e. those that could be evicted from RAM without
> > causing I/O.
> > 
> 
> Programs in the guest may have that memory mmapped.
> This could include things like libraries and executables.
> 
> How do you deal with the guest page cache containing
> references to now non-existent memory?
> 
> How do you re-populate the memory on the destination
> host?

I guess the confusion is due to the context I stripped from the previous
messages...  Actually I've been talking about doing full-fledged balloon
inflation before the migration, so, when it's deflated the guest will
fault in that data from the filesystem as usual.

Roman.

^ permalink raw reply

* RE: [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-10  8:36 UTC (permalink / raw)
  To: Amit Shah
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, dgilbert@redhat.com,
	rth@twiddle.net
In-Reply-To: <20160310075728.GB4678@grmbl.mre>

> >  Could provide more information on how to use virtio-serial to exchange
> data?  Thread , Wiki or code are all OK.
> >  I have not find some useful information yet.
> 
> See this commit in the Linux sources:
> 
> 108fc82596e3b66b819df9d28c1ebbc9ab5de14c
> 
> that adds a way to send guest trace data over to the host.  I think that's the
> most relevant to your use-case.  However, you'll have to add an in-kernel
> user of virtio-serial (like the virtio-console code
> -- the code that deals with tty and hvc currently).  There's no other non-tty
> user right now, and this is the right kind of use-case to add one for!
> 
> For many other (userspace) use-cases, see the qemu-guest-agent in the
> qemu sources.
> 
> The API is documented in the wiki:
> 
> http://www.linux-kvm.org/page/Virtio-serial_API
> 
> and the feature pages have some information that may help as well:
> 
> https://fedoraproject.org/wiki/Features/VirtioSerial
> 
> There are some links in here too:
> 
> http://log.amitshah.net/2010/09/communication-between-guests-and-
> hosts/
> 
> Hope this helps.
> 
> 
> 		Amit

Thanks a lot !!

Liang

^ permalink raw reply

* Re: [RFC qemu 0/4] A PV solution for live migration optimization
From: Amit Shah @ 2016-03-10  7:57 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, dgilbert@redhat.com,
	rth@twiddle.net
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E0414A7E3@shsmsx102.ccr.corp.intel.com>

On (Thu) 10 Mar 2016 [07:44:19], Li, Liang Z wrote:
> 
> Hi Amit,
> 
>  Could provide more information on how to use virtio-serial to exchange data?  Thread , Wiki or code are all OK. 
>  I have not find some useful information yet.

See this commit in the Linux sources:

108fc82596e3b66b819df9d28c1ebbc9ab5de14c

that adds a way to send guest trace data over to the host.  I think
that's the most relevant to your use-case.  However, you'll have to
add an in-kernel user of virtio-serial (like the virtio-console code
-- the code that deals with tty and hvc currently).  There's no other
non-tty user right now, and this is the right kind of use-case to add
one for!

For many other (userspace) use-cases, see the qemu-guest-agent in the
qemu sources.

The API is documented in the wiki:

http://www.linux-kvm.org/page/Virtio-serial_API

and the feature pages have some information that may help as well:

https://fedoraproject.org/wiki/Features/VirtioSerial

There are some links in here too:

http://log.amitshah.net/2010/09/communication-between-guests-and-hosts/

Hope this helps.


		Amit

^ permalink raw reply

* RE: [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-10  7:44 UTC (permalink / raw)
  To: Amit Shah
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, dgilbert@redhat.com,
	rth@twiddle.net
In-Reply-To: <20160308111343.GM15443@grmbl.mre>

> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> 
> I like the idea, just have to prove (review) and test it a lot to ensure we don't
> end up skipping pages that matter.
> 
> However, there are a couple of points:
> 
> In my opinion, the information that's exchanged between the guest and the
> host should be exchanged over a virtio-serial channel rather than virtio-
> balloon.  First, there's nothing related to the balloon here.
> It just happens to be memory info.  Second, I would never enable balloon in
> a guest that I want to be performance-sensitive.  So even if you add this as
> part of balloon, you'll find no one is using this solution.
> 
> Secondly, I suggest virtio-serial, because it's meant exactly to exchange free-
> flowing information between a host and a guest, and you don't need to
> extend any part of the protocol for it (hence no changes necessary to the
> spec).  You can see how spice, vnc, etc., use virtio-serial to exchange data.
> 
> 
> 		Amit

Hi Amit,

 Could provide more information on how to use virtio-serial to exchange data?  Thread , Wiki or code are all OK. 
 I have not find some useful information yet.

Thanks
Liang

^ permalink raw reply

* Re: [Qemu-devel] [RFC kernel 0/2]A PV solution for KVM live migration optimization
From: Amit Shah @ 2016-03-10  7:30 UTC (permalink / raw)
  To: Jitendra Kolhe
  Cc: ehabkost, kvm, qemu-devel, liang.z.li, dgilbert, linux-kernel,
	linux-mm, mst, mohan_parthasarathy, simhan, pbonzini, akpm,
	virtualization, rth
In-Reply-To: <1457593292-30686-1-git-send-email-jitendra.kolhe@hpe.com>

On (Thu) 10 Mar 2016 [12:31:32], Jitendra Kolhe wrote:
> On 3/8/2016 4:44 PM, Amit Shah wrote:
> >>>> Hi,
> >>>>   An interesting solution; I know a few different people have been looking at
> >>>> how to speed up ballooned VM migration.
> >>>>
> >>>
> >>> Ooh, different solutions for the same purpose, and both based on the balloon.
> >>
> >> We were also tying to address similar problem, without actually needing to modify
> >> the guest driver. Please find patch details under mail with subject.
> >> migration: skip sending ram pages released by virtio-balloon driver
> >
> > The scope of this patch series seems to be wider: don't send free
> > pages to a dest at all, vs. don't send pages that are ballooned out.
> 
> Hi,
> 
> Thanks for your response. The scope of this patch series doesn’t seem to take care 
> of ballooned out pages. To balloon out a guest ram page the guest balloon driver does 
> a alloc_page() and then return the guest pfn to Qemu, so ballooned out pages will not 
> be seen as free ram pages by the guest.
> Thus we will still end up scanning (for zero page) for ballooned out pages during 
> migration. It would be ideal if we could have both solutions.

Yes, of course it would be nice to have both solutions.  My response was to the line:

> >>> Ooh, different solutions for the same purpose, and both based on the balloon.

which sounded misleading to me for a couple of reasons: 1, as you
describe, pages being considered by this patchset and yours are
different; and 2, as I mentioned in the other mail, this patchset
doesn't really depend on the balloon, and I believe it should not.


		Amit
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* RE: [Qemu-devel] [RFC kernel 0/2]A PV solution for KVM live migration optimization
From: Li, Liang Z @ 2016-03-10  7:22 UTC (permalink / raw)
  To: Jitendra Kolhe, amit.shah@redhat.com
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	mst@redhat.com, linux-kernel@vger.kernel.org, dgilbert@redhat.com,
	linux-mm@kvack.org, mohan_parthasarathy@hpe.com, simhan@hpe.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <1457593292-30686-1-git-send-email-jitendra.kolhe@hpe.com>

> On 3/8/2016 4:44 PM, Amit Shah wrote:
> > On (Fri) 04 Mar 2016 [15:02:47], Jitendra Kolhe wrote:
> >>>>
> >>>> * Liang Li (liang.z.li@intel.com) wrote:
> >>>>> The current QEMU live migration implementation mark the all the
> >>>>> guest's RAM pages as dirtied in the ram bulk stage, all these
> >>>>> pages will be processed and that takes quit a lot of CPU cycles.
> >>>>>
> >>>>> From guest's point of view, it doesn't care about the content in
> >>>>> free pages. We can make use of this fact and skip processing the
> >>>>> free pages in the ram bulk stage, it can save a lot CPU cycles and
> >>>>> reduce the network traffic significantly while speed up the live
> >>>>> migration process obviously.
> >>>>>
> >>>>> This patch set is the QEMU side implementation.
> >>>>>
> >>>>> The virtio-balloon is extended so that QEMU can get the free pages
> >>>>> information from the guest through virtio.
> >>>>>
> >>>>> After getting the free pages information (a bitmap), QEMU can use
> >>>>> it to filter out the guest's free pages in the ram bulk stage.
> >>>>> This make the live migration process much more efficient.
> >>>>
> >>>> Hi,
> >>>>   An interesting solution; I know a few different people have been
> >>>> looking at how to speed up ballooned VM migration.
> >>>>
> >>>
> >>> Ooh, different solutions for the same purpose, and both based on the
> balloon.
> >>
> >> We were also tying to address similar problem, without actually
> >> needing to modify the guest driver. Please find patch details under mail
> with subject.
> >> migration: skip sending ram pages released by virtio-balloon driver
> >
> > The scope of this patch series seems to be wider: don't send free
> > pages to a dest at all, vs. don't send pages that are ballooned out.
> >
> > 		Amit
> 
> Hi,
> 
> Thanks for your response. The scope of this patch series doesn’t seem to
> take care of ballooned out pages. To balloon out a guest ram page the guest
> balloon driver does a alloc_page() and then return the guest pfn to Qemu, so
> ballooned out pages will not be seen as free ram pages by the guest.
> Thus we will still end up scanning (for zero page) for ballooned out pages
> during migration. It would be ideal if we could have both solutions.
> 

Agree,  for users who care about the performance, just skipping the free pages.
For users who have already turned on virtio-balloon,  your solution can take effect.

Liang
> Thanks,
> - Jitendra
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [Qemu-devel] [RFC kernel 0/2]A PV solution for KVM live migration optimization
From: Jitendra Kolhe @ 2016-03-10  7:01 UTC (permalink / raw)
  To: amit.shah
  Cc: ehabkost, kvm, qemu-devel, liang.z.li, dgilbert, linux-kernel,
	linux-mm, mst, mohan_parthasarathy, simhan, pbonzini, akpm,
	virtualization, rth

On 3/8/2016 4:44 PM, Amit Shah wrote:
> On (Fri) 04 Mar 2016 [15:02:47], Jitendra Kolhe wrote:
>>>>
>>>> * Liang Li (liang.z.li@intel.com) wrote:
>>>>> The current QEMU live migration implementation mark the all the
>>>>> guest's RAM pages as dirtied in the ram bulk stage, all these pages
>>>>> will be processed and that takes quit a lot of CPU cycles.
>>>>>
>>>>> From guest's point of view, it doesn't care about the content in free
>>>>> pages. We can make use of this fact and skip processing the free pages
>>>>> in the ram bulk stage, it can save a lot CPU cycles and reduce the
>>>>> network traffic significantly while speed up the live migration
>>>>> process obviously.
>>>>>
>>>>> This patch set is the QEMU side implementation.
>>>>>
>>>>> The virtio-balloon is extended so that QEMU can get the free pages
>>>>> information from the guest through virtio.
>>>>>
>>>>> After getting the free pages information (a bitmap), QEMU can use it
>>>>> to filter out the guest's free pages in the ram bulk stage. This make
>>>>> the live migration process much more efficient.
>>>>
>>>> Hi,
>>>>   An interesting solution; I know a few different people have been looking at
>>>> how to speed up ballooned VM migration.
>>>>
>>>
>>> Ooh, different solutions for the same purpose, and both based on the balloon.
>>
>> We were also tying to address similar problem, without actually needing to modify
>> the guest driver. Please find patch details under mail with subject.
>> migration: skip sending ram pages released by virtio-balloon driver
>
> The scope of this patch series seems to be wider: don't send free
> pages to a dest at all, vs. don't send pages that are ballooned out.
>
> 		Amit

Hi,

Thanks for your response. The scope of this patch series doesn’t seem to take care 
of ballooned out pages. To balloon out a guest ram page the guest balloon driver does 
a alloc_page() and then return the guest pfn to Qemu, so ballooned out pages will not 
be seen as free ram pages by the guest.
Thus we will still end up scanning (for zero page) for ballooned out pages during 
migration. It would be ideal if we could have both solutions.

Thanks,
- Jitendra
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH V4 0/3] basic busy polling support for vhost_net
From: Michael Rapoport @ 2016-03-10  6:48 UTC (permalink / raw)
  To: Greg Kurz
  Cc: yang.zhang.wz, kvm, mst, netdev, linux-kernel, virtualization,
	borntraeger
In-Reply-To: <20160309202645.030ad7b2@bahia.lab.toulouse-stg.fr.ibm.com>

Hi Greg,

> Greg Kurz <gkurz@linux.vnet.ibm.com> wrote on 03/09/2016 09:26:45 PM:
> > On Fri,  4 Mar 2016 06:24:50 -0500
> > Jason Wang <jasowang@redhat.com> wrote:
> 
> > This series tries to add basic busy polling for vhost net. The idea is
> > simple: at the end of tx/rx processing, busy polling for new tx added
> > descriptor and rx receive socket for a while. The maximum number of
> > time (in us) could be spent on busy polling was specified ioctl.
> > 
> > Test A were done through:
> > 
> > - 50 us as busy loop timeout
> > - Netperf 2.6
> > - Two machines with back to back connected mlx4
> 
> Hi Jason,
> 
> Could this also improve performance if both VMs are
> on the same host system ?

I've experimented a little with Jason's patches and guest-to-guest netperf 
when both guests were on the same host, and I saw improvements for that 
case.
 
> > - Guest with 1 vcpus and 1 queue
> > 
> > Results:
> > - Obvious improvements (%5 - 20%) for latency (TCP_RR).
> > - Get a better or minor regression on most of the TX tests, but see
> >   some regression on 4096 size.
> > - Except for 8 sessions of 4096 size RX, have a better or same
> >   performance.
> > - CPU utilization were incrased as expected.
> > 
> > TCP_RR:
> > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
> >     1/     1/   +8%/  -32%/   +8%/   +8%/   +7%
> >     1/    50/   +7%/  -19%/   +7%/   +7%/   +1%
> >     1/   100/   +5%/  -21%/   +5%/   +5%/    0%
> >     1/   200/   +5%/  -21%/   +7%/   +7%/   +1%
> >    64/     1/  +11%/  -29%/  +11%/  +11%/  +10%
> >    64/    50/   +7%/  -19%/   +8%/   +8%/   +2%
> >    64/   100/   +8%/  -18%/   +9%/   +9%/   +2%
> >    64/   200/   +6%/  -19%/   +6%/   +6%/    0%
> >   256/     1/   +7%/  -33%/   +7%/   +7%/   +6%
> >   256/    50/   +7%/  -18%/   +7%/   +7%/    0%
> >   256/   100/   +9%/  -18%/   +8%/   +8%/   +2%
> >   256/   200/   +9%/  -18%/  +10%/  +10%/   +3%
> >  1024/     1/  +20%/  -28%/  +20%/  +20%/  +19%
> >  1024/    50/   +8%/  -18%/   +9%/   +9%/   +2%
> >  1024/   100/   +6%/  -19%/   +5%/   +5%/    0%
> >  1024/   200/   +8%/  -18%/   +9%/   +9%/   +2%
> > Guest TX:
> > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
> >    64/     1/   -5%/  -28%/  +11%/  +12%/  +10%
> >    64/     4/   -2%/  -26%/  +13%/  +13%/  +13%
> >    64/     8/   -6%/  -29%/   +9%/  +10%/  +10%
> >   512/     1/  +15%/   -7%/  +13%/  +11%/   +3%
> >   512/     4/  +17%/   -6%/  +18%/  +13%/  +11%
> >   512/     8/  +14%/   -7%/  +13%/   +7%/   +7%
> >  1024/     1/  +27%/   -2%/  +26%/  +29%/  +12%
> >  1024/     4/   +8%/   -9%/   +6%/   +1%/   +6%
> >  1024/     8/  +41%/  +12%/  +34%/  +20%/   -3%
> >  4096/     1/  -22%/  -21%/  -36%/  +81%/+1360%
> >  4096/     4/  -57%/  -58%/ +286%/  +15%/+2074%
> >  4096/     8/  +67%/  +70%/  -45%/   -8%/  +63%
> > 16384/     1/   -2%/   -5%/   +5%/   -3%/  +80%
> > 16384/     4/    0%/    0%/    0%/   +4%/ +138%
> > 16384/     8/    0%/    0%/    0%/   +1%/  +41%
> > 65535/     1/   -3%/   -6%/   +2%/  +11%/ +113%
> > 65535/     4/   -2%/   -1%/   -2%/   -3%/ +484%
> > 65535/     8/    0%/   +1%/    0%/   +2%/  +40%
> > Guest RX:
> > size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
> >    64/     1/  +31%/   -3%/   +8%/   +8%/   +8%
> >    64/     4/  +11%/  -17%/  +13%/  +14%/  +15%
> >    64/     8/   +4%/  -23%/  +11%/  +11%/  +12%
> >   512/     1/  +24%/    0%/  +18%/  +14%/   -8%
> >   512/     4/   +4%/  -15%/   +6%/   +5%/   +6%
> >   512/     8/  +26%/    0%/  +21%/  +10%/   +3%
> >  1024/     1/  +88%/  +47%/  +69%/  +44%/  -30%
> >  1024/     4/  +18%/   -5%/  +19%/  +16%/   +2%
> >  1024/     8/  +15%/   -4%/  +13%/   +8%/   +1%
> >  4096/     1/   -3%/   -5%/   +2%/   -2%/  +41%
> >  4096/     4/   +2%/   +3%/  -20%/  -14%/  -24%
> >  4096/     8/  -43%/  -45%/  +69%/  -24%/  +94%
> > 16384/     1/   -3%/  -11%/  +23%/   +7%/  +42%
> > 16384/     4/   -3%/   -3%/   -4%/   +5%/ +115%
> > 16384/     8/   -1%/    0%/   -1%/   -3%/  +32%
> > 65535/     1/   +1%/    0%/   +2%/    0%/  +66%
> > 65535/     4/   -1%/   -1%/    0%/   +4%/ +492%
> > 65535/     8/    0%/   -1%/   -1%/   +4%/  +38%
> > 
> > Changes from V3:
> > - drop single_task_running()
> > - use cpu_relax_lowlatency() instead of cpu_relax()
> > 
> > Changes from V2:
> > - rename vhost_vq_more_avail() to vhost_vq_avail_empty(). And return
> > false we __get_user() fails.
> > - do not bother premmptions/timers for good path.
> > - use vhost_vring_state as ioctl parameter instead of reinveting a new
> > one.
> > - add the unit of timeout (us) to the comment of new added ioctls
> > 
> > Changes from V1:
> > - remove the buggy vq_error() in vhost_vq_more_avail().
> > - leave vhost_enable_notify() untouched.
> > 
> > Changes from RFC V3:
> > - small tweak on the code to avoid multiple duplicate conditions in
> > critical path when busy loop is not enabled.
> > - add the test result of multiple VMs
> > 
> > Changes from RFC V2:
> > - poll also at the end of rx handling
> > - factor out the polling logic and optimize the code a little bit
> > - add two ioctls to get and set the busy poll timeout
> > - test on ixgbe (which can give more stable and reproducable numbers)
> > instead of mlx4.
> > 
> > Changes from RFC V1:
> > - add a comment for vhost_has_work() to explain why it could be
> > lockless
> > - add param description for busyloop_timeout
> > - split out the busy polling logic into a new helper
> > - check and exit the loop when there's a pending signal
> > - disable preemption during busy looping to make sure lock_clock() was
> > correctly used.
> > 
> > Jason Wang (3):
> >   vhost: introduce vhost_has_work()
> >   vhost: introduce vhost_vq_avail_empty()
> >   vhost_net: basic polling support
> > 
> >  drivers/vhost/net.c        | 78 
+++++++++++++++++++++++++++++++++++++++++++---
> >  drivers/vhost/vhost.c      | 35 +++++++++++++++++++++
> >  drivers/vhost/vhost.h      |  3 ++
> >  include/uapi/linux/vhost.h |  6 ++++
> >  4 files changed, 117 insertions(+), 5 deletions(-)
> > 
> 

^ permalink raw reply

* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-10  1:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: riel@redhat.com, ehabkost@redhat.com, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	Dr. David Alan Gilbert, linux-mm@kvack.org, Roman Kagan,
	amit.shah@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160309172929-mutt-send-email-mst@redhat.com>

> > > > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > > > The problem is the poor performance, this PV solution
> > > >
> > > > Balloon is always PV. And do not call patches solutions please.
> > > >
> > > > > is aimed to make it more
> > > > > efficient and reduce the performance impact on guest.
> > > >
> > > > We need to get a bit beyond this.  You are making multiple
> > > > changes, it seems to make sense to split it all up, and analyse
> > > > each change separately.
> > >
> > > Couldn't agree more.
> > >
> > > There are three stages in this optimization:
> > >
> > > 1) choosing which pages to skip
> > >
> > > 2) communicating them from guest to host
> > >
> > > 3) skip transferring uninteresting pages to the remote side on
> > > migration
> > >
> > > For (3) there seems to be a low-hanging fruit to amend
> > > migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This
> > > would work for guest RAM that hasn't been touched yet or which has
> > > been ballooned out.
> > >
> > > For (1) I've been trying to make a point that skipping clean pages
> > > is much more likely to result in noticable benefit than free pages only.
> > >
> >
> > I am considering to drop the pagecache before getting the free pages.
> >
> > > As for (2), we do seem to have a problem with the existing balloon:
> > > according to your measurements it's very slow; besides, I guess it
> > > plays badly
> >
> > I didn't say communicating is slow. Even this is very slow, my
> > solution use bitmap instead of PFNs, there is fewer data traffic, so it's
> faster than the existing balloon which use PFNs.
> 
> By how much?
> 

Haven't measured yet. 
To identify a page, 1 bit is needed if using bitmap, 4 Bytes(32bit) is needed if using PFN, 

For a guest with 8GB RAM,  the corresponding free page bitmap size is 256KB.
And the corresponding total PFNs size is 8192KB. Assuming the inflating size
is 7GB, the total PFNs size is 7168KB.

Maybe this is not the point.

Liang

> > > with transparent huge pages (as both the guest and the host work
> > > with one 4k page at a time).  This is a problem for other use cases
> > > of balloon (e.g. as a facility for resource management); tackling
> > > that appears a more natural application for optimization efforts.
> > >
> > > Thanks,
> > > Roman.

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Rik van Riel @ 2016-03-09 19:38 UTC (permalink / raw)
  To: Roman Kagan, Michael S. Tsirkin
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	Li, Liang Z, linux-kernel@vger.kernel.org, Dr. David Alan Gilbert,
	linux-mm@kvack.org, amit.shah@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160309170438.GB9715@rkaganb.sw.ru>


[-- Attachment #1.1: Type: text/plain, Size: 810 bytes --]

On Wed, 2016-03-09 at 20:04 +0300, Roman Kagan wrote:
> On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > For (1) I've been trying to make a point that skipping clean
> > > pages is
> > > much more likely to result in noticable benefit than free pages
> > > only.
> > 
> > I guess when you say clean you mean zero?
> 
> No I meant clean, i.e. those that could be evicted from RAM without
> causing I/O.
> 

Programs in the guest may have that memory mmapped.
This could include things like libraries and executables.

How do you deal with the guest page cache containing
references to now non-existent memory?

How do you re-populate the memory on the destination
host?

-- 
All rights reversed

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH V4 0/3] basic busy polling support for vhost_net
From: Greg Kurz @ 2016-03-09 19:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: yang.zhang.wz, RAPOPORT, kvm, mst, netdev, linux-kernel,
	virtualization, borntraeger
In-Reply-To: <1457090693-55974-1-git-send-email-jasowang@redhat.com>

On Fri,  4 Mar 2016 06:24:50 -0500
Jason Wang <jasowang@redhat.com> wrote:

> This series tries to add basic busy polling for vhost net. The idea is
> simple: at the end of tx/rx processing, busy polling for new tx added
> descriptor and rx receive socket for a while. The maximum number of
> time (in us) could be spent on busy polling was specified ioctl.
> 
> Test A were done through:
> 
> - 50 us as busy loop timeout
> - Netperf 2.6
> - Two machines with back to back connected mlx4

Hi Jason,

Could this also improve performance if both VMs are
on the same host system ?

> - Guest with 1 vcpus and 1 queue
> 
> Results:
> - Obvious improvements (%5 - 20%) for latency (TCP_RR).
> - Get a better or minor regression on most of the TX tests, but see
>   some regression on 4096 size.
> - Except for 8 sessions of 4096 size RX, have a better or same
>   performance.
> - CPU utilization were incrased as expected.
> 
> TCP_RR:
> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>     1/     1/   +8%/  -32%/   +8%/   +8%/   +7%
>     1/    50/   +7%/  -19%/   +7%/   +7%/   +1%
>     1/   100/   +5%/  -21%/   +5%/   +5%/    0%
>     1/   200/   +5%/  -21%/   +7%/   +7%/   +1%
>    64/     1/  +11%/  -29%/  +11%/  +11%/  +10%
>    64/    50/   +7%/  -19%/   +8%/   +8%/   +2%
>    64/   100/   +8%/  -18%/   +9%/   +9%/   +2%
>    64/   200/   +6%/  -19%/   +6%/   +6%/    0%
>   256/     1/   +7%/  -33%/   +7%/   +7%/   +6%
>   256/    50/   +7%/  -18%/   +7%/   +7%/    0%
>   256/   100/   +9%/  -18%/   +8%/   +8%/   +2%
>   256/   200/   +9%/  -18%/  +10%/  +10%/   +3%
>  1024/     1/  +20%/  -28%/  +20%/  +20%/  +19%
>  1024/    50/   +8%/  -18%/   +9%/   +9%/   +2%
>  1024/   100/   +6%/  -19%/   +5%/   +5%/    0%
>  1024/   200/   +8%/  -18%/   +9%/   +9%/   +2%
> Guest TX:
> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>    64/     1/   -5%/  -28%/  +11%/  +12%/  +10%
>    64/     4/   -2%/  -26%/  +13%/  +13%/  +13%
>    64/     8/   -6%/  -29%/   +9%/  +10%/  +10%
>   512/     1/  +15%/   -7%/  +13%/  +11%/   +3%
>   512/     4/  +17%/   -6%/  +18%/  +13%/  +11%
>   512/     8/  +14%/   -7%/  +13%/   +7%/   +7%
>  1024/     1/  +27%/   -2%/  +26%/  +29%/  +12%
>  1024/     4/   +8%/   -9%/   +6%/   +1%/   +6%
>  1024/     8/  +41%/  +12%/  +34%/  +20%/   -3%
>  4096/     1/  -22%/  -21%/  -36%/  +81%/+1360%
>  4096/     4/  -57%/  -58%/ +286%/  +15%/+2074%
>  4096/     8/  +67%/  +70%/  -45%/   -8%/  +63%
> 16384/     1/   -2%/   -5%/   +5%/   -3%/  +80%
> 16384/     4/    0%/    0%/    0%/   +4%/ +138%
> 16384/     8/    0%/    0%/    0%/   +1%/  +41%
> 65535/     1/   -3%/   -6%/   +2%/  +11%/ +113%
> 65535/     4/   -2%/   -1%/   -2%/   -3%/ +484%
> 65535/     8/    0%/   +1%/    0%/   +2%/  +40%
> Guest RX:
> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>    64/     1/  +31%/   -3%/   +8%/   +8%/   +8%
>    64/     4/  +11%/  -17%/  +13%/  +14%/  +15%
>    64/     8/   +4%/  -23%/  +11%/  +11%/  +12%
>   512/     1/  +24%/    0%/  +18%/  +14%/   -8%
>   512/     4/   +4%/  -15%/   +6%/   +5%/   +6%
>   512/     8/  +26%/    0%/  +21%/  +10%/   +3%
>  1024/     1/  +88%/  +47%/  +69%/  +44%/  -30%
>  1024/     4/  +18%/   -5%/  +19%/  +16%/   +2%
>  1024/     8/  +15%/   -4%/  +13%/   +8%/   +1%
>  4096/     1/   -3%/   -5%/   +2%/   -2%/  +41%
>  4096/     4/   +2%/   +3%/  -20%/  -14%/  -24%
>  4096/     8/  -43%/  -45%/  +69%/  -24%/  +94%
> 16384/     1/   -3%/  -11%/  +23%/   +7%/  +42%
> 16384/     4/   -3%/   -3%/   -4%/   +5%/ +115%
> 16384/     8/   -1%/    0%/   -1%/   -3%/  +32%
> 65535/     1/   +1%/    0%/   +2%/    0%/  +66%
> 65535/     4/   -1%/   -1%/    0%/   +4%/ +492%
> 65535/     8/    0%/   -1%/   -1%/   +4%/  +38%
> 
> Changes from V3:
> - drop single_task_running()
> - use cpu_relax_lowlatency() instead of cpu_relax()
> 
> Changes from V2:
> - rename vhost_vq_more_avail() to vhost_vq_avail_empty(). And return
> false we __get_user() fails.
> - do not bother premmptions/timers for good path.
> - use vhost_vring_state as ioctl parameter instead of reinveting a new
> one.
> - add the unit of timeout (us) to the comment of new added ioctls
> 
> Changes from V1:
> - remove the buggy vq_error() in vhost_vq_more_avail().
> - leave vhost_enable_notify() untouched.
> 
> Changes from RFC V3:
> - small tweak on the code to avoid multiple duplicate conditions in
> critical path when busy loop is not enabled.
> - add the test result of multiple VMs
> 
> Changes from RFC V2:
> - poll also at the end of rx handling
> - factor out the polling logic and optimize the code a little bit
> - add two ioctls to get and set the busy poll timeout
> - test on ixgbe (which can give more stable and reproducable numbers)
> instead of mlx4.
> 
> Changes from RFC V1:
> - add a comment for vhost_has_work() to explain why it could be
> lockless
> - add param description for busyloop_timeout
> - split out the busy polling logic into a new helper
> - check and exit the loop when there's a pending signal
> - disable preemption during busy looping to make sure lock_clock() was
> correctly used.
> 
> Jason Wang (3):
>   vhost: introduce vhost_has_work()
>   vhost: introduce vhost_vq_avail_empty()
>   vhost_net: basic polling support
> 
>  drivers/vhost/net.c        | 78 +++++++++++++++++++++++++++++++++++++++++++---
>  drivers/vhost/vhost.c      | 35 +++++++++++++++++++++
>  drivers/vhost/vhost.h      |  3 ++
>  include/uapi/linux/vhost.h |  6 ++++
>  4 files changed, 117 insertions(+), 5 deletions(-)
> 

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Michael S. Tsirkin @ 2016-03-09 17:39 UTC (permalink / raw)
  To: Roman Kagan, Li, Liang Z, Dr. David Alan Gilbert,
	ehabkost@redhat.com, kvm@vger.kernel.org, quintela@redhat.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, amit.shah@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net, riel
In-Reply-To: <20160309170438.GB9715@rkaganb.sw.ru>

On Wed, Mar 09, 2016 at 08:04:39PM +0300, Roman Kagan wrote:
> On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > For (1) I've been trying to make a point that skipping clean pages is
> > > much more likely to result in noticable benefit than free pages only.
> > 
> > I guess when you say clean you mean zero?
> 
> No I meant clean, i.e. those that could be evicted from RAM without
> causing I/O.

They must be migrated unless guest actually evicts them.
It's not at all clear to me that it's always preferable
to drop all clean pages from pagecache. It is clearly is
going to slow the guest down significantly.


> > Yea. In fact, one can zero out any number of pages
> > quickly by putting them in balloon and immediately
> > taking them out.
> > 
> > Access will fault a zero page in, then COW kicks in.
> 
> I must be missing something obvious, but how is that different from
> inflating and then immediately deflating the balloon?

It's exactly the same except
- we do not initiate this from host - it's guest doing
  things for its own reasons
- a bit less guest/host interaction this way


> > We could have a new zero VQ (or some other option)
> > to pass these pages guest to host, but this only
> > works well if page size matches the host page size.
> 
> I'm afraid I don't yet understand what kind of pages that would be and
> how they are different from ballooned pages.
> 
> I still tend to think that ballooning is a sensible solution to the
> problem at hand;

I think it is, too. This does not mean we can't improve things though.
This patchset is reported to improve things, it should be
split up so we improve them for everyone and not just
one specific workload.


> it's just the granularity that makes things slow and
> stands in the way.

So we could request a specific page size/alignment from guest.
Send guest request to give us memory in aligned units of 2Mbytes,
and then host can treat each of these as a single huge page.


> Roman.
-- 
MST

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Roman Kagan @ 2016-03-09 17:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: riel, ehabkost@redhat.com, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, Li, Liang Z, linux-kernel@vger.kernel.org,
	Dr. David Alan Gilbert, linux-mm@kvack.org, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160309173017-mutt-send-email-mst@redhat.com>

On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > For (1) I've been trying to make a point that skipping clean pages is
> > much more likely to result in noticable benefit than free pages only.
> 
> I guess when you say clean you mean zero?

No I meant clean, i.e. those that could be evicted from RAM without
causing I/O.

> Yea. In fact, one can zero out any number of pages
> quickly by putting them in balloon and immediately
> taking them out.
> 
> Access will fault a zero page in, then COW kicks in.

I must be missing something obvious, but how is that different from
inflating and then immediately deflating the balloon?

> We could have a new zero VQ (or some other option)
> to pass these pages guest to host, but this only
> works well if page size matches the host page size.

I'm afraid I don't yet understand what kind of pages that would be and
how they are different from ballooned pages.

I still tend to think that ballooning is a sensible solution to the
problem at hand; it's just the granularity that makes things slow and
stands in the way.

Roman.

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Michael S. Tsirkin @ 2016-03-09 15:41 UTC (permalink / raw)
  To: Roman Kagan, Li, Liang Z, Dr. David Alan Gilbert,
	ehabkost@redhat.com, kvm@vger.kernel.org, quintela@redhat.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, amit.shah@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net, riel
In-Reply-To: <20160309142851.GA9715@rkaganb.sw.ru>

On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> > On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > > processed during live migration without skipping. The live migration code is
> > > > in migration/ram.c.
> > > > 
> > > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> > > > teach qemu to skip these pages.
> > > > Want to write a patch to do this?
> > > > 
> > > 
> > > Yes, we really can teach qemu to skip these pages and it's not hard.  
> > > The problem is the poor performance, this PV solution
> > 
> > Balloon is always PV. And do not call patches solutions please.
> > 
> > > is aimed to make it more
> > > efficient and reduce the performance impact on guest.
> > 
> > We need to get a bit beyond this.  You are making multiple
> > changes, it seems to make sense to split it all up, and analyse each
> > change separately.
> 
> Couldn't agree more.
> 
> There are three stages in this optimization:
> 
> 1) choosing which pages to skip
> 
> 2) communicating them from guest to host
> 
> 3) skip transferring uninteresting pages to the remote side on migration
> 
> For (3) there seems to be a low-hanging fruit to amend
> migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This
> would work for guest RAM that hasn't been touched yet or which has been
> ballooned out.
> 
> For (1) I've been trying to make a point that skipping clean pages is
> much more likely to result in noticable benefit than free pages only.

I guess when you say clean you mean zero?

Yea. In fact, one can zero out any number of pages
quickly by putting them in balloon and immediately
taking them out.

Access will fault a zero page in, then COW kicks in.

We could have a new zero VQ (or some other option)
to pass these pages guest to host, but this only
works well if page size matches the host page size.




> As for (2), we do seem to have a problem with the existing balloon:
> according to your measurements it's very slow; besides, I guess it plays
> badly with transparent huge pages (as both the guest and the host work
> with one 4k page at a time).  This is a problem for other use cases of
> balloon (e.g. as a facility for resource management); tackling that
> appears a more natural application for optimization efforts.
> 
> Thanks,
> Roman.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox