Linux virtualization list
 help / color / mirror / Atom feed
* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Michael S. Tsirkin @ 2016-03-09 15:41 UTC (permalink / raw)
  To: Roman Kagan, Li, Liang Z, Dr. David Alan Gilbert,
	ehabkost@redhat.com, kvm@vger.kernel.org, quintela@redhat.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, amit.shah@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net, riel
In-Reply-To: <20160309142851.GA9715@rkaganb.sw.ru>

On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> > On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > > processed during live migration without skipping. The live migration code is
> > > > in migration/ram.c.
> > > > 
> > > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> > > > teach qemu to skip these pages.
> > > > Want to write a patch to do this?
> > > > 
> > > 
> > > Yes, we really can teach qemu to skip these pages and it's not hard.  
> > > The problem is the poor performance, this PV solution
> > 
> > Balloon is always PV. And do not call patches solutions please.
> > 
> > > is aimed to make it more
> > > efficient and reduce the performance impact on guest.
> > 
> > We need to get a bit beyond this.  You are making multiple
> > changes, it seems to make sense to split it all up, and analyse each
> > change separately.
> 
> Couldn't agree more.
> 
> There are three stages in this optimization:
> 
> 1) choosing which pages to skip
> 
> 2) communicating them from guest to host
> 
> 3) skip transferring uninteresting pages to the remote side on migration
> 
> For (3) there seems to be a low-hanging fruit to amend
> migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This
> would work for guest RAM that hasn't been touched yet or which has been
> ballooned out.
> 
> For (1) I've been trying to make a point that skipping clean pages is
> much more likely to result in noticable benefit than free pages only.

I guess when you say clean you mean zero?

Yea. In fact, one can zero out any number of pages
quickly by putting them in balloon and immediately
taking them out.

Access will fault a zero page in, then COW kicks in.

We could have a new zero VQ (or some other option)
to pass these pages guest to host, but this only
works well if page size matches the host page size.




> As for (2), we do seem to have a problem with the existing balloon:
> according to your measurements it's very slow; besides, I guess it plays
> badly with transparent huge pages (as both the guest and the host work
> with one 4k page at a time).  This is a problem for other use cases of
> balloon (e.g. as a facility for resource management); tackling that
> appears a more natural application for optimization efforts.
> 
> Thanks,
> Roman.

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Michael S. Tsirkin @ 2016-03-09 15:30 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: riel@redhat.com, ehabkost@redhat.com, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	Dr. David Alan Gilbert, linux-mm@kvack.org, Roman Kagan,
	amit.shah@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E041498BA@shsmsx102.ccr.corp.intel.com>

On Wed, Mar 09, 2016 at 03:27:54PM +0000, Li, Liang Z wrote:
> > On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> > > On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > > > processed during live migration without skipping. The live
> > > > > > migration code is
> > > > > in migration/ram.c.
> > > > >
> > > > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we
> > can
> > > > > teach qemu to skip these pages.
> > > > > Want to write a patch to do this?
> > > > >
> > > >
> > > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > > The problem is the poor performance, this PV solution
> > >
> > > Balloon is always PV. And do not call patches solutions please.
> > >
> > > > is aimed to make it more
> > > > efficient and reduce the performance impact on guest.
> > >
> > > We need to get a bit beyond this.  You are making multiple changes, it
> > > seems to make sense to split it all up, and analyse each change
> > > separately.
> > 
> > Couldn't agree more.
> > 
> > There are three stages in this optimization:
> > 
> > 1) choosing which pages to skip
> > 
> > 2) communicating them from guest to host
> > 
> > 3) skip transferring uninteresting pages to the remote side on migration
> > 
> > For (3) there seems to be a low-hanging fruit to amend
> > migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This would
> > work for guest RAM that hasn't been touched yet or which has been
> > ballooned out.
> > 
> > For (1) I've been trying to make a point that skipping clean pages is much
> > more likely to result in noticable benefit than free pages only.
> > 
> 
> I am considering to drop the pagecache before getting the free pages. 
> 
> > As for (2), we do seem to have a problem with the existing balloon:
> > according to your measurements it's very slow; besides, I guess it plays badly
> 
> I didn't say communicating is slow. Even this is very slow, my solution use bitmap instead of
> PFNs, there is fewer data traffic, so it's faster than the existing balloon which use PFNs.

By how much?

> > with transparent huge pages (as both the guest and the host work with one
> > 4k page at a time).  This is a problem for other use cases of balloon (e.g. as a
> > facility for resource management); tackling that appears a more natural
> > application for optimization efforts.
> > 
> > Thanks,
> > Roman.

^ permalink raw reply

* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-09 15:27 UTC (permalink / raw)
  To: Roman Kagan, Michael S. Tsirkin
  Cc: riel@redhat.com, ehabkost@redhat.com, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, amit.shah@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, Dr. David Alan Gilbert,
	rth@twiddle.net
In-Reply-To: <20160309142851.GA9715@rkaganb.sw.ru>

> On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> > On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > > processed during live migration without skipping. The live
> > > > > migration code is
> > > > in migration/ram.c.
> > > >
> > > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we
> can
> > > > teach qemu to skip these pages.
> > > > Want to write a patch to do this?
> > > >
> > >
> > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > The problem is the poor performance, this PV solution
> >
> > Balloon is always PV. And do not call patches solutions please.
> >
> > > is aimed to make it more
> > > efficient and reduce the performance impact on guest.
> >
> > We need to get a bit beyond this.  You are making multiple changes, it
> > seems to make sense to split it all up, and analyse each change
> > separately.
> 
> Couldn't agree more.
> 
> There are three stages in this optimization:
> 
> 1) choosing which pages to skip
> 
> 2) communicating them from guest to host
> 
> 3) skip transferring uninteresting pages to the remote side on migration
> 
> For (3) there seems to be a low-hanging fruit to amend
> migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This would
> work for guest RAM that hasn't been touched yet or which has been
> ballooned out.
> 
> For (1) I've been trying to make a point that skipping clean pages is much
> more likely to result in noticable benefit than free pages only.
> 

I am considering to drop the pagecache before getting the free pages. 

> As for (2), we do seem to have a problem with the existing balloon:
> according to your measurements it's very slow; besides, I guess it plays badly

I didn't say communicating is slow. Even this is very slow, my solution use bitmap instead of
PFNs, there is fewer data traffic, so it's faster than the existing balloon which use PFNs.

> with transparent huge pages (as both the guest and the host work with one
> 4k page at a time).  This is a problem for other use cases of balloon (e.g. as a
> facility for resource management); tackling that appears a more natural
> application for optimization efforts.
> 
> Thanks,
> Roman.

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Roman Kagan @ 2016-03-09 14:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: riel, ehabkost@redhat.com, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, Li, Liang Z, linux-kernel@vger.kernel.org,
	Dr. David Alan Gilbert, linux-mm@kvack.org, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160307110852-mutt-send-email-mst@redhat.com>

On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > processed during live migration without skipping. The live migration code is
> > > in migration/ram.c.
> > > 
> > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> > > teach qemu to skip these pages.
> > > Want to write a patch to do this?
> > > 
> > 
> > Yes, we really can teach qemu to skip these pages and it's not hard.  
> > The problem is the poor performance, this PV solution
> 
> Balloon is always PV. And do not call patches solutions please.
> 
> > is aimed to make it more
> > efficient and reduce the performance impact on guest.
> 
> We need to get a bit beyond this.  You are making multiple
> changes, it seems to make sense to split it all up, and analyse each
> change separately.

Couldn't agree more.

There are three stages in this optimization:

1) choosing which pages to skip

2) communicating them from guest to host

3) skip transferring uninteresting pages to the remote side on migration

For (3) there seems to be a low-hanging fruit to amend
migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This
would work for guest RAM that hasn't been touched yet or which has been
ballooned out.

For (1) I've been trying to make a point that skipping clean pages is
much more likely to result in noticable benefit than free pages only.

As for (2), we do seem to have a problem with the existing balloon:
according to your measurements it's very slow; besides, I guess it plays
badly with transparent huge pages (as both the guest and the host work
with one 4k page at a time).  This is a problem for other use cases of
balloon (e.g. as a facility for resource management); tackling that
appears a more natural application for optimization efforts.

Thanks,
Roman.

^ permalink raw reply

* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-09 14:19 UTC (permalink / raw)
  To: Roman Kagan, Dr. David Alan Gilbert
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, amit.shah@redhat.com, Paolo Bonzini,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160309132210.GA5869@rkaganb.sw.ru>

> On Fri, Mar 04, 2016 at 06:51:21PM +0000, Dr. David Alan Gilbert wrote:
> > * Paolo Bonzini (pbonzini@redhat.com) wrote:
> > >
> > >
> > > On 04/03/2016 15:26, Li, Liang Z wrote:
> > > >> >
> > > >> > The memory usage will keep increasing due to ever growing
> > > >> > caches, etc, so you'll be left with very little free memory fairly soon.
> > > >> >
> > > > I don't think so.
> > > >
> > >
> > > Roman is right.  For example, here I am looking at a 64 GB
> > > (physical) machine which was booted about 30 minutes ago, and which
> > > is running disk-heavy workloads (installing VMs).
> > >
> > > Since I have started writing this email (2 minutes?), the amount of
> > > free memory has already gone down from 37 GB to 33 GB.  I expect
> > > that by the time I have finished running the workload, in two hours,
> > > it will not have any free memory.
> >
> > But what about a VM sitting idle, or that just has more RAM assigned
> > to it than is currently using.
> >  I've got a host here that's been up for 46 days and has been doing
> > some heavy VM debugging a few days ago, but today:
> >
> > # free -m
> >               total        used        free      shared  buff/cache   available
> > Mem:          96536        1146       44834         184       50555       94735
> >
> > I very rarely use all it's RAM, so it's got a big chunk of free RAM,
> > and yes it's got a big chunk of cache as well.
> 
> One of the promises of virtualization is better resource utilization.
> People tend to avoid purchasing VMs so much oversized that they never
> touch a significant amount of their RAM.  (Well, at least this is how things
> stand in hosting market; I guess enterprize market is similar in this regard).
> 
> That said, I'm not at all opposed to optimizing the migration of free memory;
> what I'm trying to say is that creating brand new infrastructure specifically for
> that case doesn't look justified when the existing one can cover it in addition
> to much more common scenarios.
> 
> Roman.

Even the existing one can cover more common scenarios, but it has performance issue.
that's why I create a new one.

Liang

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Roman Kagan @ 2016-03-09 13:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, Li, Liang Z,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, mst@redhat.com, amit.shah@redhat.com,
	Paolo Bonzini, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160304185120.GB2588@work-vm>

On Fri, Mar 04, 2016 at 06:51:21PM +0000, Dr. David Alan Gilbert wrote:
> * Paolo Bonzini (pbonzini@redhat.com) wrote:
> > 
> > 
> > On 04/03/2016 15:26, Li, Liang Z wrote:
> > >> > 
> > >> > The memory usage will keep increasing due to ever growing caches, etc, so
> > >> > you'll be left with very little free memory fairly soon.
> > >> > 
> > > I don't think so.
> > > 
> > 
> > Roman is right.  For example, here I am looking at a 64 GB (physical)
> > machine which was booted about 30 minutes ago, and which is running
> > disk-heavy workloads (installing VMs).
> > 
> > Since I have started writing this email (2 minutes?), the amount of free
> > memory has already gone down from 37 GB to 33 GB.  I expect that by the
> > time I have finished running the workload, in two hours, it will not
> > have any free memory.
> 
> But what about a VM sitting idle, or that just has more RAM assigned to it
> than is currently using.
>  I've got a host here that's been up for 46 days and has been doing some
> heavy VM debugging a few days ago, but today:
> 
> # free -m
>               total        used        free      shared  buff/cache   available
> Mem:          96536        1146       44834         184       50555       94735
> 
> I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes
> it's got a big chunk of cache as well.

One of the promises of virtualization is better resource utilization.
People tend to avoid purchasing VMs so much oversized that they never
touch a significant amount of their RAM.  (Well, at least this is how
things stand in hosting market; I guess enterprize market is similar in
this regard).

That said, I'm not at all opposed to optimizing the migration of free
memory; what I'm trying to say is that creating brand new infrastructure
specifically for that case doesn't look justified when the existing one
can cover it in addition to much more common scenarios.

Roman.

^ permalink raw reply

* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-09  6:18 UTC (permalink / raw)
  To: Paolo Bonzini, Roman Kagan
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	mst@redhat.com, linux-kernel@vger.kernel.org,
	Dr. David Alan Gilbert, linux-mm@kvack.org, amit.shah@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <56D9B6C2.3070708@redhat.com>

> On 04/03/2016 15:26, Li, Liang Z wrote:
> >> >
> >> > The memory usage will keep increasing due to ever growing caches,
> >> > etc, so you'll be left with very little free memory fairly soon.
> >> >
> > I don't think so.
> >
> 
> Roman is right.  For example, here I am looking at a 64 GB (physical) machine
> which was booted about 30 minutes ago, and which is running disk-heavy
> workloads (installing VMs).
> 
> Since I have started writing this email (2 minutes?), the amount of free
> memory has already gone down from 37 GB to 33 GB.  I expect that by the
> time I have finished running the workload, in two hours, it will not have any
> free memory.
> 
> Paolo

I have a VM which has 2GB of RAM, when the guest booted, there were about 1.4GB of free pages.
Then I tried to download a large file from the internet with the browser, after the downloading finished,
there were only 72MB of free pages left, as Roman pointed out, there were quite a lot of Cached memory.
Then I tried to compile the QEMU, after the compiling finished, there were about 1.3G free pages.

So even the cache will increase to a large amount, it will be freed if there are some other specific workloads. 
The cache memory is a big issue that should be taken into consideration.
 How about reclaim some cache before getting the free pages information?  

Liang 

^ permalink raw reply

* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-08 14:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, Dr. David Alan Gilbert,
	linux-mm@kvack.org, Roman Kagan, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160308160145-mutt-send-email-mst@redhat.com>

> On Fri, Mar 04, 2016 at 03:13:03PM +0000, Li, Liang Z wrote:
> > > > Maybe I am not clear enough.
> > > >
> > > > I mean if we inflate balloon before live migration, for a 8GB
> > > > guest, it takes
> > > about 5 Seconds for the inflating operation to finish.
> > >
> > > And these 5 seconds are spent where?
> > >
> >
> > The time is spent on allocating the pages and send the allocated pages
> > pfns to QEMU through virtio.
> 
> What if we skip allocating pages but use the existing interface to send pfns to
> QEMU?
>

I think it will be much faster, allocating pages is the main reason for the long time of the operation.
Experiment is needed to get the exact time spend on sending the pfns.

Liang

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Michael S. Tsirkin @ 2016-03-08 14:03 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, Dr. David Alan Gilbert,
	linux-mm@kvack.org, Roman Kagan, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E04145231@shsmsx102.ccr.corp.intel.com>

On Fri, Mar 04, 2016 at 03:13:03PM +0000, Li, Liang Z wrote:
> > > Maybe I am not clear enough.
> > >
> > > I mean if we inflate balloon before live migration, for a 8GB guest, it takes
> > about 5 Seconds for the inflating operation to finish.
> > 
> > And these 5 seconds are spent where?
> > 
> 
> The time is spent on allocating the pages and send the allocated pages pfns to QEMU
> through virtio.

What if we skip allocating pages but use the existing interface to send pfns
to QEMU?

> > > For the PV solution, there is no need to inflate balloon before live
> > > migration, the only cost is to traversing the free_list to  construct
> > > the free pages bitmap, and it takes about 20ms for a 8GB idle guest( less if
> > there is less free pages),  passing the free pages info to host will take about
> > extra 3ms.
> > >
> > >
> > > Liang
> > 
> > So now let's please stop talking about solutions at a high level and discuss the
> > interface changes you make in detail.
> > What makes it faster? Better host/guest interface? No need to go through
> > buddy allocator within guest? Less interrupts? Something else?
> > 
> 
> I assume you are familiar with the current virtio-balloon and how it works. 
> The new interface is very simple, send a request to the virtio-balloon driver,
> The virtio-driver will travers the '&zone->free_area[order].free_list[t])' to 
> construct a 'free_page_bitmap', and then the driver will send the content
> of  'free_page_bitmap' back to QEMU. That all the new interface does and
> there are no ' alloc_page' related affairs, so it's faster.
> 
> 
> Some code snippet:
> ----------------------------------------------
> +static void mark_free_pages_bitmap(struct zone *zone,
> +		 unsigned long *free_page_bitmap, unsigned long pfn_gap) {
> +	unsigned long pfn, flags, i;
> +	unsigned int order, t;
> +	struct list_head *curr;
> +
> +	if (zone_is_empty(zone))
> +		return;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +
> +	for_each_migratetype_order(order, t) {
> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> +
> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> +			for (i = 0; i < (1UL << order); i++) {
> +				if ((pfn + i) >= PFN_4G)
> +					set_bit_le(pfn + i - pfn_gap,
> +						   free_page_bitmap);
> +				else
> +					set_bit_le(pfn + i, free_page_bitmap);
> +			}
> +		}
> +	}
> +
> +	spin_unlock_irqrestore(&zone->lock, flags); }
> ----------------------------------------------------
> Sorry for my poor English and expression, if you still can't understand,
> you could glance at the patch, total about 400 lines.
> > 
> > > > --
> > > > MST

^ permalink raw reply

* RE: [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-08 13:11 UTC (permalink / raw)
  To: Amit Shah
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, dgilbert@redhat.com,
	rth@twiddle.net
In-Reply-To: <20160308111343.GM15443@grmbl.mre>

> Subject: Re: [RFC qemu 0/4] A PV solution for live migration optimization
> 
> On (Thu) 03 Mar 2016 [18:44:24], Liang Li wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> 
> I like the idea, just have to prove (review) and test it a lot to ensure we don't
> end up skipping pages that matter.
> 
> However, there are a couple of points:
> 
> In my opinion, the information that's exchanged between the guest and the
> host should be exchanged over a virtio-serial channel rather than virtio-
> balloon.  First, there's nothing related to the balloon here.
> It just happens to be memory info.  Second, I would never enable balloon in
> a guest that I want to be performance-sensitive.  So even if you add this as
> part of balloon, you'll find no one is using this solution.
> 
> Secondly, I suggest virtio-serial, because it's meant exactly to exchange free-
> flowing information between a host and a guest, and you don't need to
> extend any part of the protocol for it (hence no changes necessary to the
> spec).  You can see how spice, vnc, etc., use virtio-serial to exchange data.
> 
> 
> 		Amit

I don't like to use the virtio-balloon too, and it's confusing. 
It's grate if the virtio-serial can be used, I will take a look at it. 

Thanks for your suggestion!

Liang

^ permalink raw reply

* Re: [RFC qemu 0/4] A PV solution for live migration optimization
From: Amit Shah @ 2016-03-08 11:14 UTC (permalink / raw)
  To: Jitendra Kolhe
  Cc: ehabkost, kvm, qemu-devel, liang.z.li, dgilbert, linux-kernel,
	linux-mm, mst, mohan_parthasarathy, simhan, pbonzini, akpm,
	virtualization, rth
In-Reply-To: <1457083967-13681-1-git-send-email-jitendra.kolhe@hpe.com>

On (Fri) 04 Mar 2016 [15:02:47], Jitendra Kolhe wrote:
> > >
> > > * Liang Li (liang.z.li@intel.com) wrote:
> > > > The current QEMU live migration implementation mark the all the
> > > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > > will be processed and that takes quit a lot of CPU cycles.
> > > >
> > > > From guest's point of view, it doesn't care about the content in free
> > > > pages. We can make use of this fact and skip processing the free pages
> > > > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > > > network traffic significantly while speed up the live migration
> > > > process obviously.
> > > >
> > > > This patch set is the QEMU side implementation.
> > > >
> > > > The virtio-balloon is extended so that QEMU can get the free pages
> > > > information from the guest through virtio.
> > > >
> > > > After getting the free pages information (a bitmap), QEMU can use it
> > > > to filter out the guest's free pages in the ram bulk stage. This make
> > > > the live migration process much more efficient.
> > >
> > > Hi,
> > >   An interesting solution; I know a few different people have been looking at
> > > how to speed up ballooned VM migration.
> > >
> >
> > Ooh, different solutions for the same purpose, and both based on the balloon.
> 
> We were also tying to address similar problem, without actually needing to modify
> the guest driver. Please find patch details under mail with subject.
> migration: skip sending ram pages released by virtio-balloon driver

The scope of this patch series seems to be wider: don't send free
pages to a dest at all, vs. don't send pages that are ballooned out.

		Amit

^ permalink raw reply

* Re: [RFC qemu 0/4] A PV solution for live migration optimization
From: Amit Shah @ 2016-03-08 11:13 UTC (permalink / raw)
  To: Liang Li
  Cc: ehabkost, kvm, mst, linux-kernel, qemu-devel, linux-mm, pbonzini,
	akpm, virtualization, dgilbert, rth
In-Reply-To: <1457001868-15949-1-git-send-email-liang.z.li@intel.com>

On (Thu) 03 Mar 2016 [18:44:24], Liang Li wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
> 
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
> 
> This patch set is the QEMU side implementation.
> 
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
> 
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.
> 
> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.

I like the idea, just have to prove (review) and test it a lot to
ensure we don't end up skipping pages that matter.

However, there are a couple of points:

In my opinion, the information that's exchanged between the guest and
the host should be exchanged over a virtio-serial channel rather than
virtio-balloon.  First, there's nothing related to the balloon here.
It just happens to be memory info.  Second, I would never enable
balloon in a guest that I want to be performance-sensitive.  So even
if you add this as part of balloon, you'll find no one is using this
solution.

Secondly, I suggest virtio-serial, because it's meant exactly to
exchange free-flowing information between a host and a guest, and you
don't need to extend any part of the protocol for it (hence no changes
necessary to the spec).  You can see how spice, vnc, etc., use
virtio-serial to exchange data.


		Amit

^ permalink raw reply

* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-07 15:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: riel@redhat.com, ehabkost@redhat.com, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	Dr. David Alan Gilbert, linux-mm@kvack.org, Roman Kagan,
	amit.shah@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160307110852-mutt-send-email-mst@redhat.com>

> Cc: Roman Kagan; Dr. David Alan Gilbert; ehabkost@redhat.com;
> kvm@vger.kernel.org; quintela@redhat.com; linux-kernel@vger.kernel.org;
> qemu-devel@nongnu.org; linux-mm@kvack.org; amit.shah@redhat.com;
> pbonzini@redhat.com; akpm@linux-foundation.org;
> virtualization@lists.linux-foundation.org; rth@twiddle.net; riel@redhat.com
> Subject: Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration
> optimization
> 
> On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > processed during live migration without skipping. The live
> > > > migration code is
> > > in migration/ram.c.
> > >
> > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we
> can
> > > teach qemu to skip these pages.
> > > Want to write a patch to do this?
> > >
> >
> > Yes, we really can teach qemu to skip these pages and it's not hard.
> > The problem is the poor performance, this PV solution
> 
> Balloon is always PV. And do not call patches solutions please.
> 

OK.
  
> > is aimed to make it more
> > efficient and reduce the performance impact on guest.
> 
> We need to get a bit beyond this.  You are making multiple changes, it seems
> to make sense to split it all up, and analyse each change separately.  If you
> don't this patchset will be stuck: as you have seen people aren't convinced it
> actually helps with real workloads.
> 
Really, changing the virtio spec must have good reasons.

> > > > >
> > > > > > > > The only advantage of ' inflating the balloon before live
> > > > > > > > migration' is simple,
> > > > > > > nothing more.
> > > > > > >
> > > > > > > That's a big advantage.  Another one is that it does
> > > > > > > something useful in real- world scenarios.
> > > > > > >
> > > > > >
> > > > > > I don't think the heave performance impaction is something
> > > > > > useful in real
> > > > > world scenarios.
> > > > > >
> > > > > > Liang
> > > > > > > Roman.
> > > > >
> > > > > So fix the performance then. You will have to try harder if you
> > > > > want to convince people that the performance is due to bad
> > > > > host/guest interface, and so we have to change *that*.
> > > > >
> > > >
> > > > Actually, the PV solution is irrelevant with the balloon
> > > > mechanism, I just use it to transfer information between host and
> guest.
> > > > I am not sure if I should implement a new virtio device, and I
> > > > want to get the answer from the community.
> > > > In this RFC patch, to make things simple, I choose to extend the
> > > > virtio-balloon and use the extended interface to transfer the
> > > > request and
> > > free_page_bimap content.
> > > >
> > > > I am not intend to change the current virtio-balloon implementation.
> > > >
> > > > Liang
> > >
> > > And the answer would depend on the answer to my question above.
> > > Does balloon need an interface passing page bitmaps around?
> >
> > Yes, I need a new interface.
> 
> Possibly, but you will need to justify this at some level if you care about
> upstreaming your patches.
> 
> > > Does this speed up any operations?
> >
> > No, a new interface will not speed up anything, but it is the easiest way to
> solve the compatibility issue.
> 
> A bunch of new code is often easier to write than to figure out the old one,
> but if we keep piling it up we'll end up with an unmaintainable mess. So we
> are rather careful about adding new interfaces, and we try to make them
> generic sometimes even at cost of slight inefficiencies.
> 
> > > OTOH what if you use the regular balloon interface with your patches?
> > >
> >
> > The regular balloon interfaces have their specific function and I can't use
> them in my patches.
> > If using these regular interface, I have to do a lot of changes to keep the
> compatibility.
> 
> Why can't you?
> 
> What exactly do we need to change?
> 
> If we put things in terms of the balloon, that supports adding and removing
> pages.
> 
> Using these terms, let's enumerate:
> - a new method (e.g. new virtqueue) that adds and immediately removes
> page in a balloon
> 	clearly, you can add then remove using the existing interfaces
> 	is a single command significantly faster than using existing two vqs?
> - a new kind of request that says "add (and immediately remove?) as many
> pages as you can"
> 	sounds rather benign
> - a new kind of message that adds multiple pages using a bitmap
>   	(instead of an address list)
> 	again, is this significantly faster?

More of less faster because of less data traffic. I didn't measure this,  I will do it and take a deep look
at the way you suggest if we choose to make use of the virtio-balloon interface.

> 
> Does not look like compatibility is an issue, to me.
> 
> 
> At some level, your patches look like page hints.
> If we have more patches in mind that use page hints, then a new hint device
> might make sense.
> 

Yes, I have ever considered to implement a new device, use the virtio-balloon to
transfer the free pages information which is irrelevant  with the balloon mechanism
is some more or less confusing.

> However, people experimented with page hints in the past, so far this always
> went nowhere.  E.g. I CC Rick who saw some problems when page hints
> interact with huge pages. Rick, could you elaborate please?
> 

Thanks a lot. Can't wait to know the problems.

Liang
> 
> --
> MST

^ permalink raw reply

* Re: [PATCH 05/16] drm/gma: removed optional dummy crtc mode_fixup function.
From: Patrik Jakobsson @ 2016-03-07 14:31 UTC (permalink / raw)
  To: Carlos Palminha
  Cc: nicolas.pitre, boris.brezillon, jianwei.wang.chn, David Airlie,
	Daniel Vetter, alison.wang, dri-devel, virtualization,
	linux-renesas-soc, Jani Nikula, Laurent Pinchart,
	Benjamin Gaignard, vincent.abriou, Sudip Mukherjee
In-Reply-To: <fc5717ec27bea9b114e48b9d3de1512cf48f44b6.1455630967.git.palminha@synopsys.com>

On Tue, Feb 16, 2016 at 3:17 PM, Carlos Palminha
<CARLOS.PALMINHA@synopsys.com> wrote:
> This patch set nukes all the dummy crtc mode_fixup implementations.
> (made on top of Daniel topic/drm-misc branch)
>
> Signed-off-by: Carlos Palminha <palminha@synopsys.com>

You should try to avoid mixing code style fixes with functional
changes but in case of gma500 it's hard to resist. No need to change
this.

This might already have been merged but if not:

Reviewed-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>

> ---
>  drivers/gpu/drm/gma500/cdv_intel_display.c   | 13 ++++++-------
>  drivers/gpu/drm/gma500/gma_display.c         |  7 -------
>  drivers/gpu/drm/gma500/gma_display.h         |  3 ---
>  drivers/gpu/drm/gma500/mdfld_intel_display.c |  2 --
>  drivers/gpu/drm/gma500/oaktrail_crtc.c       |  1 -
>  drivers/gpu/drm/gma500/psb_intel_display.c   |  1 -
>  6 files changed, 6 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/gma500/cdv_intel_display.c b/drivers/gpu/drm/gma500/cdv_intel_display.c
> index 6126546..17db4b4 100644
> --- a/drivers/gpu/drm/gma500/cdv_intel_display.c
> +++ b/drivers/gpu/drm/gma500/cdv_intel_display.c
> @@ -116,7 +116,7 @@ static const struct gma_limit_t cdv_intel_limits[] = {
>          .p1 = {.min = 1, .max = 10},
>          .p2 = {.dot_limit = 225000, .p2_slow = 10, .p2_fast = 10},
>          .find_pll = cdv_intel_find_dp_pll,
> -        }
> +       }
>  };
>
>  #define _wait_for(COND, MS, W) ({ \
> @@ -245,7 +245,7 @@ cdv_dpll_set_clock_cdv(struct drm_device *dev, struct drm_crtc *crtc,
>         /* We don't know what the other fields of these regs are, so
>          * leave them in place.
>          */
> -       /*
> +       /*
>          * The BIT 14:13 of 0x8010/0x8030 is used to select the ref clk
>          * for the pipe A/B. Display spec 1.06 has wrong definition.
>          * Correct definition is like below:
> @@ -256,7 +256,7 @@ cdv_dpll_set_clock_cdv(struct drm_device *dev, struct drm_crtc *crtc,
>          *
>          * if DPLLA sets 01 and DPLLB sets 02, both use clk from DPLLA
>          *
> -        */
> +        */
>         ret = cdv_sb_read(dev, ref_sfr, &ref_value);
>         if (ret)
>                 return ret;
> @@ -646,7 +646,7 @@ static int cdv_intel_crtc_mode_set(struct drm_crtc *crtc,
>                  * for DP/eDP. When using SSC clock, the ref clk is 100MHz.Otherwise
>                  * it will be 27MHz. From the VBIOS code it seems that the pipe A choose
>                  * 27MHz for DP/eDP while the Pipe B chooses the 100MHz.
> -                */
> +                */
>                 if (pipe == 0)
>                         refclk = 27000;
>                 else
> @@ -659,7 +659,7 @@ static int cdv_intel_crtc_mode_set(struct drm_crtc *crtc,
>         }
>
>         drm_mode_debug_printmodeline(adjusted_mode);
> -
> +
>         limit = gma_crtc->clock_funcs->limit(crtc, refclk);
>
>         ok = limit->find_pll(limit, crtc, adjusted_mode->clock, refclk,
> @@ -721,7 +721,7 @@ static int cdv_intel_crtc_mode_set(struct drm_crtc *crtc,
>                         pipeconf |= PIPE_6BPC;
>         } else
>                 pipeconf |= PIPE_8BPC;
> -
> +
>         /* Set up the display plane register */
>         dspcntr = DISPPLANE_GAMMA_ENABLE;
>
> @@ -974,7 +974,6 @@ struct drm_display_mode *cdv_intel_crtc_mode_get(struct drm_device *dev,
>
>  const struct drm_crtc_helper_funcs cdv_intel_helper_funcs = {
>         .dpms = gma_crtc_dpms,
> -       .mode_fixup = gma_crtc_mode_fixup,
>         .mode_set = cdv_intel_crtc_mode_set,
>         .mode_set_base = gma_pipe_set_base,
>         .prepare = gma_crtc_prepare,
> diff --git a/drivers/gpu/drm/gma500/gma_display.c b/drivers/gpu/drm/gma500/gma_display.c
> index ff17af4..d6a5c77 100644
> --- a/drivers/gpu/drm/gma500/gma_display.c
> +++ b/drivers/gpu/drm/gma500/gma_display.c
> @@ -485,13 +485,6 @@ bool gma_encoder_mode_fixup(struct drm_encoder *encoder,
>         return true;
>  }
>
> -bool gma_crtc_mode_fixup(struct drm_crtc *crtc,
> -                        const struct drm_display_mode *mode,
> -                        struct drm_display_mode *adjusted_mode)
> -{
> -       return true;
> -}
> -
>  void gma_crtc_prepare(struct drm_crtc *crtc)
>  {
>         const struct drm_crtc_helper_funcs *crtc_funcs = crtc->helper_private;
> diff --git a/drivers/gpu/drm/gma500/gma_display.h b/drivers/gpu/drm/gma500/gma_display.h
> index ed569d8..fc64241 100644
> --- a/drivers/gpu/drm/gma500/gma_display.h
> +++ b/drivers/gpu/drm/gma500/gma_display.h
> @@ -75,9 +75,6 @@ extern void gma_crtc_load_lut(struct drm_crtc *crtc);
>  extern void gma_crtc_gamma_set(struct drm_crtc *crtc, u16 *red, u16 *green,
>                                u16 *blue, u32 start, u32 size);
>  extern void gma_crtc_dpms(struct drm_crtc *crtc, int mode);
> -extern bool gma_crtc_mode_fixup(struct drm_crtc *crtc,
> -                               const struct drm_display_mode *mode,
> -                               struct drm_display_mode *adjusted_mode);
>  extern void gma_crtc_prepare(struct drm_crtc *crtc);
>  extern void gma_crtc_commit(struct drm_crtc *crtc);
>  extern void gma_crtc_disable(struct drm_crtc *crtc);
> diff --git a/drivers/gpu/drm/gma500/mdfld_intel_display.c b/drivers/gpu/drm/gma500/mdfld_intel_display.c
> index acd3834..92e3f93e 100644
> --- a/drivers/gpu/drm/gma500/mdfld_intel_display.c
> +++ b/drivers/gpu/drm/gma500/mdfld_intel_display.c
> @@ -1026,10 +1026,8 @@ mrst_crtc_mode_set_exit:
>
>  const struct drm_crtc_helper_funcs mdfld_helper_funcs = {
>         .dpms = mdfld_crtc_dpms,
> -       .mode_fixup = gma_crtc_mode_fixup,
>         .mode_set = mdfld_crtc_mode_set,
>         .mode_set_base = mdfld__intel_pipe_set_base,
>         .prepare = gma_crtc_prepare,
>         .commit = gma_crtc_commit,
>  };
> -
> diff --git a/drivers/gpu/drm/gma500/oaktrail_crtc.c b/drivers/gpu/drm/gma500/oaktrail_crtc.c
> index 1048f0c..da9fd34 100644
> --- a/drivers/gpu/drm/gma500/oaktrail_crtc.c
> +++ b/drivers/gpu/drm/gma500/oaktrail_crtc.c
> @@ -657,7 +657,6 @@ pipe_set_base_exit:
>
>  const struct drm_crtc_helper_funcs oaktrail_helper_funcs = {
>         .dpms = oaktrail_crtc_dpms,
> -       .mode_fixup = gma_crtc_mode_fixup,
>         .mode_set = oaktrail_crtc_mode_set,
>         .mode_set_base = oaktrail_pipe_set_base,
>         .prepare = gma_crtc_prepare,
> diff --git a/drivers/gpu/drm/gma500/psb_intel_display.c b/drivers/gpu/drm/gma500/psb_intel_display.c
> index dcdbc37..398015b 100644
> --- a/drivers/gpu/drm/gma500/psb_intel_display.c
> +++ b/drivers/gpu/drm/gma500/psb_intel_display.c
> @@ -430,7 +430,6 @@ struct drm_display_mode *psb_intel_crtc_mode_get(struct drm_device *dev,
>
>  const struct drm_crtc_helper_funcs psb_intel_helper_funcs = {
>         .dpms = gma_crtc_dpms,
> -       .mode_fixup = gma_crtc_mode_fixup,
>         .mode_set = psb_intel_crtc_mode_set,
>         .mode_set_base = gma_pipe_set_base,
>         .prepare = gma_crtc_prepare,
> --
> 2.5.0
>

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Michael S. Tsirkin @ 2016-03-07 11:40 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: riel, ehabkost@redhat.com, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	Dr. David Alan Gilbert, linux-mm@kvack.org, Roman Kagan,
	amit.shah@redhat.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E04146308@shsmsx102.ccr.corp.intel.com>

On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > No. And it's exactly what I mean. The ballooned memory is still
> > > processed during live migration without skipping. The live migration code is
> > in migration/ram.c.
> > 
> > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> > teach qemu to skip these pages.
> > Want to write a patch to do this?
> > 
> 
> Yes, we really can teach qemu to skip these pages and it's not hard.  
> The problem is the poor performance, this PV solution

Balloon is always PV. And do not call patches solutions please.

> is aimed to make it more
> efficient and reduce the performance impact on guest.

We need to get a bit beyond this.  You are making multiple
changes, it seems to make sense to split it all up, and analyse each
change separately.  If you don't this patchset will be stuck: as you
have seen people aren't convinced it actually helps with real workloads.

> > > >
> > > > > > > The only advantage of ' inflating the balloon before live
> > > > > > > migration' is simple,
> > > > > > nothing more.
> > > > > >
> > > > > > That's a big advantage.  Another one is that it does something
> > > > > > useful in real- world scenarios.
> > > > > >
> > > > >
> > > > > I don't think the heave performance impaction is something useful
> > > > > in real
> > > > world scenarios.
> > > > >
> > > > > Liang
> > > > > > Roman.
> > > >
> > > > So fix the performance then. You will have to try harder if you want
> > > > to convince people that the performance is due to bad host/guest
> > > > interface, and so we have to change *that*.
> > > >
> > >
> > > Actually, the PV solution is irrelevant with the balloon mechanism, I
> > > just use it to transfer information between host and guest.
> > > I am not sure if I should implement a new virtio device, and I want to
> > > get the answer from the community.
> > > In this RFC patch, to make things simple, I choose to extend the
> > > virtio-balloon and use the extended interface to transfer the request and
> > free_page_bimap content.
> > >
> > > I am not intend to change the current virtio-balloon implementation.
> > >
> > > Liang
> > 
> > And the answer would depend on the answer to my question above.
> > Does balloon need an interface passing page bitmaps around?
> 
> Yes, I need a new interface.

Possibly, but you will need to justify this at some level if you care
about upstreaming your patches.

> > Does this speed up any operations?
> 
> No, a new interface will not speed up anything, but it is the easiest way to solve the compatibility issue.

A bunch of new code is often easier to write than to figure
out the old one, but if we keep piling it up we'll end up
with an unmaintainable mess. So we are rather careful
about adding new interfaces, and we try to make them generic
sometimes even at cost of slight inefficiencies.

> > OTOH what if you use the regular balloon interface with your patches?
> >
> 
> The regular balloon interfaces have their specific function and I can't use them in my patches.
> If using these regular interface, I have to do a lot of changes to keep the compatibility. 

Why can't you?

What exactly do we need to change?

If we put things in terms of the balloon, that supports
adding and removing pages.

Using these terms, let's enumerate:
- a new method (e.g. new virtqueue) that adds and immediately removes page in a balloon
	clearly, you can add then remove using the existing interfaces
	is a single command significantly faster than using existing two vqs?
- a new kind of request that says "add (and immediately remove?) as many pages as you can"
	sounds rather benign
- a new kind of message that adds multiple pages using a bitmap
  	(instead of an address list)
	again, is this significantly faster?

Does not look like compatibility is an issue, to me.


At some level, your patches look like page hints.
If we have more patches in mind that use page hints,
then a new hint device might make sense.

However, people experimented with page hints in the past, so far this
always went nowhere.  E.g. I CC Rick who saw some problems when page
hints interact with huge pages. Rick, could you elaborate please?


-- 
MST

^ permalink raw reply

* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-07  6:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, Dr. David Alan Gilbert,
	linux-mm@kvack.org, Roman Kagan, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160305214748-mutt-send-email-mst@redhat.com>

> > No. And it's exactly what I mean. The ballooned memory is still
> > processed during live migration without skipping. The live migration code is
> in migration/ram.c.
> 
> So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> teach qemu to skip these pages.
> Want to write a patch to do this?
> 

Yes, we really can teach qemu to skip these pages and it's not hard.  
The problem is the poor performance, this PV solution is aimed to make it more
efficient and reduce the performance impact on guest.

> > >
> > > > > > The only advantage of ' inflating the balloon before live
> > > > > > migration' is simple,
> > > > > nothing more.
> > > > >
> > > > > That's a big advantage.  Another one is that it does something
> > > > > useful in real- world scenarios.
> > > > >
> > > >
> > > > I don't think the heave performance impaction is something useful
> > > > in real
> > > world scenarios.
> > > >
> > > > Liang
> > > > > Roman.
> > >
> > > So fix the performance then. You will have to try harder if you want
> > > to convince people that the performance is due to bad host/guest
> > > interface, and so we have to change *that*.
> > >
> >
> > Actually, the PV solution is irrelevant with the balloon mechanism, I
> > just use it to transfer information between host and guest.
> > I am not sure if I should implement a new virtio device, and I want to
> > get the answer from the community.
> > In this RFC patch, to make things simple, I choose to extend the
> > virtio-balloon and use the extended interface to transfer the request and
> free_page_bimap content.
> >
> > I am not intend to change the current virtio-balloon implementation.
> >
> > Liang
> 
> And the answer would depend on the answer to my question above.
> Does balloon need an interface passing page bitmaps around?

Yes, I need a new interface.

> Does this speed up any operations?

No, a new interface will not speed up anything, but it is the easiest way to solve the compatibility issue.

> OTOH what if you use the regular balloon interface with your patches?
>

The regular balloon interfaces have their specific function and I can't use them in my patches.
If using these regular interface, I have to do a lot of changes to keep the compatibility. 

> 
> > > --
> > > MST

^ permalink raw reply

* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-07  5:34 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Paolo Bonzini
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, Roman Kagan, amit.shah@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160304185120.GB2588@work-vm>

> > On 04/03/2016 15:26, Li, Liang Z wrote:
> > >> >
> > >> > The memory usage will keep increasing due to ever growing caches,
> > >> > etc, so you'll be left with very little free memory fairly soon.
> > >> >
> > > I don't think so.
> > >
> >
> > Roman is right.  For example, here I am looking at a 64 GB (physical)
> > machine which was booted about 30 minutes ago, and which is running
> > disk-heavy workloads (installing VMs).
> >
> > Since I have started writing this email (2 minutes?), the amount of
> > free memory has already gone down from 37 GB to 33 GB.  I expect that
> > by the time I have finished running the workload, in two hours, it
> > will not have any free memory.
> 
> But what about a VM sitting idle, or that just has more RAM assigned to it
> than is currently using.
>  I've got a host here that's been up for 46 days and has been doing some
> heavy VM debugging a few days ago, but today:
> 
> # free -m
>               total        used        free      shared  buff/cache   available
> Mem:          96536        1146       44834         184       50555       94735
> 
> I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes it's
> got a big chunk of cache as well.
> 
> Dave
> 
> >
> > Paolo

I begin to realize Roman's opinions. The PV solution can't handle the cache memory while inflating balloon could.
Inflating balloon so as to skipping the cache memory is no good for guest's performance.

How much of the free memory in the guest depends on the workload in the VM  and the time VM has already run
before live migration. Even the memory usage will keep increasing due to ever growing caches, but we don't know
when the live migration will happen, assuming there are no or very little free pages in the guest is not quite right.

The advantage of the pv solution is the smaller performance impact, comparing with inflating the balloon.

Liang

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Michael S. Tsirkin @ 2016-03-05 19:55 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, Dr. David Alan Gilbert,
	linux-mm@kvack.org, Roman Kagan, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E041452EA@shsmsx102.ccr.corp.intel.com>

On Fri, Mar 04, 2016 at 03:49:37PM +0000, Li, Liang Z wrote:
> > > > > > > Only detect the unmapped/zero mapped pages is not enough.
> > > > Consider
> > > > > > the
> > > > > > > situation like case 2, it can't achieve the same result.
> > > > > >
> > > > > > Your case 2 doesn't exist in the real world.  If people could
> > > > > > stop their main memory consumer in the guest prior to migration
> > > > > > they wouldn't need live migration at all.
> > > > >
> > > > > The case 2 is just a simplified scenario, not a real case.
> > > > > As long as the guest's memory usage does not keep increasing, or
> > > > > not always run out, it can be covered by the case 2.
> > > >
> > > > The memory usage will keep increasing due to ever growing caches,
> > > > etc, so you'll be left with very little free memory fairly soon.
> > > >
> > >
> > > I don't think so.
> > 
> > Here's my laptop:
> > KiB Mem : 16048560 total,  8574956 free,  3360532 used,  4113072 buff/cache
> > 
> > But here's a server:
> > KiB Mem:  32892768 total, 20092812 used, 12799956 free,   368704 buffers
> > 
> > What is the difference? A ton of tiny daemons not doing anything, staying
> > resident in memory.
> > 
> > > > > > I tend to think you can safely assume there's no free memory in
> > > > > > the guest, so there's little point optimizing for it.
> > > > >
> > > > > If this is true, we should not inflate the balloon either.
> > > >
> > > > We certainly should if there's "available" memory, i.e. not free but
> > > > cheap to reclaim.
> > > >
> > >
> > > What's your mean by "available" memory? if they are not free, I don't think
> > it's cheap.
> > 
> > clean pages are cheap to drop as they don't have to be written.
> > whether they will be ever be used is another matter.
> > 
> > > > > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > > > > that's made up, in particular, by the ballon, and consider
> > > > > > inflating the balloon right before migration unless you already
> > > > > > maintain it at the optimal size for other reasons (like e.g. a
> > > > > > global resource manager
> > > > optimizing the VM density).
> > > > > >
> > > > >
> > > > > Yes, I believe the current balloon works and it's simple. Do you
> > > > > take the
> > > > performance impact for consideration?
> > > > > For and 8G guest, it takes about 5s to  inflating the balloon. But
> > > > > it only takes 20ms to  traverse the free_list and construct the
> > > > > free pages
> > > > bitmap.
> > > >
> > > > I don't have any feeling of how important the difference is.  And if
> > > > the limiting factor for balloon inflation speed is the granularity
> > > > of communication it may be worth optimizing that, because quick
> > > > balloon reaction may be important in certain resource management
> > scenarios.
> > > >
> > > > > By inflating the balloon, all the guest's pages are still be
> > > > > processed (zero
> > > > page checking).
> > > >
> > > > Not sure what you mean.  If you describe the current state of
> > > > affairs that's exactly the suggested optimization point: skip unmapped
> > pages.
> > > >
> > >
> > > You'd better check the live migration code.
> > 
> > What's there to check in migration code?
> > Here's the extent of what balloon does on output:
> > 
> > 
> >         while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4)
> > {
> >             ram_addr_t pa;
> >             ram_addr_t addr;
> >             int p = virtio_ldl_p(vdev, &pfn);
> > 
> >             pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
> >             offset += 4;
> > 
> >             /* FIXME: remove get_system_memory(), but how? */
> >             section = memory_region_find(get_system_memory(), pa, 1);
> >             if (!int128_nz(section.size) || !memory_region_is_ram(section.mr))
> >                 continue;
> > 
> > 
> > trace_virtio_balloon_handle_output(memory_region_name(section.mr),
> >                                                pa);
> >             /* Using memory_region_get_ram_ptr is bending the rules a bit, but
> >                should be OK because we only want a single page.  */
> >             addr = section.offset_within_region;
> >             balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> >                          !!(vq == s->dvq));
> >             memory_region_unref(section.mr);
> >         }
> > 
> > so all that happens when we get a page is balloon_page.
> > and
> > 
> > static void balloon_page(void *addr, int deflate) { #if defined(__linux__)
> >     if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> >                                          kvm_has_sync_mmu())) {
> >         qemu_madvise(addr, TARGET_PAGE_SIZE,
> >                 deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> >     }
> > #endif
> > }
> > 
> > 
> > Do you see anything that tracks pages to help migration skip the ballooned
> > memory? I don't.
> > 
> 
> No. And it's exactly what I mean. The ballooned memory is still processed during
> live migration without skipping. The live migration code is in migration/ram.c.

So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST,
we can teach qemu to skip these pages.
Want to write a patch to do this?

> > 
> > > > > The only advantage of ' inflating the balloon before live
> > > > > migration' is simple,
> > > > nothing more.
> > > >
> > > > That's a big advantage.  Another one is that it does something
> > > > useful in real- world scenarios.
> > > >
> > >
> > > I don't think the heave performance impaction is something useful in real
> > world scenarios.
> > >
> > > Liang
> > > > Roman.
> > 
> > So fix the performance then. You will have to try harder if you want to
> > convince people that the performance is due to bad host/guest interface,
> > and so we have to change *that*.
> > 
> 
> Actually, the PV solution is irrelevant with the balloon mechanism, I just use it
> to transfer information between host and guest. 
> I am not sure if I should implement a new virtio device, and I want to get the answer from
> the community.
> In this RFC patch, to make things simple, I choose to extend the virtio-balloon and use the
> extended interface to transfer the request and free_page_bimap content.
> 
> I am not intend to change the current virtio-balloon implementation.
> 
> Liang

And the answer would depend on the answer to my question above.
Does balloon need an interface passing page bitmaps around?
Does this speed up any operations?
OTOH what if you use the regular balloon interface with your patches?


> > --
> > MST

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Dr. David Alan Gilbert @ 2016-03-04 18:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, Li, Liang Z,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	linux-mm@kvack.org, mst@redhat.com, Roman Kagan,
	amit.shah@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <56D9B6C2.3070708@redhat.com>

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 04/03/2016 15:26, Li, Liang Z wrote:
> >> > 
> >> > The memory usage will keep increasing due to ever growing caches, etc, so
> >> > you'll be left with very little free memory fairly soon.
> >> > 
> > I don't think so.
> > 
> 
> Roman is right.  For example, here I am looking at a 64 GB (physical)
> machine which was booted about 30 minutes ago, and which is running
> disk-heavy workloads (installing VMs).
> 
> Since I have started writing this email (2 minutes?), the amount of free
> memory has already gone down from 37 GB to 33 GB.  I expect that by the
> time I have finished running the workload, in two hours, it will not
> have any free memory.

But what about a VM sitting idle, or that just has more RAM assigned to it
than is currently using.
 I've got a host here that's been up for 46 days and has been doing some
heavy VM debugging a few days ago, but today:

# free -m
              total        used        free      shared  buff/cache   available
Mem:          96536        1146       44834         184       50555       94735

I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes
it's got a big chunk of cache as well.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply

* Re: [PATCH 00/16] drm crtc cleanup: nuke optional dummy crtc mode_fixup function.
From: Daniel Vetter @ 2016-03-04 17:18 UTC (permalink / raw)
  To: Carlos Palminha
  Cc: nicolas.pitre, boris.brezillon, jianwei.wang.chn, airlied,
	daniel.vetter, alison.wang, patrik.r.jakobsson, virtualization,
	linux-renesas-soc, jani.nikula, dri-devel, benjamin.gaignard,
	vincent.abriou, sudipm.mukherjee, laurent.pinchart
In-Reply-To: <cover.1455630967.git.palminha@synopsys.com>

On Tue, Feb 16, 2016 at 02:09:44PM +0000, Carlos Palminha wrote:
> This patch set nukes all the dummy crtc mode_fixup implementations.
> (made on top of Daniel topic/drm-misc branch)
> 
> Carlos Palminha (16):
>   drm: fixes crct set_mode when crtc mode_fixup is null.
>   drm/cirrus: removed optional dummy crtc mode_fixup function.
>   drm/mgag200: removed optional dummy crtc mode_fixup function.
>   drm/udl: removed optional dummy crtc mode_fixup function.
>   drm/gma: removed optional dummy crtc mode_fixup function.
>   drm/rcar-du: removed optional dummy crtc mode_fixup function.
>   drm/omapdrm: removed optional dummy crtc mode_fixup function.
>   drm/msm/mdp: removed optional dummy crtc mode_fixup function.
>   drm/shmobile: removed optional dummy crtc mode_fixup function.
>   drm/sti: removed optional dummy crtc mode_fixup function.
>   drm/atmel-hldcd: removed optional dummy crtc mode_fixup function.
>   drm/nouveau/dispnv04: removed optional dummy crtc mode_fixup function.
>   drm/virtio: removed optional dummy crtc mode_fixup function.
>   drm/fsl-dcu: removed optional dummy crtc mode_fixup function.
>   drm/bochs: removed optional dummy crtc mode_fixup function.
>   drm/ast: removed optional dummy crtc mode_fixup function.

Pulled remaining ones into drm-misc, will send off in a pull request to
Dave in a few days if it all checks out.
-Daniel

> 
>  drivers/gpu/drm/ast/ast_mode.c                 |  8 --------
>  drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_crtc.c |  9 ---------
>  drivers/gpu/drm/bochs/bochs_kms.c              |  8 --------
>  drivers/gpu/drm/cirrus/cirrus_mode.c           | 13 -------------
>  drivers/gpu/drm/drm_crtc_helper.c              |  9 ++++++---
>  drivers/gpu/drm/fsl-dcu/fsl_dcu_drm_crtc.c     |  8 --------
>  drivers/gpu/drm/gma500/cdv_intel_display.c     | 13 ++++++-------
>  drivers/gpu/drm/gma500/gma_display.c           |  7 -------
>  drivers/gpu/drm/gma500/gma_display.h           |  3 ---
>  drivers/gpu/drm/gma500/mdfld_intel_display.c   |  2 --
>  drivers/gpu/drm/gma500/oaktrail_crtc.c         |  1 -
>  drivers/gpu/drm/gma500/psb_intel_display.c     |  1 -
>  drivers/gpu/drm/mgag200/mgag200_mode.c         | 13 -------------
>  drivers/gpu/drm/msm/mdp/mdp4/mdp4_crtc.c       |  8 --------
>  drivers/gpu/drm/msm/mdp/mdp5/mdp5_crtc.c       |  8 --------
>  drivers/gpu/drm/nouveau/dispnv04/crtc.c        |  8 --------
>  drivers/gpu/drm/omapdrm/omap_crtc.c            |  8 --------
>  drivers/gpu/drm/rcar-du/rcar_du_crtc.c         |  9 ---------
>  drivers/gpu/drm/shmobile/shmob_drm_crtc.c      |  8 --------
>  drivers/gpu/drm/sti/sti_crtc.c                 |  9 ---------
>  drivers/gpu/drm/udl/udl_modeset.c              |  9 ---------
>  drivers/gpu/drm/virtio/virtgpu_display.c       |  8 --------
>  22 files changed, 12 insertions(+), 158 deletions(-)
> 
> -- 
> 2.5.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Paolo Bonzini @ 2016-03-04 16:24 UTC (permalink / raw)
  To: Li, Liang Z, Roman Kagan
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	mst@redhat.com, linux-kernel@vger.kernel.org,
	Dr. David Alan Gilbert, linux-mm@kvack.org, amit.shah@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E0414516C@shsmsx102.ccr.corp.intel.com>



On 04/03/2016 15:26, Li, Liang Z wrote:
>> > 
>> > The memory usage will keep increasing due to ever growing caches, etc, so
>> > you'll be left with very little free memory fairly soon.
>> > 
> I don't think so.
> 

Roman is right.  For example, here I am looking at a 64 GB (physical)
machine which was booted about 30 minutes ago, and which is running
disk-heavy workloads (installing VMs).

Since I have started writing this email (2 minutes?), the amount of free
memory has already gone down from 37 GB to 33 GB.  I expect that by the
time I have finished running the workload, in two hours, it will not
have any free memory.

Paolo

^ permalink raw reply

* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-04 15:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, Dr. David Alan Gilbert,
	linux-mm@kvack.org, Roman Kagan, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160304163246-mutt-send-email-mst@redhat.com>

> > > > > > Only detect the unmapped/zero mapped pages is not enough.
> > > Consider
> > > > > the
> > > > > > situation like case 2, it can't achieve the same result.
> > > > >
> > > > > Your case 2 doesn't exist in the real world.  If people could
> > > > > stop their main memory consumer in the guest prior to migration
> > > > > they wouldn't need live migration at all.
> > > >
> > > > The case 2 is just a simplified scenario, not a real case.
> > > > As long as the guest's memory usage does not keep increasing, or
> > > > not always run out, it can be covered by the case 2.
> > >
> > > The memory usage will keep increasing due to ever growing caches,
> > > etc, so you'll be left with very little free memory fairly soon.
> > >
> >
> > I don't think so.
> 
> Here's my laptop:
> KiB Mem : 16048560 total,  8574956 free,  3360532 used,  4113072 buff/cache
> 
> But here's a server:
> KiB Mem:  32892768 total, 20092812 used, 12799956 free,   368704 buffers
> 
> What is the difference? A ton of tiny daemons not doing anything, staying
> resident in memory.
> 
> > > > > I tend to think you can safely assume there's no free memory in
> > > > > the guest, so there's little point optimizing for it.
> > > >
> > > > If this is true, we should not inflate the balloon either.
> > >
> > > We certainly should if there's "available" memory, i.e. not free but
> > > cheap to reclaim.
> > >
> >
> > What's your mean by "available" memory? if they are not free, I don't think
> it's cheap.
> 
> clean pages are cheap to drop as they don't have to be written.
> whether they will be ever be used is another matter.
> 
> > > > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > > > that's made up, in particular, by the ballon, and consider
> > > > > inflating the balloon right before migration unless you already
> > > > > maintain it at the optimal size for other reasons (like e.g. a
> > > > > global resource manager
> > > optimizing the VM density).
> > > > >
> > > >
> > > > Yes, I believe the current balloon works and it's simple. Do you
> > > > take the
> > > performance impact for consideration?
> > > > For and 8G guest, it takes about 5s to  inflating the balloon. But
> > > > it only takes 20ms to  traverse the free_list and construct the
> > > > free pages
> > > bitmap.
> > >
> > > I don't have any feeling of how important the difference is.  And if
> > > the limiting factor for balloon inflation speed is the granularity
> > > of communication it may be worth optimizing that, because quick
> > > balloon reaction may be important in certain resource management
> scenarios.
> > >
> > > > By inflating the balloon, all the guest's pages are still be
> > > > processed (zero
> > > page checking).
> > >
> > > Not sure what you mean.  If you describe the current state of
> > > affairs that's exactly the suggested optimization point: skip unmapped
> pages.
> > >
> >
> > You'd better check the live migration code.
> 
> What's there to check in migration code?
> Here's the extent of what balloon does on output:
> 
> 
>         while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4)
> {
>             ram_addr_t pa;
>             ram_addr_t addr;
>             int p = virtio_ldl_p(vdev, &pfn);
> 
>             pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
>             offset += 4;
> 
>             /* FIXME: remove get_system_memory(), but how? */
>             section = memory_region_find(get_system_memory(), pa, 1);
>             if (!int128_nz(section.size) || !memory_region_is_ram(section.mr))
>                 continue;
> 
> 
> trace_virtio_balloon_handle_output(memory_region_name(section.mr),
>                                                pa);
>             /* Using memory_region_get_ram_ptr is bending the rules a bit, but
>                should be OK because we only want a single page.  */
>             addr = section.offset_within_region;
>             balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
>                          !!(vq == s->dvq));
>             memory_region_unref(section.mr);
>         }
> 
> so all that happens when we get a page is balloon_page.
> and
> 
> static void balloon_page(void *addr, int deflate) { #if defined(__linux__)
>     if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
>                                          kvm_has_sync_mmu())) {
>         qemu_madvise(addr, TARGET_PAGE_SIZE,
>                 deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
>     }
> #endif
> }
> 
> 
> Do you see anything that tracks pages to help migration skip the ballooned
> memory? I don't.
> 

No. And it's exactly what I mean. The ballooned memory is still processed during
live migration without skipping. The live migration code is in migration/ram.c.

> 
> > > > The only advantage of ' inflating the balloon before live
> > > > migration' is simple,
> > > nothing more.
> > >
> > > That's a big advantage.  Another one is that it does something
> > > useful in real- world scenarios.
> > >
> >
> > I don't think the heave performance impaction is something useful in real
> world scenarios.
> >
> > Liang
> > > Roman.
> 
> So fix the performance then. You will have to try harder if you want to
> convince people that the performance is due to bad host/guest interface,
> and so we have to change *that*.
> 

Actually, the PV solution is irrelevant with the balloon mechanism, I just use it
to transfer information between host and guest. 
I am not sure if I should implement a new virtio device, and I want to get the answer from
the community.
In this RFC patch, to make things simple, I choose to extend the virtio-balloon and use the
extended interface to transfer the request and free_page_bimap content.

I am not intend to change the current virtio-balloon implementation.

Liang

> --
> MST

^ permalink raw reply

* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-04 15:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, Dr. David Alan Gilbert,
	linux-mm@kvack.org, Roman Kagan, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160304122456-mutt-send-email-mst@redhat.com>

> > Maybe I am not clear enough.
> >
> > I mean if we inflate balloon before live migration, for a 8GB guest, it takes
> about 5 Seconds for the inflating operation to finish.
> 
> And these 5 seconds are spent where?
> 

The time is spent on allocating the pages and send the allocated pages pfns to QEMU
through virtio.

> > For the PV solution, there is no need to inflate balloon before live
> > migration, the only cost is to traversing the free_list to  construct
> > the free pages bitmap, and it takes about 20ms for a 8GB idle guest( less if
> there is less free pages),  passing the free pages info to host will take about
> extra 3ms.
> >
> >
> > Liang
> 
> So now let's please stop talking about solutions at a high level and discuss the
> interface changes you make in detail.
> What makes it faster? Better host/guest interface? No need to go through
> buddy allocator within guest? Less interrupts? Something else?
> 

I assume you are familiar with the current virtio-balloon and how it works. 
The new interface is very simple, send a request to the virtio-balloon driver,
The virtio-driver will travers the '&zone->free_area[order].free_list[t])' to 
construct a 'free_page_bitmap', and then the driver will send the content
of  'free_page_bitmap' back to QEMU. That all the new interface does and
there are no ' alloc_page' related affairs, so it's faster.


Some code snippet:
----------------------------------------------
+static void mark_free_pages_bitmap(struct zone *zone,
+		 unsigned long *free_page_bitmap, unsigned long pfn_gap) {
+	unsigned long pfn, flags, i;
+	unsigned int order, t;
+	struct list_head *curr;
+
+	if (zone_is_empty(zone))
+		return;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	for_each_migratetype_order(order, t) {
+		list_for_each(curr, &zone->free_area[order].free_list[t]) {
+
+			pfn = page_to_pfn(list_entry(curr, struct page, lru));
+			for (i = 0; i < (1UL << order); i++) {
+				if ((pfn + i) >= PFN_4G)
+					set_bit_le(pfn + i - pfn_gap,
+						   free_page_bitmap);
+				else
+					set_bit_le(pfn + i, free_page_bitmap);
+			}
+		}
+	}
+
+	spin_unlock_irqrestore(&zone->lock, flags); }
----------------------------------------------------
Sorry for my poor English and expression, if you still can't understand,
you could glance at the patch, total about 400 lines.
> 
> > > --
> > > MST

^ permalink raw reply

* Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Michael S. Tsirkin @ 2016-03-04 14:45 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, Dr. David Alan Gilbert,
	linux-mm@kvack.org, Roman Kagan, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E0414516C@shsmsx102.ccr.corp.intel.com>

On Fri, Mar 04, 2016 at 02:26:49PM +0000, Li, Liang Z wrote:
> > Subject: Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration
> > optimization
> > 
> > On Fri, Mar 04, 2016 at 09:08:44AM +0000, Li, Liang Z wrote:
> > > > On Fri, Mar 04, 2016 at 01:52:53AM +0000, Li, Liang Z wrote:
> > > > > >   I wonder if it would be possible to avoid the kernel changes
> > > > > > by parsing /proc/self/pagemap - if that can be used to detect
> > > > > > unmapped/zero mapped pages in the guest ram, would it achieve
> > > > > > the
> > > > same result?
> > > > >
> > > > > Only detect the unmapped/zero mapped pages is not enough.
> > Consider
> > > > the
> > > > > situation like case 2, it can't achieve the same result.
> > > >
> > > > Your case 2 doesn't exist in the real world.  If people could stop
> > > > their main memory consumer in the guest prior to migration they
> > > > wouldn't need live migration at all.
> > >
> > > The case 2 is just a simplified scenario, not a real case.
> > > As long as the guest's memory usage does not keep increasing, or not
> > > always run out, it can be covered by the case 2.
> > 
> > The memory usage will keep increasing due to ever growing caches, etc, so
> > you'll be left with very little free memory fairly soon.
> > 
> 
> I don't think so.

Here's my laptop:
KiB Mem : 16048560 total,  8574956 free,  3360532 used,  4113072 buff/cache

But here's a server:
KiB Mem:  32892768 total, 20092812 used, 12799956 free,   368704 buffers

What is the difference? A ton of tiny daemons not doing anything,
staying resident in memory.

> > > > I tend to think you can safely assume there's no free memory in the
> > > > guest, so there's little point optimizing for it.
> > >
> > > If this is true, we should not inflate the balloon either.
> > 
> > We certainly should if there's "available" memory, i.e. not free but cheap to
> > reclaim.
> > 
> 
> What's your mean by "available" memory? if they are not free, I don't think it's cheap.

clean pages are cheap to drop as they don't have to be written.
whether they will be ever be used is another matter.

> > > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > > that's made up, in particular, by the ballon, and consider inflating
> > > > the balloon right before migration unless you already maintain it at
> > > > the optimal size for other reasons (like e.g. a global resource manager
> > optimizing the VM density).
> > > >
> > >
> > > Yes, I believe the current balloon works and it's simple. Do you take the
> > performance impact for consideration?
> > > For and 8G guest, it takes about 5s to  inflating the balloon. But it
> > > only takes 20ms to  traverse the free_list and construct the free pages
> > bitmap.
> > 
> > I don't have any feeling of how important the difference is.  And if the
> > limiting factor for balloon inflation speed is the granularity of communication
> > it may be worth optimizing that, because quick balloon reaction may be
> > important in certain resource management scenarios.
> > 
> > > By inflating the balloon, all the guest's pages are still be processed (zero
> > page checking).
> > 
> > Not sure what you mean.  If you describe the current state of affairs that's
> > exactly the suggested optimization point: skip unmapped pages.
> > 
> 
> You'd better check the live migration code.

What's there to check in migration code?
Here's the extent of what balloon does on output:


        while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4) {
            ram_addr_t pa;
            ram_addr_t addr;
            int p = virtio_ldl_p(vdev, &pfn);

            pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
            offset += 4;

            /* FIXME: remove get_system_memory(), but how? */
            section = memory_region_find(get_system_memory(), pa, 1);
            if (!int128_nz(section.size) || !memory_region_is_ram(section.mr))
                continue;

            trace_virtio_balloon_handle_output(memory_region_name(section.mr),
                                               pa);
            /* Using memory_region_get_ram_ptr is bending the rules a bit, but
               should be OK because we only want a single page.  */
            addr = section.offset_within_region;
            balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
                         !!(vq == s->dvq));
            memory_region_unref(section.mr);
        }

so all that happens when we get a page is balloon_page.
and

static void balloon_page(void *addr, int deflate)
{
#if defined(__linux__)
    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
                                         kvm_has_sync_mmu())) {
        qemu_madvise(addr, TARGET_PAGE_SIZE,
                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
    }
#endif
}


Do you see anything that tracks pages to help migration skip
the ballooned memory? I don't.



> > > The only advantage of ' inflating the balloon before live migration' is simple,
> > nothing more.
> > 
> > That's a big advantage.  Another one is that it does something useful in real-
> > world scenarios.
> > 
> 
> I don't think the heave performance impaction is something useful in real world scenarios.
> 
> Liang
> > Roman.

So fix the performance then. You will have to try harder if you want to
convince people that the performance is due to bad host/guest interface,
and so we have to change *that*.

-- 
MST

^ permalink raw reply

* RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-04 14:26 UTC (permalink / raw)
  To: Roman Kagan
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	mst@redhat.com, linux-kernel@vger.kernel.org,
	Dr. David Alan Gilbert, linux-mm@kvack.org, amit.shah@redhat.com,
	pbonzini@redhat.com, akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160304102346.GB2479@rkaganb.sw.ru>

> Subject: Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration
> optimization
> 
> On Fri, Mar 04, 2016 at 09:08:44AM +0000, Li, Liang Z wrote:
> > > On Fri, Mar 04, 2016 at 01:52:53AM +0000, Li, Liang Z wrote:
> > > > >   I wonder if it would be possible to avoid the kernel changes
> > > > > by parsing /proc/self/pagemap - if that can be used to detect
> > > > > unmapped/zero mapped pages in the guest ram, would it achieve
> > > > > the
> > > same result?
> > > >
> > > > Only detect the unmapped/zero mapped pages is not enough.
> Consider
> > > the
> > > > situation like case 2, it can't achieve the same result.
> > >
> > > Your case 2 doesn't exist in the real world.  If people could stop
> > > their main memory consumer in the guest prior to migration they
> > > wouldn't need live migration at all.
> >
> > The case 2 is just a simplified scenario, not a real case.
> > As long as the guest's memory usage does not keep increasing, or not
> > always run out, it can be covered by the case 2.
> 
> The memory usage will keep increasing due to ever growing caches, etc, so
> you'll be left with very little free memory fairly soon.
> 

I don't think so.

> > > I tend to think you can safely assume there's no free memory in the
> > > guest, so there's little point optimizing for it.
> >
> > If this is true, we should not inflate the balloon either.
> 
> We certainly should if there's "available" memory, i.e. not free but cheap to
> reclaim.
> 

What's your mean by "available" memory? if they are not free, I don't think it's cheap.

> > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > that's made up, in particular, by the ballon, and consider inflating
> > > the balloon right before migration unless you already maintain it at
> > > the optimal size for other reasons (like e.g. a global resource manager
> optimizing the VM density).
> > >
> >
> > Yes, I believe the current balloon works and it's simple. Do you take the
> performance impact for consideration?
> > For and 8G guest, it takes about 5s to  inflating the balloon. But it
> > only takes 20ms to  traverse the free_list and construct the free pages
> bitmap.
> 
> I don't have any feeling of how important the difference is.  And if the
> limiting factor for balloon inflation speed is the granularity of communication
> it may be worth optimizing that, because quick balloon reaction may be
> important in certain resource management scenarios.
> 
> > By inflating the balloon, all the guest's pages are still be processed (zero
> page checking).
> 
> Not sure what you mean.  If you describe the current state of affairs that's
> exactly the suggested optimization point: skip unmapped pages.
> 

You'd better check the live migration code.

> > The only advantage of ' inflating the balloon before live migration' is simple,
> nothing more.
> 
> That's a big advantage.  Another one is that it does something useful in real-
> world scenarios.
> 

I don't think the heave performance impaction is something useful in real world scenarios.

Liang
> Roman.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox