From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:38955)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lilei@linux.vnet.ibm.com>) id 1VCnGD-0003w6-1Y
	for qemu-devel@nongnu.org; Fri, 23 Aug 2013 05:02:32 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <lilei@linux.vnet.ibm.com>) id 1VCnG1-0006EQ-FC
	for qemu-devel@nongnu.org; Fri, 23 Aug 2013 05:02:25 -0400
Received: from e28smtp09.in.ibm.com ([122.248.162.9]:52073)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lilei@linux.vnet.ibm.com>) id 1VCnG0-0006Dm-UH
	for qemu-devel@nongnu.org; Fri, 23 Aug 2013 05:02:13 -0400
Received: from /spool/local
	by e28smtp09.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use
	Only! Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <lilei@linux.vnet.ibm.com>;
	Fri, 23 Aug 2013 14:26:10 +0530
Received: from d28relay01.in.ibm.com (d28relay01.in.ibm.com [9.184.220.58])
	by d28dlp01.in.ibm.com (Postfix) with ESMTP id 21890E004F
	for <qemu-devel@nongnu.org>; Fri, 23 Aug 2013 14:32:36 +0530 (IST)
Received: from d28av04.in.ibm.com (d28av04.in.ibm.com [9.184.220.66])
	by d28relay01.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	r7N937OX21889084
	for <qemu-devel@nongnu.org>; Fri, 23 Aug 2013 14:33:08 +0530
Received: from d28av04.in.ibm.com (localhost [127.0.0.1])
	by d28av04.in.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id
	r7N91Xx2004645
	for <qemu-devel@nongnu.org>; Fri, 23 Aug 2013 14:31:34 +0530
Message-ID: <521724A3.8050801@linux.vnet.ibm.com>
Date: Fri, 23 Aug 2013 17:00:19 +0800
From: Lei Li <lilei@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <1377069536-12658-1-git-send-email-lilei@linux.vnet.ibm.com>
	<1377069536-12658-14-git-send-email-lilei@linux.vnet.ibm.com>
	<52149B09.705@redhat.com> <52170060.4050104@linux.vnet.ibm.com>
	<521713DA.9010903@redhat.com>
In-Reply-To: <521713DA.9010903@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 13/18] arch_init: adjust ram_save_setup()
 for migrate_is_localhost
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: aarcange@redhat.com, quintela@redhat.com, qemu-devel@nongnu.org, mrhines@linux.vnet.ibm.com, Anthony Liguori <anthony@codemonkey.ws>, lagarcia@br.ibm.com, rcj@linux.vnet.ibm.com

On 08/23/2013 03:48 PM, Paolo Bonzini wrote:
> Il 23/08/2013 08:25, Lei Li ha scritto:
>> On 08/21/2013 06:48 PM, Paolo Bonzini wrote:
>>> Il 21/08/2013 09:18, Lei Li ha scritto:
>>>> Send all the ram blocks hooked by save_page, which will copy
>>>> ram page and MADV_DONTNEED the page just copied.
>>> You should implement this entirely in the hook.
>>>
>>> It will be a little less efficient because of the dirty bitmap overhead,
>>> but you should aim at having *zero* changes in arch_init.c and
>>> migration.c.
>> Yes, the reason I modify the migration_thread() to have new process that
>> send all the ram pages in adjusted qemu_savevm_state_begin stage and send device
>> states in qemu_savevm_device_state stage for localhost migration is to avoid the
>> bitmap thing, which is a little less efficient just like you mentioned above.
>>
>> The performance assurance is very important to this feature, our goal is
>> 100ms of downtime for a 1TB guest.
> Do not _start_ by introducing encapsulation violations all over the place.
>
> Juan has been working on optimizing the dirty bitmap code.  His patches
> could introduce a speedup of a factor of up to 64.  Thus it is possible
> that his work will help you enough that you can work with the dirty bitmap.
>
> Also, this feature (not looking at the dirty bitmap if the machine is
> stopped) is not limited to localhost migration, add it later once the
> basic vmsplice plumbing is in place.  This will also let you profile the
> code and understand whether the goal is attainable.
>
> I honestly doubt that 100ms of downtime is possible while the machine is
> stopped.  A 1TB guest has 2^28 = 268*10^6 pages, which you want to
> process in 100*10^6 nanoseconds.  Thus, your approach would require 0.4
> nanoseconds per page, or roughly 2 clock cycles per page.  This is
> impossible without _massive_ parallelization at all levels, starting
> from the kernel.
>
> As a matter of fact, 2^28 madvise system calls will take much, much
> longer than 100ms.
>
> Have you thought of using shared memory (with -mempath) instead of vmsplice?

Precisely!

Well, as Anthony mentioned in the version 1[1], there has been some work involved
regarding improvement of vmsplice() at kernel side by Robert Jennings[2].

And yes, shared memory is an alternative, I think the problem with shared memory is
that can't share anonymous memory. For this maybe Anthony can chime in as the original
idea him.  :-)


Reference links:

[1] Anthony's comments:
   https://lists.gnu.org/archive/html/qemu-devel/2013-06/msg02577.html

[2] vmpslice support for zero-copy gifting of pages:
   http://comments.gmane.org/gmane.linux.kernel.mm/103998

>
> Paolo
>


-- 
Lei