From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:51980)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1SulPy-0002yK-Kf
	for qemu-devel@nongnu.org; Fri, 27 Jul 2012 10:21:33 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1SulPo-0006zb-TQ
	for qemu-devel@nongnu.org; Fri, 27 Jul 2012 10:21:26 -0400
Received: from g4t0017.houston.hp.com ([15.201.24.20]:26250)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1SulPo-0006zP-Kl
	for qemu-devel@nongnu.org; Fri, 27 Jul 2012 10:21:16 -0400
Message-ID: <5012A3DA.20904@hp.com>
Date: Fri, 27 Jul 2012 07:21:14 -0700
From: Chegu Vinod <chegu_vinod@hp.com>
MIME-Version: 1.0
References: <1343155012-26316-1-git-send-email-quintela@redhat.com>
	<500EF579.5040607@redhat.com> <50118F45.6050909@hp.com>
	<5011B5EB.7080209@hp.com> <87zk6l5ur4.fsf@trasno.org>
	<4168C988EBDF2141B4E0B6475B6A73D1165CDD@G6W2493.americas.hpqcorp.net>
In-Reply-To: <4168C988EBDF2141B4E0B6475B6A73D1165CDD@G6W2493.americas.hpqcorp.net>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] FW: Fwd: [RFC 00/27] Migration thread (WIP)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Juan Jose Quintela Carreira <quintela@redhat.com>
Cc: Orit Wasserman <owasserm@redhat.com>, qemu-devel@nongnu.org

On 7/27/2012 7:11 AM, Vinod, Chegu wrote:
>
> -----Original Message-----
> From: Juan Quintela [mailto:quintela@redhat.com]
> Sent: Friday, July 27, 2012 4:06 AM
> To: Vinod, Chegu
> Cc: qemu-devel@nongnu.org; Orit Wasserman
> Subject: Re: Fwd: [RFC 00/27] Migration thread (WIP)
>
> Chegu Vinod <chegu_vinod@hp.com> wrote:
>> On 7/26/2012 11:41 AM, Chegu Vinod wrote:
>>
>>
>>      
>>          
>>          -------- Original Message --------
>>
>>                                                                     
>>           Subject:  [Qemu-devel] [RFC 00/27] Migration thread (WIP)
>>                                                                     
>>           Date:     Tue, 24 Jul 2012 20:36:25 +0200
>>                                                                     
>>           From:     Juan Quintela <quintela@redhat.com>
>>                                                                     
>>           To:       qemu-devel@nongnu.org
>>                                                                     
>>
>>          
>>          
>>          Hi
>>
>> This series are on top of the migration-next-v5 series just posted.
>>
>> First of all, this is an RFC/Work in progress.  Just a lot of people
>> asked for it, and I would like review of the design.
>>
>>      Hello,
>>      
>>      Thanks for sharing this early/WIP version for evaluation.
>>      
>>      Still in the middle of  code review..but wanted to share a couple
>>      of quick  observations.
>>      'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G
>>      and downtime 2s).
>>      Once with no workload (i.e. idle guest) and the second was with a
>>      SpecJBB running in the guest.
>>      
>>      The idle guest case seemed to migrate fine...
>>      
>>      
>>      capabilities: xbzrle: off
>>      Migration status: completed
>>      transferred ram: 3811345 kbytes
>>      remaining ram: 0 kbytes
>>      total ram: 134226368 kbytes
>>      total time: 199743 milliseconds
>>      
>>      
>>      In the case of the SpecJBB I ran into issues during stage 3...the
>>      source host's qemu and the guest hung. I need to debug this
>>      more... (if  already have some hints pl. let me know.).
>>      
>>      
>>      capabilities: xbzrle: off
>>      Migration status: active
>>      transferred ram: 127618578 kbytes
>>      remaining ram: 2386832 kbytes
>>      total ram: 134226368 kbytes
>>      total time: 526139 milliseconds
>>      (qemu) qemu_savevm_state_complete called
>>      qemu_savevm_state_complete calling ram_save_complete
>>       
>>      <---  hung somewhere after this ('need to get more info).
>>      
>>      
>>
>>
>> Appears to be some race condition...as there are cases when it hangs
>> and in some cases it succeeds.
> Weird guess, try to use less vcpus, same ram.

Ok..will try that.
> The way that we stop cpus is _hacky_ to say it somewhere.  Will try to think about that part.
Ok.
> Thanks for the testing.  All my testing has been done with 8GB guests and 2vcps.  Will try with more vcpus to see if it makes a difference.
>
>
>
>
>> (qemu) info migrate
>> capabilities: xbzrle: off
>> Migration status: completed
>> transferred ram: 129937687 kbytes
>> remaining ram: 0 kbytes
>> total ram: 134226368 kbytes
>> total time: 543228 milliseconds
> Humm, _that_ is more strange.  This means that it finished.

There are cases where the migration is finishing just fine... even with 
larger guest configurations (256G/20VCPUs).


>   Could you run qemu under gdb and sent me the stack traces?
>
> I don't know your gdb thread kung-fu, so here are the instructions just in case:
>
> gdb --args <exact qemu commandh line you used> C-c to break when it hangs (gdb)info threads you see all the threads running (gdb)thread 1 or whatever other number (gdb)bt the backtrace of that thread.

The hang is intermittent...
I ran it 4-5 times (under gdb) just now and I didn't see the issue :-(


> I am specially interested in the backtrace of the migration thread and of the iothread.

Will keep re-trying with different configs. and see if i get lucky in 
reproducing it (under gdb).

Vinod
>
> Thanks, Juan.
>
>> Need to review/debug...
>>
>> Vinod
>>
>>
>>
>>      ---
>>      
>>      As with the non-migration-thread version the Specjbb workload
>>      completed before the migration attempted to move to stage 3 (i.e.
>>      didn't converge while the workload was still active).
>>      
>>      BTW, with this version of the bits (i.e. while running SpecJBB
>>      which is supposed to dirty quite a bit of memory) I noticed that
>>      there wasn't much change in the b/w usage of the dedicated 10Gb
>>      private network link (It was still < ~1.5-3.0Gb/sec).   Expected
>>      this to be a little better since we have a separate thread...  not
>>      sure what else is in play here ? (numa locality of where the
>>      migration thread runs or something other basic tuning in the
>>      implementation ?)
>>      
>>      'have a hi-level design question... (perhaps folks have already
>>      thought about it..and categorized it as potential future
>>      optimization..?)
>>      
>>      Would it be possible to off load the iothread completely [from all
>>      migration related activity] and have one thread (with the
>>      appropriate protection) get involved with getting the list of the
>>      dirty pages ? Have one or more threads dedicated for trying to
>>      push multiple streams of data to saturate the allocated network
>>      bandwidth ?  This may help in large + busy guests. Comments?
>>      There  are perhaps other implications of doing all of this (like
>>      burning more host cpu cycles) but perhaps this can be configurable
>>      based on user's needs... e.g. fewer but large guests on a host
>>      with no over subscription.
>>      
>>      Thanks
>>      Vinod
>>      
>>      
>>          
>>          
>>          It does:
>> - get a new bitmap for migration, and that bitmap uses 1 bit by page
>> - it unfolds migration_buffered_file.  Only one user existed.
>> - it simplifies buffered_file a lot.
>>
>> - About the migration thread, special attention was giving to try to
>>    get the series review-able (reviewers would tell if I got it).
>>
>> Basic design:
>> - we create a new thread instead of a timer function
>> - we move all the migration work to that thread (but run everything
>>    except the waits with the iothread lock.
>> - we move all the writting to outside the iothread lock.  i.e.
>>    we walk the state with the iothread hold, and copy everything to one buffer.
>>    then we write that buffer to the sockets outside the iothread lock.
>> - once here, we move to writting synchronously to the sockets.
>> - this allows us to simplify quite a lot.
>>
>> And basically, that is it.  Notice that we still do the iterate page
>> walking with the iothread held.  Light testing show that we got
>> similar speed and latencies than without the thread (notice that
>> almost no optimizations done here yet).
>>
>> Appart of the review:
>> - Are there any locking issues that I have missed (I guess so)
>> - stop all cpus correctly.  vm_stop should be called from the iothread,
>>    I use the trick of using a bottom half to get that working correctly.
>>    but this _implementation_ is ugly as hell.  Is there an easy way
>>    of doing it?
>> - Do I really have to export last_ram_offset(), there is no other way
>>    of knowing the ammount of RAM?
>>
>> Known issues:
>>
>> - for some reason, when it has to start a 2nd round of bitmap
>>    handling, it decides to dirty all pages.  Haven't found still why
>>    this happens.
>>
>> If you can test it, and said me where it breaks, it would also help.
>>
>> Work is based on Umesh thread work, and work that Paolo Bonzini had
>> work on top of that.  All the mirgation thread was done from scratch
>> becase I was unable to debug why it was failing, but it "owes" a lot
>> to the previous design.
>>
>> Thanks in advance, Juan.
>>
>> The following changes since commit a21143486b9c6d7a50b7b62877c02b3c686943cb:
>>
>>    Merge remote-tracking branch 'stefanha/net' into staging (2012-07-23
>> 13:15:34 -0500)
>>
>> are available in the git repository at:
>>
>>
>>    http://repo.or.cz/r/qemu/quintela.git migration-thread-v1
>>
>> for you to fetch changes up to 27e539b03ba97bc37e107755bcb44511ec4c8100:
>>
>>    buffered_file: unfold buffered_append in buffered_put_buffer
>> (2012-07-24 16:46:13 +0200)
>>
>>
>> Juan Quintela (23):
>>    buffered_file: g_realloc() can't fail
>>    savevm: Factorize ram globals reset in its own function
>>    ram: introduce migration_bitmap_set_dirty()
>>    ram: Introduce migration_bitmap_test_and_reset_dirty()
>>    ram: Export last_ram_offset()
>>    ram: introduce migration_bitmap_sync()
>>    Separate migration bitmap
>>    buffered_file: rename opaque to migration_state
>>    buffered_file: opaque is MigrationState
>>    buffered_file: unfold migrate_fd_put_buffer
>>    buffered_file: unfold migrate_fd_put_ready
>>    buffered_file: unfold migrate_fd_put_buffer
>>    buffered_file: unfold migrate_fd_put_buffer
>>    buffered_file: We can access directly to bandwidth_limit
>>    buffered_file: Move from using a timer to use a thread
>>    migration: make qemu_fopen_ops_buffered() return void
>>    migration: stop all cpus correctly
>>    migration: make writes blocking
>>    migration: remove unfreeze logic
>>    migration: take finer locking
>>    buffered_file: Unfold the trick to restart generating migration data
>>    buffered_file: don't flush on put buffer
>>    buffered_file: unfold buffered_append in buffered_put_buffer
>>
>> Paolo Bonzini (2):
>>    split MRU ram list
>>    BufferedFile: append, then flush
>>
>> Umesh Deshpande (2):
>>    add a version number to ram_list
>>    protect the ramlist with a separate mutex
>>
>>   arch_init.c      |  108 +++++++++++++++++++++++++-------
>>   buffered_file.c  |  179 +++++++++++++++++-------------------------------------
>>   buffered_file.h  |   12 +---
>>   cpu-all.h        |   17 +++++-
>>   exec-obsolete.h  |   10 ---
>>   exec.c           |   45 +++++++++++---
>>   migration-exec.c |    2 -
>>   migration-fd.c   |    6 --
>>   migration-tcp.c  |    2 +-
>>   migration-unix.c |    2 -
>>   migration.c      |  111 ++++++++++++++-------------------
>>   migration.h      |    6 ++
>>   qemu-file.h      |    5 --
>>   savevm.c         |    5 --
>>   14 files changed, 249 insertions(+), 261 deletions(-)