[Qemu-devel] Testing migration under stress

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Testing migration under stress
@ 2012-11-02  3:10 David Gibson
  2012-11-02 12:12 ` Orit Wasserman
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: David Gibson @ 2012-11-02  3:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: aik, quintela

Asking for some advice on the list.

I have prorotype savevm and migration support ready for the pseries
machine.  They seem to work under simple circumstances (idle guest).
To test them more extensively I've been attempting to perform live
migrations (just over tcp->localhost) which the guest is active with
something.  In particular I've tried while using octave to do matrix
multiply (so exercising the FP unit) and my colleague Alexey has tried
during some video encoding.

However, in each of these cases, we've found that the migration only
completes and the source instance only stops after the intensive
workload has (just) completed.  What I surmise is happening is that
the workload is touching memory pages fast enough that the ram
migration code is never getting below the threshold to complete the
migration until the guest is idle again.

Does anyone have some ideas for testing this better: workloads that
are less likely to trigger this behaviour, or settings to tweak in the
migration itself to make it more likely to complete migration while
the workload is still active.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Testing migration under stress
  2012-11-02  3:10 [Qemu-devel] Testing migration under stress David Gibson
@ 2012-11-02 12:12 ` Orit Wasserman
  2012-11-05  0:30   ` David Gibson
  2012-11-06  5:22   ` Alexey Kardashevskiy
  2012-11-02 13:04 ` Paolo Bonzini
  2012-11-02 13:07 ` Juan Quintela
  2 siblings, 2 replies; 12+ messages in thread
From: Orit Wasserman @ 2012-11-02 12:12 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, qemu-devel, quintela

On 11/02/2012 05:10 AM, David Gibson wrote:
> Asking for some advice on the list.
> 
> I have prorotype savevm and migration support ready for the pseries
> machine.  They seem to work under simple circumstances (idle guest).
> To test them more extensively I've been attempting to perform live
> migrations (just over tcp->localhost) which the guest is active with
> something.  In particular I've tried while using octave to do matrix
> multiply (so exercising the FP unit) and my colleague Alexey has tried
> during some video encoding.
>
As you are doing local migration one option is to setting the speed higher
than line speed , as we don't actually send the data, another is to set high downtime.

> However, in each of these cases, we've found that the migration only
> completes and the source instance only stops after the intensive
> workload has (just) completed.  What I surmise is happening is that
> the workload is touching memory pages fast enough that the ram
> migration code is never getting below the threshold to complete the
> migration until the guest is idle again.
> 
The workload you chose is really bad for live migration, as all the guest does is
dirtying his memory. I recommend looking for workload that does some networking or disk IO.
Vinod succeeded running SwingBench and SLOB benchmarks that converged ok, I don't
know if they run on pseries, but similar workload should be ok(small database/warehouse).
We found out that SpecJbb on the other hand is hard to converge.
Web workload or video streaming also do the trick.

Cheers,
Orit

> Does anyone have some ideas for testing this better: workloads that
> are less likely to trigger this behaviour, or settings to tweak in the
> migration itself to make it more likely to complete migration while
> the workload is still active.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Testing migration under stress
  2012-11-02  3:10 [Qemu-devel] Testing migration under stress David Gibson
  2012-11-02 12:12 ` Orit Wasserman
@ 2012-11-02 13:04 ` Paolo Bonzini
  2012-11-02 13:07 ` Juan Quintela
  2 siblings, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2012-11-02 13:04 UTC (permalink / raw)
  To: qemu-devel

Il 02/11/2012 04:10, David Gibson ha scritto:
> Asking for some advice on the list.
> 
> I have prorotype savevm and migration support ready for the pseries
> machine.  They seem to work under simple circumstances (idle guest).
> To test them more extensively I've been attempting to perform live
> migrations (just over tcp->localhost) which the guest is active with
> something.  In particular I've tried while using octave to do matrix
> multiply (so exercising the FP unit) and my colleague Alexey has tried
> during some video encoding.
> 
> However, in each of these cases, we've found that the migration only
> completes and the source instance only stops after the intensive
> workload has (just) completed.  What I surmise is happening is that
> the workload is touching memory pages fast enough that the ram
> migration code is never getting below the threshold to complete the
> migration until the guest is idle again.
> 
> Does anyone have some ideas for testing this better: workloads that
> are less likely to trigger this behaviour, or settings to tweak in the
> migration itself to make it more likely to complete migration while
> the workload is still active.

Have you set the migration speed (migrate_set_speed) to something higher
than the default 32MB/sec?

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Testing migration under stress
  2012-11-02  3:10 [Qemu-devel] Testing migration under stress David Gibson
  2012-11-02 12:12 ` Orit Wasserman
  2012-11-02 13:04 ` Paolo Bonzini
@ 2012-11-02 13:07 ` Juan Quintela
  2012-11-05  0:31   ` David Gibson
  2 siblings, 1 reply; 12+ messages in thread
From: Juan Quintela @ 2012-11-02 13:07 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, qemu-devel

David Gibson <david@gibson.dropbear.id.au> wrote:
> Asking for some advice on the list.
>
> I have prorotype savevm and migration support ready for the pseries
> machine.  They seem to work under simple circumstances (idle guest).
> To test them more extensively I've been attempting to perform live
> migrations (just over tcp->localhost) which the guest is active with
> something.  In particular I've tried while using octave to do matrix
> multiply (so exercising the FP unit) and my colleague Alexey has tried
> during some video encoding.
>
> However, in each of these cases, we've found that the migration only
> completes and the source instance only stops after the intensive
> workload has (just) completed.  What I surmise is happening is that
> the workload is touching memory pages fast enough that the ram
> migration code is never getting below the threshold to complete the
> migration until the guest is idle again.
>
> Does anyone have some ideas for testing this better: workloads that
> are less likely to trigger this behaviour, or settings to tweak in the
> migration itself to make it more likely to complete migration while
> the workload is still active.

You can:

migrate_set_downtime 2s (or so)

I normally run stress, and you move the memory that it dirties until it
converges (depends a lot of your networking).

Doing anything that is really memory intensive is basically never gonig
to converge.

Later, Juan.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Testing migration under stress
  2012-11-02 12:12 ` Orit Wasserman
@ 2012-11-05  0:30   ` David Gibson
  2012-11-05 12:21     ` Orit Wasserman
  2012-11-06  5:22   ` Alexey Kardashevskiy
  1 sibling, 1 reply; 12+ messages in thread
From: David Gibson @ 2012-11-05  0:30 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: aik, qemu-devel, quintela

On Fri, Nov 02, 2012 at 02:12:25PM +0200, Orit Wasserman wrote:
> On 11/02/2012 05:10 AM, David Gibson wrote:
> > Asking for some advice on the list.
> > 
> > I have prorotype savevm and migration support ready for the pseries
> > machine.  They seem to work under simple circumstances (idle guest).
> > To test them more extensively I've been attempting to perform live
> > migrations (just over tcp->localhost) which the guest is active with
> > something.  In particular I've tried while using octave to do matrix
> > multiply (so exercising the FP unit) and my colleague Alexey has tried
> > during some video encoding.

> As you are doing local migration one option is to setting the speed
> higher than line speed , as we don't actually send the data, another
> is to set high downtime.

I'm not entirely sure what you mean by that.  But I do have suspicions
based on this and other factors that the default bandwidth it is
limiting to is horribly, horribly low.

> > However, in each of these cases, we've found that the migration only
> > completes and the source instance only stops after the intensive
> > workload has (just) completed.  What I surmise is happening is that
> > the workload is touching memory pages fast enough that the ram
> > migration code is never getting below the threshold to complete the
> > migration until the guest is idle again.
> > 
> The workload you chose is really bad for live migration, as all the
> guest does is dirtying his memory.

Well, I realised that was true of the matrix multiply.  For video
encode though, the output data should be much, much smaller than the
input, so I wouldn't expect it to be dirtying memory that fast.

> I recommend looking for workload
> that does some networking or disk IO.  Vinod succeeded running
> SwingBench and SLOB benchmarks that converged ok, I don't know if
> they run on pseries, but similar workload should be ok(small
> database/warehouse).  We found out that SpecJbb on the other hand is
> hard to converge.  Web workload or video streaming also do the
> trick.

Hrm.  As something really simple and stupid, I did try migrationg an
ls -lR /, but even that didn't converge :/.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Testing migration under stress
  2012-11-02 13:07 ` Juan Quintela
@ 2012-11-05  0:31   ` David Gibson
  0 siblings, 0 replies; 12+ messages in thread
From: David Gibson @ 2012-11-05  0:31 UTC (permalink / raw)
  To: Juan Quintela; +Cc: aik, qemu-devel

On Fri, Nov 02, 2012 at 02:07:45PM +0100, Juan Quintela wrote:
> David Gibson <david@gibson.dropbear.id.au> wrote:
> > Asking for some advice on the list.
> >
> > I have prorotype savevm and migration support ready for the pseries
> > machine.  They seem to work under simple circumstances (idle guest).
> > To test them more extensively I've been attempting to perform live
> > migrations (just over tcp->localhost) which the guest is active with
> > something.  In particular I've tried while using octave to do matrix
> > multiply (so exercising the FP unit) and my colleague Alexey has tried
> > during some video encoding.
> >
> > However, in each of these cases, we've found that the migration only
> > completes and the source instance only stops after the intensive
> > workload has (just) completed.  What I surmise is happening is that
> > the workload is touching memory pages fast enough that the ram
> > migration code is never getting below the threshold to complete the
> > migration until the guest is idle again.
> >
> > Does anyone have some ideas for testing this better: workloads that
> > are less likely to trigger this behaviour, or settings to tweak in the
> > migration itself to make it more likely to complete migration while
> > the workload is still active.
> 
> You can:
> 
> migrate_set_downtime 2s (or so)
> 
> I normally run stress, and you move the memory that it dirties until it
> converges (depends a lot of your networking).

So, I'm using tcp to localhost, so it should be really fast, but it
doesn't seem to be :/.  I suspect there are some other bugs here.

> Doing anything that is really memory intensive is basically never gonig
> to converge.

Well, I didn't think the loads I chose would be memory limited
(especially the video encode), but..

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Testing migration under stress
  2012-11-05  0:30   ` David Gibson
@ 2012-11-05 12:21     ` Orit Wasserman
  2012-11-06  1:14       ` David Gibson
  0 siblings, 1 reply; 12+ messages in thread
From: Orit Wasserman @ 2012-11-05 12:21 UTC (permalink / raw)
  To: David Gibson; +Cc: aik, qemu-devel, quintela

On 11/05/2012 02:30 AM, David Gibson wrote:
> On Fri, Nov 02, 2012 at 02:12:25PM +0200, Orit Wasserman wrote:
>> On 11/02/2012 05:10 AM, David Gibson wrote:
>>> Asking for some advice on the list.
>>>
>>> I have prorotype savevm and migration support ready for the pseries
>>> machine.  They seem to work under simple circumstances (idle guest).
>>> To test them more extensively I've been attempting to perform live
>>> migrations (just over tcp->localhost) which the guest is active with
>>> something.  In particular I've tried while using octave to do matrix
>>> multiply (so exercising the FP unit) and my colleague Alexey has tried
>>> during some video encoding.
> 
>> As you are doing local migration one option is to setting the speed
>> higher than line speed , as we don't actually send the data, another
>> is to set high downtime.
> 
> I'm not entirely sure what you mean by that.  But I do have suspicions
> based on this and other factors that the default bandwidth it is
> limiting to is horribly, horribly low.
> 
>>> However, in each of these cases, we've found that the migration only
>>> completes and the source instance only stops after the intensive
>>> workload has (just) completed.  What I surmise is happening is that
>>> the workload is touching memory pages fast enough that the ram
>>> migration code is never getting below the threshold to complete the
>>> migration until the guest is idle again.
>>>
>> The workload you chose is really bad for live migration, as all the
>> guest does is dirtying his memory.
> 
> Well, I realised that was true of the matrix multiply.  For video
> encode though, the output data should be much, much smaller than the
> input, so I wouldn't expect it to be dirtying memory that fast.
> 
>> I recommend looking for workload
>> that does some networking or disk IO.  Vinod succeeded running
>> SwingBench and SLOB benchmarks that converged ok, I don't know if
>> they run on pseries, but similar workload should be ok(small
>> database/warehouse).  We found out that SpecJbb on the other hand is
>> hard to converge.  Web workload or video streaming also do the
>> trick.
> 
> Hrm.  As something really simple and stupid, I did try migrationg an
> ls -lR /, but even that didn't converge :/.
That is strange, it should converge even with the defaults,
Any special about your storage setup ?
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Testing migration under stress
  2012-11-05 12:21     ` Orit Wasserman
@ 2012-11-06  1:14       ` David Gibson
  0 siblings, 0 replies; 12+ messages in thread
From: David Gibson @ 2012-11-06  1:14 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: aik, qemu-devel, quintela

On Mon, Nov 05, 2012 at 02:21:37PM +0200, Orit Wasserman wrote:
> On 11/05/2012 02:30 AM, David Gibson wrote:
> > On Fri, Nov 02, 2012 at 02:12:25PM +0200, Orit Wasserman wrote:
> >> On 11/02/2012 05:10 AM, David Gibson wrote:
> >>> Asking for some advice on the list.
> >>>
> >>> I have prorotype savevm and migration support ready for the pseries
> >>> machine.  They seem to work under simple circumstances (idle guest).
> >>> To test them more extensively I've been attempting to perform live
> >>> migrations (just over tcp->localhost) which the guest is active with
> >>> something.  In particular I've tried while using octave to do matrix
> >>> multiply (so exercising the FP unit) and my colleague Alexey has tried
> >>> during some video encoding.
> > 
> >> As you are doing local migration one option is to setting the speed
> >> higher than line speed , as we don't actually send the data, another
> >> is to set high downtime.
> > 
> > I'm not entirely sure what you mean by that.  But I do have suspicions
> > based on this and other factors that the default bandwidth it is
> > limiting to is horribly, horribly low.
> > 
> >>> However, in each of these cases, we've found that the migration only
> >>> completes and the source instance only stops after the intensive
> >>> workload has (just) completed.  What I surmise is happening is that
> >>> the workload is touching memory pages fast enough that the ram
> >>> migration code is never getting below the threshold to complete the
> >>> migration until the guest is idle again.
> >>>
> >> The workload you chose is really bad for live migration, as all the
> >> guest does is dirtying his memory.
> > 
> > Well, I realised that was true of the matrix multiply.  For video
> > encode though, the output data should be much, much smaller than the
> > input, so I wouldn't expect it to be dirtying memory that fast.
> > 
> >> I recommend looking for workload
> >> that does some networking or disk IO.  Vinod succeeded running
> >> SwingBench and SLOB benchmarks that converged ok, I don't know if
> >> they run on pseries, but similar workload should be ok(small
> >> database/warehouse).  We found out that SpecJbb on the other hand is
> >> hard to converge.  Web workload or video streaming also do the
> >> trick.
> > 
> > Hrm.  As something really simple and stupid, I did try migrationg an
> > ls -lR /, but even that didn't converge :/.
> That is strange, it should converge even with the defaults,
> Any special about your storage setup ?

I didn't think so.  Do you mean host or guest storage setup?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Testing migration under stress
  2012-11-02 12:12 ` Orit Wasserman
  2012-11-05  0:30   ` David Gibson
@ 2012-11-06  5:22   ` Alexey Kardashevskiy
  2012-11-06  6:55     ` David Gibson
  2012-11-06 10:54     ` Orit Wasserman
  1 sibling, 2 replies; 12+ messages in thread
From: Alexey Kardashevskiy @ 2012-11-06  5:22 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: quintela, qemu-devel, David Gibson

On 02/11/12 23:12, Orit Wasserman wrote:
> On 11/02/2012 05:10 AM, David Gibson wrote:
>> Asking for some advice on the list.
>>
>> I have prorotype savevm and migration support ready for the pseries
>> machine.  They seem to work under simple circumstances (idle guest).
>> To test them more extensively I've been attempting to perform live
>> migrations (just over tcp->localhost) which the guest is active with
>> something.  In particular I've tried while using octave to do matrix
>> multiply (so exercising the FP unit) and my colleague Alexey has tried
>> during some video encoding.
>>
> As you are doing local migration one option is to setting the speed higher
> than line speed , as we don't actually send the data, another is to set high downtime.
>
>> However, in each of these cases, we've found that the migration only
>> completes and the source instance only stops after the intensive
>> workload has (just) completed.  What I surmise is happening is that
>> the workload is touching memory pages fast enough that the ram
>> migration code is never getting below the threshold to complete the
>> migration until the guest is idle again.
>>
> The workload you chose is really bad for live migration, as all the guest does is
> dirtying his memory. I recommend looking for workload that does some networking or disk IO.
> Vinod succeeded running SwingBench and SLOB benchmarks that converged ok, I don't
> know if they run on pseries, but similar workload should be ok(small database/warehouse).
> We found out that SpecJbb on the other hand is hard to converge.
> Web workload or video streaming also do the trick.


My ffmpeg workload is simple encoding h263+ac3 to h263+ac3, 64*36 pixels. 
So it should not be dirtying memory too much. Or is it?

(qemu) info migrate
capabilities: xbzrle: off
Migration status: completed
total time: 14538 milliseconds
downtime: 1273 milliseconds
transferred ram: 389961 kbytes
remaining ram: 0 kbytes
total ram: 1065024 kbytes
duplicate: 181949 pages
normal: 97446 pages
normal bytes: 389784 kbytes

How many bytes were actually transferred? "duplicate" * 4K = 745MB?

Is there any tool in QEMU to see how many pages are used/dirty/etc?
"info" does not seem to have any kind of such statistic.

btw the new guest did not resume (qemu still responds on commands) but this 
is probably our problem within "pseries" platform. What is strange is that 
"info migrate" on the new guest shows nothing:

(qemu) info migrate
(qemu)




> Cheers,
> Orit
>
>> Does anyone have some ideas for testing this better: workloads that
>> are less likely to trigger this behaviour, or settings to tweak in the
>> migration itself to make it more likely to complete migration while
>> the workload is still active.
>>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Testing migration under stress
  2012-11-06  5:22   ` Alexey Kardashevskiy
@ 2012-11-06  6:55     ` David Gibson
  2012-11-06  7:55       ` Alexey Kardashevskiy
  2012-11-06 10:54     ` Orit Wasserman
  1 sibling, 1 reply; 12+ messages in thread
From: David Gibson @ 2012-11-06  6:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Orit Wasserman, qemu-devel, quintela

On Tue, Nov 06, 2012 at 04:22:11PM +1100, Alexey Kardashevskiy wrote:
> On 02/11/12 23:12, Orit Wasserman wrote:
> >On 11/02/2012 05:10 AM, David Gibson wrote:
> >>Asking for some advice on the list.
> >>
> >>I have prorotype savevm and migration support ready for the pseries
> >>machine.  They seem to work under simple circumstances (idle guest).
> >>To test them more extensively I've been attempting to perform live
> >>migrations (just over tcp->localhost) which the guest is active with
> >>something.  In particular I've tried while using octave to do matrix
> >>multiply (so exercising the FP unit) and my colleague Alexey has tried
> >>during some video encoding.
> >>
> >As you are doing local migration one option is to setting the speed higher
> >than line speed , as we don't actually send the data, another is to set high downtime.
> >
> >>However, in each of these cases, we've found that the migration only
> >>completes and the source instance only stops after the intensive
> >>workload has (just) completed.  What I surmise is happening is that
> >>the workload is touching memory pages fast enough that the ram
> >>migration code is never getting below the threshold to complete the
> >>migration until the guest is idle again.
> >>
> >The workload you chose is really bad for live migration, as all the guest does is
> >dirtying his memory. I recommend looking for workload that does some networking or disk IO.
> >Vinod succeeded running SwingBench and SLOB benchmarks that converged ok, I don't
> >know if they run on pseries, but similar workload should be ok(small database/warehouse).
> >We found out that SpecJbb on the other hand is hard to converge.
> >Web workload or video streaming also do the trick.
> 
> 
> My ffmpeg workload is simple encoding h263+ac3 to h263+ac3, 64*36
> pixels. So it should not be dirtying memory too much. Or is it?

Oh.. if your encoding the same format to the same format it may well
be optimized and therefore memory limited.  I was envisaging encoding
an uncompressed format to a highly compressed format, which should be
compute limited rather than memory bandwidth limited.  The size and
resolution of the input doesn't really matter as long as:
	   * the output size is much smaller than the input size
and	   * it takes several minutes for the full encode to give a
	     reasonable amount of  time for the migrate to converge.
> 
> (qemu) info migrate
> capabilities: xbzrle: off
> Migration status: completed
> total time: 14538 milliseconds
> downtime: 1273 milliseconds
> transferred ram: 389961 kbytes
> remaining ram: 0 kbytes
> total ram: 1065024 kbytes
> duplicate: 181949 pages
> normal: 97446 pages
> normal bytes: 389784 kbytes
> 
> How many bytes were actually transferred? "duplicate" * 4K = 745MB?
> 
> Is there any tool in QEMU to see how many pages are used/dirty/etc?
> "info" does not seem to have any kind of such statistic.
> 
> btw the new guest did not resume (qemu still responds on commands)
> but this is probably our problem within "pseries" platform. What is

Uh, that's a bug, and I'm not sure when it broke.  If the migrate
isn't even working we're premature in attempting to work out why it
isn't happening when we expect.


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Testing migration under stress
  2012-11-06  6:55     ` David Gibson
@ 2012-11-06  7:55       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 12+ messages in thread
From: Alexey Kardashevskiy @ 2012-11-06  7:55 UTC (permalink / raw)
  To: David Gibson; +Cc: Orit Wasserman, qemu-devel, quintela

On 06/11/12 17:55, David Gibson wrote:
> On Tue, Nov 06, 2012 at 04:22:11PM +1100, Alexey Kardashevskiy wrote:
>> On 02/11/12 23:12, Orit Wasserman wrote:
>>> On 11/02/2012 05:10 AM, David Gibson wrote:
>>>> Asking for some advice on the list.
>>>>
>>>> I have prorotype savevm and migration support ready for the pseries
>>>> machine.  They seem to work under simple circumstances (idle guest).
>>>> To test them more extensively I've been attempting to perform live
>>>> migrations (just over tcp->localhost) which the guest is active with
>>>> something.  In particular I've tried while using octave to do matrix
>>>> multiply (so exercising the FP unit) and my colleague Alexey has tried
>>>> during some video encoding.
>>>>
>>> As you are doing local migration one option is to setting the speed higher
>>> than line speed , as we don't actually send the data, another is to set high downtime.
>>>
>>>> However, in each of these cases, we've found that the migration only
>>>> completes and the source instance only stops after the intensive
>>>> workload has (just) completed.  What I surmise is happening is that
>>>> the workload is touching memory pages fast enough that the ram
>>>> migration code is never getting below the threshold to complete the
>>>> migration until the guest is idle again.
>>>>
>>> The workload you chose is really bad for live migration, as all the guest does is
>>> dirtying his memory. I recommend looking for workload that does some networking or disk IO.
>>> Vinod succeeded running SwingBench and SLOB benchmarks that converged ok, I don't
>>> know if they run on pseries, but similar workload should be ok(small database/warehouse).
>>> We found out that SpecJbb on the other hand is hard to converge.
>>> Web workload or video streaming also do the trick.
>>
>>
>> My ffmpeg workload is simple encoding h263+ac3 to h263+ac3, 64*36
>> pixels. So it should not be dirtying memory too much. Or is it?
>
> Oh.. if your encoding the same format to the same format it may well
> be optimized and therefore memory limited.

No, it is not optimized, it still decodes and encodes as I inserted some 
filter in a chain.

> I was envisaging encoding
> an uncompressed format to a highly compressed format, which should be
> compute limited rather than memory bandwidth limited.

> The size and
> resolution of the input doesn't really matter as long as:
> 	   * the output size is much smaller than the input size

This is another scenario, I run both. I just tried to reduce memory 
consumption as it was recommended here and see if anything changes.

Originally it was 1280*720 to 64*36 but I am not sure it does not use much 
memory as (I suspect at least sometime) ffmpeg may decode a series of full 
size frames to do motion detection or something.


> and	   * it takes several minutes for the full encode to give a
> 	     reasonable amount of  time for the migrate to converge.

90 seconds each file, if I run a script which does encoding in a loop, the 
pause between encodings is not big enough to finish migration anyway if I 
encode big video to small video.

However if it is 64*36, migration finishes (the first qemu succeeds and 
stops) but the new guest does not resume.


>>
>> (qemu) info migrate
>> capabilities: xbzrle: off
>> Migration status: completed
>> total time: 14538 milliseconds
>> downtime: 1273 milliseconds
>> transferred ram: 389961 kbytes
>> remaining ram: 0 kbytes
>> total ram: 1065024 kbytes
>> duplicate: 181949 pages
>> normal: 97446 pages
>> normal bytes: 389784 kbytes
>>
>> How many bytes were actually transferred? "duplicate" * 4K = 745MB?
>>
>> Is there any tool in QEMU to see how many pages are used/dirty/etc?
>> "info" does not seem to have any kind of such statistic.
>>
>> btw the new guest did not resume (qemu still responds on commands)
>> but this is probably our problem within "pseries" platform. What is
>
> Uh, that's a bug, and I'm not sure when it broke.  If the migrate
> isn't even working we're premature in attempting to work out why it
> isn't happening when we expect.

Here I wanted to emphasize that I would like to find some way to get 
information about how migration is doing (or done) in the new guest - there 
is no statistic about it.


-- 
Alexey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Testing migration under stress
  2012-11-06  5:22   ` Alexey Kardashevskiy
  2012-11-06  6:55     ` David Gibson
@ 2012-11-06 10:54     ` Orit Wasserman
  1 sibling, 0 replies; 12+ messages in thread
From: Orit Wasserman @ 2012-11-06 10:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: quintela, qemu-devel, David Gibson

On 11/06/2012 07:22 AM, Alexey Kardashevskiy wrote:
> On 02/11/12 23:12, Orit Wasserman wrote:
>> On 11/02/2012 05:10 AM, David Gibson wrote:
>>> Asking for some advice on the list.
>>>
>>> I have prorotype savevm and migration support ready for the pseries
>>> machine.  They seem to work under simple circumstances (idle guest).
>>> To test them more extensively I've been attempting to perform live
>>> migrations (just over tcp->localhost) which the guest is active with
>>> something.  In particular I've tried while using octave to do matrix
>>> multiply (so exercising the FP unit) and my colleague Alexey has tried
>>> during some video encoding.
>>>
>> As you are doing local migration one option is to setting the speed higher
>> than line speed , as we don't actually send the data, another is to set high downtime.
>>
>>> However, in each of these cases, we've found that the migration only
>>> completes and the source instance only stops after the intensive
>>> workload has (just) completed.  What I surmise is happening is that
>>> the workload is touching memory pages fast enough that the ram
>>> migration code is never getting below the threshold to complete the
>>> migration until the guest is idle again.
>>>
>> The workload you chose is really bad for live migration, as all the guest does is
>> dirtying his memory. I recommend looking for workload that does some networking or disk IO.
>> Vinod succeeded running SwingBench and SLOB benchmarks that converged ok, I don't
>> know if they run on pseries, but similar workload should be ok(small database/warehouse).
>> We found out that SpecJbb on the other hand is hard to converge.
>> Web workload or video streaming also do the trick.
> 
> 
> My ffmpeg workload is simple encoding h263+ac3 to h263+ac3, 64*36 pixels. So it should not be dirtying memory too much. Or is it?
> 
> (qemu) info migrate
> capabilities: xbzrle: off
> Migration status: completed
> total time: 14538 milliseconds
> downtime: 1273 milliseconds
> transferred ram: 389961 kbytes
> remaining ram: 0 kbytes
> total ram: 1065024 kbytes
> duplicate: 181949 pages
> normal: 97446 pages
> normal bytes: 389784 kbytes
> 
> How many bytes were actually transferred? "duplicate" * 4K = 745MB?
For duplicate we send one byte and those are usually zero pages + the page header.
transferred is the actual amount of bytes sent so here is around 389M was sent.
> 
> Is there any tool in QEMU to see how many pages are used/dirty/etc?
sadly no.
> "info" does not seem to have any kind of such statistic.
> 
> btw the new guest did not resume (qemu still responds on commands) but this is probably our problem within "pseries" platform. What is strange is that "info migrate" on the new guest shows nothing:
> 
> (qemu) info migrate
> (qemu)
> 
the "info migrate" command displays outgoing migration information not incoming ..
> 
> 
> 
>> Cheers,
>> Orit
>>
>>> Does anyone have some ideas for testing this better: workloads that
>>> are less likely to trigger this behaviour, or settings to tweak in the
>>> migration itself to make it more likely to complete migration while
>>> the workload is still active.
>>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-11-06 10:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-02  3:10 [Qemu-devel] Testing migration under stress David Gibson
2012-11-02 12:12 ` Orit Wasserman
2012-11-05  0:30   ` David Gibson
2012-11-05 12:21     ` Orit Wasserman
2012-11-06  1:14       ` David Gibson
2012-11-06  5:22   ` Alexey Kardashevskiy
2012-11-06  6:55     ` David Gibson
2012-11-06  7:55       ` Alexey Kardashevskiy
2012-11-06 10:54     ` Orit Wasserman
2012-11-02 13:04 ` Paolo Bonzini
2012-11-02 13:07 ` Juan Quintela
2012-11-05  0:31   ` David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).