* [Qemu-devel] When does live migration give up?
@ 2013-09-04 15:24 Alex Bligh
2013-09-04 17:07 ` Paolo Bonzini
0 siblings, 1 reply; 6+ messages in thread
From: Alex Bligh @ 2013-09-04 15:24 UTC (permalink / raw)
To: qemu-devel; +Cc: Alex Bligh
We have seen a situation when migrating about 50 VMs at once where some
of them fail. I think this is because they are dirtying pages faster than
they can be transmitted.
What algorithm controls when migration fails in this way, and is it
tunable?
I am fully aware one answer to this question is "do not attempt to
migrate 50 busy VMs through a single 1GB/s NIC".
--
Alex Bligh
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] When does live migration give up?
2013-09-04 15:24 [Qemu-devel] When does live migration give up? Alex Bligh
@ 2013-09-04 17:07 ` Paolo Bonzini
2013-09-04 18:05 ` Alex Bligh
0 siblings, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2013-09-04 17:07 UTC (permalink / raw)
To: Alex Bligh; +Cc: qemu-devel
Il 04/09/2013 17:24, Alex Bligh ha scritto:
> We have seen a situation when migrating about 50 VMs at once where some
> of them fail. I think this is because they are dirtying pages faster than
> they can be transmitted.
No, migration never "gives up". It may never converge, but it keeps
trying until cancelled.
Could it be that you are choosing migration server ports from a small
range, and some of them are failing because two migrations pick the same
random port for the destination (which is where the server socket lies)?
Paolo
> What algorithm controls when migration fails in this way, and is it
> tunable?
>
> I am fully aware one answer to this question is "do not attempt to
> migrate 50 busy VMs through a single 1GB/s NIC".
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] When does live migration give up?
2013-09-04 17:07 ` Paolo Bonzini
@ 2013-09-04 18:05 ` Alex Bligh
2013-09-04 18:34 ` Paolo Bonzini
2013-09-04 18:35 ` Alex Bligh
0 siblings, 2 replies; 6+ messages in thread
From: Alex Bligh @ 2013-09-04 18:05 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel, Alex Bligh
Paolo,
--On 4 September 2013 19:07:53 +0200 Paolo Bonzini <pbonzini@redhat.com>
wrote:
> Il 04/09/2013 17:24, Alex Bligh ha scritto:
>> We have seen a situation when migrating about 50 VMs at once where some
>> of them fail. I think this is because they are dirtying pages faster than
>> they can be transmitted.
>
> No, migration never "gives up". It may never converge, but it keeps
> trying until cancelled.
>
> Could it be that you are choosing migration server ports from a small
> range, and some of them are failing because two migrations pick the same
> random port for the destination (which is where the server socket lies)?
Should not be that. We create FDs (which are sockets) and pass them in at
both ends. Approx 10% of migrations die after many minutes on the
customer's platform. This does not appear to happen if migrations are
not carried out 50 at a time.
We appear to be getting something other than 'ms' returned through the
monitoring system. Unhelpfully what that is is not logged.
Is there anything (apart from the socket closing prematurely) which can
cause a failed migration after many minutes? We've seen problems where
the destination is not set up the same as the source (e.g. different
numbers of NICs) but IIRC that fails much earlier.
To make things easier (cough), this is qemu 1.0 (as shipped with Ubuntu
Precise).
--
Alex Bligh
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] When does live migration give up?
2013-09-04 18:05 ` Alex Bligh
@ 2013-09-04 18:34 ` Paolo Bonzini
2013-09-04 22:37 ` Alex Bligh
2013-09-04 18:35 ` Alex Bligh
1 sibling, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2013-09-04 18:34 UTC (permalink / raw)
To: Alex Bligh; +Cc: qemu-devel
Il 04/09/2013 20:05, Alex Bligh ha scritto:
> Paolo,
>
> --On 4 September 2013 19:07:53 +0200 Paolo Bonzini <pbonzini@redhat.com>
> wrote:
>
>> Il 04/09/2013 17:24, Alex Bligh ha scritto:
>>> We have seen a situation when migrating about 50 VMs at once where some
>>> of them fail. I think this is because they are dirtying pages faster
>>> than
>>> they can be transmitted.
>>
>> No, migration never "gives up". It may never converge, but it keeps
>> trying until cancelled.
>>
>> Could it be that you are choosing migration server ports from a small
>> range, and some of them are failing because two migrations pick the same
>> random port for the destination (which is where the server socket lies)?
>
> Should not be that. We create FDs (which are sockets) and pass them in at
> both ends.
Do you mean something like this?
destination
socket()
bind() to { sin_port = 0, sin_addr.s_addr = INADDR_ANY }
listen()
getsockname()
send address to source
accept()
start QEMU with file descriptor returned by accept
source
read address
socket()
connect()
pass socket file descriptor to QEMU and migrate to it
Anything that doesn't use sin_port = 0 and getsockname() is prone to
race conditions.
> Approx 10% of migrations die after many minutes on the
> customer's platform. This does not appear to happen if migrations are
> not carried out 50 at a time.
Dying after many minutes usually means that the destination is not set
up the same as the source, as you said below.
Paolo
> We appear to be getting something other than 'ms' returned through the
> monitoring system. Unhelpfully what that is is not logged.
>
> Is there anything (apart from the socket closing prematurely) which can
> cause a failed migration after many minutes? We've seen problems where
> the destination is not set up the same as the source (e.g. different
> numbers of NICs) but IIRC that fails much earlier.
>
> To make things easier (cough), this is qemu 1.0 (as shipped with Ubuntu
> Precise).
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] When does live migration give up?
2013-09-04 18:05 ` Alex Bligh
2013-09-04 18:34 ` Paolo Bonzini
@ 2013-09-04 18:35 ` Alex Bligh
1 sibling, 0 replies; 6+ messages in thread
From: Alex Bligh @ 2013-09-04 18:35 UTC (permalink / raw)
To: Alex Bligh, Paolo Bonzini; +Cc: qemu-devel
--On 4 September 2013 19:05:50 +0100 Alex Bligh <alex@alex.org.uk> wrote:
> We appear to be getting something other than 'ms' returned through the
> monitoring system. Unhelpfully what that is is not logged.
By which I mean 'we seem to be getting "status: failed" on the send side'
--
Alex Bligh
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] When does live migration give up?
2013-09-04 18:34 ` Paolo Bonzini
@ 2013-09-04 22:37 ` Alex Bligh
0 siblings, 0 replies; 6+ messages in thread
From: Alex Bligh @ 2013-09-04 22:37 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel, Alex Bligh
Paolo,
>
> Do you mean something like this?
>
> destination
> socket()
> bind() to { sin_port = 0, sin_addr.s_addr = INADDR_ANY }
> listen()
> getsockname()
> send address to source
> accept()
> start QEMU with file descriptor returned by accept
>
> source
> read address
> socket()
> connect()
> pass socket file descriptor to QEMU and migrate to it
>
> Anything that doesn't use sin_port = 0 and getsockname() is prone to
> race conditions.
From memory we bind() to a specific randomly chosen port and if
that fails retry until bind() succeeds. This is because we
want the port to be within a given range. I believe that is
race free as only one bind() can run at once.
>> Approx 10% of migrations die after many minutes on the
>> customer's platform. This does not appear to happen if migrations are
>> not carried out 50 at a time.
>
> Dying after many minutes usually means that the destination is not set
> up the same as the source, as you said below.
Hmmm. OK I thought that produced an immediate error. Is there any way
of logging what's up to stderr or similar etc?
Alex
>
> Paolo
>
>> We appear to be getting something other than 'ms' returned through the
>> monitoring system. Unhelpfully what that is is not logged.
>>
>> Is there anything (apart from the socket closing prematurely) which can
>> cause a failed migration after many minutes? We've seen problems where
>> the destination is not set up the same as the source (e.g. different
>> numbers of NICs) but IIRC that fails much earlier.
>>
>> To make things easier (cough), this is qemu 1.0 (as shipped with Ubuntu
>> Precise).
>>
>
>
>
--
Alex Bligh
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-09-04 22:37 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-04 15:24 [Qemu-devel] When does live migration give up? Alex Bligh
2013-09-04 17:07 ` Paolo Bonzini
2013-09-04 18:05 ` Alex Bligh
2013-09-04 18:34 ` Paolo Bonzini
2013-09-04 22:37 ` Alex Bligh
2013-09-04 18:35 ` Alex Bligh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).