* Possible error restoring machine
@ 2012-05-23 9:39 Frediano Ziglio
2012-05-23 10:25 ` Ian Campbell
2012-05-23 13:30 ` Shriram Rajagopalan
0 siblings, 2 replies; 5+ messages in thread
From: Frediano Ziglio @ 2012-05-23 9:39 UTC (permalink / raw)
To: xen-devel@lists.xensource.com
I noted a possible problem restoring a machine.
In xc_domain_restore (xc_domain_restore.c) if it's not the last
checkpoint we set O_NONBLOCK flag (search for fcntl) that we can call
pagebuf_get or just load other pages (see following "goto loadpages;"
line).
Now we could ending up calling xc_tmem_restore/xc_tmem_restore_extra
(xc_tmem.c) which call read_extract (xc_private.c) on the same non
blocking socket/file but read_extract does not handle EAGAIN/EWOULDBLOCK
(both can be returned on non blocking socket depending on file type and
Unix/Linux version) leading to a failure.
Does this make sense or is it impossible ??
Also note that rdexact (xc_domain_restore.c) handle data timeout but we
can still block in read_exact called by
xc_tmem_restore/xc_tmem_restore_extra.
Last note on rdexact, isn't 1 second (HEARTBEAT_MS) too small if there
are network problems?
Frediano
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Possible error restoring machine
2012-05-23 9:39 Possible error restoring machine Frediano Ziglio
@ 2012-05-23 10:25 ` Ian Campbell
2012-05-23 11:37 ` Frediano Ziglio
2012-05-23 13:30 ` Shriram Rajagopalan
1 sibling, 1 reply; 5+ messages in thread
From: Ian Campbell @ 2012-05-23 10:25 UTC (permalink / raw)
To: Frediano Ziglio; +Cc: Shriram Rajagopalan, xen-devel@lists.xensource.com
CCiong the Remus maintainer since all this non-blocking stuff is for
remus/checkpointing.
On Wed, 2012-05-23 at 10:39 +0100, Frediano Ziglio wrote:
> I noted a possible problem restoring a machine.
>
> In xc_domain_restore (xc_domain_restore.c) if it's not the last
> checkpoint we set O_NONBLOCK flag (search for fcntl) that we can call
> pagebuf_get or just load other pages (see following "goto loadpages;"
> line).
> Now we could ending up calling xc_tmem_restore/xc_tmem_restore_extra
> (xc_tmem.c) which call read_extract (xc_private.c) on the same non
> blocking socket/file
There's a bunch of such places in that function, the RDEXACT macro is
also == rdexact except on Minios.
> but read_extract does not handle EAGAIN/EWOULDBLOCK
> (both can be returned on non blocking socket depending on file type and
> Unix/Linux version) leading to a failure.
> Does this make sense or is it impossible ??
Isn't this what the if line:
len = read(fd, buf + offset, size - offset);
if ( (len == -1) && ((errno == EINTR) || (errno == EAGAIN)) )
continue;
is doing?
> Also note that rdexact (xc_domain_restore.c) handle data timeout but we
> can still block in read_exact called by
> xc_tmem_restore/xc_tmem_restore_extra.
Oh, wait! read_exact != rdexact -- ouch! Those are confusingly similar!
I suspect we need to pull the xc_tmem_{save,restore} into the
appropriate file and use the non-blocking capable versions or to export
the non-blocking function, with an improved name, so it can be used from
xc_tmem.c.
Shriram, any thoughts?
>
> Last note on rdexact, isn't 1 second (HEARTBEAT_MS) too small if there
> are network problems?
>
> Frediano
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Possible error restoring machine
2012-05-23 10:25 ` Ian Campbell
@ 2012-05-23 11:37 ` Frediano Ziglio
0 siblings, 0 replies; 5+ messages in thread
From: Frediano Ziglio @ 2012-05-23 11:37 UTC (permalink / raw)
To: Ian Campbell; +Cc: rshriram@cs.ubc.ca, xen-devel@lists.xensource.com
On Wed, 2012-05-23 at 11:25 +0100, Ian Campbell wrote:
> CCiong the Remus maintainer since all this non-blocking stuff is for
> remus/checkpointing.
>
> On Wed, 2012-05-23 at 10:39 +0100, Frediano Ziglio wrote:
> > I noted a possible problem restoring a machine.
> >
> > In xc_domain_restore (xc_domain_restore.c) if it's not the last
> > checkpoint we set O_NONBLOCK flag (search for fcntl) that we can call
> > pagebuf_get or just load other pages (see following "goto loadpages;"
> > line).
> > Now we could ending up calling xc_tmem_restore/xc_tmem_restore_extra
> > (xc_tmem.c) which call read_extract (xc_private.c) on the same non
> > blocking socket/file
>
> There's a bunch of such places in that function, the RDEXACT macro is
> also == rdexact except on Minios.
>
> > but read_extract does not handle EAGAIN/EWOULDBLOCK
> > (both can be returned on non blocking socket depending on file type and
> > Unix/Linux version) leading to a failure.
> > Does this make sense or is it impossible ??
>
> Isn't this what the if line:
> len = read(fd, buf + offset, size - offset);
> if ( (len == -1) && ((errno == EINTR) || (errno == EAGAIN)) )
> continue;
>
> is doing?
>
> > Also note that rdexact (xc_domain_restore.c) handle data timeout but we
> > can still block in read_exact called by
> > xc_tmem_restore/xc_tmem_restore_extra.
>
> Oh, wait! read_exact != rdexact -- ouch! Those are confusingly similar!
>
> I suspect we need to pull the xc_tmem_{save,restore} into the
> appropriate file and use the non-blocking capable versions or to export
> the non-blocking function, with an improved name, so it can be used from
> xc_tmem.c.
>
I was working on a patch to try to reduce cpu usage and read calls using
buffering for io_fd.
Currently works but is not still that good to post.
> Shriram, any thoughts?
>
> >
> > Last note on rdexact, isn't 1 second (HEARTBEAT_MS) too small if there
> > are network problems?
> >
Frediano
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Possible error restoring machine
2012-05-23 9:39 Possible error restoring machine Frediano Ziglio
2012-05-23 10:25 ` Ian Campbell
@ 2012-05-23 13:30 ` Shriram Rajagopalan
2012-05-23 14:15 ` Dan Magenheimer
1 sibling, 1 reply; 5+ messages in thread
From: Shriram Rajagopalan @ 2012-05-23 13:30 UTC (permalink / raw)
To: Frediano Ziglio; +Cc: xen-devel@lists.xensource.com, Ian Campbell
[-- Attachment #1.1: Type: text/plain, Size: 1942 bytes --]
On Wed, May 23, 2012 at 5:39 AM, Frediano Ziglio <frediano.ziglio@citrix.com>
wrote:
> I noted a possible problem restoring a machine.
>
> In xc_domain_restore (xc_domain_restore.c) if it's not the last
> checkpoint we set O_NONBLOCK flag (search for fcntl) that we can call
> pagebuf_get or just load other pages (see following "goto loadpages;"
> line).
> Now we could ending up calling xc_tmem_restore/xc_tmem_restore_extra
> (xc_tmem.c) which call read_extract (xc_private.c) on the same non
> blocking socket/file but read_extract does not handle EAGAIN/EWOULDBLOCK
> (both can be returned on non blocking socket depending on file type and
> Unix/Linux version) leading to a failure.
> Does this make sense or is it impossible ??
>
It certainly is possible. But again, I have never seen anyone use tmem with
Remus. I dont even know if it would work properly, even if we fix the
read_exact code
to handle non-blocking fds.
For the normal live-migration scenario, the O_NONBLOCK change does not
happen.
So, RDEXACT == rdexact == read_exact, output wise.
> Also note that rdexact (xc_domain_restore.c) handle data timeout but we
> can still block in read_exact called by
> xc_tmem_restore/xc_tmem_restore_extra.
>
Yep. Only in Remus case. As stated above, havent come across anyone
using Remus + tmem and/or dont know if it would work properly. I dont
know the semantics of tmem enough to comment on remus+tmem, whether
it makes sense or not, etc..
> Last note on rdexact, isn't 1 second (HEARTBEAT_MS) too small if there
> are network problems?
>
This wont be a problem for live migration. Because that timeout code
is within the if (ctx->completed) { } block. It only becomes active when
Remus is enabled i.e. ctx->last_checkpoint = 0. Otherwise, the read call is
still blocking.
> Frediano
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
[-- Attachment #1.2: Type: text/html, Size: 2581 bytes --]
[-- Attachment #2: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Possible error restoring machine
2012-05-23 13:30 ` Shriram Rajagopalan
@ 2012-05-23 14:15 ` Dan Magenheimer
0 siblings, 0 replies; 5+ messages in thread
From: Dan Magenheimer @ 2012-05-23 14:15 UTC (permalink / raw)
To: rshriram, Frediano Ziglio; +Cc: xen-devel, Ian Campbell
> From: Shriram Rajagopalan [mailto:rshriram@cs.ubc.ca]
> Subject: Re: [Xen-devel] Possible error restoring machine
>
> Yep. Only in Remus case. As stated above, havent come across anyone
> using Remus + tmem and/or dont know if it would work properly. I dont
> know the semantics of tmem enough to comment on remus+tmem, whether
> it makes sense or not, etc..
An interesting question... from what I remember about Remus
(it's been a few years now since I looked at it), they
can't co-exist I think. To Remus, tmem is like a hidden
hypervisor-private local disk and the writes to it don't
get captured/replicated by Remus. I think this is fixable
but I don't think the fix would be easy.
But this is just a few seconds of thought, so I may be
all wrong.
Dan
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-05-23 14:15 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-23 9:39 Possible error restoring machine Frediano Ziglio
2012-05-23 10:25 ` Ian Campbell
2012-05-23 11:37 ` Frediano Ziglio
2012-05-23 13:30 ` Shriram Rajagopalan
2012-05-23 14:15 ` Dan Magenheimer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).