* Possible error restoring machine @ 2012-05-23 9:39 Frediano Ziglio 2012-05-23 10:25 ` Ian Campbell 2012-05-23 13:30 ` Shriram Rajagopalan 0 siblings, 2 replies; 5+ messages in thread From: Frediano Ziglio @ 2012-05-23 9:39 UTC (permalink / raw) To: xen-devel@lists.xensource.com I noted a possible problem restoring a machine. In xc_domain_restore (xc_domain_restore.c) if it's not the last checkpoint we set O_NONBLOCK flag (search for fcntl) that we can call pagebuf_get or just load other pages (see following "goto loadpages;" line). Now we could ending up calling xc_tmem_restore/xc_tmem_restore_extra (xc_tmem.c) which call read_extract (xc_private.c) on the same non blocking socket/file but read_extract does not handle EAGAIN/EWOULDBLOCK (both can be returned on non blocking socket depending on file type and Unix/Linux version) leading to a failure. Does this make sense or is it impossible ?? Also note that rdexact (xc_domain_restore.c) handle data timeout but we can still block in read_exact called by xc_tmem_restore/xc_tmem_restore_extra. Last note on rdexact, isn't 1 second (HEARTBEAT_MS) too small if there are network problems? Frediano ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Possible error restoring machine 2012-05-23 9:39 Possible error restoring machine Frediano Ziglio @ 2012-05-23 10:25 ` Ian Campbell 2012-05-23 11:37 ` Frediano Ziglio 2012-05-23 13:30 ` Shriram Rajagopalan 1 sibling, 1 reply; 5+ messages in thread From: Ian Campbell @ 2012-05-23 10:25 UTC (permalink / raw) To: Frediano Ziglio; +Cc: Shriram Rajagopalan, xen-devel@lists.xensource.com CCiong the Remus maintainer since all this non-blocking stuff is for remus/checkpointing. On Wed, 2012-05-23 at 10:39 +0100, Frediano Ziglio wrote: > I noted a possible problem restoring a machine. > > In xc_domain_restore (xc_domain_restore.c) if it's not the last > checkpoint we set O_NONBLOCK flag (search for fcntl) that we can call > pagebuf_get or just load other pages (see following "goto loadpages;" > line). > Now we could ending up calling xc_tmem_restore/xc_tmem_restore_extra > (xc_tmem.c) which call read_extract (xc_private.c) on the same non > blocking socket/file There's a bunch of such places in that function, the RDEXACT macro is also == rdexact except on Minios. > but read_extract does not handle EAGAIN/EWOULDBLOCK > (both can be returned on non blocking socket depending on file type and > Unix/Linux version) leading to a failure. > Does this make sense or is it impossible ?? Isn't this what the if line: len = read(fd, buf + offset, size - offset); if ( (len == -1) && ((errno == EINTR) || (errno == EAGAIN)) ) continue; is doing? > Also note that rdexact (xc_domain_restore.c) handle data timeout but we > can still block in read_exact called by > xc_tmem_restore/xc_tmem_restore_extra. Oh, wait! read_exact != rdexact -- ouch! Those are confusingly similar! I suspect we need to pull the xc_tmem_{save,restore} into the appropriate file and use the non-blocking capable versions or to export the non-blocking function, with an improved name, so it can be used from xc_tmem.c. Shriram, any thoughts? > > Last note on rdexact, isn't 1 second (HEARTBEAT_MS) too small if there > are network problems? > > Frediano > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Possible error restoring machine 2012-05-23 10:25 ` Ian Campbell @ 2012-05-23 11:37 ` Frediano Ziglio 0 siblings, 0 replies; 5+ messages in thread From: Frediano Ziglio @ 2012-05-23 11:37 UTC (permalink / raw) To: Ian Campbell; +Cc: rshriram@cs.ubc.ca, xen-devel@lists.xensource.com On Wed, 2012-05-23 at 11:25 +0100, Ian Campbell wrote: > CCiong the Remus maintainer since all this non-blocking stuff is for > remus/checkpointing. > > On Wed, 2012-05-23 at 10:39 +0100, Frediano Ziglio wrote: > > I noted a possible problem restoring a machine. > > > > In xc_domain_restore (xc_domain_restore.c) if it's not the last > > checkpoint we set O_NONBLOCK flag (search for fcntl) that we can call > > pagebuf_get or just load other pages (see following "goto loadpages;" > > line). > > Now we could ending up calling xc_tmem_restore/xc_tmem_restore_extra > > (xc_tmem.c) which call read_extract (xc_private.c) on the same non > > blocking socket/file > > There's a bunch of such places in that function, the RDEXACT macro is > also == rdexact except on Minios. > > > but read_extract does not handle EAGAIN/EWOULDBLOCK > > (both can be returned on non blocking socket depending on file type and > > Unix/Linux version) leading to a failure. > > Does this make sense or is it impossible ?? > > Isn't this what the if line: > len = read(fd, buf + offset, size - offset); > if ( (len == -1) && ((errno == EINTR) || (errno == EAGAIN)) ) > continue; > > is doing? > > > Also note that rdexact (xc_domain_restore.c) handle data timeout but we > > can still block in read_exact called by > > xc_tmem_restore/xc_tmem_restore_extra. > > Oh, wait! read_exact != rdexact -- ouch! Those are confusingly similar! > > I suspect we need to pull the xc_tmem_{save,restore} into the > appropriate file and use the non-blocking capable versions or to export > the non-blocking function, with an improved name, so it can be used from > xc_tmem.c. > I was working on a patch to try to reduce cpu usage and read calls using buffering for io_fd. Currently works but is not still that good to post. > Shriram, any thoughts? > > > > > Last note on rdexact, isn't 1 second (HEARTBEAT_MS) too small if there > > are network problems? > > Frediano ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Possible error restoring machine 2012-05-23 9:39 Possible error restoring machine Frediano Ziglio 2012-05-23 10:25 ` Ian Campbell @ 2012-05-23 13:30 ` Shriram Rajagopalan 2012-05-23 14:15 ` Dan Magenheimer 1 sibling, 1 reply; 5+ messages in thread From: Shriram Rajagopalan @ 2012-05-23 13:30 UTC (permalink / raw) To: Frediano Ziglio; +Cc: xen-devel@lists.xensource.com, Ian Campbell [-- Attachment #1.1: Type: text/plain, Size: 1942 bytes --] On Wed, May 23, 2012 at 5:39 AM, Frediano Ziglio <frediano.ziglio@citrix.com> wrote: > I noted a possible problem restoring a machine. > > In xc_domain_restore (xc_domain_restore.c) if it's not the last > checkpoint we set O_NONBLOCK flag (search for fcntl) that we can call > pagebuf_get or just load other pages (see following "goto loadpages;" > line). > Now we could ending up calling xc_tmem_restore/xc_tmem_restore_extra > (xc_tmem.c) which call read_extract (xc_private.c) on the same non > blocking socket/file but read_extract does not handle EAGAIN/EWOULDBLOCK > (both can be returned on non blocking socket depending on file type and > Unix/Linux version) leading to a failure. > Does this make sense or is it impossible ?? > It certainly is possible. But again, I have never seen anyone use tmem with Remus. I dont even know if it would work properly, even if we fix the read_exact code to handle non-blocking fds. For the normal live-migration scenario, the O_NONBLOCK change does not happen. So, RDEXACT == rdexact == read_exact, output wise. > Also note that rdexact (xc_domain_restore.c) handle data timeout but we > can still block in read_exact called by > xc_tmem_restore/xc_tmem_restore_extra. > Yep. Only in Remus case. As stated above, havent come across anyone using Remus + tmem and/or dont know if it would work properly. I dont know the semantics of tmem enough to comment on remus+tmem, whether it makes sense or not, etc.. > Last note on rdexact, isn't 1 second (HEARTBEAT_MS) too small if there > are network problems? > This wont be a problem for live migration. Because that timeout code is within the if (ctx->completed) { } block. It only becomes active when Remus is enabled i.e. ctx->last_checkpoint = 0. Otherwise, the read call is still blocking. > Frediano > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel [-- Attachment #1.2: Type: text/html, Size: 2581 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Possible error restoring machine 2012-05-23 13:30 ` Shriram Rajagopalan @ 2012-05-23 14:15 ` Dan Magenheimer 0 siblings, 0 replies; 5+ messages in thread From: Dan Magenheimer @ 2012-05-23 14:15 UTC (permalink / raw) To: rshriram, Frediano Ziglio; +Cc: xen-devel, Ian Campbell > From: Shriram Rajagopalan [mailto:rshriram@cs.ubc.ca] > Subject: Re: [Xen-devel] Possible error restoring machine > > Yep. Only in Remus case. As stated above, havent come across anyone > using Remus + tmem and/or dont know if it would work properly. I dont > know the semantics of tmem enough to comment on remus+tmem, whether > it makes sense or not, etc.. An interesting question... from what I remember about Remus (it's been a few years now since I looked at it), they can't co-exist I think. To Remus, tmem is like a hidden hypervisor-private local disk and the writes to it don't get captured/replicated by Remus. I think this is fixable but I don't think the fix would be easy. But this is just a few seconds of thought, so I may be all wrong. Dan ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-05-23 14:15 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-05-23 9:39 Possible error restoring machine Frediano Ziglio 2012-05-23 10:25 ` Ian Campbell 2012-05-23 11:37 ` Frediano Ziglio 2012-05-23 13:30 ` Shriram Rajagopalan 2012-05-23 14:15 ` Dan Magenheimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).