From mboxrd@z Thu Jan 1 00:00:00 1970 From: Shriram Rajagopalan Subject: Re: Possible error restoring machine Date: Wed, 23 May 2012 09:30:41 -0400 Message-ID: References: <7CE799CC0E4DE04B88D5FDF226E18AC2CDEB431483@LONPMAILBOX01.citrite.net> Reply-To: rshriram@cs.ubc.ca Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4564090204801295993==" Return-path: In-Reply-To: <7CE799CC0E4DE04B88D5FDF226E18AC2CDEB431483@LONPMAILBOX01.citrite.net> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Frediano Ziglio Cc: "xen-devel@lists.xensource.com" , Ian Campbell List-Id: xen-devel@lists.xenproject.org --===============4564090204801295993== Content-Type: multipart/alternative; boundary=000e0ce04312a8145204c0b4262e --000e0ce04312a8145204c0b4262e Content-Type: text/plain; charset=ISO-8859-1 On Wed, May 23, 2012 at 5:39 AM, Frediano Ziglio wrote: > I noted a possible problem restoring a machine. > > In xc_domain_restore (xc_domain_restore.c) if it's not the last > checkpoint we set O_NONBLOCK flag (search for fcntl) that we can call > pagebuf_get or just load other pages (see following "goto loadpages;" > line). > Now we could ending up calling xc_tmem_restore/xc_tmem_restore_extra > (xc_tmem.c) which call read_extract (xc_private.c) on the same non > blocking socket/file but read_extract does not handle EAGAIN/EWOULDBLOCK > (both can be returned on non blocking socket depending on file type and > Unix/Linux version) leading to a failure. > Does this make sense or is it impossible ?? > It certainly is possible. But again, I have never seen anyone use tmem with Remus. I dont even know if it would work properly, even if we fix the read_exact code to handle non-blocking fds. For the normal live-migration scenario, the O_NONBLOCK change does not happen. So, RDEXACT == rdexact == read_exact, output wise. > Also note that rdexact (xc_domain_restore.c) handle data timeout but we > can still block in read_exact called by > xc_tmem_restore/xc_tmem_restore_extra. > Yep. Only in Remus case. As stated above, havent come across anyone using Remus + tmem and/or dont know if it would work properly. I dont know the semantics of tmem enough to comment on remus+tmem, whether it makes sense or not, etc.. > Last note on rdexact, isn't 1 second (HEARTBEAT_MS) too small if there > are network problems? > This wont be a problem for live migration. Because that timeout code is within the if (ctx->completed) { } block. It only becomes active when Remus is enabled i.e. ctx->last_checkpoint = 0. Otherwise, the read call is still blocking. > Frediano > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel --000e0ce04312a8145204c0b4262e Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Wed, May 23, 2012 at 5:39 AM, Frediano Ziglio <frediano.ziglio@citrix.com> wrote:> I noted a possible problem restoring a machine.
>
> In x= c_domain_restore (xc_domain_restore.c) if it's not the last
> checkpoint we set O_NONBLOCK flag (search for fcntl) that we can call<= br>> pagebuf_get or just load other pages (see following "goto load= pages;"
> line).
> Now we could ending up calling xc_tmem_= restore/xc_tmem_restore_extra
> (xc_tmem.c) which call read_extract (xc_private.c) on the same non
= > blocking socket/file but read_extract does not handle EAGAIN/EWOULDBLO= CK
> (both can be returned on non blocking socket depending on file t= ype and
> Unix/Linux version) leading to a failure.
> Does this make sense= or is it impossible ??
>


It certai= nly is possible. But again, I have never seen anyone use tmem with
Remus. I dont even know if it would work properly, even if we fix the = read_exact code
to handle non-blocking fds.

<= div>For the normal live-migration scenario, the O_NONBLOCK change does not = happen.
So, RDEXACT =3D=3D rdexact =3D=3D read_exact, output wise.
<= br>> Also note that rdexact (xc_domain_restore.c) handle data timeout bu= t we
> can still block in read_exact called by
> xc_tmem_restor= e/xc_tmem_restore_extra.
>

Yep. Only in Remus case. As stated above, hav= ent come across anyone
using Remus + tmem and/or dont know if it = would work properly. I dont
know the semantics of tmem enough to = comment on remus+tmem, whether
it makes sense or not, etc..


> Last no= te on rdexact, isn't 1 second (HEARTBEAT_MS) too small if there
>= are network problems?
>

This wont be a prob= lem for live migration. Because that timeout code
is within the if (ctx->completed) { } =A0block. It only becomes act= ive when
Remus is enabled i.e. ctx->last_checkpoint =3D 0. Oth= erwise, the read call is
still blocking.


> Frediano
>
> _________________________________________= ______
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

--000e0ce04312a8145204c0b4262e-- --===============4564090204801295993== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============4564090204801295993==--