From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:60599) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Qp6Gr-0005sf-I3 for qemu-devel@nongnu.org; Thu, 04 Aug 2011 18:20:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Qp6Gp-00073T-U7 for qemu-devel@nongnu.org; Thu, 04 Aug 2011 18:20:05 -0400 Received: from fmmailgate02.web.de ([217.72.192.227]:54733) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Qp6Gp-00073E-GT for qemu-devel@nongnu.org; Thu, 04 Aug 2011 18:20:03 -0400 Message-ID: <4E3B1B07.2000400@web.de> Date: Fri, 05 Aug 2011 00:19:51 +0200 From: Jan Kiszka MIME-Version: 1.0 References: <1312383104-9565-1-git-send-email-mjt@msgid.tls.msk.ru> <20110804162441.0045215c@doriath> <4E3AF860.2070005@msgid.tls.msk.ru> <4E3AFA5F.4040708@msgid.tls.msk.ru> In-Reply-To: <4E3AFA5F.4040708@msgid.tls.msk.ru> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigDBADE4D9E18C63B88CAF2BDB" Sender: jan.kiszka@web.de Subject: Re: [Qemu-devel] [PATCH] do not call monitor_resume() from migrate_fd_put_buffer() error path List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Michael Tokarev Cc: mtosatti@redhat.com, qemu-devel@nongnu.org, Luiz Capitulino This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigDBADE4D9E18C63B88CAF2BDB Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 2011-08-04 22:00, Michael Tokarev wrote: > 04.08.2011 23:52, Michael Tokarev =D0=BF=D0=B8=D1=88=D0=B5=D1=82: >> 04.08.2011 23:24, Luiz Capitulino =D0=BF=D0=B8=D1=88=D0=B5=D1=82: >>> On Wed, 3 Aug 2011 18:51:44 +0400 >>> Michael Tokarev wrote: >>> >>>> If we do, it results in double monitor_resume() (second being called= >>>> from migrate_fd_cleanup() anyway) and monitor suspend count becoming= >>>> negative. >>> >>> Are you hitting an specific issue or did you find this by code inspec= tion? >> >> The specific case I'm talking about can trivially be triggered. >> Run migrate "exec:nosuchfile" and your monitor will be stuck >> forever. >> >>> IIRC, I asked Marcelo to add the monitor_resume() call in the fix for= >>> e447b1a603 because the monitor wasn't being resumed in some cases. Do= n't >>> remember which though, do you Marcelo? >> >> And this (e447b1a603) is actually exactly the same thing: when the >> child terminates before qemu completes sending stuff to it. >> >>> I see two possibilities here: >>> >>> 1. After e447b1a603 there was some change that made the monitor_resu= me() call >>> in migrate_fd_put_buffer() unnecessary >>> >>> 2. We're calling it in the wrong place >>> >>> Taking a quick look at the code I see that migrate_fd_cleanup() doesn= 't >>> seem to be called when qemu_savevm_state_iterate() fails, for example= =2E >> >> Yes indeed, but that will lead to a havoc in other places, like >> leaving callback notifier registered even after migration gets >> cancelled, or keeping the file open. >> >> Note that qemu_savevm_state_iterate() gets called from a callback >> routine. >> >> What I observe here is that when the exec: process exits prematurely, >> qemu will continue writing stuff to pipe up till migrate_fd_put_buffer= () >> hits error (EPIPE iirc), at which point we'll call mon->resume() twice= =2E >> The same happens in case of tcp migration when the other end closes >> the connection. >> >> I don't know when qemu_savevm_state_iterate() may fail, or why >> qemu_file_has_error() in there does not return true. >> >> Maybe the problem here is much more severe actually, -- that's why >> I initially asked what it is all about, -- before offering the patch. >=20 > This looks like a race condition: which of the two places > will detect the error first. >=20 > Sometimes I see this (-monitor stdio): >=20 > (qemu) migrate "exec:nothing" > sh: nothing: not found >=20 > (qemu) >=20 > and after this, monitor is stuck as I described. But sometimes > I see this: >=20 > (qemu) migrate "exec:nothing" > sh: nothing: not found > (qemu) >=20 > And it continues working. Note the extra newline. > If the error gets detected sooner (like in case of > exec:nothing, or even better, after removing > /bin/sh so that system(3) will fail), the chance > to have stuck monitor is much higher. If the > error gets detected later, it will survive most > likely. Or something like that anyway. Resume on error in migrate_fd_put_buffer raced with migrate_fd_cleanup triggered via migrate_fd_put_ready called from migrate_fd_connect. This migration code is a horrible maze. Patch below tries to move monitor_resume a bit out of this. Please check if it works for you, then I'll send it out properly. Thanks, Jan --- diff --git a/migration.c b/migration.c index 2a15b98..756fa62 100644 --- a/migration.c +++ b/migration.c @@ -292,18 +292,17 @@ int migrate_fd_cleanup(FdMigrationState *s) ret =3D -1; } s->file =3D NULL; + } else { + if (s->mon) { + monitor_resume(s->mon); + } } =20 - if (s->fd !=3D -1) + if (s->fd !=3D -1) { close(s->fd); - - /* Don't resume monitor until we've flushed all of the buffers */ - if (s->mon) { - monitor_resume(s->mon); + s->fd =3D -1; } =20 - s->fd =3D -1; - return ret; } =20 @@ -330,9 +329,6 @@ ssize_t migrate_fd_put_buffer(void *opaque, const voi= d *data, size_t size) if (ret =3D=3D -EAGAIN) { qemu_set_fd_handler2(s->fd, NULL, NULL, migrate_fd_put_notify, s= ); } else if (ret < 0) { - if (s->mon) { - monitor_resume(s->mon); - } s->state =3D MIG_STATE_ERROR; notifier_list_notify(&migration_state_notifiers, NULL); } @@ -458,6 +454,9 @@ int migrate_fd_close(void *opaque) { FdMigrationState *s =3D opaque; =20 + if (s->mon) { + monitor_resume(s->mon); + } qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL); return s->close(s); } --------------enigDBADE4D9E18C63B88CAF2BDB Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.16 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk47GxAACgkQitSsb3rl5xSUrACfd7DxPj3dWmwL+2vIPI+xzl3I +QsAnjw8EP8K2p5Yi++IAAt02tkDwuGB =B7AW -----END PGP SIGNATURE----- --------------enigDBADE4D9E18C63B88CAF2BDB--