From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35894) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bfSK8-0006Eb-K1 for qemu-devel@nongnu.org; Thu, 01 Sep 2016 09:46:37 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bfSK3-0007ki-0V for qemu-devel@nongnu.org; Thu, 01 Sep 2016 09:46:31 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45050) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bfSK2-0007kY-OL for qemu-devel@nongnu.org; Thu, 01 Sep 2016 09:46:26 -0400 Date: Thu, 1 Sep 2016 16:46:23 +0300 From: "Michael S. Tsirkin" Message-ID: <20160901164131-mutt-send-email-mst@kernel.org> References: <1470842980-32481-1-git-send-email-mst@redhat.com> <1470842980-32481-4-git-send-email-mst@redhat.com> <20160812063828.GG2759@al.usersys.redhat.com> <20160814054808-mutt-send-email-mst@kernel.org> <09A27EBF-F644-45E7-949D-A5D55AE3BCB5@nutanix.com> <49c5ce42-8fa8-964d-920f-2f4126d6a229@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <49c5ce42-8fa8-964d-920f-2f4126d6a229@redhat.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PULL 3/3] vhost-user: Attempt to fix a race with set_mem_table. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Maxime Coquelin Cc: Prerna Saxena , "marcandre.lureau@redhat.com" , Peter Maydell , Fam Zheng , "qemu-devel@nongnu.org" On Wed, Aug 31, 2016 at 01:19:47PM +0200, Maxime Coquelin wrote: >=20 >=20 > On 08/14/2016 11:42 AM, Prerna Saxena wrote: > > On 14/08/16 8:21 am, "Michael S. Tsirkin" wrote: > >=20 > >=20 > > > On Fri, Aug 12, 2016 at 07:16:34AM +0000, Prerna Saxena wrote: > > > >=20 > > > > On 12/08/16 12:08 pm, "Fam Zheng" wrote: > > > >=20 > > > >=20 > > > >=20 > > > >=20 > > > >=20 > > > > > On Wed, 08/10 18:30, Michael S. Tsirkin wrote: > > > > > > From: Prerna Saxena > > > > > >=20 > > > > > > The set_mem_table command currently does not seek a reply. He= nce, there is > > > > > > no easy way for a remote application to notify to QEMU when i= t finished > > > > > > setting up memory, or if there were errors doing so. > > > > > >=20 > > > > > > As an example: > > > > > > (1) Qemu sends a SET_MEM_TABLE to the backend (eg, a vhost-us= er net > > > > > > application). SET_MEM_TABLE does not require a reply accordin= g to the spec. > > > > > > (2) Qemu commits the memory to the guest. > > > > > > (3) Guest issues an I/O operation over a new memory region wh= ich was configured on (1). > > > > > > (4) The application has not yet remapped the memory, but it s= ees the I/O request. > > > > > > (5) The application cannot satisfy the request because it doe= s not know about those GPAs. > > > > > >=20 > > > > > > While a guaranteed fix would require a protocol extension (co= mmitted separately), > > > > > > a best-effort workaround for existing applications is to send= a GET_FEATURES > > > > > > message before completing the vhost_user_set_mem_table() call= . > > > > > > Since GET_FEATURES requires a reply, an application that proc= esses vhost-user > > > > > > messages synchronously would probably have completed the SET_= MEM_TABLE before replying. > > > > > >=20 > > > > > > Signed-off-by: Prerna Saxena > > > > > > Reviewed-by: Michael S. Tsirkin > > > > > > Signed-off-by: Michael S. Tsirkin > > > > >=20 > > > > > Sporadic hangs are seen with test-vhost-user after this patch: > > > > >=20 > > > > > https://travis-ci.org/qemu/qemu/builds > > > > >=20 > > > > > Reverting seems to fix it for me. > > > > >=20 > > > > > Is this a known problem? > > > > >=20 > > > > > Fam > > > >=20 > > > > Hi Fam, > > > > Thanks for reporting the sporadic hangs. I had seen =E2=80=98make= check=E2=80=99 pass on my Centos 6 environment, so missed this. > > > > I am setting up the docker test env to repro this, but I think I = can guess the problem : > > > >=20 > > > > In tests/vhost-user-test.c: > > > >=20 > > > > static void chr_read(void *opaque, const uint8_t *buf, int size) > > > > { > > > > ..[snip].. > > > >=20 > > > > case VHOST_USER_SET_MEM_TABLE: > > > > /* received the mem table */ > > > > memcpy(&s->memory, &msg.payload.memory, sizeof(msg.payload= .memory)); > > > > s->fds_num =3D qemu_chr_fe_get_msgfds(chr, s->fds, G_N_ELE= MENTS(s->fds)); > > > >=20 > > > >=20 > > > > /* signal the test that it can continue */ > > > > g_cond_signal(&s->data_cond); > > > > break; > > > > ..[snip].. > > > > } > > > >=20 > > > >=20 > > > > The test seems to be marked complete as soon as mem_table is copi= ed. > > > > However, this patch 3/3 changes the behaviour of the SET_MEM_TABL= E vhost command implementation with qemu. SET_MEM_TABLE now sends out a n= ew message GET_FEATURES, and the call is only completed once it receives = features from the remote application. (or the test framework, as is the c= ase here.) > > >=20 > > > Hmm but why does it matter that data_cond is woken up? > >=20 > > Michael, sorry, I didn=E2=80=99t quite understand that. Could you pls= explain ? > >=20 > > >=20 > > >=20 > > > > While the test itself can be modified (Do not signal completion u= ntil we=E2=80=99ve sent a follow-up response to GET_FEATURES), I am now w= ondering if this patch may break existing vhost applications too ? If so,= reverting it possibly better. > > >=20 > > > What bothers me is that the new feature might cause the same > > > issue once we enable it in the test. > >=20 > > No it wont. The new feature is a protocol extension, and only works i= f it has been negotiated with. If not negotiated, that part of code is ne= ver executed. > >=20 > > >=20 > > > How about a patch to tests/vhost-user-test.c adding the new > > > protocol feature? I would be quite interested to see what > > > is going on with it. > >=20 > > Yes that can be done. But you can see that the protocol extension pat= ch will not change the behaviour of the _existing_ test. > >=20 > > >=20 > > >=20 > > > > What confuses me is why it doesn=E2=80=99t fail all the time, but= only about 20% to 30% time as Fam reports. > > >=20 > > > And succeeds every time on my systems :( > >=20 > > +1 to that :( I have had no luck repro=E2=80=99ing it. > >=20 > > >=20 > > > >=20 > > > > Thoughts : Michael, Fam, MarcAndre ? >=20 > I have managed to reproduce the hang by adding some debug prints into > vhost_user_get_features(). >=20 > Doing this the issue is reproducible quite easily. > Another way to reproduce it in one shot is to strace (with following > forks option) vhost-user-test execution. >=20 > So, by adding debug prints at vhost_user_get_features() entry and exit, > we can see we never return from this function when hang happens. > Strace of Qemu instance shows that its thread keeps retrying to receive > GET_FEATURE reply: >=20 > write(1, "vhost_user_get_features IN: \n", 29) =3D 29 > sendmsg(11, {msg_name=3DNULL, msg_namelen=3D0, > msg_iov=3D[{iov_base=3D"\1\0\0\0\1\0\0\0\0\0\0\0", iov_len=3D12= }], > msg_iovlen=3D1, msg_controllen=3D0, msg_flags=3D0}, 0) =3D 12 > recvmsg(11, {msg_namelen=3D0}, MSG_CMSG_CLOEXEC) =3D -1 EAGAIN > nanosleep({0, 100000}, 0x7fff29f8dd70) =3D 0 > ... > recvmsg(11, {msg_namelen=3D0}, MSG_CMSG_CLOEXEC) =3D -1 EAGAIN > nanosleep({0, 100000}, 0x7fff29f8dd70) =3D 0 >=20 > The reason is that vhost-user-test never replies to Qemu, > because its thread handling the GET_FEATURES command is waiting for > the s->data_mutex lock. > This lock is held by the other vhost-user-test thread, executing > read_guest_mem(). >=20 > The lock is never released because the thread is blocked in read > syscall, when read_guest_mem() is doing the readl(). >=20 > This is because on Qemu side, the thread polling the qtest socket is > waiting for the qemu_global_mutex (in os_host_main_loop_wait()), but > the mutex is held by the thread trying to get the GET_FEATURE reply > (the TCG one). >=20 > So here is the deadlock. >=20 > That said, I don't see a clean way to solve this. > Any thoughts? >=20 > Regards, > Maxime My thought is that we really need to do what I said: avoid doing GET_FEATURES (and setting reply_ack) on the first set_mem, and I quote: OK this all looks very reasonable (and I do like patch 1 too) but there's one source of waste here: we do not need to synchronize when we set up device the first time when hdev->memory_changed is false. I think we should test that and skip synch in both patches unless hdev->memory_changed is set. with that change test will start passing. --=20 MST