From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40365) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZFxW2-0004hB-GF for qemu-devel@nongnu.org; Fri, 17 Jul 2015 00:44:55 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZFxVz-0003Jb-AI for qemu-devel@nongnu.org; Fri, 17 Jul 2015 00:44:54 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33029) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZFxVz-0003Ix-4P for qemu-devel@nongnu.org; Fri, 17 Jul 2015 00:44:51 -0400 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (Postfix) with ESMTPS id 0B5712C76F7 for ; Fri, 17 Jul 2015 04:44:49 +0000 (UTC) References: <1437040609-9878-1-git-send-email-pbonzini@redhat.com> <20150716190546.GI29283@redhat.com> From: Paolo Bonzini Message-ID: <55A8883D.1010207@redhat.com> Date: Fri, 17 Jul 2015 06:44:45 +0200 MIME-Version: 1.0 In-Reply-To: <20150716190546.GI29283@redhat.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is dead, all hail ctx->notify_me List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Richard W.M. Jones" Cc: kwolf@redhat.com, lersek@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com On 16/07/2015 21:05, Richard W.M. Jones wrote: >=20 > Sorry to spoil things, but I'm still seeing this bug, although it is > now a lot less frequent with your patch. I would estimate it happens > more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs > with qemu.git + the v2 patch series. >=20 > It's the exact same hang in both cases. >=20 > Is it possible that this patch doesn't completely close any race? >=20 > Still, it is an improvement, so there is that. Would seem at first glance like a different bug. Interestingly, adding some "tracing" (qemu_clock_get_ns) makes the bug more likely: now it reproduces in about 10 tries. Of course :) adding other kinds of tracing instead make it go away again (>50 tries). Perhaps this: i/o thread vcpu thread worker thread --------------------------------------------------------------------- lock_iothread notify_me =3D 1 ... unlock_iothread lock_iothread notify_me =3D 3 ppoll notify_me =3D 1 bh->scheduled =3D 1 event_notifier_set event_notifier_test_and_clear ppoll ^^ hang =09 In the exact shape above, it doesn't seem too likely to happen, but perhaps there's another simpler case. Still, the bug exists. The above is not really related to notify_me. Here the notification is not being optimized away! So I wonder if this one has been there forever= . Fam suggested putting the event_notifier_test_and_clear before aio_bh_poll(), but it does not work. I'll look more close However, an unconditional event_notifier_test_and_clear is pretty expensive. On one hand, obviously correctness comes first. On the other hand, an expensive operation at the wrong place can mask the race very easily; I'll let the fix run for a while, but I'm not sure if a successful test really says anything useful. Paolo