From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40365)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1ZFxW2-0004hB-GF
	for qemu-devel@nongnu.org; Fri, 17 Jul 2015 00:44:55 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1ZFxVz-0003Jb-AI
	for qemu-devel@nongnu.org; Fri, 17 Jul 2015 00:44:54 -0400
Received: from mx1.redhat.com ([209.132.183.28]:33029)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1ZFxVz-0003Ix-4P
	for qemu-devel@nongnu.org; Fri, 17 Jul 2015 00:44:51 -0400
Received: from int-mx11.intmail.prod.int.phx2.redhat.com
	(int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24])
	by mx1.redhat.com (Postfix) with ESMTPS id 0B5712C76F7
	for <qemu-devel@nongnu.org>; Fri, 17 Jul 2015 04:44:49 +0000 (UTC)
References: <1437040609-9878-1-git-send-email-pbonzini@redhat.com>
	<20150716190546.GI29283@redhat.com>
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <55A8883D.1010207@redhat.com>
Date: Fri, 17 Jul 2015 06:44:45 +0200
MIME-Version: 1.0
In-Reply-To: <20150716190546.GI29283@redhat.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx->dispatching is
 dead, all hail ctx->notify_me
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Richard W.M. Jones" <rjones@redhat.com>
Cc: kwolf@redhat.com, lersek@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com


On 16/07/2015 21:05, Richard W.M. Jones wrote:
>=20
> Sorry to spoil things, but I'm still seeing this bug, although it is
> now a lot less frequent with your patch.  I would estimate it happens
> more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs
> with qemu.git + the v2 patch series.
>=20
> It's the exact same hang in both cases.
>=20
> Is it possible that this patch doesn't completely close any race?
>=20
> Still, it is an improvement, so there is that.

Would seem at first glance like a different bug.

Interestingly, adding some "tracing" (qemu_clock_get_ns) makes the bug
more likely: now it reproduces in about 10 tries.  Of course :) adding
other kinds of tracing instead make it go away again (>50 tries).

Perhaps this:

   i/o thread         vcpu thread                   worker thread
   ---------------------------------------------------------------------
   lock_iothread
   notify_me =3D 1
   ...
   unlock_iothread
                      lock_iothread
                      notify_me =3D 3
                      ppoll
                      notify_me =3D 1
                                                     bh->scheduled =3D 1
                                                     event_notifier_set
                      event_notifier_test_and_clear
   ppoll
    ^^ hang
=09
In the exact shape above, it doesn't seem too likely to happen, but
perhaps there's another simpler case.  Still, the bug exists.

The above is not really related to notify_me.  Here the notification is
not being optimized away!  So I wonder if this one has been there forever=
.

Fam suggested putting the event_notifier_test_and_clear before
aio_bh_poll(), but it does not work.  I'll look more close

However, an unconditional event_notifier_test_and_clear is pretty
expensive.  On one hand, obviously correctness comes first.  On the
other hand, an expensive operation at the wrong place can mask the race
very easily; I'll let the fix run for a while, but I'm not sure if a
successful test really says anything useful.

Paolo