From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([209.51.188.92]:49734) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gk8af-0003VB-1O for qemu-devel@nongnu.org; Thu, 17 Jan 2019 09:24:18 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gk8ad-00059q-Sx for qemu-devel@nongnu.org; Thu, 17 Jan 2019 09:24:17 -0500 Date: Thu, 17 Jan 2019 15:23:53 +0100 From: Kevin Wolf Message-ID: <20190117142353.GA6761@localhost.localdomain> References: <20181207122647.GE5119@linux.fritz.box> <20181212122457.GB5415@linux.fritz.box> <20181213122039.GC5427@linux.fritz.box> <2ee882bf-00cc-addf-d866-c5c916a00774@virtuozzo.com> <9d97e822-7c9b-c095-6a27-a74facf21e47@virtuozzo.com> <83610386-bc64-6691-9af8-e98eb0a68342@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <83610386-bc64-6691-9af8-e98eb0a68342@virtuozzo.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] PING: [PATCH] blk: postpone request execution on a context protected with "drained section" List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Denis Plotnikov Cc: "keith.busch@intel.com" , "jsnow@redhat.com" , "qemu-block@nongnu.org" , "qemu-devel@nongnu.org" , "mreitz@redhat.com" , "stefanha@redhat.com" , Denis Lunev , "famz@redhat.com" , Vladimir Sementsov-Ogievskiy Am 17.01.2019 um 13:57 hat Denis Plotnikov geschrieben: > Kevin, >=20 > could you please take a look at my last comments? I read it, and what it told me is essentially that I need to work on it myself to fully understand the problem and possible acceptable solutions because you can't seem to find one yourself. I will, but I can't guarantee when I can find the time for it. Kevin > On 15.01.2019 10:22, Denis Plotnikov wrote: > > ping ping ping ping!!!! > >=20 > > On 09.01.2019 11:18, Denis Plotnikov wrote: > >> ping ping!!! > >> > >> On 18.12.2018 11:53, Denis Plotnikov wrote: > >>> ping ping > >>> > >>> On 14.12.2018 14:54, Denis Plotnikov wrote: > >>>> > >>>> > >>>> On 13.12.2018 15:20, Kevin Wolf wrote: > >>>>> Am 13.12.2018 um 12:07 hat Denis Plotnikov geschrieben: > >>>>>> On 12.12.2018 15:24, Kevin Wolf wrote: > >>>>>>> Am 11.12.2018 um 17:55 hat Denis Plotnikov geschrieben: > >>>>>>>>> Why involve the AioContext at all? This could all be kept at = the > >>>>>>>>> BlockBackend level without extending the layering violation t= hat > >>>>>>>>> aio_disable_external() is. > >>>>>>>>> > >>>>>>>>> BlockBackends get notified when their root node is drained, s= o=20 > >>>>>>>>> hooking > >>>>>>>>> things up there should be as easy, if not even easier than in > >>>>>>>>> AioContext. > >>>>>>>> > >>>>>>>> Just want to make sure that I understood correctly what you=20 > >>>>>>>> meant by > >>>>>>>> "BlockBackends get notified". Did you mean that bdrv_drain_end= =20 > >>>>>>>> calls > >>>>>>>> child's role callback blk_root_drained_end by calling > >>>>>>>> bdrv_parent_drained_end? > >>>>>>> > >>>>>>> Yes, blk_root_drained_begin/end calls are all you need.=20 > >>>>>>> Specifically, > >>>>>>> their adjustments to blk->quiesce_counter that are already ther= e,=20 > >>>>>>> and in > >>>>>>> the 'if (--blk->quiesce_counter =3D=3D 0)' block of=20 > >>>>>>> blk_root_drained_end() > >>>>>>> we can resume the queued requests. > >>>>>> Sounds it should be so, but it doesn't work that way and that's = why: > >>>>>> when doing mirror we may resume postponed coroutines too early=20 > >>>>>> when the > >>>>>> underlying bs is protected from writing at and thus we encounter= the > >>>>>> assert on a write request execution at bdrv_co_write_req_prepare= when > >>>>>> resuming the postponed coroutines. > >>>>>> > >>>>>> The thing is that the bs is protected for writing before executi= on of > >>>>>> bdrv_replace_node at mirror_exit_common and bdrv_replace_node ca= lls > >>>>>> bdrv_replace_child_noperm which, in turn, calls=20 > >>>>>> child->role->drained_end > >>>>>> where one of the callbacks is blk_root_drained_end which check > >>>>>> if(--blk->quiesce_counter =3D=3D 0) and runs the postponed reque= sts > >>>>>> (coroutines) if the coundition is true. > >>>>> > >>>>> Hm, so something is messed up with the drain sections in the mirr= or > >>>>> driver. We have: > >>>>> > >>>>> =A0=A0=A0=A0 bdrv_drained_begin(target_bs); > >>>>> =A0=A0=A0=A0 bdrv_replace_node(to_replace, target_bs, &local_err)= ; > >>>>> =A0=A0=A0=A0 bdrv_drained_end(target_bs); > >>>>> > >>>>> Obviously, the intention was to keep the BlockBackend drained dur= ing > >>>>> bdrv_replace_node(). So how could blk->quiesce_counter ever get t= o 0 > >>>>> inside bdrv_replace_node() when target_bs is drained? > >>>>> > >>>>> Looking at bdrv_replace_child_noperm(), it seems that the functio= n has > >>>>> a bug: Even if old_bs and new_bs are both drained, the quiesce_co= unter > >>>>> for the parent reaches 0 for a moment because we call .drained_en= d for > >>>>> the old child first and .drained_begin for the new one later. > >>>>> > >>>>> So it seems the fix would be to reverse the order and first call > >>>>> .drained_begin for the new child and then .drained_end for the ol= d > >>>>> child. Sounds like a good new testcase for tests/test-bdrv-drain.= c,=20 > >>>>> too. > >>>> Yes, it's true, but it's not enough... > >>>> In mirror_exit_common() we actively manipulate with block driver=20 > >>>> states. > >>>> When we replaced a node in the snippet you showed we can't allow t= he=20 > >>>> postponed coroutines to run because the block tree isn't ready to=20 > >>>> receive the requests yet. > >>>> To be ready, we need to insert a proper block driver state to the=20 > >>>> block backend which is done here > >>>> > >>>> =A0=A0=A0=A0 blk_remove_bs(bjob->blk); > >>>> =A0=A0=A0=A0 blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, &error_abort= ); > >>>> =A0=A0=A0=A0 blk_insert_bs(bjob->blk, mirror_top_bs, &error_abort)= ; << << << << > >>>> > >>>> =A0=A0=A0=A0 bs_opaque->job =3D NULL; > >>>> > >>>> =A0=A0=A0=A0 bdrv_drained_end(src); > >>>> > >>>> If the tree isn't ready and we resume the coroutines, we'll end up= =20 > >>>> with the request landed in a wrong block driver state. > >>>> > >>>> So, we explicitly should stop all activities on all the driver sta= tes > >>>> and its parents and allow the activities when everything is ready = to=20 > >>>> go. > >>>> > >>>> Why explicitly, because the block driver states may belong to=20 > >>>> different block backends at the moment of the manipulation beginni= ng. > >>>> > >>>> So, it seems we need to disable all their contexts until the=20 > >>>> manipulation ends. > >>>> > >>>> Please, correct me if I'm wrong. > >>>> > >>>>> > >>>>>> In seems that if the external requests disabled on the context w= e=20 > >>>>>> can't > >>>>>> rely on anything or should check where the underlying bs and its > >>>>>> underlying nodes are ready to receive requests which sounds quit= e > >>>>>> complicated. > >>>>>> Please correct me if still don't understand something in that=20 > >>>>>> routine. > >>>>> > >>>>> I think the reason why reyling on aio_disable_external() works is= =20 > >>>>> simply > >>>>> because src is also drained, which keeps external events in the > >>>>> AioContext disabled despite the bug in draining the target node. > >>>>> > >>>>> The bug would become apparent even with aio_disable_external() if= we > >>>>> didn't drain src, or even if we just supported src and target bei= ng in > >>>>> different AioContexts. > >>>> > >>>> Why don't we disable all those contexts involved until the end of=20 > >>>> the block device tree reconstruction? > >>>> > >>>> Thanks! > >>>> > >>>> Denis > >>>>> > >>>>> Kevin > >>>>> > >>>> > >>> > >> > >=20 >=20 > --=20 > Best, > Denis