From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51864) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XX9h8-0000J2-FY for qemu-devel@nongnu.org; Thu, 25 Sep 2014 10:07:04 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XX9h0-0006nC-WB for qemu-devel@nongnu.org; Thu, 25 Sep 2014 10:06:54 -0400 Received: from mail-pd0-f172.google.com ([209.85.192.172]:54981) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XX9h0-0006lj-PZ for qemu-devel@nongnu.org; Thu, 25 Sep 2014 10:06:46 -0400 Received: by mail-pd0-f172.google.com with SMTP id fp1so1193090pdb.3 for ; Thu, 25 Sep 2014 07:06:40 -0700 (PDT) Message-ID: <54242145.6070808@ozlabs.ru> Date: Fri, 26 Sep 2014 00:05:57 +1000 From: Alexey Kardashevskiy MIME-Version: 1.0 References: <20140919084703.GA7667@noname.redhat.com> <1411462065-6462-1-git-send-email-aik@ozlabs.ru> <20140924094836.GB3862@noname.redhat.com> <5423D523.5070009@ozlabs.ru> <20140925085718.GE4667@noname.redhat.com> <5423E686.20109@ozlabs.ru> <20140925102027.GH4667@noname.redhat.com> <54240AB0.508@ozlabs.ru> <20140925123944.GK4667@noname.redhat.com> In-Reply-To: <20140925123944.GK4667@noname.redhat.com> Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH] qcow2: Fix race in cache invalidation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: "libvir-list @ redhat . com" , qemu-devel@nongnu.org, Max Reitz , Stefan Hajnoczi , Paolo Bonzini , "Dr . David Alan Gilbert" On 09/25/2014 10:39 PM, Kevin Wolf wrote: > Am 25.09.2014 um 14:29 hat Alexey Kardashevskiy geschrieben: >> On 09/25/2014 08:20 PM, Kevin Wolf wrote: >>> Am 25.09.2014 um 11:55 hat Alexey Kardashevskiy geschrieben: >>>> Right. Cool. So is below what was suggested? I am doublechecking as it does >>>> not solve the original issue - the bottomhalf is called first and then >>>> nbd_trip() crashes in qcow2_co_flush_to_os(). >>>> >>>> diff --git a/block.c b/block.c >>>> index d06dd51..1e6dfd1 100644 >>>> --- a/block.c >>>> +++ b/block.c >>>> @@ -5037,20 +5037,22 @@ void bdrv_invalidate_cache(BlockDriverState *bs, >>>> Error **errp) >>>> if (local_err) { >>>> error_propagate(errp, local_err); >>>> return; >>>> } >>>> >>>> ret = refresh_total_sectors(bs, bs->total_sectors); >>>> if (ret < 0) { >>>> error_setg_errno(errp, -ret, "Could not refresh total sector count"); >>>> return; >>>> } >>>> + >>>> + bdrv_drain_all(); >>>> } >>> >>> Try moving the bdrv_drain_all() call to the top of the function (at >>> least it must be called before bs->drv->bdrv_invalidate_cache). >> >> >> Ok, I did. Did not help. >> >> >>> >>>> +static QEMUBH *migration_complete_bh; >>>> +static void process_incoming_migration_complete(void *opaque); >>>> + >>>> static void process_incoming_migration_co(void *opaque) >>>> { >>>> QEMUFile *f = opaque; >>>> - Error *local_err = NULL; >>>> int ret; >>>> >>>> ret = qemu_loadvm_state(f); >>>> qemu_fclose(f); >>> >>> Paolo suggested to move eveything starting from here, but as far as I >>> can tell, leaving the next few lines here shouldn't hurt. >> >> >> Ouch. I was looking at wrong qcow2_fclose() all this time :) >> Aaaany what you suggested did not help - >> bdrv_co_flush() calls qemu_coroutine_yield() while this BH is being >> executed and the situation is still the same. > > Hm, do you have a backtrace? The idea with the BH was that it would be > executed _outside_ coroutine context and therefore wouldn't be able to > yield. If it's still executed in coroutine context, it would be > interesting to see who that caller is. Like this? process_incoming_migration_complete bdrv_invalidate_cache_all bdrv_drain_all aio_dispatch node->io_read (which is nbd_read) nbd_trip bdrv_co_flush [...] -- Alexey