From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50970) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XYCR1-0007CP-Lg for qemu-devel@nongnu.org; Sun, 28 Sep 2014 07:14:43 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XYCQu-0001Ob-5n for qemu-devel@nongnu.org; Sun, 28 Sep 2014 07:14:35 -0400 Received: from mail-pd0-f171.google.com ([209.85.192.171]:65192) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XYCQt-0001N9-VE for qemu-devel@nongnu.org; Sun, 28 Sep 2014 07:14:28 -0400 Received: by mail-pd0-f171.google.com with SMTP id y13so15549161pdi.2 for ; Sun, 28 Sep 2014 04:14:20 -0700 (PDT) Message-ID: <5427ED85.3040909@ozlabs.ru> Date: Sun, 28 Sep 2014 21:14:13 +1000 From: Alexey Kardashevskiy MIME-Version: 1.0 References: <20140919084703.GA7667@noname.redhat.com> <1411462065-6462-1-git-send-email-aik@ozlabs.ru> <20140924094836.GB3862@noname.redhat.com> <5423D523.5070009@ozlabs.ru> <20140925085718.GE4667@noname.redhat.com> <5423E686.20109@ozlabs.ru> <20140925102027.GH4667@noname.redhat.com> <54240AB0.508@ozlabs.ru> <20140925123944.GK4667@noname.redhat.com> <54242145.6070808@ozlabs.ru> In-Reply-To: <54242145.6070808@ozlabs.ru> Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH] qcow2: Fix race in cache invalidation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: "libvir-list @ redhat . com" , qemu-devel@nongnu.org, Max Reitz , Stefan Hajnoczi , Paolo Bonzini , "Dr . David Alan Gilbert" On 09/26/2014 12:05 AM, Alexey Kardashevskiy wrote: > On 09/25/2014 10:39 PM, Kevin Wolf wrote: >> Am 25.09.2014 um 14:29 hat Alexey Kardashevskiy geschrieben: >>> On 09/25/2014 08:20 PM, Kevin Wolf wrote: >>>> Am 25.09.2014 um 11:55 hat Alexey Kardashevskiy geschrieben: >>>>> Right. Cool. So is below what was suggested? I am doublechecking as it does >>>>> not solve the original issue - the bottomhalf is called first and then >>>>> nbd_trip() crashes in qcow2_co_flush_to_os(). >>>>> >>>>> diff --git a/block.c b/block.c >>>>> index d06dd51..1e6dfd1 100644 >>>>> --- a/block.c >>>>> +++ b/block.c >>>>> @@ -5037,20 +5037,22 @@ void bdrv_invalidate_cache(BlockDriverState *bs, >>>>> Error **errp) >>>>> if (local_err) { >>>>> error_propagate(errp, local_err); >>>>> return; >>>>> } >>>>> >>>>> ret = refresh_total_sectors(bs, bs->total_sectors); >>>>> if (ret < 0) { >>>>> error_setg_errno(errp, -ret, "Could not refresh total sector count"); >>>>> return; >>>>> } >>>>> + >>>>> + bdrv_drain_all(); >>>>> } >>>> >>>> Try moving the bdrv_drain_all() call to the top of the function (at >>>> least it must be called before bs->drv->bdrv_invalidate_cache). >>> >>> >>> Ok, I did. Did not help. >>> >>> >>>> >>>>> +static QEMUBH *migration_complete_bh; >>>>> +static void process_incoming_migration_complete(void *opaque); >>>>> + >>>>> static void process_incoming_migration_co(void *opaque) >>>>> { >>>>> QEMUFile *f = opaque; >>>>> - Error *local_err = NULL; >>>>> int ret; >>>>> >>>>> ret = qemu_loadvm_state(f); >>>>> qemu_fclose(f); >>>> >>>> Paolo suggested to move eveything starting from here, but as far as I >>>> can tell, leaving the next few lines here shouldn't hurt. >>> >>> >>> Ouch. I was looking at wrong qcow2_fclose() all this time :) >>> Aaaany what you suggested did not help - >>> bdrv_co_flush() calls qemu_coroutine_yield() while this BH is being >>> executed and the situation is still the same. >> >> Hm, do you have a backtrace? The idea with the BH was that it would be >> executed _outside_ coroutine context and therefore wouldn't be able to >> yield. If it's still executed in coroutine context, it would be >> interesting to see who that caller is. > > Like this? > process_incoming_migration_complete > bdrv_invalidate_cache_all > bdrv_drain_all > aio_dispatch > node->io_read (which is nbd_read) > nbd_trip > bdrv_co_flush > [...] Ping? I do not know how to understand this backtrace - in fact, in gdb at the moment of crash I only see traces up to nbd_trip and coroutine_trampoline (below). What is the context here then?... Program received signal SIGSEGV, Segmentation fault. 0x000000001050a8d4 in qcow2_cache_flush (bs=0x100363531a0, c=0x0) at /home/alexey/p/qemu/block/qcow2-cache.c:174 (gdb) bt #0 0x000000001050a8d4 in qcow2_cache_flush (bs=0x100363531a0, c=0x0) at /home/alexey/p/qemu/block/qcow2-cache.c:174 #1 0x00000000104fbc4c in qcow2_co_flush_to_os (bs=0x100363531a0) at /home/alexey/p/qemu/block/qcow2.c:2162 #2 0x00000000104c7234 in bdrv_co_flush (bs=0x100363531a0) at /home/alexey/p/qemu/block.c:4978 #3 0x00000000104b7e68 in nbd_trip (opaque=0x1003653e530) at /home/alexey/p/qemu/nbd.c:1260 #4 0x00000000104d7d84 in coroutine_trampoline (i0=0x100, i1=0x36549850) at /home/alexey/p/qemu/coroutine-ucontext.c:118 #5 0x000000804db01a9c in .__makecontext () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () -- Alexey